Paste in FASTA sequences or choose a file from your computer below.
NetSolP: predicting protein solubility in Escherichia coli using language models
Vineet Thumuluri, Hannah-Marie Martiny, Jose J. Almagro Armenteros, Jesper Salomon, Henrik Nielsen,
Alexander R. Johansen
Bioinformatics (2021) DOI:10.1093/bioinformatics/btab801
Solubility and expression levels of proteins can be a limiting factor for large-scale studies and
industrial production. By determining the solubility and expression directly from the protein
sequence, the success rate of wet-lab experiments can be increased.
In this study, we focus on predicting the solubility and usability for purification of proteins
expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep
learning protein language models called transformers and we show that it achieves
state-of-the-art performance and improves extrapolation across datasets. As we find current
methods are built on biased datasets, we curate existing datasets by using strict
sequence-identity partitioning and ensure that there is minimal bias in the sequences.
Availability and implementation
The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP
and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0.
This folder contains the datasets with the partitions used in the paper.
- PSI Biology dataset:
Data taken from: SoDoPe
- NESG / Price dataset used in two ways:
Data taken from: SoluProt
- Camsol dataset:
- FASTA file of sequences
- CSV file of mutations
The download contains the following:
- Datasets used in the paper
- Code used in the webserver
- Trained models
- Code for training and testing the models
The code and the trained models are made available under the
3-Clause BSD License
By downloading the file, you agree to the terms of this license.
Copyright © 2021 Technical University of Denmark
Redistribution and use in source and binary forms, with or without modification, are permitted provided
that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.