DTU Health Tech

Department of Health Technology

DeepLoc - 2.1

Prediction of eukaryotic protein subcellular localization using deep learning

DeepLoc 2.1 predicts the subcellular localization(s) and the associated membrane type(s) of eukaryotic proteins. DeepLoc 2.1 is a multi-label predictor, which means that is able to predict one or more localizations for any given protein. It can differentiate between 10 different localizations and 4 different membrane associations. The localization labels are defined by the 10 classes: Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell membrane, Endoplasmic reticulum, Chloroplast, Golgi apparatus, Lysosome/Vacuole and Peroxisome. The membrane association of a protein is defined by the flw. four labels: Peripheral, Transmembrane, Lipid anchor and Soluble (non-membrane). Additionally, DeepLoc 2.1 can predict the presence of the sorting signal(s) that had an influence on the prediction of the subcellular localization(s).

The webserver grants free access to all users, including commercial users.


Submission


Submit data

Paste or upload protein sequence(s) as fasta format to predict the subcellular localization and membrane association. A maximum of 500 sequences is allowed. The prediction can take a few seconds per sequence depending on the model selected.

Protein sequences should not be shorter than 10 amino acids.
Sequences beyond the limit of the language model will be truncated by removing the middle part of the sequence. In Slow mode, the limit is 4000 so longer sequences than that will be represented by the concatenation of first and last 2000 amino acids. For the Fast model the limit is 1022. Furthermore, it is recommended to use the "short output (no figures)" option for high-throughput analysis (i.e. when submitting more than 100 sequences in one batch).
The maximal active execution time of a job in the queue is 4 hours - if a job happens to fail, please submit a smaller job next time.


Example proteins:
Format directly from your local disk:


Model
High-quality (Slow)
High-throughput (Fast)
Output format:
Long output
Short output (no figures)

Instructions/Help


The DeepLoc 2.1 server predicts the multi-label subcellular localization and membrane association of eukaryotics proteins using Neural Networks algorithm trained on Uniprot proteins with experimental evidence of subcellular localization (including membrane association). The model can predict whether a protein can be in one or multiple localizations inside the eukaryotic cell along with the membrane association of the protein. It only uses the sequence information to perform the prediction. Additionally, DeepLoc 2.1 can predict the presence of the sorting signal(s) that had an influence on the prediction of the subcellular localization(s). The importance of each amino acid in the predicted localization is also included as an "attention" plot. Positions in the sequence with a high attention value are deemed more relevant for the prediction. This does not mean that a particular amino acid is very important for the prediction but that a region in the neighbourhood of those positions has more weight in the final prediction of the model.

The DeepLoc 2.1 server can be run using two versions of the same model.

  • The high-quality model utilizes the ProtT5-XL-Uniref50 transformer (ProtT5). This model provides a more accurate prediction at the expense of longer computation time due to the size of the model (3 billion parameters). Use case: high-quality prediction for a small number of proteins.
  • The high-throughput model utilizes the 33-layer ESM transformer (ESM1b). This smaller model (650 million parameters) has the advantage of a faster computation time with a slight decrease in accuracy compared to the ProtT5 model. Use case: high-throughput prediction for a larger number of proteins.

The DeepLoc 2.1 server requires protein sequence(s) in fasta format, and can not handle nucleic acid sequences.

Two different versions of the output can be selected before running DeepLoc 2.1. The long output will generate an attention plot per sequence while the short output will not generate any plots.

Paste protein sequence(s) in fasta format or upload a fasta file.

After the server successfully finishes the job, a summary page shows up. If an error happens during the prediction a log will appear specifying the error.

Output format


The DeepLoc 2.1 output is composed of three main components:

  • The Predicted localizations, Predicted membrane types and Predicted signals display the subcellular localizations, membrane association and sorting signals predicted for the query protein, respectively.
  • The Probability table displays the probability assigned by the model to each of the subcellular localizations and membrane types. Localizations with a probability above the threshold are highlighted in green. The green intensity reflects the proximity of the localization probability to the threshold. The intensity increases the farther the probability is from the threshold.
  • The Sorting signal importance displays a logo-like plot of the positions in the query protein with higher importance for the prediction and highly associated with sorting signals.

Output format

Training and testing data sets


The dataset used to train and test the DeepLoc 2.1 server is available here:

The Partition column in the Training/Validation sets indicates the five partitions (0-4) that the dataset was homology partitioned into (maximum 30% sequence similarity).

References


Please cite:

DeepLoc 2.0: multi-label subcellular localization prediction using protein language models.
Vineet Thumuluri, Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Henrik Nielsen, Ole Winther.
Nucleic Acids Research, Web server issue 2022.

Abstract

The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals.

Version history


2.0 The current server. New in this version:
  • Model architecture: DeepLoc 2.1 is based on a transformer language model, trained on a massive dataset of unlabeled protein sequences. DeepLoc 2.1 leverages all the features from DeepLoc 2.0 (multi-localization and sorting signal predictions) but additionally provides predictions regarding the membrane associativity of a query protein.
  • Multi-localization prediction: DeepLoc 2.1 is able to predict proteins that are located in more than one compartment.
  • Multi-membrane type prediction: DeepLoc 2.1 is able to predict the membrane associativity of proteins.
  • Sorting signal prediction: DeepLoc 2.1 predicts the presence of nine types of sorting signals. For prediction of the precise positions of N- or C-terminal sorting signals, we refer to specific predictors such as SignalP, TargetP, or NetGPI.
  • Logo-like attention plot: The plot visualizes which part(s) of the input sequence were important for prediction. We show in the article of DeepLoc 2.0 that there is a correlation between the attention and the positions of known sorting signals, and that this correlation is stronger than for DeepLoc 1.0.
Publication:
DeepLoc 2.0: multi-label subcellular localization prediction using protein language models.
Vineet Thumuluri, Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Henrik Nielsen, Ole Winther.
Nucleic Acids Research, Web server issue 2022.
1.0 The original DeepLoc server.
Publication:
DeepLoc: prediction of protein subcellular localization using deep learning
Jose Juan Almagro Armenteros, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, Ole Winther.
Bioinformatics, 33:3387–3395 (2017).

Software Downloads


  • Version 2.0
  • Version 1.0


GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0). If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: