DTU Health Tech

Department of Health Technology

DeepLoc - 2.0

Prediction of eukaryotic protein subcellular localization using deep learning

DeepLoc 2.0 predicts the subcellular localization(s) of eukaryotic proteins. DeepLoc 2.0 is a multi-label predictor, which means that is able to predict one or more localizations for any given protein. It can differentiate between 10 different localizations: Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell membrane, Endoplasmic reticulum, Chloroplast, Golgi apparatus, Lysosome/Vacuole and Peroxisome. Additionally, DeepLoc 2.0 can predict the presence of the sorting signal(s) that had an influence on the prediction of the subcellular localization(s).

Prokaryotic proteins: To predict the locations of proteins in prokaryotes, use DeepLocPro.
RNA: To predict the locations of RNA, use DeepLocRNA.

NOTE: This is not the newest version of DeepLoc. To use the current version with the added capability of predicting membrane protein types, please go to DeepLoc 2.1!

Submission


Submit data

Paste or upload protein sequence(s) as fasta format to predict the subcellular localization. A maximum of 500 sequences is allowed. The prediction can take a few seconds per sequence depending on the model selected.

Protein sequences should be not less than 10 and not more than 6000 amino acids.
Be aware that sequences longer than 4000 (Slow mode) or 1022 (Fast mode) will be truncated. Truncation happens from the middle of the sequence.


Example proteins:
Format directly from your local disk:


Model
High-quality (Slow)
High-throughput (Fast)
Output format:
Long output
Short output (no figures)

Instructions/Help


The DeepLoc 2.0 server predicts the multi-label subcellular localization of eukaryotics proteins using Neural Networks algorithm trained on Uniprot proteins with experimental evidence of subcellular localization. The model can predict whether a protein can be in one or multiple localizations inside the eukaryotic cell. It only uses the sequence information to perform the prediction. Additionally, DeepLoc 2.0 can predict the presence of the sorting signal(s) that had an influence on the prediction of the subcellular localization(s). The importance of each amino acid in the predicted localization is also included as an "attention" plot. Positions in the sequence with a high attention value are deemed more relevant for the prediction. This does not mean that a particular amino acid is very important for the prediction but that a region in the neighbourhood of those positions has more weight in the final prediction of the model.

The DeepLoc 2.0 server can be run using two versions of the same model.

  • The high-quality model utilizes the ProtT5-XL-Uniref50 transformer (ProtT5). This model provides a more accurate prediction at the expense of longer computation time due to the size of the model (3 billion parameters). Use case: high-quality prediction for a small number of proteins.
  • The high-throughput model utilizes the 33-layer ESM transformer (ESM1b). This smaller model (650 million parameters) has the advantage of a faster computation time with a slight decrease in accuracy compared to the ProtT5 model. Use case: high-throughput prediction for a larger number of proteins.

The DeepLoc 2.0 server requires protein sequence(s) in fasta format, and can not handle nucleic acid sequences.

Two different versions of the output can be selected before running DeepLoc 2.0. The long output will generate an attention plot per sequence while the short output will not generate any plots.

Paste protein sequence(s) in fasta format or upload a fasta file.

After the server successfully finishes the job, a summary page shows up. If an error happens during the prediction a log will appear specifying the error.

Output format


The DeepLoc 2.0 output is composed of three main components:

  • The Predicted localizations and Predicted signals display the subcellular localizations and sorting signals predicted for the query protein, respectively.
  • The Probability table displays the probability assigned by the model to each of the subcellular localizations. Localizations with a probability above the threshold are highlighted in green. The green intensity reflects the proximity of the localization probability to the threshold. The intensity increases the farther the probability is from the threshold.
  • The Sorting signal importance displays a logo-like plot of the positions in the query protein with higher importance for the prediction and highly associated with sorting signals.

Output format

Training and testing data sets


The dataset used to train and test the DeepLoc 2.0 server is available here:

The Partition column in the Training/Validation set indicates the five partitions (0-4) that the dataset was homology partitioned (maximum 30% sequence similarity).

References


Please cite:

DeepLoc 2.0: multi-label subcellular localization prediction using protein language models.
Vineet Thumuluri, Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Henrik Nielsen, Ole Winther.
Nucleic Acids Research, Web server issue 2022.

Abstract

The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals.

Version history


2.0 The current server. New in this version:
  • Model architecture: DeepLoc 2.0 is based on a transformer language model, trained on a massive dataset of unlabeled protein sequences.
  • Multi-localization prediction: DeepLoc 2.0 is able to predict proteins that are located in more than one compartment.
  • Sorting signal prediction: DeepLoc 2.0 predicts the presence of nine types of sorting signals. For prediction of the precise positions of N- or C-terminal sorting signals, we refer to specific predictors such as SignalP, TargetP, or NetGPI.
  • Logo-like attention plot: The plot visualizes which part(s) of the input sequence were important for prediction. We show in the article that there is a correlation between the attention and the positions of known sorting signals, and that this correlation is stronger than for DeepLoc 1.0.
Publication:
DeepLoc 2.0: multi-label subcellular localization prediction using protein language models.
Vineet Thumuluri, Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Henrik Nielsen, Ole Winther.
Nucleic Acids Research, Web server issue 2022.
1.0 The original DeepLoc server.
Publication:
DeepLoc: prediction of protein subcellular localization using deep learning
Jose Juan Almagro Armenteros, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, Ole Winther.
Bioinformatics, 33:3387–3395 (2017).

Software Downloads


  • Version 2.0
  • Version 1.0


GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: