SignalP - 3.0

Signal peptide and cleavage sites in gram+, gram- and eukaryotic amino acid sequences

SignalP 3.0 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models.

NOTE: This is not the newest version of SignalP. To use the current version, please go to the main SignalP site!

Submission

Sequence submission: paste the sequence(s) and/or upload a local file

Paste a single sequence or several sequences in FASTA format into the field below:

Submit a file in FASTA format directly from your local disk:

Organism group Eukaryotes Gram-negative bacteria Gram-positive bacteria	Method Neural networks Hidden Markov models Both	Graphics No graphics GIF (inline) GIF (inline) and EPS (as links)
Output format Standard Full Short (no graphics!)	Truncation Truncate each sequence to max. residues. We recommend that only the N-terminal part of each protein sequence is submitted. Enter 0 (zero) to disable truncation.

Restrictions:
At most 2,000 sequences and 200,000 amino acids per submission; each sequence not more than 6,000 amino acids.

Confidentiality:
The sequences are kept confidential and will be deleted after processing.

CITATIONS

For publication of results, please cite:

Current version:
Improved prediction of signal peptides: SignalP 3.0.
Jannick Dyrløv Bendtsen, Henrik Nielsen, Gunnar von Heijne and Søren Brunak.
J. Mol. Biol., 340:783-795, 2004.
Download the full article in PDF.
Original paper:
Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.
Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne.
Protein Engineering, 10:1-6, 1997.
If you specifically use the SignalP-HMM output, please also cite:
Prediction of signal peptides and signal anchors by a hidden Markov model.
Henrik Nielsen and Anders Krogh.
Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology (ISMB 6),
AAAI Press, Menlo Park, California, pp. 122-130, 1998.
A paper about using SignalP and other protein subcellular localization prediction methods:

Locating proteins in the cell using TargetP, SignalP, and related tools
Olof Emanuelsson, Søren Brunak, Gunnar von Heijne, Henrik Nielsen
Nature Protocols 2, 953-971 (2007).

is available for download - please click here to access the paper and supplementary materials.

Usage instructions

1. Specify the input sequences

All the input sequences must be in one-letter amino acid code. The allowed alphabet (not case sensitive) is as follows:

A C D E F G H I K L M N P Q R S T V W Y and X (unknown)

All the alphabetic symbols not in the allowed alphabet will be converted to X before processing. All the non-alphabetic symbols, including white space and digits, will be ignored.

The sequences can be input in the following two ways:

Paste a single sequence (just the amino acids) or a number of sequences in FASTA format into the upper window of the main server page.
Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.

Both ways can be employed at the same time: all the specified sequences will be processed. However, there may be not more than 2,000 sequences and 200,000 amino acids in total in one submission. The sequences may not be longer than 6,000 amino acids.

2. Customize your run

Organism group:
Eukaryotes, Gram-negative bacteria or Gram-positive bacteria.
Method:
Neural networks, hidden Markov models or both.
Graphics output:
No graphics, in line GIF or in line GIF and EPS as links. See the Output format for examples.
Text output:
Standard, full or short output format. See the Output format for examples.
Sequence truncation:
Signal peptides occurr at the N-terminal end of protein sequences; they are seldom longer than 45 amino acids. It is normally not meaningful to submit more than 60-70 amino acids per sequence. Therefore, the default truncation has been set to 70.

3. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Output format

Description of the scores
Examples of standard output
Examples of short output

DESCRIPTION OF THE SCORES

The graphical output from SignalP (neural network) comprises three different scores, C, S and Y. Two additional scores are reported in the SignalP3-NN output, namely the S-mean and the D-score, but these are only reported as numerical values.

For each organism class in SignalP; Eukaryote, Gram-negative and Gram-positive, two different neural networks are used, one for predicting the actual signal peptide and one for predicting the position of the signal peptidase I (SPase I) cleavage site. The S-score for the signal peptide prediction is reported for every single amino acid position in the submitted sequence, with high scores indicating that the corresponding amino acid is part of a signal peptide, and low scores indicating that the amino acid is part of a mature protein.

The C-score is the ``cleavage site'' score. For each position in the submitted sequence, a C-score is reported, which should only be significantly high at the cleavage site. Confusion is often seen with the position numbering of the cleavage site. When a cleavage site position is referred to by a single number, the number indicates the first residue in the mature protein, meaning that a reported cleavage site between amino acid 26-27 corresponds to that the mature protein starts at (and include) position 27.

Y-max is a derivative of the C-score combined with the S-score resulting in a better cleavage site prediction than the raw C-score alone. This is due to the fact that multiple high-peaking C-scores can be found in one sequence, where only one is the true cleavage site. The cleavage site is assigned from the Y-score where the slope of the S-score is steep and a significant C-score is found.

The S-mean is the average of the S-score, ranging from the N-terminal amino acid to the amino acid assigned with the highest Y-max score, thus the S-mean score is calculated for the length of the predicted signal peptide. The S-mean score was in SignalP version 2.0 used as the criteria for discrimination of secretory and non-secretory proteins.

The D-score is introduced in SignalP version 3.0 and is a simple average of the S-mean and Y-max score. The score shows superior discrimination performance of secretory and non-secretory proteins to that of the S-mean score which was used in SignalP version 1 and 2.

For non-secretory proteins all the scores represented in the SignalP3-NN output should ideally be very low.

The hidden Markov model calculates the probability of whether the submitted sequence contains a signal peptide or not. The eukaryotic HMM model also reports the probability of a signal anchor, previously named uncleaved signal peptides. Furthermore, the cleavage site is assigned by a probability score together with scores for the n-region, h-region, and c-region of the signal peptide, if such one is found.

EXAMPLES OF STANDARD OUTPUT

By default the server produces the following output for each input sequence:

Example 1: secretory protein

The example below shows the output for thioredoxin domain containing protein 4 precursor (endoplasmic reticulum protein ERp44), taken from the Swiss-Prot entry TXN4_HUMAN. The signal peptide prediction is consistent with the database annotation. >TXN4_HUMAN SignalP-NN result:

# data

>Sequence length = 70 # Measure Position Value Cutoff signal peptide? max. C 30 0.565 0.32 YES max. Y 30 0.690 0.33 YES max. S 12 0.989 0.87 YES mean S 1-29 0.852 0.48 YES D 1-29 0.771 0.43 YES # Most likely cleavage site between pos. 29 and 30: VTT-EI
SignalP-HMM result:

# data

>TXN4_HUMAN Prediction: Signal peptide Signal peptide probability: 0.984 Signal anchor probability: 0.015 Max cleavage site probability: 0.962 between pos. 29 and 30
# gnuplot script for making the plot(s)

Example 2: non-secretory protein

The example below shows the output for BMP-2 inducible protein kinase (EC 2.7.1.37), a nuclear protein taken from the Swiss-Prot entry BM2K_HUMAN. No signal peptide is predicted. >BM2K_HUMAN SignalP-NN result:

# data

>BM2K_HUMAN length = 70 # Measure Position Value Cutoff signal peptide? max. C 20 0.035 0.32 NO max. Y 20 0.034 0.33 NO max. S 12 0.263 0.87 NO mean S 1-19 0.063 0.48 NO D 1-19 0.049 0.43 NO
SignalP-HMM result:

# data

>BM2K_HUMAN Prediction: Non-secretory protein Signal peptide probability: 0.157 Signal anchor probability: 0.023 Max cleavage site probability: 0.027 between pos. 28 and 29
# gnuplot script for making the plot(s)

EXAMPLE OF SHORT OUTPUT

When selecting the short output format, the prediction for each submitted sequence (in a multisequence FASTA file) are reported on a single line, one for each fasta entry. A two line header is included, showing the information of the different columns. # SignalP-NN euk predictions # SignalP-HMM euk predictions # name Cmax pos ? Ymax pos ? Smax pos ? Smean ? D ? # name ! Cmax pos ? Sprob ? TXN4_HUMAN 0.565 30 Y 0.690 30 Y 0.989 12 Y 0.852 Y 0.771 Y TXN4_HUMAN S 0.962 30 Y 0.984 Y BM2K_HUMAN 0.035 20 N 0.034 20 N 0.263 12 N 0.063 N 0.049 N BM2K_HUMAN Q 0.027 29 N 0.157 N

Scientific background

For a brief description of the SignalP method please consult the article abstracts.

Biological background

Interest in signal peptides has for a long time been one of the hot topics in bioinformatics. The importance of signal peptides was emphasized in 1999 when Günter Blobel received the Nobel Prize in physiology or medicine for his discovery "proteins have intrinsic signal that govern their transport and localization in the cell". He pointed out the importance of defined peptide motifs for targeting proteins to their site of function. The press release can be read here
For biological background of protein localization we refer to the following pages.
Signal peptides
Signal anchors
Other secretory signals

Data sets and statictics

A very important task in machine learning methods is to obtain a clean and accurate dataset for training and testing. Bias and noise in the data set often lead to wrong predictions, which is undesirable.
Description of data sets
Dataset extraction
Dataset cleanup
Sequence logos
Length distributions
Characteristics of signal peptides
Download the training sets

Methods for prediction of signal peptides

With the current growth of sequence databases and speed of genome sequencing, accurate prediction methods have become increasingly important. For SignalP we have focused on neural networks as well as Hidden Markov Models.
Neural Networks
Hidden Markov Models

Performance and results

Any machine learning approach must be evaluated to test the predictive performance on unknown sequences.
Performance of the current prediction method
Five fold crossvalidation
Independent test set by Menne
Signal anchor prediction

Acknowledgements

The information on these pages are partly generated by the initial creator of SignalP, Henrik Nielsen. The information provided have been updated with new knowledge, but most of the biological background text emerges from Henriks work.

References

Main references:

Original method (SignalP v. 1.1)
Update to SignalP v. 2.0
Update to SignalP v. 3.0 (current method)

Other publications

Original method (SignalP v. 1.1)

Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.
Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne.
Protein Engineering, 10:1-6, 1997.

We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server.

PMID: 9051728 (full text pdf version)

Update to SignalP v. 2.0

Prediction of signal peptides and signal anchors by a hidden Markov model.
Henrik Nielsen and Anders Krogh.
Proc Int Conf Intell Syst Mol Biol. (ISMB 6), 6:122-130, 1998.

A hidden Markov model of signal peptides has been developed. It contains submodels for the N-terminal part, the hydrophobic region, and the region around the cleavage site. For known signal peptides, the model can be used to assign objective boundaries between these three regions. Applied to our data, the length distributions for the three regions are significantly different from expectations. For instance, the assigned hydrophobic region is between 8 and 12 residues long in almost all eukaryotic signal peptides. This analysis also makes obvious the difference between eukaryotes, Gram-positive bacteria, and Gram-negative bacteria. The model can be used to predict the location of the cleavage site, which it finds correctly in nearly 70% of signal peptides in a cross-validated test--almost the same accuracy as the best previous method. One of the problems for existing prediction methods is the poor discrimination between signal peptides and uncleaved signal anchors, but this is substantially improved by the hidden Markov model when expanding it with a very simple signal anchor model.

PMID: 9783217

Update to SignalP v. 3.0

Improved prediction of signal peptides: SignalP 3.0.
Jannick Dyrløv Bendtsen, Henrik Nielsen, Gunnar von Heijne and Søren Brunak.
J. Mol. Biol., 340:783-795, 2004.

We describe improvements of the currently most popular method for prediction of classically secreted proteins, SignalP. SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, and both components have been updated. Motivated by the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as input to the neural network. This addition, together with a thorough error-correction of a new data set, have improved the performance of the predictor significantly over SignalP version 2. In version 3, correctness of the cleavage site predictions have increased notably for all three organism groups, eukaryotes, Gram negative and Gram positive bacteria. The accuracy of cleavage site prediction has increased in the range from 6-17 % over the previous version, whereas the signal peptide discrimination improvement mainly is due to the elimination of false positive predictions, as well as the introduction of a new discrimination score for the neural network. The new method has also been benchmarked against other available methods.

PMID: 15223320 doi: 10.1016/j.jmb.2004.05.028

Other publications

Machine learning approaches to the prediction of signal peptides and other protein sorting signals.
Henrik Nielsen, Søren Brunak, and Gunnar von Heijne.
Protein Engineering, 12:3-9, 1999, Review.

Prediction of protein sorting signals from the sequence of amino acids has great importance in the field of proteomics today. Recently, the growth of protein databases, combined with machine learning approaches, such as neural networks and hidden Markov models, have made it possible to achieve a level of reliability where practical use in, for example automatic database annotation is feasible. In this review, we concentrate on the present status and future perspectives of SignalP, our neural network-based method for prediction of the most well-known sorting signal: the secretory signal peptide. We discuss the problems associated with the use of SignalP on genomic sequences, showing that signal peptide prediction will improve further if integrated with predictions of start codons and transmembrane helices. As a step towards this goal, a hidden Markov model version of SignalP has been developed, making it possible to discriminate between cleaved signal peptides and uncleaved signal anchors. Furthermore, we show how SignalP can be used to characterize putative signal peptides from an archaeon, Methanococcus jannaschii. Finally, we briefly review a few methods for predicting other protein sorting signals and discuss the future of protein sorting prediction in general.

PMID: 10065704

A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.
Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne.
Int. J. Neural Sys., 8:581-599, 1997.

We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequences. The method performs significantly better than previous prediction schemes, and can easily be applied to genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server: http://www.cbs.dtu.dk/services/SignalP/.

PMID: 10065837

Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site.
Henrik Nielsen, Jacob Engelbrecht, Gunnar von Heijne and Søren Brunak.
Proteins, 24(2):165-77, 1996.

When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution mode. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.

PMID: 8820484

From sequence to sorting: Prediction of signal peptides.
Henrik Nielsen.
Ph.D. thesis, defended at Department of Biochemistry, Stockholm University, Sweden, May 25, 1999.

In the present age of genome sequencing, a vast number of predicted genes are initially known only by their putative nucleotide sequence. The newly established field of bioinformatics is concerned with the computational prediction of structural and functional properties of genes and the proteins they encode, based on their nucleotide and amino acid sequences.
Since one of the crucial properties of a protein is its subcellular location, prediction of protein sorting is an important question in bioinformatics. A fundamental distinction in protein sorting is that between secretory and non-secretory proteins, determined by a cleavable N-terminal sorting signal, the secretory signal peptide.
The main part of this thesis, including four of the six papers, concerns prediction of secretory signal peptides in both eukaryotic and bacterial data using two machine learning techniques: artificial neural networks and hidden Markov models. A central result is the SignalP prediction method, which has been made available as a World Wide Web server and is very widely used.
Two additional prediction methods are also included, with one paper each. ChloroP predicts chloroplast transit peptides, another cleavable N-terminal sorting signal; while NetStart predicts start codons in eukaryotic genes. For prediction of all N-terminal signals, the assignment of correct start codon can be critical, which is why prediction of translation initiation from the nucleotide sequence is also important for protein sorting prediction.
This thesis comprises a detailed review of the molecular biology of protein secretion, a short introduction to the most important machine learning algorithms in bioinformatics, and a critical review of existing methods for protein sorting prediction. In addition, it contains general treatment of the principles of data set construction and performance evaluation for prediction methods in bioinformatics.

Version history

Please click on the version number to activate the corresponding server where available.

4.1	The current server. New in this version: For the web page, an option to set the D-score cutoff values so that the sensitivity is the same as that of SignalP 3.0. Option included to set the minimum cleavage site position i.e. Ymax position - default value is 10. For the signalp package an option has been included to specify a temporary directory (-T dir). For the signalp package an option has been included to show signalp version (-V). Documentation rewritten. Main publication: SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, Søren Brunak, Gunnar von Heijne and Henrik Nielsen. Nature Methods, 8:785-786, 2011.
4.0	New in this version: Improved discrimination between signal peptides and transmembrane regions. No HMM method - only one prediction. Main publication: SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, Søren Brunak, Gunnar von Heijne and Henrik Nielsen. Nature Methods, 8:785-786, 2011.
3.0	New in this version: D-score. Improved quality of prediction. Main publication: Improved prediction of signal peptides: SignalP 3.0. Jannick Dyrløv Bendtsen, Henrik Nielsen, Gunnar von Heijne and Søren Brunak. J. Mol. Biol., 340:783-795, 2004.
2.0	New in this version: Incorporation of a hidden Markov model version: SignalP V2.0 comprises two signal peptide prediction methods, SignalP-NN (based on neural networks, corresponding to SignalP V1.1) and SignalP-HMM (based on hidden Markov models). For eukaryotic data, SignalP-HMM has a substantially improved discrimination between signal peptides and uncleaved signal anchors, but it has a slightly lower accuracy in predicting the precise location of the cleavage site. The user can choose whether to run SignalP-NN, SignalP-HMM, or both. Retraining of the neural networks: SignalP-NN in SignalP V2.0 is trained on a newer data set derived from SWISS-PROT rel. 35 (instead of rel. 29 as in SignalP V1.1). Graphics integrated in the output: SignalP V2.0 shows signal peptide and cleavage site scores for each position as plots in GIF format on the output page. The plots provide more information than the prediction summary, e.g. about possible cleavage sites other than the strongest prediction. Signal peptide region assignment: SignalP-HMM provides not only a prediction of the presence of a signal peptide and the position of the cleavage site, but also an approximate assignment of n-, h- and c-regions within the signal peptide. These are shown in the graphical output as probabilities for each position being in one of these three regions. Automatic truncation: in SignalP V1.1, we recommended that you should submit only the N-terminal part of each protein, not more than 50-70 amino acids. SignalP V2.0 now offers to truncate your sequences automatically. Main publication: Prediction of signal peptides and signal anchors by a hidden Markov model. Henrik Nielsen and Anders Krogh. Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology (ISMB 6), AAAI Press, Menlo Park, California, pp. 122-130, 1998.
1.1	The original server: the method based on artificial neural networks. Main publication: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne. Protein Engineering, 10:1-6, 1997.

Software Downloads

Version 6.0h

Version 5.0b

Linux
Darwin

Version 4.1g

Version 3.0

Version 2.0

GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: