Usage instructions
1. Specify the input sequences
All the input sequences must be in one-letter amino acid
code. The allowed alphabet (not case sensitive) is as follows:
A C D E F G H I K L M N P Q R S T V W Y and X (unknown)
All the alphabetic symbols not in the allowed alphabet
will be converted to X before processing. All the non-alphabetic
symbols, including white space and digits, will be ignored.
The sequences can be input in the following two ways:
-
Paste a single sequence (just the amino acids) or a number of sequences in
FASTA
format into the upper window of the main server page.
-
Select a FASTA
file on your local disk, either by typing the file name into the lower window
or by browsing the disk.
Both ways can be employed at the same time: all the specified sequences will
be processed. However, there may be not more than 2,000 sequences and
200,000 amino acids in total in one submission. The sequences
may not be longer than 6,000 amino acids.
2. Customize your run
- Organism group:
Eukaryotes, Gram-negative bacteria or Gram-positive bacteria.
- Method:
Neural networks, hidden Markov models or both.
- Graphics output:
No graphics, in line GIF or in line GIF and
EPS as links. See the Output format for examples.
- Text output:
Standard, full or short output format. See the Output format for examples.
- Sequence truncation:
Signal peptides occurr at the N-terminal end of protein sequences;
they are seldom longer than 45 amino acids. It is normally not meaningful
to submit more than 60-70 amino acids per sequence. Therefore, the default
truncation has been set to 70.
3. Submit the job
Click on the
"Submit" button. The status of your job (either 'queued'
or 'running') will be displayed and constantly updated until it terminates and
the server output appears in the browser window.
At any time during the wait you may enter your e-mail address and simply leave
the window. Your job will continue; you will be notified by e-mail when it has
terminated. The e-mail message will contain the URL under which the results are
stored; they will remain on the server for 24 hours for you to collect them.
Scientific background
For a brief description of the SignalP method please consult the article abstracts.
Biological background
Interest in signal peptides has for a long time been one of the
hot topics in bioinformatics. The importance of signal peptides
was emphasized in 1999 when Günter Blobel received the Nobel Prize in
physiology or medicine for his discovery "proteins have intrinsic
signal that govern their transport and localization in the cell".
He pointed out the importance of defined peptide motifs for
targeting proteins to their site of function.
The press release can be read
here
For biological background of protein localization we refer to the following
pages.
Signal peptides
Signal anchors
Other secretory signals
Data sets and statictics
A very important task in machine learning methods is to obtain a clean and accurate dataset for training
and testing. Bias and noise in the data set often lead to wrong predictions, which is undesirable.
Description of data sets
Dataset extraction
Dataset cleanup
Sequence logos
Length distributions
Characteristics of signal peptides
Download the training sets
Methods for prediction of signal peptides
With the current growth of sequence databases and speed of genome sequencing,
accurate prediction methods have become increasingly important.
For SignalP we have focused on neural networks as well as Hidden Markov Models.
Neural Networks
Hidden Markov Models
Performance and results
Any machine learning approach must be evaluated to test the predictive performance on unknown sequences.
Performance of the current prediction method
Five fold crossvalidation
Independent test set by Menne
Signal anchor prediction
Acknowledgements
The information on these pages are partly generated by the initial creator of
SignalP, Henrik Nielsen. The information provided have been updated with new knowledge,
but most of the biological background text emerges from Henriks work.
References
Main references:
Other publications
Original method (SignalP v. 1.1)
Identification of prokaryotic and eukaryotic signal peptides
and prediction of their cleavage sites.
Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von
Heijne.
Protein Engineering, 10:1-6,
1997.
We have developed a new method for the identification of signal peptides and
their cleavage sites based on neural networks trained on separate sets of
prokaryotic and eukaryotic sequence. The method performs significantly better
than previous prediction schemes and can easily be applied on genome-wide data
sets. Discrimination between cleaved signal peptides and uncleaved N-terminal
signal-anchor sequences is also possible, though with lower precision.
Predictions can be made on a publicly available WWW server.
PMID: 9051728
(full text pdf
version)
Update to SignalP v. 2.0
Prediction of signal peptides and signal anchors by a hidden Markov
model.
Henrik Nielsen and Anders Krogh.
Proc Int Conf Intell Syst Mol Biol. (ISMB 6), 6:122-130,
1998.
A hidden Markov model of signal peptides has been developed. It contains
submodels for the N-terminal part, the hydrophobic region, and the region
around the cleavage site. For known signal peptides, the model can be used to
assign objective boundaries between these three regions. Applied to our data,
the length distributions for the three regions are significantly different from
expectations. For instance, the assigned hydrophobic region is between 8 and 12
residues long in almost all eukaryotic signal peptides. This analysis also
makes obvious the difference between eukaryotes, Gram-positive bacteria, and
Gram-negative bacteria. The model can be used to predict the location of the
cleavage site, which it finds correctly in nearly 70% of signal peptides in a
cross-validated test--almost the same accuracy as the best previous method. One
of the problems for existing prediction methods is the poor discrimination
between signal peptides and uncleaved signal anchors, but this is substantially
improved by the hidden Markov model when expanding it with a very simple signal
anchor model.
PMID: 9783217
Update to SignalP v. 3.0
Improved prediction of signal peptides: SignalP 3.0.
Jannick Dyrløv Bendtsen, Henrik Nielsen,
Gunnar von Heijne and Søren Brunak.
J. Mol. Biol., 340:783-795,
2004.
We describe improvements of the currently most
popular method for prediction of classically secreted proteins,
SignalP. SignalP consists of two different predictors based on
neural network and hidden Markov model algorithms, and both
components have been updated. Motivated by the idea that the
cleavage site position and the amino acid composition of the
signal peptide are correlated, new features have been included as
input to the neural network. This addition, together with a
thorough error-correction of a new data set, have improved the
performance of the predictor significantly over SignalP version 2.
In version 3, correctness of the cleavage site predictions have
increased notably for all three organism groups, eukaryotes, Gram
negative and Gram positive bacteria. The accuracy of cleavage site
prediction has increased in the range from 6-17 % over the
previous version, whereas the signal peptide discrimination
improvement mainly is due to the elimination of false positive
predictions, as well as the introduction of a new discrimination
score for the neural network. The new method has also been
benchmarked against other available methods.
PMID: 15223320
doi: 10.1016/j.jmb.2004.05.028
Other publications
Machine learning approaches to the prediction of signal peptides
and other protein sorting signals.
Henrik Nielsen, Søren Brunak, and Gunnar von Heijne.
Protein Engineering, 12:3-9, 1999, Review.
Prediction of protein sorting signals from the sequence of amino acids has
great importance in the field of proteomics today. Recently, the growth of
protein databases, combined with machine learning approaches, such as neural
networks and hidden Markov models, have made it possible to achieve a level of
reliability where practical use in, for example automatic database annotation
is feasible. In this review, we concentrate on the present status and future
perspectives of SignalP, our neural network-based method for prediction of the
most well-known sorting signal: the secretory signal peptide. We discuss the
problems associated with the use of SignalP on genomic sequences, showing that
signal peptide prediction will improve further if integrated with predictions
of start codons and transmembrane helices. As a step towards this goal, a
hidden Markov model version of SignalP has been developed, making it possible
to discriminate between cleaved signal peptides and uncleaved signal anchors.
Furthermore, we show how SignalP can be used to characterize putative signal
peptides from an archaeon, Methanococcus jannaschii. Finally, we briefly review
a few methods for predicting other protein sorting signals and discuss the
future of protein sorting prediction in general.
PMID: 10065704
A neural network method for identification of prokaryotic and eukaryotic
signal peptides and prediction of their cleavage sites.
Henrik Nielsen, Jacob Engelbrecht, Søren Brunak
and Gunnar von Heijne.
Int. J. Neural Sys., 8:581-599, 1997.
We have developed a new method for the identification of signal peptides and
their cleavage sites based on neural networks trained on separate sets of
prokaryotic and eukaryotic sequences. The method performs significantly better
than previous prediction schemes, and can easily be applied to genome-wide data
sets. Discrimination between cleaved signal peptides and uncleaved N-terminal
signal-anchor sequences is also possible, though with lower precision.
Predictions can be made on a publicly available WWW server:
http://www.cbs.dtu.dk/services/SignalP/.
PMID: 10065837
Defining a similarity threshold for a functional protein sequence pattern:
the signal peptide cleavage site.
Henrik Nielsen, Jacob Engelbrecht, Gunnar von Heijne
and Søren Brunak.
Proteins, 24(2):165-77, 1996.
When preparing data sets of amino acid or nucleotide sequences it is
necessary to exclude redundant or homologous sequences in order to avoid
overestimating the predictive performance of an algorithm. For some time
methods for doing this have been available in the area of protein structure
prediction. We have developed a similar procedure based on pair-wise
alignments for sequences with functional sites. We show how a correlation
coefficient between sequence similarity and functional homology can be used
to compare the efficiency of different similarity measures and choose a
nonarbitrary threshold value for excluding redundant sequences. The impact
of the choice of scoring matrix used in the alignments is examined. We
demonstrate that the parameter determining the quality of the correlation is
the relative entropy of the matrix, rather than the assumed (PAM or
identity) substitution mode. Results are presented for the case of
prediction of cleavage sites in signal peptides. By inspection of the false
positives, several errors in the database were found. The procedure
presented may be used as a general outline for finding a problem-specific
similarity measure and threshold value for analysis of other functional
amino acid or nucleotide sequence patterns.
PMID: 8820484
From sequence to sorting: Prediction of signal peptides.
Henrik Nielsen.
Ph.D. thesis, defended at Department of Biochemistry,
Stockholm University, Sweden, May 25, 1999.
In the present age of genome sequencing, a vast number of predicted
genes are initially known only by their putative nucleotide
sequence. The newly established field of bioinformatics is concerned
with the computational prediction of structural and functional
properties of genes and the proteins they encode, based on their
nucleotide and amino acid sequences.
Since one of the crucial properties of a protein is its subcellular
location, prediction of protein sorting is an important question in
bioinformatics. A fundamental distinction in protein sorting is that
between secretory and non-secretory proteins, determined by a
cleavable N-terminal sorting signal, the secretory signal peptide.
The main part of this thesis, including four of the six papers,
concerns prediction of secretory signal peptides in both eukaryotic
and bacterial data using two machine learning techniques: artificial
neural networks and hidden Markov models. A central result is the
SignalP prediction method, which has been made available as a World
Wide Web server and is very widely used.
Two additional prediction methods are also included, with one paper
each. ChloroP predicts chloroplast transit peptides, another
cleavable N-terminal sorting signal; while NetStart predicts start
codons in eukaryotic genes. For prediction of all N-terminal signals,
the assignment of correct start codon can be critical, which is why
prediction of translation initiation from the nucleotide sequence is
also important for protein sorting prediction.
This thesis comprises a detailed review of the molecular biology of
protein secretion, a short introduction to the most important machine
learning algorithms in bioinformatics, and a critical review of
existing methods for protein sorting prediction. In addition, it
contains general treatment of the principles of data set construction
and performance evaluation for prediction methods in bioinformatics.
Version history
Please click on the version number to activate the corresponding server where available.
4.1
|
The current server. New in this version:
- For the web page, an option to set the D-score cutoff values so
that the sensitivity is the same as that of SignalP 3.0.
- Option included to set the minimum cleavage site position i.e. Ymax position - default value is 10.
- For the signalp package an option has been included to specify a temporary directory (-T dir).
- For the signalp package an option has been included to show signalp version (-V).
- Documentation rewritten.
Main publication:
-
SignalP 4.0: discriminating signal peptides from transmembrane regions
Thomas Nordahl Petersen, Søren Brunak,
Gunnar von Heijne and Henrik Nielsen.
Nature Methods, 8:785-786, 2011.
|
4.0
|
New in this version:
- Improved discrimination between signal peptides and transmembrane regions.
- No HMM method - only one prediction.
Main publication:
-
SignalP 4.0: discriminating signal peptides from transmembrane regions
Thomas Nordahl Petersen, Søren Brunak,
Gunnar von Heijne and Henrik Nielsen.
Nature Methods, 8:785-786, 2011.
|
3.0
|
New in this version:
- D-score. Improved quality of prediction.
Main publication:
-
Improved prediction of signal peptides: SignalP 3.0.
Jannick Dyrløv Bendtsen, Henrik Nielsen,
Gunnar von Heijne and Søren Brunak.
J. Mol. Biol., 340:783-795, 2004.
|
2.0
|
New in this version:
- Incorporation of a hidden Markov model version:
SignalP V2.0 comprises two signal peptide prediction methods,
SignalP-NN (based on neural networks, corresponding to SignalP V1.1)
and SignalP-HMM (based on hidden Markov models). For eukaryotic data,
SignalP-HMM has a substantially improved discrimination between signal
peptides and uncleaved signal anchors, but it has a slightly lower
accuracy in predicting the precise location of the cleavage site.
The user can choose whether to run SignalP-NN, SignalP-HMM, or both.
- Retraining of the neural networks:
SignalP-NN in SignalP V2.0 is trained on a newer data set derived
from SWISS-PROT rel. 35 (instead of rel. 29 as in SignalP V1.1).
- Graphics integrated in the output:
SignalP V2.0 shows signal peptide and cleavage site scores for each
position as plots in GIF format on the output page. The plots provide
more information than the prediction summary, e.g. about possible
cleavage sites other than the strongest prediction.
- Signal peptide region assignment:
SignalP-HMM provides not only a prediction of the presence of a signal
peptide and the position of the cleavage site, but also an approximate
assignment of n-, h- and c-regions within the signal peptide. These are
shown in the graphical output as probabilities for each position being
in one of these three regions.
- Automatic truncation:
in SignalP V1.1, we recommended that you should submit only the
N-terminal part of each protein, not more than 50-70 amino acids.
SignalP V2.0 now offers to truncate your sequences automatically.
Main publication:
-
Prediction of signal peptides and signal anchors by a hidden
Markov model.
Henrik Nielsen and Anders Krogh.
Proceedings of the Sixth International Conference on Intelligent
Systems for Molecular Biology (ISMB 6),
AAAI Press, Menlo Park, California, pp. 122-130, 1998.
|
1.1
|
The original server: the method based on artificial
neural networks.
Main publication:
-
Identification of prokaryotic and eukaryotic signal peptides
and prediction of their cleavage sites.
Henrik Nielsen, Jacob Engelbrecht, Søren Brunak
and Gunnar von Heijne.
Protein Engineering, 10:1-6, 1997.
|