In the mid 80' Gunnar von Heijne introduced the first approach for prediction of signal peptides (Nucleic Acids Res. 1986, 14, 4683-4690). This method was based on a weight matrix, a method which still has its use. Together with the increasing computational power different machine learning approaches enter the prediction scene and in 1996 SignalP version 1 was introduced to the academic community. SignalP version 1 was based on an artificial neural network, which in version 2 was extended with a Hidden Markow Model.
In SignalP version 3 we have used data from SWISS-PROT version 40. The current data set is throughly cleaned using diffenrent methods. In the traning data we have removed spurious residues around the cleavage site, togeter with other obvious wrong annotated signal peptide sequences. The entire cleaning method can be found in lastest SignalP paper or on the datasets page.
Artificial neural networks consist of a large number of independent computational units
(so-called neurons) that are able to influence the computations of each other. A neuron has
several inputs, and one output. The output from a neuron (a real number between 0 and 1) is
calculated as a function of a weighted sum of the inputs. Several neurons can be connected
(with the output of one neuron being the input of another neuron) thus forming a neural
network. When a network is presented with an input (e.g. a string of real numbers that
represent a sequence of amino acids) it will calculate an output that can be interpreted as
a classification of the input (e.g. ``is the input sequence a signal peptide or not?''). It
is possible to ``teach'' a neural network how to make a classification by presenting it with
a set of known inputs (the training set) several times, and simultaneously modifying the
weights associated with each input in such a way that the difference between the desired
output and the actual output is minimized.
Simple neural networks of the kind used here (feed-forward networks) are closely related to
the weight matrix method (von Heijne 1986), the two main differences being (1) that the
weights in neural networks are found by training rather that statistical analysis, and (2)
that neural networks are able to solve non-linear classification problems by introducing a
layer of ``hidden neurons'' between input and output.
In this study, the output layer consisted of only one neuron which classified the sequence
windows in two ways: Cleavage sites vs. other sequence positions, and signal peptide vs.
non-signal-peptide. In the latter case, negative examples included both the first 70
positions of non-secretory proteins, and the first 30 positions of the mature part of
secretory proteins.
For each of the five data sets and two classification problems, we tested several networks
with different numbers of input positions and hidden units, and selected the smallest
network that reached the optimal performance. While cleavage site networks worked best with
asymmetric windows (i.e. windows including more positions upstream than downstream of the
cleavage site), signal peptide networks work best with symmetric windows.
As a second machine learning algorithm method, a hidden Markov model (HMM) was developed in SignalP version 2.
The HMM also includes prediction of signal anchors in addition to the prediction of signal peptides. The figure
below show the architechture of the HMM.
Neural networks
Hidden Markov Models