Output format

The output format is essentially in GFF format. The default (long) output format looks like this:

# ANIA_NEIGO SpII score=29.6052 margin=11.2327 cleavage=18-19 Pos+2=G
# Cut-off=-3
ANIA_NEIGO	LipoP1.0:Best	SpII	1	1	29.6052
ANIA_NEIGO	LipoP1.0:Margin	SpII	1	1	11.2327
ANIA_NEIGO	LipoP1.0:Class	SpI	1	1	18.3725
ANIA_NEIGO	LipoP1.0:Class	CYT	1	1	-0.200913
ANIA_NEIGO	LipoP1.0:Signal	CleavII	18	19	29.6052	# FALAA|CGGEQ Pos+2=G
ANIA_NEIGO	LipoP1.0:Signal	CleavI	24	25	18.0333	# GGEQA|AQAPA
ANIA_NEIGO	LipoP1.0:Signal	CleavI	20	21	15.9259	# LAACG|GEQAA
ANIA_NEIGO	LipoP1.0:Signal	CleavI	26	27	12.0794	# EQAAQ|APAET
ANIA_NEIGO	LipoP1.0:Signal	CleavI	25	26	11.4077	# GEQAA|QAPAE
ANIA_NEIGO	LipoP1.0:Signal	CleavI	27	28	9.40252	# QAAQA|PAETP

(output trunctated)

The first line, which is the only line if short output is chosen, summarizes the best prediction. In the example the best prediction is a lipoprotein with a cleavage site between amino acid 18 and 19 and amino acid G (glycine) in position +2 after the cleavage site. The second line gives the cut-off used. In the following the columns contain

Sequence ID
Type of prediction. Best means the highest scoring class, Margin gives the difference between the best score and the second best score, Class gives the score of other classes and Signal lines contain predicted cleavage sites.
Feature type, see below
Location in the sequence. For lines with a class prediction it is always 1. For cleavage sites it is the last amino acid of the signal peptide relative to the predicted cleavage site.
Location as above axcept that for cleavage sites it is the first amino acids after the cleavage site.
Score. For the "Margin" type it is the difference between the best and the second best class score. Otherwise the log-odds score.
For the cleavage sites the ±5 context is shown after the #, and for lipoprotein cleavage sites the amino acid in postition +2 is shown (which may determine whether the lipoprotein is attached to the inner or outer membrane, see below).

These 4 clases are predicted

SpI: signal peptide (signal peptidase I)

SpII: lipoprotein signal peptide (signal peptidase II)

TMH: n-terminal transmembrane helix. This is generally not a very reliable prediction and should be tested. This part of the model is mainly there to avoid tranmembrane helices being falsely predicted as signal peptides.

CYT: cytoplasmic. It really just means all the rest.

For technical reasons (see paper) the score for CYT is always the same.

These signals are predicted:

CleavI: Cleavage sites for (signal peptidase I).

CleavII: Cleavage sites for (signal peptidase II).

Plot of scores

A plot of the cleavage site scores is made in postscript unless you have chosen the short output format or disabled the plot. For each predicted cleavage site, the score is shown. Two different colors are used for SpI and SpII. To the left is shown the scores of the classes scoring higher than the cut-off. The postscript is converted to an image (png format) and included in the html output (if selected).

Below the plot there are links to

The plot in encapsulated postscript
A script for making the plot in gnuplot.

If there are only few predictions of cleavage sites, no plot is made.

It is shown in the paper that the margin, i.e., the difference between the best and the second best prediction, correlates well with the number of falsely predicted signal peptides.

An aspartic acid (D) in position +2 after the cleavage site of a lipoprotein means that it is attached to the inner membrane, and most other lipoproteins are attached to the outer membrane. Therefore we report the amino acid in this position for predicted lipoproteins. See e.g. Seydel et al (1999) Molecular Microbiology 34: 810-821 for more details.

The cross-validation test reported in the paper gave the results shown in the table below. The highest scoring class was predicted. For signal peptides, 309 out of 328 were correctly classified as such, whereas 2 where classified as lipoproteins, 14 as cytoplasmic and 3 as having an n-terminal transmembrane helix. Of 63 lipoproteins, 61 were classified correctly.

Correct class	Predicted class
Correct class	SPaseI	SPaseII	Cytoplasmic	TMH	Total
SPaseI	309	2	14	3	328
SPaseII	2	61	0	0	63
Cytoplasmic	5	1	382	0	388
TMH	8	0	21	142	171

It is also shown in the paper that the prediction is more reliable the higher the margin is.