Examples of proteome predictions for three organism types

Eukaryota - Human proteom GRCh37.62

Gram positive bacteria - B.subtilis EB2

Gram negative bacteria - E.coli K12

Training and testing data sets

These are the annotated sequence data described in Table A of the Supplementary Materials. The entire datasets correspond to the "Total" columns in the table (before homology reduction). Sequences labeled "Train" correspond to the "Train" columns in the table, while sequences labeled "Evaluation" correspond to the "Comp." columns in the table (used for comparing the performance to SignalP 3.0 and other methods). Sequences used to train SignalP 3.0 (or homologous to those used to train SignalP 3.0) have been removed from the "Comp." sets.

Note that the "Comp." sets are subsets of the "Train" sets. The evaluation of SignalP 4.0 was done using a nested cross-validation approach, where different partitions were used for training, optimization and evaluation, see Supplementary Materials for details.

   166 AJL2_ANGJA Evaluation
MVSFKLPAFLCVAVLSSMALVSHGAVLGLCEGACPEGWVEHKNRCYLHVAEKKTWLDAELNCLHHGGNLASEHSEDEHQF
LKDLHKGSDDPFWIGLSAVHEGRSWLWSDGTSASAEGDFSMWNPGEPNDAGGKEDCVHDNYGGQKHWNDIKCDLLFPSIC
VLRMVE
SSSSSSSSSSSSSSSSSSSSSSSS........................................................
................................................................................
......
   503 A1BG_BOVIN  Evaluation Train
MSAWAALLLLWGLSLSPVTEQATFFDPRPSLWAEAGSPLAPWADVTLTCQSPLPTQEFQLLKDGVGQEPVHLESPAHEHR
FPLGPVTSTTRGLYRCSYKGNNDWISPSNLVEVTGAEPLPAPSISTSPVSWITPGLNTTLLCLSGLRGVTFLLRLEGEDQ
FLEVAEAPEATQATFPVHRAGNYSCSYRTHAAGTPSEPSATVTIEELDPPPAPTLTVDRESAKVLRPGSSASLTCVAPLS
GVDFQLRRGAEEQLVPRASTSPDRVFFRLSALAAGDGSGYTCRYRLRSELAAWSRDSAPAELVLSDGTLPAPELSAEPAI
LSPTPGALVQLRCRAPRAGVRFALVRKDAGGRQVQRVLSPAGPEAQFELRGVSAVDSGNYSCVYVDTSPPFAGSKPSATL
ELRVDGPLPRPQLRALWTGALTPGRDAVLRCEAEVPDVSFLLLRAGEEEPLAVAWSTHGPADLVLTSVGPQHAGTYSCRY
RTGGPRSLLSELSDPVELRVAGS
SSSSSSSSSSSSSSSSSSSSS...........................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
.......................

The format is:

First a header line with number of amino acids, sequence name (UniProt ID) and possibly a description field ('Evaluation'/'Train').
The protein sequence.
The annotations, one for each amino acid.

Annotations:
S — Amino acid is part of a Signal peptide (experimentally verified)
T — Amino acid is part of a Transmembrane region (experimentally verified)
t — Amino acid is part of a Transmembrane region (not experimentally verified)
. — An annotation different from those shown above

Eukaryota sequence data
Gram positive sequence data
Gram negative sequence data