EasyPred - 1.0

Development of neural network and weight matrix prediction methods for protein sequences.

EasyPred Prediction method training server.

Submission

Paste in training examples, or upload training examples		Paste in evaluation examples, or, upload evaluation examples
Valid format: column format. Example: Training set,		Valid formats: column format, Example: Evaluation set. or fasta format, Example: gp120.

Instructions: Paste in or upload training examples to train a prediction method. To evaluate the performance of the method Paste in or upload evaluation examples as well. Please read the DTU Health Tech access policies for information about limitations on the daily number of submissions.

General parameters

Cutoff for counting an example as a positive example.

Sorting of output
Sort output on predicted values
Don't sort output

Load saved prediction method

Paste in parameters,

or upload parameter file

Select method

Matrix method parameters

Clustering method.
Henikoff & Henikoff 1/nr method
Cluster at 62% identity
No clustering

Weight on prior:

CITATIONS

For publication of results, please cite:

EasyPred. To be published

Usage instructions

1. Specify the training sequences

All the input sequences must be in one-letter amino acid code. The allowed alphabet (not case sensitive) is as follows:

A C D E F G H I K L M N P Q R S T V W Y

The training sequences can be input in the following two ways:

Paste a set of sequences, one sequence per line (just the amino acids) into the upper left window. Look here to see an example of the format.
You can also select a file (in the same format) on your local disk, either by typing the file name into the lower left window or by browsing the disk.

3. Select evaluation examples (Optional)

The evaluation examples can either be one example per line (optionally followed by an assigned value: example. ) or in fasta format fasta example. .

3. Customize your run by changing some of the advanced options (Optional)

4. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Output format

DESCRIPTION

Example of output is found below. The output is divided into the folowinng sections:

Description of training data
Prediction method
Prediction data
Evaluation of predictions
Predictions This section contain a line "Number Sequence (Assignment) Prediction" followed by the predictions in 4 collumns:
1. Residue number
2. Sequence of peptide that the prediction is made on
3. Assignment of the correct output (if made available by the user)
4. Predicted value

EXAMPLE OUTPUT

Description of training data

Length of motif: 9
Number of training data: 200
Threshold for counting example as positive: 0.500000

Prediction method.

Neural network
Number of input units: 180
Number of hidden units: 2
Number of bins used for balancing training: 2
Doing 5 fold cross validation

Cross validation number 1
Output from the neural network program (HOW)
Maximal test set correlation coefficent sum = 0.323400 in epoch 255
Maximal test set pearson correlation coefficent sum = 0.329300 in epoch 300
minimal per example squared error = 0.039100 in epoch 299

Cross validation number 2
Output from the neural network program (HOW)
Maximal test set correlation coefficent sum = 0.397700 in epoch 239
Maximal test set pearson correlation coefficent sum = 0.514800 in epoch 299
minimal per example squared error = 0.017400 in epoch 299

Cross validation number 3
Output from the neural network program (HOW)
Maximal test set correlation coefficent sum = 0.285700 in epoch 256
Maximal test set pearson correlation coefficent sum = 0.441000 in epoch 300
minimal per example squared error = 0.027800 in epoch 272

Cross validation number 4
Output from the neural network program (HOW)
Maximal test set correlation coefficent sum = 0.369800 in epoch 282
Maximal test set pearson correlation coefficent sum = 0.561800 in epoch 225
minimal per example squared error = 0.021600 in epoch 300

Cross validation number 5
Output from the neural network program (HOW)
Maximal test set correlation coefficent sum = 0.315100 in epoch 208
Maximal test set pearson correlation coefficent sum = 0.546600 in epoch 261
minimal per example squared error = 0.021800 in epoch 272

Parameters for prediction method

Prediction data

Number of evaluation data: 66
Predicting using a neural network
Using all networks

Evaluation of predictions

Pearson coefficient for N= 66 data: 0.53066
Aroc value: 0.77124

Predictions



Number Sequence Assignment Prediction
1      ILYQVPFSV    0.853    0.696
2      VVMGTLVAL    0.589    0.542
3      ILDEAYVMA    0.494    0.608
4      KILSVFFLA    0.851    0.526
5      HLYQGCQVV    0.539    0.558
6      YLDLALMSV    0.843    0.689
7      ALAKAAAAA    0.563    0.499
8      MALLRLPLV    0.634    0.555
9      FLLTRILTI    0.803    0.586
10     ILSSLGLPV    0.638    0.533
11     RMYGVLPWI    0.689    0.621
12     ALPYWNFAT    0.323    0.575
13     YLEPGPVTV    0.647    0.665
14     FLPWHRLFL    0.564    0.556
15     LLPSLFLLL    0.554    0.516
16     MLQDMAILT    0.527    0.542
17     LVSLLTFMI    0.301    0.423
18     GLMTAVYLV    0.798    0.592
19     ILTVILGVL    0.451    0.473
20     GLYSSTVPV    0.697    0.620
21     SLYFGGICV    0.782    0.500
22     GLYYLTTEV    0.719    0.595
23     ALYGALLLA    0.818    0.669
24     IMPGQEAGL    0.614    0.580
25     WLSLLVPFV    0.822    0.560
26     YLVAYQATV    0.639    0.645
27     RLMIGTAAA    0.499    0.525
28     WLDQVPFSV    0.774    0.657
29     AAAKAAAAV    0.446    0.450
30     KTWGQYWQV    0.778    0.575
31     VIHAFQYVI    0.343    0.407
32     GLLGWSPQA    0.793    0.585
33     YMLDLQPET    0.654    0.599
34     HLAVIGALL    0.571    0.449
35     MLLAVLYCL    0.463    0.614
36     MMWYWGPSL    0.770    0.552
37     FVNHDFTVV    0.473    0.479
38     FLLRWEQEI    0.700    0.563
39     IIDQVPFSV    0.659    0.646
40     QVMSLHNLV    0.367    0.435
41     SVYVDAKLV    0.572    0.456
42     RLLDDTPEV    0.578    0.605
43     IAATYNFAV    0.581    0.515
44     YLVSFGVWI    0.941    0.520
45     ILLLCLIFL    0.541    0.569
46     AIAKAAAAV    0.399    0.474
47     LLLCLIFLL    0.699    0.530
48     GLQDCTMLV    0.710    0.578
49     ALAKAAAAL    0.470    0.492
50     MLGNAPSVV    0.499    0.561
51     FTDQVPFSV    0.619    0.652
52     YLAPGPVTA    0.794    0.649
53     GLLGNVSTV    0.706    0.586
54     GTLGIVCPI    0.503    0.531
55     YLEPGPVTI    0.614    0.633
56     LLFLGVVFL    0.638    0.542
57     SLAGFVRML    0.565    0.528
58     GLYLSQIAV    0.578    0.578
59     WTDQVPFSV    0.392    0.611
60     RLTEELNTI    0.374    0.499
61     KLTPLCVTL    0.572    0.586
62     YLYPGPVTA    0.739    0.695
63     TVLRFVPPL    0.599    0.506
64     ILSPFMPLL    0.648    0.584
65     FVWLHYYSV    0.749    0.573
66     ILDQVPFSV    0.635    0.677

Article Abstract

REFERENCE

X3M a Computer Program to Extract 3D Models.
O. Lund, M. Nielsen, C. Lundegaard, P. Worning
Abstract at the CASP5 conference A102, 2002.

Center for Biological Sequence Analysis, BioCentrum-DTU, The Technical University of Denmark, DK-2800 Lyngby, Denmark

ABSTRACT

Summary

A novel method was developed for fold recognition/homology modeling, in which a large sequence database is iteratively searched to construct a sequence profile until a template can be found in a database of proteins with known structure. The method differs from the PDB-BLAST method in that a sequence profile is only made if a template is not readily found in the database of known structures. A sequence profile is subsequently made for the template, using the same number of PSI-BLAST iterations that were used to identify it. Query and template sequences are subsequently aligned using a score based on profile-profile comparisons. The alignment score is modified so as to ensure unreliable parts of the alignment is discarded.

Background

A problem often encountered when doing iterative sequence searches in a database is that the search may go astray and start picking up unrelated sequences often with hydrophobic or low complexity regions. It has been found that using PSI-BLAST [1] to build a profile using a sequence database and subsequently use this profile to search a database of proteins with known structures (PDB-BLAST) works better than searching one merged database [2]. We have developed a method related to PDB-BLAST where we only perform iterative searches against the sequence database if no match can be found in the database of proteins with known structure.

It has been shown that methods based on profile-profile alignment can produce more accurate alignments than methods based on sequence-profile or sequence-sequence alignment [3]. A number of different methods for scoring two profiles against each other have been suggested over the recent years: The average score between all amino acid pairs according to the probability distribution in each profile [2], the probability that the same amino acid is found in given positions in the two profiles (the dot product of the amino acid probability vectors) [4], the probability that two amino acid distributions are the same [5], or combinations of different profile-profile scores with other scoring terms [6]. Kelley et al. [7] use the average alignment score of the query profile with the template sequence and the query sequence with the template profile for fold recognition. Here we take that average for each residue pair in and use that as a scoring matrix for the alignment algorithm. This approach has the advantage that it reduces to the classical sequence-sequence alignment in the case that no homologous proteins can be found.

In CASP4 Venclovas [8] successfully selected correctly aligned regions by discarding regions which aligned differently in different blast searches. Another way to select for reliable parts of the alignment is to change the scoring matrix that is used to align the two proteins. It has been found that scoring matrices with low PAM values (corresponding to high BLOSUM values) are appropriate for making shorter alignments [9]. Subtracting a number from the scoring matrix also leads to shorter but more accurate alignments [10,3]. Blosum alignment scores S are often measured in half bits and derived from log odds scores S = 2*ln2(Qij/PiPj) [11]. In this case subtracting two from the alignment score corresponds to demanding that the probability Qij to find amino acids i and j aligned must be twice as big as the background probability PiPj in order for S to be positive. We have used this method in an attempt to make a reliable profile-profile alignment.

Databases

A fasta file containing all pdb entries (pdb) was downloaded from NCBI (ftp://ftp.ncbi.nih.gov/blast/db/pdbaa.Z). A non redundant database of known protein sequences (sp) was compiled from files downloaded from Swiss-prot (ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/fasta/*.fas.gz). PDB entries were downloaded from RCSB (ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb/).

Template identification The program blastpgp [1] was used to search the databases. In order to find a template, the query sequence was run against the pdb database. If a template could not be found with an E value of less than 0.05 the sequence was run two iterations against sp, and a binary checkpoint file was saved as well as the position specific scoring matrix in ASCII format (blastpgp does not update these files after the last iterations, so the saved files correspond to the profile obtained after the first iteration). The checkpoint file was used to restart a blastpgp search of the query sequence against the pdb database. The procedure of iteratively using the sp database to generate a profile that in turn is used to search the pdb database was continued until a template was found with a E value of less than 0.05 or a total number of five iterations against the pdb database had been performed.

Alignment

If a template was identified, we attempted to improve the alignment by performing a profile-profile alignment. In order to make a sequence profile for the template sequence we ran the template sequence the same number of iterations as the query sequence against the sp database and saved the scoring matrix in ASCII format. If no sequence profile was generated for either the query or the template sequence, it was constructed from a blosum62 matrix [11]. A scoring matrix Sij was constructed based on the two profiles.

The program blastpgp [1] was used to search the databases. In order to find a template, the query sequence was run against the pdb database. If a template could not be found with an E value of less than 0.05 the sequence was run two iterations against sp, and a binary checkpoint file was saved as well as the position specific scoring matrix in ASCII format (blastpgp does not update these files after the last iterations, so the saved files correspond to the profile obtained after the first iteration). The checkpoint file was used to restart a blastpgp search of the query sequence against the pdb database. The procedure of iteratively using the sp database to generate a profile that in turn is used to search the pdb database was continued until a template was found with a E value of less than 0.05 or a total number of five iterations against the pdb database had been performed. Sij = (QPi(TAj)+TPj(QAi))/2-1

Where QPi(TAj) is the score of residue j in the template sequence with the profile at position i in the query sequence, and TPj(QAi) is the score of residue i in the query sequence with the profile at position j in the template sequence. These two scores were averaged and 1 was subtracted to reduce the lengths of the alignments and make them more accurate. The query was then aligned to the template using a local alignment algorithm [12], with a maximum number of gaps set to 20, a first gap penalty of 11, and a gap elongation penalty of 1.

Modeling

The corresponding atoms derived from the alignment can be extracted from the template file and used as a starting point for the homology modeling. Missing atoms were added using the segmod program [13] from the GeneMine package (www.bioinformatics.ucla.edu/genemine/). The structures can then refined using the encad program [14] also from the GeneMine package. The modeling step was not in place for CASP5 so only alignments were submitted.

Alignments were submitted for 41/67 (61%) of the targets (T0130, T0132, T0133, T0137, T0140, T0141, T0142, T0143, T0144, T0149, T0150, T0151, T0152, T0153, T0154, T0155, T0158, T0160, T0163, T0164, T0165, T0166, T0167, T0169, T0171, T0172, T0175, T0178, T0179, T0182, T0183, T0184, T0185, T0186, T0188, T0189, T0190, T0191, T0192, T0193, T0195). We only submitted alignments for targets where we estimated that it was at least 95 % certain that we had identified the correct fold. We furthermore sought to perform the alignment in such a way that regions where a reliable alignment could not be made were excluded. We look forward to see if this strategy worked and to compare our results with those submitted by other groups.

References

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Rychlewski L, Zhang B, Godzik A. (1998) Fold and function predictions for Mycoplasma genitalium proteins. Fold Des. 3 (4), 229-38. 3. Jaroszewski L, Rychlewski L, Godzik A. (2000) Improving the quality of twilight-zone alignments. Protein Sci. 9 (8), 1487-96. 4. Lyngsø RB, Pedersen CN, Nielsen H. R. (1999) Metrics and similarity measures for hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 178-86. 5. Yona G, Levitt M. (2000) Towards a complete map of the protein space based on a unified sequence and structure analysis of all known proteins. Proc Int Conf Intell Syst Mol Biol. 8, 395-406. 6. Fischer D. (2000) Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac Symp Biocomput. 119-30. 7. Kelley LA, MacCallum RM, Sternberg MJ. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol. 299 (2), 499-520. 8. Venclovas C. (2001) Comparative modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins Suppl 5, 47-54. 9. Altschul SF. (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 219 (3), 555-65. 10. Vogt G, Etzold T, Argos P. (1995) An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J Mol Biol. 249 (4), 816-31. 11. Henikoff S, Henikoff JG. (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 89 (22), 10915-9. 12. Smith TF, Waterman MS. (1981) Identification of common molecular subsequences. J Mol Biol. 147 (1), 195-7. 13. Levitt, M (1992) Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 226 (2), 507-533 14. Levitt, M, Hirshberg, M, Sharon, R and Daggett, V (1995). Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Computer Physics Comm. 91, 215-231.

GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: