Center for Biological Sequence Analysis, BioCentrum-DTU,
The Technical University of Denmark, DK-2800 Lyngby, Denmark
A novel method was developed for fold recognition/homology modeling, in which a large sequence database is iteratively searched to construct a sequence profile until a template can be found in a database of proteins with known structure. The method differs from the PDB-BLAST method in that a sequence profile is only made if a template is not readily found in the database of known structures. A sequence profile is subsequently made for the template, using the same number of PSI-BLAST iterations that were used to identify it. Query and template sequences are subsequently aligned using a score based on profile-profile comparisons. The alignment score is modified so as to ensure unreliable parts of the alignment is discarded.
A problem often encountered when doing iterative sequence searches in a database is that the search may go astray and start picking up unrelated sequences often with hydrophobic or low complexity regions. It has been found that using PSI-BLAST [1] to build a profile using a sequence database and subsequently use this profile to search a database of proteins with known structures (PDB-BLAST) works better than searching one merged database [2]. We have developed a method related to PDB-BLAST where we only perform iterative searches against the sequence database if no match can be found in the database of proteins with known structure.
It has been shown that methods based on profile-profile alignment can produce more accurate alignments than methods based on sequence-profile or sequence-sequence alignment [3]. A number of different methods for scoring two profiles against each other have been suggested over the recent years: The average score between all amino acid pairs according to the probability distribution in each profile [2], the probability that the same amino acid is found in given positions in the two profiles (the dot product of the amino acid probability vectors) [4], the probability that two amino acid distributions are the same [5], or combinations of different profile-profile scores with other scoring terms [6]. Kelley et al. [7] use the average alignment score of the query profile with the template sequence and the query sequence with the template profile for fold recognition. Here we take that average for each residue pair in and use that as a scoring matrix for the alignment algorithm. This approach has the advantage that it reduces to the classical sequence-sequence alignment in the case that no homologous proteins can be found.
In CASP4 Venclovas [8] successfully selected correctly aligned regions by discarding regions which aligned differently in different blast searches. Another way to select for reliable parts of the alignment is to change the scoring matrix that is used to align the two proteins. It has been found that scoring matrices with low PAM values (corresponding to high BLOSUM values) are appropriate for making shorter alignments [9]. Subtracting a number from the scoring matrix also leads to shorter but more accurate alignments [10,3]. Blosum alignment scores S are often measured in half bits and derived from log odds scores S = 2*ln2(Qij/PiPj) [11]. In this case subtracting two from the alignment score corresponds to demanding that the probability Qij to find amino acids i and j aligned must be twice as big as the background probability PiPj in order for S to be positive. We have used this method in an attempt to make a reliable profile-profile alignment.
A fasta file containing all pdb entries (pdb) was downloaded from NCBI (ftp://ftp.ncbi.nih.gov/blast/db/pdbaa.Z). A non redundant database of known protein sequences (sp) was compiled from files downloaded from Swiss-prot (ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/fasta/*.fas.gz). PDB entries were downloaded from RCSB (ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb/).
Template identification The program blastpgp [1] was used to search the databases. In order to find a template, the query sequence was run against the pdb database. If a template could not be found with an E value of less than 0.05 the sequence was run two iterations against sp, and a binary checkpoint file was saved as well as the position specific scoring matrix in ASCII format (blastpgp does not update these files after the last iterations, so the saved files correspond to the profile obtained after the first iteration). The checkpoint file was used to restart a blastpgp search of the query sequence against the pdb database. The procedure of iteratively using the sp database to generate a profile that in turn is used to search the pdb database was continued until a template was found with a E value of less than 0.05 or a total number of five iterations against the pdb database had been performed.
If a template was identified, we attempted to improve the alignment by performing a profile-profile alignment. In order to make a sequence profile for the template sequence we ran the template sequence the same number of iterations as the query sequence against the sp database and saved the scoring matrix in ASCII format. If no sequence profile was generated for either the query or the template sequence, it was constructed from a blosum62 matrix [11]. A scoring matrix Sij was constructed based on the two profiles.
The program blastpgp [1] was used to search the databases. In order to find a template, the query sequence was run against the pdb database. If a template could not be found with an E value of less than 0.05 the sequence was run two iterations against sp, and a binary checkpoint file was saved as well as the position specific scoring matrix in ASCII format (blastpgp does not update these files after the last iterations, so the saved files correspond to the profile obtained after the first iteration). The checkpoint file was used to restart a blastpgp search of the query sequence against the pdb database. The procedure of iteratively using the sp database to generate a profile that in turn is used to search the pdb database was continued until a template was found with a E value of less than 0.05 or a total number of five iterations against the pdb database had been performed. Sij = (QPi(TAj)+TPj(QAi))/2-1
Where QPi(TAj) is the score of residue j in the template sequence with the profile at position i in the query sequence, and TPj(QAi) is the score of residue i in the query sequence with the profile at position j in the template sequence. These two scores were averaged and 1 was subtracted to reduce the lengths of the alignments and make them more accurate. The query was then aligned to the template using a local alignment algorithm [12], with a maximum number of gaps set to 20, a first gap penalty of 11, and a gap elongation penalty of 1.
The corresponding atoms derived from the alignment can be extracted from the template file and used as a starting point for the homology modeling. Missing atoms were added using the segmod program [13] from the GeneMine package (www.bioinformatics.ucla.edu/genemine/). The structures can then refined using the encad program [14] also from the GeneMine package. The modeling step was not in place for CASP5 so only alignments were submitted.
Alignments were submitted for 41/67 (61%) of the targets (T0130, T0132, T0133, T0137, T0140, T0141, T0142, T0143, T0144, T0149, T0150, T0151, T0152, T0153, T0154, T0155, T0158, T0160, T0163, T0164, T0165, T0166, T0167, T0169, T0171, T0172, T0175, T0178, T0179, T0182, T0183, T0184, T0185, T0186, T0188, T0189, T0190, T0191, T0192, T0193, T0195). We only submitted alignments for targets where we estimated that it was at least 95 % certain that we had identified the correct fold. We furthermore sought to perform the alignment in such a way that regions where a reliable alignment could not be made were excluded. We look forward to see if this strategy worked and to compare our results with those submitted by other groups.
References
1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402
2. Rychlewski L, Zhang B, Godzik A. (1998) Fold and function predictions for Mycoplasma genitalium proteins. Fold Des. 3 (4), 229-38.
3. Jaroszewski L, Rychlewski L, Godzik A. (2000) Improving the quality of twilight-zone alignments. Protein Sci. 9 (8), 1487-96.
4. Lyngsø RB, Pedersen CN, Nielsen H. R. (1999) Metrics and similarity measures for hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 178-86.
5. Yona G, Levitt M. (2000) Towards a complete map of the protein space based on a unified sequence and structure analysis of all known proteins. Proc Int Conf Intell Syst Mol Biol. 8, 395-406.
6. Fischer D. (2000) Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac Symp Biocomput. 119-30.
7. Kelley LA, MacCallum RM, Sternberg MJ. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol. 299 (2), 499-520.
8. Venclovas C. (2001) Comparative modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins Suppl 5, 47-54.
9. Altschul SF. (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 219 (3), 555-65.
10. Vogt G, Etzold T, Argos P. (1995) An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J Mol Biol. 249 (4), 816-31.
11. Henikoff S, Henikoff JG. (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 89 (22), 10915-9.
12. Smith TF, Waterman MS. (1981) Identification of common molecular subsequences. J Mol Biol. 147 (1), 195-7.
13. Levitt, M (1992) Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 226 (2), 507-533
14. Levitt, M, Hirshberg, M, Sharon, R and Daggett, V (1995). Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Computer Physics Comm. 91, 215-231.