There are currently models for vertebrate and C. elegans. The vertebrate model is trained entirely on human genes, but it should work reasonably well for other vertebrates.
>SEQ1 Any text following the identifier is ignored TCATTGTATCAGAAAGATAAAGAAAAAATAATCGTATTTCAGTACTTCTATACATCCTAAAAGGGAAGAC GGAACACTTAAGTGGTTGATAAATTTGAAAAGCTGATTAAACATAATAATCACCATGTTGGGGGAAGACA TAAAAGTCATAAAACAGATTTTTTATAATATTAAAAAAGTGACATGAAAATTATACAATTTTAGAAAGGA ATATAAAAAGGCAGGAGTTAAAAAATAGTGGGACTAATATCATAGAAAACTATCCATGAGGAAGGTCAAA TTTATTTTCAACATGTAAAAAGGATAAAGAGTAGAGGTATTTTAAAAATTCACAGATTCTTAATGAGGCA AATGTTAAAATATGGAACCCAATCTCAGACAAATACATAGAAAGGAGTAAGGGCCAACTCTCATGCATAA GGTATCCCATCCTATAGCAAATCAGATATATAGGTACGCTTGALetters can be upper or lower case. Spaces and other non-letter characters in the sequence are ignored. Letter U is translated to T. All letters not equal to A, C, G, T or U are treated as unknown (N). The sequences can be of any length.
All lines starting with `#' are treated as comment lines, lines starting with `%%' may contain annotation (see below). The execution time of the program is roughly proportional to the sequence length.
The signal prediction is different from most other predictors of splice sites and start/stop, in that only signals that fit well into a whole gene structure is predicted, i.e., the signals are not predicted from the local sequence alone. This yields fewer predictions and usually better, however, if there is an error that frameshifts an actual gene or something like that, the splice sites might be missed as well as the gene.
Because of the slow-down of the program and the large amount of information produced, it is best to use this option on a region, where it is likely that there is only one gene. Then it will be possible to see alternative ways of splicing it together. Although it is quite possible that real alternative splicing can be predicted in this way, this has not yet been investigated. Whether a gene is alternatively spliced or not, it will often be usefull to see the alternative possibilities that might score almost as well as the best prediction.
SEQ2 non-coding 105 443
The same can be specified in the sequence file by preceeding each line with `%%',
%% SEQ2 non-coding 105 443 This has to come before the actual sequence in the file, e.g., all annotation lines can come in the very beginning.
This is very useful if there are database hits to a sequence or if repeats are mapped by some other program. Assume for instance that there is a database hit to base 1503-1594 and alu repeats are found at position 10731-10890 and 13205-13356 in SEQ2. Then one might want to enter the lines
SEQ2 coding 1503 1594 + SEQ2 non-coding 1503 1594 - SEQ2 non-coding 10731 10890 SEQ2 non-coding 13205 13356Here we indicated that the sequence is coding on the direct strand from 1503 to 1594 and non-coding in this region on the complementary strand. The two last lines means that the regions are non-coding on BOTH STRANDS.
Regions specified in the file are not allowed to overlap except on opposite strands. If the annotation you give does not conform to the model, the program will die. This happens for instance if the annotaion you give forces
If no start codon or stop codon is predicted for a gene (e.g. begins and ends with an intron) the frame information and scores might be wrong.
HMMgene can in principle predict a gene with a stop codon in frame, if splicing happens in the middle of it. I have not yet seen any examples though.