Output format


The output is a prediction of partial or complete genes in the sequences.It is in GFF format, which is a sequence annotation format developed with gene finding in mind. It is very simple and therefore it is easy to develop programs in perl or awk to post-process the output. The following is an example of the form it takes with hmmgene.

Note that hmmgene only predicts coding regions. That is, the first exon (`firstex' below) is only the coding part of the first coding exon and similarly for the last exon (`lastex' below). Below a `gene' therefore means the region of the gene from start to stop codon.

SEQ1 HMMgene1.1 firstex 692     702     0.347   +  2    bestparse:cds_1
SEQ1 HMMgene1.1 exon_1  2473    2711    0.421   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 exon_2  2897    3081    0.544   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_3  10376   10563   0.861   +  2    bestparse:cds_1
SEQ1 HMMgene1.1 exon_4  11841   11891   0.857   +  2    bestparse:cds_1
SEQ1 HMMgene1.1 exon_5  12387   12483   0.993   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_6  13076   13211   0.970   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 exon_7  13332   13415   0.926   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 exon_8  13515   13603   1.000   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_9  14180   14235   1.000   +  2    bestparse:cds_1
SEQ1 HMMgene1.1 exon_10 14321   14408   0.999   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_11 14483   14579   0.877   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 exon_12 14697   14764   0.639   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_13 14901   15030   0.835   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 lastex  15643   15704   0.987   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 CDS     692     15704   0.132   +  .    bestparse:cds_1
(the real list is tab separated)

Columns

  1. Sequence identifier
  2. Program name
  3. Prediction (see table below for the meaning).
  4. Beginning
  5. End
  6. Score between 0 and 1
  7. Strand: $+$ for direct and $-$ for complementary
  8. Frame (for exons it is the position of the donor in the frame)
  9. Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below).
The score that comes with all the exons as well as the entire gene `CDS' above) is a probability, so a value close to one means that the program is fairly certain. (See `Known Bugs'.) The program also outputs some comment lines which are preceeded by `#'.

Predictions

Name Meaning
firstex The coding part of the first coding exon starting with the first base of the start codon.
exon_N The N'th predicted internal coding exon.
lastex The coding part of the last coding exon ending with the last base of the stop codon.
singleex The coding part of an exon in a gene with only one coding exon.
CDS Coding region composed of the exon predictions prior to this line.
START Predicted start codon with position of first and last base (only with signal option).
STOP Predicted stop codon with position of first and last base (only with signals option).
DON Predicted donor site with position of the base before and after the splice site (only with signal option).
ACC Predicted acceptor site with position of the base before and after the splice site (only signal option).