<h2>Instructions</h2>
<hr>
The server has two forms:
You can either submit a local file with sequences in fasta format,
or paste your sequence into the window.   Then you select which organism
the sequence is from and what options you would like to use.
Optionally you can also specify known annotation from e.g. database hits.
<P>

There are currently models for vertebrate and C. elegans.  The vertebrate
model is trained entirely on human genes, but it should work reasonably
well for other vertebrates.

<H3><A NAME="as-h2-63846">Sequence format</A></H3>


DNA sequences must be in FASTA format which looks like this example:

<PRE>
>SEQ1 Any text following the identifier is ignored
TCATTGTATCAGAAAGATAAAGAAAAAATAATCGTATTTCAGTACTTCTATACATCCTAAAAGGGAAGAC
GGAACACTTAAGTGGTTGATAAATTTGAAAAGCTGATTAAACATAATAATCACCATGTTGGGGGAAGACA
TAAAAGTCATAAAACAGATTTTTTATAATATTAAAAAAGTGACATGAAAATTATACAATTTTAGAAAGGA
ATATAAAAAGGCAGGAGTTAAAAAATAGTGGGACTAATATCATAGAAAACTATCCATGAGGAAGGTCAAA
TTTATTTTCAACATGTAAAAAGGATAAAGAGTAGAGGTATTTTAAAAATTCACAGATTCTTAATGAGGCA
AATGTTAAAATATGGAACCCAATCTCAGACAAATACATAGAAAGGAGTAAGGGCCAACTCTCATGCATAA
GGTATCCCATCCTATAGCAAATCAGATATATAGGTACGCTTGA
</PRE>

Letters can be upper or lower case.  Spaces and other non-letter
characters in the sequence are ignored.  Letter U is translated to T.
All letters not equal to A, C, G, T or U are treated as unknown (N).
The sequences can be of any length.

<P>

All lines starting with `#' are treated as comment
lines, lines starting with `%%' may contain annotation (see below).

The execution time of the program is roughly proportional to the
sequence length.

<H3><A NAME="as-h2-638410">Options</A></H3>

<H4><A NAME="as-h3-638411">Predict signals</A></H4>

Predict splice sites and start/stop codons
associated probabilities.  The output
format is <A target = "_blank" HREF="https://www.sanger.ac.uk/resources/software/gff/spec.html">GFF</A>.

<P>

The signal prediction is different from most other predictors of
splice sites and start/stop, in that only signals that fit well into a
whole gene structure is predicted, i.e., the signals are not predicted
from the local sequence alone.  This yields fewer predictions and
usually better, however, if there is an error that frameshifts an
actual gene or something like that, the splice sites might be missed
as well as the gene.


<H4><A NAME="as-h3-638412">Alternative predictions</A></H4>

The predicted genes in a sequence are the most probable ones according
to the program (or rather the underlying hidden Markov model).  It is
possible to also see suboptimal predictions.  For instance, to see the
3 most probable predictions, hmmgene is run with `3 best predictions'
instead of `best prediction'.
The program will run approximately 3 times slower in that case.

<P>

Because of the slow-down of the program and the large amount of
information produced, it is best to use this option on a region, where
it is likely that there is only one gene.  Then it will be possible to
see alternative ways of splicing it together.  Although it is quite
possible that real alternative splicing can be predicted in this way,
this has not yet been investigated.  Whether a gene is alternatively
spliced or not, it will often be usefull to see the alternative
possibilities that might score almost as well as the best prediction.


<H4><A NAME="as-h2-638413">Annotation</A></H4>

If something is known about one or more of the sequences, it can be
specified either in a separate annotation file or in the sequence
file.  For instance if it is known that SEQ2 is non-coding from base
number 105 to 443, the annotation file must contain a line of the form

<P>

SEQ2 non-coding 105 443


<UL>
        <LI>coding
        <LI>non-coding
        <LI>intron
        <LI>non-intron
        <LI>intergenic
        <LI>non-intergenic
</UL>


Note that these keywords must appear exactly as written here (lower
case).
An optional + or - at the end of a line indicates direct strand
(the direction of the sequence in the file) or the complementary
strand.

<P>

The same can be specified in the sequence file by preceeding each line
with `%%',

<P>

%% SEQ2 non-coding 105 443

This has to come <EM>before</EM> the actual sequence in the file, e.g.,
all annotation lines can come in the very beginning.

<P>

This is very useful if there are database hits to a sequence or
if repeats are mapped by some other program.
Assume for instance that there is a database hit to base
1503-1594 and alu repeats are found at position 10731-10890 and
13205-13356 in SEQ2.  Then one might want to enter the lines

<PRE>
SEQ2 coding 1503 1594 +
SEQ2 non-coding 1503 1594 -
SEQ2 non-coding 10731 10890
SEQ2 non-coding 13205 13356
</PRE>

Here we indicated that the sequence is coding on the direct strand
from 1503 to 1594 and non-coding in this region on the complementary
strand.  The two last lines means that the regions are non-coding
on BOTH STRANDS.

<P>

Regions specified in the file are not allowed to overlap except
on opposite strands.
If the annotation you give does not conform to the model, the program
will die.  This happens for instance if the annotaion you give forces

<UL>
        <LI>a non-consensus start or stop codon.
        <LI>a donor different from GT, or a an acceptor different from AG.
        <LI>start of an exon less than 25 bases from the beginning
                of the sequence or an exon extending closer than 10 bases from
                the end.
        <LI>a stop codon in a coding region.
</UL>

<H3><A NAME="as-h2-638414">Known Bugs</A></H3>


For some reason the probability of the final exon is sometimes
larger than 1.  Usually it is not very much.  I can't find the
error.  Please let me know if it happens in other cases.

<P>

If no start codon or stop codon is predicted for a gene (e.g.
begins and ends with an intron) the frame information and
scores might be wrong.

<P>

HMMgene can in principle predict a gene with a stop codon in frame,
if splicing happens in the middle of it.  I have not yet seen
any examples though.