NetUTR - 1.0b

Splice sites in 5' UTR regions of human genes

The NetUTR 1.0 server predicts splice sites in 5' UTR regions of human genes.

Submission

Sequence submission:

Restrictions
Currently at most 200 sequences and 500,000 nucleotides per submission, no sequence longer than 500,000 nucleotides.
Confidentiality
The sequences are kept confidential and will be deleted after processing.

CITATIONS

For publication of results, please cite:

Analysis and recognition of 5' UTR intron splice sites in human pre-mRNA.
E. Eden and S. Brunak.
Nucleic Acids Research, 33:1131-1142, 2004.

Instructions

1. Specify the input sequences

All the input sequences must be in one-letter nucleotide code. The allowed alphabet (not case sensitive) is as follows:

A C G T and X (unknown)

All the non-alphabetic symbols will be ignored; all the alphabetic symbols not in the alphabet will be converted to X before processing. The sequences can be input in the following two ways:

Paste a single sequence (just the amino acids) or a number of sequences in FASTA format into the upper window of the main server page.
Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.

Both ways can be employed at the same time: all the specified sequences will be processed. However, currently there may be not more than 200 sequences and 500,000 nucleotides in total in one submission. Sequences longer than 500,000 nucleotides are not allowed.

2. Customize your run

Currently no custom options are enabled.

3. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Output format

DESCRIPTION

The output conforms to the GFF version 2 format. For each input sequence the server prints a list of predicted splice sites, both donor and acceptor, showing their positions in the sequence and the prediction confidence scores. Only the sites with scores higher than 0.5 are predicted as splice sites and reported in the list. The sites with the scores higher than 0.8 are marked with 'H' as high confidence predictions.

The example below shows the NetUTR 1.0 output for the sequence taken from the GenBank entry X05196, containing the human aldolase C gene. The 5' UTR intron (155-1411) in that gene is predicted correctly, with high confidence.

EXAMPLE OUTPUT

##gff-version 2 ##source-version netUTR-1.0b ##date 2006-03-16 ##Type DNA # seqname source feature start end score +/- ? # --------------------------------------------------------------------------- X05196 netUTR-1.0b acceptor 512 512 0.575 + X05196 netUTR-1.0b acceptor 577 577 0.692 + X05196 netUTR-1.0b acceptor 1189 1189 0.631 + X05196 netUTR-1.0b acceptor 1411 1411 0.861 + H X05196 netUTR-1.0b acceptor 1621 1621 0.501 + X05196 netUTR-1.0b acceptor 1949 1949 0.755 + X05196 netUTR-1.0b acceptor 2378 2378 0.831 + H X05196 netUTR-1.0b acceptor 2590 2590 0.604 + X05196 netUTR-1.0b acceptor 3334 3334 0.532 + X05196 netUTR-1.0b acceptor 3692 3692 0.574 + X05196 netUTR-1.0b acceptor 5292 5292 0.522 + X05196 netUTR-1.0b donor 155 155 0.971 + H X05196 netUTR-1.0b donor 993 993 0.612 + X05196 netUTR-1.0b donor 1536 1536 0.623 + X05196 netUTR-1.0b donor 2261 2261 0.508 + X05196 netUTR-1.0b donor 2463 2463 0.604 + X05196 netUTR-1.0b donor 3335 3335 0.509 + X05196 netUTR-1.0b donor 4297 4297 0.589 + X05196 netUTR-1.0b donor 4455 4455 0.625 + X05196 netUTR-1.0b donor 5929 5929 0.616 + # ---------------------------------------------------------------------------

References

Analysis and recognition of 5' UTR intron splice sites in human pre-mRNA.
E. Eden and S. Brunak.,
Nucleic Acids Research,33:1131-1142, 2004.

Abstract

Prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition. We perform a rigorous analysis of such splice sites embedded in human 5' UTR regions, and investigate correlations between this class of splice sites and other features found in the adjacent exons and introns. By restricting the training of neural network algorithms to 'pure' untranslated regions (not extending partially into protein coding regions), we for the first time investigate the predictive power of the splicing signal proper in contrast to conventional splice site prediction, which typically rely on the change in sequence at the transition from protein coding to non-coding. By doing so the algorithms were able to pick up subtler splicing signals that were otherwise masked by 'coding' noise thus enhancing significantly the prediction of 5' UTR splice sites. For example, the non-coding splice site predicting networks pick up compositional and positional bias in the 3' ends of non-coding exons and 5' non-coding intron ends, where cytosine and guanine are overrepresented. This compositional bias at the true UTR donor sites is also visible in the synaptic weights of the neural networks trained to identify UTR donor sites. Conventional splice site prediction methods perform poorly in UTR regions, because the reading frame pattern is absent. The NetUTR method presented here performs 2-3 fold better compared to NetGene2 and GenScan in 5' UTR regions. We also tested the 5' UTR trained method on protein coding regions, and discovered surprisingly that it works quite well (although it cannot compete with NetGene2). This indicates that the local splicing pattern in UTR and coding regions largely is the same.

GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: