Usage instructions


Quick start

Define the GenBank entries to be analysed, by specifying GenBank accession IDs (past in or upload) or by pasting in (or uploading) GenBank files. A combination of ID's and GenBank files is equally acceptable. Hitting "Submit query" at this point, will run the server with default settings: All protein coding genes ("CDS's") will be extracted with full intron/exon annotation.

The wanted feature types (CDS, rRNA, etc.), preferences for naming and definition of flanking regions can be specified using the Basic options.

Please notice that all three "Submit query" buttons perform the same action. The idea is that is not necessary to scroll down the web page if the options are not altered.


Detailed instructions

Specifying the input data in GenBank format

1) Specify GenBank entries by accession IDs (NOT AVAILABLE IN 1.2L (light))

The easiest way to specify GenBank information is by simply supplying a list of GenBank entry ID's. The GenBank database the FeatureExtract server using is a mirror of the GenBank flat file distribution with the addtion of several Eukaryotic genomes (see databases for details).

2) Supply your own GenBank format files

Use the "Upload file" option for large file(s). Smaller files can be pasted in. Multiple files can be concatenated.

Any file complying with the GenBank format definition can be used here. For example this could be chromosome files from the Eukaryotic genome mentioned above. An other example could be files with custom gene/promoter ect predictions.


Basic options

Select type of features to extract

Select which feature type(s) to extract. A number of predefined feature type can be selected. Multiple features can entered in the text-field as as comma-separated list, e.g. CDS,rRNA,tRNA,repeat. The MOST keyword (see below) can be useful when extracting intergenic regions.

Notice that some feature types are not always defined to mean the same. Especially the actual meaning of gene and mRNA vary a lot.

Integenic regions: Selecting this option will include the intergenic regions in the set of extracted sequences. The intergenic regions are simply defined as the areas between the features defined here. Intergenic regions can be extratced with flanks.

Naming preferences

Specify the preferred naming of each extratced entry. If the desired type of name is not avialable, fall back to the next level: 1 > 2 > 3.

  1. Gene name
    - GenBank field: /gene="xxx"
     
  2. Systematic name
    - GenBank field: /locus_tag="xxx"
     
  3. Entry ID + distance
    - GenBank field: LOCUS

Flanking regions

Define flanking regions, if any.

Notice: computations concerning flanking region elements are only performed if flanking regions have been requested using this option.

Submit query

Click on the "Submit query" button. If the processing of the query takes more than a few seconds you'll will get the option of supplying your email address and be notified when the job is done.


Advanced options

FeatureExtract has support for a number of advanced options. Typically it is not necessary to set these manually and most users can safely skip this section and proceed to submitting the query.

Frameshifts

This options defines the cut-off value which determines if an intervening sequence will be annotated as a frameshift or an intron. Intervening sequence shorter than the specified value will be considered frameshifts - this includes negative frameshifts.

Custom defined annotation

Using this options it is possible to extend (or redefine) the build in annotation table.

Splicing

Notice: For all intron and frameshift containing sequences, the spliced sequence and annotation is by default added to the comment field.

Splice all intron containing seqeunces
Enabeling this option will cause the server to produce spliced sequences (and annotation) for all intron containing sequences. The full length sequence and annotation is then moved into the comment field.

Only output intron containing sequnces
Enabeling this option will supress the output which does not contain introns or frameshifts. This option can be use in combination with the "Splice all..." option mentioned above, as a quick way of producing a spliced only dataset.

Feature types to annotate in flanking regions

This option governs which feature type to annotate in the flanking regions. The default value, the keyword MOST, is a list built to minimize the problem with feature type synonyms (e.g 'CDS' vs. 'gene' vs. 'mRNA') but at the same time extract as much information as possible. The keysword are defined below:

A custom defined list can be specified as a comma separated list.

Flanking region annotation scheme

This option governs how features in flanking regions are annotated.

Trouble shooting

Verbose mode: Output additional information about the contents of the GenBank files and the general progress of the extraction


Examples

Example 1: Alphaglobins using GenBank accession IDs (NOT AVAILABLE IN 1.2L (light))

The following list of GenBank entries contains alpha globins from a wide range of organisms. This example illustrates the annotation of exon and intron regions in protein coding genes.

Instruction: Paste in the list and hit "submit".

AB001981
X01831
J00923
J00043
J00044
X01086
X07053
AF098919

Example 2: Yeast mitochondrial genes

This is an example of how to work with an uploaded GenBank file.

Instructions: Download GenBank file NC_001224 (This file contains the Yeast mitochondrial chromosome - part of the Yeast genome build from SGD). Upload the file to the FeatureExtract server, using the "Upload file containing one or more GenBank files" option. Hit "Submit query".

Notes on working with a chromosomal file

The mitochodrial GenBank file is also a good example on how FeatureExtract works with a chromosomal file containing multiple sequence features. For experimentation, try to enable the extraction of flanks, say 300 bp upstream and 200 bp downstream. Also, try to widen the set of feature type to be extratced from the default (CDS) to a custom list: CDS,rRNA,tRNA.