DTU Health Tech
Department of Health Technology
This link is for the general contact of the DTU Health Tech institute.
If you need help with the bioinformatics programs, see the "Getting Help" section below the program.
The FeatureExtract server extracts sequence and feature annotation, such as intron/exon structure, from GenBank entries and other GenBank format files.
New in version 1.2: Placeholder GenBank entries is expanded into subentries automatically. New options with regard to spliced genes. command-line version (any platform) of FeatureExtract is now available as Open Source under Download.
Light version (2017): Automatic look up of GenBank IDs disabled due to stability issues.
Restrictions: A maximum of 100mb of GenBank files will be processed in each run.
Confidentiality:
The sequences are kept confidential and will be deleted
after processing.
For publication of results, please cite:
FeatureExtract - extraction of sequence annotation made easy.
Rasmus Wernersson.
Nucleic Acids Research, 2005, Vol. 33, Web Server issue W567-W569
The commandline version of FeatureExtract is open source software (GPL license) and can be downloaded on the "Software download" page.
If you require FeatureExtract on a commerical license, please contact software@cbs.dtu.dk.
Define the GenBank entries to be analysed, by specifying GenBank accession IDs (past in or upload) or by pasting in (or uploading) GenBank files. A combination of ID's and GenBank files is equally acceptable. Hitting "Submit query" at this point, will run the server with default settings: All protein coding genes ("CDS's") will be extracted with full intron/exon annotation.
The wanted feature types (CDS, rRNA, etc.), preferences for naming and definition of flanking regions can be specified using the Basic options.
Please notice that all three "Submit query" buttons perform the same action. The idea is that is not necessary to scroll down the web page if the options are not altered.
The easiest way to specify GenBank information is by simply supplying a list of GenBank entry ID's. The GenBank database the FeatureExtract server using is a mirror of the GenBank flat file distribution with the addtion of several Eukaryotic genomes (see databases for details).
Use the "Upload file" option for large file(s). Smaller files can be pasted in. Multiple files can be concatenated.
Any file complying with the GenBank format definition can be used here. For example this could be chromosome files from the Eukaryotic genome mentioned above. An other example could be files with custom gene/promoter ect predictions.
Select which feature type(s) to extract. A number of predefined feature type can be selected. Multiple features can entered in the text-field as as comma-separated list, e.g. CDS,rRNA,tRNA,repeat. The MOST keyword (see below) can be useful when extracting intergenic regions.
Notice that some feature types are not always defined to mean the same. Especially the actual meaning of gene and mRNA vary a lot.
Integenic regions: Selecting this option will include the intergenic regions in the set of extracted sequences. The intergenic regions are simply defined as the areas between the features defined here. Intergenic regions can be extratced with flanks.
Specify the preferred naming of each extratced entry. If the desired type of name is not avialable, fall back to the next level: 1 > 2 > 3.
Define flanking regions, if any.
Notice: computations concerning flanking region elements are only performed if flanking regions have been requested using this option.
Click on the "Submit query" button. If the processing of the query takes more than a few seconds you'll will get the option of supplying your email address and be notified when the job is done.
FeatureExtract has support for a number of advanced options. Typically it is not necessary to set these manually and most users can safely skip this section and proceed to submitting the query.
This options defines the cut-off value which determines if an intervening sequence will be annotated as a frameshift or an intron. Intervening sequence shorter than the specified value will be considered frameshifts - this includes negative frameshifts.
Using this options it is possible to extend (or redefine) the build in annotation table.
Notice: For all intron and frameshift containing sequences, the spliced sequence and annotation is by default added to the comment field.
Splice all intron containing seqeunces
Enabeling this option will cause the server to produce spliced sequences
(and annotation) for all intron containing sequences. The full length
sequence and annotation is then moved into the comment field.
Only output intron containing sequnces
Enabeling this option will supress the output which does not
contain introns or frameshifts. This option can be use in combination
with the "Splice all..." option mentioned above, as a quick way of
producing a spliced only dataset.
This option governs which feature type to annotate in the flanking regions. The default value, the keyword MOST, is a list built to minimize the problem with feature type synonyms (e.g 'CDS' vs. 'gene' vs. 'mRNA') but at the same time extract as much information as possible. The keysword are defined below:
A custom defined list can be specified as a comma separated list.
This option governs how features in flanking regions are annotated.
Verbose mode: Output additional information about the contents of the GenBank files and the general progress of the extraction
The following list of GenBank entries contains alpha globins from a wide range of organisms. This example illustrates the annotation of exon and intron regions in protein coding genes.
Instruction: Paste in the list and hit "submit".
AB001981 X01831 J00923 J00043 J00044 X01086 X07053 AF098919
This is an example of how to work with an uploaded GenBank file.
Instructions: Download GenBank file NC_001224 (This file contains the Yeast mitochondrial chromosome - part of the Yeast genome build from SGD). Upload the file to the FeatureExtract server, using the "Upload file containing one or more GenBank files" option. Hit "Submit query".
CDS,rRNA,tRNA
.
Tables illustrating the data format: Sample output.
File containing 270 intron containing genes from Yeast:
yeast_genome.with_introns.tab [1470 kb]
View the TAB file using a text editor (e.g. UltraEdit on Windows, BBedit on Mac or NEdit on Unix), or import the file into a spreadsheet like Excel or a database like MySQL or Access.
The output data format uses a scheme with one entry per line in the following format (tab separated): name seq ann com name: The sequence name, as determined by the "Naming preference" option. seq: The DNA sequence it self. UPPERCASE is used for the main sequence, lowercase is used for flanks (if any). ann: Single letter sequence annotation. Position for position the annotation descripes the DNA sequence: The first letter in the annotation, descriped the annotation for the first position in the DNA sequence and so forth. The annotation code is defined as follows: FEATURE BLOCKS (AKA. "EXON BLOCKS") ( First position E Exon T tRNA exonic region R rRNA / generic RNA exonic region P Promotor X Unknown feature type ) Last position ? Ambiguous first or last position [ First UTR region position 3 3'UTR 5 5'UTR ] Last UTR region position NOTICE: custom feature block can be defined using the "Custom defined annotation" option. INTRONS and FRAMESHIFTS D First intron position (donor site) I Intron position A Last intron position (acceptor site) < Start of frameshift F Frameshift > End of frameshift REGIONS WITHOUT FEATURES . NULL annotation (no annotation). ONLY IN FLANKING REGIONS: + Other feature defined on the SAME STRAND as the current entry. - Other feature defined on the OPPOSITE STRAND relative to the current entry. # Multiple or overlapping features. A..Z: Feature on the SAME STRAND as the current entry. a..z: Feature on the OPPOSITE STRAND as the current entry. Notice: The type of features annotated in the flanking regions is determined by the following option: "Feature types to annotate in flanking regions" com: Comments (free text). All text, extra information etc defined in the GenBank files are concatenated into a single comment. The following extra information is added by this program: *) Strand ("+" or "-"). *) GenBank entry ID ("LOCUS"). *) Feature type (e.g. "CDS" or "rRNA") *) Spliced DNA sequence. Simply the DNA sequence defined by the JOIN statement. This is provied for two reasons. 1) To overcome negative frameshifts. 2) As an easy way of extracting the sequence of the spliced producted. *) Spliced DNA annotation.
Work on a large number of biological problems benefits tremendously from having an easy way to access the annotation of DNA sequence features, such as intron/exon structure, the contents of promoter regions and the location of other genes in upsteam and downstream regions. For example, taking the placement of introns within a gene into account can help in a phylogenetic analysis of homologous genes. Designing experiments for investigating UTR regions using PCR or DNA microarrays require knowledge of known elements in UTR regions and the positions and strandness of other genes nearby on the chromosome. A wealth of such information is already known and documented in databases such as GenBank and the NCBI Human Genome builds. However, it usually requires significant bioinformatics skills and intimate knowledge of the data format to access this information.
Presented here is a highly flexible and easy-to-use tool for extracting feature annotation from GenBank entries. The tool is also useful for extracting datasets corresponding to a particular feature (e.g. promoters). Most importantly, the output data format is highly consistent, easy to handle for the user and easy to parse computationally.
Rasmus Wernersson.
FeatureExtract - extraction of sequence annotation made easy.
Nucleic Acids Research, 2005, Vol. 33, Web Server issue W567-W569
Contact
Rasmus Wernersson: raz@cbs.dtu.dk
(Web)
If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).
If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.
Correspondence:
Technical Support: