DTU Health Tech

Department of Health Technology

FeatureExtract - 1.2

Extraction of sequence and annotation, e.g. intron/exon structure, from GenBank format files


The FeatureExtract server extracts sequence and feature annotation, such as intron/exon structure, from GenBank entries and other GenBank format files.

New in version 1.2: Placeholder GenBank entries is expanded into subentries automatically. New options with regard to spliced genes. command-line version (any platform) of FeatureExtract is now available as Open Source under Download.

Light version (2017): Automatic look up of GenBank IDs disabled due to stability issues.

Submission


Paste in one or more GenBank file(s)

Upload file containing one or more GenBank entries


View example GenBank file
Notice: Multiple GenBank format files can be concatenated. A comprehensive source for GenBank files is the NCBI web-site: http://www.ncbi.nlm.nih.gov/.

Mar 13th, 2017:Light version (automatic look-up of GenBank IDs disabled).


Instructions: Basic usage - Paste in or upload a set of GenBank format files and hit submit. The FeatureExtract server will then by default extract all protein coding genes with full intron/exon annotation.

Please read the DTU Health Tech access policies for information about limitations on the daily number of submissions. For processing large datasets (e.g the Human Genome builds from NCBI) it is recommended to download the command-line version of FeatureExtract from the "Software download" page, and do the processing locally.


Basic options

Select type of features to extract

Alternatively, enter the desired feature type(s) below:

Example: CDS,rRNA,tRNA

Include intergenic regions.
[details]

Naming preferences

1) Gene name
2) Systematic name
3) EntryId + distance

If the desired type of naming is not available, fall back to the level below: 1 -> 2 -> 3.
[details]

Flanking regions

bp : Upstream (5')
bp : Downstream (3')

Optional: Define flanking regions
[details]


Advanced options

Frameshifts

(bp): Frameshift cutoff

"Introns" shorter than this length are considered annotated frameshifts
[details]

Custom defined annotation

Example: snRNA=(N),promoter={P},unknown=QQQ
[details]

Splicing (new in 1.2)

Splice all intron containing seqeunces
Full length sequences are kept in the comments field

Only output intron containing sequnces
Can be used in combination with the "splice all..." option

[details]

Feature types to annotate in flanking regions

Alternatively, enter the desired feature type(s) below:

Example: MOST,polyA
[details]

Flanking region annotation scheme

Full annotation
Uppercase = same strand, Lowercase = opposite strand.
Presence/absence annotation
+ = same strand, - = opposite strand, # = overlapping
[details]

Trouble shooting

Produce verbose information

Verbose: Output additional information about the contents of the GenBank files and the general progress of the extraction.
[details]

Restrictions: A maximum of 100mb of GenBank files will be processed in each run.

Confidentiality:
The sequences are kept confidential and will be deleted after processing.


CITATIONS

For publication of results, please cite:

FeatureExtract - extraction of sequence annotation made easy.
Rasmus Wernersson.
Nucleic Acids Research, 2005, Vol. 33, Web Server issue W567-W569


PORTABLE VERSION

The commandline version of FeatureExtract is open source software (GPL license) and can be downloaded on the "Software download" page.

If you require FeatureExtract on a commerical license, please contact software@cbs.dtu.dk.

Usage instructions


Quick start

Define the GenBank entries to be analysed, by specifying GenBank accession IDs (past in or upload) or by pasting in (or uploading) GenBank files. A combination of ID's and GenBank files is equally acceptable. Hitting "Submit query" at this point, will run the server with default settings: All protein coding genes ("CDS's") will be extracted with full intron/exon annotation.

The wanted feature types (CDS, rRNA, etc.), preferences for naming and definition of flanking regions can be specified using the Basic options.

Please notice that all three "Submit query" buttons perform the same action. The idea is that is not necessary to scroll down the web page if the options are not altered.


Detailed instructions

Specifying the input data in GenBank format

1) Specify GenBank entries by accession IDs (NOT AVAILABLE IN 1.2L (light))

The easiest way to specify GenBank information is by simply supplying a list of GenBank entry ID's. The GenBank database the FeatureExtract server using is a mirror of the GenBank flat file distribution with the addtion of several Eukaryotic genomes (see databases for details).

2) Supply your own GenBank format files

Use the "Upload file" option for large file(s). Smaller files can be pasted in. Multiple files can be concatenated.

Any file complying with the GenBank format definition can be used here. For example this could be chromosome files from the Eukaryotic genome mentioned above. An other example could be files with custom gene/promoter ect predictions.


Basic options

Select type of features to extract

Select which feature type(s) to extract. A number of predefined feature type can be selected. Multiple features can entered in the text-field as as comma-separated list, e.g. CDS,rRNA,tRNA,repeat. The MOST keyword (see below) can be useful when extracting intergenic regions.

Notice that some feature types are not always defined to mean the same. Especially the actual meaning of gene and mRNA vary a lot.

Integenic regions: Selecting this option will include the intergenic regions in the set of extracted sequences. The intergenic regions are simply defined as the areas between the features defined here. Intergenic regions can be extratced with flanks.

Naming preferences

Specify the preferred naming of each extratced entry. If the desired type of name is not avialable, fall back to the next level: 1 > 2 > 3.

  1. Gene name
    - GenBank field: /gene="xxx"
     
  2. Systematic name
    - GenBank field: /locus_tag="xxx"
     
  3. Entry ID + distance
    - GenBank field: LOCUS

Flanking regions

Define flanking regions, if any.

Notice: computations concerning flanking region elements are only performed if flanking regions have been requested using this option.

Submit query

Click on the "Submit query" button. If the processing of the query takes more than a few seconds you'll will get the option of supplying your email address and be notified when the job is done.


Advanced options

FeatureExtract has support for a number of advanced options. Typically it is not necessary to set these manually and most users can safely skip this section and proceed to submitting the query.

Frameshifts

This options defines the cut-off value which determines if an intervening sequence will be annotated as a frameshift or an intron. Intervening sequence shorter than the specified value will be considered frameshifts - this includes negative frameshifts.

Custom defined annotation

Using this options it is possible to extend (or redefine) the build in annotation table.

Splicing

Notice: For all intron and frameshift containing sequences, the spliced sequence and annotation is by default added to the comment field.

Splice all intron containing seqeunces
Enabeling this option will cause the server to produce spliced sequences (and annotation) for all intron containing sequences. The full length sequence and annotation is then moved into the comment field.

Only output intron containing sequnces
Enabeling this option will supress the output which does not contain introns or frameshifts. This option can be use in combination with the "Splice all..." option mentioned above, as a quick way of producing a spliced only dataset.

Feature types to annotate in flanking regions

This option governs which feature type to annotate in the flanking regions. The default value, the keyword MOST, is a list built to minimize the problem with feature type synonyms (e.g 'CDS' vs. 'gene' vs. 'mRNA') but at the same time extract as much information as possible. The keysword are defined below:

  • ALL
    • All feature types.

  • MOST
    • CDS
    • 3'UTR
    • 5'UTR
    • promoter
    • -35_signal
    • -10_signal
    • RBS
    • rRNA
    • tRNA
    • snoRNA
    • scRNA
    • misc_RNA
    • misc_feature

A custom defined list can be specified as a comma separated list.

Flanking region annotation scheme

This option governs how features in flanking regions are annotated.

  • Full annotation
    • Use the same annotation scheme as in the extracted sequences. (E.g (EEEEEEE) for exons).
    • Features on the oppsite strand relative to the individual extracted sequence is annotated in lowercase.

  • Presence/absence annotation
    • Only annotate the presence of absence of features.
    • Characters used:
      "+" : a feature on the same strand.
      "-" : a feature on the opposite strand.
      "#" : overlapping features.

Trouble shooting

Verbose mode: Output additional information about the contents of the GenBank files and the general progress of the extraction


Examples

Example 1: Alphaglobins using GenBank accession IDs (NOT AVAILABLE IN 1.2L (light))

The following list of GenBank entries contains alpha globins from a wide range of organisms. This example illustrates the annotation of exon and intron regions in protein coding genes.

Instruction: Paste in the list and hit "submit".

AB001981
X01831
J00923
J00043
J00044
X01086
X07053
AF098919

Example 2: Yeast mitochondrial genes

This is an example of how to work with an uploaded GenBank file.

Instructions: Download GenBank file NC_001224 (This file contains the Yeast mitochondrial chromosome - part of the Yeast genome build from SGD). Upload the file to the FeatureExtract server, using the "Upload file containing one or more GenBank files" option. Hit "Submit query".

Notes on working with a chromosomal file

The mitochodrial GenBank file is also a good example on how FeatureExtract works with a chromosomal file containing multiple sequence features. For experimentation, try to enable the extraction of flanks, say 300 bp upstream and 200 bp downstream. Also, try to widen the set of feature type to be extratced from the default (CDS) to a custom list: CDS,rRNA,tRNA.

Output format


SAMPLE OUTPUT

Tables illustrating the data format: Sample output.

File containing 270 intron containing genes from Yeast:
yeast_genome.with_introns.tab [1470 kb]

View the TAB file using a text editor (e.g. UltraEdit on Windows, BBedit on Mac or NEdit on Unix), or import the file into a spreadsheet like Excel or a database like MySQL or Access.


FORMAL DESCRIPTION

	The output data format uses a scheme with one
	entry per line in the following format (tab separated):
	
	name	seq	ann	com
	
	name:	The sequence name, as determined by the "Naming preference"
		option.
		
	seq:	The DNA sequence it self. UPPERCASE is used for the
		main sequence, lowercase is used for flanks (if any).
		
	ann:	Single letter sequence annotation. Position for position
		the annotation descripes the DNA sequence: The first
		letter in the annotation, descriped the annotation for
		the first position in the DNA sequence and so forth.
		
		The annotation code is defined as follows:
		
		FEATURE BLOCKS (AKA. "EXON BLOCKS")
		
		(	First position
		E	Exon
		T	tRNA exonic region
		R	rRNA / generic RNA exonic region
		P	Promotor
		X	Unknown feature type
		)	Last position
		
		?	Ambiguous first or last position
		
		[	First UTR region position
		3	3'UTR
		5	5'UTR
		]	Last UTR region position		
		
			NOTICE: custom feature block can be defined using
			the "Custom defined annotation" option.
			
		INTRONS and FRAMESHIFTS
			
		D	First intron position (donor site)
		I	Intron position
		A	Last intron position (acceptor site)
		
		<	Start of frameshift
		F	Frameshift
		>	End of frameshift
		
		REGIONS WITHOUT FEATURES
		
		.	NULL annotation (no annotation).
		
		ONLY IN FLANKING REGIONS:
		
		+	Other feature defined on the SAME STRAND
			as the current entry.
		-	Other feature defined on the OPPOSITE STRAND
			relative to the current entry.
		#	Multiple or overlapping features.

		A..Z:	Feature on the SAME STRAND as the current entry.
		a..z:	Feature on the OPPOSITE STRAND as the current entry.
				
			Notice: The type of features annotated in the flanking
			regions is determined by the following option: 
			"Feature types to annotate in flanking regions"
		
	com:	Comments (free text). All text, extra information etc
		defined in the GenBank files are concatenated into a single
		comment.
		
		The following extra information is added by this program:
		
		*) Strand ("+" or "-").
		*) GenBank entry ID ("LOCUS").
		*) Feature type (e.g. "CDS" or "rRNA")
		*) Spliced DNA sequence. Simply the DNA sequence defined
		   by the JOIN statement. 
		   This is provied for two reasons. 1) To overcome negative
		   frameshifts. 2) As an easy way of extracting the sequence
		   of the spliced producted.
		*) Spliced DNA annotation.

Article abstract


ABSTRACT

Work on a large number of biological problems benefits tremendously from having an easy way to access the annotation of DNA sequence features, such as intron/exon structure, the contents of promoter regions and the location of other genes in upsteam and downstream regions. For example, taking the placement of introns within a gene into account can help in a phylogenetic analysis of homologous genes. Designing experiments for investigating UTR regions using PCR or DNA microarrays require knowledge of known elements in UTR regions and the positions and strandness of other genes nearby on the chromosome. A wealth of such information is already known and documented in databases such as GenBank and the NCBI Human Genome builds. However, it usually requires significant bioinformatics skills and intimate knowledge of the data format to access this information.

Presented here is a highly flexible and easy-to-use tool for extracting feature annotation from GenBank entries. The tool is also useful for extracting datasets corresponding to a particular feature (e.g. promoters). Most importantly, the output data format is highly consistent, easy to handle for the user and easy to parse computationally.

REFERENCE

Rasmus Wernersson.
FeatureExtract - extraction of sequence annotation made easy.
Nucleic Acids Research, 2005, Vol. 33, Web Server issue W567-W569

Contact
Rasmus Wernersson: raz@cbs.dtu.dk (Web)



GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: