DTU Health Tech

Department of Health Technology

We recently made large changes to the webserver infrastructure, so you might experience errors. Please report issues to health-master@dtu.dk

EasyGene - 1.2

Gene finding in prokaryotes

The EasyGene 1.2 server produces a list of predicted genes given a sequence of prokaryotic DNA. The current version contains models for 138 different organisms. Each prediction is attributed with a significance score (R-value) indicating how likely it is to be just a non-coding open reading frame rather than a real gene. All that is required of you as a user is to the query sequence(s) and to select the organism model to use.

The pre-calculated EasyGene 1.2 predictions for the complete genomes of the 138 organisms can be downloaded from BINF easygene at Copenhagen university

Submission


Sequence submission: paste the sequence(s) or upload a local file

Paste a single sequence or several sequences in FASTA format into the field below:

Submit a file in FASTA format directly from your local disk:

Organism  

R-value cutoff                 Predict suboptimal gene starts


Restrictions
At most 10,000,000 nucleotides per submission in at most 50 sequences.
Confidentiality
The sequences are kept confidential and will be deleted after processing.


CITATIONS

For publication of results, please cite:

Large-scale prokaryotic gene prediction and comparison to genome annotation.
P. Nielsen and A. Krogh.
Bioinformatics: 21:4322-4329, 2005.

PMID: 16249266

EasyGene - a prokaryotic gene finder that ranks ORFs by statistical significance.
Thomas Schou Larsen and Anders Krogh.
BMC Bioinformatics: 4:21, 2003

PMID: 12783628         View the full article

Instructions


1. Specify the input sequences

All the input sequences must be in one-letter nucleotide code. The allowed alphabet (not case sensitive) is as follows:

A C G T and N (unknown)

All the other symbols will be converted to N before processing. The sequences can be input in the following two ways:

  • Paste a single sequence (just the nucleotides) or a number of sequences in FASTA format into the upper window of the main server page.

  • Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.

Both ways can be employed at the same time: all the specified sequences will be processed. However, there may be at most 50 sequences and 1,000,000 nucleotides per submission; each sequence not more than 500,000 nucleotides.

2. Customize your run

  • Model organism
    Select the model organism most closely related to the source organism of your data. The pre-calculated EasyGene 1.2 predictions for the complete genomes of the 138 organisms on the current model list can be downloaded from the EasyGene site at BINF at the University of Copenhagen.

  • R-value cutoff
    Each prediction is attributed with a significance score (R-value) indicating how likely it is to be just a non-coding open reading frame rather than a real gene. By default the server only shows gene predictions with R-values below R=2. If more (and less certain!) predictions are wanted, one may enter a higher R-value in the specified field. For small genes especially, it may be a good idea to consider R-values up to around R=60.

  • Suboptimal gene starts
    By default the server only shows the highest scoring start codons. If alternative start codons are wanted, check the corresponding button.

3. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Output format


DESCRIPTION

The output conforms to the GFF format. For each input sequence the server prints a list of predicted genes, one per line. The columns are:
  • seqname:  input sequence name;
  • model:  organism model code (also in plain text in the table head);
  • feature:  predicted feature, 'CDS' or 'CDSsub' (alternative translation start);
  • start and end:  positions in the sequence;
  • score:  R-value, indicating how likely the fragment is to be just a non-coding open reading frame rather than a real gene;
  • strand:  '+' or '-';
  • startc:  predicted start codon;
  • odds:  log odds score.
Only the predictions with R-values lower than the selected R-value cutoff (the default is 2) are reported.

The example below shows the EasyGene 1.2 output for the sequence taken from the GenBank entry AB010576, containing Bacillus subtilis ComX, ComQ and DegQ genes. All the three genes are predicted as annotated in the database (shown in green), with high confidence, although an alternative translation start is preferred for comQ (shown in orange). Two additional genes not annotated in the GenBank entry are also predicted.

EXAMPLE OUTPUT


##gff-version 2
##source-version easygene-1.2b
##date 2007-08-15
##Type DNA
# model:  BS03 Bacillus subtilis
# seqname       model   feature start   end       score        +/-      ?       startc  odds
# ---------------------------------------------------------------------------------------------
AB010576        BS03    CDS     67      324     0.0271875       +       0       #ATG    20.1861
AB010576        BS03    CDSsub  55      324     0.031955        +       0       #ATG    20.1731
AB010576        BS03    CDS     1129    1269    0.0190622       +       0       #ATG    15.7102
AB010576        BS03    CDS     1370    2314    2.13273e-12     +       0       #ATG    74.7815
AB010576        BS03    CDSsub  1454    2314    1.92405e-12     +       0       #ATG    74.6356
AB010576        BS03    CDS     2327    2491    0.0167943       +       0       #ATG    17.2951
AB010576        BS03    CDS     300     668     1.43511         -       0       #ATG    10.6215
# ---------------------------------------------------------------------------------------------

AP02	Aeropyrum pernix 
ATW03	Agrobacterium tumefaciens str. C58
AA02	Aquifex aeolicus 
AF02	Archaeoglobus fulgidus DSM 4304 
BAA03	Bacillus anthracis str. Ames 
BCE03	Bacillus cereus ATCC 10987 
BH03	Bacillus halodurans 
BPS01	Burkholderia pseudomallei K96243
BS03	Bacillus subtilis 
BT02	Bacteroides thetaiotaomicron VPI-5482 
BBA01	Bdellovibrio bacteriovorus 
BL03	Bifidobacterium longum NCC2705 
BBR02	Bordetella bronchiseptica 
BPA02	Bordetella parapertussis 
BPE02	Bordetella pertussis 
BJ02	Bradyrhizobium japonicum 
BM02	Brucella melitensis
BSU03	Brucella suis 1330
BAS02	Buchnera aphidicola
CJ02	Campylobacter jejuni 
CF02	Candidatus Blochmannia floridanus 
CC02	Caulobacter crescentus CB15 
CM02	Chlamydia muridarum  
CPN03	Chlamydia pneumoniae AR39 
CT02	Chlamydia trachomatis 
CCA02	Chlamydophila caviae GPIC    
CTE02	Chlorobium tepidum TLS 
CV02	Chromobacterium violaceum ATCC 12472  
CA02	Clostridium acetobutylicum ATCC824  
CP02	Clostridium perfringens 
CTEE02	Clostridium tetani E88 
CDI01	Corynebacterium diphtheriae 
CEF01	Corynebacterium efficiens YS-314 
CG03	Corynebacterium glutamicum ATCC 13032 
CB02	Coxiella burnetii RSA 493 
DR02	Deinococcus radiodurans
EF02	Enterococcus faecalis V583 
ECC02	Escherichia coli CFT073 
EC03	Escherichia coli K12 
ECE03	Escherichia coli O157:H7 EDL933 
ECO02	Escherichia coli O157:H7  
FN02	Fusobacterium nucleatum subsp. nucleatum ATCC 2558... 
GS01	Geobacter sulfurreducens PCA 
GV01	Gloeobacter violaceus 
HD02	Haemophilus ducreyi 35000HP 
HI02	Haemophilus influenzae Rd 
HM01	Haloarcula marismortui ATCC 43049
HS02	Halobacterium sp. NRC-1 
HW01	Haloquadratum walsbyi DSM 16790  
HP02	Helicobacter pylori 26695 
HPJ02	Helicobacter pylori str. J99 
LJ01	Lactobacillus johnsonii NCC 533 
LP02	Lactobacillus plantarum WCFS1 
LL02	Lactococcus lactis subsp. lactis 
LIN02	Leptospira interrogans serovar lai str. 56601 
LI02	Listeria innocua Clip11262 
LM02	Listeria monocytogenes EGD 
MLO03	Mesorhizobium loti 
MET02	Methanobacterium thermoautotrophicum str. Delta H 
MBU01	Methanococcoides burtonii DSM 6242 
MJ02	Methanococcus jannaschii 
MM01	Methanococcus maripaludis S2 
MK02	Methanopyrus kandleri AV19 
MTE01	Methanosaeta thermophila PT 
MA02	Methanosarcina acetivorans str. C2A 
MBA01	Methanosarcina barkeri str. fusaro
MM02	Methanosarcina mazei Goe1 
MST01	Methanosphaera stadtmanae DSM 3091 
MHU01	Methanospirillum hungatei JF-1 
MAP01	Mycobacterium avium subsp. paratuberculosis str. k... 
MB02	Mycobacterium bovis subsp. bovis AF2122/97 
MT03	Mycobacterium tuberculosis CDC1551 
MTH03	Mycobacterium tuberculosis H23Rv 
NEQ01	Nanoarchaeum equitans Kin4-M 
NP01	Natronomonas pharaonis DSM 2160 
NMA02	Neisseria meningitidis serogroup A Z2491 
NM02	Neisseria meningitidis serogroup B MC58 
NE02	Nitrosomonas europaea 
NO02	Nostoc sp. PCC 7120 
OI02	Oceanobacillus iheyensis HTE831 
OYP01	Onion yellows phytoplasma 
PM02	Pasteurella multocida 
PL01	Photorhabdus luminescens subsp. laumondii TTO1 
PT01	Picrophilus torridus DSM 9790 
PI02	Pirellula sp 
PG02	Porphyromonas gingivalis W83 
PMMI02	Prochlorococcus marinus str. MIT 9313 
PMA02	Prochlorococcus marinus subsp marinus CCMP1375 
PMM02	Prochlorococcus marinus subsp. pastoris str. CCMP1... 
PA02	Pseudomonas aeruginosa PA01 
PS02	Pseudomonas syringae pv. tomato str. DC3000 
PAE02	Pyrobaculum aerophilum 
PRI01	Pyrobaculum islandicum DSM 4184 
PAB02	Pyrococcus abyssi 
PF02	Pyrococcus furiosus DSM 3638 
PH02	Pyrococcus horikoshii 
RS02	Ralstonia solanacearum
RPA01	Rhodopseudomonas palustris CGA009 
RC02	Rickettsia conorii Malish 7 
RP02	Rickettsia prowazekii Madrid E 
SE02	Salmonella enterica subsp. enterica serovar Typhi ... 
STT02	Salmonella enterica subsp. enterica serovar Typhi ... 
STY02	Salmonella typhimurium LT2 
SO02	Shewanella oneidensis MR-1 
SM02	Sinorhizobium meliloti 1021 
SAM03	Staphylococcus aureus MU50  
SA02	Staphylococcus aureus subsp aureus N315 
SEA02	Staphylococcus epidermidis ATCC 12228 
SAG02	Streptococcus agalactiae 2603V/R 
SAN02	Streptococcus agalactiae NEM316 
SMU02	Streptococcus mutans UA159 
SP02	Streptococcus pneumoniae
SPY02	Streptococcus pyogenes 
SAV03	Streptomyces avermitilis MA-4680 
SC02	Streptomyces coelicolor A3(2) 
UC01	Sulfolobus acidocaldarius DSM 639 
SS02	Sulfolobus solfataricus 
ST02	Sulfolobus tokodaii 
SSW02	Synechococcus sp. WH 8102 
SPC02	Synechocystis sp. PCC 6803 
TT02	Thermoanaerobacter tengcongensis strain MB4T 
TK01	Thermococcus kodakarensis KOD1  
THP01	Thermofilum pendens Hrk 5 
TA02	Thermoplasma acidophilum 
TV02	Thermoplasma volcanium 
TE02	Thermosynechococcus elongatus BP-1 
TM02	Thermotoga maritima 
TD01	Treponema denticola ATCC 35405 
TP02	Treponema pallidum 
TW02	Tropheryma whipplei Twist 
VC02	Vibrio cholerae 
VP02	Vibrio parahaemolyticus RIMD 2210633
VV02	Vibrio vulnificus
WB02	Wigglesworthia glossinidia endosymbiont of Glossin... 
WDM01	Wolbachia endosymbiont of Drosophila melanogaster 
XA02	Xanthomonas axonopodis pv. citri str. 306 
XC02	Xanthomonas campestris pv. campestris str. ATCC 33... 
XF02	Xylella fastidiosa
YP02	Yersinia pestis

References


Current version (1.2):

Large-scale prokaryotic gene prediction and comparison to genome annotation.
P. Nielsen and A. Krogh., Bioinformatics: 21:4322-4329, 2005.

Bioinformatics Centre, Institute of Molecular Biology and Physiology, University of Copenhagen, Universitetsparken 15, 2100 Copenhagen, Denmark

PMID: 16249266

Abstract

MOTIVATION: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. RESULTS: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to approximately 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation.



Original EasyGene paper:

EasyGene - a prokaryotic gene finder that ranks ORFs by statistical significance.
Thomas Schou Larsen1 and Anders Krogh1,2., Bioinformatics: 4:21, 2003

1Center for Biological Sequence Analysis BioCentrum, Technical University of Denmark Building 208, 2800 Lyngby, Denmark
2Present address: The Bioinformatics Centre, University of Copenhagen Universitetsparken 15, 2100 Copenhagen, Denmark

PMID: 12783628         View the full article

Abstract

BACKGROUND: Contrary to other areas of sequence analysis, a measure of statistical significance of a putative gene has not been devised to help in discriminating real genes from the masses of random Open Reading Frames (ORFs) in prokaryotic genomes. Therefore, many genomes have too many short ORFs annotated as genes. RESULTS: In this paper, we present a new automated gene-finding method, EasyGene, which estimates the statistical significance of a predicted gene. The gene finder is based on a hidden Markov model (HMM) that is automatically estimated for a new genome. Using extensions of similarities in Swiss-Prot, a high quality training set of genes is automatically extracted from the genome and used to estimate the HMM. Putative genes are then scored with the HMM, and based on score and length of an ORF, the statistical significance is calculated. The measure of statistical significance for an ORF is the expected number of ORFs in one megabase of random sequence at the same significance level or better, where the random sequence has the same statistics as the genome in the sense of a third order Markov chain. CONCLUSIONS: The result is a flexible gene finder whose overall performance matches or exceeds other methods. The entire pipeline of computer processing from the raw input of a genome or set of contigs to a list of putative genes with significance is automated, making it easy to apply EasyGene to newly sequenced organisms.

Software Downloads




GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0). If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: