Services
EasyGene - 1.2
Gene finding in prokaryotes
The EasyGene 1.2 server produces a list of predicted genes given a sequence of prokaryotic DNA. The current version contains models for 138 different organisms. Each prediction is attributed with a significance score (R-value) indicating how likely it is to be just a non-coding open reading frame rather than a real gene. All that is required of you as a user is to the query sequence(s) and to select the organism model to use.
The pre-calculated EasyGene 1.2 predictions for the complete genomes of the 138 organisms can be downloaded from BINF easygene at Copenhagen university
Submission
Sequence submission: paste the sequence(s) or upload a local file
Restrictions
At most 10,000,000 nucleotides per submission in at most 50 sequences.
Confidentiality
The sequences are kept confidential and will be deleted after processing.
CITATIONS
For publication of results, please cite:
Large-scale prokaryotic gene prediction and comparison
to genome annotation.
P. Nielsen and A. Krogh.
Bioinformatics: 21:4322-4329, 2005.
PMID: 16249266
EasyGene - a prokaryotic gene finder that ranks ORFs by statistical
significance.
Thomas Schou Larsen and Anders Krogh.
BMC Bioinformatics: 4:21, 2003
PMID: 12783628 View the full article
Instructions
1. Specify the input sequences
All the input sequences must be in one-letter nucleotide code. The allowed alphabet (not case sensitive) is as follows:
All the other symbols will be converted to N before processing. The sequences can be input in the following two ways:
-
Paste a single sequence (just the nucleotides) or a number of sequences in
FASTA
format into the upper window of the main server page.
- Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.
Both ways can be employed at the same time: all the specified sequences
will be processed. However, there may be at most 50 sequences and
1,000,000 nucleotides per submission; each sequence not more than
500,000 nucleotides.
2. Customize your run
-
Model organism
Select the model organism most closely related to the source organism of your data. The pre-calculated EasyGene 1.2 predictions for the complete genomes of the 138 organisms on the current model list can be downloaded from the EasyGene site at BINF at the University of Copenhagen.
-
R-value cutoff
Each prediction is attributed with a significance score (R-value) indicating how likely it is to be just a non-coding open reading frame rather than a real gene. By default the server only shows gene predictions with R-values below R=2. If more (and less certain!) predictions are wanted, one may enter a higher R-value in the specified field. For small genes especially, it may be a good idea to consider R-values up to around R=60.
-
Suboptimal gene starts
By default the server only shows the highest scoring start codons. If alternative start codons are wanted, check the corresponding button.
3. Submit the job
Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.
Output format
DESCRIPTION
- seqname: input sequence name;
- model: organism model code (also in plain text in the table head);
- feature: predicted feature, 'CDS' or 'CDSsub' (alternative translation start);
- start and end: positions in the sequence;
- score: R-value, indicating how likely the fragment is to be just a non-coding open reading frame rather than a real gene;
- strand: '+' or '-';
- startc: predicted start codon;
- odds: log odds score.
The example below shows the EasyGene 1.2 output for the sequence taken from the GenBank entry AB010576, containing Bacillus subtilis ComX, ComQ and DegQ genes. All the three genes are predicted as annotated in the database (shown in green), with high confidence, although an alternative translation start is preferred for comQ (shown in orange). Two additional genes not annotated in the GenBank entry are also predicted.
EXAMPLE OUTPUT
##gff-version 2 ##source-version easygene-1.2b ##date 2007-08-15 ##Type DNA # model: BS03 Bacillus subtilis # seqname model feature start end score +/- ? startc odds # --------------------------------------------------------------------------------------------- AB010576 BS03 CDS 67 324 0.0271875 + 0 #ATG 20.1861 AB010576 BS03 CDSsub 55 324 0.031955 + 0 #ATG 20.1731 AB010576 BS03 CDS 1129 1269 0.0190622 + 0 #ATG 15.7102 AB010576 BS03 CDS 1370 2314 2.13273e-12 + 0 #ATG 74.7815 AB010576 BS03 CDSsub 1454 2314 1.92405e-12 + 0 #ATG 74.6356 AB010576 BS03 CDS 2327 2491 0.0167943 + 0 #ATG 17.2951 AB010576 BS03 CDS 300 668 1.43511 - 0 #ATG 10.6215 # ---------------------------------------------------------------------------------------------
AP02 Aeropyrum pernix ATW03 Agrobacterium tumefaciens str. C58 AA02 Aquifex aeolicus AF02 Archaeoglobus fulgidus DSM 4304 BAA03 Bacillus anthracis str. Ames BCE03 Bacillus cereus ATCC 10987 BH03 Bacillus halodurans BPS01 Burkholderia pseudomallei K96243 BS03 Bacillus subtilis BT02 Bacteroides thetaiotaomicron VPI-5482 BBA01 Bdellovibrio bacteriovorus BL03 Bifidobacterium longum NCC2705 BBR02 Bordetella bronchiseptica BPA02 Bordetella parapertussis BPE02 Bordetella pertussis BJ02 Bradyrhizobium japonicum BM02 Brucella melitensis BSU03 Brucella suis 1330 BAS02 Buchnera aphidicola CJ02 Campylobacter jejuni CF02 Candidatus Blochmannia floridanus CC02 Caulobacter crescentus CB15 CM02 Chlamydia muridarum CPN03 Chlamydia pneumoniae AR39 CT02 Chlamydia trachomatis CCA02 Chlamydophila caviae GPIC CTE02 Chlorobium tepidum TLS CV02 Chromobacterium violaceum ATCC 12472 CA02 Clostridium acetobutylicum ATCC824 CP02 Clostridium perfringens CTEE02 Clostridium tetani E88 CDI01 Corynebacterium diphtheriae CEF01 Corynebacterium efficiens YS-314 CG03 Corynebacterium glutamicum ATCC 13032 CB02 Coxiella burnetii RSA 493 DR02 Deinococcus radiodurans EF02 Enterococcus faecalis V583 ECC02 Escherichia coli CFT073 EC03 Escherichia coli K12 ECE03 Escherichia coli O157:H7 EDL933 ECO02 Escherichia coli O157:H7 FN02 Fusobacterium nucleatum subsp. nucleatum ATCC 2558... GS01 Geobacter sulfurreducens PCA GV01 Gloeobacter violaceus HD02 Haemophilus ducreyi 35000HP HI02 Haemophilus influenzae Rd HM01 Haloarcula marismortui ATCC 43049 HS02 Halobacterium sp. NRC-1 HW01 Haloquadratum walsbyi DSM 16790 HP02 Helicobacter pylori 26695 HPJ02 Helicobacter pylori str. J99 LJ01 Lactobacillus johnsonii NCC 533 LP02 Lactobacillus plantarum WCFS1 LL02 Lactococcus lactis subsp. lactis LIN02 Leptospira interrogans serovar lai str. 56601 LI02 Listeria innocua Clip11262 LM02 Listeria monocytogenes EGD MLO03 Mesorhizobium loti MET02 Methanobacterium thermoautotrophicum str. Delta H MBU01 Methanococcoides burtonii DSM 6242 MJ02 Methanococcus jannaschii MM01 Methanococcus maripaludis S2 MK02 Methanopyrus kandleri AV19 MTE01 Methanosaeta thermophila PT MA02 Methanosarcina acetivorans str. C2A MBA01 Methanosarcina barkeri str. fusaro MM02 Methanosarcina mazei Goe1 MST01 Methanosphaera stadtmanae DSM 3091 MHU01 Methanospirillum hungatei JF-1 MAP01 Mycobacterium avium subsp. paratuberculosis str. k... MB02 Mycobacterium bovis subsp. bovis AF2122/97 MT03 Mycobacterium tuberculosis CDC1551 MTH03 Mycobacterium tuberculosis H23Rv NEQ01 Nanoarchaeum equitans Kin4-M NP01 Natronomonas pharaonis DSM 2160 NMA02 Neisseria meningitidis serogroup A Z2491 NM02 Neisseria meningitidis serogroup B MC58 NE02 Nitrosomonas europaea NO02 Nostoc sp. PCC 7120 OI02 Oceanobacillus iheyensis HTE831 OYP01 Onion yellows phytoplasma PM02 Pasteurella multocida PL01 Photorhabdus luminescens subsp. laumondii TTO1 PT01 Picrophilus torridus DSM 9790 PI02 Pirellula sp PG02 Porphyromonas gingivalis W83 PMMI02 Prochlorococcus marinus str. MIT 9313 PMA02 Prochlorococcus marinus subsp marinus CCMP1375 PMM02 Prochlorococcus marinus subsp. pastoris str. CCMP1... PA02 Pseudomonas aeruginosa PA01 PS02 Pseudomonas syringae pv. tomato str. DC3000 PAE02 Pyrobaculum aerophilum PRI01 Pyrobaculum islandicum DSM 4184 PAB02 Pyrococcus abyssi PF02 Pyrococcus furiosus DSM 3638 PH02 Pyrococcus horikoshii RS02 Ralstonia solanacearum RPA01 Rhodopseudomonas palustris CGA009 RC02 Rickettsia conorii Malish 7 RP02 Rickettsia prowazekii Madrid E SE02 Salmonella enterica subsp. enterica serovar Typhi ... STT02 Salmonella enterica subsp. enterica serovar Typhi ... STY02 Salmonella typhimurium LT2 SO02 Shewanella oneidensis MR-1 SM02 Sinorhizobium meliloti 1021 SAM03 Staphylococcus aureus MU50 SA02 Staphylococcus aureus subsp aureus N315 SEA02 Staphylococcus epidermidis ATCC 12228 SAG02 Streptococcus agalactiae 2603V/R SAN02 Streptococcus agalactiae NEM316 SMU02 Streptococcus mutans UA159 SP02 Streptococcus pneumoniae SPY02 Streptococcus pyogenes SAV03 Streptomyces avermitilis MA-4680 SC02 Streptomyces coelicolor A3(2) UC01 Sulfolobus acidocaldarius DSM 639 SS02 Sulfolobus solfataricus ST02 Sulfolobus tokodaii SSW02 Synechococcus sp. WH 8102 SPC02 Synechocystis sp. PCC 6803 TT02 Thermoanaerobacter tengcongensis strain MB4T TK01 Thermococcus kodakarensis KOD1 THP01 Thermofilum pendens Hrk 5 TA02 Thermoplasma acidophilum TV02 Thermoplasma volcanium TE02 Thermosynechococcus elongatus BP-1 TM02 Thermotoga maritima TD01 Treponema denticola ATCC 35405 TP02 Treponema pallidum TW02 Tropheryma whipplei Twist VC02 Vibrio cholerae VP02 Vibrio parahaemolyticus RIMD 2210633 VV02 Vibrio vulnificus WB02 Wigglesworthia glossinidia endosymbiont of Glossin... WDM01 Wolbachia endosymbiont of Drosophila melanogaster XA02 Xanthomonas axonopodis pv. citri str. 306 XC02 Xanthomonas campestris pv. campestris str. ATCC 33... XF02 Xylella fastidiosa YP02 Yersinia pestis
References
Current version (1.2):
Large-scale prokaryotic gene prediction and comparison to genome annotation.
,
Bioinformatics: 21:4322-4329, 2005.
, Bioinformatics: 21:4322-4329, 2005.
Bioinformatics Centre, Institute of Molecular Biology and Physiology, University of Copenhagen, Universitetsparken 15, 2100 Copenhagen, Denmark
PMID: 16249266
Abstract
MOTIVATION: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. RESULTS: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to approximately 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation.
Original EasyGene paper:
EasyGene - a prokaryotic gene finder that ranks ORFs by statistical
significance.
,
Bioinformatics: 4:21, 2003
, Bioinformatics: 4:21, 2003
1Center for Biological Sequence Analysis BioCentrum,
Technical University of Denmark Building 208, 2800 Lyngby, Denmark
2Present address: The Bioinformatics Centre,
University of Copenhagen Universitetsparken 15, 2100 Copenhagen, Denmark
PMID: 12783628 View the full article
Abstract
BACKGROUND: Contrary to other areas of sequence analysis, a measure of statistical significance of a putative gene has not been devised to help in discriminating real genes from the masses of random Open Reading Frames (ORFs) in prokaryotic genomes. Therefore, many genomes have too many short ORFs annotated as genes. RESULTS: In this paper, we present a new automated gene-finding method, EasyGene, which estimates the statistical significance of a predicted gene. The gene finder is based on a hidden Markov model (HMM) that is automatically estimated for a new genome. Using extensions of similarities in Swiss-Prot, a high quality training set of genes is automatically extracted from the genome and used to estimate the HMM. Putative genes are then scored with the HMM, and based on score and length of an ORF, the statistical significance is calculated. The measure of statistical significance for an ORF is the expected number of ORFs in one megabase of random sequence at the same significance level or better, where the random sequence has the same statistics as the genome in the sense of a third order Markov chain. CONCLUSIONS: The result is a flexible gene finder whose overall performance matches or exceeds other methods. The entire pipeline of computer processing from the raw input of a genome or set of contigs to a list of putative genes with significance is automated, making it easy to apply EasyGene to newly sequenced organisms.