DeepCDS - 1.0

Prediction of coding sequences (CDS) in prokaryotic sequencing reads

The DeepCDS 1.0 server predicts coding sequences, including start and stop codon positions, in short prokaryotic sequences.
Three model versions are available, adapted to different levels of sequencing noise: from error-free sequences to Illumina short reads with substitution, insertion, and deletion errors.

Submission

1. Sequence submission: paste the sequence(s) or upload a local file

Restrictions

At most 50 sequences and 1,000,000 nucleotides per submission; each sequence not more than 500,000 nucleotides.

Confidentiality

The sequences are kept confidential and will be deleted after processing.

Instructions

1. Specify the input sequences

The sequences intended for processing can be input in the following two ways:

Paste a single sequence (just the nucleotides) or a number of sequences in FASTA format into the upper window of the main server page.

Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.

The allowed input alphabet is A, C, G, T, U and N (unknown); all the other letters will be converted to N before processing. T and U are treated as equivalent.

2. Specify the error model

DeepCDS was trained in three versions adapted to different degrees of sequencing noise. Below is a guide for choosing the appropriate error model:

Error-free data: Use this version if your data is complete genomic sequences without sequencing errors. This model was trained on clean sequences and does not account for sequencing noise.

Substitution errors: Use this version if you have error-prone sequences (such as short sequencing reads) but do not require explicit insertion and deletion position predictions. This model was trained on sequences with Illumina-like substitution error profiles and is the most robust version overall to sequencing noise.

Substitution, insertion, and deletion errors: Use this version if you have error-prone sequences with frequent insertion and deletion errors and want explicit insertion and deletion postion predictions. This model was trained on sequences with Illumina-like substitution, insertion, and deletion error profiles and explicitly accounts for each of these sequencing error types.

3. Specify the minimum CDS length to be predicted

DeepCDS can predict both complete and fragmented coding sequences (CDS). In the selection box you can specify the minimum length a coding sequence should have in order to be predicted. The default is 60 bp, for which the model has been tested thoroughly.

4. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'active') will be displayed and constantly updated until it terminates and the server output appears in your browser window.

NOTE: At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Output formats

The output is provided as three files: a .gff file with the CDS annotations (including start codon and stop codon positions), a .fna file with the predicted CDS sequences, and a .faa file with the predicted CDS sequences translated into the corresponding amino acid sequence.

.gff notes

Feature types

CDS: A coding sequence region annotation.

start_codon: Start codon annotation. Please note that the beginning of a CDS annotation in short sequence fragments does not necessarily equal a start codon position, as DeepCDS can predict CDS regions that are only internal regions of a protein, only the start of a protein, or the end of a protein.

stop_codon: Stop codon annotation. Please note that the end of a CDS annotation in short sequence fragments does not necessarily equal a stop codon position, as DeepCDS can predict CDS regions that are only internal regions of a protein, only the start of a protein, or the end of a protein.

insertion: A specific inserted nucleotide position that has been directly identified. This is a special case where the exact insertion site is known, in contrast to the uncertain_region feature which marks ambiguous positions when the exact site cannot be determined. CDS fragments flanking an insertion share a group_id attribute. This feature type is only predicted if the error model selected is "Substitution, insertion, and deletion errors".

uncertain_region: Marks ambiguous nucleotide positions between CDS fragments interrupted by a predicted insertion or deletion error, where the exact indel position cannot be directly determined. CDS fragments flanking an uncertain_region share a group_id attribute. This feature type is only predicted if the error model selected is "Substitution, insertion, and deletion errors".

Attribute information

Attributes are provided as a list of tag-value pairs. Each pair is separated by a semicolon.

ID: Unique ID for annotation.

start: the state that the given feature started in, for example start_codon or internal_region.

end: the state that the given feature ended in, for example stop_codon or internal_region.

group_id: CDS regions interrupted by an insertion or deletion error are split into two or more CDS feature annotations, and share a common group_id attribute in order to connect CDS fragments that belong to the same coding sequence. This attribute is only provided if the error model selected is "Substitution, insertion, and deletion errors".

indel_type: Provided together with group_id. Marks the kind of sequencing error predicted (either insertion or deletion). This attribute is only provided if the error model selected is "Substitution, insertion, and deletion errors".

overlapping_frames: Marks which reading frames the two CDS fragments flanking a type=uncertain_region are placed in. This attribute is only provided if the error model selected is "Substitution, insertion, and deletion errors".

Note: Any additional notes related to the given annotation.

.fna notes

Fasta file containing the predicted CDS sequences. In cases where a deletion error has been predicted, the missing region in the merged CDS sequence is represented as an "NNN" codon.

.faa notes

Fasta file containing the translated CDS sequences (using the standard prokaryotic translation table; NCBI genetic code 11). In cases where a deletion error has been predicted, the missing region in the merged CDS sequence is represented as an "NNN" codon that is translated as "X". Furthermore, all codons with one or more unknown nucleotide positions are translated as "X", and stop codons are denoted as "*".

Data Sets

Training and Validation Sets

The datasets consist of gzipped CSV files with processed training and validation datasets. This can be downloaded for each of the three model versions trained via command line with wget.

Datasets without sequencing errors

wget -r ftp://ftp.healthtech.dtu.dk/public/deepcds_1.0_server_datasets/train_val/datasets_without_errors

Datasets with substitution sequencing errors

wget -r ftp://ftp.healthtech.dtu.dk/public/deepcds_1.0_server_datasets/train_val/datasets_with_substitution_errors

Datasets with substitution, insertion, and deletion sequencing errors

wget -r ftp://ftp.healthtech.dtu.dk/public/deepcds_1.0_server_datasets/train_val/datasets_with_errors

Test Sets

The test sets contain simulated test sequences in FASTA format and can be downloaded via command line with wget.

wget -r ftp://ftp.healthtech.dtu.dk/public/deepcds_1.0_server_datasets/test

Dataset Information

Additional details about each dataset can be found in the README file included in each archive.

DeepCDS: Ab initio coding sequence prediction in prokaryotic short reads

Nielsen, L.S., Nielsen, H. and Winther, O.

Accurate coding sequence prediction in short prokaryotic metagenomic reads remains challenging due to sequence fragmentation, unknown sequence origins, and sequencing errors. Here we introduce DeepCDS, a deep learning-based ab initio coding sequence predictor trained on short prokaryotic sequences with and without simulated Illumina-like sequencing errors. DeepCDS integrates ESM-2 protein language model embeddings with nucleotide-level information to predict complete and fragmented coding sequence regions. Benchmarking on 215 phylogenetically diverse prokaryotic organisms demonstrates that DeepCDS consistently outperforms current state-of-the-art methods in coding sequence detection, start and stop codon localization, and robustness to different sequencing error profiles, while remaining operational at shorter sequence lengths than existing tools support. These findings demonstrate that protein language models capture distinct signals relevant for nucleotide-level coding sequence detection, especially at very short lengths. Ultimately, DeepCDS may help uncover the functional potential of the vast microbial diversity that remains genomically uncharacterized.

DeepCDS 1.0 can be downloaded to run locally from here.

GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: