DTU Health Tech
Department of Health Technology
This link is for the general contact of the DTU Health Tech institute.
If you need help with the bioinformatics programs, see the "Getting Help" section below the program.
1. Sequence submission: paste the sequence(s) or upload a local file
Restrictions
At most 50 sequences and 1,000,000 nucleotides per submission; each sequence not more than 500,000 nucleotides.
Confidentiality
The sequences are kept confidential and will be deleted after processing.
The sequences intended for processing can be input in the following two ways:
The allowed input alphabet is A, C, G, T, U
and N (unknown); all the other letters will be converted to N
before processing. T and U are treated as equivalent.
NOTE: At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.
The output is provided as a csv-file, containing the following information attributes for each prediction (ATG):
origin specifies the origin of the sequence predicted on (provided by the user).
atg_pos states the position of the ATG predicted upon (position corresponds to the A in the codon).
entry_line specifies the fasta entry line of the specific sequence.
preds provides the predicted probability of the specific ATG being a translation initiation site (in the range [0.0, 1.0]).
stop_codon_position states the position of the first in-frame stop codon relative to the ATG prediction upon (position corresponds to the first position of the stop codon).
peptide_len states the length of the hypothetical peptide.
strand States the strand predicted upon (+ denotes the template strand, - denotes the complement strand).
The article is published in BMC Bioinformatics and can be accessed here.
Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5' end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model.
Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions.
Conclusion: By leveraging "protein-ness", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks.
If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).
If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.
Correspondence:
Technical Support: