NetStart - 2.0

Prediction of Translation Initiation Sites in Eukaryotes

The NetStart 2.0 server predicts canonical translation initiation sites in eukaryotic mRNA sequences.

NetStart 2.0 has been trained on sequences from 60 phylogenetically diverse eukaryotic species. Predictions can be made either for one of these species specifically, or based on the phylum for which the sequences originate. If the species origin is unknown, predictions can be made without utilizing taxonomical information.

Submission

1. Sequence submission: paste the sequence(s) or upload a local file

Paste a single sequence or several sequences in FASTA format into the field below:

Submit a file in FASTA-formatted file directly from your local disk:

2. Input origin of sequence

3. Select which predictions should be returned
All
Highest predicted ATG per transcript
All ATGs predicted with a probability above threshold

4. Consider the reverse complement sequences?
No (predicting on mRNA transcripts)
Yes (recommended for predicting on DNA sequences)

Restrictions
At most 50 sequences and 1,000,000 nucleotides per submission; each sequence not more than 500,000 nucleotides.

Confidentiality
The sequences are kept confidential and will be deleted after processing.

Instructions

1. Specify the input sequences

The sequences intended for processing can be input in the following two ways:

Paste a single sequence (just the nucleotides) or a number of sequences in FASTA format into the upper window of the main server page.
Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.

The allowed input alphabet is A, C, G, T, U and N (unknown); all the other letters will be converted to N before processing. T and U are treated as equivalent.

2. Select origin of sequence

In the selection box you can provide the phylogenetic origin of the sequence. This can be provided as either of the 60 species used in training of NetStart 2.0. If your input sequence does not originate from any of these species, selecting a phylum as input will utilize taxonomical information down to phylum-level in the predictions. If the origin of the sequence is not represented by any of the above, Unknown can be selected as origin, for which the model will not use any taxonomical information when predicting.

It should be noted that NetStart 2.0 has been trained using taxonomical information specifically for each of the 60 species, and not on sequences defined only on phylum-level or labelled with an unknown origin. We refer to experiments presented in the paper accompanying this tool for an assessment of this impact.

3. Select output format of predictions

The predictions can be provided in three versions:

All: Predicted probabilities for all ATGs in the input sequence(s) are provided.

Highest predicted ATG per transcript: Only the ATG with the highest predicted probability for being a translation initiation site for each input sequence is provided.

All ATGs predicted with a probability above threshold: All ATGs having a probability of being a translation initiation site above the specified threshold are provided. The default threshold is set to 0.625, which was found to be the most optimal (see accompanying paper).

4. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in your browser window.

NOTE: At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Output format

The output is provided as a csv-file, containing the following information attributes for each prediction (ATG):
origin specifies the origin of the sequence predicted on (provided by the user).
atg_pos states the position of the ATG predicted upon (position corresponds to the A in the codon).
entry_line specifies the fasta entry line of the specific sequence.
preds provides the predicted probability of the specific ATG being a translation initiation site (in the range [0.0, 1.0]).
stop_codon_position states the position of the first in-frame stop codon relative to the ATG prediction upon (position corresponds to the first position of the stop codon).
peptide_len states the length of the hypothetical peptide.
strand States the strand predicted upon (+ denotes the template strand, - denotes the complement strand).

Training and Test Data Sets

Training Set

The training data consists of four CSV files, each corresponding to one of the four data partitions used for training and validation in rotation across the four models. You can download the data as a zip-file here.
Additional details about the dataset can be found in the included file, datasets_description.txt, which comes with the download.

Test Sets

The test sets are provided as fasta-formatted files, with one file for each of the 60 species. The homology-partitioned test sets (partition 5), with a modifications as described in the accompanying paper, can be downloaded here.
Additional details about the test set can be found in the included file testset_description.txt, which comes with the download.

The genomic test set contains labeled gene sequences of the corresponding TIS-labeled transcript sequences from the homology-partitioned test set and can be downloaded here.

Reference

Nielsen, L.S., Pedersen, A.G., Winther, O. and Nielsen, H. NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model. BMC Bioinformatics 26, 216 (2025). https://doi.org/10.1186/s12859-025-06220-2

The article is published in BMC Bioinformatics and can be accessed here.

Abstract

Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5' end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model.
Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions.
Conclusion: By leveraging "protein-ness", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks.

NetStart 2.0 can be downloaded to run locally from here.

Software Downloads

Version 1.0c

Linux
IRIX

GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: