DTU Health Tech

Department of Health Technology

We recently made large changes to the webserver infrastructure, so you might experience errors. Please report issues to health-master@dtu.dk

NNAlign - 2.0

Identifying sequence motifs in quantitative peptide data

The NNAlign server allows generating artificial neural network models of receptor-ligand interactions. The program takes as input a set of ligand sequences with target values; it returns a sequence alignment, a binding motif of the interaction, and a model that can be used to scan for occurrences of the motif in other sequences.
Visit the links on the pink bar below to read detailed instructions and guidelines, see output formats, or download the code.

New in version 2.0:

Submission


1. TRAIN or UPLOAD a model

Paste peptides in PEPTIDE format

or submit a file directly from your local disk:

To load some SAMPLE DATA click here:

More sample training data:


2. EVALUATION data (optional)

Paste in evaluation examples in PEPTIDE or FASTA format

or upload evaluation examples:

Sample evaluation data in FASTA or PEPTIDE format

3. SUBMIT job




PRESET parameter configurations


MHC CLASS I ligands of variable length
MHC CLASS II ligands:
DNA/RNA data:


CUSTOMIZE your run

Hover the mouse cursor over the symbol for a short description of the options

BASIC options

Job name

Motif length

DATA PROCESSING options

Order of the data
High values are positive instances
Low values are positive instances

Data rescaling
Linear rescale
Log-transform
No rescale

Average target values of identical sequences

Folds for cross-validation

Perform nested cross-validation

Stop training on best test-set performance

Method to create subsets
Random subsets
Homology clustering
Common-motif clustering
User-defined partitions

Alphabet

NEURAL NETWORK architectur

Number of training cycles

Number of seeds

Number of hidden neurons

Amino acid encoding

Maximum length for Deletions

Maximum length for Insertions

Only allow insertions in sequences shorter than the motif length

Burn-in period

Length of the PFR for composition encoding

Encode PFR composition as sparse

Encode PFR length

Expected peptide length for encoding

Binned peptide length encoding

Load receptor pseudo-sequences

Example

SORTING and VISUALIZATION options

Number of networks (per fold) in the final network ensemble

Sort results by prediction value

Exclude offset correction

Show all logos in the final ensemble

EVALUATION DATA options

Length of peptides generated from FASTA entries

Sort evaluation results by prediction value

Threshold on evaluation set predictions


SUBMIT job



NOTE, depending on the size of your datasets and selected parameters it might take up to a few hours to complete the query.
Please be patient.

Confidentiality:
The sequences are kept confidential and will be deleted after processing.


CITATIONS

For publication of results, please cite:

  • NNAlign: a platform to construct and evaluate artificial neural network models of receptor-ligand interactions
    Nielsen M, Andreatta M
    Nucleic Acids Research (2017) Apr 12. doi: 10.1093/nar/gkx276
    Pubmed: 28407117

Instructions & Guidelines



NNAlign is a server for the discovery of sequence motifs in quantitative peptide data, i.e. a set of amino acid sequences, each associated with some numerical value. This value (the quantitative measure) defines negative, positive and intermediate examples across a numerical spectrum. It could be, for example, the binding strength of each peptide to a certain molecule, or the signals measured on a peptide array.
It is important, for effective training of NNAlign, that not only positive instances but also negative (and possibly intermediate) examples are included in the training set. The neural networks can then attempt to correlate the amino acid sequences with their relative quantitative values, and learn what differentiates positives and negatives. The generated model that can the be used to produce quantitative predictions on new data.

This page introduces the data formats, the parameters available to customize the analysis and some guidelines for the use of version 2.0. Users are welcome to contact the authors for any questions.


1. Specify TRAINING sequences

Paste a set of peptides, one sequence per line into the upper left window, or upload a file from your local disk. Training data should be in two columns. The first column is composed of peptide sequences, the second column of a target value for each sequence. Look here to see an example of the format.
  • If you provide receptor pseudosequences with Load receptors pseudo-sequences, add a third column to the input data with the name of the receptor associated with each training point. Here is an example of the format.
  • To specify your custom partitioning of the data, add the partition number of each datapoint (in integer numbers from 0 to N) on the last column of the training data, and check "User-defined partitions" in the Method to create subsets option.
The program accepts a continuous spectrum of signal intensities associated with the sequences, and by default it assumes that positive examples (e.g. binders) have higher values (as opposed to negative examples that lie in the left part of the spectrum). NOTE that by default the program assumes that the data is expressed in the standard one-letter 20 amino acid alphabet:
A C D E F G H I K L M N P Q R S T V W Y

plus the X symbol for "unknown" amino acid, treated as a wildcard. If you wish to use a different alphabet (e.g. modified amino acids, DNA/RNA, etc) you must specify the list of symbols with the Alphabet option (see below). In all cases, X is a reserved wildcard symbol with neutral value.

2. Select EVALUATION examples (Optional)

NNAlign creates an ensemble of neural networks trained to recognize sequence motifs contained in the training data. If you wish to use the neural networks to discover occurrences of the motif on new data, paste in an evaluation set or load it from your disk. Two formats are accepted:

  • A list of peptides, one sequence per line. If values are provided together with the peptides (in a format similar to the training data), the method will calculate statistical measures of correlation between observed and predicted values for the evaluation set. Example
  • A set of amino acid sequences in fasta format. The sequences are digested into peptides with length of the motif (plus any flanks) and then run through the neural networks. Example

3. Set OPTIONS to customize your analysis

BASIC options

Job name:
This prefix is pre-pended to all files generated by the current run. If left empty, a system-generated number will be assigned as prefix.

Motif length:
The length of the alignment core can be specified as:

  • Single value. e.g. 7
  • Interval. e.g. 6-8
  • Interval with step size. e.g. 6-10/2
The algorithm will align the sequences for all specified motif lengths, and select the solution that maximizes cross-validated performance in terms of correlation between observed and predicted values.
Note that sequences shorter then the maximum motif length (+ insertions, if enabled) will be removed from the dataset.

DATA PROCESSING options

Order of the data:
By default, peptides with high values are positive instances and high prediction scores are used to derive the sequence logo. You can invert this behaviour

Data rescaling:
The optimal data distribution for NN training is between 0 and 1 with the bulk of the data in the middle region of the spectrum. With the default option the program rescales linearly the data between 0 and 1, but it is also possible to apply a logarithmic transformation if the data appears squashed towards lower values. If your data is already rescaled between 0 and 1, select the No rescale option
You can also inspect the data distribution before and after the transformation in the output panel, following the link "View data distribution". Example

Average target values of identical sequences:
If there are duplicated sequences, by default they are all used together with their target values. Toggle this option to only use each sequence once with the average of the multiple target values.

Folds for cross-validation:
Specify the number of subsets to be created for the estimation of performance on cross-validation. It is also possible to skip cross-validation, ticking the 'NO' button. In this case all data are used for training, and execution will be faster, but it won't be possible to calculate performance measures.

Cross-validation method:
The predictive performance of the method is estimated in cross-validation (CV) on the training set. At each cross-validation step, one of 'n' subsets is left out as an evaluation set, where 'n' is the number of folds, rotating the evaluation set n times. Two CV methods are available:

  • Simple cross-validation: uses n-1 sets to train the ANN and 1 set for evaluation.
  • Nested cross-validation: uses n-2 sets to train the ANN, 1 set for early stopping (see below), and 1 set for evaluation, where n is the number of folds.

Stop training on best test-set performance:
If this option is selected, training of the networks will be stopped on the highest CV test-set performance (Early stopping). A completely unbiased evaluation of the performance requires an additional independent test set, by selecting Nested cross-validation. However, for large datasets an accurate and much faster estimate of the predictive performance can be done on the same subsets used for early stopping (Simple cross-validation together with Early stopping).
Leaving the Early stopping option unticked will continue the training until the maximum number of training cycles as specified in the "Number of training cycles" option.

Method to create subsets:
The data can be prepared for cross-validation in 4 manners:

  • Random subsets: the raw data is simply split randomly into subsets of equal size
  • Homology clustering: a Hobohm 1 algorithm is used to group homologous sequencences and limit overlap between subsets. Also specify the maximum identity between sequences in the same subset (e.g. 0.8 means that peptides in the same subsets are no more than 80% identical).
  • Common motif clustering: two sequences are considered homologous if they share a stretch of at least N identical amino acids, where N is the common motif length specified by the user.
  • User-defined partitions: you may specify your own partitions for cross-validation. Specify the groups as an additional column of the input data, assigning to each data point the partition number from 0 to N.

Remove homologous sequences from training set:
Homologous sequences are by default clustered in the same subset. Check the box to keep only one instance of homologous sequences.

Alphabet:
You may use a custom alphabet (e.g. nucleic acids, or non-standard amino acids). All upper-case letters and the symbols + and @ are permitted. The symbol X is reserved as a wildcard. Note that if you modify the alphabet, all BLOSUM options will be disabled.

NEURAL NETWORK architecture

Number of training cycles:
This option specifies how many times each example in the training set is presented to the neural networks. If training is stopped on the best test-set performance, this value represents the maximum number of training cycles.

Number of seeds:
It is possible to train the model from different initial random network configurations. The ensemble of several neural networks has been shown to perform better than a single network. However, note that the time required to train a model increases linearly with this parameter.

Number of hidden neurons:
A higher number of hidden neurons in the ANNs potentially allows detecting higher order correlations, but increases the number of parameters of the model. Different hidden layer sizes can be specified in a comma separated list (e.g. 3,7,12,20), in which case an ensemble of networks with different architectures is constructed.

Amino acid encoding:
Amino acids must be converted to numbers in order to be presented to the neural networks. Sparse encoding converts an amino acid into a binary vector, whereas Blosum encoding uses the BLOSUM62 substitution scores, accounting for physicochemical similarity between amino acids. Choosing the "Combined" option, networks are trained both with Blosum and Sparse encoding, combining the predictions of the two approaches. Note that Blosum encoding is only available if the training data uses the standard one-letter 20 amino acids alphabet (used as default).

Maximum length for Deletions:
Allow deletions in the alignment. For a description of how deletions (and insertions) are treated refer to this paper.

Maximum length for Insertions:
Allow insertions in the alignment.

Only allow insertions in sequences shorter than the motif length:
Apply insertions in any sequence, or only on sequences shorter than the motif length.

Burn-in period:
The burn-in is a number of initial iterations where no deletions or insertions are allowed. As gaps increase dramatically the number of possible solutions, it may be useful to use a burn-in period > 0 in order to limit the search space in the initial training phases.

Length of the PFR for composition encoding:
In some instances, the amino acid composition of the regions surrounding the motif core (peptide flanking region, PFR) can have an influence on the response. See for example in this paper, where the amino acid composition of a PFR of at least two amino acids around the core was shown to influence peptide-MHC binding strength. With this option you can specify the length of the regions flanking the alignment core, which will be encoded as input to NNAlign.

Encode PFR composition as sparse:
By default, the composition of the regions flanking the binding cores is encoded using the Blosum substitution matrix. Turning this option on, the raw frequency of each amino acid in the PFR (sparse alphabet) is used for encoding.

Encode PFR length:
Encodes the length of the flanks, i.e. the number of amino acids before/after the motif core. It essentially bears information about the position of the core within the peptide, if it is found at the extremes or in the middle. If this option is set to N > 0, the flank length is truncated to N amino acids, if N = 0 the encoding is unbounded (recommended). Setting this option to -1 disables the encoding.

Encode peptide length:
Assigns input neurons to encode the length of the input sequences if set to > 0. For an optimal encoding, give a rough estimate of the expected optimal peptide length.

Load receptor pseudo-sequences:
If you have different receptors associated with your training examples, specify the receptor names as the third column in the training file. Then, upload here a file with two columns: the receptor names in the first, and the aligned pseudo-sequences in the second. Note that pseudo-sequences must all have the same length and be in the same alphabet as the training sequences (including X).

SORTING and VISUALIZATION options

Number of networks (per fold) in the final ensemble:
When training with cross-validation, each neural network's performance is evaluated in terms of Pearson's correlation between target and predicted values. The top N networks (for each cross-validation fold) can be selected using this parameter, and they will constitute the final model.

Sort results by prediction value:
Predictions can be sorted by the NNAlign predicted value. If left unticked, sequences are presented in their original order.

Exclude offset correction:
Offset correction is a procedure that realigns individual networks to enhance the combined sequence motif (see the section "Improving the LOGO sequence motif representation by an offset correction" in this paper. You can disable offset correction by ticking this box.

Show all logos in the final ensemble:
Displays the sequence motif identified by each neural network in the model.

EVALUATION DATA options

Length of peptides generated from FASTA entries:
Evaluation data submitted in FASTA format will be digested into fragments of the specified length. These peptides will then be submitted to the network ensemble to scan for the presence of the learned sequence motifs.

Sort evaluation results by prediction value:
Predictions on the independent evaluation set can be sorted by the NNAlign predicted value. If left unticked, sequences are presented in their original order.

Threshold on evaluation set predictions:
For large FASTA file submissions, the size of the results file may become very big. Use this parameter to limit the size of evaluation set results, and only show sequences with high predicted values. It should be given as a number between 0 and 1 (set to 0, all results will be displayed).

4. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and leave the window. Your job will continue and you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Loading a trained model

Once your job is completed, you will have the possibility of downloading the trained method to your computer. You may then upload this model at any moment to the NNAlign submission page and use it for new predictions on evaluation sets.
Simply select the option Upload a MODEL and paste your model file in the submission form.

Output format (version 2.0)



DESCRIPTION

An example of output is found below. The results page is composed of the following sections:
  1. Training data & Neural network architecture
  2. Information about the data used to train the ANNs, including the number of datapoints and the parameters used to train the ANN ensemble. It is also reported whether repeated flanks are found in the data, and if sequences were removed from the dataset (because shorted than the specified motif).
    You can inspect the distribution of the training data before and after rescaling. If the linear rescale produces a distribution that is too skewed towards zero, you might consider running the analysis again using a logarithmic transformation.

  3. Performance measures
    • Predictive performance is estimated in cross-validation on the training set, and given as Root mean square error (RMSE), Pearson and Spearman correlations.
    • For a visual depiction of the correlation between observed vs. predicted values, inspect the "scatterplot" figure.
    • The "complete alignment core" file reports the prediction for each sequence and the core of the alignment. This file consists of several columns:
      • Core: the predicted binding core for the sequence
      • P1: position of the first residue of the core within the sequence
      • Measure: the target value
      • Prediction: the score predicted by the ensemble
      • Peptide: complete sequence of the training example
      • Gap_pos: starting position of the deletion, if any
      • Gap_lgt: length of the deletion, if any
      • Insert_pos: starting position of the insertion, if any
      • Insert_lgt: length of the insertion, if any
      • Core+Gap: the binding core including inserted or deleted amino acids, if any
      • P1_rel: reliability of the starting position of the core. It gives a confidence measure on the location of the core (reliability scores are described in this paper.).
    • The trained "model", i.e. the set of network weights optimized on the training data, can be dowloaded to local disk using the relative link. The model file can than be uploaded to server at any moment to obtain prediction on new data.


  4. Sequence motif
  5. A sequence logo representation of the motif. The height of each column, and the relative height of AA letters, represent the information content in bits at each position of the alignment. Logos are generated using the Seq2Logo program.
    The amino acid preferences at each position in the alignment may also be viewed in a Log-odds matrix (or frequency matrix) format, with positive values indicating favored residues and negative values disallowed amino acids.

  6. Evaluation data
  7. If you provided evaluation data upon submission, you will find the predictions here.
    For evaluation files in peptide format with associated values (i.e. a similar format as for the training data), performance measures will also be available. If the submission was in FASTA format, the source protein sequence ID is also shown here, and in the case of peptides shared by multiple entries, the sequence IDs are listed separated by / (slash).



EXAMPLE OUTPUT


Version: 2.0
Run ID: 12857
Run Name: DRB1_0301.example

Training data

Read 1715 unique sequences
View data distribution
(See Instructions for optimal data distribution)
Pre-processing: Linear rescale

Neural network architecture

Motif length: 9
Flanking region (PFR) size: 3
Number of hidden neurons: 5,15
Peptide length encoding: 13
Flank length encoding: 0
Maximum length of deletions in alignment: 0
Maximum length of insertions in alignment: 0
Amino acid numerical encoding: Blosum
Number of training cycles: 500
Number of NN seeds: 4
Number of networks in final ensemble: 40
Stop training on best test-set performance: Yes
Cross-validation setup: Simple
Folds for cross-validation : 5
Method to create subsets: Random


RESULTS

Performance measures - motif length 9

RMSE = 0.149188
Pearson correlation coefficient = 0.735081
Spearman rank coefficient = 0.731830

View scatterplot of Predicted vs. Observed values
Download complete alignment core on the training data

Save the trained MODEL. You may use this model for a new submission

Sequence motif

Cores realigned with offset correction

Click here if you have problems visualizing this image

Figure: Visualization of the sequence motif using the Seq2Logo program

View a Log-odds matrix or Frequency matrix representation of the motif


Evaluation data

Uploaded 1068 peptides from 6 FASTA entries

See the predictions on the evaluation set




DOWNLOAD
a compressed archive with all results files

Article abstract


NNAlign: a platform to construct and evaluate artificial neural network models of receptor-ligand interactions

Morten Nielsen1,2, Massimo Andreatta1

Nucleic Acids Research, 2017 Apr 12. doi: 10.1093/nar/gkx276

1Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, 1650 San Martín, Argentina
2Department of Bio and Health Informatics, Technical University of Denmark, DK-2800 Lyngby, Denmark

Peptides are extensively used to characterize functional or (linear) structural aspects of receptor-ligand interactions in biological systems e.g. SH2, SH3, PDZ peptide-recognition domains, the MHC membrane receptors and enzymes such as kinases and phosphatases. NNAlign is a method for the identification of such linear motifs in biological sequences. The algorithm aligns the amino acid or nucleotide sequences provided as training set, and generates a model of the sequence motif detected in the data. The webserver allows setting up cross-validation experiments to estimate the performance of the model, as well as evaluations on independent data. Many features of the training sequences can be encoded as input, and the network architecture is highly customizable. The results returned by the server include a graphical representation of the motif identified by the method, performance values and a downloadable model that can be applied to scan protein sequences for occurrence of the motif. While its performance for the characterization of peptide-MHC interactions is widely documented, we extended NNAlign to be applicable to other receptor-ligand systems as well. Version 2.0 supports alignments with insertions and deletions, encoding of receptor pseudo-sequences, and custom alphabets for the training sequences. The server is available at http://www.cbs.dtu.dk/services/NNAlign-2.0

Full text

Software Downloads




GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0). If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: