ClusteringTool - 0.9

Clustering of peptide data

This tool is currently unavailable. This is due to misuse coupled with an inappropriate setup on our server, which makes it unstable
When both issues has been addressed, we will put it up again.

ClusteringTool is a server for unsupervised clustering of peptide sequences. The program takes as input a list of peptide sequences (fasta file) and attempts to cluster them into meaningful groups, using the algorithm described in this paper.
Visit the links on the grey bar below to read instructions and guidelines, see output formats, or download the code.

For large data sets, you are encouraged to download a stand-alone version of the program, with full functionality and no parameter limitations.

Submission

To load some SAMPLE DATA click here:

TRAINING DATASET SUBMISSION

Paste training dataset in the box:

or submit a file directly from your local disk:

TEST DATASET SUBMISSION

Paste test dataset in the box:

or submit a file directly from your local disk:

ANNOTATION SUBMISSION

Paste annotation CSV in the box:

or submit a file directly from your local disk:

SUBMIT job

Optionally set some of the parameters before starting your job, or the recommended configuration:

Suggest parameters:

BASIC options

Hover the mouse cursor over the symbol for a short description of the options

Job name

Number of clusters

FILTERING options

Number of standard deviation

SOM options

Initial sigma

Initial learning rate

Seed number

Number of iterations

SUBMIT job

Confidentiality:
The sequences are kept confidential and will be deleted after processing.

CITATIONS

GibbsCluster: unsupervised clustering and alignment of peptide sequences
Andreatta M, Alvarez B, Nielsen M
Nucleic Acids Research (2017) doi: 10.1093/nar/gkx248
Full text: PUBMED

Instructions & Guidelines

GibbsCluster takes as input a list of peptide sequences and attempts to cluster them into meaningful groups.
Sequence alignment and clustering are performed simultaneously by sampling the space of possible solutions using a Gibbs sampling strategy. Each cluster is represented by a position-specific scoring matrix (PSSM), and the algorithm aims at maximizing the information content of individual matrices while minimizing the overlap between distinct clusters. The server returns a detailed report on the optimal clustering solutions, including plots of the optimal number of clusters and graphical representations of the identified sequence motifs as sequence logos. For details on the algorithm refer to this paper

This page introduces the data formats, the parameters available to customize the analysis and some guidelines for the use of version 2.0. Users are welcome to contact the authors for any questions.

1. Specify INPUT sequences

Paste a set of peptides (up to 50 amino acids in length), one sequence per line into the upper left window, or upload a file from your local disk.

The input file can be a plain list of peptides (Sample 1) or an annotated list (Sample 2). The annotation is carried over to the results, and may be useful for correlating a known classification with the clustering produced by the method. All input sequences must be in one-letter amino acid code. The allowed alphabet (not case sensitive) is as follows:

A C D E F G H I K L M N P Q R S T V W Y

2. Set OPTIONS to customize your analysis

The options are divided in three levels: Basic, Filtering and SOM. The only essential parameters to be set are in the Basic options; if you are using the server for the first time you may leave all the other options unchanged.

A brief explanation of each option can be visualized by hovering the mouse over the

symbol next to each option in the submission form.

BASIC options

Job name:
This prefix is pre-pended to all files generated by the current run. If left empty, a system-generated number will be assigned as prefix.

Number of clusters:
You may provide a specific number of clusters (e.g. 3), or an interval of partitions (e.g. 2-5). In the second case, the method will suggest the optimal number of cluster it found in the data, given the parameter configuration of the job. Maximum number of clusters: 15.

FILTERING options

Make clustering moves at each iteration:
By default, simple shift moves are performed at each iteration, indel moves every 10 iterations, single peptide moves every 20 iterations, phase shift moves every 100 iterations.
You can alter this behavior by ticking this option; simple shift and phase shift moves become disabled, and single peptide moves are made at each iteration. This set-up is recommended for "nearly-aligned" data, where clustering and indels should be sampled more regularly than extensions at the termini. That is the case, for example, of sets of MHC class I ligands of different length, which would in most cases require central indels to model peptide bulging of long ligand.

SOM options

Number of iterations per sequence per temperature step:
This parameter ("I") specificies how long your clustering schedule should be. Note that total number of iterations is the results of "I" multiplied by the number of sequences times the number of temperature steps, and it will increase linearly the execution time.

Initial Monte Carlo temperature:
The temperature is a scalar, lowered by discreet steps as the iterations progress. The temperature influences the probability of accepting or rejecting the moves of the algorith. In the initial iterations (high temperature) the program is free to explore the landscape of solutions, and as the system cools off only moves that increase the energy will be accepted.

Number of temperature steps:
The number of steps in the cooling schedule (starting from the initial temperature specified above).

Interval between Indel moves:
Specifies how often to attempt introducing insertions and deletions (see glossary).

Interval between Single peptide moves:
Specifies how often to attempt moving a sequence between clusters (see glossary).

Interval between Phase shift moves:
Specifies how often to attempt shifting the alignment window of a single cluster.

Background amino acid frequencies:
Construction of PSSMs relies on calculating the frequency of a given residue at a given position, compared to the expected background frequency of that amino acid. You may use a flat background model identical for all amino acids (Flat), a pre-calculated distribution reflecting the relative frequency of each residue in naturally occurring proteins (Pre-calculated Uniprot), or determine the background model directly from the dataset you submitted (From data).

Preference for hydrophobic AAs at P1:
In the special case of MHC class II data, we have previously found helpful to guide the alignment by expressing a preference for hydrophobic residues at the P1 of the alignment.

Sequence weighting type:
Data redundancy may affect the quality of the clustering. You may use an explicit clustering of the sequences in a given group (Clustering), or use a faster heuristic that calculates the degree of variability at each column in the alignment (Heuristic, recommended); you may also disable sequence weighting for downweighting of redundant sequences (None).

3. SUBMIT the job

Click on the "Submit query" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

GLOSSARY

PSSM: Position-specific scoring matrix. A matrix of size L x A, where L is the length of the alignment window and A is the length of the alphabet; it stores the weight (or "preference") or any given amino acid at each position of the alignment.
Indels: Insertions and deletions. An insertion adds one or more gaps in the alignment, a deletion removes one or more amino acids of a given sequence.
Simple shift move: A move affecting the alignment core of a single peptide. The algorithm attempts to shift the alignment core of a peptide to the left or right, and accept/reject the move with a probability that depends on the temperature of the system.
Single peptide move: A move that attempts to change the cluster of a single peptide sequence.
Indel move: A move that introduces or remove an insertion or a deletion in a peptide sequence. In the GibbsCluster, indels can have any length up to the values specified as parameters, but there can only be one single indel stretch per peptide (i.e. a single sequence cannot contain indels at multiple positions)
Phase shift move: A move affecting the alignment core of all peptide sequences in a cluster. In simple terms, the algorithm will attempt to shift en bloc the alignment window of a given cluster.

Output format

DESCRIPTION

An example of output is found below. The output is composed of the following sections:

Run identifiers and settings

Barplot of KLD vs number of clusters

Table of validation set

Self-organising maps

LOGO

Seq2Logo

Clustering Report

Clustering Solution

Gn: Cluster number
Num: Sequential number of the sequence in the cluster
Sequence: Complete peptide sequence
Core: Portion of the peptide in the alignment window
of: Offset value, i.e. the starting position of the alignment core
IP: Position of the insertion, if any
IL: Length of the insertion, if any
DP: Position of the deletion, if any
DL: Length of the deletion, if any
Annotation: Text annotation to the peptide, if provided at submission
Self: Score of the peptide to its own cluster
bgG: Identifier of the nearest cluster (-1 means no nearest cluster)
bgScore: Score of the peptide to the nearest cluster
cScore: corrected score (S_self - λ x S_nearest)

Remember

DOWNLOAD

EXAMPLE OUTPUT

ClusteringTool Server - Results

Technical University of Denmark

Version: 1.0
Run ID: 27911
Run name: sample_run
Platform: Linux x86_64

Read 200 unique sequences from file

Settings:
No shift moves, cluster moves at every iteration
Number of clusters: 2 - 5
Number of standard deviation: 1.96
Initial sigma: 6
Initial learning rate: 0.5
Seed number: 42
Number of SOM iterations: 5000

SOM_admut

SOM_sil

SOM_admut

SOM_admut	SOM_sil	SOM_sil
SOM_admut	SOM_sil	SOM_sil	SOM_sil

SOM	SOM_admut	SOM_sil
SOM_with_validation

Article abstract

GibbsCluster: unsupervised clustering and alignment of peptide sequences

Massimo Andreatta, Bruno Alvarez, Morten Nielsen

Nucleic Acids Research, 2017 Apr 12. doi: 10.1093/nar/gkx248

Receptor interactions with short linear peptide fragments (ligands) are at the base of many biological signaling processes. Conserved and information-rich amino acid patterns, commonly called sequence motifs, shape and regulate these interactions. Because of the properties of a receptor-ligand system or of the assay used to interrogate it, experimental data often contain multiple sequence motifs. GibbsCluster is a powerful tool for unsupervised motif discovery because it can simultaneously cluster and align peptide data. The GibbsCluster 2.0 presented here is an improved version incorporating insertion and deletions accounting for variations in motif length in the peptide input. In basic terms, the program takes as input a set of peptide sequences and clusters them into meaningful groups. It returns the optimal number of clusters it identified, together with the sequence alignment and sequence motif characterizing each cluster. Several parameters are available to customize cluster analysis, including adjustable penalties for small clusters and overlapping groups, and a trash cluster to remove outliers. As an example application, we used the server to deconvolute multiple specificities in large-scale peptidome data generated by mass spectrometry. The server is available at http://www.cbs.dtu.dk/services/GibbsCluster-2.0.

Full text

GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support:

ClusteringTool - 0.9

Clustering of peptide data

This tool is currently unavailable. This is due to misuse coupled with an inappropriate setup on our server, which makes it unstable When both issues has been addressed, we will put it up again.

Submission

TRAINING DATASET SUBMISSION

Paste training dataset in the box:

or submit a file directly from your local disk:

TEST DATASET SUBMISSION

Paste test dataset in the box:

or submit a file directly from your local disk:

ANNOTATION SUBMISSION

Paste annotation CSV in the box:

or submit a file directly from your local disk:

SUBMIT job

BASIC options

FILTERING options

SOM options

SUBMIT job

CITATIONS

Instructions & Guidelines

1. Specify INPUT sequences

2. Set OPTIONS to customize your analysis

BASIC options

FILTERING options

SOM options

3. SUBMIT the job

GLOSSARY

Output format

DESCRIPTION

EXAMPLE OUTPUT

ClusteringTool Server - Results

Technical University of Denmark

Article abstract

GETTING HELP

This tool is currently unavailable. This is due to misuse coupled with an inappropriate setup on our server, which makes it unstable
When both issues has been addressed, we will put it up again.