Instructions & Guidelines

GibbsCluster takes as input a list of peptide sequences and attempts to cluster them into meaningful groups.
Sequence alignment and clustering are performed simultaneously by sampling the space of possible solutions using a Gibbs sampling strategy. Each cluster is represented by a position-specific scoring matrix (PSSM), and the algorithm aims at maximizing the information content of individual matrices while minimizing the overlap between distinct clusters. The server returns a detailed report on the optimal clustering solutions, including plots of the optimal number of clusters and graphical representations of the identified sequence motifs as sequence logos. For details on the algorithm refer to this paper

This page introduces the data formats, the parameters available to customize the analysis and some guidelines for the use of version 2.0. Users are welcome to contact the authors for any questions.

1. Specify INPUT sequences

Paste a set of peptides (up to 50 amino acids in length), one sequence per line into the upper left window, or upload a file from your local disk.

The input file can be a plain list of peptides (Sample 1) or an annotated list (Sample 2). The annotation is carried over to the results, and may be useful for correlating a known classification with the clustering produced by the method. All input sequences must be in one-letter amino acid code. The allowed alphabet (not case sensitive) is as follows:


2. Set OPTIONS to customize your analysis

The options are divided in three levels: Basic, Advanced and Very advanced. The only essential parameters to be set are in the Basic options; if you are using the server for the first time you may leave the advanced options unchanged.

A brief explanation of each option can be visualized by hovering the mouse over the symbol next to each option in the submission form.

BASIC options

Job name:
This prefix is pre-pended to all files generated by the current run. If left empty, a system-generated number will be assigned as prefix.

Number of clusters:
You may provide a specific number of clusters (e.g. 3), or an interval of partitions (e.g. 1-8). In the second case, the method will suggest the optimal number of cluster it found in the data, given the parameter configuration of the job. Maximum number of clusters: 15.

Motif length:
The algorithm will attempt to align all sequences to common windows of N amino acids, and construct its PSSMs on these alignments. Specify with this option the length of the alignment window. Minimum motif length: 2

ADVANCED options

Make clustering moves at each iteration:
By default, simple shift moves are performed at each iteration, indel moves every 10 iterations, single peptide moves every 20 iterations, phase shift moves every 100 iterations.
You can alter this behavior by ticking this option; simple shift and phase shift moves become disabled, and single peptide moves are made at each iteration. This set-up is recommended for "nearly-aligned" data, where clustering and indels should be sampled more regularly than extensions at the termini. That is the case, for example, of sets of MHC class I ligands of different length, which would in most cases require central indels to model peptide bulging of long ligand.

Max deletion length:
The maximum length of consecutive deletions in a peptide sequence.

Max insertion length:
The maximum length of consecutive insertions in a peptide sequence.

Number of seeds for initial conditions:
Gibbs sampling is a heuristic rather than a rigorous optimization procedure. Therefore, it cannot guarantee that the most optimal solution is always reached from any starting configuration. A common procedure to boost performance is to repeat the sampling from a number of initial random configurations and select the solution that appears to be optimal in terms of the fitness function that governs the system. Specify with this parameter the number of initial configurations used to initialize the system.

Penalty factor for inter-cluster similarity (λ):
This parameter modulates how similar the clusters are allowed to be. If you believe your data contains multiple specificities with well-defined motifs, λ can be relatively high; on the other hand, if your aim is to detect subtle differences in mostly homogenous data, the parameter λ should be set to a lower value.

Weigth on small clusters (σ):
This parameter can be used to specify how small clusters are allowed to be. With low values of σ the method will tend to produce small specialized clusters, while larger σ will return larger and more general clusters.

Use trash cluster to remove outliers:
The trash cluster is used to collect the peptides that appear not to match any of the motifs being identified. The behaviour of the trash-cluster is identical to any of the other clusters, with the difference that the sequences in the trash cluster do not contribute to the overall score of the system.

Threshold for discarding to trash:
This parameter specifies a baseline on the peptide scores, below which peptides are tossed into the trash cluster. If you believe your data contains some degree of noise, you may experiment with increasing this value and observe how many sequences become filtered out by the trash cluster.


Number of iterations per sequence per temperature step:
This parameter ("I") specificies how long your clustering schedule should be. Note that total number of iterations is the results of "I" multiplied by the number of sequences times the number of temperature steps, and it will increase linearly the execution time.

Initial Monte Carlo temperature:
The temperature is a scalar, lowered by discreet steps as the iterations progress. The temperature influences the probability of accepting or rejecting the moves of the algorith. In the initial iterations (high temperature) the program is free to explore the landscape of solutions, and as the system cools off only moves that increase the energy will be accepted.

Number of temperature steps:
The number of steps in the cooling schedule (starting from the initial temperature specified above).

Interval between Indel moves:
Specifies how often to attempt introducing insertions and deletions (see glossary).

Interval between Single peptide moves:
Specifies how often to attempt moving a sequence between clusters (see glossary).

Interval between Phase shift moves:
Specifies how often to attempt shifting the alignment window of a single cluster.

Background amino acid frequencies:
Construction of PSSMs relies on calculating the frequency of a given residue at a given position, compared to the expected background frequency of that amino acid. You may use a flat background model identical for all amino acids (Flat), a pre-calculated distribution reflecting the relative frequency of each residue in naturally occurring proteins (Pre-calculated Uniprot), or determine the background model directly from the dataset you submitted (From data).

Preference for hydrophobic AAs at P1:
In the special case of MHC class II data, we have previously found helpful to guide the alignment by expressing a preference for hydrophobic residues at the P1 of the alignment.

Sequence weighting type:
Data redundancy may affect the quality of the clustering. You may use an explicit clustering of the sequences in a given group (Clustering), or use a faster heuristic that calculates the degree of variability at each column in the alignment (Heuristic, recommended); you may also disable sequence weighting for downweighting of redundant sequences (None).

3. SUBMIT the job

Click on the "Submit query" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.