Services
GibbsCluster - 2.0
Simultaneous alignment and clustering of peptide data
GibbsCluster is a server for unsupervised alignment and clustering of peptide sequences.
The program takes as input a list of peptide sequences and attempts to cluster them
into meaningful groups, using the algorithm described in this paper.
Visit the links on the grey bar below to read instructions and guidelines, see output formats, or download the code.
Update (Nov 2016): Implements deletions and insertions in the sequence alignment.
For very large data sets, you are encouraged to download a stand-alone version of the program, with full functionality and no parameter limitations.
Submission
CITATIONS
- GibbsCluster: unsupervised clustering and alignment of peptide sequences
Andreatta M, Alvarez B, Nielsen M
Nucleic Acids Research (2017) doi: 10.1093/nar/gkx248
Full text: PUBMED
- Simultaneous alignment and clustering of peptide data using a Gibbs sampling approach
Andreatta M, Lund O, Nielsen M
Bioinformatics (2012) doi: 10.1093/bioinformatics/bts621
Full text: PUBMED
Instructions & Guidelines
Sequence alignment and clustering are performed simultaneously by sampling the space of possible solutions using a Gibbs sampling strategy. Each cluster is represented by a position-specific scoring matrix (PSSM), and the algorithm aims at maximizing the information content of individual matrices while minimizing the overlap between distinct clusters. The server returns a detailed report on the optimal clustering solutions, including plots of the optimal number of clusters and graphical representations of the identified sequence motifs as sequence logos. For details on the algorithm refer to this paper
This page introduces the data formats, the parameters available to customize the analysis and some guidelines for the use of version 2.0. Users are welcome to contact the authors for any questions.
1. Specify INPUT sequences
The input file can be a plain list of peptides (Sample 1) or an annotated list (Sample 2). The annotation is carried over to the results, and may be useful for correlating a known classification with the clustering produced by the method. All input sequences must be in one-letter amino acid code. The allowed alphabet (not case sensitive) is as follows:
2. Set OPTIONS to customize your analysis
A brief explanation of each option can be visualized by hovering the mouse over the

BASIC options
Job name:
This prefix is pre-pended to all files generated by the current run. If left empty, a
system-generated number will be assigned as prefix.
Number of clusters:
You may provide a specific number of clusters (e.g. 3), or an interval of partitions (e.g. 1-8).
In the second case, the method will suggest the optimal number of cluster it found in the
data, given the parameter configuration of the job. Maximum number of clusters: 15.
Motif length:
The algorithm will attempt to align all sequences to common windows of N amino acids,
and construct its PSSMs on these alignments. Specify with this option the length of the
alignment window. Minimum motif length: 2
ADVANCED options
Make clustering moves at each iteration:
By default, simple shift moves are performed at each iteration, indel moves every 10
iterations, single peptide moves every 20 iterations, phase shift moves every 100 iterations.
You can alter this behavior by ticking this option; simple shift and phase shift moves
become disabled, and single peptide moves are made at each iteration. This set-up is
recommended for "nearly-aligned" data, where clustering and indels should be sampled more
regularly than extensions at the termini. That is the case, for example, of sets of MHC
class I ligands of different length, which would in most cases require central indels
to model peptide bulging of long ligand.
Max deletion length:
The maximum length of consecutive deletions in a peptide sequence.
Max insertion length:
The maximum length of consecutive insertions in a peptide sequence.
Number of seeds for initial conditions:
Gibbs sampling is a heuristic rather than a rigorous optimization procedure. Therefore,
it cannot guarantee that the most optimal solution is always reached from any starting
configuration. A common procedure to boost performance is to repeat the sampling from a
number of initial random configurations and select the solution that appears to be
optimal in terms of the fitness function that governs the system. Specify with this parameter
the number of initial configurations used to initialize the system.
Penalty factor for inter-cluster similarity (λ):
This parameter modulates how similar the clusters are allowed to be. If you believe your
data contains multiple specificities with well-defined motifs, λ can be relatively high;
on the other hand, if your aim is to detect subtle differences in mostly homogenous data, the
parameter λ should be set to a lower value.
Weigth on small clusters (σ):
This parameter can be used to specify how small clusters are allowed to be. With low values
of σ the method will tend to produce small specialized clusters, while larger σ will
return larger and more general clusters.
Use trash cluster to remove outliers:
The trash cluster is used to collect the peptides that appear not to match any of the
motifs being identified. The behaviour of the trash-cluster is identical to any of the
other clusters, with the difference that the sequences in the trash cluster do not
contribute to the overall score of the system.
Threshold for discarding to trash:
This parameter specifies a baseline on the peptide scores, below which peptides are tossed
into the trash cluster. If you believe your data contains some degree of noise, you may
experiment with increasing this value and observe how many sequences become filtered out
by the trash cluster.
VERY ADVANCED options
Number of iterations per sequence per temperature step:
This parameter ("I") specificies how long your clustering schedule should be. Note that total
number of iterations is the results of "I" multiplied by the number of sequences times the
number of temperature steps, and it will increase linearly the execution time.
Initial Monte Carlo temperature:
The temperature is a scalar, lowered by discreet steps as the iterations progress. The
temperature influences the probability of accepting or rejecting the moves of the algorith.
In the initial iterations (high temperature) the program is free to explore the landscape
of solutions, and as the system cools off only moves that increase the energy will be
accepted.
Number of temperature steps:
The number of steps in the cooling schedule (starting from the initial temperature specified
above).
Interval between Indel moves:
Specifies how often to attempt introducing insertions and deletions (see glossary).
Interval between Single peptide moves:
Specifies how often to attempt moving a sequence between clusters (see glossary).
Interval between Phase shift moves:
Specifies how often to attempt shifting the alignment window of a single cluster.
Background amino acid frequencies:
Construction of PSSMs relies on calculating the frequency of a given residue at a given position,
compared to the expected background frequency of that amino acid. You may use a flat background model
identical for all amino acids (Flat), a pre-calculated distribution reflecting the relative
frequency of each residue in naturally occurring proteins (Pre-calculated Uniprot), or determine
the background model directly from the dataset you submitted (From data).
Preference for hydrophobic AAs at P1:
In the special case of MHC class II data, we have previously found helpful to guide the
alignment by expressing a preference for hydrophobic residues at the P1 of the alignment.
Sequence weighting type:
Data redundancy may affect the quality of the clustering. You may use an explicit clustering
of the sequences in a given group (Clustering), or use a faster heuristic that calculates the degree of
variability at each column in the alignment (Heuristic, recommended); you may also disable
sequence weighting for downweighting of redundant sequences (None).
3. SUBMIT the job
At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.
GLOSSARY
- PSSM: Position-specific scoring matrix. A matrix of size L x A, where L is the length of the alignment window and A is the length of the alphabet; it stores the weight (or "preference") or any given amino acid at each position of the alignment.
- Indels: Insertions and deletions. An insertion adds one or more gaps in the alignment, a deletion removes one or more amino acids of a given sequence.
- Simple shift move: A move affecting the alignment core of a single peptide. The algorithm attempts to shift the alignment core of a peptide to the left or right, and accept/reject the move with a probability that depends on the temperature of the system.
- Single peptide move: A move that attempts to change the cluster of a single peptide sequence.
- Indel move: A move that introduces or remove an insertion or a deletion in a peptide sequence. In the GibbsCluster, indels can have any length up to the values specified as parameters, but there can only be one single indel stretch per peptide (i.e. a single sequence cannot contain indels at multiple positions)
- Phase shift move: A move affecting the alignment core of all peptide sequences in a cluster. In simple terms, the algorithm will attempt to shift en bloc the alignment window of a given cluster.
Output format
DESCRIPTION
- Run identifiers and settings The parameters specificied by the user are reported here, together with the number of sequences loaded as input.
- Barplot of KLD vs number of clusters For each initial number of clusters, the information content of the alignments is shown as a barplot. The relative size of each block within a bar is proportional to the size of a given cluster. In the example below, the run with 3 initial clusters produced one empty cluster, therefore only 2 boxes are depicted in the third column of the barplot.
- Sequence logos of the optimal solution The sequence motifs identified by the Gibbs Clustering are shown to the right of the barplot. This is the optimal solution over the range of initial numbers of clusters. Hovering the cursor over the logos shows the KLD of each cluster in the solution.
- Complete results for all initial number of clusters If a range of initial numbers of clusters was specified (1 to 3 in the example below), the results for each case are listed in succession. The optimal number of cluster shown above is just a suggestion and depends on the specificied parameters, so the user is encouraged to inspect all solutions with different numbers of clusters.
- Gn: Cluster number
- Num: Sequential number of the sequence in the cluster
- Sequence: Complete peptide sequence
- Core: Portion of the peptide in the alignment window
- of: Offset value, i.e. the starting position of the alignment core
- IP: Position of the insertion, if any
- IL: Length of the insertion, if any
- DP: Position of the deletion, if any
- DL: Length of the deletion, if any
- Annotation: Text annotation to the peptide, if provided at submission
- Self: Score of the peptide to its own cluster
- bgG: Identifier of the nearest cluster (-1 means no nearest cluster)
- bgScore: Score of the peptide to the nearest cluster
- cScore: corrected score (Sself - λ x Snearest)
It is possible to customize the logos further by clicking on the LOGO button. This transfers the data to the Seq2Logo server, which allows plotting several different kinds of sequence logo.
Inspect the complete Clustering Report and the formatted Clustering Solution for a tabular version of the results. Format of the Clustering Solution files:
Remember that results are only stored on the CBS server for about 24 hours. Save your results to disk by clicking on the DOWNLOAD link at the bottom of the results page.
EXAMPLE OUTPUT
GibbsCluster Server - ResultsTechnical University of Denmark |
Version: 2.0
Run ID: 27911
Run name: gibbs_27911
Platform: Linux x86_64
Read 200 unique sequences from file
Settings:
No shift moves, cluster moves at every iteration
Number of clusters: 1 - 3
Motif length: 9
Initial MC temperture: 0.8
Number of temperature steps: 20
Number of iterations x Sequence x Tstep: 100
Max insertion length: 1
Max deletion length: 5
Interval between Indel moves: 10
Number of initial seeds: 3
Penalty lambda: 0.8
Weight on small clusters: 10
Sequence weighting type: 0
Background model: Uniprot pre-calculated
Use trash cluster to remove outliers: 1
Threshold for trash cluster: 0
KLD vs. Number of clusters with λ = 0.8
|
Identified 2 sequence motifs
|
![]() View the barplot in full size |
![]() ![]() |
RESULTS for 1 CLUSTERS | ||||||||||||||||||||||||||||
Final Average KLD: 9.466122
|
![]() | |||||||||||||||||||||||||||
RESULTS for 2 CLUSTERS | ||||||||||||||||||||||||||||
Final Average KLD: 10.917174
|
![]() ![]() | |||||||||||||||||||||||||||
RESULTS for 3 CLUSTERS | ||||||||||||||||||||||||||||
Final Average KLD: 10.907671
|
![]() ![]() | |||||||||||||||||||||||||||
See the Activity log for this job
Article abstract
GibbsCluster: unsupervised clustering and alignment of peptide sequences
Massimo Andreatta, Bruno Alvarez, Morten Nielsen
Nucleic Acids Research, 2017 Apr 12. doi: 10.1093/nar/gkx248
Receptor interactions with short linear peptide fragments (ligands) are at the base of many biological signaling processes. Conserved and information-rich amino acid patterns, commonly called sequence motifs, shape and regulate these interactions. Because of the properties of a receptor-ligand system or of the assay used to interrogate it, experimental data often contain multiple sequence motifs. GibbsCluster is a powerful tool for unsupervised motif discovery because it can simultaneously cluster and align peptide data. The GibbsCluster 2.0 presented here is an improved version incorporating insertion and deletions accounting for variations in motif length in the peptide input. In basic terms, the program takes as input a set of peptide sequences and clusters them into meaningful groups. It returns the optimal number of clusters it identified, together with the sequence alignment and sequence motif characterizing each cluster. Several parameters are available to customize cluster analysis, including adjustable penalties for small clusters and overlapping groups, and a trash cluster to remove outliers. As an example application, we used the server to deconvolute multiple specificities in large-scale peptidome data generated by mass spectrometry. The server is available at http://www.cbs.dtu.dk/services/GibbsCluster-2.0.