DTU Health Tech

Department of Health Technology

TCRcluster - 1.0

Latent embeddings of TCR sequences and agglomerative clustering

TCRcluster is a variational autoencoder and agglomerative clustering pipeline. It clusters TCRs together based on latent cosine distances extracted with a VAE.

Submit data


Paste in the sequences for all CDR chains (A1,A2,A3,B1,B2,B3), and an optional label column. The leading C and trailing F amino acids must be clipped from A3 and B3.
Additionally, the label of each data point can be provided by the user, if available.
Alternatively, load and example input or upload a file from your local machine.
Each column should be comma separated, with headers A1,A2,A3,B1,B2,B3,label (label column is optional)

For detailed instructions, see Instructions tab above.

For an overview of the method and citation information, see Abstract tab.

Sequences submission

Paste the sequence(s):

or load some sample data:
or upload a local file:


Options

VAE model: One Stage (No Triplet) One Stage (Cosine Triplet) Two Stages (No Triplet) Two Stages (Cosine Triplet)

Threshold Options

Set a threshold: Use default (Optimisation) Set manually

Cite

Yat-tsai Richie Wan, Morten Nielsen ; Placeholder Read the full article here

Instructions for TCRcluster-1.0

Input format

  • The data should have format: A1,A2,A3,B1,B2,B3,label, or A1,A2,A3,B1,B2,B3.
  • Each CDR chain have a different maximum length limit, with the standard uppercase 20 amino acid alphabet:

  • A C D E F G H I K L M N P Q R S T V W Y
  • The input needs to be comma separated, with headers: A1,A2,A3,B1,B2,B3
  • The input can be submitted with or without labels. Labels can be any specificity to a known peptide or the source of a sample, and are not used during clustering.
  • If labels are provided, the column header must be A1,A2,A3,B1,B2,B3,label
  • Additional columns that will not be used by the program (for example indices, etc.) can be added, but only columns with the column names A1,A2,A3,B1,B2,B3 are required.

  • Load Example data on the Submission page to ensure your data format is correct.
List of supported CDR lengths

  1. Paste the data into the box, or load an example file, or load a file from your local machine.
  2. Select which VAE model to use.
  3. Select whether to run the clustering optimisation (recommended), or provide a custom distance threshold.
Click the submit button when all data are entered.



EXAMPLE OUTPUT

Using the sample data, with labels, with optimisation:

Click here to download the latent vector and predicted clusters in .csv format.
Click here to download the clusters summary in .csv format.
Click here to download the cosine distance matrix in .csv format.
Click here to download the optimisation results in .csv format.
Click here to download the optimisation curve plot in .png format.


Below is a table preview of clustering metrics at each threshold tested.
A total of 80 points are tested, showing only 10 points centered around the best solution.
the 'best' column denotes the best silhouette solution.

    threshold   best  n_cluster  n_singletons  silhouette  mean_purity  \
40    0.77949  False          6             6       0.194        0.875
41    0.79165  False          6             6       0.194        0.875
42    0.80380  False          6             6       0.194        0.875
43    0.81596  False          6             6       0.194        0.875
44    0.82811  False          6             5       0.197        0.850
45    0.84026  False          5             5       0.184        0.820
46    0.85242  False          6             3       0.195        0.767
47    0.86457  False          6             3       0.195        0.767
48    0.87673  False          6             3       0.195        0.767
49    0.88888  False          6             3       0.195        0.767
50    0.90104   True          6             1       0.206        0.739
51    0.91319  False          6             1       0.206        0.739
52    0.92534  False          6             1       0.206        0.739
53    0.93750  False          6             1       0.206        0.739
54    0.94965  False          6             1       0.206        0.739
55    0.96181  False          6             1       0.206        0.739
56    0.97396  False          6             1       0.206        0.739
57    0.98612  False          6             1       0.206        0.739
58    0.99827  False          6             0       0.202        0.683
59    1.01042  False          6             0       0.202        0.683
60    1.02258  False          6             0       0.202        0.683

    retention  mean_size  max_size
40       0.70      2.333       4.0
41       0.70      2.333       4.0
42       0.70      2.333       4.0
43       0.70      2.333       4.0
44       0.75      2.500       5.0
45       0.75      3.000       5.0
46       0.85      2.833       5.0
47       0.85      2.833       5.0
48       0.85      2.833       5.0
49       0.85      2.833       5.0
50       0.95      3.167       5.0
51       0.95      3.167       5.0
52       0.95      3.167       5.0
53       0.95      3.167       5.0
54       0.95      3.167       5.0
55       0.95      3.167       5.0
56       0.95      3.167       5.0
57       0.95      3.167       5.0
58       1.00      3.333       5.0
59       1.00      3.333       5.0
60       1.00      3.333       5.0

DESCRIPTION

The output contains up to 5 files when optimising, or 3 files when using a custom threshold:

  • latent vector and predicted cluster CSV table containing the input datapoint and sequences, extracted latent vector (columns z_XX) and the predicted cluster label.

  • clusters summary CSV table containing a summary of the clusters for each label. Purity and majority labels are only informative if labels were provided with the input data.

  • cosine distance matrix CSV table of square cosine distance matrix (all versus all). Additional placeholder label and index columns are included and only used in the python code for compatibility purposes.

  • Optimisation results CSV table containing the clustering metrics at each hyperparameter (distance threshold) used in the optimisation process. The column "Best" denotes which threshold has been found to be optimal (e.g. max Silhouette score).

  • Optimisation plot png plot of the optimisation table, with the optimal SI denoted by a dashed line.
    Silhouette score, purity, retention are plotted in range (0-1) on the main Y-axis. mean size and n_cluster above are plotted in log2 scale on the secondary Y-axis. n_above denotes the number of clusters above size 6 and purity 80% (only useful if labels were provided)
  • References


    - Yat-tsai Richie Wan, Morten Nielsen, Manuscript in preparation, 2025, Vol. 1, No. 2, 1234-1235.

    Abstract

    T cells are a crucial part of the adaptive immune system, tasked with identifying and eliminating pathogen-infected or cancerous cells. However prediction of T cell specificity and clustering T cell repertoires remain a highly challenging task, mainly due to their large sequence diversity. As a means to resolve this, accurately encoding T-cell receptor (TCR) sequences into a latent feature space is critical for advancing TCR-specificity prediction and downstream applications. Here, we developed a variational autoencoder (VAE)-based model trained on paired TCR α-β chain data. Our approach incorporates all six complementarity-determining regions (CDRs) for enhanced specificity assessment. Optimized through a custom loss regime, our best models achieved up to 99.3% sequence reconstruction accuracy. We introduced a semi-supervised “Two-Stage VAE” framework, combining a cosine triplet loss with an integrated classifier predictor to refine the latent representations for peptide specificity. Benchmarking against sequence-based methods (TCRBase and tcrdist3), our VAE models demonstrated superior performance for peptide-specificity prediction. Clustering analyses leveraging our VAE latent space were evaluated using K-means, agglomerative clustering, and a novel graph-based Top1Cut method. Agglomerative clustering achieved the most biologically relevant results, balancing cluster purity and retention despite noise in TCR specificity annotations. We extended these insights to TCR repertoire data, evaluating immune repertoires from SARS-CoV-2 and cancer patients. Across datasets, VAE-based models outperformed sequence-based methods, particularly in retention metrics, with notable improvements in the SARS-CoV-2 repertoire likely reflecting the presence of both SARS-CoV-2 epitopes and TCRs in training dataset. Moreover, the cancer repertoire analysis highlighted the generalizability of our approach, where the Two-Stage VAE without triplet loss excelled despite minimal similarity between the training and test data. Collectively, these results demonstrate the potential of VAE-based latent representations for TCR-specificity prediction, clustering, and repertoire analysis, offering a robust framework for analyzing diverse immune datasets.

    Supplementary material

    Here, you will find the data set used for training and evaluating TCRcluster-1.0.


    TCRcluster-1.0

    Training data

    Used to train the VAE models Only partitions 2,3,4 were used for training.
    Partition 0 was used as the validation.
    Partition 1 was used as the held-out test set.

    All partitions, binders only (Used to train models)
    All partitions, with swapped negatives (Used to train models)

    Validation data

    Only partition 0 was used for validation

    17 peptides subset, test set binders only (used for model development)
    17 peptides subset, with swapped negatives (used for model development)

    Test data

    Only partition 1 was used for testing, never seen in either training or validation

    17 peptides subset, test set with swapped negatives (used for figure 2)
    17 peptides subset, test set binders only (used for figure 3)

    References

    ...
    Submitted 2025.



    GETTING HELP

    If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

    If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

    Correspondence: Technical Support: