ProtParts - 1.0

Protein clustering and partitioning

ProtParts is a web server for protein sequence clustering based on E-value of alignments, which represents sequence homology. The clustering results align with protein domain and family annotations. Furthermore, ProtParts randomly assigns clusters into partitions to ensure independence among partitions for machine learning, thereby, preventing data leakage during model training and evaluation. Additionally, ProtParts can reduce sequence redundancy of a protein sequence dataset as well, which is based on Hobohm 1 algorithm and E-value sequence homology.

Submission

Sequence submission: paste the sequence(s) or upload a local file in FASTA format

Instructions

ProtParts-1.0 is a program that clusters protein sequences with graph method based on E-value of BLAST, and generates random partitions on clusters.

ProtParts accepts protein sequences in FASTA format and performs pairwise sequence alignments with BLAST. Since ProtParts clusters proteins based on sequence homology, a threshold of E-value representing the sequence homology is required to proceed. Afterwards, ProtParts randomly allocates the clusters from the graph into partitions. Under a given threshold, the independence of clusters ensures the independence of partitions for machine learning as well. Users can choose different output format to meet their requirements. By default, ProtParts runs the BLAST, clustering and partitioning in a sequential order. An optional processing of redundancy reduction can be applied before clustering. It reduces the dataset redundancy with Hobohm 1 algorithm based on the E-value sequence homology as well.

Options

Parameters	Description
➀ Sequence submission	Upload or paste protein sequences in FASTA.
➁ Clustering threshold	Build clusters in graph using E-value as a threshold (Default: 1e-9). Users can provide multiple thresholds with space separated to get different clustering results. When the clustering threshold is empty, the parameter, number of partitions, becomes mandatory. Based on a given number of partitions, the program will use the lowest sequence homology (i.e., the highest E-value) to create clusters to fit the maximum partition capacity.
➂ Number of partitions	Allocate clusters into partitions randomly. (Default: 5)
Output format:	Save the clustering and partitioning results in JSON, FASTA, CSV, or TXT format.
➃ Redundancy reduction threshold (optional)	Reduce dataset redundancy using Hobohm 1 algorithm with an E-value threshold. (Default: None, skip the redundancy reduction)

Output

Result page

➀	Pararmeters of running the program.
➁	Summary of statistical description for clustering results. Download the output file in JSON, TXT, CSV, or FASTA. See examples in the following Output file format section.
➂	Graphical analyses of clustering results, distribution of cluster size and silhouette coefficient per sample (only for large clusters with more than 10 data)

Output file format

1. JSON
JSON output format without partitioning

{
    "Cluster_0": [
        "A0002",
        "A0003",
        "A0004",
        ...

JSON output format with partitioning

{
    "Partition_0": {
        "Cluster_4": [
            "A0010",
            "A0118",
            "A0119",
            ...

2. TXT
TXT output format without partitioning. The comment lines start with #. "ClustID 0" indicates the numbering of clusters. "A0002" is the sequence ID.

# Clustering method: graph
# Threshold: 1e-09
# Number of clusters: 2030
ClustID 0 A0002
ClustID 0 A0003
ClustID 0 A0004
...

TXT output format with partitioning. "PartID 0" indicates the numbering of partitions.

# Clustering method: graph
# Threshold: 1e-09
# Number of partitions: 5
ClustID 4 PartID 0 A0010
ClustID 4 PartID 0 A0118
ClustID 4 PartID 0 A0119
...

3. CSV
CSV output format without partitioning.

SequenceID,ClusterID
A0002,0
A0003,0
A0004,0
...

CSV output format with partitioning.

SequenceID,PartitionID,ClusterID
A0010,0,4
A0118,0,4
A0119,0,4
...

4. FASTA
FASTA output format without partitioning. The clustering and partitioning results are appeneded after sequence IDs.

>A0002 Cluster_0
MAQLTLLLLSLFLTLISLPPPGASISSCNGPCRDLNDCDGQLICIKGKCNDDPEVGTHICGGTTPSPQPGSCNPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
>A0003 Cluster_0
MAQLTLLLLSLFFTLISLPPPGASISSCNGPCRDLNDCNGQLICIKGKCNDDPEVGTHICGGTTPSPQPGSCKPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
>A0004 Cluster_0
MAQLTLLLLSLFLTLISLPPPGASISSCNGPCRDLNDCDGQLICIKGKCNDDPEVGTHICGGTTPSPQPGGCNPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
...

FASTA output format with partitioning.

>A0010 Cluster_4 Partition_0
MARPSFLSLVSLSLLVLSHSSAANRQPSKYQQQQKGECQIQRLNAQEPQQRIQAEAGVTEFWDWTDDQFQCAGVAACRNMIQPRGLLLPSYTNAPTLIYILKGRGITGVMIPGCPETYQSSQQSREGDVSHRQFRDQHQKIRRFQQGDVIALPAGVAHWCYNDGDSDLVTVSVEDTGNRQNQLDNNPRRFFLAGNPQQQQKEMYAKRPQQQHSGNVFRGFDTEVLAETFGVDMEMARRLQGKDDYRGHIIQVERELKIVRPPRTREEQEQQERGERDNGMEETICTARLVENIDNPSRADIFNPRAGRLTSVNSFNLPILNYLRLSAEKGVLYRNALMPPHWKLNAHCVLYATRGEAQMQIVDQRGEAVFNDRIREGQLVVVPQNFVVMKQAGNQGFEWVAIKTNENAMFNTLAGRTSALRAMPVDVLANAYQISQSEARRLKMGREEAVLFEPRSEGRDVD
>A0118 Cluster_4 Partition_0
PPTKFSFSLFLVSVLVLCLGFALAKIDPELKQCKHQCKVQRQYDEQQKEQCVKECEKYYKEKKGREREHEEEEEEWGTGGVDEPSTHEPAEKHLSQCMRQCERQEGGQQKQLCRFRCQERYKKERGQHNYKREDDEDEDEDEAEEEDENPYVFEDEDFTTKVKTEQGKVVLLPKFTQKSKLLHALEKYRLAVLVANPQAFVVPSHMDADSIFFVSWGRGTITKILENKRESINVRQGDIVSISSGTPFYIANNDENEKLYLVQFLRPVNLPGHFEVFHGPGGENPESFYRAFSWEILEAALKTSKDTLEKLFEKQDQGTIMKASKEQVRAMSRRGEGPKIWPFTEESTGSFKLFKKDPSQSNKYGQLFEAERIDYPPLEKLDMVVSYANITKGGMSVPFYNSRATKIAIVVSGEGCVEIACPHLSSSKSSHPSYKKLRARIRKDTVFIVPAGHPFATVASGNENLEIVCFEVNAEGNIRYTLAGKKNIIKVMEKEAKELAFKMEGEEVDKVFGKQDEEFFFQGPEWRKEKEGRADE
>A0119 Cluster_4 Partition_0
MGPPTKFSFSLFLVSVLVLCLGFALAKIDPELKQCKHQCKVQRQYDEQQKEQCVKECEKYYKEKKGREREHEEEEEEWGTGGVDEPSTHEPAEKHLSQCMRQCERQEGGQQKQLCRFRCQERYKKERGQHNYKREDDEDEDEDEAEEEDENPYVFEDEDFTTKVKTEQGKVVLLPKFTQKSKLLHALEKYRLAVLVANPQAFVVPSHMDADSIFFVSWGRGTITKILENKRESINVRQGDIVSISSGTPFYIANNDENEKLYLVQFLRPVNLPGHFEVFHGPGGENPESFYRAFSWEILEAALKTSKDTLEKLFEKQDQGTIMKASKEQIRAMSRRGEGPKIWPFTEESTGSFKLFKKDPSQSNKYGQLFEAERIDYPPLEKLDMVVSYANITKGGMSVPFYNSRATKIAIVVSGEGCVEIACPHLSSSKSSHPSYKKLRARIRKDTVFIVPAGHPFATVASGNENLEIVCFEVNAEGNIRYTLAGKKNIIKVMEKEAKELAFKMEGEEVDKVFGKQDEEFFFQGPEWRKEKEGRADE
...

Abstract

In bioinformatics, sequence redundancy of a dataset is likely to introduce data leakage during the training in machine learning. The data leakage results in overfitting, compromising the capacity of model generalisation. However, current clustering tools are not able to provide completely independent partitions for training, or the sequence similarity measurement lacks statistical significance. In this study, we developed a clustering and partitioning tool, ProtParts, utilising the E-value of BLAST for accurate similarity distance calculations, and incorporating a graph algorithm to cluster protein sequences. This ensures independent partitions and prevents data leakage in training and test datasets, thereby enhancing the model generalisation. A series of comparative analysis indicated the ProtParts has a higher silhouette coefficient and adjusted mutual information. Re-training three predictive models (based on random forest, feedforward neural network, and convolutional neural network) revealed that partitioning with data leakage leads to overfitting and inflated performance during cross-validation. In contrast, training on ProtParts partitions demonstrated a more robust and improved model performance on predicting independent data. Based on the result, we deployed the user-friendly web server ProtParts for protein partitioning in machine learning applications.

References

ProtParts, an automated web server for clustering and partitioning protein dataset
Li Y. and Barra C.,
bioRxiv, https://doi.org/10.1101/2024.07.12.603234.

Abstract

Data leakage originating from protein sequence similarity shared among train and test sets can result in model overfitting and overestimation of model performance and utility. However, leakage is often subtle and might be difficult to eliminate. Available clustering tools often do not provide completely independent partitions, and in addition it is difficult to assess the statistical significance of those differences. In this study, we developed a clustering and partitioning tool, ProtParts, utilizing the E-value of BLAST to compute pairwise similarities between each pair of proteins and using a graph algorithm to generate clusters of similar sequences. This exhaustive clustering ensures the most independent partitions, giving a metric of statistical significance and, thereby enhancing the model generalization. A series of comparative analyses indicated that ProtParts clusters have higher silhouette coefficient and adjusted mutual information than other algorithms using k-mers or sequence percentage identity. Re-training three distinct predictive models revealed how sub-optimal data clustering and partitioning leads to overfitting and inflated performance during cross-validation. In contrast, training on ProtParts partitions demonstrated a more robust and improved model performance on predicting independent data. Based on these results, we deployed the user-friendly web server ProtParts (https://services.healthtech.dtu.dk/services/ProtParts-1.0) for protein partitioning prior to machine learning applications.

Supplementary material

Here, you will find the datasets in fasta used in the publication, and sequence inforamtion and annotations in CSV files.
FASTA
Allergen dataset (Dataset 1 and 2)
Evaluation dataset (Dataset 3)
ASTRAL SCOPe dataset
CSV
Dataset 1
Dataset 2
Dataset 3

References

Software Downloads

Version 1.0b

Python

GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: