DTU Health Tech

Department of Health Technology

ProtParts - 1.0

Protein clustering and partitioning

ProtParts is a web server for protein sequence clustering based on E-value of alignments, which represents sequence homology. The clustering results align with protein domain and family annotations. Furthermore, ProtParts randomly assigns clusters into partitions to ensure independence among partitions for machine learning, thereby, preventing data leakage during model training and evaluation. Additionally, ProtParts can reduce sequence redundancy of a protein sequence dataset as well, which is based on Hobohm 1 algorithm and E-value sequence homology.

Submission


Sequence submission: paste the sequence(s) or upload a local file in FASTA format

Paste a single sequence or several sequences in FASTA format into the field below:


or load an example data.

E-value threshold for clustering

or a range of E-values to find optimal clustering
1e- — 1e-

Number of partitions


Output format


Optional parameters
E-value threshold for redundancy reduction


Pruning to improve clustering performance


Instructions


ProtParts-1.0 is a program that clusters protein sequences with graph method based on E-value of BLAST, and generates random partitions on clusters.


ProtParts accepts protein sequences in FASTA format and performs pairwise sequence alignments with BLAST. Since ProtParts clusters proteins based on sequence homology, a threshold of E-value representing the sequence homology is required to proceed. Afterwards, ProtParts randomly allocates the clusters from the graph into partitions. Under a given threshold, the independence of clusters ensures the independence of partitions for machine learning as well. Users can choose different output format to meet their requirements. By default, ProtParts runs the BLAST, clustering and partitioning in a sequential order. An optional processing of redundancy reduction can be applied before clustering. It reduces the dataset redundancy with Hobohm 1 algorithm based on the E-value sequence homology as well.

Options

protparts

Parameters Description
➀ Sequence submission Upload or paste protein sequences in FASTA.
➁ Clustering threshold Build clusters in graph using E-value as a threshold (Default: 1e-9). Users can provide multiple thresholds with space separated to get different clustering results. When the clustering threshold is empty, the parameter, number of partitions, becomes mandatory. Based on a given number of partitions, the program will use the lowest sequence homology (i.e., the highest E-value) to create clusters to fit the maximum partition capacity.
➂ Number of partitions Allocate clusters into partitions randomly. (Default: 5)
Output format: Save the clustering and partitioning results in JSON, FASTA, CSV, or TXT format.
➃ Redundancy reduction threshold (optional) Reduce dataset redundancy using Hobohm 1 algorithm with an E-value threshold. (Default: None, skip the redundancy reduction)

Output


Result page


protparts

Pararmeters of running the program.
Summary of statistical description for clustering results. Download the output file in JSON, TXT, CSV, or FASTA. See examples in the following Output file format section.
Graphical analyses of clustering results, distribution of cluster size and silhouette coefficient per sample (only for large clusters with more than 10 data)

Output file format


1. JSON
JSON output format without partitioning
{
    "Cluster_0": [
        "A0002",
        "A0003",
        "A0004",
        ...
JSON output format with partitioning
{
    "Partition_0": {
        "Cluster_4": [
            "A0010",
            "A0118",
            "A0119",
            ...
2. TXT
TXT output format without partitioning. The comment lines start with #. "ClustID 0" indicates the numbering of clusters. "A0002" is the sequence ID.
# Clustering method: graph
# Threshold: 1e-09
# Number of clusters: 2030
ClustID 0 A0002
ClustID 0 A0003
ClustID 0 A0004
...
TXT output format with partitioning. "PartID 0" indicates the numbering of partitions.
# Clustering method: graph
# Threshold: 1e-09
# Number of partitions: 5
ClustID 4 PartID 0 A0010
ClustID 4 PartID 0 A0118
ClustID 4 PartID 0 A0119
...
3. CSV
CSV output format without partitioning.
SequenceID,ClusterID
A0002,0
A0003,0
A0004,0
...
CSV output format with partitioning.
SequenceID,PartitionID,ClusterID
A0010,0,4
A0118,0,4
A0119,0,4
...
4. FASTA
FASTA output format without partitioning. The clustering and partitioning results are appeneded after sequence IDs.
>A0002 Cluster_0
MAQLTLLLLSLFLTLISLPPPGASISSCNGPCRDLNDCDGQLICIKGKCNDDPEVGTHICGGTTPSPQPGSCNPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
>A0003 Cluster_0
MAQLTLLLLSLFFTLISLPPPGASISSCNGPCRDLNDCNGQLICIKGKCNDDPEVGTHICGGTTPSPQPGSCKPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
>A0004 Cluster_0
MAQLTLLLLSLFLTLISLPPPGASISSCNGPCRDLNDCDGQLICIKGKCNDDPEVGTHICGGTTPSPQPGGCNPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
...
FASTA output format with partitioning.
>A0010 Cluster_4 Partition_0
MARPSFLSLVSLSLLVLSHSSAANRQPSKYQQQQKGECQIQRLNAQEPQQRIQAEAGVTEFWDWTDDQFQCAGVAACRNMIQPRGLLLPSYTNAPTLIYILKGRGITGVMIPGCPETYQSSQQSREGDVSHRQFRDQHQKIRRFQQGDVIALPAGVAHWCYNDGDSDLVTVSVEDTGNRQNQLDNNPRRFFLAGNPQQQQKEMYAKRPQQQHSGNVFRGFDTEVLAETFGVDMEMARRLQGKDDYRGHIIQVERELKIVRPPRTREEQEQQERGERDNGMEETICTARLVENIDNPSRADIFNPRAGRLTSVNSFNLPILNYLRLSAEKGVLYRNALMPPHWKLNAHCVLYATRGEAQMQIVDQRGEAVFNDRIREGQLVVVPQNFVVMKQAGNQGFEWVAIKTNENAMFNTLAGRTSALRAMPVDVLANAYQISQSEARRLKMGREEAVLFEPRSEGRDVD
>A0118 Cluster_4 Partition_0
PPTKFSFSLFLVSVLVLCLGFALAKIDPELKQCKHQCKVQRQYDEQQKEQCVKECEKYYKEKKGREREHEEEEEEWGTGGVDEPSTHEPAEKHLSQCMRQCERQEGGQQKQLCRFRCQERYKKERGQHNYKREDDEDEDEDEAEEEDENPYVFEDEDFTTKVKTEQGKVVLLPKFTQKSKLLHALEKYRLAVLVANPQAFVVPSHMDADSIFFVSWGRGTITKILENKRESINVRQGDIVSISSGTPFYIANNDENEKLYLVQFLRPVNLPGHFEVFHGPGGENPESFYRAFSWEILEAALKTSKDTLEKLFEKQDQGTIMKASKEQVRAMSRRGEGPKIWPFTEESTGSFKLFKKDPSQSNKYGQLFEAERIDYPPLEKLDMVVSYANITKGGMSVPFYNSRATKIAIVVSGEGCVEIACPHLSSSKSSHPSYKKLRARIRKDTVFIVPAGHPFATVASGNENLEIVCFEVNAEGNIRYTLAGKKNIIKVMEKEAKELAFKMEGEEVDKVFGKQDEEFFFQGPEWRKEKEGRADE
>A0119 Cluster_4 Partition_0
MGPPTKFSFSLFLVSVLVLCLGFALAKIDPELKQCKHQCKVQRQYDEQQKEQCVKECEKYYKEKKGREREHEEEEEEWGTGGVDEPSTHEPAEKHLSQCMRQCERQEGGQQKQLCRFRCQERYKKERGQHNYKREDDEDEDEDEAEEEDENPYVFEDEDFTTKVKTEQGKVVLLPKFTQKSKLLHALEKYRLAVLVANPQAFVVPSHMDADSIFFVSWGRGTITKILENKRESINVRQGDIVSISSGTPFYIANNDENEKLYLVQFLRPVNLPGHFEVFHGPGGENPESFYRAFSWEILEAALKTSKDTLEKLFEKQDQGTIMKASKEQIRAMSRRGEGPKIWPFTEESTGSFKLFKKDPSQSNKYGQLFEAERIDYPPLEKLDMVVSYANITKGGMSVPFYNSRATKIAIVVSGEGCVEIACPHLSSSKSSHPSYKKLRARIRKDTVFIVPAGHPFATVASGNENLEIVCFEVNAEGNIRYTLAGKKNIIKVMEKEAKELAFKMEGEEVDKVFGKQDEEFFFQGPEWRKEKEGRADE
...

Abstract


In bioinformatics, sequence redundancy of a dataset is likely to introduce data leakage during the training in machine learning. The data leakage results in overfitting, compromising the capacity of model generalisation. However, current clustering tools are not able to provide completely independent partitions for training, or the sequence similarity measurement lacks statistical significance. In this study, we developed a clustering and partitioning tool, ProtParts, utilising the E-value of BLAST for accurate similarity distance calculations, and incorporating a graph algorithm to cluster protein sequences. This ensures independent partitions and prevents data leakage in training and test datasets, thereby enhancing the model generalisation. A series of comparative analysis indicated the ProtParts has a higher silhouette coefficient and adjusted mutual information. Re-training three predictive models (based on random forest, feedforward neural network, and convolutional neural network) revealed that partitioning with data leakage leads to overfitting and inflated performance during cross-validation. In contrast, training on ProtParts partitions demonstrated a more robust and improved model performance on predicting independent data. Based on the result, we deployed the user-friendly web server ProtParts for protein partitioning in machine learning applications.

References


Supplementary material


Here, you will find the datasets used in the publication.
Allergen dataset
ASTRAL SCOPe dataset

References

Software Downloads




GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: