DTU Health Tech
Department of Health Technology
This link is for the general contact of the DTU Health Tech institute.
If you need help with the bioinformatics programs, see the "Getting Help" section below the program.
ProtParts is a web server for protein sequence clustering based on E-value of alignments, which represents sequence homology. The clustering results align with protein domain and family annotations. Furthermore, ProtParts randomly assigns clusters into partitions to ensure independence among partitions for machine learning, thereby, preventing data leakage during model training and evaluation. Additionally, ProtParts can reduce sequence redundancy of a protein sequence dataset as well, which is based on Hobohm 1 algorithm and E-value sequence homology.
Sequence submission: paste the sequence(s) or upload a local file in FASTA format |
ProtParts-1.0 is a program that clusters protein sequences with graph method based on E-value of BLAST, and generates random partitions on clusters.
Parameters | Description |
---|---|
➀ Sequence submission | Upload or paste protein sequences in FASTA. |
➁ Clustering threshold | Build clusters in graph using E-value as a threshold (Default: 1e-9). Users can provide multiple thresholds with space separated to get different clustering results. When the clustering threshold is empty, the parameter, number of partitions, becomes mandatory. Based on a given number of partitions, the program will use the lowest sequence homology (i.e., the highest E-value) to create clusters to fit the maximum partition capacity. |
➂ Number of partitions | Allocate clusters into partitions randomly. (Default: 5) |
Output format: | Save the clustering and partitioning results in JSON, FASTA, CSV, or TXT format. |
➃ Redundancy reduction threshold (optional) | Reduce dataset redundancy using Hobohm 1 algorithm with an E-value threshold. (Default: None, skip the redundancy reduction) |
➀ | Pararmeters of running the program. |
➁ | Summary of statistical description for clustering results. Download the output file in JSON, TXT, CSV, or FASTA. See examples in the following Output file format section. |
➂ | Graphical analyses of clustering results, distribution of cluster size and silhouette coefficient per sample (only for large clusters with more than 10 data) |
{
"Cluster_0": [
"A0002",
"A0003",
"A0004",
...
JSON output format with partitioning
{
"Partition_0": {
"Cluster_4": [
"A0010",
"A0118",
"A0119",
...
2. TXT# Clustering method: graph
# Threshold: 1e-09
# Number of clusters: 2030
ClustID 0 A0002
ClustID 0 A0003
ClustID 0 A0004
...
TXT output format with partitioning. "PartID 0" indicates the numbering of partitions.
# Clustering method: graph
# Threshold: 1e-09
# Number of partitions: 5
ClustID 4 PartID 0 A0010
ClustID 4 PartID 0 A0118
ClustID 4 PartID 0 A0119
...
3. CSVSequenceID,ClusterID
A0002,0
A0003,0
A0004,0
...
CSV output format with partitioning.
SequenceID,PartitionID,ClusterID
A0010,0,4
A0118,0,4
A0119,0,4
...
4. FASTA>A0002 Cluster_0
MAQLTLLLLSLFLTLISLPPPGASISSCNGPCRDLNDCDGQLICIKGKCNDDPEVGTHICGGTTPSPQPGSCNPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
>A0003 Cluster_0
MAQLTLLLLSLFFTLISLPPPGASISSCNGPCRDLNDCNGQLICIKGKCNDDPEVGTHICGGTTPSPQPGSCKPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
>A0004 Cluster_0
MAQLTLLLLSLFLTLISLPPPGASISSCNGPCRDLNDCDGQLICIKGKCNDDPEVGTHICGGTTPSPQPGGCNPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
...
FASTA output format with partitioning.
>A0010 Cluster_4 Partition_0
MARPSFLSLVSLSLLVLSHSSAANRQPSKYQQQQKGECQIQRLNAQEPQQRIQAEAGVTEFWDWTDDQFQCAGVAACRNMIQPRGLLLPSYTNAPTLIYILKGRGITGVMIPGCPETYQSSQQSREGDVSHRQFRDQHQKIRRFQQGDVIALPAGVAHWCYNDGDSDLVTVSVEDTGNRQNQLDNNPRRFFLAGNPQQQQKEMYAKRPQQQHSGNVFRGFDTEVLAETFGVDMEMARRLQGKDDYRGHIIQVERELKIVRPPRTREEQEQQERGERDNGMEETICTARLVENIDNPSRADIFNPRAGRLTSVNSFNLPILNYLRLSAEKGVLYRNALMPPHWKLNAHCVLYATRGEAQMQIVDQRGEAVFNDRIREGQLVVVPQNFVVMKQAGNQGFEWVAIKTNENAMFNTLAGRTSALRAMPVDVLANAYQISQSEARRLKMGREEAVLFEPRSEGRDVD
>A0118 Cluster_4 Partition_0
PPTKFSFSLFLVSVLVLCLGFALAKIDPELKQCKHQCKVQRQYDEQQKEQCVKECEKYYKEKKGREREHEEEEEEWGTGGVDEPSTHEPAEKHLSQCMRQCERQEGGQQKQLCRFRCQERYKKERGQHNYKREDDEDEDEDEAEEEDENPYVFEDEDFTTKVKTEQGKVVLLPKFTQKSKLLHALEKYRLAVLVANPQAFVVPSHMDADSIFFVSWGRGTITKILENKRESINVRQGDIVSISSGTPFYIANNDENEKLYLVQFLRPVNLPGHFEVFHGPGGENPESFYRAFSWEILEAALKTSKDTLEKLFEKQDQGTIMKASKEQVRAMSRRGEGPKIWPFTEESTGSFKLFKKDPSQSNKYGQLFEAERIDYPPLEKLDMVVSYANITKGGMSVPFYNSRATKIAIVVSGEGCVEIACPHLSSSKSSHPSYKKLRARIRKDTVFIVPAGHPFATVASGNENLEIVCFEVNAEGNIRYTLAGKKNIIKVMEKEAKELAFKMEGEEVDKVFGKQDEEFFFQGPEWRKEKEGRADE
>A0119 Cluster_4 Partition_0
MGPPTKFSFSLFLVSVLVLCLGFALAKIDPELKQCKHQCKVQRQYDEQQKEQCVKECEKYYKEKKGREREHEEEEEEWGTGGVDEPSTHEPAEKHLSQCMRQCERQEGGQQKQLCRFRCQERYKKERGQHNYKREDDEDEDEDEAEEEDENPYVFEDEDFTTKVKTEQGKVVLLPKFTQKSKLLHALEKYRLAVLVANPQAFVVPSHMDADSIFFVSWGRGTITKILENKRESINVRQGDIVSISSGTPFYIANNDENEKLYLVQFLRPVNLPGHFEVFHGPGGENPESFYRAFSWEILEAALKTSKDTLEKLFEKQDQGTIMKASKEQIRAMSRRGEGPKIWPFTEESTGSFKLFKKDPSQSNKYGQLFEAERIDYPPLEKLDMVVSYANITKGGMSVPFYNSRATKIAIVVSGEGCVEIACPHLSSSKSSHPSYKKLRARIRKDTVFIVPAGHPFATVASGNENLEIVCFEVNAEGNIRYTLAGKKNIIKVMEKEAKELAFKMEGEEVDKVFGKQDEEFFFQGPEWRKEKEGRADE
...
Data leakage originating from protein sequence similarity shared among train and test sets can result in model overfitting and overestimation of model performance and utility. However, leakage is often subtle and might be difficult to eliminate. Available clustering tools often do not provide completely independent partitions, and in addition it is difficult to assess the statistical significance of those differences. In this study, we developed a clustering and partitioning tool, ProtParts, utilizing the E-value of BLAST to compute pairwise similarities between each pair of proteins and using a graph algorithm to generate clusters of similar sequences. This exhaustive clustering ensures the most independent partitions, giving a metric of statistical significance and, thereby enhancing the model generalization. A series of comparative analyses indicated that ProtParts clusters have higher silhouette coefficient and adjusted mutual information than other algorithms using k-mers or sequence percentage identity. Re-training three distinct predictive models revealed how sub-optimal data clustering and partitioning leads to overfitting and inflated performance during cross-validation. In contrast, training on ProtParts partitions demonstrated a more robust and improved model performance on predicting independent data. Based on these results, we deployed the user-friendly web server ProtParts (https://services.healthtech.dtu.dk/services/ProtParts-1.0) for protein partitioning prior to machine learning applications.
If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).
If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.
Correspondence:
Technical Support: