DTU Health Tech
Department of Health Technology
This link is for the general contact of the DTU Health Tech institute.
If you need help with the bioinformatics programs, see the "Getting Help" section below the program.
ProtParts is a web server for protein sequence clustering based on E-value of alignments, which represents sequence homology. The clustering results align with protein domain and family annotations. Furthermore, ProtParts randomly assigns clusters into partitions to ensure independence among partitions for machine learning, thereby, preventing data leakage during model training and evaluation. Additionally, ProtParts can reduce sequence redundancy of a protein sequence dataset as well, which is based on Hobohm 1 algorithm and E-value sequence homology.
Sequence submission: paste the sequence(s) or upload a local file in FASTA format |
ProtParts-1.0 is a program that clusters protein sequences with graph method based on E-value of BLAST, and generates random partitions on clusters.
Parameters | Description |
---|---|
➀ Sequence submission | Upload or paste protein sequences in FASTA. |
➁ Clustering threshold | Build clusters in graph using E-value as a threshold (Default: 1e-9). Users can provide multiple thresholds with space separated to get different clustering results. When the clustering threshold is empty, the parameter, number of partitions, becomes mandatory. Based on a given number of partitions, the program will use the lowest sequence homology (i.e., the highest E-value) to create clusters to fit the maximum partition capacity. |
➂ Number of partitions | Allocate clusters into partitions randomly. (Default: 5) |
Output format: | Save the clustering and partitioning results in JSON, FASTA, CSV, or TXT format. |
➃ Redundancy reduction threshold (optional) | Reduce dataset redundancy using Hobohm 1 algorithm with an E-value threshold. (Default: None, skip the redundancy reduction) |
➀ | Pararmeters of running the program. |
➁ | Summary of statistical description for clustering results. Download the output file in JSON, TXT, CSV, or FASTA. See examples in the following Output file format section. |
➂ | Graphical analyses of clustering results, distribution of cluster size and silhouette coefficient per sample (only for large clusters with more than 10 data) |
{
"Cluster_0": [
"A0002",
"A0003",
"A0004",
...
JSON output format with partitioning
{
"Partition_0": {
"Cluster_4": [
"A0010",
"A0118",
"A0119",
...
2. TXT# Clustering method: graph
# Threshold: 1e-09
# Number of clusters: 2030
ClustID 0 A0002
ClustID 0 A0003
ClustID 0 A0004
...
TXT output format with partitioning. "PartID 0" indicates the numbering of partitions.
# Clustering method: graph
# Threshold: 1e-09
# Number of partitions: 5
ClustID 4 PartID 0 A0010
ClustID 4 PartID 0 A0118
ClustID 4 PartID 0 A0119
...
3. CSVSequenceID,ClusterID
A0002,0
A0003,0
A0004,0
...
CSV output format with partitioning.
SequenceID,PartitionID,ClusterID
A0010,0,4
A0118,0,4
A0119,0,4
...
4. FASTA>A0002 Cluster_0
MAQLTLLLLSLFLTLISLPPPGASISSCNGPCRDLNDCDGQLICIKGKCNDDPEVGTHICGGTTPSPQPGSCNPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
>A0003 Cluster_0
MAQLTLLLLSLFFTLISLPPPGASISSCNGPCRDLNDCNGQLICIKGKCNDDPEVGTHICGGTTPSPQPGSCKPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
>A0004 Cluster_0
MAQLTLLLLSLFLTLISLPPPGASISSCNGPCRDLNDCDGQLICIKGKCNDDPEVGTHICGGTTPSPQPGGCNPSGTLTCQGKSYPTYDCSPPVTSSTPAKLTNNDFSEGGDGGGPSECDESYHSNNERIVALSTGWYNGGSRCGKMIRITASNGKSVSAKVVDECDSRHGCDKEHAGQPPCRNNIVDGSNAVWSALGLDKNVGVVDITWSMA
...
FASTA output format with partitioning.
>A0010 Cluster_4 Partition_0
MARPSFLSLVSLSLLVLSHSSAANRQPSKYQQQQKGECQIQRLNAQEPQQRIQAEAGVTEFWDWTDDQFQCAGVAACRNMIQPRGLLLPSYTNAPTLIYILKGRGITGVMIPGCPETYQSSQQSREGDVSHRQFRDQHQKIRRFQQGDVIALPAGVAHWCYNDGDSDLVTVSVEDTGNRQNQLDNNPRRFFLAGNPQQQQKEMYAKRPQQQHSGNVFRGFDTEVLAETFGVDMEMARRLQGKDDYRGHIIQVERELKIVRPPRTREEQEQQERGERDNGMEETICTARLVENIDNPSRADIFNPRAGRLTSVNSFNLPILNYLRLSAEKGVLYRNALMPPHWKLNAHCVLYATRGEAQMQIVDQRGEAVFNDRIREGQLVVVPQNFVVMKQAGNQGFEWVAIKTNENAMFNTLAGRTSALRAMPVDVLANAYQISQSEARRLKMGREEAVLFEPRSEGRDVD
>A0118 Cluster_4 Partition_0
PPTKFSFSLFLVSVLVLCLGFALAKIDPELKQCKHQCKVQRQYDEQQKEQCVKECEKYYKEKKGREREHEEEEEEWGTGGVDEPSTHEPAEKHLSQCMRQCERQEGGQQKQLCRFRCQERYKKERGQHNYKREDDEDEDEDEAEEEDENPYVFEDEDFTTKVKTEQGKVVLLPKFTQKSKLLHALEKYRLAVLVANPQAFVVPSHMDADSIFFVSWGRGTITKILENKRESINVRQGDIVSISSGTPFYIANNDENEKLYLVQFLRPVNLPGHFEVFHGPGGENPESFYRAFSWEILEAALKTSKDTLEKLFEKQDQGTIMKASKEQVRAMSRRGEGPKIWPFTEESTGSFKLFKKDPSQSNKYGQLFEAERIDYPPLEKLDMVVSYANITKGGMSVPFYNSRATKIAIVVSGEGCVEIACPHLSSSKSSHPSYKKLRARIRKDTVFIVPAGHPFATVASGNENLEIVCFEVNAEGNIRYTLAGKKNIIKVMEKEAKELAFKMEGEEVDKVFGKQDEEFFFQGPEWRKEKEGRADE
>A0119 Cluster_4 Partition_0
MGPPTKFSFSLFLVSVLVLCLGFALAKIDPELKQCKHQCKVQRQYDEQQKEQCVKECEKYYKEKKGREREHEEEEEEWGTGGVDEPSTHEPAEKHLSQCMRQCERQEGGQQKQLCRFRCQERYKKERGQHNYKREDDEDEDEDEAEEEDENPYVFEDEDFTTKVKTEQGKVVLLPKFTQKSKLLHALEKYRLAVLVANPQAFVVPSHMDADSIFFVSWGRGTITKILENKRESINVRQGDIVSISSGTPFYIANNDENEKLYLVQFLRPVNLPGHFEVFHGPGGENPESFYRAFSWEILEAALKTSKDTLEKLFEKQDQGTIMKASKEQIRAMSRRGEGPKIWPFTEESTGSFKLFKKDPSQSNKYGQLFEAERIDYPPLEKLDMVVSYANITKGGMSVPFYNSRATKIAIVVSGEGCVEIACPHLSSSKSSHPSYKKLRARIRKDTVFIVPAGHPFATVASGNENLEIVCFEVNAEGNIRYTLAGKKNIIKVMEKEAKELAFKMEGEEVDKVFGKQDEEFFFQGPEWRKEKEGRADE
...
If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).
If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.
Correspondence:
Technical Support: