Scientific Background

For a brief description of the NetAcet method please consult the article abstract.

Data sets and statictics

A very important task in machine learning methods is to obtain a clean and accurate dataset for training and testing. Bias and noise in the data set often lead to wrong predictions.

Description of data sets
Dataset extraction
Homology reduction
Sequence logos
Download the training sets

Data set

The section describes the extraction and homology reduction of the data sets used for training of NetAcet 1.0.


The data used for NetAcet were extracted from Table 2 in Polevoda et al. and from the Yeast Protein Map. All inconsistensies between the two data sets were removed resulting in a positive set of 61 sequences and 76 negative sequences.

Homology reduction

Sequences were truncated to their N-terminal 40 residues and subsequently homology reduced by visual inspection of a neighbour-joining tree generated from a ClustalW multible alignment. Four sequences were removed from the positive dataset due to close homology to other sequences and following this reduction the two closest homologs were 52% identical although the average homology is much lower.

Below is shown an unrooted phylogenetic tree of the positive data set before homology reduction.

Positive training set. Save image to disk to see proper picture.

Below is shown an unrooted phylogenetic tree of the negative data set before homology reduction.

Negative training set. Save image to disk to see proper picture.

Sequence logos

To visualise the sequence information content for N-terminal acetylation, we have generated sequence logos for the yeast training set. The total height of the stack of letters at each position shows the amount of sequence conservation at the position, while the relative height of each letter shows the relative abundance of the corresponding amino acid.

Shannon logo

Kullback logo

Blue: Positively charged residues
Red: Negatively charged residues
Green: Neutral polar residues
Black: Hydrophobic residues

Download the dataset

The datasets used for the training of NetAcet can be downloaded here:

Positive training set
Negative training set