DTU Health Tech

Department of Health Technology

GlycateBase ver. 1.0

Release notes

Generation of data set

This webpage describes the generation of the data set used to develop the artificial neural network based glycation predictor NetGlycate-1.0. NetGlycate-1.0 predicts glycation of ε amino groups of lysines in mammalian proteins and the data set therefore only contain such data. Glycation data was obtained from the literature. The resulting data set consists of 20 proteins with 89 glycated lysines and 126 non-glycated lysines and can be downloaded below. Only experimentally verified glycation sites were used, and all sequences were extracted from the UniProt database (Bairoch et al., 2005). It was decided to mask out lysines in pro- and signal peptides since these parts of the proteins are cleaved off during maturation of the proteins and are thus not available for glycation. The references from which the glycation data was taken are shown in Table 1. To avoid confusing the prediction algorithm, unvalidated glycation sites were masked out, however, some of the studies mentioned in Table 1 claim to have validated some of the sites in the dataset that were masked out as unvalidated. The reasons for these sites being masked out as unvalidated are described below.

UniProt ID UniProt AC Citation
GPX1_BOVIN P00435 Baldwin et al., 1995
RNP_BOVIN P00656 Watkins et al., 1985
et al., 2003
CFAB_HUMAN P00751 Niemann et al., 1991
B2MG_HUMAN P01884 Miyata et al., 1994
HBA_HUMAN P01922 Shapiro et al., 1980
Zhang et al., 2001
HBB_HUMAN P02023 Shapiro et al., 1980
Zhang et al., 2001
CRAA_BOVIN P02470 Abraham et al., 1994
CRAB_BOVIN P02510 Abraham et al., 1994
CRB2_BOVIN P02522 Zhao et al., 1996
CRGB_BOVIN P02526 Smith et al., 1996
APA1_HUMAN P02647 Calvo et al., 1993
APE_HUMAN P02649 Shuvaev et al., 1999
ALBU_HUMAN P02768 Garlick & Mazer, 1983
Shaklai et al., 1984
Iberg & Flückiger, 1986
Lapolla et al., 2004
MIP_BOVIN P06624 Swamy-Mruthinti & Schey, 1997
PMGE_HUMAN P07738 Fujita et al., 1998
SODE_HUMAN P08294 Adachi et al., 1992
TAU_HUMAN P10636 Nacharaju et al., 1997
ALAT_PIG P13191 Beranek et al., 2001
CD59_HUMAN P13987 Acosta et al., 2000
AKA1_RAT P51635 Takahashi et al., 1995

Table 1. The references from where the glycation data is taken are shown.

For RNP_BOVIN, Cotham et al., 2003, states that all ten lysines are glycated but finds the same four major sites as Watkins et al., 1985. It was therefore decided to use the four glycation sites from Watkins et al., 1985 and mask out the remaining six lysines. Of the remaining six lysines the predictor only predicts K-92 to be glycated thus surporting the notion that the remaining six lysines are mainly minor sites. In fact Ames, 2005 finds K-92 to be glycated thus confirming the prediction made by our predictor.

There has been some controversy about the glycation sites for CRGB_BOVIN. In particular K-163. According to the newest article, Smith et al., 1996, it is only the N-terminus and K-2 that gets glycated and not K-163. It was therefore decided to mask out K-163. The predictor predicts K-163 to be un-glycated thus agreing with Smith et al., 1996.

The protein APE_HUMAN contains suspiciously few glycation sites compared to other proteins of similar length. Furthermore, since K-93 only corresponds to 20% of the detected Amadori products (Shuvaev et al., 1999), the other lysines in the protein are masked out as unvalidated. The same problem arises for the protein CFAB_HUMAN and it was therefore decided also to mask out the non-glycated lysines in this protein. For both APE_HUMAN and CFAB_HUMAN the predictor predicts several glycation sites among the masked out lysines thus suggesting that there is more than one glycation site in each of these two proteins.

The training of neural networks was done using three-fold cross-validation. The division into cross validation groups was made on the site level meaning that each sequence in the cross-validation groups only contained one site. The other sites were masked out and the sequence is then repeated one time for each site.

The positive and negative sites were extracted as a window of 21 amino acid residues and a phylogenetic tree was constructed. The tree was then inspected visually and the related sites were placed in the same cross validation group. This was done in order to prevent the situation where the network had learned the sites in the test set before-hand from the learning set. This situation would occur if related sites were placed in the test and learning set and could lead to an overestimation of the performance of the network if the related sites belonged to the same category (glycated or non-glycated lysine). If the related sites belonged to different categories it could give problems with learning to classify the sites correctly.

The remaining positive and negative sites were then added randomly to the three cross validation groups in such a way that the cross-validation groups contained almost the same number of positive and negative sites (see Table 2) and that all sites in the cross-validation groups were placed in random order.

Group Positive Negative
1 29 43
2 30 45
3 30 38

Table 2. Number of positive and negative sites in each cross-validation group.

Both in vitro and in vivo data were used to make the data set as large as possible. It was, however, decided to only include in vitro data that were obtained at conditions that resembles physiological conditions. Note that the glycated proteins used in this study are of mammalian origin.

Data set

G: positive site
K: negative site
S: signal peptide (not used for training)
P: propeptide (not used for training)
U: unvalidated site (not used for training)
-: non-lysine residue (not used for training)

For the complete dataset click here.


Abraham et al., 1994
Abraham,E.C., Cherian,M. and Smith,J.B. (1994). Site selectivity in the glycation of alpha A- and alpha B-crystallins by glucose. Biochem Biophys Res Commun, 201 , 1451-1456.

Acosta et al., 2000
Acosta,J., Hettinga,J., Flückiger,R., Krumrei,N., Goldfine,A., Angarita,L. and Halperin,J. (2000). Molecular basis for a link between complement and the vascular complications of diabetes. Proc Natl Acad Sci U S A, 97 , 5450-5455.

Adachi et al., 1992
Adachi,T., Ohta,H., Hayashi,K., Hirano,K. and Marklund,S.L. (1992). The site of nonenzymic glycation of human extracellular-superoxide dismutase in vitro. Free Radic Biol Med, 13 , 205-210.

Ames, 2005
Ames J.M. (2005). Application of semiquantitative proteomics techniques to the maillard reaction. Ann N Y Acad Sci, 1043 , 225-35.

Bairoch et al., 2005
Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B., Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M., Martin,M.J., Natale,D.A., O'Donovan,C., Redaschi,N. and Yeh,L.S.L. (2005). The Universal Protein Resource (UniProt). Nucleic Acids Res, 33 , 154-159.

Baldwin et al., 1995
Baldwin,J.S., Lee,L., Leung,T.K., Muruganandam,A. and Mutus,B. (1995). Identification of the site of non-enzymatic glycation of glutathione peroxidase: rationalization of the glycation-related catalytic alterations on the basis of three-dimensional protein structure. Biochim Biophys Acta, 1247 , 60-64.

Beranek et al., 2001
Beranek,M., Drsata,J. and Palicka,V. (2001). Inhibitory effect of glycation on catalytic activity of alanine aminotransferase. Mol Cell Biochem, 218 , 35-39.

Calvo et al., 1993
Calvo,C., Ulloa,N., Campos,M., Verdugo,C. and Ayrault-Jarrier,M. (1993). The preferential site of non-enzymatic glycation of human apolipoprotein A-I in vivo. Clin Chim Acta, 217 , 193-198.

Cotham et al., 2003
Cotham,W.E., Hinton,D.J.S., Metz,T.O., Brock,J.W.C., Thorpe,S.R., Baynes,J.W. and Ames,J.M. (2003). Mass spectrometric analysis of glucose-modified ribonuclease. Biochem Soc Trans, 31 , 1426-1427.

Fujita et al., 1998
Fujita,T., Suzuki,K., Tada,T., Yoshihara,Y., Hamaoka,R., Uchida,K., Matuo,Y., Sasaki,T., Hanafusa,T. and Taniguchi,N. (1998). Human erythrocyte bisphosphoglycerate mutase: inactivation by glycation in vivo and in vitro. J Biochem (Tokyo), 124 , 1237-1244.

Garlick & Mazer, 1983
Garlick,R.L. and Mazer,J.S. (1983). The principal site of nonenzymatic glycosylation of human serum albumin in vivo. J Biol Chem, 258 , 6142-6146.

Iberg & Flückiger, 1986
Iberg,N. and Flückiger,R. (1986). Nonenzymatic glycosylation of albumin in vivo. Identification of multiple glycosylated sites. J Biol Chem, 261 , 13542-13545.

Lapolla et al., 2004
Lapolla,A., Fedele,D., Reitano,R., Arico,N.C., Seraglia,R., Traldi,P., Marotta,E. and Tonani,R. (2004). Enzymatic digestion and mass spectrometry in the study of advanced glycation end products/peptides. J Am Soc Mass Spectrom, 15 , 496-509.

Miyata et al., 1994
Miyata,T., Inagi,R., Wada,Y., Ueda,Y., Iida,Y., Takahashi,M., Taniguchi,N. and Maeda,K. (1994). Glycation of human beta 2-microglobulin in patients with hemodialysis-associated amyloidosis: identification of the glycated sites. Biochemistry, 33 , 12215-12221.

Nacharaju et al., 1997
Nacharaju,P., Ko,L. and Yen,S.H. (1997). Characterization of in vitro glycation sites of tau. J Neurochem, 69 , 1709-1719.

Niemann et al., 1991
Niemann,M.A., Bhown,A.S. and Miller,E.J. (1991). The principal site of glycation of human complement factor B. Biochem J, 274 ( Pt 2), 473-480.

Shaklai et al., 1984
Shaklai,N., Garlick,R.L. and Bunn,H.F. (1984). Nonenzymatic glycosylation of human serum albumin alters its conformation and function. J Biol Chem, 259 , 3812-3817.

Shapiro et al., 1980
Shapiro,R., McManus,M.J., Zalut,C. and Bunn,H.F. (1980). Sites of nonenzymatic glycosylation of human hemoglobin A. J Biol Chem, 255 , 3120-3127.

Shuvaev et al., 1999
Shuvaev,V.V., Fujii,J., Kawasaki,Y., Itoh,H., Hamaoka,R., Barbier,A., Ziegler,O., Siest,G. and Taniguchi,N. (1999). Glycation of apolipoprotein E impairs its binding to heparin: identification of the major glycation site. Biochim Biophys Acta, 1454 , 296-308.

Smith et al., 1996
Smith,J.B., Hanson,S.R., Cerny,R.L., Zhao,H.R. and Abraham,E.C. (1996). Identification of the glycation site of lens gamma B-crystallin by fast atom bombardment tandem mass spectrometry. Anal Biochem, 243 , 186-189.

Swamy-Mruthinti & Schey, 1997
Swamy-Mruthinti,S. and Schey,K.L. (1997). Mass spectroscopic identification of in vitro glycated sites of MIP. Curr Eye Res, 16 , 936-941.

Takahashi et al., 1995
Takahashi,M., Lu,Y.B., Myint,T., Fujii,J., Wada,Y. and Taniguchi,N. (1995). In vivo glycation of aldehyde reductase, a major 3-deoxyglucosone reducing enzyme: identification of glycation sites. Biochemistry, 34 , 1433-1438.

Watkins et al., 1985
Watkins,N.G., Thorpe,S.R. and Baynes,J.W. (1985). Glycation of amino groups in protein. Studies on the specificity of modification of RNase by glucose. J Biol Chem, 260 , 10629-10636.

Zhang et al., 2001
Zhang,X., Medzihradszky,K.F., Cunningham,J., Lee,P.D., Rognerud,C.L., Ou,C.N., Harmatz,P. and Witkowska,H.E. (2001). Characterization of glycated hemoglobin in diabetic patients: usefulness of electrospray mass spectrometry in monitoring the extent and distribution of glycation. J Chromatogr B Biomed Sci Appl, 759 , 1-15.

Zhao et al., 1996
Zhao,H.R., Smith,J.B., Jiang,X.Y. and Abraham,E.C. (1996). Sites of glycation of beta B2-crystallin by glucose and fructose. Biochem Biophys Res Commun, 229 , 128-133.


New data, comments and suggestions may be sent to Morten Bo Johansen
E-mail address:mbj@cbs.dtu.dk


If the use of this database contributes significantly to your results, please cite:

Analysis and prediction of mammalian protein glycation.
Morten Bo Johansen, Lars Kiemer and Søren Brunak
Glycobiology, 16:844-853, 2006

PMID: 16762979       doi: 10.1093/glycob/cwl009