CAZyP - 0.1

Prediction of Glycoside Hydrolase Activity and Substrate Specificity

The CAZyP 0.1 server predicts Enzyme Activity class and Substrate Specificity of Glycoside Hydrolases. In version 0.1, predicted Enzyme Activity class follows the Enzyme Commission number (EC number) convention. EC numbers provide a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. (citation: wiki) CAZyP-0.1 predictions are based on a 'nearest-qualified, biochemically-characterized neighor' principle, using a sequence comparison algorithm such as Diamond or Blast versus a high-quality, curated library of enzyme positive and negative activities from literature. Data on characterized enzymes in a number of enzyme classes been compiled and curated by the CAZy group, AFMB, Marseille, and by members of the DTU Bioengineering Enzyme Discovery group. The prediction machinery has been developed in a collaboration between members of DTU Bioengineering, Enzyme Discovery group, DTU Healthtech and DTU Compute, and supported by funding provided by the Novo Nordisk Foundation.

Manuscript in Preparation:

EC numbers: caveat emptor: CAZyP-0.1 performs a statistical analysis of 'best Blast hits' derived from a database containing CAZy GH catalytic domain subsequences that are associated with one or more experimentally-determined EC numbers. Please note that EC numbers vary widely in their description of biochemical specificity, and may represent a highly diverse and in some cases ambiguous (to the reader) biochemical reaction. Some standardized substrates for which activity is historically seen as diagnostic for a specific EC number may in fact be considered a 'weak' positive - an observation we have made that is reflected in the confidence score. In some cases both positive and negative biochemical activities are recorded in primary literature and found by our curators.

Prediction output tabulates the following:

  • Protein identifier (input from user)
  • Predicted Reaction mechanism (retaining/inverting, etc)
  • Predicted EC number(s)
  • Prediction confidence
  • Suggested positive substrates = predicted active glysoside hydrolysis
  • Suggested negative substrates = predicted not to be active

Submit data

Sequence submission: paste the sequence(s) and/or upload a local file

Protein sequences should be not less than 10 amino acids. The maximum number of proteins is 1000.

For example proteins Click here
Format directly from your local disk:

Search database:
Family GH5
Clan GH-A
Entire GH
Output format:
Long output
Short output


1. Specify the input sequences

All input sequences must be in standard FASTA format. The sequences are comprised of one-letter amino acid codes. The allowed alphabet (identical to UniProt, not case sensitive) is as follows:

A C D E F G H I K L M N P Q R S T V W Y and X U B Z O (unknown/ambigous/non-standard)

All the alphabetic symbols not in the allowed alphabet will be converted to X before processing. All the non-alphabetic symbols, including white space and digits, will be ignored.

The sequences can be input in the following two ways:

  • Paste a single sequence (just the amino acids) or a number of sequences in FASTA format into the upper window of the main server page.

  • Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.

Both ways can be employed at the same time: all the specified sequences will be processed. However, there may be not more than 5,000 sequences in one submission. The sequences may not be longer than 10,000 amino acids.

2. Customize your run

  • Search database:
    You should specify the search database: Family GH5, Clan GH-A or Entire GH.
  • Output format:
    You can choose between two output formats: Long, or Short.

3. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Example Outputs


Example: secretory protein - standard output format

Example: secretory protein - short output format

MORE DRAGONS: From the Downloads tab, the user can obtain the results of the run as a simple table.
Or as a JSON format file.
Results include
  • 1 line per input sequence:
  • with predictions taking the form of tuples: e.g., 'EC': (ECnumber, confidence, model), and '007': ([-2glycan or None, -2-1bond or None, -1glycan or Error, -1+1bond or Error, +1glycan or Error, +1+2bond or None, +2glycan or None])
  • prediction out includes input accession and predictions named (perhaps) family, mechanism, ecnumber (ecnumbers), doubleoseven

Training and testing data sets

The datasets for training and testing CAZyP-0.1 are not yet available. Where they available they HERE BE DRAGONS might be available as a TSV file.

Family GH5 dataset: download

Clan GH-A dataset: download

Entire GH dataset: download

Article abstracts

Main references:

Current version (CAZyP v. 0.1)

CAZyP 0.1 predicts EC class for only family GH5 and clan GH-A proteins as defined in CAZy.
(list of authors and reference information)

Frequently Asked Questions

Positive activity
Negative activity
Confidence interval

True positive Enzyme Activity

True negative Enzyme Activity

How confidence is calculated

History of EC number prediction / other references

— Why EC numbers ?

So far no other biochemical activity classification scheme is publically available (although rumors abound!).

Portable version

Would you prefer to run CAZyP-0.1 at your own site, or have in-house proprietary enzyme activity data you would like to use? CAZyP-0.1 may become available as a Python package. If you might be interested, please write to for more information.


Correspondence:        Technical Support: