Raw data is given in Numpy (Python) compressed files with an array of pdb/chain ids (pdbids) and a 3-dimensional array of input and output features.
First dimension is samples, second dimension is sequence position and third dimension is input features:
# [0:20] Amino Acids (sparse encoding) # Unknown residues are stored as an all-zero vector # [20:50] hmm profile # [50] Seq mask (1 = seq, 0 = empty) # [51] Disordered mask (0 = disordered, 1 = ordered) # [52] Evaluation mask (For CB513 dataset, 1 = eval, 0 = ignore) # [53] ASA (isolated) # [54] ASA (complexed) # [55] RSA (isolated) # [56] RSA (complexed) # [57:65] Q8 GHIBESTC (Q8 -> Q3: HHHEECCC) # [65:67] Phi+Psi # [67] ASA_max