Model description


VDJsolver was developed using Yabasic (www.yabasic.de). The program uses the maximum likelihood method to obtain the best fit to the following model:

VH-PH-N1-PDdown-D-PDup-N2-PJ-JH,

where Nx designates N and P palindromic nucleotides upstream or downstream of the D gene as indicated. Any segment may be omitted except VH and JH. VH was compared with the IGHV3-23*01 germline gene (GenBank accession number M99660) while JH was compared with the germline JH gene with the highest identity score from codon 114 through the splice site among all JH-genes in the IMGT database. The D segments were compared with any germline D segment available in the IMGT database. P segments were defined as 2-8 nucleotide long extensions from the VH, Dx or JH genes reverse complementary to the corresponding germline sequence. Maximum likelihood was determined by running through all possible combinations of segments for a given rearrangement and finding the combination maximizing the likelihood score. The score was defined as the product of estimated probabilities for any event deviating from the germline sequences in question. Probabilities for transitions and transversions in VH, Dx and JH segments were calculated from the number of substitutions found in the VH region from codon 1 through 100 (assuming a 5/4 ratio of transitions to transversions). For un-mutated sequences, the estimated Taq error rate was used. A given N nucleotide was attributed a probability equal to its frequency in all N segments (determined by iteration of the model on all sequences). To reduce stochastic assignment of D segments, D segments shorter than 4 nucleotides were not accepted and D segments with more mutations than the 95 percentile of that expected by the assumed mutation rate and length of the D segment (Poisson distribution) were not accepted either. A dynamic probability for including a D segment was introduced, dependent on the length of the joint region (codons 101 through the downstream splice site) and the mutation rate of the VH region. The parameters were fine tuned to find a D gene in 5% of the sequences from a set of artificial rearrangements made by a random permutation of the bases between the VH and JH segments of real rearrangements. D segments were generally at least eight nucleotides long.