The stretch of covarying positions corresponds to the main stem of the RNA
secondary structure. Note that the trace is interrupted by a few positions
having complete sequence conservation. The respective parts of the sequence
logo profile have also been displayed along the edges of the plot.
Export user data directly to MatrixPlot
The user data can automatically be exported to MatrixPlot thorugh the
web. This would typically be data generated from another web page which then
can be exported for graphical manipulation, such as zooming in on a region.
Only a single button is needed, as the form can be written using hidden
variables.
Discussion of Information Content
Two issues on information content for sequence/structure alignments are
discussed here. First the contribution to the computation of information
content when gaps are included in the alignment, and secondly the sample size
robustness of mutual information for RNA secondary structure co-variance
analysis. For background and details consult the mentioned references.
-
Calculating the information content of an alignment that have gaps.
The conceptual problem in computing the information content or relative entropy
of an alignment that has gaps is the background probablities. The background
probabilities are typically found by counting base (or amino acid) frequencies
in the genomes (or data set) from which the aligned sequences originate. The
gaps, however, do not appear before the alignment is made, so a ``gap
background probability'' or a ``gap expectation value'' cannot be dealt with in
the same way. It only makes sense to talk about this when the alignment has
been performed.
Calculation of the information content of an alignment containing gaps has been
derived by Hertz and Stormo, 1995 (In Lim & Cantor (eds), Proc. of the
Third Int. Conf. on Bioinformatics and genome Res., pp. 201-216). They derived
the expression
where and are the fraction
of gaps and symbol k at position i in the alignment, and where
k is a symbol in the alphabet A. The fraction is the background probability of symbol
k, so . Note that if
only a few sequences contain a gap () or almost all
sequences have a gap () at some position
i, the ``gap'' term does not contribute to the sum. Since the gap
background probability is one in this expression, then all the background
probabilities sum to 2 rather than one, which is the formal claim to define
information content or relative entropy. This is resolved as follows. Hertz and
Stormo derived a large-devation rate function given by
and as they write, it normalizes all background probabilities
with a factor of 1/2. This corresponds to adding
to the
information content on each of L positions in the alignment. Clearly,
the difference between I and , is where to put
the baseline in a sequence logo profile plot. The normalizing factor of
Hertz and Stormo was found
by considering the minimum of .
When using the normalizing factor, one interpretation is that the a
priori expectation to gaps is probability 0.5, however, another
interpretation is that one has two models, one for which the expectation values
for all the symbols sum to one, and one model dealing only with gaps and the
expectation to gaps is one. When combining the two models a (re-)normalization
of all the parameters can be introduced.
The considerations for any gap background probability can in general be
extended by subtracting from the standard
form
When using the normalization , it is easily
found that . Clearly . Hence, the
normalizing factor and the a priori expectation of gap frequency is
directly related and the same when . Using this
relation one can readily write
so when the difference is equal to , and the same as above is obtained. The result
becomes independent of . The
interpretation of a gap a priori probability of 0.5 also makes sense.
If no prior knowledge is known about the alignment it makes sense to expect
gaps to appear randomly, i.e. having an a priori expectation of 0.5.
So a ``renormalization'' of the two sets of probabilities, the set of base
frequencies, and the gap probability, ensures that the background probabilities
sum to one and fulfill the formal claim to define information content.
The current version of Inform computes the information content according to
, but MatrixPlot contains an option ``neg'' for
which the baseline can be moved as described (when neg=n). Otherwise (when
neg=y) the baseline is placed so that ``negative information content'' can show
up on the profile. This can be useful in identifying positions which neither
contains many or few gaps, but really is degenerated.
-
Mutual information only for basepairs
As has been discussed by Gorodkin et al. (Comput. Appl. Biosci.,
13:583-586, 1997) and used in the display of structure logos the
mutual information content for RNA sequences can be limited to include only
pairs of bases that are complementary. In that way a more relevant measure is
constructed. This is done by defining a complementary matrix (Gorodkin
et al. Nucl. Acids. Res. 25:3724-3732,1997) which lists the bases that
are complementary. The complementary matrix lists all bases against themselves,
and a number is assigned to indicate the degree of ``belief'' in the
complementarity. The matrix is clearly symmetric and most positions in the
matrix hold the value zero. The mutual information between position i
and j in the alignment using the complementary matrix is given by
(A)
where , A the alphabet, and where
and . The element of the complementary matrix
lists the degree of complementarity between base k and l. The
fraction of pairs of base k at position i and base l at
position j is and is the
fraction of base k at position i. Gaps is dealt with by having
for any base k.
There are in particular two qualitative differences between this measure
and the standard form of mutual
information given by
(B)
First in (A) the observed and expected values are computed independently for
two positions i and j, and the the overall covariance is
computed. In contrast in (B) the observed and expected values are compared for
each combination of the involved symbols. Secondly, in (A) only symbols for
which a basepair rule is defined are included in the sum. In (B) all symbols
are included (also gaps) even if it is already known that they do not (for RNA
secondary structure) contribute to the covariance. This fact makes (A) much
more robust to sampling size noise as illustrated below by two examples. In the
Inform program measure (A) is referred to as mtype 2, and (B) to mtype 1.
The upper triangle is mutual information content computed by type 2, and
the lower triangle is the mutual information content computed by using type 1.
Sample size corrections for plain sequence information content (e.g.
Schneider et al. J. Mol. Biol. 188:415-431, 1986; Basharin, Theory
Probability Appl. 4:333-336, 1959), and for the standard form of mutual
information and correlation functions (e.g. Weiss and Herzel, J.
Theor. Biol. 190:341-353, 1998) has previously been studied.
Acknowledgements
Thanks to Anders Krogh for his contribution in the discussion of information
content. Thanks to Claus A. Andersen, Lars J. Jensen, and Christopher Workman
for suggestions and feedback.
Reference
MatrixPlot: visualizing sequence constraints.
J. Gorodkin, H. H. Stærfeldt, O. Lund, and S. Brunak.
Bioinformatics. 15:769-770, 1999. (http://www.cbs.dtu.dk/services/MatrixPlot/)