DTU Health Tech

Department of Health Technology

MatrixPlot - 1.2

Visualizing structural sequence constraints.


MatrixPlot can be used to generate mutual information plots of sequence alignments, distance matrices of sequence with known 3D coordinates, and plots of user provided matrix files. For details consult the "Introduction" and "Data format" pages.

For publication of results, please cite:
MatrixPlot: visualizing sequence constraints. J. Gorodkin, H. H. Stærfeldt, O. Lund, and S. Brunak. Bioinformatics. 15:769-770, 1999. [ps.gz | pdf.gz].

Introduction


Contents


Background and Motivation

MatrixPlot is a program to display high quality matrix plots of any type of data given in a simple format, the .mp format. MatrixPlot have a number of options to tune the plot according to specifications given by the user. In particular the user may display additional information along the edges of the plot and zoom in on any region of the plot. For further introduction consult the MatrixPlot paper
MatrixPlot: a program to visualize sequence constraints. J. Gorodkin, H. H. Stærfeldt, O. Lund, and S. Brunak. Bioinformatics. 15:769-770, 1999. (http://www.cbs.dtu.dk/services/MatrixPlot/)
MatrixPlot is accompanied by two other programs: Inform, which computes the mutual information between any two positions in a sequence alignment, and produces the output in the mp format. The second program pdb2mp takes a PDB file from Brookhaven Protein Data Bank and computes all the interatomic distance between either the C-alpha atoms in proteins or P atoms in nucleotide sequences. The output is produced in the mp format. The modularity of the program structure is simple:
Picture
In particular for the user program, one can export the output of any web server directly to MatrixPlot by using hidden variables in a submission form. The zoom program cuts out a desired region of the matrix plot. It can be reused arbitrarily many times on the mp file typically after consulting the output from MatrixPlot.
The mutual information content of a structural alignment (refound by foldalign of the following RNA sequences (sequences from Tuerk et al. PNAS 89, pp 6988-6992, 1992.):
CCAGAGGCCCAACUGGUAAACGGGC
CCG-AAGCUCAACGGGAUAAUGAGC
CCG-AAGCCGAACGGGAAAACCGGC
CC-CAAGCGC-AGGGGAGAA-GCGC
CCG-ACGCCA-ACGGGAGAA-UGGC
CCGUUUUCAG-UCGGGAAAAACUGA
CCGUUACUCC-UCGGGAUAAAGGAG
CCGUAAGAGG-ACGGGAUAAACCUC
CCG-UAGGAG-GCGGGAUAU-CUCC
CCG--UGCCG-GCGGGAUAU-CGGC
CCG-AACUCG-ACGGGAUAA-CGAG
CCG--ACUCG--CGGGAUAA-CGAG
can be displayed by MatrixPlot as:
Picture
The degree of mutual information is indicated by the color scale. Below we discuss convenient ways to compute the mutual information content of such a structural alignment. Along the edges the plain sequence information content along with gap frequencies are displayed. The scale to the left of the profile indicates the sequence information, which is plotted in black. The scale to the right indicates the frequency of gaps. The sequence information is computed as in the structure logo, but when including gaps we have introduced displacement of the baseline of the plain sequence information content profile. This is discussed below. If, at some position in the alignment, the gap frequency uses a larger bar than than for the sequence information content, the information bar is displayed on top of the gap bar, and vice versa.
For sequences with a set of three-dimensional coordinates, for example PDB entry 1raa chain A, MatrixPlot generates the plot
Picture
The interatomic distances between the C-alpha atoms are indicated by the colorscales, which measures distances in Angstrom. The colors in the bar express the secondary structure assignment for the PDB file: helix is indicated in red; sheet is indicated in green; turn is indicated in blue. The color gray indicate residues without assignment in the PDB file.
Below it is discussed how to modify the graphical output of MatrixPlot. The graphical output is given in postscript files. Any submitted data is kept confidential and deleted a short time after submission.



Data Formats



Program Options (man pages)

The options to Inform, pdb2mp, MatrixPlot, and Zoom are described in the pages given below. Source code requests: send email to gorodkin@cbs.dtu.dk.



WWW Interface

The web interface of MatrixPlot consist of four pages, and has been made from the programs presented above: mutual information plots for RNA or DNA alignments, mutual information plots for protein alignments, distance matrices of sequences with a set of three-dimensional coordinates, and a page that takes a user produced mp file. Some of the command line options of the individual programs have been combined into one field on the corresponding web page. This is described here:
  • Mutual information for nucleotide sequence alignments
    This page contains the options to Inform and all the relevant options of MatrixPlot to generate mutual information plots.
  • Mutual information for protein sequence alignments
    This page contains the options to Inform ("complementarity matrix" is not defined) with mtype=1 (standard form of mutual information), and all the relevant options of MatrixPlot.
  • Distance matrices
    This page contains the options to pdb2mp, as well as the relevant MatrixPlot options to generate distance matrices of sequences with a set of 3D coordinates. The user may also enter the name of a PDB entry. If the chain identifier is omitted a distance matrix will be generated for the first identifier in the PDB entry.
  • User matrix
    This page allows the user to submit an mp file. All MatrixPlot options are available on that page.
After submission to the web pages, the matrix plot is returned as gif and postscript files. The colors on the gif file are limited, so they might be distorded depending on the browser. Along with the matrix plots a new form is given. This form includes all the previous MatrixPlot settings as well as options to zoom around in the plot. Upon resubmission new matrix plots are returned together with a new submission form. The following two plots illustrate this. First a full matrix plot of the archaea RNA sequences downloaded from the Signal recognition particle database (Samuelsson and Zwieb, Nucl. Acids Res. 27 pp 169-170, 1999). The trace of lines from upper left corner towards lower right corner indicate strecthes of covariant positions in the alignment.
Picture
Then using the zoom option result in the following plot:
Picture

The stretch of covarying positions corresponds to the main stem of the RNA secondary structure. Note that the trace is interrupted by a few positions having complete sequence conservation. The respective parts of the sequence logo profile have also been displayed along the edges of the plot.

Export user data directly to MatrixPlot
The user data can automatically be exported to MatrixPlot thorugh the web. This would typically be data generated from another web page which then can be exported for graphical manipulation, such as zooming in on a region. Only a single button is needed, as the form can be written using hidden variables.


Discussion of Information Content

Two issues on information content for sequence/structure alignments are discussed here. First the contribution to the computation of information content when gaps are included in the alignment, and secondly the sample size robustness of mutual information for RNA secondary structure co-variance analysis. For background and details consult the mentioned references.

  • Calculating the information content of an alignment that have gaps.

    The conceptual problem in computing the information content or relative entropy of an alignment that has gaps is the background probablities. The background probabilities are typically found by counting base (or amino acid) frequencies in the genomes (or data set) from which the aligned sequences originate. The gaps, however, do not appear before the alignment is made, so a ``gap background probability'' or a ``gap expectation value'' cannot be dealt with in the same way. It only makes sense to talk about this when the alignment has been performed.

    Calculation of the information content of an alignment containing gaps has been derived by Hertz and Stormo, 1995 (In Lim & Cantor (eds), Proc. of the Third Int. Conf. on Bioinformatics and genome Res., pp. 201-216). They derived the expression

    tex2html_wrap_inline99
    where tex2html_wrap_inline101 and tex2html_wrap_inline103 are the fraction of gaps and symbol k at position i in the alignment, and where k is a symbol in the alphabet A. The fraction tex2html_wrap_inline113 is the background probability of symbol k, so tex2html_wrap_inline117. Note that if only a few sequences contain a gap (tex2html_wrap_inline119) or almost all sequences have a gap (tex2html_wrap_inline121) at some position i, the ``gap'' term does not contribute to the sum. Since the gap background probability is one in this expression, then all the background probabilities sum to 2 rather than one, which is the formal claim to define information content or relative entropy. This is resolved as follows. Hertz and Stormo derived a large-devation rate function given by
    tex2html_wrap_inline125
    and as they write, it normalizes all background probabilities with a factor of 1/2. This corresponds to adding tex2html_wrap_inline129 to the information content on each of L positions in the alignment. Clearly, the difference between I and tex2html_wrap_inline135 , is where to put the baseline in a sequence logo profile plot. The normalizing factor of Hertz and Stormo was found by considering the minimum of tex2html_wrap_inline135. When using the normalizing factor, one interpretation is that the a priori expectation to gaps is probability 0.5, however, another interpretation is that one has two models, one for which the expectation values for all the symbols sum to one, and one model dealing only with gaps and the expectation to gaps is one. When combining the two models a (re-)normalization of all the parameters can be introduced.

    The considerations for any gap background probability can in general be extended by subtracting tex2html_wrap_inline135 from the standard form

    tex2html_wrap_inline141
    When using the normalization tex2html_wrap_inline143, it is easily found that tex2html_wrap_inline145. Clearly tex2html_wrap_inline147. Hence, the normalizing factor and the a priori expectation of gap frequency is directly related and the same when tex2html_wrap_inline149. Using this relation one can readily write
    tex2html_wrap_inline151
    so when tex2html_wrap_inline149 the difference is equal to tex2html_wrap_inline155, and the same as above is obtained. The result becomes independent of tex2html_wrap_inline101. The interpretation of a gap a priori probability of 0.5 also makes sense. If no prior knowledge is known about the alignment it makes sense to expect gaps to appear randomly, i.e. having an a priori expectation of 0.5. So a ``renormalization'' of the two sets of probabilities, the set of base frequencies, and the gap probability, ensures that the background probabilities sum to one and fulfill the formal claim to define information content.

    The current version of Inform computes the information content according to tex2html_wrap_inline135, but MatrixPlot contains an option ``neg'' for which the baseline can be moved as described (when neg=n). Otherwise (when neg=y) the baseline is placed so that ``negative information content'' can show up on the profile. This can be useful in identifying positions which neither contains many or few gaps, but really is degenerated.

  • Mutual information only for basepairs
    As has been discussed by Gorodkin et al. (Comput. Appl. Biosci., 13:583-586, 1997) and used in the display of structure logos the mutual information content for RNA sequences can be limited to include only pairs of bases that are complementary. In that way a more relevant measure is constructed. This is done by defining a complementary matrix (Gorodkin et al. Nucl. Acids. Res. 25:3724-3732,1997) which lists the bases that are complementary. The complementary matrix lists all bases against themselves, and a number is assigned to indicate the degree of ``belief'' in the complementarity. The matrix is clearly symmetric and most positions in the matrix hold the value zero. The mutual information between position i and j in the alignment using the complementary matrix is given by
    tex2html_wrap_inline64 (A)
    where tex2html_wrap_inline66, A the alphabet, and where tex2html_wrap_inline70 and tex2html_wrap_inline72. The element tex2html_wrap_inline74 of the complementary matrix lists the degree of complementarity between base k and l. The fraction of pairs of base k at position i and base l at position j is tex2html_wrap_inline88 and tex2html_wrap_inline90 is the fraction of base k at position i. Gaps is dealt with by having tex2html_wrap_inline96 for any base k. There are in particular two qualitative differences between this measure and the standard form of mutual information given by
    tex2html_wrap_inline104 (B)
    First in (A) the observed and expected values are computed independently for two positions i and j, and the the overall covariance is computed. In contrast in (B) the observed and expected values are compared for each combination of the involved symbols. Secondly, in (A) only symbols for which a basepair rule is defined are included in the sum. In (B) all symbols are included (also gaps) even if it is already known that they do not (for RNA secondary structure) contribute to the covariance. This fact makes (A) much more robust to sampling size noise as illustrated below by two examples. In the Inform program measure (A) is referred to as mtype 2, and (B) to mtype 1. The upper triangle is mutual information content computed by type 2, and the lower triangle is the mutual information content computed by using type 1.
    Picture

    Sample size corrections for plain sequence information content (e.g. Schneider et al. J. Mol. Biol. 188:415-431, 1986; Basharin, Theory Probability Appl. 4:333-336, 1959), and for the standard form of mutual information and correlation functions (e.g. Weiss and Herzel, J. Theor. Biol. 190:341-353, 1998) has previously been studied.

Acknowledgements

Thanks to Anders Krogh for his contribution in the discussion of information content. Thanks to Claus A. Andersen, Lars J. Jensen, and Christopher Workman for suggestions and feedback.


Reference

MatrixPlot: visualizing sequence constraints. J. Gorodkin, H. H. Stærfeldt, O. Lund, and S. Brunak. Bioinformatics. 15:769-770, 1999. (http://www.cbs.dtu.dk/services/MatrixPlot/)

Data formats


This page is an individual page for the data format, and can also be found as a section on the introduction page. The data formats used by the programs, Inform, pdb2mp, and MatrixPlot are described and illustrated by examples.

  • Inform formats:
    The formats for Inform are those that can be used for the MatrixPlot pages for generating mutual information plots of nucletide sequences and proteins. The formats are a simple align format, fasta format and the msf format. Examples:
  • pdb2mp formats
  • MatrixPlot format

User Matrix


Submit your own computed matrix plot file by pasting it in, or read it from a file. For publication of results, please cite

MatrixPlot: visualizing sequence constraints. J. Gorodkin, H. H. Stærfeldt, O. Lund, and S. Brunak. Bioinformatics. 15:769-770, 1999.


Submission by pasting the sequence:

Type in the sequence with 3D coordinates: (see the data format description)





MatrixPlot options:
Title:
Title: size toffsetx toffsety
Color scale (interval R G B)
Color scale out:
Colorgrain:
Spacing:
Color: size
xcoloff ycoloff

Grid: thickness
xgridsize ygridsize
Show max value:
Position numbers:
size xnumpos ynumpos

Assignment along the edges of the matrix plot. Note that the "Letter" option overwrite the "Bar" option for each of the two possible assignments.
Bar of first
assignment col:
Properties: barthick
xbarpos ybarpos
Letters of first
assignment col:
Properties: scale
xbarpos ybarpos
Bar of second
assignment col:
Properties: barthick2
xbarpos2 ybarpos2
Letters of second
assignment col:
Properties: scale
xbarpos2 ybarpos2

The bars and letters can be replaced by one entire text string.
Read all cols as one string (you can have more than two cols).
Show sequence
information profile:
Move baseline of info profile
when having gaps in alignment:
Include gap
profile (usegaps):
Complementary matrix
elements (cout):
Reverse horizontal
direction (xreverse):
Reverse vertical
direction (yreverse):


Submission of an mp file:

Read the mp file: (see the data format description)






MatrixPlot options:
Title:
Title: size toffsetx toffsety
Color scale (interval R G B)
Color scale out:
Colorgrain:
Spacing:
Color: size
xcoloff ycoloff

Grid: thickness
xgridsize ygridsize
Show max value:
Position numbers:
size xnumpos ynumpos

Assignment along the edges of the matrix plot. Note that the "Letter" option overwrite the "Bar" option for each of the two possible assignments.
Bar of first
assignment col:
Properties: barthick
xbarpos ybarpos
Letters of first
assignment col:
Properties: scale
xbarpos ybarpos
Bar of second
assignment col:
Properties: barthick2
xbarpos2 ybarpos2
Letters of second
assignment col:
Properties: scale
xbarpos2 ybarpos2

The bars and letters can be replaced by one entire text string.
Read all cols as one string (you can have more than two cols).
Show sequence
information profile:
Move baseline of info profile
when having gaps in alignment:
Include gap
profile (usegaps):
Complementary matrix
elements (cout):
Reverse horizontal
direction (xreverse):
Reverse vertical
direction (yreverse):

Mutual information in RNA and DNA sequences


Compute mutual information of your sequence alignment. You can submit the data by pasting the alignment, or read it from a file. When the processing is done you can zoom around in your plot as much as you want. For pure sequence information go to sequence-structure logos.For publication of results, please cite

MatrixPlot: visualizing sequence constraints. J. Gorodkin, H. H. Stærfeldt, O. Lund, and S. Brunak. Bioinformatics. 15:769-770, 1999.


Submission by pasting the alignment:

Alignment: (see the data format description)





Mutual information options:

Compute by
Zero's along the diagonal:
Complementary matrix elements (bp):
Alphabet:
Background nucleotide distribution
(for logo profile):
Complementarity matrix
(when using type 2)

MatrixPlot options:

Title:
Title: size toffsetx toffsety
Color scale (interval R G B)
Color scale out:
Colorgrain:
Spacing:
Color: size
xcoloff ycoloff

Grid: thickness
xgridsize ygridsize
Show max
mutual value:
Show sequence
information profile:
Include gap
profile (usegaps):
Reverse horizontal
direction (xreverse):
Position numbers:
size xnumpos ynumpos
Move baseline of info profile
when having gaps in alignment:
Complementary matrix
elements (cout):
Reverse vertical
direction (yreverse):

Submission of a file containing the alignment:

Alignment: (see the data format description)





Mutual information options:

Compute by
Zero's along the diagonal:
Complementary matrix elements (bp):
Alphabet:
Background nucleotide distribution
(for logo profile):
Complementarity matrix
(when using type 2)

MatrixPlot options:

Title:
Title: size toffsetx toffsety
Color scale (interval R G B)
Color scale out:
Colorgrain:
Spacing:
Color: size
xcoloff ycoloff

Grid: thickness
xgridsize ygridsize
Show max
mutual value:
Show sequence
information profile:
Include gap
profile (usegaps):
Reverse horizontal
direction (xreverse):
Position numbers:
size xnumpos ynumpos
Move baseline of info profile
when having gaps in alignment:
Complementary matrix
elements (cout):
Reverse vertical
direction (yreverse):

Mutual information in protein sequences


Compute mutual information of your sequence alignment. You can submit the data by pasting the alignment, or read it from a file. When the processing is done you can zoom around in your plot as much as you want. For pure sequence information go to protein sequence logos. For publication of results, please cite

MatrixPlot: visualizing sequence constraints. J. Gorodkin, H. H. Stærfeldt, O. Lund, and S. Brunak. Bioinformatics. 15:769-770, 1999.


Submission by pasting the alignment:

Alignment: (see the data format description)





Mutual information options:

Zero's along the diagonal:
Complementary matrix elements (bp):
Alphabet:
Background nucleotide distribution (for logo profile):

MatrixPlot options:

Title:
Title: size toffsetx toffsety
Color scale (interval R G B)
Color scale out:
Colorgrain:
Spacing:
Color: size
xcoloff ycoloff

Grid: thickness
xgridsize ygridsize
Show max
mutual value:
Show sequence
information profile:
Include gap
profile (usegaps):
Reverse horizontal
direction (xreverse):
Position numbers:
size xnumpos ynumpos
Move baseline of info profile
when having gaps in alignment:
Complementary matrix
elements (cout):
Reverse vertical
direction (yreverse):
*

Submission of a file containing the alignment:

Alignment: (see the data format description)





Mutual information options:

Zero's along the diagonal:
Complementary matrix elements (bp):
Alphabet:
Background nucleotide distribution (for logo profile):

MatrixPlot options:

Title:
Title: size toffsetx toffsety
Color scale (interval R G B)
Color scale out:
Colorgrain:
Spacing:
Color: size
xcoloff ycoloff

Grid: thickness
xgridsize ygridsize
Show max
mutual value:
Show sequence
information profile:
Include gap
profile (usegaps):
Reverse horizontal
direction (xreverse):
Position numbers:
size xnumpos ynumpos
Move baseline of info profile
when having gaps in alignment:
Complementary matrix
elements (cout):
Reverse vertical
direction (yreverse):

Distance matrices


Get a distance matrix of an RNA or DNA sequence with 3D coordinates. You can submit the sequence with its 3D coordinates by pasting it in, or read it from a file. You can also type a PDB entry name with corresponding chain identifier. After return of the matrix plot you can zoom around in it. For publication of results, please cite

MatrixPlot: visualizing sequence constraints. J. Gorodkin, H. H. Stærfeldt, O. Lund, and S. Brunak. Bioinformatics. 15:769-770, 1999.


Submission by pasting the sequence:

Give PDB entry (small letters) (and chain, capital letter):
OR
Type in the sequence with 3D coordinates: (see the data format description)





pdb2mp column format options: (Ignore if you are using PDB format)
xcol:
ycol:
zcol:
acol:
scol:

MatrixPlot options:
Title:
Title: size toffsetx toffsety
Color scale (interval R G B)
Color scale out:
Colorgrain:
Spacing:
Color: size
xcoloff ycoloff

Grid: thickness
xgridsize ygridsize
Show max
distance:
Position numbers:
size xnumpos ynumpos
Secondary
structure bar
Structure bar:
barthick xbarpos ybarpos
Sequence
letters
Sequence letters:
scale xbarpos2 ybarpos2
Complementary matrix
elements (cout):
Reverse horizontal
direction (xreverse):
Reverse vertical
direction (yreverse):

Submission of a file containing the sequence:

Give PDB entry (small letters) (and chain, capital letter):
OR
Read the sequence file with 3D coordinates: (see the data format description)





pdb2mp column format options: (Ignore if you are using PDB format)
xcol:
ycol:
zcol:
acol:
scol:

MatrixPlot options:
Title:
Title: size toffsetx toffsety
Color scale (interval R G B)
Color scale out:
Colorgrain:
Spacing:
Color: size
xcoloff ycoloff

Grid: thickness
xgridsize ygridsize
Show max
distance:
Position numbers:
size xnumpos ynumpos
Secondary
structure bar
Structure bar:
barthick xbarpos ybarpos
Sequence
letters
Sequence letters:
scale xbarpos2 ybarpos2
Complementary matrix
elements (cout):
Reverse horizontal
direction (xreverse):
Reverse vertical
direction (yreverse):


GETTING HELP

If you need help regarding technical issues (e.g. errors or missing results) contact Technical Support. Please include the name of the service and version (e.g. NetPhos-4.0) and the options you have selected. If the error occurs after the job has started running, please include the JOB ID (the long code that you see while the job is running).

If you have scientific questions (e.g. how the method works or how to interpret results), contact Correspondence.

Correspondence: Technical Support: