MaxAlign - 1.2
Maximizing the size of gap-free columns by selecting an optimal subset of sequences in an alignment.
The presence of gaps in an alignment of nucleotide or protein sequences is often an inconvenience for bioinformatical studies. In phylogenetic analyses, for instance, gapped columns are often discarded entirely from the alignment.
MaxAlign is a program that optimizes the alignment prior to such analyses. Specifically, it maximizes the number of characters that are present in gap-free columns - alignment area - by selecting an optimal subset of sequences. MaxAlign can be used prior to any phylogenetic analysis as well as in other situations where this form of alignment clean up is useful, such as the presence of badly aligned or truncated sequences.
For publication of results, please cite:
MaxAlign: maximizing usable data in an alignment
BMC Bioinformatics 2007, 8:312
Gouveia-Oliveira R, Sackett P W, Pedersen A G
Would you prefer to run MaxAlign at your own machine?
MaxAlign 1.1 is available for download. Right-click - Save as.
You need a perl installation to run MaxAlign, but nothing else as the program is made in pure perl and requires no extra modules.
1. Specify the input sequencesAll sequence headers/names MUST be different.
All the input sequences must be in one-letter amino acid or nucleotide code. The suggested alphabet (not case sensitive) is as follows: A C D E F G H I K L M N P Q R S T U V W Y -
Gaps should be represented only by "-".
Other symbols e.g. B,J,X will be considered as nucleotides / amino acids. The sequences can be input in the following two ways:
Paste an alignment in
format into the upper window of the main server page.
- Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.
We have set a limit of 3000 sequences and a length limit of 5000 nucleotides or residues.
2. Advanced optionsEnsure optimal solution (costly)
There are two algorithms in MaxAlign. One heuristic and the other branch-and-bound. The branch-and-bound is guaranteed to find the optimal solution, because it tries all possible solutions. Therefore it is a slower process. The heuristic algorithm finds the optimal solution in 96% of the test cases, and in the last 4% it finds a solution within 99% of the optimal solution (see the paper). Note: This has been improved later to the heuristic algorithm finding the optimal solution in 98% of the test cases.
As a default, only the heuristic algorithm is used. If, however, you want to be absolutely sure that you have the optimal alignment, you can check the "Ensure optimal solution" checkbox, and the branch-and-bound algorithm will be used as well. However, there is a limit (5 min) on the time used for finding the optimum.
You might also be interested in knowing which sequences were discarded first (the heuristic algorithm is iterative). These are probably the worst sequences. You might even want to use some intermediate alignment that is not so good but keeps some more sequences. Or you might have clicked on "Ensure optimal solution" and being interested in comparing the algorithms. In that case you can check the "Detailed output" checkbox. You will get the same output as before plus the information relative to the progression of the MaxAlign run.
Preserving selected sequences
You might want to keep some sequences in your alignment, even at the cost of excluding some sites. You can do that by marking those sequences with a plus sign, "+", before their name, as in the example below:
Sequence 1 above will always be incorporated in the output of MaxAlign, while Sequence 2 incorporation will be evaluated.
Please be sure your sequence names are not starting with a plus "+" if you dont want them to be marked.
Remove gap-only columns
This option will remove all the columns that only have gaps from the alignment. Please note that this is the only time when MaxAlign will meddle with the columns in the alignment. Otherwise it will only remove sequences. But gap-only columns are useless, and therefore you have the option of removing them. If the input alignment was in frame, removing gap-only columns will not cause any frameshift.
Run MaxAlign on protein sequences and obtain the result in nucleotide sequences (or vice versa)
You might be interested in running a codon-based analysis after MaxAlign. So it would be convenient to get the output as an alignment of nucleotides. However, you will have to run MaxAlign on the protein sequence if you are to run a codon-based analysis. So you can provide both alignments to MaxAlign. You must provide the alignment to be analysed (in this case, the protein one) as the first one, on the top of the page, but also provide the nucleotidic one in the second box, at the end of the page. MaxAlign will process the first one, but then use the second one to construct the output. The way it does this is by comparing the sequence identifiers. Please be sure that your sequences have the same identifiers (which, in fasta format, are the whole headers until the first space character).
3. Submit the jobClick on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.
At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.
Under MaxAlign alignment you have a report on how much better your alignment became, in which you can compare the alignment area before and after the MaxAlign run . You can also see how many sequences were excluded and included, and follow the links to see which ones.
In case you have clicked on Detailed output you will have access to the details of the MaxAlign iterative run, seeing which sequences were excluded at which stage.
And if you have provided a second alignment, you will get your output based on that alignment, and not on your first alignment. For example, you can provide a nucleotide alignment as the second alignment, and the translation as the first alignment. Computations will be done on the first alignment, but the output will be printed as a nucleotide alignment.
EXAMPLE OUTPUT ALIGNMENT
>Sequence_1 TAGCACC---ACG >Sequence_3 TAGTTCC---ACA >Sequence_4 TATGTCC---ACG >Sequence_5 TATCCCCACGACC
Alignment area is thus equal to the number of sequences included in the alignment times the number of columns without any gaps. This maximization of the alignment area is done by selectively removing sequences from the alignment. In the example alignment shown below,
Alignment area = 21
one can see that the alignment area equals 21. That corresponds to the nucleotides in columns 4, 5 and 6 in sequences from A to G (3 sites times 7 sequences). Out of the 56 nucleotides in the alignment, only 21 can be used in a potential phylogenetic analysis, as 5 of the 3 columns will be excluded (the ones in grey).
If, however, the first two sequences (A and B) are excluded, we get the alignment shown below,
Alignment area = 30
MaxAlign does exactly this: removes the subset of sequences that maximizes the alignment area.
To see how MaxAlign does this, it is useful to look at the alignment through the concepts of gap pattern and sequence sets: The first three sites (1,2,3) have the same gap pattern, which is of having gaps only in the first two sequences (A and B). All sites sharing that gap pattern can yield ungapped columns if the aforementioned sequences are removed. So for every gap pattern there is a set of sequences requiring removal to produce ungapped columns, what could then increase the alignment area. In this example, the correspondence between gap patterns and sets is shown in the following table:
|Sites sharing a gap pattern||Sequences in Set|
|1, 2, 3||A , B|
|4, 5, 6||-|
|8||D , E|
Before we move on, it is worth mentioning that there are two algorithms under the hood of MaxAlign: one heuristic and one exhaustive search. The heuristic is extremely quick and finds the optimal solution in 96% of the test cases (note: new improvements makes it find the optimal solution in 98% of the test cases), and in the last 4% it finds a solution within 99% of the optimal solution. The exhaustive search will test all possible solutions and thus it will always find the optimal one. As the number of solutions can be immense, this is not a practical approach. Nevertheless, with the checkbox "Ensure optimal solution" checked, MaxAlign will run both algorithms and tell the user whether the solution encountered was the same. The exhaustive search can take a long time on huge datasets. Therefore we incorporated a time limit on it. It will quit if the run exceeds that time limit. The time limit is chosen by the user on the command line version and it was set to 5 min in the web server version.
Let us now see how the heuristic algorithm works: It begins by finding relationships between the above-mentioned sets of sequences. If two sets are alike, they are fused into only one set (in the example, the sets of sites 1, 2 and 3 will be fused into one). If a set is a subset of a major set its "benefits" are also added to a major set (in the example, column 7 is also added to the set D,E). This can be seen in the following table:
|Sets||New columns by removing the set||Number of new columns|
|A , B||1 , 2, 3||3|
|D , E||7 , 8||2|
Then starts the iterative step. At each iteration, the set with more impact on the alignment area is identified and removed. In an earlier version, there where 3 very different methods, but they have been replaced with 3 different versions of the best one: "Greedy ratio". Here the set that has the highest "alignment area per sequence removed ratio" is removed.
1-No synergy: Synergy between sets are disregarded.
2-Synergy between 2 sets: Synergy between any 2 sets are taken into account. It is the only one available on this webserver and the default one on the command line version of MaxAlign
3-Synergy between 3 sets: Synergy between any 3 sets are taken into account. Runtime increases dramatically but is still acceptable, however the improvement is not considered worth it.
The algorithm runs until no set removal can bring an improvement. And that is, most probably, the optimal solution. In this anedoctical example, only one iteration would be enough to find the optimal solution, which is the removal of the sequences A and B.
Both the webversion and the command line version (available for download in the main page) are programmed in pure Perl, without requiring any modules, and therefore it can run in any machine.
MaxAlign: maximizing usable data in an alignment
BMC Bioinformatics 2007
Gouveia-Oliveira R, Sackett P W, Pedersen A G
Center for Biological Sequence Analysis, BioCentrum-DTU, The Technical University of Denmark, DK-2800 Lyngby, Denmark
Results: MaxAlign is a program that optimizes the alignment prior to such analyses. Specifically, it maximizes the number of nucleotide (or amino acid) symbols that are present in gap-free columns - the alignment area - by selecting the optimal subset of sequences to exclude from the alignment.MaxAlign can be used prior to phylogenetic and bioinformatical analyses as well as in other situations where this form of alignment improvement is useful. In this work we test MaxAligns performance in these tasks and compare the accuracy of phylogenetic estimates including and excluding gapped columns from the analysis, with and without processing with MaxAlign. In this paper we also introduce a new simple measure of tree similarity, Normalized Symmetric Similarity (NSS) that we consider useful for comparing tree topologies.
Conclusions: We demonstrate how MaxAlign is helpful in detecting misaligned or defective sequences without requiring manual inspection. We also show that it is not advisable to exclude gapped columns from phylogenetic analyses unless MaxAlign is used first. Finally, we find that the sequences removed by MaxAlign from an alignment tend to be those that would otherwise be associated with low phylogenetic accuracy, and that the presence of gaps in any given sequence does not seem to disturb the phylogenetic estimates of other sequences.