Algorithm description

The presence of gaps in an alignment of nucleotide or protein sequences is often an inconvenience for bioinformatical studies. In phylogenetic analyses, for instance, gapped columns are often discarded entirely from the alignment. The goal of this tool is to maximize the alignment area, defined as the number of nucleotide (or amino acid) symbols used in the alignment, when columns having at least one gap are excluded.

Alignment area is thus equal to the number of sequences included in the alignment times the number of columns without any gaps. This maximization of the alignment area is done by selectively removing sequences from the alignment. In the example alignment shown below,

1 2 3 4 5 6 7 8
Sequence A - - - a c c t a
Sequence B - - - a c c t a
Sequence C g g t a c c g a
Sequence D g g t a g g - -
Sequence E c g t a c g g -
Sequence F t g a a t g c a
Sequence G t g t a c g c a

Alignment area = 21

one can see that the alignment area equals 21. That corresponds to the nucleotides in columns 4, 5 and 6 in sequences from A to G (3 sites times 7 sequences). Out of the 56 nucleotides in the alignment, only 21 can be used in a potential phylogenetic analysis, as 5 of the 3 columns will be excluded (the ones in grey).
If, however, the first two sequences (A and B) are excluded, we get the alignment shown below,

Sequence C g g t a c c g a
Sequence D g g t a g g - -
Sequence E c g t a c g g -
Sequence F t g a a t g c a
Sequence G t g t a c g c a

Alignment area = 30

in which 6 columns are now available to contribute to the phylogenetic analysis, and the alignment area becomes 30 (6 sites times 5 sequences).

MaxAlign does exactly this: removes the subset of sequences that maximizes the alignment area.

To see how MaxAlign does this, it is useful to look at the alignment through the concepts of gap pattern and sequence sets: The first three sites (1,2,3) have the same gap pattern, which is of having gaps only in the first two sequences (A and B). All sites sharing that gap pattern can yield ungapped columns if the aforementioned sequences are removed. So for every gap pattern there is a set of sequences requiring removal to produce ungapped columns, what could then increase the alignment area. In this example, the correspondence between gap patterns and sets is shown in the following table:

Sites sharing a gap pattern Sequences in Set
1, 2, 3 A , B
4, 5, 6 -
7 D
8 D , E

Before we move on, it is worth mentioning that there are two algorithms under the hood of MaxAlign: one heuristic and one exhaustive search. The heuristic is extremely quick and finds the optimal solution in 96% of the test cases (note: new improvements makes it find the optimal solution in 98% of the test cases), and in the last 4% it finds a solution within 99% of the optimal solution. The exhaustive search will test all possible solutions and thus it will always find the optimal one. As the number of solutions can be immense, this is not a practical approach. Nevertheless, with the checkbox "Ensure optimal solution" checked, MaxAlign will run both algorithms and tell the user whether the solution encountered was the same. The exhaustive search can take a long time on huge datasets. Therefore we incorporated a time limit on it. It will quit if the run exceeds that time limit. The time limit is chosen by the user on the command line version and it was set to 5 min in the web server version.
Let us now see how the heuristic algorithm works: It begins by finding relationships between the above-mentioned sets of sequences. If two sets are alike, they are fused into only one set (in the example, the sets of sites 1, 2 and 3 will be fused into one). If a set is a subset of a major set its "benefits" are also added to a major set (in the example, column 7 is also added to the set D,E). This can be seen in the following table:

Sets New columns by removing the set Number of new columns
A , B 1 , 2, 3 3
D 7 1
D , E 7 , 8 2

Then starts the iterative step. At each iteration, the set with more impact on the alignment area is identified and removed. In an earlier version, there where 3 very different methods, but they have been replaced with 3 different versions of the best one: "Greedy ratio". Here the set that has the highest "alignment area per sequence removed ratio" is removed.

    1-No synergy: Synergy between sets are disregarded.

    2-Synergy between 2 sets: Synergy between any 2 sets are taken into account. It is the only one available on this webserver and the default one on the command line version of MaxAlign

    3-Synergy between 3 sets: Synergy between any 3 sets are taken into account. Runtime increases dramatically but is still acceptable, however the improvement is not considered worth it.

The algorithm runs until no set removal can bring an improvement. And that is, most probably, the optimal solution. In this anedoctical example, only one iteration would be enough to find the optimal solution, which is the removal of the sequences A and B.

Both the webversion and the command line version (available for download in the main page) are programmed in pure Perl, without requiring any modules, and therefore it can run in any machine.