FOGSAA PairwiseAligner implementation #4784

michaelfm1211 · 2024-08-03T09:14:18Z

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

This PR adds an implementation of the Fast Optimal Global Alignment Algorithm (FOGSAA) to Bio.Align.PairwiseAligner. Right now this PR is still a work in progress and not ready to be merged. The implementation is still in its infancy and there are likely a number of bugs. I wanted to make a PR to give time for feedback on the code and discussion on whether or not adding FOGSAA to PairwiseAligner is even a good idea.

The majority of this PR is taken and adapted from Angana Chakraborty's C++ code. In the linked issue @CallMeMisterOwl has indicated the author has OK'd the code for use in BioPython, but we'll probably want to check in again if this PR will eventually be merged.

Right now only the DNA scoring algorithm has been added. Matrix scoring for proteins and alignment are still TODO. To make writing the code easier for me, I haven't put everything in macros like the other algorithms, but this will be done before the final version. There are some flaws in the DNA scoring implementation. Chakraborty's code only uses integers for scoring, but the rest of PairwiseAligner uses doubles. Before the final version this will be addressed (it will require replacing/changing the queue data structure used in the original). The code also creates a full 2D alignment matrix, but some sources (ex, this README) say the algorithm can be done without it. There are almost certainly more optimizations that can be done too.

PairwiseAligner selects the best algorithm by the parameters given. In the linked issue, it was suggested that FOGSAA be made the default algorithm, but IMO that won't be a good idea until after the final version has been battle-tested in production. For now, I've added a setter to the .algorithm property of PairwiseAligner. I think this setter is a good idea on its own right, but if others disagree it can be removed in the final version.

I'll update this description as progress continues.

Update August 9 (commit 793ff19): Support for scoring with a scoring matrix should now be implemented. The code has now been moved into macros similar to the other algorithm implementations.

Update August 14 (commit 3f398f5): Per mdehoon's suggestions, the .algorithm setter has been removed. The FOGSAA algorithm is now selected by setting the mode attribute of PairwiseAligner to "fogsaa".

Update August 22: Commits b0c365f and 98eb4f5 should now add basic support for aligning two sequences with FOGSAA.

Update August 27 (commit 463aabc): Support for non-integer scores should now be implemented. This was done by replacing the priority queue implementation used in Chakraborty's code with a max heap.

Update September 28: End gap scores and affine gaps should now work, and warnings will be raised if potentially problematic gap scores are used for FOGSAA. Numerous bugs fixes and some speedups have also been added.

mdehoon · 2024-08-03T09:22:33Z

Bio/Align/_pairwisealigner.c

+static int
+Aligner_set_algorithm(Aligner* self, PyObject* value, void* closure)
+{
+    if (value == Py_None) {


I haven't tried the code, but it seems that with this method the user is able to choose an algorithm that is inconsistent with the gap scores.

Good catch. I pushed another commit today that should fix this. Can you take a look at it?

I think that the algorithm should be read-only. The appropriate algorithm is chosen automatically based on the gap scores and mode. Other than for FOGSAA, there is no point in choosing a different algorithm, because then the algorithm is either inconsistent with the gap scores, or it is slower than the automatically selected one (while producing identical results).

Also, if a user sets the algorithm to Gotoh, but then changes the gap scores, the algorithm may flip back to Needleman-Wunsch if the gap scores are consistent with that algorithm.

I would suggest to make 'FOGSAA' (or 'fast global' or whatever you want to call it) one of the options for the mode attribute. If that option is chosen, the algorithm attribute automatically switches to FOGSAA.

…GSAA

…r for FOGSAA

michaelfm1211 · 2024-08-20T20:46:45Z

~~Note to self (and others following this PR, I guess): Tests are failing on macOS and Windows because of a bug in NumPy (see #4797). Passes on Linux and is working locally.~~ The problem has been fixed.

These restrictions come from the queue data structure used not from the algorithm itself. Changing the priority queue implementation may ease these restrictions at a possible loss to performance.

This allows for the queue sort doubles, which in turn removes the requirement of integer scores in FOGSAA.

michaelfm1211 · 2024-09-16T01:15:28Z

I haven't had much time to work on this over the past month, but I should have more time soon. Right now the only feature yet to be implemented is support for different gap scores on the ends of strands. A bigger problem though seems to be performance. From my rudimentary benchmarking with a few random sequences from GenBank, it looks like FOGSAA is running at least 10x slower than Needleman-Wunsch. Right now I'm not sure where this is coming from (my current guesses are inefficiencies in the priority queue, incorrect bounds, or some other sort of bug); I'll need to do some more testing. Nevermind about the performance; most of the slowness was from running Python with the debug allocator. It still is usually slower than the Needleman-Wunsch code, but runs faster than the original Chakraborty code.

I'll also do more extensive testing on the correctness of this code. My last few commits fixed a good number of bugs, but there very well could be more bugs I haven't caught yet.

michaelfm1211 · 2024-09-28T18:43:14Z

Odd. Tests are failing on macOS with Python 3.12.6. My dev environment is macOS with Python 3.12.5 the tests pass fine.

The exception being raised should only be raised if the lower bound is still less than the upper bound when all feasible paths have been exhausted, but a debug message I added says the lower bound is 0.200000 and the upper bound is also 0.200000, so no exception should be raised. I don't really know what's going on but I suspect some kind of floating point comparison bug.

Edit: it was in fact a floating point comparison bug

michaelfm1211 · 2024-09-29T02:16:30Z

I think this PR should be out of draft stage now. As long as I haven't missed anything, this FOGSAA implementation should give the same results as the Chakraborty code and implement all the features all the other PairwiseAligners have.

@mdehoon Do you want to take a look over this again? I know it's a lot of code, so no rush.

mdehoon · 2024-10-15T08:33:34Z

Bio/Align/__init__.py

-    Waterman-Smith-Beyer global or local alignment algorithm).
+    alignment algorithm (the Needleman-Wunsch, Smith-Waterman, Gotoh,
+    Waterman-Smith-Beyer, or Fast Optimal Global Sequence Alignment Algorithm
+    global or local alignment algorithm).


I guess this should be

(the Needleman-Wunsch, Smith-Waterman, Gotoh, or Waterman-Smith-Beyer global or local alignment algorithm, or the Fast Optimal Global Sequence Alignment Algorithm).

Just for my curiosity, is there a version of FOGSAA for local alignments?

I think one could theoretically adapt FOGSAA for local alignments, but I think it would almost always be slower than Smith-Waterman because FOGSAA would almost never be able to prune a branch. Also, the name is Fast Optimal Global Alignment Algorithm, so it really was never designed for local alignments.

mdehoon · 2024-10-15T08:34:42Z

Bio/Align/_pairwisealigner.c

-  const char text[] = "Pairwise aligner, implementing the Needleman-Wunsch, Smith-Waterman, Gotoh, and Waterman-Smith-Beyer global and local alignment algorithms";
+  const char text[] = "Pairwise aligner, implementing the Needleman-Wunsch, "
+      "Smith-Waterman, Gotoh, Waterman-Smith-Beyer, and Fast Optimal Global "
+      "Sequence Alignment Algorithm global and local alignment algorithms";


Same comment here

mdehoon · 2024-10-15T08:48:08Z

Bio/Align/_pairwisealigner.c

+        PyObject *BiopythonWarning = PyObject_GetAttrString(Bio_module, "BiopythonWarning"); \
+        Py_DECREF(Bio_module); \
+        if (PyErr_WarnEx(BiopythonWarning, \
+                    "Match score is less than mismatch score. Algorithm may return incorrect results.", 1)) { \


One issue with warnings is that they get raised only once during a Python session. So even if the same warning occurs in a completely different context, the warning will not be shown. But I don't know a better solution to this, as raising an exception is too strict.

mdehoon

Can you add a line or two to the Biopython tutorial documentation (chapter_align.rst ) to make users aware that FOGSAA is now available? Search engines will also be able to find it then.
Otherwise, it looks fine. Great job!

mdehoon · 2024-10-16T00:21:00Z

Doc/Tutorial/chapter_pairwise.rst

+If `aligner.mode` is set to `"fogsaa"`, then the Fast Optimal Global Alignment
+Algorithm [Chakraborty2013]_ with some modifications is used. This mode
+calculates a global alignment, but it is not like the regular `"global"` mode.
+It is best suited for long alignments between similar sequences. If the the


Oops, thanks for catching that!

mdehoon · 2024-10-16T00:22:19Z

Doc/Tutorial/chapter_pairwise.rst

+match score is less than the mismatch score or any gap score, or if any gap
+score is greater than the mismatch score, then a warning is raised and the
+algorithm may return incorrect results. Unlike other modes that may return more
+than one alignment, FOGSAA always returns only one alignment.


Key point to include here is that FOGSAA relies on heuristic, and (unlike the other alignment algorithms) is not guaranteed to find the alignment with the highest score.

I changed the description to explain how the heuristic is used. FOGSAA should be guaranteed to find the optimal alignment (one of the alignments returned by Needleman-Wunsch) if the assumptions of the heuristic are not violated. These are what the warnings are for: if a warning is raised then the heuristic may be incorrect.

Does the new explanation make this clear to users, or should I modify it?

I think it's fine. Users who want to know the details can look in the literature.

michaelfm1211 · 2024-10-16T02:09:24Z

Odd, it looks like CircleCI failed to checkout the branch, but the rest of the tests work so it can't be an issue with the code. I'm guessing this is just a CircleCI thing or just bad luck.

mdehoon · 2024-10-16T07:41:28Z

Bio/Align/_pairwisealigner.c

+{
+    /* No need to create path because FOGSAA only finds one optimal alignment
+     * the .path fields should be populated by FOGSAA_EXIT_ALIGN. To indicate
+     * we've exausted the iterator, just set self->M[0][0].path to DONE */


Oops. I added another commit to fix that. Thanks!

mdehoon · 2024-10-17T01:07:32Z

Thank you!

michaelfm1211 added 2 commits July 31, 2024 11:39

scaffold FOGSAA and add setter for PairwiseAligner.algorithm

47a4b4a

add first version of FOGSAA scoring

983bcd0

mdehoon reviewed Aug 3, 2024

View reviewed changes

add restrictions to algorithm setter and support matrix scoring in FO…

793ff19

…GSAA

michaelfm1211 force-pushed the fogsaa branch from cd7e7de to 793ff19 Compare August 9, 2024 15:44

michaelfm1211 added 2 commits August 9, 2024 17:40

add test for running fogsaa with matrix scoring

3b3e4ec

Remove algorithm setter in lieu of FOGSAA_Mode, scaffold PathGenerato…

3f398f5

…r for FOGSAA

michaelfm1211 force-pushed the fogsaa branch from cede290 to 3f398f5 Compare August 15, 2024 04:05

Add basic support for FOGSAA alignment

b0c365f

michaelfm1211 added 6 commits August 22, 2024 18:16

Add FOGSAA alignment with matrix scoring

98eb4f5

Add more restrictions to FOGSAA parameters.

670bedd

These restrictions come from the queue data structure used not from the algorithm itself. Changing the priority queue implementation may ease these restrictions at a possible loss to performance.

Change FOGSAA priority queue implementation to a max heap

463aabc

This allows for the queue sort doubles, which in turn removes the requirement of integer scores in FOGSAA.

Allocate memory once, fix affine gaps, remove threshold

cd75ba2

more fixes

430e873

Add error checking, debugging code, and fix lower bounds

3804b7c

add support for different affine gaps on edges

ca7f7ad

michaelfm1211 force-pushed the fogsaa branch 2 times, most recently from e303d68 to 3feeade Compare September 28, 2024 18:22

remove debug printfs, warn on invalid parameters

05953f7

michaelfm1211 force-pushed the fogsaa branch from 3feeade to 05953f7 Compare September 28, 2024 18:25

michaelfm1211 added 3 commits September 28, 2024 13:54

fix floating point comparison bugs

9381ed6

only copy cells of optimal path in fogsaa align

5801ac9

stop using different macros for fogsaa cell types

1577360

michaelfm1211 marked this pull request as ready for review September 29, 2024 02:14

michaelfm1211 requested a review from mdehoon October 12, 2024 18:40

mdehoon reviewed Oct 15, 2024

View reviewed changes

michaelfm1211 and others added 3 commits October 15, 2024 17:53

fix fogsaa docstrings, iterator, and reset algorithm on set_mode

6e15ee9

mention fogsaa in Tutorial/chapter_pairwise.rst

b4093c4

Merge branch 'master' into fogsaa

c6c7e0e

mdehoon reviewed Oct 16, 2024

View reviewed changes

fix fogsaa documentation in Doc/Tutorial/chapter_pairwise.rst

841d6cb

mdehoon reviewed Oct 16, 2024

View reviewed changes

fix typo

0fbc519

mdehoon merged commit eedf82d into biopython:master Oct 17, 2024
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FOGSAA PairwiseAligner implementation #4784

FOGSAA PairwiseAligner implementation #4784

michaelfm1211 commented Aug 3, 2024 •

edited

Loading

mdehoon Aug 3, 2024

michaelfm1211 Aug 9, 2024

mdehoon Aug 10, 2024

michaelfm1211 commented Aug 20, 2024 •

edited

Loading

michaelfm1211 commented Sep 16, 2024 •

edited

Loading

michaelfm1211 commented Sep 28, 2024 •

edited

Loading

michaelfm1211 commented Sep 29, 2024

mdehoon Oct 15, 2024

michaelfm1211 Oct 15, 2024

mdehoon Oct 15, 2024

mdehoon Oct 15, 2024

mdehoon left a comment

mdehoon Oct 16, 2024

michaelfm1211 Oct 16, 2024

mdehoon Oct 16, 2024

michaelfm1211 Oct 16, 2024

mdehoon Oct 16, 2024

michaelfm1211 commented Oct 16, 2024 •

edited

Loading

mdehoon Oct 16, 2024

michaelfm1211 Oct 16, 2024

mdehoon commented Oct 17, 2024

FOGSAA PairwiseAligner implementation #4784

FOGSAA PairwiseAligner implementation #4784

Conversation

michaelfm1211 commented Aug 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelfm1211 commented Aug 20, 2024 • edited Loading

michaelfm1211 commented Sep 16, 2024 • edited Loading

michaelfm1211 commented Sep 28, 2024 • edited Loading

michaelfm1211 commented Sep 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdehoon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelfm1211 commented Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdehoon commented Oct 17, 2024

michaelfm1211 commented Aug 3, 2024 •

edited

Loading

michaelfm1211 commented Aug 20, 2024 •

edited

Loading

michaelfm1211 commented Sep 16, 2024 •

edited

Loading

michaelfm1211 commented Sep 28, 2024 •

edited

Loading

michaelfm1211 commented Oct 16, 2024 •

edited

Loading