Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FOGSAA PairwiseAligner implementation #4784

Merged
merged 22 commits into from
Oct 17, 2024
Merged

Conversation

michaelfm1211
Copy link
Contributor

@michaelfm1211 michaelfm1211 commented Aug 3, 2024

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit
    locally, and understand that continuous integration checks will be used to
    confirm the Biopython unit tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

Closes #4776

This PR adds an implementation of the Fast Optimal Global Alignment Algorithm (FOGSAA) to Bio.Align.PairwiseAligner. Right now this PR is still a work in progress and not ready to be merged. The implementation is still in its infancy and there are likely a number of bugs. I wanted to make a PR to give time for feedback on the code and discussion on whether or not adding FOGSAA to PairwiseAligner is even a good idea.

The majority of this PR is taken and adapted from Angana Chakraborty's C++ code. In the linked issue @CallMeMisterOwl has indicated the author has OK'd the code for use in BioPython, but we'll probably want to check in again if this PR will eventually be merged.

Right now only the DNA scoring algorithm has been added. Matrix scoring for proteins and alignment are still TODO. To make writing the code easier for me, I haven't put everything in macros like the other algorithms, but this will be done before the final version. There are some flaws in the DNA scoring implementation. Chakraborty's code only uses integers for scoring, but the rest of PairwiseAligner uses doubles. Before the final version this will be addressed (it will require replacing/changing the queue data structure used in the original). The code also creates a full 2D alignment matrix, but some sources (ex, this README) say the algorithm can be done without it. There are almost certainly more optimizations that can be done too.

PairwiseAligner selects the best algorithm by the parameters given. In the linked issue, it was suggested that FOGSAA be made the default algorithm, but IMO that won't be a good idea until after the final version has been battle-tested in production. For now, I've added a setter to the .algorithm property of PairwiseAligner. I think this setter is a good idea on its own right, but if others disagree it can be removed in the final version.

I'll update this description as progress continues.

Update August 9 (commit 793ff19): Support for scoring with a scoring matrix should now be implemented. The code has now been moved into macros similar to the other algorithm implementations.

Update August 14 (commit 3f398f5): Per mdehoon's suggestions, the .algorithm setter has been removed. The FOGSAA algorithm is now selected by setting the mode attribute of PairwiseAligner to "fogsaa".

Update August 22: Commits b0c365f and 98eb4f5 should now add basic support for aligning two sequences with FOGSAA.

Update August 27 (commit 463aabc): Support for non-integer scores should now be implemented. This was done by replacing the priority queue implementation used in Chakraborty's code with a max heap.

Update September 28: End gap scores and affine gaps should now work, and warnings will be raised if potentially problematic gap scores are used for FOGSAA. Numerous bugs fixes and some speedups have also been added.

static int
Aligner_set_algorithm(Aligner* self, PyObject* value, void* closure)
{
if (value == Py_None) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried the code, but it seems that with this method the user is able to choose an algorithm that is inconsistent with the gap scores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I pushed another commit today that should fix this. Can you take a look at it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the algorithm should be read-only. The appropriate algorithm is chosen automatically based on the gap scores and mode. Other than for FOGSAA, there is no point in choosing a different algorithm, because then the algorithm is either inconsistent with the gap scores, or it is slower than the automatically selected one (while producing identical results).

Also, if a user sets the algorithm to Gotoh, but then changes the gap scores, the algorithm may flip back to Needleman-Wunsch if the gap scores are consistent with that algorithm.

I would suggest to make 'FOGSAA' (or 'fast global' or whatever you want to call it) one of the options for the mode attribute. If that option is chosen, the algorithm attribute automatically switches to FOGSAA.

@michaelfm1211
Copy link
Contributor Author

michaelfm1211 commented Aug 20, 2024

Note to self (and others following this PR, I guess): Tests are failing on macOS and Windows because of a bug in NumPy (see #4797). Passes on Linux and is working locally. The problem has been fixed.

These restrictions come from the queue data structure used not from the
algorithm itself. Changing the priority queue implementation may ease
these restrictions at a possible loss to performance.
This allows for the queue sort doubles, which in turn removes the
requirement of integer scores in FOGSAA.
@michaelfm1211
Copy link
Contributor Author

michaelfm1211 commented Sep 16, 2024

I haven't had much time to work on this over the past month, but I should have more time soon. Right now the only feature yet to be implemented is support for different gap scores on the ends of strands. A bigger problem though seems to be performance. From my rudimentary benchmarking with a few random sequences from GenBank, it looks like FOGSAA is running at least 10x slower than Needleman-Wunsch. Right now I'm not sure where this is coming from (my current guesses are inefficiencies in the priority queue, incorrect bounds, or some other sort of bug); I'll need to do some more testing. Nevermind about the performance; most of the slowness was from running Python with the debug allocator. It still is usually slower than the Needleman-Wunsch code, but runs faster than the original Chakraborty code.

I'll also do more extensive testing on the correctness of this code. My last few commits fixed a good number of bugs, but there very well could be more bugs I haven't caught yet.

@michaelfm1211
Copy link
Contributor Author

michaelfm1211 commented Sep 28, 2024

Odd. Tests are failing on macOS with Python 3.12.6. My dev environment is macOS with Python 3.12.5 the tests pass fine.

The exception being raised should only be raised if the lower bound is still less than the upper bound when all feasible paths have been exhausted, but a debug message I added says the lower bound is 0.200000 and the upper bound is also 0.200000, so no exception should be raised. I don't really know what's going on but I suspect some kind of floating point comparison bug.

Edit: it was in fact a floating point comparison bug

@michaelfm1211 michaelfm1211 marked this pull request as ready for review September 29, 2024 02:14
@michaelfm1211
Copy link
Contributor Author

I think this PR should be out of draft stage now. As long as I haven't missed anything, this FOGSAA implementation should give the same results as the Chakraborty code and implement all the features all the other PairwiseAligners have.

@mdehoon Do you want to take a look over this again? I know it's a lot of code, so no rush.

Waterman-Smith-Beyer global or local alignment algorithm).
alignment algorithm (the Needleman-Wunsch, Smith-Waterman, Gotoh,
Waterman-Smith-Beyer, or Fast Optimal Global Sequence Alignment Algorithm
global or local alignment algorithm).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be

(the Needleman-Wunsch, Smith-Waterman, Gotoh, or Waterman-Smith-Beyer global or local alignment algorithm, or the Fast Optimal Global Sequence Alignment Algorithm).

Just for my curiosity, is there a version of FOGSAA for local alignments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one could theoretically adapt FOGSAA for local alignments, but I think it would almost always be slower than Smith-Waterman because FOGSAA would almost never be able to prune a branch. Also, the name is Fast Optimal Global Alignment Algorithm, so it really was never designed for local alignments.

const char text[] = "Pairwise aligner, implementing the Needleman-Wunsch, Smith-Waterman, Gotoh, and Waterman-Smith-Beyer global and local alignment algorithms";
const char text[] = "Pairwise aligner, implementing the Needleman-Wunsch, "
"Smith-Waterman, Gotoh, Waterman-Smith-Beyer, and Fast Optimal Global "
"Sequence Alignment Algorithm global and local alignment algorithms";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here

PyObject *BiopythonWarning = PyObject_GetAttrString(Bio_module, "BiopythonWarning"); \
Py_DECREF(Bio_module); \
if (PyErr_WarnEx(BiopythonWarning, \
"Match score is less than mismatch score. Algorithm may return incorrect results.", 1)) { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with warnings is that they get raised only once during a Python session. So even if the same warning occurs in a completely different context, the warning will not be shown. But I don't know a better solution to this, as raising an exception is too strict.

Copy link
Contributor

@mdehoon mdehoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a line or two to the Biopython tutorial documentation (chapter_align.rst ) to make users aware that FOGSAA is now available? Search engines will also be able to find it then.
Otherwise, it looks fine. Great job!

If `aligner.mode` is set to `"fogsaa"`, then the Fast Optimal Global Alignment
Algorithm [Chakraborty2013]_ with some modifications is used. This mode
calculates a global alignment, but it is not like the regular `"global"` mode.
It is best suited for long alignments between similar sequences. If the the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks for catching that!

match score is less than the mismatch score or any gap score, or if any gap
score is greater than the mismatch score, then a warning is raised and the
algorithm may return incorrect results. Unlike other modes that may return more
than one alignment, FOGSAA always returns only one alignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key point to include here is that FOGSAA relies on heuristic, and (unlike the other alignment algorithms) is not guaranteed to find the alignment with the highest score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the description to explain how the heuristic is used. FOGSAA should be guaranteed to find the optimal alignment (one of the alignments returned by Needleman-Wunsch) if the assumptions of the heuristic are not violated. These are what the warnings are for: if a warning is raised then the heuristic may be incorrect.

Does the new explanation make this clear to users, or should I modify it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. Users who want to know the details can look in the literature.

@michaelfm1211
Copy link
Contributor Author

michaelfm1211 commented Oct 16, 2024

Odd, it looks like CircleCI failed to checkout the branch, but the rest of the tests work so it can't be an issue with the code. I'm guessing this is just a CircleCI thing or just bad luck.

{
/* No need to create path because FOGSAA only finds one optimal alignment
* the .path fields should be populated by FOGSAA_EXIT_ALIGN. To indicate
* we've exausted the iterator, just set self->M[0][0].path to DONE */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exhausted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I added another commit to fix that. Thanks!

@mdehoon mdehoon merged commit eedf82d into biopython:master Oct 17, 2024
32 checks passed
@mdehoon
Copy link
Contributor

mdehoon commented Oct 17, 2024

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] implement FOGSAA for global sequence alignment
2 participants