-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FOGSAA PairwiseAligner implementation #4784
Conversation
Bio/Align/_pairwisealigner.c
Outdated
static int | ||
Aligner_set_algorithm(Aligner* self, PyObject* value, void* closure) | ||
{ | ||
if (value == Py_None) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried the code, but it seems that with this method the user is able to choose an algorithm that is inconsistent with the gap scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I pushed another commit today that should fix this. Can you take a look at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the algorithm should be read-only. The appropriate algorithm is chosen automatically based on the gap scores and mode. Other than for FOGSAA, there is no point in choosing a different algorithm, because then the algorithm is either inconsistent with the gap scores, or it is slower than the automatically selected one (while producing identical results).
Also, if a user sets the algorithm
to Gotoh, but then changes the gap scores, the algorithm may flip back to Needleman-Wunsch if the gap scores are consistent with that algorithm.
I would suggest to make 'FOGSAA' (or 'fast global' or whatever you want to call it) one of the options for the mode
attribute. If that option is chosen, the algorithm
attribute automatically switches to FOGSAA
.
|
These restrictions come from the queue data structure used not from the algorithm itself. Changing the priority queue implementation may ease these restrictions at a possible loss to performance.
This allows for the queue sort doubles, which in turn removes the requirement of integer scores in FOGSAA.
I haven't had much time to work on this over the past month, but I should have more time soon. Right now the only feature yet to be implemented is support for different gap scores on the ends of strands. I'll also do more extensive testing on the correctness of this code. My last few commits fixed a good number of bugs, but there very well could be more bugs I haven't caught yet. |
e303d68
to
3feeade
Compare
3feeade
to
05953f7
Compare
Odd. Tests are failing on macOS with Python 3.12.6. My dev environment is macOS with Python 3.12.5 the tests pass fine. The exception being raised should only be raised if the lower bound is still less than the upper bound when all feasible paths have been exhausted, but a debug message I added says the lower bound is Edit: it was in fact a floating point comparison bug |
I think this PR should be out of draft stage now. As long as I haven't missed anything, this FOGSAA implementation should give the same results as the Chakraborty code and implement all the features all the other PairwiseAligners have. @mdehoon Do you want to take a look over this again? I know it's a lot of code, so no rush. |
Bio/Align/__init__.py
Outdated
Waterman-Smith-Beyer global or local alignment algorithm). | ||
alignment algorithm (the Needleman-Wunsch, Smith-Waterman, Gotoh, | ||
Waterman-Smith-Beyer, or Fast Optimal Global Sequence Alignment Algorithm | ||
global or local alignment algorithm). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should be
(the Needleman-Wunsch, Smith-Waterman, Gotoh, or Waterman-Smith-Beyer global or local alignment algorithm, or the Fast Optimal Global Sequence Alignment Algorithm).
Just for my curiosity, is there a version of FOGSAA for local alignments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one could theoretically adapt FOGSAA for local alignments, but I think it would almost always be slower than Smith-Waterman because FOGSAA would almost never be able to prune a branch. Also, the name is Fast Optimal Global Alignment Algorithm, so it really was never designed for local alignments.
Bio/Align/_pairwisealigner.c
Outdated
const char text[] = "Pairwise aligner, implementing the Needleman-Wunsch, Smith-Waterman, Gotoh, and Waterman-Smith-Beyer global and local alignment algorithms"; | ||
const char text[] = "Pairwise aligner, implementing the Needleman-Wunsch, " | ||
"Smith-Waterman, Gotoh, Waterman-Smith-Beyer, and Fast Optimal Global " | ||
"Sequence Alignment Algorithm global and local alignment algorithms"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here
PyObject *BiopythonWarning = PyObject_GetAttrString(Bio_module, "BiopythonWarning"); \ | ||
Py_DECREF(Bio_module); \ | ||
if (PyErr_WarnEx(BiopythonWarning, \ | ||
"Match score is less than mismatch score. Algorithm may return incorrect results.", 1)) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with warnings is that they get raised only once during a Python session. So even if the same warning occurs in a completely different context, the warning will not be shown. But I don't know a better solution to this, as raising an exception is too strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a line or two to the Biopython tutorial documentation (chapter_align.rst ) to make users aware that FOGSAA is now available? Search engines will also be able to find it then.
Otherwise, it looks fine. Great job!
Doc/Tutorial/chapter_pairwise.rst
Outdated
If `aligner.mode` is set to `"fogsaa"`, then the Fast Optimal Global Alignment | ||
Algorithm [Chakraborty2013]_ with some modifications is used. This mode | ||
calculates a global alignment, but it is not like the regular `"global"` mode. | ||
It is best suited for long alignments between similar sequences. If the the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks for catching that!
Doc/Tutorial/chapter_pairwise.rst
Outdated
match score is less than the mismatch score or any gap score, or if any gap | ||
score is greater than the mismatch score, then a warning is raised and the | ||
algorithm may return incorrect results. Unlike other modes that may return more | ||
than one alignment, FOGSAA always returns only one alignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key point to include here is that FOGSAA relies on heuristic, and (unlike the other alignment algorithms) is not guaranteed to find the alignment with the highest score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the description to explain how the heuristic is used. FOGSAA should be guaranteed to find the optimal alignment (one of the alignments returned by Needleman-Wunsch) if the assumptions of the heuristic are not violated. These are what the warnings are for: if a warning is raised then the heuristic may be incorrect.
Does the new explanation make this clear to users, or should I modify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. Users who want to know the details can look in the literature.
Odd, it looks like CircleCI failed to checkout the branch, but the rest of the tests work so it can't be an issue with the code. I'm guessing this is just a CircleCI thing or just bad luck. |
Bio/Align/_pairwisealigner.c
Outdated
{ | ||
/* No need to create path because FOGSAA only finds one optimal alignment | ||
* the .path fields should be populated by FOGSAA_EXIT_ALIGN. To indicate | ||
* we've exausted the iterator, just set self->M[0][0].path to DONE */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exhausted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I added another commit to fix that. Thanks!
Thank you! |
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runpre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
Closes #4776
This PR adds an implementation of the Fast Optimal Global Alignment Algorithm (FOGSAA) to
Bio.Align.PairwiseAligner
. Right now this PR is still a work in progress and not ready to be merged. The implementation is still in its infancy and there are likely a number of bugs. I wanted to make a PR to give time for feedback on the code and discussion on whether or not adding FOGSAA toPairwiseAligner
is even a good idea.The majority of this PR is taken and adapted from Angana Chakraborty's C++ code. In the linked issue @CallMeMisterOwl has indicated the author has OK'd the code for use in BioPython, but we'll probably want to check in again if this PR will eventually be merged.
Right now only the DNA scoring algorithm has been added. Matrix scoring for proteins and alignment are still TODO. To make writing the code easier for me, I haven't put everything in macros like the other algorithms, but this will be done before the final version. There are some flaws in the DNA scoring implementation. Chakraborty's code only uses integers for scoring, but the rest of
PairwiseAligner
usesdouble
s. Before the final version this will be addressed (it will require replacing/changing the queue data structure used in the original). The code also creates a full 2D alignment matrix, but some sources (ex, this README) say the algorithm can be done without it. There are almost certainly more optimizations that can be done too.PairwiseAligner
selects the best algorithm by the parameters given. In the linked issue, it was suggested that FOGSAA be made the default algorithm, but IMO that won't be a good idea until after the final version has been battle-tested in production. For now, I've added a setter to the.algorithm
property ofPairwiseAligner
. I think this setter is a good idea on its own right, but if others disagree it can be removed in the final version.I'll update this description as progress continues.
Update August 9 (commit 793ff19): Support for scoring with a scoring matrix should now be implemented. The code has now been moved into macros similar to the other algorithm implementations.
Update August 14 (commit 3f398f5): Per mdehoon's suggestions, the
.algorithm
setter has been removed. The FOGSAA algorithm is now selected by setting themode
attribute ofPairwiseAligner
to"fogsaa"
.Update August 22: Commits b0c365f and 98eb4f5 should now add basic support for aligning two sequences with FOGSAA.
Update August 27 (commit
463aabc
): Support for non-integer scores should now be implemented. This was done by replacing the priority queue implementation used in Chakraborty's code with a max heap.Update September 28: End gap scores and affine gaps should now work, and warnings will be raised if potentially problematic gap scores are used for FOGSAA. Numerous bugs fixes and some speedups have also been added.