-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stdlib: Improved BIF for binary matches and split. #1480
Conversation
Nice find and thanks for the pr! I haven't looked at the code yet, just wanted to answer your questions.
No, it is not. If you run a debug emulator, a bunch of assertions should trigger.
Yes it is possible for it to terminate. Yes it should be ok (iirc). |
@garazdawi Would you happen to have any recommendations for memory allocation techniques that are safe between traps? Are all of the allocator types for |
@potatosalad only |
@garazdawi Thanks for the answers. I added a |
@potatosalad that looks correct. Have you run the test suite under a debug emulator to make sure all the assertions in the code still hold? |
@garazdawi I'm not entirely sure what you mean by "under a debug emulator". Is there something specific I'm supposed to do to make that happen? Here's what I've done so far:
|
It is described here how to build a debug emulator: https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system tldr;
|
@garazdawi Thanks! So, it does look like there is an assertion failure which gets triggered when running
The full stack trace is:
However, when I change the class of the |
Are you sure that changing it to The erts_check_off_heap checks whether all the data that is part of the heap actually are proper terms. So I would that looking at the code that creates erlang terms and check that it is correct. |
@garazdawi I retract my comments about Thanks again for your guidance thus far, but I have a few more questions due to my lack of understanding about how the garbage collection/allocation/trapping works:
|
@garazdawi I answered my first question, I think:
Yes, it appears so. I just pushed a commit that allocates only the memory that can be used before the reductions run out. Any unused heap memory is released prior to trapping (for example, if the flags It seems to have fixed the assertion error on macOS. I'm still not sure whether it's safe or not to keep process heap memory separate from the terms returned to
|
If you disable GC while doing the trapping, it should be possible to allocate one piece of memory at the start of the operation and then use that. However before enabling GC again, you have to make sure that the allocated area is of exactly the correct size and does not contain any uninitialized data. Use
If you disable GC you don't have to. |
So, I tried that on this branch in this specific commit, but the assertion error returned again:
The garbage collection gets turned off after the heap has already been allocated, just like it does in Does garbage collection need to be disabled before the heap has been allocated? If so, I'm not sure how |
No, it shouldn't have to. I just now realized that I probably have sent you down the wrong debugging path. The Anyway I think the problem may be in the area that you have been looking. One thing that could be the cause is that you are not allowed to do |
@garazdawi I think that's exactly what was happening. There were two locations where this was possible:
So, after rearranging everything, I have everything working so the magic refs get created before the factory gets preallocated. The assertion error is no longer present in macOS or illumos. However, the AC operations are now ~2x slower for some reason (it's still ~70x faster than the existing solution, however). The BM operations are only around 10% slower. There are probably some super obvious inefficiencies that I'm blind to, but might be caught during code review. |
great, we will try to get some time "soon" to look at this closer |
Any chance this will make it in OTP 20? Thanks |
no it will not get into OTP 20. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, now that OTP-20 is out I took some time to look at this. I'm very happy with these changes, nice refactorings together with faster code is great :)
I couldn't find anything obvious that causes the performance degradation you are seeing. Is it the last refactoring commit that causes the degradation? I'll take a closer look at it when all else it taken care of.
I've put the branch into our tests to see if anything triggers in there.
I put some minor comments on some code style things to make the code look more like the other code in erts.
Thanks for this pr, great work!
erts/emulator/beam/erl_bif_binary.c
Outdated
@@ -171,7 +171,23 @@ static void *my_alloc(MyAllocator *my, Uint size) | |||
|
|||
#define ALPHABET_SIZE 256 | |||
|
|||
typedef struct _ac_node { | |||
typedef struct _findall_data FindallData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do the typedefs inlined into the struct, i.e.
typedef struct _a {} A;
erts/emulator/beam/erl_bif_binary.c
Outdated
typedef Eterm (*BinaryFindResult)(Process *, Eterm, BinaryFindContext **); | ||
typedef struct BinaryFindSearch BinaryFindSearch; | ||
|
||
typedef enum BinaryFindState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix typedef:ed names with _ so that it is clear that only the typedef:ed name is/should be used
@@ -782,55 +854,31 @@ static Sint bm_find_first_match(BMFindFirstState *state, BMData *bmd, | |||
; | |||
if (i < 0) { /* found */ | |||
*reductions = reds; | |||
return j; | |||
*mpos = (Uint) j; | |||
*mlen = (Uint) blen; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could j
and blen
be changed to be Uint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I played around with this, but couldn't find a simple way to get this done without changing the data types for BMData and BMFindFirstState to Uint as well. The way that some of the loops have been previously written for the BM functions are designed for a signed integer that can terminate the loop once the number is negative.
Would you like me to refactor the BM related functions so that everything can be a Uint similar to how the AC functions work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need, just leave it as it is
erts/emulator/beam/erl_bif_binary.c
Outdated
} BMFindFirstState; | ||
|
||
#define BM_OK 0 /* used only for find_all */ | ||
#define BM_OK 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are refactoring, can we change this and the AC_* equivalent to an enum instead?
@garazdawi Thanks for the feedback!
It is the most recent commit, specifically 17172aa. The performance degradation may be as simple as the fact that the size required for allocating the state struct has increased from the previous implementation. I'll try to get your requested changes committed soon. |
erts/emulator/beam/erl_bif_binary.c
Outdated
if (ctx->pat_bin != NULL && ctx->pat_term == THE_NON_VALUE) { | ||
hp = HAlloc(p, ERTS_MAGIC_REF_THING_SIZE * 2); | ||
ctx->pat_term = erts_mk_magic_ref(&hp, &MSO(p), ctx->pat_bin); | ||
hp += ERTS_MAGIC_REF_THING_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not increment hp here, it is done by erts_mk_magic_ref. I found this fault by running emulator bif_SUITE:shadow_comments.
erts_free_aligned_binary_bytes(temp_alloc); | ||
if (*res_term == THE_NON_VALUE) { | ||
if (is_first_call) { | ||
erts_set_gc_state(p, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to export the context here as well? I get an assertion that triggers when running stdlib binary_module_SUITE:random_ref_sr_comp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked a little bit closer on this and the assert was not due to an un-exported context. It seems to be because the calculations of when to export a context inside the _result functions is off by one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I have fixed the assertion issue in the most recent set of changes.
The assertion was triggered when the pattern didn't match anything and the size of the subject binary was exactly reductions - 1
, which resulted in the _result function being called with only 1 reduction left, but with no results in the returned list other than the original subject. So the context would not be exported (because there were no results), but the BIF would trap before creating the list containing the original subject because it ran out of reductions.
@potatosalad ping? |
@lpgauth pong! Sorry for the delay, I have some half-finished changes I'm hoping to have pushed up by the end of the week. |
@potatosalad awesome, really looking forward to having this patch in prod 👍 |
@garazdawi With the exception of the Sint to Uint change which I commented on above, I believe I have implemented/fixed all of the changes from your review. I have run |
Great! I've put it back into testing. |
There still seems to be some bug(s) left. I've found two different scenarios when the code crashes. The failures seems to be non-deterministic, where the slower the machine, the greater the chance of triggering the error. Let me know if you need some more information to debug the errors.
for some reason BIF__ARGS[1] (i.e. BIF_ARG_2) is TNV.
It seems like done is called for a non-global search when terminating. |
@garazdawi So the second case I was able to replicate and fix using the following quick test: Source = << 0:67108864, 1, 2 >>,
Pattern = binary:compile_pattern(<<1>>),
Function = fun() -> [] = binary:split(Source, Pattern) end,
[erlang:spawn_link(Function) || _ <- lists:seq(0, 10)]. This would immediately and consistently cause the emulator to crash for me. I just pushed up a fix that should prevent it from happening in the future (simply check whether The first case, however, I am having some difficulty replicating. I've tried it on my macOS machine, a SmartOS VM, a 64-bit Linux VM, and a 64-bit PowerPC Linux VM all while restricting the memory and CPU resources at different levels and repeating the test(s) over and over again. The assertions prior to trapping should catch Right before trapping, the following assertion is always run: ASSERT(result == THE_NON_VALUE && ctx->trap_term != result && ctx->pat_term != result);
BIF_TRAP3(&binary_find_trap_export, p, arg1, ctx->trap_term, ctx->pat_term); The only trap that is different is inside ASSERT(result == THE_NON_VALUE && ctx->trap_term != result && ctx->pat_term != result);
BIF_TRAP3(&binary_find_trap_export, BIF_P, BIF_ARG_1, BIF_ARG_2, BIF_ARG_3); Should the second type of trap (the recursive one in BIF_TRAP3(&binary_find_trap_export, BIF_P, BIF_ARG_1, BIF_ARG_2, BIF_ARG_3); to this? BIF_TRAP3(&binary_find_trap_export, BIF_P, BIF_ARG_1, ctx->trap_term, ctx->pat_term); Do you have any additional information about the environment or how the bug might be triggered in a repeatable manner? I've kind of exhausted my knowledge related to being able to repeat it. Thanks again for your help. |
I'll put it into testing again and see if I can reproduce the issue |
So there were a bunch of segfaults again this night. I did some digging and it seems that I can reliably reproduce the fault if I do this:
at the start of edit: I ran all of the binary_module_SUITE in order to trigger the fault |
@garazdawi It was the same off-by-one issue that was there before in I was able to reproduce it reliably with Here's why I think 69 exposed the bug: I added I temporarily added
The fix has been pushed up, hopefully it'll now survive the testing onslaught. Thanks again for your patience. |
I haven't found any new faults yet. I will let the patch stew over the weekend and then hopefully merge next week. |
We saw now that our valgrind logs contain memory leak reports looking like this: Error[0x89db]:Leak_DefinitelyLost: sys.c:973 Stackdump: It looks like the BinaryFindContext.pat_bin binary is lost. |
The leak is first reported after binary_module_SUITE:badargs |
@sverker I was able to consistently trigger the leak while running the emulator with valgrind using either of the following: binary:match(<<1,2,3:3>>, <<1>>),
binary:matches(<<1,2,3:3>>, <<1>>). However, from glancing at my changes, it seems as though the memory leak should have been there already as the behavior for returning a badarg hasn't really been changed. I didn't do any similar tests on the "before this pull request" emulator to prove it one way or the other, but was just something that puzzled me a little 😕 I fixed it by checking that we have a binary instead of a bitstring prior to compiling the pattern only to return a badarg further down the line. I may have uncovered an additional issue with After looking at the code, I realized that it doesn't save the change to the trim flag while trapping. So if the following scenario happens:
Now, when trapping, it also unsets the trim bit on the context if it doesn't match the local variable. |
I haven't found any new failures or issues, so everything seems good to be merged. Please do a rebase onto latest master and check that you are happy with everything and I'll merge it after that. |
dd0c292
to
964354b
Compare
@garazdawi Squashed and rebased everything onto master. Everything builds and the stdlib tests work for me. As a side note, I'm giving a talk at ElixirConf in a couple of days related to well-behaved NIFs and I heavily referenced your talks from a few years ago about scheduling in the Erlang VM. They were extremely helpful in explaining some of the behaviors I measured, especially the part about "Thread Progress". Thank you for giving those talks! |
merged for release in OTP 21, thanks for contributing! I'm glad you found my presentations helpful! That's the reason why I do them. |
Summary
Affects
binary:matches/2
,binary:matches/3
, andbinary:split/3
when global flag enabled. Binary find state is kept in memory between traps instead of serializing and restoring inefficiently.Overview
A performance issue was first reported at elixir-lang/elixir#6148 where @waj noticed that for the same large binary, a single global
binary:split/3
was considerably slower than multiplebinary:split/2
calls.binary:matches/2
andbinary:matches/3
are also affected.Using tail recursion with multiple calls to
binary:split/2
is much more performant. For example, in this benchmark, a ~128MB binary with 1KB "parts" separated by 1 byte when usingbinary:split/3
with global flag enabled is ~80x slower than tail recursion. With the changes from this pull request, it is ~3x faster than tail recursion or ~140x faster than the current implementation.The performance impact worsens as the length of each "part" increases, for example (using a "part" of size 1, 4, and 16):
With the changes from this pull request, the performance degradation is no longer present to the same degree:
A more detailed benchmark and results can be seen here: https://gist.github.com/potatosalad/4b7117679c048bfda23b4e2b8382e448
Cause
For the current implementation, every time the function
do_binary_find
has exhausted its reductions and traps, it allocates new memory from the process heap and copies the temporary find state into it. When it picks up where it left off, it allocates new temporary memory and copies the entire find state from the process heap.So, using the ~128MB example from earlier, worst case traps result in as much as 64MB worth of memory churn (hopefully I calculated that right: ~32 bytes for each
FindallData
,32 bytes * 1000000 = ~32 megabytes
, and that amount gets allocated twice per trap).In addition, accumulating the actual large list of results for
binary:split/3
andbinary:matches/2
does not count reductions and is the cause of some of the performance degradation.Solution
Borrowing some ideas from the trapping technique I found in
erts_term_to_binary_int
, I made the following changes:do_binary_find
using the new staticBinaryFindMap
andBinaryFindReduce
definitions (sobfs->map->exec()
can be called instead of nestedif/else
statements).BinaryFindState
:BFMap
- actively finding/searching the subjectBFReduce
- finding is finished and the resulting term is being builtBFDone
- finished and temporary memory has been releasedA few things I'm not sure about:
erts_alloc(ERTS_ALC_T_TMP, ...)
between traps?