stdlib: Improved BIF for binary matches and split. #1480

potatosalad · 2017-05-29T18:51:44Z

Summary

Affects binary:matches/2, binary:matches/3, and binary:split/3 when global flag enabled. Binary find state is kept in memory between traps instead of serializing and restoring inefficiently.

Overview

A performance issue was first reported at elixir-lang/elixir#6148 where @waj noticed that for the same large binary, a single global binary:split/3 was considerably slower than multiple binary:split/2 calls. binary:matches/2 and binary:matches/3 are also affected.

Using tail recursion with multiple calls to binary:split/2 is much more performant. For example, in this benchmark, a ~128MB binary with 1KB "parts" separated by 1 byte when using binary:split/3 with global flag enabled is ~80x slower than tail recursion. With the changes from this pull request, it is ~3x faster than tail recursion or ~140x faster than the current implementation.

The performance impact worsens as the length of each "part" increases, for example (using a "part" of size 1, 4, and 16):

1> element(1, timer:tc(fun() -> length(binary:split(binary:copy(<<"a,">>, 1000000), <<",">>, [global])) end)).
714472
2> element(1, timer:tc(fun() -> length(binary:split(binary:copy(<<"aaaa,">>, 1000000), <<",">>, [global])) end)).
1442794
3> element(1, timer:tc(fun() -> length(binary:split(binary:copy(<<"aaaaaaaaaaaaaaaa,">>, 1000000), <<",">>, [global])) end)).
4411278

With the changes from this pull request, the performance degradation is no longer present to the same degree:

1> element(1, timer:tc(fun() -> length(binary:split(binary:copy(<<"a,">>, 1000000), <<",">>, [global])) end)).
119144
2> element(1, timer:tc(fun() -> length(binary:split(binary:copy(<<"aaaa,">>, 1000000), <<",">>, [global])) end)).
115605
3> element(1, timer:tc(fun() -> length(binary:split(binary:copy(<<"aaaaaaaaaaaaaaaa,">>, 1000000), <<",">>, [global])) end)).
183583

A more detailed benchmark and results can be seen here: https://gist.github.com/potatosalad/4b7117679c048bfda23b4e2b8382e448

Cause

For the current implementation, every time the function do_binary_find has exhausted its reductions and traps, it allocates new memory from the process heap and copies the temporary find state into it. When it picks up where it left off, it allocates new temporary memory and copies the entire find state from the process heap.

So, using the ~128MB example from earlier, worst case traps result in as much as 64MB worth of memory churn (hopefully I calculated that right: ~32 bytes for each FindallData, 32 bytes * 1000000 = ~32 megabytes, and that amount gets allocated twice per trap).

In addition, accumulating the actual large list of results for binary:split/3 and binary:matches/2 does not count reductions and is the cause of some of the performance degradation.

Solution

Borrowing some ideas from the trapping technique I found in erts_term_to_binary_int, I made the following changes:

Instead of trapping with a bignum with extra memory shoved onto the end of it, I used a magic binary ref to keep the state between traps.
Temporary memory is kept between traps instead of serializing and restoring to the process heap.
When trapping, suspend the process garbage collection. Resume once the result is ready and temporary memory has been released.
Unified AC and BM calls to reduce the complexity of do_binary_find using the new static BinaryFindMap and BinaryFindReduce definitions (so bfs->map->exec() can be called instead of nested if/else statements).
Three modes are defined for the BinaryFindState:
- BFMap - actively finding/searching the subject
- BFReduce - finding is finished and the resulting term is being built
- BFDone - finished and temporary memory has been released
Count reductions and trap if needed during the result building step.
Once the result is ready, free temporary memory and return the result.

A few things I'm not sure about:

Is it safe to re-use memory allocated with erts_alloc(ERTS_ALC_T_TMP, ...) between traps?
If a process is killed between traps (not even sure if that can happen) and the state resource destructor is called, is it okay to then free the temporary memory at that point?

garazdawi · 2017-05-29T19:48:53Z

Nice find and thanks for the pr!

I haven't looked at the code yet, just wanted to answer your questions.

Is it safe to re-use memory allocated with erts_alloc(ERTS_ALC_T_TMP, ...) between traps?

No, it is not. If you run a debug emulator, a bunch of assertions should trigger.

If a process is killed between traps (not even sure if that can happen) and the state resource destructor is called, is it okay to then free the temporary memory at that point?

Yes it is possible for it to terminate. Yes it should be ok (iirc).

potatosalad · 2017-05-29T19:55:57Z

@garazdawi Would you happen to have any recommendations for memory allocation techniques that are safe between traps? Are all of the allocator types for erts_alloc() unsafe for that purpose, or only the ones marked as TEMPORARY?

garazdawi · 2017-05-29T20:09:19Z

@potatosalad only TEMPORARY. I'd put it in either short or std alloc.

potatosalad · 2017-05-29T22:07:21Z

@garazdawi Thanks for the answers. I added a BINARY_FIND type to erl_alloc.types under SHORT_LIVED, but I'm not sure if that's what you were suggesting or if you meant to use an existing type. Also wasn't sure whether it would be classified SYSTEM or PROCESSES.

garazdawi · 2017-05-30T07:50:46Z

@potatosalad that looks correct.

Have you run the test suite under a debug emulator to make sure all the assertions in the code still hold?

potatosalad · 2017-05-30T13:10:15Z

@garazdawi I'm not entirely sure what you mean by "under a debug emulator". Is there something specific I'm supposed to do to make that happen?

Here's what I've done so far:

I ran the full test suite locally with OTP built with --enable-debug.
I enabled the HARDDEBUG flag in erl_bif_binary.c and ran the binary_module_SUITE.
I also ran it under illumos with libumem to check for memory leaks with mdb.

garazdawi · 2017-05-30T13:24:35Z

It is described here how to build a debug emulator: https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system

tldr;

./otp_build setup -a && ./otp_build tests
make emulator FLAVOR=smp TYPE=debug
export PATH=`pwd`/bin:$PATH
cd release/tests/test_server
cerl -debug
1> ts:install(), ts:run(stdlib, binary_module_SUITE, [batch]).

potatosalad · 2017-05-30T15:26:33Z

@garazdawi Thanks!

So, it does look like there is an assertion failure which gets triggered when running binary_module_SUITE (specifically once it gets to random_ref_sr_comp):

beam/erl_gc.c:3607:checked_thing_subtag() Assertion failed: TYPE ASSERTION: is_thing(x)

The full stack trace is:

erl_debug()
checked_thing_subtag+0x46()
erts_check_off_heap2+0x62()
erts_check_off_heap+0x1d()
garbage_collect+0x428()
erts_garbage_collect_nobump+0x47()
process_main+0x117a()
sched_thread_func+0x230()
thr_wrapper+0xee()

However, when I change the class of the BINARY_FIND allocator type from PROCESSES to SYSTEM, the test passes just fine. Is having it classified as SYSTEM okay? If not, is there anything in particular you would recommend checking for in the code that is causing the off heap memory to fail the is_thing() check?

garazdawi · 2017-05-30T15:36:02Z

Are you sure that changing it to SYSTEM removes the symptom? That sounds very strange.... afaik that should only change the classification in which erlang:memory() and +Mim/+Mis uses.

The erts_check_off_heap checks whether all the data that is part of the heap actually are proper terms. So I would that looking at the code that creates erlang terms and check that it is correct.

potatosalad · 2017-05-30T16:33:17Z

@garazdawi I retract my comments about SYSTEM. It appears to be resolved under illumos (I'm unable to get it to fail after multiple runs), but it only occasionally fails under macOS.

Thanks again for your guidance thus far, but I have a few more questions due to my lack of understanding about how the garbage collection/allocation/trapping works:

When building the resulting term (for example, a list of position and length tuples for binary:matches/2), I use HAlloc one time which holds enough space for all list elements and tuples. However, it can trap before the full list has been built, where it picks up where it left off building the list.
- Should I instead allocate portions of memory based on the number of reductions left and make sure to fully use all of the memory before trapping?
When trapping, the 3 arguments passed to BIF_TRAP3 are:
1. The unchanged subject binary used to search against.
2. The magic ref/bin pointing to BinaryFindState which in turn holds pointers to both memory allocated with erts_alloc(ERTS_ALC_T_BINARY_FIND, ...) (the internal/short-lived find state) and memory allocated with HAlloc(p, ...).
  - Do I need to pass the memory allocated with HAlloc as part of the terms passed to BIF_TRAP3? For example, a 2-arity tuple comprised of the magic ref/bin to BinaryFindState and the currently built list at that point in time.
3. The unchanged compiled AC or BM pattern as a magic ref/bin.

potatosalad · 2017-05-30T18:13:30Z

@garazdawi I answered my first question, I think:

Should I instead allocate portions of memory based on the number of reductions left and make sure to fully use all of the memory before trapping?

Yes, it appears so. I just pushed a commit that allocates only the memory that can be used before the reductions run out. Any unused heap memory is released prior to trapping (for example, if the flags [global, trim_all] are specified for a binary:split/3, memory may have been over-allocated).

It seems to have fixed the assertion error on macOS.

I'm still not sure whether it's safe or not to keep process heap memory separate from the terms returned to BIF_TRAP3 as I mentioned in my second question:

Do I need to pass the memory allocated with HAlloc as part of the terms passed to BIF_TRAP3? For example, a 2-arity tuple comprised of the magic ref/bin to BinaryFindState and the currently built list at that point in time.

garazdawi · 2017-05-31T06:49:48Z

@garazdawi I answered my first question, I think:

Should I instead allocate portions of memory based on the number of reductions left and make sure to fully use all of the memory before trapping?

Yes, it appears so. I just pushed a commit that allocates only the memory that can be used before the reductions run out. Any unused heap memory is released prior to trapping (for example, if the flags [global, trim_all] are specified for a binary:split/3, memory may have been over-allocated).

If you disable GC while doing the trapping, it should be possible to allocate one piece of memory at the start of the operation and then use that. However before enabling GC again, you have to make sure that the allocated area is of exactly the correct size and does not contain any uninitialized data. Use erts_factory_proc_prealloc_init and erts_factory_close like binary_to_term_int does in order to get the correct size of heap allocation.

I'm still not sure whether it's safe or not to keep process heap memory separate from the terms returned to BIF_TRAP3 as I mentioned in my second question:

Do I need to pass the memory allocated with HAlloc as part of the terms passed to BIF_TRAP3? For example, a 2-arity tuple comprised of the magic ref/bin to BinaryFindState and the currently built list at that point in time.

If you disable GC you don't have to.

potatosalad · 2017-05-31T14:01:40Z

Use erts_factory_proc_prealloc_init and erts_factory_close like binary_to_term_int does in order to get the correct size of heap allocation.

So, I tried that on this branch in this specific commit, but the assertion error returned again:

beam/erl_gc.c:3607:checked_thing_subtag() Assertion failed: TYPE ASSERTION: is_thing(x)

The garbage collection gets turned off after the heap has already been allocated, just like it does in binary_to_term_int. The only difference between the two implementations (from what I can tell) is the use of BIF_TRAP3 versus ERTS_BIF_PREP_TRAP3. Those seem to do the same thing, but changes whether it returns immediately or not.

Does garbage collection need to be disabled before the heap has been allocated? If so, I'm not sure how binary_to_term_int is able to allocate before disabling gc.

garazdawi · 2017-05-31T16:27:29Z

Does garbage collection need to be disabled before the heap has been allocated? If so, I'm not sure how binary_to_term_int is able to allocate before disabling gc.

No, it shouldn't have to.

I just now realized that I probably have sent you down the wrong debugging path. The erts_check_off_heap2 is not to see if allocated fragments are correct, it is used to check that the terms in the MSO list are correct. In your case that will be off heap binaries/the magic reference.

Anyway I think the problem may be in the area that you have been looking. One thing that could be the cause is that you are not allowed to do factory_prealloc -> HAlloc -> factory_close. Could it be that the magic reference is allocated after the initial factory_alloc and before the factory_close?

potatosalad · 2017-05-31T23:31:04Z

Could it be that the magic reference is allocated after the initial factory_alloc and before the factory_close?

@garazdawi I think that's exactly what was happening. There were two locations where this was possible:

The BinaryFindState magic ref gets created after the factory has already been preallocated.
The pattern magic ref gets created after the factory has already been preallocated (the possibility of this happening is either never or very rarely).

So, after rearranging everything, I have everything working so the magic refs get created before the factory gets preallocated. The assertion error is no longer present in macOS or illumos.

However, the AC operations are now ~2x slower for some reason (it's still ~70x faster than the existing solution, however). The BM operations are only around 10% slower. There are probably some super obvious inefficiencies that I'm blind to, but might be caught during code review.

garazdawi · 2017-06-01T12:35:35Z

great, we will try to get some time "soon" to look at this closer

lpgauth · 2017-06-20T11:54:46Z

Any chance this will make it in OTP 20? Thanks

garazdawi · 2017-06-20T12:29:17Z

no it will not get into OTP 20.

garazdawi

So, now that OTP-20 is out I took some time to look at this. I'm very happy with these changes, nice refactorings together with faster code is great :)

I couldn't find anything obvious that causes the performance degradation you are seeing. Is it the last refactoring commit that causes the degradation? I'll take a closer look at it when all else it taken care of.

I've put the branch into our tests to see if anything triggers in there.

I put some minor comments on some code style things to make the code look more like the other code in erts.

Thanks for this pr, great work!

garazdawi · 2017-07-03T12:22:44Z

erts/emulator/beam/erl_bif_binary.c

@@ -171,7 +171,23 @@ static void *my_alloc(MyAllocator *my, Uint size)

 #define ALPHABET_SIZE 256

-typedef struct _ac_node {
+typedef struct _findall_data FindallData;


Please do the typedefs inlined into the struct, i.e.

typedef struct _a {} A;

garazdawi · 2017-07-03T12:24:41Z

erts/emulator/beam/erl_bif_binary.c

+typedef Eterm (*BinaryFindResult)(Process *, Eterm, BinaryFindContext **);
+typedef struct BinaryFindSearch BinaryFindSearch;
+
+typedef enum BinaryFindState {


prefix typedef:ed names with _ so that it is clear that only the typedef:ed name is/should be used

garazdawi · 2017-07-03T12:33:36Z

erts/emulator/beam/erl_bif_binary.c

@@ -782,55 +854,31 @@ static Sint bm_find_first_match(BMFindFirstState *state, BMData *bmd,
 	    ;
 	if (i < 0) { /* found */
 	    *reductions = reds;
-	    return j;
+	    *mpos = (Uint) j;
+	    *mlen = (Uint) blen;


Could j and blen be changed to be Uint?

So, I played around with this, but couldn't find a simple way to get this done without changing the data types for BMData and BMFindFirstState to Uint as well. The way that some of the loops have been previously written for the BM functions are designed for a signed integer that can terminate the loop once the number is negative.

Would you like me to refactor the BM related functions so that everything can be a Uint similar to how the AC functions work?

no need, just leave it as it is

garazdawi · 2017-07-03T12:43:26Z

erts/emulator/beam/erl_bif_binary.c

-} BMFindFirstState;
-
-#define BM_OK 0 /* used only for find_all */
+#define BM_OK 0


While we are refactoring, can we change this and the AC_* equivalent to an enum instead?

potatosalad · 2017-07-04T21:24:59Z

@garazdawi Thanks for the feedback!

I couldn't find anything obvious that causes the performance degradation you are seeing. Is it the last refactoring commit that causes the degradation? I'll take a closer look at it when all else it taken care of.

It is the most recent commit, specifically 17172aa. The performance degradation may be as simple as the fact that the size required for allocating the state struct has increased from the previous implementation.

I'll try to get your requested changes committed soon.

garazdawi · 2017-07-05T08:04:44Z

erts/emulator/beam/erl_bif_binary.c

+    if (ctx->pat_bin != NULL && ctx->pat_term == THE_NON_VALUE) {
+	hp = HAlloc(p, ERTS_MAGIC_REF_THING_SIZE * 2);
+	ctx->pat_term = erts_mk_magic_ref(&hp, &MSO(p), ctx->pat_bin);
+	hp += ERTS_MAGIC_REF_THING_SIZE;


You should not increment hp here, it is done by erts_mk_magic_ref. I found this fault by running emulator bif_SUITE:shadow_comments.

garazdawi · 2017-07-05T09:52:57Z

erts/emulator/beam/erl_bif_binary.c

+	erts_free_aligned_binary_bytes(temp_alloc);
+	if (*res_term == THE_NON_VALUE) {
+	    if (is_first_call) {
+		erts_set_gc_state(p, 0);


Don't you need to export the context here as well? I get an assertion that triggers when running stdlib binary_module_SUITE:random_ref_sr_comp.

Looked a little bit closer on this and the assert was not due to an un-exported context. It seems to be because the calculations of when to export a context inside the _result functions is off by one.

I believe I have fixed the assertion issue in the most recent set of changes.

The assertion was triggered when the pattern didn't match anything and the size of the subject binary was exactly reductions - 1, which resulted in the _result function being called with only 1 reduction left, but with no results in the returned list other than the original subject. So the context would not be exported (because there were no results), but the BIF would trap before creating the list containing the original subject because it ran out of reductions.

lpgauth · 2017-07-25T14:42:34Z

@potatosalad ping?

potatosalad · 2017-07-25T14:47:58Z

@lpgauth pong!

Sorry for the delay, I have some half-finished changes I'm hoping to have pushed up by the end of the week.

lpgauth · 2017-07-25T14:58:15Z

@potatosalad awesome, really looking forward to having this patch in prod 👍

potatosalad · 2017-08-07T15:25:39Z

@garazdawi With the exception of the Sint to Uint change which I commented on above, I believe I have implemented/fixed all of the changes from your review.

I have run binary_module_SUITE under a debug emulator and a valgrind emulator several times and no longer see any triggered assertions.

garazdawi · 2017-08-14T09:38:41Z

Great! I've put it back into testing.

garazdawi · 2017-08-15T08:20:47Z

There still seems to be some bug(s) left. I've found two different scenarios when the code crashes. The failures seems to be non-deterministic, where the slower the machine, the greater the chance of triggering the error. Let me know if you need some more information to debug the errors.

binary_module_SUITE:random_ref_sr_comp

--------------- System Information ---------------
OTP release: 21
ERTS version: 9.0.2
Compile date: Mon Aug 14 21:58:11 2017
Arch: powerpc64-unknown-linux-gnu
Endianness: Big
Word size: 64-bit
HiPE support: no
SMP support: yes
Thread support: yes
Kernel poll: Supported and used
Debug compiled: no
Lock checking: no
Lock counting: no
Node name: test_server@batman
Number of schedulers: 2
Number of async-threads: 7
--------------------------------------------------
(gdb) etp-process-info A__p
  Pid: <0.2468.0>
  State: off-heap-msgq | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal

  Flags: off-heap-msgq disable-gc 
  Current function: erlang:binary_find_trap/3
  CP: #Cp<binary_module_SUITE:do_split_comp/3+0x120>
  I: #Cp<0x10335ea0>
  Heap size: 233
  Old-heap size: 6772
  Mbuf size: 0
  Msgq len: 0 (inner=0, outer=0)
  Parent: <0.1741.0>
  Pointer: (Process *) 0x40040c83920

(gdb) bt
#0  0x00000000101ec4f0 in binary_find_trap (A__p=0x40040c83920, 
    BIF__ARGS=0x400407c00c0, A__I=<value optimized out>)
    at beam/erl_bif_unique.h:195
#1  0x00000000100364ec in process_main (x_reg_array=0x400407c00c0, 
    f_reg_array=0x400407c2140)
    at powerpc64-unknown-linux-gnu/opt/smp/beam_instrs.h:1066
#2  0x4800248400000000 in ?? ()
#3  0x00000000101014d4 in sched_thread_func (vesdp=0x4004041e840)
    at beam/erl_process.c:8437
#4  0x0000000010278ec0 in thr_wrapper (vtwd=0xfffff9e07c0)
    at pthread/ethread.c:118
#5  0x00000400000847d8 in .start_thread () from /lib64/libpthread.so.0
#6  0x00000080a6744dac in .__clone () from /lib64/libc.so.6
(gdb) info locals
result = <value optimized out>
ctx_bin = <value optimized out>
pat_bin = <value optimized out>
ctx = (BinaryFindContext *) 0x0
(gdb) etp BIF__ARGS[1]
<the non-value>.

for some reason BIF__ARGS[1] (i.e. BIF_ARG_2) is TNV.

sendfile_SUITE:t_sendfile_crashduring (no-async-threads)

--------------- System Information ---------------
OTP release: 21
ERTS version: 9.0.2
Compile date: Mon Aug 14 21:58:00 2017
Arch: x86_64-unknown-linux-gnu
Endianness: Little
Word size: 64-bit
HiPE support: no
SMP support: yes
Thread support: yes
Kernel poll: Supported and used
Debug compiled: no
Lock checking: no
Lock counting: no
Node name: test_server@saradoc
Number of schedulers: 8
Number of async-threads: 3
--------------------------------------------------
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00000000005cc488 in bf_context_destructor (ctx_bin=0x7f02c7b00e88)
    at beam/erl_bif_binary.c:1153
#2  0x00000000004ded88 in erts_bin_free (bp=0x7f02c7b00e88)
    at beam/erl_binary.h:462
#3  erts_bin_release (bp=<optimized out>) at beam/erl_binary.h:479
#4  erts_cleanup_offheap (offheap=offheap@entry=0x7f02f5308e50)
    at beam/erl_message.c:179
#5  0x00000000004e7b6a in delete_process (p=p@entry=0x7f02f5308c50)
    at beam/erl_process.c:12287
#6  0x00000000004fb14f in erts_continue_exit_process (p=p@entry=0x7f02f5308c50)
    at beam/erl_process.c:13464
#7  0x00000000004fb54e in erts_do_exit_process (p=p@entry=0x7f02f5308c50, 
    reason=reason@entry=1066187) at beam/erl_process.c:13210
#8  0x0000000000434ccd in terminate_proc (Value=1066187, c_p=0x7f02f5308c50)
    at beam/beam_emu.c:1591
#9  handle_error (c_p=0x7f02f5308c50, pc=<optimized out>, reg=<optimized out>, 
    bif_mfa=<optimized out>) at beam/beam_emu.c:1445
#10 0x0000000000439247 in process_main (x_reg_array=0x7f02f8e800c0, 
    f_reg_array=0x0) at beam/beam_emu.c:924
#11 0x00000000004e3eeb in sched_thread_func (vesdp=0x7f02f7784d80)
    at beam/erl_process.c:8437
#12 0x0000000000642f1f in thr_wrapper (vtwd=<optimized out>)
    at pthread/ethread.c:118
#13 0x00007f033a20f0a4 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f0339d3c08d in clone () from /lib64/libc.so.6
(gdb) p *ctx->search
$4 = {init = 0x5cc310 <bm_init_find_first_match>, 
  find = 0x5cc320 <bm_find_first_match>, done = 0x0}

It seems like done is called for a non-global search when terminating.

potatosalad · 2017-08-23T18:30:45Z

@garazdawi So the second case I was able to replicate and fix using the following quick test:

Source = << 0:67108864, 1, 2 >>,
Pattern = binary:compile_pattern(<<1>>),
Function = fun() -> [] = binary:split(Source, Pattern) end,
[erlang:spawn_link(Function) || _ <- lists:seq(0, 10)].

This would immediately and consistently cause the emulator to crash for me. I just pushed up a fix that should prevent it from happening in the future (simply check whether ctx->search->done is NULL or not like I was doing everywhere else).

The first case, however, I am having some difficulty replicating. I've tried it on my macOS machine, a SmartOS VM, a 64-bit Linux VM, and a 64-bit PowerPC Linux VM all while restricting the memory and CPU resources at different levels and repeating the test(s) over and over again.

The assertions prior to trapping should catch THE_NON_VALUE before it gets very far, so I'm not sure how binary_find_trap was able to be run with BIF_ARG_2 equal to THE_NON_VALUE.

Right before trapping, the following assertion is always run:

ASSERT(result == THE_NON_VALUE && ctx->trap_term != result && ctx->pat_term != result);
BIF_TRAP3(&binary_find_trap_export, p, arg1, ctx->trap_term, ctx->pat_term);

The only trap that is different is inside binary_find_trap:

ASSERT(result == THE_NON_VALUE && ctx->trap_term != result && ctx->pat_term != result);
BIF_TRAP3(&binary_find_trap_export, BIF_P, BIF_ARG_1, BIF_ARG_2, BIF_ARG_3);

Should the second type of trap (the recursive one in binary_find_trap) simply mirror the other lines related to trapping? So, change:

BIF_TRAP3(&binary_find_trap_export, BIF_P, BIF_ARG_1, BIF_ARG_2, BIF_ARG_3);

to this?

BIF_TRAP3(&binary_find_trap_export, BIF_P, BIF_ARG_1, ctx->trap_term, ctx->pat_term);

Do you have any additional information about the environment or how the bug might be triggered in a repeatable manner? I've kind of exhausted my knowledge related to being able to repeat it.

Thanks again for your help.

garazdawi · 2017-08-24T08:21:20Z

I'll put it into testing again and see if I can reproduce the issue

garazdawi · 2017-08-25T08:53:54Z

So there were a bunch of segfaults again this night. I did some digging and it seems that I can reliably reproduce the fault if I do this:

    p->fcalls = 69;

at the start of binary_split. Not sure what is so special about 69, smaller and larger values seems to work fine.

edit: I ran all of the binary_module_SUITE in order to trigger the fault

potatosalad · 2017-08-25T19:55:06Z

@garazdawi It was the same off-by-one issue that was there before in do_split_global_result.

I was able to reproduce it reliably with p->fcalls = 69.

Here's why I think 69 exposed the bug: binary_module_SUITE uses a 68-character alphabet for constructing the randomly generated strings which range from 30 to 1000 bytes for the haystack. The needle is then calculated as a substring of the haystack between 1 and 40 bytes in length. The probability of the first result of a global split being divisible by 68 is fairly high at ~20% probability. The number of bytes searched and the number of results found also seem to approach a number divisible by 69 or any of its prime factors (23 and 3). All other lengths are pretty statistically insignificant, with the exception of a zero-length first result, which has a ~5% chance of happening. The zero-length off-by-one issue was fixed previously, which just made it so the still present off-by-one bug was just less likely to occur.

I added ASSERT(ctx->exported == 1) to do_split_global_result and do_match_global_result, which I was able to trigger reliably by setting p->fcalls = 69. By adjusting the if statement to be ((fa->head + 1) >= reds) I was able to run the test suite without any more assertions being triggered.

I temporarily added orig_head and orig_reds to figure out why the context wasn't being exported correctly which is when I realized that it was the same off-by-one bug. Here's part of the output from the debugging session:

--------------- System Information ---------------
OTP release: 21
ERTS version: 9.0.2
Compile date: Tue Aug 22 16:59:40 2017
Arch: x86_64-apple-darwin16.7.0
Endianness: Little
Word size: 64-bit
HiPE support: yes
SMP support: yes
Thread support: yes
Kernel poll: Supported
Debug compiled: no
Lock checking: no
Lock counting: no
System not initialized
--------------------------------------------------
(lldb) bt
* thread #14, stop reason = signal SIGSTOP
    frame #0: 0x00007fffb088bd42 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fffb0979457 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fffb07f1420 libsystem_c.dylib`abort + 129
    frame #3: 0x000000001826cada beam.debug.smp`erl_assert_error(expr="ctx->exported == 1", func="do_split_global_result", file="beam/erl_bif_binary.c", line=1815) at sys.c:1173
  * frame #4: 0x000000001824dc19 beam.debug.smp`do_split_global_result(p=0x0000000019ac1630, subject=445602570, ctxp=0x00000000b03098f0) at erl_bif_binary.c:1815
    frame #5: 0x000000001824b2b7 beam.debug.smp`do_binary_find(p=0x0000000019ac1630, subject=445602570, ctxp=0x00000000b0309970, pat_bin=0x000000001a9c0038, ctx_bin=0x0000000000000000, res_term=0x00000000b0309958) at erl_bif_binary.c:1414
    frame #6: 0x0000000018244d73 beam.debug.smp`binary_split(p=0x0000000019ac1630, arg1=445602570, arg2=421376626, arg3=1140152777) at erl_bif_binary.c:1557
    frame #7: 0x0000000018244f54 beam.debug.smp`binary_split_3(A__p=0x0000000019ac1630, BIF__ARGS=0x00000000191800c0, A__I=0x000000001b751d88) at erl_bif_binary.c:1581
    frame #8: 0x0000000017fc0280 beam.debug.smp`process_main(x_reg_array=0x00000000191800c0, f_reg_array=0x0000000019182140) at beam_hot.h:351
    frame #9: 0x000000001811b091 beam.debug.smp`sched_thread_func(vesdp=0x0000000018d0fa80) at erl_process.c:8598
    frame #10: 0x000000001834ade0 beam.debug.smp`thr_wrapper(vtwd=0x00007fff57c42770) at ethread.c:118
    frame #11: 0x00007fffb097693b libsystem_pthread.dylib`_pthread_body + 180
    frame #12: 0x00007fffb0976887 libsystem_pthread.dylib`_pthread_start + 286
    frame #13: 0x00007fffb097608d libsystem_pthread.dylib`thread_start + 13
(lldb) p orig_head
(Sint) $0 = 21
(lldb) p orig_reds
(Uint) $1 = 22

The fix has been pushed up, hopefully it'll now survive the testing onslaught. Thanks again for your patience.

garazdawi · 2017-08-31T07:55:39Z

I haven't found any new faults yet. I will let the patch stew over the weekend and then hopefully merge next week.

sverker · 2017-08-31T10:25:15Z

We saw now that our valgrind logs contain memory leak reports looking like this:

Error[0x89db]:Leak_DefinitelyLost: sys.c:973

Stackdump:
malloc:236 (-> 0x4C275D8) [vg_replace_malloc.c]
erts_sys_alloc:973 (-> 0x618101) [sys.c]
erts_alloc_fnf:269 (-> 0x5F9CDA) [erl_alloc.h]
erts_create_magic_binary_x:490 (-> 0x5F9FE3) [erl_binary.h]
erts_create_magic_binary:508 (-> 0x5FA092) [erl_binary.h]
create_bmdata:399 (-> 0x5FA471) [erl_bif_binary.c]
do_binary_match_compile:1002 (-> 0x5FBA9D) [erl_bif_binary.c]
maybe_binary_match_compile:1206 (-> 0x5FC307) [erl_bif_binary.c]
binary_match:1488 (-> 0x5FD087) [erl_bif_binary.c]
binary_matches_2:1522 (-> 0x5FD2AB) [erl_bif_binary.c]
process_main:595 (-> 0x435672) [beam_hot.h]
sched_thread_func:8442 (-> 0x462AAA) [erl_process.c]
thr_wrapper:118 (-> 0x6B8570) [ethread.c]
start_thread:300 (-> 0x57019C9) [pthread_create.c]

It looks like the BinaryFindContext.pat_bin binary is lost.

sverker · 2017-08-31T10:29:01Z

The leak is first reported after binary_module_SUITE:badargs
so it's probably an "error-case-leak".

potatosalad · 2017-08-31T19:59:03Z

@sverker I was able to consistently trigger the leak while running the emulator with valgrind using either of the following:

binary:match(<<1,2,3:3>>, <<1>>),
binary:matches(<<1,2,3:3>>, <<1>>).

However, from glancing at my changes, it seems as though the memory leak should have been there already as the behavior for returning a badarg hasn't really been changed. I didn't do any similar tests on the "before this pull request" emulator to prove it one way or the other, but was just something that puzzled me a little 😕

I fixed it by checking that we have a binary instead of a bitstring prior to compiling the pattern only to return a badarg further down the line.

I may have uncovered an additional issue with do_split_global_result. During all of the testing, I had a single failure for binary_module_SUITE:random_ref_sr_comp where the results between having a precompiled and non-precompiled pattern didn't match. I've been unable to trigger the same error case, but I logged the failure here: https://gist.github.com/potatosalad/b22e8ac1cb845ae392efab46d1350ac0

After looking at the code, I realized that it doesn't save the change to the trim flag while trapping. So if the following scenario happens:

haystack:
  <<0,1,1,1,0,1,1>>

needle:
  <<1>>

options:
  [global, trim]

expected results:
  [<<0>>,<<>>,<<>>,<<0>>]

actual results:
  [<<0>>,<<>>,<<0>>]

do_split_global_result (iterates last-to-first):
  0: <<0>>    <- [5] stop here: keep (unset trim bit on local variable only)
  1: <<>>     <- [4] resume here: trim (trim bit set from context)
  2: <<>>     <- [3] trap here: keep (trim bit not saved to context)
  3: <<0>>    <- [2] keep (unset trim bit on local variable only)
  4: <<>>     <- [1] trim
  5: <<>>     <- [0] start here: trim (trim bit set from context)

Now, when trapping, it also unsets the trim bit on the context if it doesn't match the local variable.

garazdawi · 2017-09-06T13:28:32Z

I haven't found any new failures or issues, so everything seems good to be merged. Please do a rebase onto latest master and check that you are happy with everything and I'll merge it after that.

potatosalad · 2017-09-06T17:18:22Z

@garazdawi Squashed and rebased everything onto master. Everything builds and the stdlib tests work for me.

As a side note, I'm giving a talk at ElixirConf in a couple of days related to well-behaved NIFs and I heavily referenced your talks from a few years ago about scheduling in the Erlang VM. They were extremely helpful in explaining some of the behaviors I measured, especially the part about "Thread Progress". Thank you for giving those talks!

garazdawi · 2017-09-13T08:43:13Z

merged for release in OTP 21, thanks for contributing!

I'm glad you found my presentations helpful! That's the reason why I do them.

potatosalad mentioned this pull request May 29, 2017

String.split/2 is too slow with large strings elixir-lang/elixir#6148

Closed

garazdawi added enhancement team:VM Assigned to OTP team VM labels May 30, 2017

garazdawi self-assigned this Jul 3, 2017

garazdawi added the testing currently being tested, tag is used by OTP internal CI label Jul 3, 2017

garazdawi requested changes Jul 3, 2017

View reviewed changes

garazdawi removed the testing currently being tested, tag is used by OTP internal CI label Jul 5, 2017

garazdawi reviewed Jul 5, 2017

View reviewed changes

garazdawi added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Aug 14, 2017

garazdawi approved these changes Aug 14, 2017

View reviewed changes

garazdawi removed the testing currently being tested, tag is used by OTP internal CI label Aug 15, 2017

garazdawi added the testing currently being tested, tag is used by OTP internal CI label Aug 24, 2017

garazdawi removed the testing currently being tested, tag is used by OTP internal CI label Aug 25, 2017

garazdawi added the testing currently being tested, tag is used by OTP internal CI label Aug 28, 2017

garazdawi added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Sep 1, 2017

stdlib: Improved BIF for binary matches and split.

964354b

potatosalad force-pushed the binary_find_bif_improved branch from dd0c292 to 964354b Compare September 6, 2017 16:47

garazdawi added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Sep 7, 2017

garazdawi merged commit 964354b into erlang:master Sep 13, 2017

stdlib: Improved BIF for binary matches and split. #1480

stdlib: Improved BIF for binary matches and split. #1480

Conversation

potatosalad commented May 29, 2017

Summary

Overview

Cause

Solution

garazdawi commented May 29, 2017

potatosalad commented May 29, 2017

garazdawi commented May 29, 2017

potatosalad commented May 29, 2017

garazdawi commented May 30, 2017

potatosalad commented May 30, 2017

garazdawi commented May 30, 2017

potatosalad commented May 30, 2017

garazdawi commented May 30, 2017

potatosalad commented May 30, 2017

potatosalad commented May 30, 2017

garazdawi commented May 31, 2017

potatosalad commented May 31, 2017

garazdawi commented May 31, 2017

potatosalad commented May 31, 2017

garazdawi commented Jun 1, 2017

lpgauth commented Jun 20, 2017

garazdawi commented Jun 20, 2017

garazdawi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

potatosalad commented Jul 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lpgauth commented Jul 25, 2017

potatosalad commented Jul 25, 2017

lpgauth commented Jul 25, 2017

potatosalad commented Aug 7, 2017

garazdawi commented Aug 14, 2017

garazdawi commented Aug 15, 2017

potatosalad commented Aug 23, 2017

garazdawi commented Aug 24, 2017

garazdawi commented Aug 25, 2017 • edited Loading

potatosalad commented Aug 25, 2017 • edited Loading

garazdawi commented Aug 31, 2017

sverker commented Aug 31, 2017

sverker commented Aug 31, 2017

potatosalad commented Aug 31, 2017 • edited Loading

garazdawi commented Sep 6, 2017

potatosalad commented Sep 6, 2017

garazdawi commented Sep 13, 2017

garazdawi commented Aug 25, 2017 •

edited

Loading

potatosalad commented Aug 25, 2017 •

edited

Loading

potatosalad commented Aug 31, 2017 •

edited

Loading