Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERL-700: seq_trace token not propagated to gen_statem on gen_statem:call with timeout set #3906

Closed
OTP-Maintainer opened this issue Aug 13, 2018 · 23 comments
Assignees
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Milestone

Comments

@OTP-Maintainer
Copy link

Original reporter: picaoao
Affected version: OTP-21.0.4
Fixed in version: OTP-23
Component: otp
Migrated from: https://bugs.erlang.org/browse/ERL-700


In our project we are taking advantage of the `seq_trace` token label's field to propagate context between processes, without having to change the processes interface (API), which is super cool. This is now possible to do it because the restriction on the size of the `label` field was lifted.

However we noticed that at some point in time during the request process, the token was getting reset.

After investigating I realized that the issue was with doing `gen_statem.call()` with timeout set to any value other than :infinity. The gen_statem.call() without specifying a `timeout` value also works. 
We tested with other OTP behaviors (gen_server, gen_fsm) and it doesn't happen in those cases, so it is an issue particular with gen_statem.

I will have an example how to reproduce it in Elixir, but it can be easily translated to Erlang:


{noformat}
# Create :process1 gen_statem
iex> SeqTraceTest.ProcessGenStatem.start_link(:process1)

# spawn a process so that we can set a token a call the process we just created in the step before
iex> spawn_link(fn -> :seq_trace.set_token(:label, :my_context); SeqTraceTest.ProcessGenStatem.echo(:process1, "msg"); IO.puts("token after: #{inspect(:seq_trace.get_token(:label))}") end)

# Here we will output correctly the token from the gen_statem process (it was propagated). The token will also be output after the call to gen_statem (it wasn't reset)
{:label, :my_context}
#PID<0.165.0>
token after: {:label, :my_context}

#Now call with a timeout set to 5000
iex> spawn_link(fn -> :seq_trace.set_token(:label, :my_context); SeqTraceTest.ProcessGenStatem.echo(:process1, "msg", 5000); IO.puts("token after: #{inspect(:seq_trace.get_token(:label))}") end)

# We won't see the token output in by the gen_statem process, and we won't output from the caller process too (it was reset) because we called with the timeout.
[]
#PID<0.162.0>
token after: []
{noformat}


I also left some code to add one more process in the request chain. It will correctly propagate also with more than one process in case we don't use timeout.

{noformat}
defmodule SeqTraceTest.ProcessGenStatem do
  def start_link(name) do
    :gen_statem.start_link({:local, name}, __MODULE__, [], [])
  end

  def init(_arg) do
    {:ok, :run, []}
  end

  def proxy_echo(name, proxy_name, msg) do
    :gen_statem.call(name, {:proxy_echo, proxy_name, msg})
  end

  def echo(name, msg) do
    :gen_statem.call(name, {:echo, msg})
  end

  def proxy_echo(name, proxy_name, msg, timeout) do
    :gen_statem.call(name, {:proxy_echo, proxy_name, msg, timeout}, timeout)
  end

  def echo(name, msg, timeout) do
    :gen_statem.call(name, {:echo, msg}, timeout)
  end

  def run({:call, from}, {:echo, msg}, data) do
    IO.puts(inspect(:seq_trace.get_token(:label)))
    {:keep_state, data, [{:reply, from, msg}]}
  end

  def run({:call, from}, {:proxy_echo, proxy_name, msg}, data) do
    {:keep_state, data, [{:reply, from, SeqTraceTest.ProcessGenStatem.echo(proxy_name, msg)}]}
  end

  def run({:call, from}, {:proxy_echo, proxy_name, msg, timeout}, data) do
    {:keep_state, data, [{:reply, from, SeqTraceTest.ProcessGenStatem.echo(proxy_name, msg, timeout)}]}
  end

  def terminate(_, _, _) do
    :ok
  end

  def code_change(_vsn, state, data, _extra) do
    {:ok, state, data}
  end

  def callback_mode do
    :state_functions
  end
end

{noformat}

@OTP-Maintainer
Copy link
Author

lukas said:

I would guess that this is because of the way the clean_timeout is implemented. Have you tried to use a dirty_timeout and see if that works?

@OTP-Maintainer
Copy link
Author

picaoao said:

You are right ! With `dirty_timeout` it works !

I confess I wasn't aware of the dirty_timeout/clean_timeout options, but after just reading about it I guess we won't have much chance of passing the token with the clean_timeout since we use a BIF to create a process to do the call, and the token won't get propagated to it, right ?
It would be cool if we could have it also in the initI() callbacks of the behaviors, but I guess it is related with the same thing, right ?

So I figure this is by design ? Or was it supposed to propagate the seq_trace token with the `clean_timeout` also ? Or will you (OTP team) plan to do something about it ? :)

@OTP-Maintainer
Copy link
Author

picaoao said:

I just noticed also that the token is reset after calling `:supervisor.start_link()` :( 
Is there anything we can do about it ?

@OTP-Maintainer
Copy link
Author

lukas said:

We were talking at lunch about propagating the token when a spawn happens, this would solve both problems. Making the token propagate is not very difficult, however designing what the trace messages would look like and how to best keep with backwards compatibility with tools needs to be considered.

We'd welcome any suggestions that you may have.

@OTP-Maintainer
Copy link
Author

picaoao said:

That would be great ! We are hitting several cases where the token gets reset because we call libraries that spawn processes on the calls. Without that being done implicitly, we it really not very practical wrapping all the time these calls with a get_token() and set_token() not to loose it. 

At the moment we are not really taking advantage or using the tracing facility of the seq_trace library. We are just using the label field to implicitly transport a request context across processes (and that is awesome btw).

Regarding the trace messages and your concern, from what I've read in the :seq_trace documentation (and pardon my ignorance), I don't see how they are affected by setting the token at spawn of the process. Are you referring to the serial component of the token ?

@OTP-Maintainer
Copy link
Author

lukas said:

I was referring to the SeqTraceInfo parts.

Having thought a bit more on this it would make sense that all process interactions where the receiving side can react to the interaction should be possible to seq_trace. i.e. we should pass the token for spawn/send/recv/'EXIT'/'DOWN' and maybe more? but we should not pass it for monitor/link etc.

The SeqTraceInfo messages should mimic the way that normal tracing is done as much as possible, so I guess we should add both spawn and spawned events to the trace events that happen.

We also of course have to deal with distributed seq_traceing, which makes this a a lot more time consuming to implement.

@OTP-Maintainer
Copy link
Author

picaoao said:

Hi Lukas, is there any plan for releasing the feature above ?

It would be a great addition and it would open doors for implement tracing between process calls, "'a la OpenTracing", without having to change the API between processes to include the tracing context.... like if a process was a micro-server(ice); I think it works already like that :) (GenServer) :)

I can't think of any other way of having it...

@OTP-Maintainer
Copy link
Author

lukas said:

We are planning to do it eventually, but I don't know when that will be.

@OTP-Maintainer
Copy link
Author

john said:

I've made a stab at this in [PR 2189|https://github.com/erlang/otp/pull/2189]. I couldn't make your example work out of the box because {{IO.puts}} in {{run/3}} sometimes loses the token (it calls a server that sometimes handles unrelated messages before responding), but it runs fine once it's removed. Feel free to try it out. :)

@OTP-Maintainer
Copy link
Author

picaoao said:

That is fantastic :)

I will give it a spin and let you know !

@OTP-Maintainer
Copy link
Author

picaoao said:

I was giving it a try and the tokens seems to be propagating correctly :)
However I’m hitting SEGFAULTs if I make several requests

I’m not sure if you patched it on top of OTP 21.2. It seems to have fixed this bug https://bugs.erlang.org/browse/ERL-741

@OTP-Maintainer
Copy link
Author

john said:

The PR included that fix so the crash you're seeing has to be something else, can you provide a test case?

Core dumps (with corresponding beam.smp) would also be helpful.

@OTP-Maintainer
Copy link
Author

picaoao said:

I was testing better and it crashes with an abort (rarely), and mostly with a cannot allocate  memory of type (heap_flag)

Test: 
Basically I'm hitting the app with 100 requests per second (parallel), so generating many traces.
I'm also sending a map with some fields as the `label` of the token.

Results of testing:

OTP 21.2
- with seq_trace:  GOOD
- without: GOOD

OTP 21.1
- with seq_trace : SEGFAULTS
- without: GOOD

OTP-fix (your branch)
- with seq_trace: ABORT or cannot allocate memory heap_flag (with core dump)
- without: GOOD

Details about error in your branch:

Rarely I get:
  size_object: bad tag for 0x0
  ERROR: signal: aborted (core dumped)

Mostly I get:
Cannot allocate 87937984512448 bytes of memory (of type "heap_frag").

How can I pass you the crash dump (it is a bit bigger than 10MB and I cannot attach it directly)

Since I'm not sure, I added links to dropbox for the smp and core dump.
https://www.dropbox.com/s/j2miaz3wjz8uqhh/erl_crash_allocate.dump?dl=0
https://www.dropbox.com/s/y5st0zloyoqvgkn/beam.smp?dl=0

@OTP-Maintainer
Copy link
Author

john said:

Thank you, though that's a crash dump and not a core dump. You'll only get core dumps if it actually aborts/segfaults, which is a bit unfortunate in this case. :(

I'd wager that this is related to the label field having its restrictions lifted, you'll see this kind of crash if there's one spot where it doesn't get copied right. I don't believe this is caused by the PR but rather that it provokes an existing bug, so I'll make it a priority to fix this.

@OTP-Maintainer
Copy link
Author

john said:

I've pushed a fix for a bug where the heap would be corrupted if both {{seq_trace}} and regular tracing were enabled, and the token was matched out with {{get_seq_token}}. It's a pretty unusual set of circumstances but it's possible that this was the cause.

@OTP-Maintainer
Copy link
Author

picaoao said:

Thank you !

I will give it a spin in the afternoon and give you feedback about it.

@OTP-Maintainer
Copy link
Author

picaoao said:

Sorry, but I was quite busy in the last days, so I could only find time for testing now.

I'm still getting problems when I test it :(
I was using simply 1 request/s and I was able to get SEGFAULTS now with the message (eheap_alloc: Cannot allocate 87343595872928 bytes of memory (of type "heap_frag").


Here are the new files:

https://www.dropbox.com/s/9s99libk4uko1of/beam_segfault.smp?dl=0
https://www.dropbox.com/s/xr5mj9qcua1351l/erl_crash_segfault.dump?dl=0

@OTP-Maintainer
Copy link
Author

john said:

That's a crash dump again, not a core dump. :(

Can you try running your test with a debug-compiled emulator? Compile it with {{make emulator TYPE=debug}}, and start it with {{erl -emu_type debug}}. It will hopefully crash a lot sooner with a core dump instead of crash-dumping.

@OTP-Maintainer
Copy link
Author

picaoao said:

Thank you for the steps.

I've managed to get the core:

https://www.dropbox.com/s/usdetpxd8dknduz/core?dl=0
https://www.dropbox.com/s/4q4yutm4nvpnycu/beam.debug.smp?dl=0

@OTP-Maintainer
Copy link
Author

sverker said:

We have found a bug that fits very well  with the symptoms in your core dump.

Proposed fix based on plain OTP-21.0:
https://github.com/sverker/otp/commits/sverker/seq-trace-label-old-heap-bug/ERL-700/OTP-15849

@OTP-Maintainer
Copy link
Author

picaoao said:

This is great news ! I've been talking with John in the Erlanger Slack channel about the problem.
I will give it a try to see if I reproduce it with your fix.

I was thinking and it might make sense to test your fix on top of Johns feature implementation, no ?
If so, could you also provide a branch with both so I can test ? 

Thanks!

@OTP-Maintainer
Copy link
Author

sverker said:

The fix OTP-15849 has been released in OTP-21.3.8.3 and merged to maint (OTP-22.1) and master (OTP-23).

Johns [PR 2189|https://github.com/erlang/otp/pull/2189] has also been merged to master.

Can we close this issue now?

@OTP-Maintainer
Copy link
Author

picaoao said:

I did a test and it works great !
I'm no longer able to reproduce the SEGFAULT and the new propagation features of seq_trace are working like a charm !

Can't wait for these fixes and features being released !

Thanks !
Great work John, Sverker, Lukas !

@OTP-Maintainer OTP-Maintainer added bug Issue is reported as a bug team:VM Assigned to OTP team VM priority:medium labels Feb 10, 2021
@OTP-Maintainer OTP-Maintainer added this to the OTP-23 milestone Feb 10, 2021
@jhogberg jhogberg removed this from the OTP-23 milestone Feb 10, 2021
@jhogberg jhogberg added this to the OTP-23.0 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug priority:medium team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

3 participants