Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segmentation fault at 0x0000000000000000 in rb_fiber_transfer #113

Open
casperisfine opened this issue Sep 3, 2024 · 7 comments
Open

Comments

@casperisfine
Copy link

Opening here first because I'm too unfamiliar with this to know whether it's a Ruby bug or a bug in io-event. All I can say is that rb_fiber_transfer expects rb_current_execution_context() to return something, but it's called from a thread that doesn't have a current execution context.

So I don't know if it's IO-event trying to resume a fiber from a wrong thread or wrong assumption in rb_fiber_transfer.

I unfortunately don't have a repro script, but this is happening on a fairly frequent basis (dozens of times a day).

/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/scheduler.rb:167: [BUG] Segmentation fault at 0x0000000000000000
ruby 3.4.0dev (2024-08-27T14:45:53Z v3.4.0-pshopify-pr.. 2863085136) +YJIT [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0016 p:---- s:0100 e:000099 CFUNC  :resume
c:0015 p:0011 s:0095 e:000094 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/scheduler.rb:167
c:0014 p:0026 s:0089 E:001cf8 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:421
c:0013 p:0018 s:0084 E:001c98 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:189
c:0012 p:0033 s:0079 e:000078 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:215 [FINISH]
c:0011 p:0024 s:0071 e:000070 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/kernel/async.rb:26
c:0010 p:0004 s:0061 E:0004c8 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:370
c:0009 p:0008 s:0054 E:001c08 BLOCK  /artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:156
c:0008 p:0012 s:0050 e:000049 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/instrumentation/timings.rb:67
c:0007 p:0021 s:0042 E:002398 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:155
c:0006 p:0027 s:0031 e:000027 BLOCK  /artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:129
c:0005 p:0012 s:0025 e:000024 METHOD /artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/instrumentation/timings.rb:67
c:0004 p:0015 s:0017 e:000016 BLOCK  /artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:128
c:0003 p:0011 s:0013 e:000012 BLOCK  /artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:190
c:0002 p:0004 s:0009 e:000007 BLOCK  /artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:403 [FINISH]
c:0001 p:---- s:0003 e:000002 DUMMY  [FINISH]

-- Ruby level backtrace information ----------------------------------------
/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:403:in 'block in schedule'
/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:190:in 'block in run'
/artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:128:in 'block (2 levels) in _process'
/artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/instrumentation/timings.rb:67:in 'measure'
/artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:129:in 'block (3 levels) in _process'
/artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:155:in 'make_request'
/artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/instrumentation/timings.rb:67:in 'measure'
/artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:156:in 'block in make_request'
/artifacts/bundle/ruby/3.4.0+0/gems/runtime-engine-client-0.10.1/lib/runtime_engine/async_client.rb:370:in 'timeout'
/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/kernel/async.rb:26:in 'Async'
/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:215:in 'async'
/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:189:in 'run'
/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/task.rb:421:in 'schedule'
/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/scheduler.rb:167:in 'resume'
/artifacts/bundle/ruby/3.4.0+0/gems/async-2.14.2/lib/async/scheduler.rb:167:in 'resume'

-- Threading information ---------------------------------------------------
Total ractor count: 1
Ruby thread count for this ractor: 16

-- Machine register context ------------------------------------------------
 RIP: 0x00005b77a8f4971f RBP: 0x00007fd19fa2c500 RSP: 0x00007fd23a74b970
 RAX: 0x00007fd25504d000 RBX: 0x00007fd1feb42800 RCX: 0x0000000000000000
 RDX: 0x00007fd255068280 RDI: 0x00007fd1febd44a0 RSI: 0x00007fd19fa3a7a0
  R8: 0x0000000000000001  R9: 0x00007fd207a14c40 R10: 0x00007fd16c93d0b0
 R11: 0x0000000000000002 R12: 0x00007fd19fa3a580 R13: 0x00007fd23a74b920
 R14: 0x00007fd23a74b920 R15: 0x00005b77a91b12a0 EFL: 0x0000000000010287

-- C level backtrace information -------------------------------------------
/usr/local/ruby/bin/ruby(rb_print_backtrace+0x14) [0x5b77a8dbe0c1] /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/vm_dump.c:820
/usr/local/ruby/bin/ruby(rb_vm_bugreport) /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/vm_dump.c:1151
/usr/local/ruby/bin/ruby(rb_bug_for_fatal_signal+0xfc) [0x5b77a8f768dc] /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/error.c:1087
/usr/local/ruby/bin/ruby(sigsegv+0x46) [0x5b77a8d05be6] /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/signal.c:929
/lib/x86_64-linux-gnu/libc.so.6(0x7fd255732520) [0x7fd255732520]
/usr/local/ruby/bin/ruby(rb_current_execution_context+0x0) [0x5b77a8f4971f] /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/cont.c:2677
/usr/local/ruby/bin/ruby(fiber_current) /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/cont.c:2067
/usr/local/ruby/bin/ruby(fiber_switch) /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/cont.c:2771
/usr/local/ruby/bin/ruby(rb_fiber_transfer) /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/cont.c:2795
/usr/local/ruby/bin/ruby(rb_ensure+0x10c) [0x5b77a8baa5ac] /tmp/ruby-build/ruby-3.4.0-pshopify-preview5/eval.c:1000
/artifacts/bundle/ruby/3.4.0+0/gems/io-event-1.6.5/lib/IO_Event.so(IO_Event_Selector_resume+0x9c) [0x7fd21526bf2c] ./io/event/selector/selector.c:234
/artifacts/bundle/ruby/3.4.0+0/gems/io-event-1.6.5/lib/IO_Event.so(IO_Event_Selector_resume) (null):0
[0x5b77aaae022f]
@ioquatix
Copy link
Member

ioquatix commented Sep 3, 2024

Thanks for your report, I will take a look tomorrow.

If I prepare a debug release, would you be able to test it?

@casperisfine
Copy link
Author

Of io-event possibly yes. Of Ruby, no I'm afraid I can't.

@ioquatix
Copy link
Member

ioquatix commented Sep 3, 2024

I have a couple of questions to help me understand the nature of the problem.

  1. Is this issue occurring inside Falcon or are you using Async{...} inside an existing (alternative) server? Can you provide any details?
  2. Can you please elaborate on the setup and/or organization of threads and if you know, the specific work/operation that is causing this error e.g. async-http client request, async-redis, some other native library.
  3. Are you using connection pooling for the above, and if so, what library is doing that? e.g. async-pool, connection-pool, ActiveRecord's own connection pooling, etc.
  4. Are you able to clarify "happening on a fairly frequent basis (dozens of times a day)." - as in a percentage of total requests/operations - it would help me to have an idea how rare this is so I know how deep I may have to dig.

@casperisfine
Copy link
Author

casperisfine commented Sep 4, 2024

I have a couple of questions to help me understand the nature of the problem.

Ok, but keep in mind I'm here with some sort of SRE hat. I don't know the code using async at all, I'm just seeing segfaults. And since it's async I don't even have a backtrace to see if there is a pattern in the code that trigger this.

Is this issue occurring inside Falcon or are you using Async{...} inside an existing (alternative) server? Can you provide any details?

Yeah, it's running inside Pitchfork, in a fairly gigantic application.

Can you please elaborate on the setup and/or organization of threads and if you know, the specific work/operation that is causing this error

Yeah, unfortunately I don't know that, and as mentioned I can't just rely on the backtrace to point me at the calling code like I usually do for this sort of issues. All I know is that it's in a generic /graphql endpoint, so it could be anything... All I can say is the gems used:

$ rg -F async Gemfile.lock 
720:      async
1905:      async-io (>= 1.41.0)
2367:      async (~> 2.12)
2368:      async-http (>= 0.67, < 0.70)
2600:    async (2.14.2)
2604:    async-http (0.69.0)
2605:      async (>= 2.10.2)
2606:      async-pool (~> 0.7)
2613:    async-io (1.43.2)
2614:      async
2615:    async-pool (0.8.0)
2616:      async (>= 1.25)

So I think it's a pool of HTTP connections? If I grep for Async do I find a couple callsites:

        results = Async do |task|
          tasks = batch_entries.map do |batch_entry|
            task.async do
              run_with_cache(batch_entry.context, batch_entry.http_request)
            end
          end

          tasks.map(&:wait)
        end.wait
            Async do |task_scheduler|
              tasks = scope_key_to_missing_fetch_results.map do |scope_key, missing_fetch_results|
                task_scheduler.async do
                  perform_fetch(scope_key: scope_key, fetch_results: missing_fetch_results)
                rescue StandardError => e
                  missing_fetch_results.each do |fetch_result|
                    fetch_result.error_result = e.message
                  end
                  log_fetch_error(e, message: "[REDACTED]")
                end
              end
              tasks.map(&:wait)
            end.wait

That seems to be all what it's used for.

As for the organization of threads, as you can see in the crash report there's 16 Ruby threads, but also a whole lot of native threads started by gems like grpc. But these are mostly all "service threads" AFAICT Async is only called from the main thread.

Are you able to clarify "happening on a fairly frequent basis (dozens of times a day)." - as in a percentage of total requests/operations

Well, I don't know how often this codepath is called, but probably quite a lot, dozens of times a day isn't a whole lot for us, probably 0.000001% or something like that, but this was to say it's not a once in a blue-moon issue.

@ioquatix
Copy link
Member

ioquatix commented Sep 4, 2024

Thanks, that's helpful.

@trevorturk
Copy link

The mention of /graphql makes me want to call attention to the new AsyncDataloader stuff in graphql-ruby, just in case that might be relevant. I'm not sure if @rmosolgo might be worth looping in, in case he has any ideas. I doubt Shopify is using AsyncDataloader yet, but you never know! (Apologies if this is unhelpful / a red herring...)

@casperisfine
Copy link
Author

It's not specific to graphql, it's also failing on some non graphql endpoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants