Closed Bug 1966936 Opened 2 months ago Closed 2 months ago

Crash in [@ mozilla::ipc::MessageChannel::Send | mozilla::ipc::IProtocol::ChannelSend | mozilla::dom::locks::PLockRequestParent::SendResolve]

Categories

(Core :: IPC, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
141 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox138 --- wontfix
firefox139 --- wontfix
firefox140 --- fixed
firefox141 --- fixed

People

(Reporter: mccr8, Assigned: nika)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/33ff6e48-596d-44b3-8ad2-79e430250516

MOZ_CRASH Reason:

MOZ_RELEASE_ASSERT(aMsg->routing_id() != MSG_ROUTING_NONE)

Top 10 frames:

0  xul.dll  mozilla::ipc::MessageChannel::Send(mozilla::UniquePtr<IPC::Message, mozilla::...  ipc/glue/MessageChannel.cpp:716
1  xul.dll  mozilla::ipc::IProtocol::ChannelSend(mozilla::UniquePtr<IPC::Message, mozilla...  ipc/glue/ProtocolUtils.cpp:523
1  xul.dll  mozilla::dom::locks::PLockRequestParent::SendResolve(mozilla::dom::LockMode c...  ipc/ipdl/PLockRequestParent.cpp:117
2  xul.dll  mozilla::dom::locks::LockManagerParent::ProcessRequestQueue(nsTArray<RefPtr<m...  dom/locks/LockManagerParent.cpp:98
3  xul.dll  mozilla::dom::locks::LockManagerParent::RecvPLockRequestConstructor(mozilla::...  dom/locks/LockManagerParent.cpp:174
4  xul.dll  mozilla::dom::locks::PLockManagerParent::OnMessageReceived(IPC::Message const&)  ipc/ipdl/PLockManagerParent.cpp:278
5  xul.dll  mozilla::ipc::PBackgroundParent::OnMessageReceived(IPC::Message const&)  ipc/ipdl/PBackgroundParent.cpp:2105
6  xul.dll  mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecyc...  ipc/glue/MessageChannel.cpp:1789
6  xul.dll  mozilla::ipc::MessageChannel::DispatchMessage(mozilla::ipc::ActorLifecyclePro...  ipc/glue/MessageChannel.cpp:1716
6  xul.dll  mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::ActorLifecycleProxy*, ...  ipc/glue/MessageChannel.cpp:1507

Maybe we're trying to send on an actor that hasn't finished setting up or has been torn down?

It kind of looks like this started showing up recently across all channels, so maybe a site change? Or a crash generation change?

A few comments about Firefox having crashed while they were away from the computer, and a few about it crashing while watching a video. Nothing particularly notable in the URLs.

Kagami, could you look at this?

Flags: needinfo?(krosylight)

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release
  • Top 20 desktop browser crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

So ChannelSend has CanSend check but the send still fails? The assertion is added in bug 1936917 by Nika, so probably she knows better than me 👀

Flags: needinfo?(krosylight) → needinfo?(nika)
See Also: → 1936917

This assertion is checking that in the header of the message, the routing ID is not kint32min, which is used to indicate an uninitialized routing ID. This routing ID should have been initialized in PLockRequest::SendResolve, when the message was created. This assertion failing implies Id() == kint32min, which is strange. If an actor has been closed or destroyed, it'll have an ID of 1 (for kFreedActorId).

The tags being used here are generated by IToplevelProtocol::NextId. This is designed a bit weirdly because at the time it was written, we had 3 different "sides" on the protocol due to webreplay (bug 1512085), but that code is gone now. Due to the release assertion there, the largest value mLastLocalId could have is (1 << 29) - 1, which in turn means the largest ID possible on a child side actor is 1 << 31, which is exactly kint32min.

This code has never really been written with much thought to overflow (beyond the assertion in the code), as it was assumed we'd never have an actor which manages ~500 million different actors over its lifetime, but it seems vaguely like that has happened here? I suppose makes sense that it's possible for this to happen with LockManager over a long period of time I suppose, but exhausting the ID counter in as little as 3 hours of uptime (https://crash-stats.mozilla.org/report/index/dcdad739-6065-4d97-99cc-06f700250516) feels surprising.

Probably wouldn't be too difficult, just a bit tedious, to switch all of these routing & actor IDs over to be 64-bit integers, reducing the likelihood that we can overflow them like this? Could also change the assertion in NextID to fire earlier, such that it's not possible to generate the invalid ID (though I would expect that would start crashing in the content process instead?)

Perhaps there's something else I'm missing which happened recently as well? I don't know why this would've surged in frequency so much so recently, as the lock manager code seems like it's been around for a bit.

Flags: needinfo?(nika)

This changes a few IDs which have historically been 32-bit integers to
instead be 64-bit integers, reducing the chance of the value
overflowing.

Assignee: nobody → nika
Status: NEW → ASSIGNED
Severity: -- → S3

I guess we can call this IPC if we're hitting some internal IPC limit, and the fix is increasing that limit. Thanks, Nika!

Component: DOM: Core & HTML → IPC

I don't see crashes on any versions earlier than 135. I'll leave everything as "affected" as this is nominally a topcrash but I don't know if the release volume is really high enough for a dot release or whatever. On the other hand, given how it started showing up across a bunch of branches it is possible we're in the early stages of some A/B test that could get scaled up.

Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/248b373768c1 Switch various IPC IDs to 64-bits, r=ipc-reviewers,mccr8,jld
Pushed by nfay@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/14dddd29370d Revert "Bug 1966936 - Switch various IPC IDs to 64-bits, r=ipc-reviewers,mccr8,jld" for causing multiple build bustages

Backed out for causing multiple build bustages

Backout link

Push with failures

Failure log

Flags: needinfo?(nika)

Oops. Not sure how I missed this in my local builds.

Flags: needinfo?(nika)
Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e774648b9b92 Switch various IPC IDs to 64-bits, r=ipc-reviewers,mccr8,jld
Pushed by imoraru@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9a7ffcf7876e Revert "Bug 1966936 - Switch various IPC IDs to 64-bits, r=ipc-reviewers,mccr8,jld" for causing build bustages on IPCFuzzController.cpp.

Revert for causing build bustages on IPCFuzzController.cpp.

[task 2025-05-28T17:24:56.488Z] 17:24:56     INFO -  gmake[4]: Entering directory '/builds/worker/workspace/obj-build/tools/fuzzing/ipc'
[task 2025-05-28T17:24:56.491Z] 17:24:56     INFO -  /builds/worker/fetches/sccache/sccache /builds/worker/fetches/afl-instrumentation/bin/afl-clang-fast++ --sysroot /builds/worker/fetches/sysroot-x86_64-linux-gnu -o IPCFuzzController.o -c  -I/builds/worker/workspace/obj-build/dist/stl_wrappers -I/builds/worker/workspace/obj-build/dist/system_wrappers -include /builds/worker/checkouts/gecko/config/gcc_hidden.h -fstrict-flex-arrays=1 -D_FORTIFY_SOURCE=0 -fno-common -DNDEBUG=1 -DTRIMMED=1 -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -DMOZ_SUPPORT_LEAKCHECKING -DSTATIC_EXPORTABLE_JS_API -I/builds/worker/checkouts/gecko/tools/fuzzing/ipc -I/builds/worker/workspace/obj-build/tools/fuzzing/ipc -I/builds/worker/checkouts/gecko/dom/base -I/builds/worker/checkouts/gecko/dom/ipc -I/builds/worker/workspace/obj-build/ipc/ipdl/_ipdlheaders -I/builds/worker/checkouts/gecko/ipc/chromium/src -I/builds/worker/workspace/obj-build/dist/include -I/builds/worker/workspace/obj-build/dist/include/nspr -I/builds/worker/workspace/obj-build/dist/include/nss -DMOZILLA_CLIENT -include /builds/worker/workspace/obj-build/mozilla-config.h -fsanitize=address -fsanitize=bool,bounds,enum,function,integer-divide-by-zero,pointer-overflow,return,vla-bound,object-size -fno-sanitize-recover=bool,bounds,enum,function,integer-divide-by-zero,pointer-overflow,return,vla-bound,object-size -fsanitize-blacklist=/builds/worker/workspace/obj-build/ubsan_blacklist.txt -fno-rtti -pthread -fno-sized-deallocation -fno-aligned-new -ffunction-sections -fdata-sections -fno-math-errno -fno-exceptions -fPIC -fcrash-diagnostics-dir=/builds/worker/artifacts -gline-tables-only -Xclang -load -Xclang /builds/worker/workspace/obj-build/build/clang-plugin/libclang-plugin.so -Xclang -add-plugin -Xclang moz-check -O2 -fno-omit-frame-pointer -funwind-tables -Werror -Wall -Wbitfield-enum-conversion -Wempty-body -Wformat-type-confusion -Wignored-qualifiers -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtautological-constant-in-range-compare -Wtype-limits -Wno-error=tautological-type-limit-compare -Wunreachable-code -Wunreachable-code-return -Wunused-but-set-parameter -Wno-invalid-offsetof -Wclass-varargs -Wempty-init-stmt -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wno-range-loop-analysis -Wenum-compare-conditional -Wenum-float-conversion -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-enum-enum-conversion -Wno-deprecated-this-capture -Wcomma -Wimplicit-fallthrough -Wstring-conversion -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=free-nonheap-object -Wno-error=atomic-alignment -Wno-error=deprecated-builtins -Wformat -Wformat-security -Wno-psabi -Wthread-safety -Wno-error=builtin-macro-redefined -Wno-vla-cxx-extension -Wno-unknown-warning-option -fno-strict-aliasing -ffp-contract=off  -MD -MP -MF .deps/IPCFuzzController.o.pp   /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp
[task 2025-05-28T17:24:56.492Z] 17:24:56    ERROR -  /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:245:27: error: result of comparison of constant 'MSG_ROUTING_CONTROL' (9223372036854775807) with expression of type 'int' is always false [-Werror,-Wtautological-constant-out-of-range-compare]
[task 2025-05-28T17:24:56.492Z] 17:24:56     INFO -    245 |         (maybeLastActorId == MSG_ROUTING_CONTROL &&
[task 2025-05-28T17:24:56.492Z] 17:24:56     INFO -        |          ~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~
[task 2025-05-28T17:24:56.492Z] 17:24:56    ERROR -  /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:330:28: error: format specifies type 'int' but the argument has type 'ActorId' (aka 'long') [-Werror,-Wformat]
[task 2025-05-28T17:24:56.493Z] 17:24:56     INFO -    329 |     MOZ_FUZZING_NYX_PRINTF("INFO: [OnActorConnected] ActorID %d Protocol: %s\n",
[task 2025-05-28T17:24:56.493Z] 17:24:56     INFO -        |                                                              ~~
[task 2025-05-28T17:24:56.493Z] 17:24:56     INFO -        |                                                              %ld
[task 2025-05-28T17:24:56.493Z] 17:24:56     INFO -    330 |                            protocol->Id(), protocol->GetProtocolName());
[task 2025-05-28T17:24:56.493Z] 17:24:56     INFO -        |                            ^~~~~~~~~~~~~~
[task 2025-05-28T17:24:56.494Z] 17:24:56     INFO -  /builds/worker/workspace/obj-build/dist/include/mozilla/Fuzzing.h:59:58: note: expanded from macro 'MOZ_FUZZING_NYX_PRINTF'
[task 2025-05-28T17:24:56.494Z] 17:24:56     INFO -     59 |         snprintf(msgbuf, sizeof(msgbuf) - 1, "" aFormat, __VA_ARGS__); \
[task 2025-05-28T17:24:56.494Z] 17:24:56     INFO -        |                                                 ~~~~~~~  ^~~~~~~~~~~
[task 2025-05-28T17:24:56.500Z] 17:24:56    ERROR -  /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:330:28: error: format specifies type 'int' but the argument has type 'ActorId' (aka 'long') [-Werror,-Wformat]
[task 2025-05-28T17:24:56.500Z] 17:24:56     INFO -    329 |     MOZ_FUZZING_NYX_PRINTF("INFO: [OnActorConnected] ActorID %d Protocol: %s\n",
[task 2025-05-28T17:24:56.500Z] 17:24:56     INFO -        |                                                              ~~
[task 2025-05-28T17:24:56.501Z] 17:24:56     INFO -        |                                                              %ld
[task 2025-05-28T17:24:56.501Z] 17:24:56     INFO -    330 |                            protocol->Id(), protocol->GetProtocolName());
[task 2025-05-28T17:24:56.501Z] 17:24:56     INFO -        |                            ^~~~~~~~~~~~~~
[task 2025-05-28T17:24:56.502Z] 17:24:56     INFO -  /builds/worker/workspace/obj-build/dist/include/mozilla/Fuzzing.h:62:34: note: expanded from macro 'MOZ_FUZZING_NYX_PRINTF'
[task 2025-05-28T17:24:56.502Z] 17:24:56     INFO -     62 |         fprintf(stderr, aFormat, __VA_ARGS__);                         \
[task 2025-05-28T17:24:56.502Z] 17:24:56     INFO -        |                         ~~~~~~~  ^~~~~~~~~~~
[task 2025-05-28T17:24:56.503Z] 17:24:56    ERROR -  /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:339:13: error: format specifies type 'int' but the argument has type 'ActorId' (aka 'long') [-Werror,-Wformat]
[task 2025-05-28T17:24:56.503Z] 17:24:56     INFO -    337 |             "INFO: [OnActorConnected] ActorID %d Protocol: %s matches "
[task 2025-05-28T17:24:56.503Z] 17:24:56     INFO -        |                                               ~~
[task 2025-05-28T17:24:56.504Z] 17:24:56     INFO -        |                                               %ld
[task 2025-05-28T17:24:56.504Z] 17:24:56     INFO -    338 |             "target.\n",
[task 2025-05-28T17:24:56.504Z] 17:24:56     INFO -    339 |             protocol->Id(), protocol->GetProtocolName());
[task 2025-05-28T17:24:56.504Z] 17:24:56     INFO -        |             ^~~~~~~~~~~~~~
[task 2025-05-28T17:24:56.505Z] 17:24:56     INFO -  /builds/worker/workspace/obj-build/dist/include/mozilla/Fuzzing.h:59:58: note: expanded from macro 'MOZ_FUZZING_NYX_PRINTF'
[task 2025-05-28T17:24:56.505Z] 17:24:56     INFO -     59 |         snprintf(msgbuf, sizeof(msgbuf) - 1, "" aFormat, __VA_ARGS__); \
[task 2025-05-28T17:24:56.505Z] 17:24:56     INFO -        |                                                 ~~~~~~~  ^~~~~~~~~~~
[task 2025-05-28T17:24:56.506Z] 17:24:56    ERROR -  /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:339:13: error: format specifies type 'int' but the argument has type 'ActorId' (aka 'long') [-Werror,-Wformat]
[task 2025-05-28T17:24:56.506Z] 17:24:56     INFO -    337 |             "INFO: [OnActorConnected] ActorID %d Protocol: %s matches "
[task 2025-05-28T17:24:56.506Z] 17:24:56     INFO -        |  
<...>
Flags: needinfo?(nika)
Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/328ef07b7a7a Switch various IPC IDs to 64-bits, r=ipc-reviewers,mccr8,jld
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 141 Branch
Flags: needinfo?(nika)

The patch landed in nightly and beta is affected.
:nika, is this bug important enough to require an uplift?

For more information, please visit BugBot documentation.

Flags: needinfo?(nika)

Comment on attachment 9489752 [details]
Bug 1966936 - Switch various IPC IDs to 64-bits, r=#ipc-reviewers!

Beta/Release Uplift Approval Request

  • User impact if declined/Reason for urgency: I'm not sure this bug warrants an uplift. Based on the comments in the first part, it seemed like we were exhausting some IPC identifier counters on the backend due to creating a very large number of IPC actors, leading to the counters wrapping around and the browser crashing. This patch is fairly safe to uplift (it is just changing the width of these counters to 64 bits).

The crash rates on this particular crash seem to have dropped in release & beta independently of this change, which suggests the origin of this issue (perhaps a major website having a WebLock bug?) has disappeared. Because we don't seem to be seeing crash volume for this signature as of recently it may not be worth uplifting this change.

  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch changes the width of various IPC identifiers to 64 bits. The biggest risk is if there are consumers of the IDs which are missed in the uplift and truncate the ID to 32 bits, which should only be an issue if over 1 billion actors are created.
  • String changes made/needed:
  • Is Android affected?: No
Flags: needinfo?(nika)
Attachment #9489752 - Flags: approval-mozilla-beta?

I like the idea of taking this on Beta as a hardening step for ESR140 at least.

Comment on attachment 9489752 [details]
Bug 1966936 - Switch various IPC IDs to 64-bits, r=#ipc-reviewers!

Approved for 140.0b5.

Attachment #9489752 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Duplicate of this bug: 1972730

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::ipc::MessageChannel::Send | mozilla::ipc::IProtocol::ChannelSend | mozilla::dom::locks::PLockRequestParent::SendResolve] → [@ mozilla::ipc::MessageChannel::Send | mozilla::ipc::IProtocol::ChannelSend | mozilla::dom::locks::PLockRequestParent::SendResolve] [@ mozilla::ipc::IToplevelProtocol::NextId]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: