Crash in [@ mozilla::ipc::Message Channel::Send | mozilla::ipc::IProtocol::Channel Send | mozilla::dom::locks::PLock Request Parent::Send Resolve]
Categories
(Core :: IPC, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr115 | --- | unaffected |
firefox-esr128 | --- | unaffected |
firefox138 | --- | wontfix |
firefox139 | --- | wontfix |
firefox140 | --- | fixed |
firefox141 | --- | fixed |
People
(Reporter: mccr8, Assigned: nika)
References
Details
(Keywords: crash, topcrash)
Crash Data
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/33ff6e48-596d-44b3-8ad2-79e430250516
MOZ_CRASH Reason:
MOZ_RELEASE_ASSERT(aMsg->routing_id() != MSG_ROUTING_NONE)
Top 10 frames:
0 xul.dll mozilla::ipc::MessageChannel::Send(mozilla::UniquePtr<IPC::Message, mozilla::... ipc/glue/MessageChannel.cpp:716
1 xul.dll mozilla::ipc::IProtocol::ChannelSend(mozilla::UniquePtr<IPC::Message, mozilla... ipc/glue/ProtocolUtils.cpp:523
1 xul.dll mozilla::dom::locks::PLockRequestParent::SendResolve(mozilla::dom::LockMode c... ipc/ipdl/PLockRequestParent.cpp:117
2 xul.dll mozilla::dom::locks::LockManagerParent::ProcessRequestQueue(nsTArray<RefPtr<m... dom/locks/LockManagerParent.cpp:98
3 xul.dll mozilla::dom::locks::LockManagerParent::RecvPLockRequestConstructor(mozilla::... dom/locks/LockManagerParent.cpp:174
4 xul.dll mozilla::dom::locks::PLockManagerParent::OnMessageReceived(IPC::Message const&) ipc/ipdl/PLockManagerParent.cpp:278
5 xul.dll mozilla::ipc::PBackgroundParent::OnMessageReceived(IPC::Message const&) ipc/ipdl/PBackgroundParent.cpp:2105
6 xul.dll mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecyc... ipc/glue/MessageChannel.cpp:1789
6 xul.dll mozilla::ipc::MessageChannel::DispatchMessage(mozilla::ipc::ActorLifecyclePro... ipc/glue/MessageChannel.cpp:1716
6 xul.dll mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::ActorLifecycleProxy*, ... ipc/glue/MessageChannel.cpp:1507
Maybe we're trying to send on an actor that hasn't finished setting up or has been torn down?
Reporter | ||
Comment 1•2 months ago
|
||
It kind of looks like this started showing up recently across all channels, so maybe a site change? Or a crash generation change?
A few comments about Firefox having crashed while they were away from the computer, and a few about it crashing while watching a video. Nothing particularly notable in the URLs.
Comment hidden (duplicate) |
Comment 4•2 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release
- Top 20 desktop browser crashes on beta
For more information, please visit BugBot documentation.
Comment 5•2 months ago
|
||
So ChannelSend has CanSend check but the send still fails? The assertion is added in bug 1936917 by Nika, so probably she knows better than me 👀
Assignee | ||
Comment 6•2 months ago
|
||
This assertion is checking that in the header of the message, the routing ID is not kint32min
, which is used to indicate an uninitialized routing ID. This routing ID should have been initialized in PLockRequest::SendResolve
, when the message was created. This assertion failing implies Id() == kint32min
, which is strange. If an actor has been closed or destroyed, it'll have an ID of 1
(for kFreedActorId
).
The tags being used here are generated by IToplevelProtocol::NextId
. This is designed a bit weirdly because at the time it was written, we had 3 different "sides" on the protocol due to webreplay (bug 1512085), but that code is gone now. Due to the release assertion there, the largest value mLastLocalId
could have is (1 << 29) - 1
, which in turn means the largest ID possible on a child side actor is 1 << 31, which is exactly kint32min
.
This code has never really been written with much thought to overflow (beyond the assertion in the code), as it was assumed we'd never have an actor which manages ~500 million different actors over its lifetime, but it seems vaguely like that has happened here? I suppose makes sense that it's possible for this to happen with LockManager
over a long period of time I suppose, but exhausting the ID counter in as little as 3 hours of uptime (https://crash-stats.mozilla.org/report/index/dcdad739-6065-4d97-99cc-06f700250516) feels surprising.
Probably wouldn't be too difficult, just a bit tedious, to switch all of these routing & actor IDs over to be 64-bit integers, reducing the likelihood that we can overflow them like this? Could also change the assertion in NextID
to fire earlier, such that it's not possible to generate the invalid ID (though I would expect that would start crashing in the content process instead?)
Perhaps there's something else I'm missing which happened recently as well? I don't know why this would've surged in frequency so much so recently, as the lock manager code seems like it's been around for a bit.
Assignee | ||
Comment 7•2 months ago
|
||
This changes a few IDs which have historically been 32-bit integers to
instead be 64-bit integers, reducing the chance of the value
overflowing.
Updated•2 months ago
|
Updated•2 months ago
|
Reporter | ||
Comment 8•2 months ago
|
||
I guess we can call this IPC if we're hitting some internal IPC limit, and the fix is increasing that limit. Thanks, Nika!
Reporter | ||
Comment 9•2 months ago
|
||
I don't see crashes on any versions earlier than 135. I'll leave everything as "affected" as this is nominally a topcrash but I don't know if the release volume is really high enough for a dot release or whatever. On the other hand, given how it started showing up across a bunch of branches it is possible we're in the early stages of some A/B test that could get scaled up.
Comment 10•2 months ago
|
||
Comment 11•2 months ago
|
||
Comment 12•2 months ago
|
||
Assignee | ||
Comment 13•2 months ago
|
||
Oops. Not sure how I missed this in my local builds.
Comment 14•2 months ago
|
||
Comment 15•2 months ago
|
||
Comment 16•2 months ago
|
||
Revert for causing build bustages on IPCFuzzController.cpp.
[task 2025-05-28T17:24:56.488Z] 17:24:56 INFO - gmake[4]: Entering directory '/builds/worker/workspace/obj-build/tools/fuzzing/ipc'
[task 2025-05-28T17:24:56.491Z] 17:24:56 INFO - /builds/worker/fetches/sccache/sccache /builds/worker/fetches/afl-instrumentation/bin/afl-clang-fast++ --sysroot /builds/worker/fetches/sysroot-x86_64-linux-gnu -o IPCFuzzController.o -c -I/builds/worker/workspace/obj-build/dist/stl_wrappers -I/builds/worker/workspace/obj-build/dist/system_wrappers -include /builds/worker/checkouts/gecko/config/gcc_hidden.h -fstrict-flex-arrays=1 -D_FORTIFY_SOURCE=0 -fno-common -DNDEBUG=1 -DTRIMMED=1 -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -DMOZ_SUPPORT_LEAKCHECKING -DSTATIC_EXPORTABLE_JS_API -I/builds/worker/checkouts/gecko/tools/fuzzing/ipc -I/builds/worker/workspace/obj-build/tools/fuzzing/ipc -I/builds/worker/checkouts/gecko/dom/base -I/builds/worker/checkouts/gecko/dom/ipc -I/builds/worker/workspace/obj-build/ipc/ipdl/_ipdlheaders -I/builds/worker/checkouts/gecko/ipc/chromium/src -I/builds/worker/workspace/obj-build/dist/include -I/builds/worker/workspace/obj-build/dist/include/nspr -I/builds/worker/workspace/obj-build/dist/include/nss -DMOZILLA_CLIENT -include /builds/worker/workspace/obj-build/mozilla-config.h -fsanitize=address -fsanitize=bool,bounds,enum,function,integer-divide-by-zero,pointer-overflow,return,vla-bound,object-size -fno-sanitize-recover=bool,bounds,enum,function,integer-divide-by-zero,pointer-overflow,return,vla-bound,object-size -fsanitize-blacklist=/builds/worker/workspace/obj-build/ubsan_blacklist.txt -fno-rtti -pthread -fno-sized-deallocation -fno-aligned-new -ffunction-sections -fdata-sections -fno-math-errno -fno-exceptions -fPIC -fcrash-diagnostics-dir=/builds/worker/artifacts -gline-tables-only -Xclang -load -Xclang /builds/worker/workspace/obj-build/build/clang-plugin/libclang-plugin.so -Xclang -add-plugin -Xclang moz-check -O2 -fno-omit-frame-pointer -funwind-tables -Werror -Wall -Wbitfield-enum-conversion -Wempty-body -Wformat-type-confusion -Wignored-qualifiers -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtautological-constant-in-range-compare -Wtype-limits -Wno-error=tautological-type-limit-compare -Wunreachable-code -Wunreachable-code-return -Wunused-but-set-parameter -Wno-invalid-offsetof -Wclass-varargs -Wempty-init-stmt -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wno-range-loop-analysis -Wenum-compare-conditional -Wenum-float-conversion -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-enum-enum-conversion -Wno-deprecated-this-capture -Wcomma -Wimplicit-fallthrough -Wstring-conversion -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=free-nonheap-object -Wno-error=atomic-alignment -Wno-error=deprecated-builtins -Wformat -Wformat-security -Wno-psabi -Wthread-safety -Wno-error=builtin-macro-redefined -Wno-vla-cxx-extension -Wno-unknown-warning-option -fno-strict-aliasing -ffp-contract=off -MD -MP -MF .deps/IPCFuzzController.o.pp /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp
[task 2025-05-28T17:24:56.492Z] 17:24:56 ERROR - /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:245:27: error: result of comparison of constant 'MSG_ROUTING_CONTROL' (9223372036854775807) with expression of type 'int' is always false [-Werror,-Wtautological-constant-out-of-range-compare]
[task 2025-05-28T17:24:56.492Z] 17:24:56 INFO - 245 | (maybeLastActorId == MSG_ROUTING_CONTROL &&
[task 2025-05-28T17:24:56.492Z] 17:24:56 INFO - | ~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~
[task 2025-05-28T17:24:56.492Z] 17:24:56 ERROR - /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:330:28: error: format specifies type 'int' but the argument has type 'ActorId' (aka 'long') [-Werror,-Wformat]
[task 2025-05-28T17:24:56.493Z] 17:24:56 INFO - 329 | MOZ_FUZZING_NYX_PRINTF("INFO: [OnActorConnected] ActorID %d Protocol: %s\n",
[task 2025-05-28T17:24:56.493Z] 17:24:56 INFO - | ~~
[task 2025-05-28T17:24:56.493Z] 17:24:56 INFO - | %ld
[task 2025-05-28T17:24:56.493Z] 17:24:56 INFO - 330 | protocol->Id(), protocol->GetProtocolName());
[task 2025-05-28T17:24:56.493Z] 17:24:56 INFO - | ^~~~~~~~~~~~~~
[task 2025-05-28T17:24:56.494Z] 17:24:56 INFO - /builds/worker/workspace/obj-build/dist/include/mozilla/Fuzzing.h:59:58: note: expanded from macro 'MOZ_FUZZING_NYX_PRINTF'
[task 2025-05-28T17:24:56.494Z] 17:24:56 INFO - 59 | snprintf(msgbuf, sizeof(msgbuf) - 1, "" aFormat, __VA_ARGS__); \
[task 2025-05-28T17:24:56.494Z] 17:24:56 INFO - | ~~~~~~~ ^~~~~~~~~~~
[task 2025-05-28T17:24:56.500Z] 17:24:56 ERROR - /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:330:28: error: format specifies type 'int' but the argument has type 'ActorId' (aka 'long') [-Werror,-Wformat]
[task 2025-05-28T17:24:56.500Z] 17:24:56 INFO - 329 | MOZ_FUZZING_NYX_PRINTF("INFO: [OnActorConnected] ActorID %d Protocol: %s\n",
[task 2025-05-28T17:24:56.500Z] 17:24:56 INFO - | ~~
[task 2025-05-28T17:24:56.501Z] 17:24:56 INFO - | %ld
[task 2025-05-28T17:24:56.501Z] 17:24:56 INFO - 330 | protocol->Id(), protocol->GetProtocolName());
[task 2025-05-28T17:24:56.501Z] 17:24:56 INFO - | ^~~~~~~~~~~~~~
[task 2025-05-28T17:24:56.502Z] 17:24:56 INFO - /builds/worker/workspace/obj-build/dist/include/mozilla/Fuzzing.h:62:34: note: expanded from macro 'MOZ_FUZZING_NYX_PRINTF'
[task 2025-05-28T17:24:56.502Z] 17:24:56 INFO - 62 | fprintf(stderr, aFormat, __VA_ARGS__); \
[task 2025-05-28T17:24:56.502Z] 17:24:56 INFO - | ~~~~~~~ ^~~~~~~~~~~
[task 2025-05-28T17:24:56.503Z] 17:24:56 ERROR - /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:339:13: error: format specifies type 'int' but the argument has type 'ActorId' (aka 'long') [-Werror,-Wformat]
[task 2025-05-28T17:24:56.503Z] 17:24:56 INFO - 337 | "INFO: [OnActorConnected] ActorID %d Protocol: %s matches "
[task 2025-05-28T17:24:56.503Z] 17:24:56 INFO - | ~~
[task 2025-05-28T17:24:56.504Z] 17:24:56 INFO - | %ld
[task 2025-05-28T17:24:56.504Z] 17:24:56 INFO - 338 | "target.\n",
[task 2025-05-28T17:24:56.504Z] 17:24:56 INFO - 339 | protocol->Id(), protocol->GetProtocolName());
[task 2025-05-28T17:24:56.504Z] 17:24:56 INFO - | ^~~~~~~~~~~~~~
[task 2025-05-28T17:24:56.505Z] 17:24:56 INFO - /builds/worker/workspace/obj-build/dist/include/mozilla/Fuzzing.h:59:58: note: expanded from macro 'MOZ_FUZZING_NYX_PRINTF'
[task 2025-05-28T17:24:56.505Z] 17:24:56 INFO - 59 | snprintf(msgbuf, sizeof(msgbuf) - 1, "" aFormat, __VA_ARGS__); \
[task 2025-05-28T17:24:56.505Z] 17:24:56 INFO - | ~~~~~~~ ^~~~~~~~~~~
[task 2025-05-28T17:24:56.506Z] 17:24:56 ERROR - /builds/worker/checkouts/gecko/tools/fuzzing/ipc/IPCFuzzController.cpp:339:13: error: format specifies type 'int' but the argument has type 'ActorId' (aka 'long') [-Werror,-Wformat]
[task 2025-05-28T17:24:56.506Z] 17:24:56 INFO - 337 | "INFO: [OnActorConnected] ActorID %d Protocol: %s matches "
[task 2025-05-28T17:24:56.506Z] 17:24:56 INFO - |
<...>
Comment 17•2 months ago
|
||
Comment 18•2 months ago
|
||
bugherder |
Assignee | ||
Updated•2 months ago
|
Comment 19•2 months ago
|
||
The patch landed in nightly and beta is affected.
:nika, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- See https://wiki.mozilla.org/Release_Management/Requesting_an_Uplift for documentation on how to request an uplift.
- If no, please set
status-firefox140
towontfix
.
For more information, please visit BugBot documentation.
Assignee | ||
Comment 20•2 months ago
|
||
Comment on attachment 9489752 [details]
Bug 1966936 - Switch various IPC IDs to 64-bits, r=#ipc-reviewers!
Beta/Release Uplift Approval Request
- User impact if declined/Reason for urgency: I'm not sure this bug warrants an uplift. Based on the comments in the first part, it seemed like we were exhausting some IPC identifier counters on the backend due to creating a very large number of IPC actors, leading to the counters wrapping around and the browser crashing. This patch is fairly safe to uplift (it is just changing the width of these counters to 64 bits).
The crash rates on this particular crash seem to have dropped in release & beta independently of this change, which suggests the origin of this issue (perhaps a major website having a WebLock bug?) has disappeared. Because we don't seem to be seeing crash volume for this signature as of recently it may not be worth uplifting this change.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The patch changes the width of various IPC identifiers to 64 bits. The biggest risk is if there are consumers of the IDs which are missed in the uplift and truncate the ID to 32 bits, which should only be an issue if over 1 billion actors are created.
- String changes made/needed:
- Is Android affected?: No
Comment 21•2 months ago
|
||
I like the idea of taking this on Beta as a hardening step for ESR140 at least.
Comment 22•2 months ago
|
||
Comment on attachment 9489752 [details]
Bug 1966936 - Switch various IPC IDs to 64-bits, r=#ipc-reviewers!
Approved for 140.0b5.
Updated•2 months ago
|
Comment 23•2 months ago
|
||
uplift |
Comment 25•1 month ago
|
||
Copying crash signatures from duplicate bugs.
Description
•