The silph title state machine (tid13) blocked on event 0x10a0, never signaled. Root: the event's producer chain runs on the silph worker (entry 0x821C4AD0, our tid14), which was starved. tid14 shares a HW slot with a guest spinlock/ barrier participant (sub_824D1328, entry 0x824D2940) that busy-spins on the db16cyc hint `or r31,r31,r31` (encoding 0x7FFFFB78) at 0x824D140C. Under our round-robin lockstep the spinner consumed its whole block every round and starved the co-located tid14 (only 9 progress hits over 200M instr) — so the producer never reached the event-create/duplicate/signal dance the canary oracle performs (handle F80000E8 set by the submitter F8000044 via a duplicated handle). Fix (canary-faithful): recognize the db16cyc spin hint exactly as canary's InstrEmit_orx does (code 0x7FFFFB78 -> DelayExecution) and surface it as a new StepResult::Yield. The scheduler's yield_current() promotes every Ready peer on the slot past STARVE_LIMIT so begin_slot_visit picks one next round, then they reset and the spinner reclaims the slot — fair alternation, no priority inversion, pure function of slot state (deterministic). Result (lockstep, cache-persist, -n 200M): tid14 progresses past its old stall into a real wait; tid13 advances off 0x10a0 to a new event; hub/submitter re-enter their wait loops. imports 280k->592k, packets 124M->164M, swaps 1->2. draws still 0 (the splash's first draw is a further-upstream gate). Determinism preserved (two cold n50m runs byte-identical). n50m golden re-baselined (imports 90296->339766, swaps 1->2; draws unchanged 0). n2m golden unchanged (db16cyc not reached in first 2M). Tests 670/670. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9.5 KiB
AUDIT-059 — handle disambiguation (iterate 2.BD)
Date: 2026-06-06. Engines: ours target/release/xenia-rs -n 50M (3.9 s wall, 50M instr, 40k import calls), canary Wine xenia_canary.exe --mute=true --audit_handle_lifecycle=true (~35 s wall, 34k log lines, 0 fatals).
Verdict — HANDOFF's wedge handles are stale
HANDOFF said: "opt_callback signals 0x108c, tid=1 wedges on 0x10e8." Both IDs are now <UNCREATED> in ours, along with 0x1090 / 0x10dc / 0x10fc / 0x1104 (also in HANDOFF's adjacent list). The allocation order shifted since that snapshot.
Real wedges, current code state
| Handle | Kind | Engine state | Waiter | Notes |
|---|---|---|---|---|
| 0x12a4 | <UNCREATED> |
<AUDIT_BLIND>, waiters=1 |
tid=1 main, pc=0x824ac578 | Wait went via do_wait_single but creation never hit NtCreateEvent — KeInitializeEvent path. This is the iterate-2.BC wedge (recorded as "0x10e8" in HANDOFF — same site, different ID). |
| 0x12ac | Event/Auto | <NO_SIGNALS_DESPITE_WAITS>, waiters=1 |
tid=13 silph UI cluster, pc=0x824ac578 lr=0x821cb1e0 | Frame trail: 0x821cb1e0 → 0x821cbae0 → 0x821cc454 → 0x821c4f18 → 0x82174a80. Frames 3-5 carry silph::UImpl@GamePart_Title / silph::VGamePart_Title vtables — audit-049's cluster, unchanged. |
| 0x12b8 | Event/Auto | NO_SIGNALS, waiters=1 | (tid TBD) | Sibling, 0xC bytes from 0x12ac. |
| 0x1020 | Event/Manual | NO_SIGNALS, waiters=1 | — | γ-class. |
| 0x1040 | Event/Auto | NO_SIGNALS, waits=32 (hot poll) | — | Heavy wait, no signal. |
| 0x10a8 | Event/Auto | NO_SIGNALS, waits=7 | — | γ-class. |
| 0x10e4 | Event/Manual | NO_SIGNALS, waiters=1, waits=2 | — | γ-class. |
Working handles (sanity baseline): 0x1028 (Sema, 8 waits / 7 signals / 7 wakes), 0x10d0 (Sema, 2 waits / 1 signal / 1 wake), 0x10f0 (Event/Auto, 1/1/1 ✓ marked <SUSPECT> but actually fine), 0x10e0 (Event/Manual, 32 primary signals from somewhere).
GPU interrupt delivery — the iterate-2.BC delta confirmed
| Engine | gpu.interrupt.delivered (vsync) | EmulateCPInterruptDPC / vblank pump |
|---|---|---|
| ours | 54 (source=0) + 1 (source=1) | — |
| canary | — | 4712 in 30 s ≈ 157 Hz |
~87× ratio. Confirms HANDOFF's diagnosis: ours' victim-thread injector dies once guest threads all park; canary's host frame-limiter thread keeps firing regardless.
Canary signaler attribution
Top KeSetEvent guest_ptrs in canary (30 s window):
| guest_ptr | KeSetEvent fires | Inferred role |
|---|---|---|
0x828A3254 |
5729 | Audio host-pump worker (per AUDIT-032: r3=0x828A3230 region) |
0x828A3244 |
5728 | Audio host-pump sibling |
0x828A3244 + 16-byte stride |
— | Static XEX-image audio event struct |
0xBCE25234 |
1301 | silph UI cluster PKEVENT (heap-allocated, 0x10 stride). Likely ours' 0x12ac analog. |
0xBCE25214 / 0xBCE25244 / 0xBCE25224 |
648 / 603 / 603 | Sibling silph UI PKEVENTs (0x10 stride struct). Likely ours' 0x12a4 / 0x12b8 / 0x1040 analogs. |
Ours signals every one of those equivalents 0 times.
Round 2 — LR-extended probes name the producer
Extended the canary probes with guest-LR capture (5 sites in xboxkrnl_threading.cc, 10 LOC). Re-ran the harness. Now each KeSetEvent line carries the guest function that signaled the event. Result for the silph UI cluster:
| PKEVENT | KeSetEvent count | Producer LR(s) |
|---|---|---|
0xBCE25214 |
574 | 0x82508510 (single producer) |
0xBCE25224 |
565 | 0x82508358 (single producer) |
0xBCE25234 |
1153 | 0x82506C90 (579) + 0x82508524 (574) |
0xBCE25244 |
570 | 0x82506F9C (single producer) |
0xBCE25284 |
1 | 0x82507ABC (one-shot 5th-worker init?) |
All 6 producer LRs sit in 0x82506000–0x82509000. This is exactly the sub_825070F0 worker thread cluster that audit-057/058 already named:
audit-057: "sub_825070F0 (4 missing, initializes 4 workers w/ shared ctx 0xBCE25340, entries 0x82506528/58/88/B8)"
The 4 worker entries (0x82506528/58/88/B8) are inside sub_82506xxx — exactly where the producer LRs 0x82506C90/0x82506F9C live. The other producer LRs 0x825083xx / 0x825085xx are in downstream callees (workers call deeper code which itself calls KeSetEvent).
For comparison the audio host-pump pair gets a single sharp producer too:
0x828A3254× 5271 ←lr=0x824D2A440x828A3244× 5271 ←lr=0x824D292C
(These match AUDIT-032's PC 0x824D229C / r3=0x828A3230 region — already-understood audio host-pump.)
Verdict — 2.BE is INSUFFICIENT for the silph UI wedge
The silph UI PKEVENTs are signaled exclusively by threads spawned by sub_825070F0. Per audit-057/058, sub_825070F0 fires 0× in ours — those 4 worker threads never spawn. Therefore the PKEVENTs are never signaled. Therefore tid=13 (0x12ac in ours) wedges forever.
sub_825070F0's call chain is gated by the audit-009 "unreachability island" — a CRT-driven fnptr-array bootstrap that ours fails to enumerate. VSync delivery is irrelevant to that bootstrap; the host frame-limiter thread does not drive CRT initializers.
Therefore:
- 2.BE alone CANNOT unwedge tid=13. It will close the 54-vs-4712 VSync delivery gap and may unblock things downstream of vsync, but the silph UI wedge has an independent missing-signaler root cause.
- 2.BE may still unwedge tid=1 main on
0x12a4— that wait went viaKeInitializeEvent(handle never hitNtCreateEventin ours, hence<AUDIT_BLIND>). Whether0x12a4's signaler depends on VSync is unknown without further probing.
Implications for next moves
A single fix won't take us to draws > 0. We need at least two:
- 2.BE (VSync delivery) — still worth landing for the architectural correctness it brings, AND because it's the only fix that can unwedge tid=1 main's
0x12a4if that's vsync-derived. ~60–80 LOC per Agent C's plan. - 2.BF (sub_825070F0 activation) — this is the audit-058 unfinished business. Options:
- (a) Static work: trace canary's CRT-driven fnptr-array path that activates the silph UI bootstrap; backport the missing init into ours. High info, slow. Requires more probing.
- (b) Direct synthetic spawn: ours injects host-side
ExCreateThreadcalls for the 4 worker entries at boot completion, mirroring AUDIT-048's audio-host-pump precedent. Pragmatic; ~40 LOC; risks getting context (0xBCE25340) wrong.
A possible third move:
- Re-probe with LR on Wait paths (we already added it but didn't grep for it) — to tell us whether tid=1's wait on
0x12a4is the same LR assub_825070F0-chain or a totally different signaler. If different, it's a 3rd missing producer.
Round 4 — wait-side guest LR via one-frame back-chain walk
After fixing the PPC stack-walk offset (Xbox 360 stores saved LR at [prev_sp - 8], not the +4 AIX convention), wait-side LR comes through cleanly.
Canary's top wait sites:
| canary handle | wait count | guest_lr | LR region | mapping |
|---|---|---|---|---|
F800005C |
1635 | 0x8216EE14 |
kernel early-boot infra | unrelated |
F800000C |
1597 | 0x824AFFC4 |
xboxkrnl wrapper (scheduler / work-queue?) | unrelated |
F80000DC |
476 | 0x821C7D3C |
silph::UImpl/GamePart | = ours' 0x12ac silph UI wedge |
F80000B0 |
6 across | 0x821CBAE0 + 0x821CC19C + 0x822DFE2x/D0 |
exact match with audit-049's frame trail | sibling silph UI wait |
Identity proof: ours' audit-049 frame trail for the silph UI wedge was 0x821cb1e0 / 0x821cbae0 / 0x821cc454 / 0x821c4f18 / 0x82174a80. Round 4 captures 0x821CBAE0 and 0x821CC19C (adjacent PCs) as wait LRs in canary — same cluster, same code.
Refined verdict. ours' 0x12a4 (tid=1 main, AUDIT_BLIND) and 0x12ac (tid=13 silph UI) are 8 bytes apart — likely sibling KEVENT fields in the same silph UI struct. canary's analogs are in the F80000xx namespace, similarly clustered. The single fix that addresses both:
2.BF (b) — synthetic host-side spawn of
sub_825070F0's 4 workers at the audit-058-identified context (0xBCE25340), entries0x82506528/58/88/B8. Once those workers run, they signal the silph UI PKEVENT cluster, unwedging BOTH tid=1 main and tid=13 silph UI in one shot.
2.BE (host-driven VSync ISR delivery) becomes follow-on work after the UI bootstrap completes and frame pacing actually matters.
Open questions for iterate 2.BD′ / 2.BE planning
- Does 2.BE alone unwedge tid=13? Cheapest verification path: land 2.BE and re-run audit-059, see whether
0x12acsignal count goes 0 → non-zero. - What is the LR-pattern of canary's
KeSetEvent guest_ptr=0xBCE25234callers? The current probe doesn't capture LR — extending the cvar to do so on a filtered subset would let us name the producer function in canary's namespace. - Does the GPU frame-limiter's CP interrupt actually walk into the silph UI cluster? I.e., does
EmulateCPInterruptDPC→interrupt_callback→ guest code ever hitsub_821CB030or its callees? An LR probe insideEmulateCPInterruptDPCwould answer this.
Artifacts
canary.log2.2 MB / 34,095 lines / 32,977 AUDIT-HLC linescanary.stdout2.2 MB (duplicate of canary.log due to log_file fallback)canary.stderr8.4 KB (Wine diagnostics)ours.log479 lines (focus ledger + thread diagnostics + final state)ours.stderr317 lines (kernel-call counters)vkd3d-proton.cache.write15 KB (build artifact, ignored)
Commits in play (xenia-canary, fork-local only):
03362b59fcross-build-wine (cross-compile toolchain)d031d7c51audit-handle-lifecycle-probes (this audit's probes)