Files
xenia-rs/audit-runs/audit-059-handle-disambiguation/FINDINGS.md
MechaCat02 de21c7a544 [iterate-2G] db16cyc spin-hint cooperative yield: unblock title-screen 0x10a0 gate
The silph title state machine (tid13) blocked on event 0x10a0, never signaled.
Root: the event's producer chain runs on the silph worker (entry 0x821C4AD0,
our tid14), which was starved. tid14 shares a HW slot with a guest spinlock/
barrier participant (sub_824D1328, entry 0x824D2940) that busy-spins on the
db16cyc hint `or r31,r31,r31` (encoding 0x7FFFFB78) at 0x824D140C. Under our
round-robin lockstep the spinner consumed its whole block every round and
starved the co-located tid14 (only 9 progress hits over 200M instr) — so the
producer never reached the event-create/duplicate/signal dance the canary
oracle performs (handle F80000E8 set by the submitter F8000044 via a duplicated
handle).

Fix (canary-faithful): recognize the db16cyc spin hint exactly as canary's
InstrEmit_orx does (code 0x7FFFFB78 -> DelayExecution) and surface it as a new
StepResult::Yield. The scheduler's yield_current() promotes every Ready peer on
the slot past STARVE_LIMIT so begin_slot_visit picks one next round, then they
reset and the spinner reclaims the slot — fair alternation, no priority
inversion, pure function of slot state (deterministic).

Result (lockstep, cache-persist, -n 200M): tid14 progresses past its old stall
into a real wait; tid13 advances off 0x10a0 to a new event; hub/submitter
re-enter their wait loops. imports 280k->592k, packets 124M->164M, swaps 1->2.
draws still 0 (the splash's first draw is a further-upstream gate).

Determinism preserved (two cold n50m runs byte-identical). n50m golden
re-baselined (imports 90296->339766, swaps 1->2; draws unchanged 0). n2m
golden unchanged (db16cyc not reached in first 2M). Tests 670/670.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 10:38:17 +02:00

9.5 KiB
Raw Blame History

AUDIT-059 — handle disambiguation (iterate 2.BD)

Date: 2026-06-06. Engines: ours target/release/xenia-rs -n 50M (3.9 s wall, 50M instr, 40k import calls), canary Wine xenia_canary.exe --mute=true --audit_handle_lifecycle=true (~35 s wall, 34k log lines, 0 fatals).

Verdict — HANDOFF's wedge handles are stale

HANDOFF said: "opt_callback signals 0x108c, tid=1 wedges on 0x10e8." Both IDs are now <UNCREATED> in ours, along with 0x1090 / 0x10dc / 0x10fc / 0x1104 (also in HANDOFF's adjacent list). The allocation order shifted since that snapshot.

Real wedges, current code state

Handle Kind Engine state Waiter Notes
0x12a4 <UNCREATED> <AUDIT_BLIND>, waiters=1 tid=1 main, pc=0x824ac578 Wait went via do_wait_single but creation never hit NtCreateEventKeInitializeEvent path. This is the iterate-2.BC wedge (recorded as "0x10e8" in HANDOFF — same site, different ID).
0x12ac Event/Auto <NO_SIGNALS_DESPITE_WAITS>, waiters=1 tid=13 silph UI cluster, pc=0x824ac578 lr=0x821cb1e0 Frame trail: 0x821cb1e0 → 0x821cbae0 → 0x821cc454 → 0x821c4f18 → 0x82174a80. Frames 3-5 carry silph::UImpl@GamePart_Title / silph::VGamePart_Title vtables — audit-049's cluster, unchanged.
0x12b8 Event/Auto NO_SIGNALS, waiters=1 (tid TBD) Sibling, 0xC bytes from 0x12ac.
0x1020 Event/Manual NO_SIGNALS, waiters=1 γ-class.
0x1040 Event/Auto NO_SIGNALS, waits=32 (hot poll) Heavy wait, no signal.
0x10a8 Event/Auto NO_SIGNALS, waits=7 γ-class.
0x10e4 Event/Manual NO_SIGNALS, waiters=1, waits=2 γ-class.

Working handles (sanity baseline): 0x1028 (Sema, 8 waits / 7 signals / 7 wakes), 0x10d0 (Sema, 2 waits / 1 signal / 1 wake), 0x10f0 (Event/Auto, 1/1/1 ✓ marked <SUSPECT> but actually fine), 0x10e0 (Event/Manual, 32 primary signals from somewhere).

GPU interrupt delivery — the iterate-2.BC delta confirmed

Engine gpu.interrupt.delivered (vsync) EmulateCPInterruptDPC / vblank pump
ours 54 (source=0) + 1 (source=1)
canary 4712 in 30 s ≈ 157 Hz

~87× ratio. Confirms HANDOFF's diagnosis: ours' victim-thread injector dies once guest threads all park; canary's host frame-limiter thread keeps firing regardless.

Canary signaler attribution

Top KeSetEvent guest_ptrs in canary (30 s window):

guest_ptr KeSetEvent fires Inferred role
0x828A3254 5729 Audio host-pump worker (per AUDIT-032: r3=0x828A3230 region)
0x828A3244 5728 Audio host-pump sibling
0x828A3244 + 16-byte stride Static XEX-image audio event struct
0xBCE25234 1301 silph UI cluster PKEVENT (heap-allocated, 0x10 stride). Likely ours' 0x12ac analog.
0xBCE25214 / 0xBCE25244 / 0xBCE25224 648 / 603 / 603 Sibling silph UI PKEVENTs (0x10 stride struct). Likely ours' 0x12a4 / 0x12b8 / 0x1040 analogs.

Ours signals every one of those equivalents 0 times.

Round 2 — LR-extended probes name the producer

Extended the canary probes with guest-LR capture (5 sites in xboxkrnl_threading.cc, 10 LOC). Re-ran the harness. Now each KeSetEvent line carries the guest function that signaled the event. Result for the silph UI cluster:

PKEVENT KeSetEvent count Producer LR(s)
0xBCE25214 574 0x82508510 (single producer)
0xBCE25224 565 0x82508358 (single producer)
0xBCE25234 1153 0x82506C90 (579) + 0x82508524 (574)
0xBCE25244 570 0x82506F9C (single producer)
0xBCE25284 1 0x82507ABC (one-shot 5th-worker init?)

All 6 producer LRs sit in 0x825060000x82509000. This is exactly the sub_825070F0 worker thread cluster that audit-057/058 already named:

audit-057: "sub_825070F0 (4 missing, initializes 4 workers w/ shared ctx 0xBCE25340, entries 0x82506528/58/88/B8)"

The 4 worker entries (0x82506528/58/88/B8) are inside sub_82506xxx — exactly where the producer LRs 0x82506C90/0x82506F9C live. The other producer LRs 0x825083xx / 0x825085xx are in downstream callees (workers call deeper code which itself calls KeSetEvent).

For comparison the audio host-pump pair gets a single sharp producer too:

  • 0x828A3254 × 5271 ← lr=0x824D2A44
  • 0x828A3244 × 5271 ← lr=0x824D292C

(These match AUDIT-032's PC 0x824D229C / r3=0x828A3230 region — already-understood audio host-pump.)

Verdict — 2.BE is INSUFFICIENT for the silph UI wedge

The silph UI PKEVENTs are signaled exclusively by threads spawned by sub_825070F0. Per audit-057/058, sub_825070F0 fires 0× in ours — those 4 worker threads never spawn. Therefore the PKEVENTs are never signaled. Therefore tid=13 (0x12ac in ours) wedges forever.

sub_825070F0's call chain is gated by the audit-009 "unreachability island" — a CRT-driven fnptr-array bootstrap that ours fails to enumerate. VSync delivery is irrelevant to that bootstrap; the host frame-limiter thread does not drive CRT initializers.

Therefore:

  • 2.BE alone CANNOT unwedge tid=13. It will close the 54-vs-4712 VSync delivery gap and may unblock things downstream of vsync, but the silph UI wedge has an independent missing-signaler root cause.
  • 2.BE may still unwedge tid=1 main on 0x12a4 — that wait went via KeInitializeEvent (handle never hit NtCreateEvent in ours, hence <AUDIT_BLIND>). Whether 0x12a4's signaler depends on VSync is unknown without further probing.

Implications for next moves

A single fix won't take us to draws > 0. We need at least two:

  1. 2.BE (VSync delivery) — still worth landing for the architectural correctness it brings, AND because it's the only fix that can unwedge tid=1 main's 0x12a4 if that's vsync-derived. ~6080 LOC per Agent C's plan.
  2. 2.BF (sub_825070F0 activation) — this is the audit-058 unfinished business. Options:
    • (a) Static work: trace canary's CRT-driven fnptr-array path that activates the silph UI bootstrap; backport the missing init into ours. High info, slow. Requires more probing.
    • (b) Direct synthetic spawn: ours injects host-side ExCreateThread calls for the 4 worker entries at boot completion, mirroring AUDIT-048's audio-host-pump precedent. Pragmatic; ~40 LOC; risks getting context (0xBCE25340) wrong.

A possible third move:

  1. Re-probe with LR on Wait paths (we already added it but didn't grep for it) — to tell us whether tid=1's wait on 0x12a4 is the same LR as sub_825070F0-chain or a totally different signaler. If different, it's a 3rd missing producer.

Round 4 — wait-side guest LR via one-frame back-chain walk

After fixing the PPC stack-walk offset (Xbox 360 stores saved LR at [prev_sp - 8], not the +4 AIX convention), wait-side LR comes through cleanly.

Canary's top wait sites:

canary handle wait count guest_lr LR region mapping
F800005C 1635 0x8216EE14 kernel early-boot infra unrelated
F800000C 1597 0x824AFFC4 xboxkrnl wrapper (scheduler / work-queue?) unrelated
F80000DC 476 0x821C7D3C silph::UImpl/GamePart = ours' 0x12ac silph UI wedge
F80000B0 6 across 0x821CBAE0 + 0x821CC19C + 0x822DFE2x/D0 exact match with audit-049's frame trail sibling silph UI wait

Identity proof: ours' audit-049 frame trail for the silph UI wedge was 0x821cb1e0 / 0x821cbae0 / 0x821cc454 / 0x821c4f18 / 0x82174a80. Round 4 captures 0x821CBAE0 and 0x821CC19C (adjacent PCs) as wait LRs in canary — same cluster, same code.

Refined verdict. ours' 0x12a4 (tid=1 main, AUDIT_BLIND) and 0x12ac (tid=13 silph UI) are 8 bytes apart — likely sibling KEVENT fields in the same silph UI struct. canary's analogs are in the F80000xx namespace, similarly clustered. The single fix that addresses both:

2.BF (b) — synthetic host-side spawn of sub_825070F0's 4 workers at the audit-058-identified context (0xBCE25340), entries 0x82506528/58/88/B8. Once those workers run, they signal the silph UI PKEVENT cluster, unwedging BOTH tid=1 main and tid=13 silph UI in one shot.

2.BE (host-driven VSync ISR delivery) becomes follow-on work after the UI bootstrap completes and frame pacing actually matters.

Open questions for iterate 2.BD / 2.BE planning

  1. Does 2.BE alone unwedge tid=13? Cheapest verification path: land 2.BE and re-run audit-059, see whether 0x12ac signal count goes 0 → non-zero.
  2. What is the LR-pattern of canary's KeSetEvent guest_ptr=0xBCE25234 callers? The current probe doesn't capture LR — extending the cvar to do so on a filtered subset would let us name the producer function in canary's namespace.
  3. Does the GPU frame-limiter's CP interrupt actually walk into the silph UI cluster? I.e., does EmulateCPInterruptDPCinterrupt_callback → guest code ever hit sub_821CB030 or its callees? An LR probe inside EmulateCPInterruptDPC would answer this.

Artifacts

  • canary.log 2.2 MB / 34,095 lines / 32,977 AUDIT-HLC lines
  • canary.stdout 2.2 MB (duplicate of canary.log due to log_file fallback)
  • canary.stderr 8.4 KB (Wine diagnostics)
  • ours.log 479 lines (focus ledger + thread diagnostics + final state)
  • ours.stderr 317 lines (kernel-call counters)
  • vkd3d-proton.cache.write 15 KB (build artifact, ignored)

Commits in play (xenia-canary, fork-local only):

  • 03362b59f cross-build-wine (cross-compile toolchain)
  • d031d7c51 audit-handle-lifecycle-probes (this audit's probes)