Files
xenia-rs/audit-runs/audit-059-handle-disambiguation/round-D2-autosignal-poc/FINDINGS.md
MechaCat02 db90ad0f7d [AUDIT-059 R-D2] Phase D auto-signal POC confirms audit-049 wedge diagnosis
Hook NtCreateEvent for the silph::UImpl tid=13 chain (entry=0x821748F0,
start_context=0x4024a840, frame-1 LR=0x821CB15C inside sub_821CB030+0x128)
and auto-signal the resulting handle after XENIA_SILPH_UI_AUTOSIGNAL_DELAY
instructions. Env-gated; default off.

SR4 verdict B (partial unwedge):
- handle 0x1078 signal_attempts 0->1
- tid=13 Blocked(WaitAny[0x1078]) -> Ready pc=0x824a9108
- ExCreateThread 10 -> 12 (new silph::UImpl tid=14, worker tid=15)
- New downstream wedges 0x1084 + 0x1088
- cxx_throw runtime_error on tid=5 inside R26 dispatcher
  (BST not-registered instance lhs=0x715a7af0)
- VdSwap stays 1; no draws (POC is diagnostic, not final fix)

Confirms Phase C diagnosis end-to-end. The real signaler must (a) drive
NtSetEvent on the silph KEVENT AND (b) register the dispatcher's BST
instance upstream; this POC only does (a).

Reading-error class #20: ctx.lr at kernel export entry is the thunk
wrapper's return slot, NOT the guest caller's post-bl PC. Walk back-chain
1 step to get frames[1].lr.

Reading-error class #21: --parallel and lockstep have SEPARATE outer
loops in main.rs (run_execution_parallel line 2928 vs run_execution
line 2706). Per-round hooks must be wired in BOTH paths.

Files:
- crates/xenia-cpu/src/scheduler.rs: GuestThread.start_entry/start_context
  fields + spawn() population + current_thread_entry_and_ctx() helper
- crates/xenia-kernel/src/state.rs: AutoSignalPending struct, env-parsed
  silph_autosignal_delay, pending Vec, last_cycle_hint, set_now_cycle_hint,
  maybe_register_silph_autosignal (walks back-chain), fire_due_silph_autosignals
- crates/xenia-kernel/src/exports.rs: hook in nt_create_event
- crates/xenia-app/src/main.rs: fire-site + cycle hint in both outer loops
- audit-runs/audit-059-handle-disambiguation/round-D2-autosignal-poc/FINDINGS.md

Tests 655/655 green. Default behavior byte-identical when env unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-11 18:38:38 +02:00

9.5 KiB

Phase D — Audit-049 Auto-Signal POC — FINDINGS

Branch: iterate-2C/silph-ui-spawn-trace (extends Phase C 481591f) Date: 2026-06-11 Sub-rounds: D2.SR1 → D2.SR4 (4/4 used) Verdict: B — partial unwedge

Mission

Phase C diagnosed the audit-049 wedge as tid=13 (silph::UImpl@GamePart_Title) waiting INFINITE on a KEVENT created at sub_821CB030+0x128 (lr=0x821cb15c, post-bl PC). The Phase D POC tests this diagnosis by hooking NtCreateEvent from that exact call site and auto-signaling the resulting handle after a configurable delay (XENIA_SILPH_UI_AUTOSIGNAL_DELAY instructions).

If tid=13 unblocks, the diagnosis is confirmed. If new wedges or new threads appear downstream, even better — that's actual game progression past the wedge.

Result summary

Symptom SR2/SR3 baseline SR4 (POC firing)
silph autosignal: scheduled handle=0x1078 caller_lr=0x821cb15c yes (SR2/SR3) yes
silph autosignal: firing handle=0x1078 NO yes (cycle 16326209)
handle 0x1078 final signaled=false waiters=1 <NO_SIGNALS_DESPITE_WAITS> signal_attempts=1 waiters=0
tid=13 final state Blocked(WaitAny[0x1078]) Ready pc=0x824a9108
tid=1 final state Blocked(WaitAny[0x1070]) thread-join Blocked(WaitAny[0x1070]) (tid=13 not yet exited)
ExCreateThread total 10 12 (+tid=14, +tid=15)
New downstream wedges none past 0x1078 0x1084 (Event/Auto), 0x1088 (Event/Manual)
cxx_throw runtime_error decoded none yes, stack depth 6, top L0=0x82612b50 → L4=sub_82450B60+0x1A8 → L6=sub_82450a50
VdSwap 1 1
gpu.interrupt.delivered{source=0} 6393 4539 (different trajectory, no draws)

Conclusion: tid=13 unwedged cleanly from the audit-049 wait, spawned two follow-on threads (tid=14 entry=silph ctx=0x40929c00, tid=15 a worker), and progressed deep enough into the silph::UImpl state machine to throw a runtime_error from sub_82450a50 → sub_82450B60+0x1A8 (the dispatcher cluster from round 26). The auto-signal is not the proper signaler — it lets tid=13 proceed but downstream state-machine invariants the missing real signaler would have established are not in place, so the dispatcher trips on a "not-registered instance" lookup.

This is a clean confirmation of the Phase C diagnosis: the wedge handle, the wait site, and the LR filter are all correct. The fix shape is:

  • Either: synthesize the missing signaler properly (cluster-B silph_ui_synth.rs analogue from R33's deferred plan)
  • Or: track what the auto-signal needed to write into the work-item state ([+8] field per R26) BEFORE signaling, so the dispatcher's BST lookup succeeds

Sub-round detail

D2.SR1 — initial run, hook never fires (wrong LR filter)

Filter checked creator_lr ∈ [0x821CB15C, 0x821CB160] against ctx.lr at nt_create_event entry. But ctx.lr is the thunk wrapper return slot (0x824a9f6c), not the guest caller's post-bl PC. Confirmed via handle-audit created stack dump: frame 0 lr=0x824a9f6c, frame 1 lr=0x821cb15c. The guest caller's LR lives one frame up the PPC EABI back-chain.

Diagnosis classification: D (filter mismatch). Reading-error class #20 (new).

D2.SR2 — frame-1-LR fix; hook schedules, never fires

Refactored maybe_register_silph_autosignal to take (ctx, mem), walk back-chain via existing walk_guest_back_chain (1 step), match the saved LR. Hook now fires:

silph autosignal: scheduled handle=0x1078 caller_lr=0x821cb15c for cycle 10000 (now=0, delay=10000)

But no "firing" log appears, and tid=13 stays Blocked. Classification: D (drain site never reached).

D2.SR3 — diagnostic added; confirms drain site never visited

Added a one-shot info-level "tick (first visit, none due)" log inside fire_due_silph_autosignals when pending is non-empty but nothing due. Re-ran. The tick-diagnostic never fired either — proving the function isn't being called at all in --parallel mode.

Root cause: --parallel dispatches to run_execution_parallel (line 2928 of main.rs), which has its own outer loop at line 3186. My Phase D wiring only touched the lockstep path at line 2763. Classification: D (wrong code path wired).

D2.SR4 — parallel-path wiring added; hook fires; tid=13 unblocks

Added the same set_now_cycle_hint + fire_due_silph_autosignals calls inside the parallel outer loop, right after coord_pre_round (and under the same kernel_arc guard, so no extra locking). Re-built, re-ran.

Now all three log lines appear:

silph autosignal: scheduled handle=0x1078 caller_lr=0x821cb15c for cycle 16326202 (now=16316202, delay=10000)
silph autosignal: tick (first visit, none due) now=16316213 pending=1 first_deadline=16326202
silph autosignal: firing handle=0x1078 prev_signaled=Some(false) at cycle 16326209

now=16316202 at schedule time confirms set_now_cycle_hint is wired through correctly (the parallel path was simply never visited in SR2/SR3). Fire at cycle 16326209 = deadline 16326202 + 7-cycle scheduler granularity. Diagnostic classification: B (partial unwedge — new waits and cxx_throw downstream).

Code shape

POC is ~70 LOC across four files, all env-gated. Default off.

File Change Lines
crates/xenia-cpu/src/scheduler.rs GuestThread.start_entry/start_context fields; spawn() populates; current_thread_entry_and_ctx() helper +18
crates/xenia-kernel/src/state.rs AutoSignalPending struct; silph_autosignal_* fields; set_now_cycle_hint, maybe_register_silph_autosignal, fire_due_silph_autosignals methods +95
crates/xenia-kernel/src/exports.rs Hook in nt_create_event +3
crates/xenia-app/src/main.rs Fire-site wiring in lockstep loop (line 2788) and parallel loop (line 3215) +12

Tests stay green at 655/655.

Reading-error class #20 (new)

ctx.lr at kernel export entry ≠ guest caller's post-bl PC. When a guest bl calls an export thunk, the thunk-wrapper has its own frame between the guest caller and the export body. At export-body entry, ctx.lr holds the wrapper's return slot, not the guest caller's post-bl PC.

To match a specific guest call site by LR, the export must walk one step up the back-chain (walk_guest_back_chain(ctx.gpr[1], ctx.lr, mem, 2)) and use frames[1].lr.

SR1 burned one full sub-round on this. Detect early in future POCs by comparing ctx.lr against the handle-audit's created stack frame dump for a known-good event (e.g. one created from a labelled site).

Reading-error class #21 (new)

--parallel and lockstep have separate outer loops in main.rs. They share coord_pre_round (carved out exactly for this reason), but anything wired adjacent to that call site only takes effect on the path it's wired on. Lockstep is run_execution (line 2706, outer loop at 2763). Parallel is run_execution_parallel (line 2928, outer loop at 3186).

Per-round hooks added for a specific build mode must be wired in both paths. SR2/SR3 burned two sub-rounds on this.

Files modified + LR mapping (for follow-up sessions)

Wedge handle creation (confirmed by handle-audit dump):

created cycle=0 tid=13 lr=0x824a9f6c [src=NtCreateEvent thunk return]
created stack (6 frames):
   [ 0] fp=0x715a7a10 lr=0x824a9f6c   ← ctx.lr at nt_create_event
   [ 1] fp=0x715a7aa0 lr=0x821cb15c   ← guest caller's post-bl PC (filter on this)
   [ 2] fp=0x715a7bd0 lr=0x821cbae0   ← sub_821CBA08 frame
   [ 3] fp=0x715a7cd0 lr=0x821cc454   ← sub_821CC3F8 frame
   [ 4] fp=0x715a7d60 lr=0x821c4f18   ← sub_821C4EB0 frame (silph::UImpl@GamePart_Title)
   [ 5] fp=0x715a7e00 lr=0x82174a80   ← sub_821748F0 trampoline frame

Downstream cxx_throw stack (after auto-signal fires, tid=5 throws runtime_error):

L0 lr=0x82612b50  std::exception throw path
L1 lr=0x825f2444
L2 lr=0x824547e8
L3 lr=0x82451418
L4 lr=0x82450d08  ← sub_82450B60+0x1A8 (dispatcher, audit-059 R26)
L5 lr=0x82450b34
L6 lr=0x82450a50  ← sub_82450a50 (worker dispatch)

cxx_throw runtime_error decoded magic=0x19930520
cxx_throw BST ceil search candidate_key=0x828e2b2c match_found=false
cxx_throw lhs (not-registered instance) lhs=0x715a7af0

This confirms the dispatcher reached audit-049 territory (R26's sub_82450B60+0x1A8 PC 0x82450D08), looked up a runtime instance in its BST keyed by VA, and the instance was never registered. The auto-signal bypassed an upstream registration step the real signaler would have driven.

Recommendation

Ship the POC env-gated (default off; no behavior change unless opted in). The verdict-B success makes it a useful diagnostic flag for future audit-049 work: future investigations can set XENIA_SILPH_UI_AUTOSIGNAL_DELAY=10000 to skip the wedge and probe downstream behavior without first writing the proper signaler.

Long-term fix path remains the R33 silph_ui_synth.rs analogue: synthesize the missing signaler + its precondition state (BST instance registration at 0x715a7af0-equivalent, work-item state [+8] per R26). The auto-signal POC is not the final fix — it confirms diagnosis but doesn't honor the dispatcher's BST registry invariant.

Artifacts

  • poc-sr1.log, poc-sr1.stderr — initial run, filter mismatch (D)
  • poc-sr2.log, poc-sr2.stderr — frame-1-LR fix, no fire (D)
  • poc-sr3.log, poc-sr3.stderr — diagnostic added, no fire (D, parallel path unwired)
  • poc-sr4.log, poc-sr4.stderr — parallel-path wired, fires + partial unwedge (B)

All .log/.stderr files are .gitignored; this FINDINGS.md is the only artifact-side commit.