[AUDIT-059 R-D2] Phase D auto-signal POC confirms audit-049 wedge diagnosis

Hook NtCreateEvent for the silph::UImpl tid=13 chain (entry=0x821748F0,
start_context=0x4024a840, frame-1 LR=0x821CB15C inside sub_821CB030+0x128)
and auto-signal the resulting handle after XENIA_SILPH_UI_AUTOSIGNAL_DELAY
instructions. Env-gated; default off.

SR4 verdict B (partial unwedge):
- handle 0x1078 signal_attempts 0->1
- tid=13 Blocked(WaitAny[0x1078]) -> Ready pc=0x824a9108
- ExCreateThread 10 -> 12 (new silph::UImpl tid=14, worker tid=15)
- New downstream wedges 0x1084 + 0x1088
- cxx_throw runtime_error on tid=5 inside R26 dispatcher
  (BST not-registered instance lhs=0x715a7af0)
- VdSwap stays 1; no draws (POC is diagnostic, not final fix)

Confirms Phase C diagnosis end-to-end. The real signaler must (a) drive
NtSetEvent on the silph KEVENT AND (b) register the dispatcher's BST
instance upstream; this POC only does (a).

Reading-error class #20: ctx.lr at kernel export entry is the thunk
wrapper's return slot, NOT the guest caller's post-bl PC. Walk back-chain
1 step to get frames[1].lr.

Reading-error class #21: --parallel and lockstep have SEPARATE outer
loops in main.rs (run_execution_parallel line 2928 vs run_execution
line 2706). Per-round hooks must be wired in BOTH paths.

Files:
- crates/xenia-cpu/src/scheduler.rs: GuestThread.start_entry/start_context
  fields + spawn() population + current_thread_entry_and_ctx() helper
- crates/xenia-kernel/src/state.rs: AutoSignalPending struct, env-parsed
  silph_autosignal_delay, pending Vec, last_cycle_hint, set_now_cycle_hint,
  maybe_register_silph_autosignal (walks back-chain), fire_due_silph_autosignals
- crates/xenia-kernel/src/exports.rs: hook in nt_create_event
- crates/xenia-app/src/main.rs: fire-site + cycle hint in both outer loops
- audit-runs/audit-059-handle-disambiguation/round-D2-autosignal-poc/FINDINGS.md

Tests 655/655 green. Default behavior byte-identical when env unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-11 18:38:38 +02:00
parent 481591fdb2
commit db90ad0f7d
5 changed files with 334 additions and 0 deletions

View File

@@ -0,0 +1,144 @@
# Phase D — Audit-049 Auto-Signal POC — FINDINGS
**Branch**: `iterate-2C/silph-ui-spawn-trace` (extends Phase C `481591f`)
**Date**: 2026-06-11
**Sub-rounds**: D2.SR1 → D2.SR4 (4/4 used)
**Verdict**: **B — partial unwedge**
## Mission
Phase C diagnosed the audit-049 wedge as tid=13 (silph::UImpl@GamePart_Title) waiting INFINITE on a KEVENT created at `sub_821CB030+0x128` (`lr=0x821cb15c`, post-bl PC). The Phase D POC tests this diagnosis by hooking `NtCreateEvent` from that exact call site and auto-signaling the resulting handle after a configurable delay (`XENIA_SILPH_UI_AUTOSIGNAL_DELAY` instructions).
If tid=13 unblocks, the diagnosis is confirmed. If new wedges or new threads appear downstream, even better — that's actual game progression past the wedge.
## Result summary
| Symptom | SR2/SR3 baseline | SR4 (POC firing) |
|---|---|---|
| `silph autosignal: scheduled handle=0x1078 caller_lr=0x821cb15c` | yes (SR2/SR3) | yes |
| `silph autosignal: firing handle=0x1078` | NO | **yes (cycle 16326209)** |
| handle 0x1078 final | `signaled=false waiters=1 <NO_SIGNALS_DESPITE_WAITS>` | `signal_attempts=1 waiters=0` |
| tid=13 final state | `Blocked(WaitAny[0x1078])` | **`Ready` pc=0x824a9108** |
| tid=1 final state | `Blocked(WaitAny[0x1070])` thread-join | `Blocked(WaitAny[0x1070])` (tid=13 not yet exited) |
| ExCreateThread total | 10 | **12 (+tid=14, +tid=15)** |
| New downstream wedges | none past 0x1078 | **0x1084 (Event/Auto), 0x1088 (Event/Manual)** |
| `cxx_throw` runtime_error decoded | none | **yes, stack depth 6, top L0=0x82612b50 → L4=sub_82450B60+0x1A8 → L6=sub_82450a50** |
| VdSwap | 1 | 1 |
| gpu.interrupt.delivered{source=0} | 6393 | 4539 (different trajectory, no draws) |
**Conclusion**: tid=13 unwedged cleanly from the audit-049 wait, spawned two follow-on threads (tid=14 entry=`silph` ctx=`0x40929c00`, tid=15 a worker), and progressed deep enough into the silph::UImpl state machine to throw a `runtime_error` from sub_82450a50 → sub_82450B60+0x1A8 (the dispatcher cluster from round 26). The auto-signal **is not** the proper signaler — it lets tid=13 proceed but downstream state-machine invariants the missing real signaler would have established are not in place, so the dispatcher trips on a "not-registered instance" lookup.
This is a **clean confirmation** of the Phase C diagnosis: the wedge handle, the wait site, and the LR filter are all correct. The fix shape is:
- Either: synthesize the missing signaler properly (cluster-B silph_ui_synth.rs analogue from R33's deferred plan)
- Or: track what the auto-signal needed to write into the work-item state (`[+8]` field per R26) BEFORE signaling, so the dispatcher's BST lookup succeeds
## Sub-round detail
### D2.SR1 — initial run, hook never fires (wrong LR filter)
Filter checked `creator_lr ∈ [0x821CB15C, 0x821CB160]` against `ctx.lr` at `nt_create_event` entry. But `ctx.lr` is the **thunk wrapper return slot** (`0x824a9f6c`), not the guest caller's post-bl PC. Confirmed via handle-audit `created stack` dump: frame 0 lr=`0x824a9f6c`, frame 1 lr=`0x821cb15c`. The guest caller's LR lives one frame up the PPC EABI back-chain.
Diagnosis classification: **D (filter mismatch)**. Reading-error class #20 (new).
### D2.SR2 — frame-1-LR fix; hook schedules, never fires
Refactored `maybe_register_silph_autosignal` to take `(ctx, mem)`, walk back-chain via existing `walk_guest_back_chain` (1 step), match the saved LR. Hook now fires:
```
silph autosignal: scheduled handle=0x1078 caller_lr=0x821cb15c for cycle 10000 (now=0, delay=10000)
```
But no "firing" log appears, and tid=13 stays Blocked. Classification: **D (drain site never reached)**.
### D2.SR3 — diagnostic added; confirms drain site never visited
Added a one-shot info-level "tick (first visit, none due)" log inside `fire_due_silph_autosignals` when pending is non-empty but nothing due. Re-ran. **The tick-diagnostic never fired either** — proving the function isn't being called at all in `--parallel` mode.
Root cause: `--parallel` dispatches to `run_execution_parallel` (line 2928 of main.rs), which has its own outer loop at line 3186. My Phase D wiring only touched the lockstep path at line 2763. Classification: **D (wrong code path wired)**.
### D2.SR4 — parallel-path wiring added; hook fires; tid=13 unblocks
Added the same `set_now_cycle_hint` + `fire_due_silph_autosignals` calls inside the parallel outer loop, right after `coord_pre_round` (and under the same `kernel_arc` guard, so no extra locking). Re-built, re-ran.
Now all three log lines appear:
```
silph autosignal: scheduled handle=0x1078 caller_lr=0x821cb15c for cycle 16326202 (now=16316202, delay=10000)
silph autosignal: tick (first visit, none due) now=16316213 pending=1 first_deadline=16326202
silph autosignal: firing handle=0x1078 prev_signaled=Some(false) at cycle 16326209
```
`now=16316202` at schedule time confirms `set_now_cycle_hint` is wired through correctly (the parallel path was simply never visited in SR2/SR3). Fire at cycle 16326209 = deadline 16326202 + 7-cycle scheduler granularity. Diagnostic classification: **B (partial unwedge — new waits and cxx_throw downstream)**.
## Code shape
POC is ~70 LOC across four files, all env-gated. Default off.
| File | Change | Lines |
|---|---|---|
| `crates/xenia-cpu/src/scheduler.rs` | `GuestThread.start_entry/start_context` fields; `spawn()` populates; `current_thread_entry_and_ctx()` helper | +18 |
| `crates/xenia-kernel/src/state.rs` | `AutoSignalPending` struct; `silph_autosignal_*` fields; `set_now_cycle_hint`, `maybe_register_silph_autosignal`, `fire_due_silph_autosignals` methods | +95 |
| `crates/xenia-kernel/src/exports.rs` | Hook in `nt_create_event` | +3 |
| `crates/xenia-app/src/main.rs` | Fire-site wiring in lockstep loop (line 2788) **and** parallel loop (line 3215) | +12 |
Tests stay green at **655/655**.
## Reading-error class #20 (new)
**`ctx.lr` at kernel export entry ≠ guest caller's post-bl PC.** When a guest `bl` calls an export thunk, the thunk-wrapper has its own frame between the guest caller and the export body. At export-body entry, `ctx.lr` holds the *wrapper's* return slot, not the guest caller's post-bl PC.
To match a specific guest call site by LR, the export must walk one step up the back-chain (`walk_guest_back_chain(ctx.gpr[1], ctx.lr, mem, 2)`) and use `frames[1].lr`.
SR1 burned one full sub-round on this. Detect early in future POCs by comparing `ctx.lr` against the handle-audit's `created stack` frame dump for a known-good event (e.g. one created from a labelled site).
## Reading-error class #21 (new)
**`--parallel` and lockstep have separate outer loops in main.rs.** They share `coord_pre_round` (carved out exactly for this reason), but anything wired adjacent to that call site only takes effect on the path it's wired on. Lockstep is `run_execution` (line 2706, outer loop at 2763). Parallel is `run_execution_parallel` (line 2928, outer loop at 3186).
Per-round hooks added for a specific build mode must be wired in **both** paths. SR2/SR3 burned two sub-rounds on this.
## Files modified + LR mapping (for follow-up sessions)
**Wedge handle creation** (confirmed by handle-audit dump):
```
created cycle=0 tid=13 lr=0x824a9f6c [src=NtCreateEvent thunk return]
created stack (6 frames):
[ 0] fp=0x715a7a10 lr=0x824a9f6c ← ctx.lr at nt_create_event
[ 1] fp=0x715a7aa0 lr=0x821cb15c ← guest caller's post-bl PC (filter on this)
[ 2] fp=0x715a7bd0 lr=0x821cbae0 ← sub_821CBA08 frame
[ 3] fp=0x715a7cd0 lr=0x821cc454 ← sub_821CC3F8 frame
[ 4] fp=0x715a7d60 lr=0x821c4f18 ← sub_821C4EB0 frame (silph::UImpl@GamePart_Title)
[ 5] fp=0x715a7e00 lr=0x82174a80 ← sub_821748F0 trampoline frame
```
**Downstream cxx_throw stack** (after auto-signal fires, tid=5 throws runtime_error):
```
L0 lr=0x82612b50 std::exception throw path
L1 lr=0x825f2444
L2 lr=0x824547e8
L3 lr=0x82451418
L4 lr=0x82450d08 ← sub_82450B60+0x1A8 (dispatcher, audit-059 R26)
L5 lr=0x82450b34
L6 lr=0x82450a50 ← sub_82450a50 (worker dispatch)
cxx_throw runtime_error decoded magic=0x19930520
cxx_throw BST ceil search candidate_key=0x828e2b2c match_found=false
cxx_throw lhs (not-registered instance) lhs=0x715a7af0
```
This confirms the dispatcher reached audit-049 territory (R26's `sub_82450B60+0x1A8` PC `0x82450D08`), looked up a runtime instance in its BST keyed by VA, and the instance was never registered. **The auto-signal bypassed an upstream registration step** the real signaler would have driven.
## Recommendation
Ship the POC env-gated (default off; no behavior change unless opted in). The verdict-B success makes it a useful diagnostic flag for future audit-049 work: future investigations can set `XENIA_SILPH_UI_AUTOSIGNAL_DELAY=10000` to skip the wedge and probe downstream behavior without first writing the proper signaler.
Long-term fix path remains the R33 silph_ui_synth.rs analogue: synthesize the missing signaler + its precondition state (BST instance registration at `0x715a7af0`-equivalent, work-item state `[+8]` per R26). The auto-signal POC is **not** the final fix — it confirms diagnosis but doesn't honor the dispatcher's BST registry invariant.
## Artifacts
- `poc-sr1.log`, `poc-sr1.stderr` — initial run, filter mismatch (D)
- `poc-sr2.log`, `poc-sr2.stderr` — frame-1-LR fix, no fire (D)
- `poc-sr3.log`, `poc-sr3.stderr` — diagnostic added, no fire (D, parallel path unwired)
- `poc-sr4.log`, `poc-sr4.stderr` — parallel-path wired, **fires + partial unwedge (B)**
All `.log`/`.stderr` files are `.gitignore`d; this `FINDINGS.md` is the only artifact-side commit.

View File

@@ -2780,6 +2780,13 @@ fn run_execution(
RoundCtl::BreakOuter => break,
RoundCtl::Continue => {}
}
// ITERATE-2C Phase D — deposit the current instruction count so
// `nt_create_event` can compute absolute auto-signal deadlines,
// then drain any pending auto-signals whose deadline has passed.
// Both calls are no-ops when `XENIA_SILPH_UI_AUTOSIGNAL_DELAY`
// is unset (the pending queue stays empty).
kernel.set_now_cycle_hint(stats.instruction_count);
kernel.fire_due_silph_autosignals(stats.instruction_count);
dispatch_graphics_interrupts(
kernel,
mem,
@@ -3204,6 +3211,15 @@ fn run_execution_parallel(
}
let mut guard = pre_outcome.1;
// ITERATE-2C Phase D — same auto-signal hook as the lockstep
// path. Held under the same `kernel_arc` guard the rest of
// this prologue runs under, so no extra locking.
{
let s = stats_mtx.lock().expect("stats mutex poisoned");
guard.set_now_cycle_hint(s.instruction_count);
guard.fire_due_silph_autosignals(s.instruction_count);
}
// Iterate-2.BE — host-driven synchronous ISR dispatch.
// Runs under the kernel lock while workers are still parked
// at the phaser B2 barrier (the coordinator hasn't published

View File

@@ -117,6 +117,14 @@ pub struct GuestThread {
/// Axis 3 instruction budget. Decremented per retired step on this
/// thread; on zero, slot rotates within same-priority tier.
pub quantum_remaining: u32,
/// SpawnParams.entry — the BL target the trampoline jumped to.
/// Persisted so kernel exports can filter syscalls by spawning
/// chain (e.g. the silph UI auto-signal POC). 0 for the initial
/// thread (uses `install_initial_thread`, not `spawn`).
pub start_entry: u32,
/// SpawnParams.start_context — initial r3 at spawn. Persisted for
/// the same filtering reason as `start_entry`.
pub start_context: u32,
}
impl GuestThread {
@@ -136,6 +144,8 @@ impl GuestThread {
affinity_mask: 0xFF,
ideal_processor: None,
quantum_remaining: QUANTUM_DEFAULT,
start_entry: 0,
start_context: 0,
}
}
}
@@ -500,6 +510,17 @@ impl Scheduler {
self.current.expect("no current thread")
}
/// `(start_entry, start_context)` of the currently-running thread.
/// Returns None if there is no current thread or its ref is stale.
/// Used by `KernelState::maybe_register_silph_autosignal` to filter
/// `NtCreateEvent` calls by spawning chain.
pub fn current_thread_entry_and_ctx(&self) -> Option<(u32, u32)> {
let r = self.current?;
let slot = self.slots.get(r.hw_id as usize)?;
let t = slot.runqueue.get(r.idx as usize)?;
Some((t.start_entry, t.start_context))
}
// ----- Guest-thread lookup -----
/// Find the `ThreadRef` of the (non-Exited) thread with `tid`.
@@ -614,6 +635,8 @@ impl Scheduler {
t.priority = params.priority;
t.affinity_mask = mask;
t.ideal_processor = params.ideal_processor;
t.start_entry = params.entry;
t.start_context = params.start_context;
// M3.7 — populate the inter-thread reservation handle + slot id
// so the interpreter can route lwarx/stwcx through the table.
t.ctx.hw_id = slot_id;

View File

@@ -2001,6 +2001,9 @@ fn nt_create_event(ctx: &mut PpcContext, mem: &GuestMemory, state: &mut KernelSt
mem,
"NtCreateEvent",
);
// ITERATE-2C Phase D — audit-049 auto-signal POC. Env-gated; no-op
// when `XENIA_SILPH_UI_AUTOSIGNAL_DELAY` is unset.
state.maybe_register_silph_autosignal(handle, ctx, mem);
if handle_ptr != 0 {
mem.write_u32(handle_ptr, handle);
}

View File

@@ -325,6 +325,32 @@ pub struct KernelState {
pub silph_synth_handles: [Option<u32>; 4],
/// AUDIT-2.BF — `ThreadRef` cache for the 4 synthetic workers.
pub silph_synth_refs: [Option<xenia_cpu::ThreadRef>; 4],
/// ITERATE-2C Phase D — auto-signal delay for silph::UImpl
/// `NtCreateEvent` calls (see [`Self::maybe_register_silph_autosignal`]).
/// `None` = feature disabled; populated once from
/// `XENIA_SILPH_UI_AUTOSIGNAL_DELAY=<u64>` at construction.
pub silph_autosignal_delay: Option<u64>,
/// ITERATE-2C Phase D — pending auto-signal queue. Drained each
/// outer round by [`Self::fire_due_silph_autosignals`].
pub silph_autosignal_pending: Vec<AutoSignalPending>,
/// ITERATE-2C Phase D — most recent `stats.instruction_count`
/// deposited by the scheduler loop (see
/// [`Self::set_now_cycle_hint`]). Used by
/// [`Self::maybe_register_silph_autosignal`] to compute absolute
/// deadlines, since `nt_create_event` doesn't see `ExecStats`.
pub last_cycle_hint: u64,
/// ITERATE-2C Phase D — one-shot diagnostic latch. Flipped by
/// [`Self::fire_due_silph_autosignals`] on the first visit where
/// the pending queue is non-empty but no entry is due yet.
pub silph_autosignal_diag_logged: bool,
}
/// ITERATE-2C Phase D — one queued auto-signal. `deadline_cycle` is
/// absolute (cycle hint at register time + configured delay).
#[derive(Debug, Clone, Copy)]
pub struct AutoSignalPending {
pub handle: u32,
pub deadline_cycle: u64,
}
impl KernelState {
@@ -400,6 +426,12 @@ impl KernelState {
silph_synth_ctx: 0,
silph_synth_handles: [None; 4],
silph_synth_refs: [None; 4],
silph_autosignal_delay: std::env::var("XENIA_SILPH_UI_AUTOSIGNAL_DELAY")
.ok()
.and_then(|v| v.parse::<u64>().ok()),
silph_autosignal_pending: Vec::new(),
last_cycle_hint: 0,
silph_autosignal_diag_logged: false,
};
crate::exports::register_exports(&mut state);
crate::xam::register_exports(&mut state);
@@ -800,6 +832,122 @@ impl KernelState {
self.audit.record_wake(handle, entry);
}
/// ITERATE-2C Phase D — deposit the latest scheduler instruction
/// count so `nt_create_event` can compute absolute auto-signal
/// deadlines. Called once per outer round from the app's
/// `coord_pre_round` site. No-op when the feature env is unset.
pub fn set_now_cycle_hint(&mut self, now_cycle: u64) {
self.last_cycle_hint = now_cycle;
}
/// ITERATE-2C Phase D — register a freshly-allocated event for
/// auto-signal after the configured delay, **iff** the creating
/// thread matches the silph::UImpl tid=13 chain that wedges in
/// audit-049. Filter:
///
/// * Env `XENIA_SILPH_UI_AUTOSIGNAL_DELAY` set (= delay non-None)
/// * Frame-1 LR (the guest caller's post-bl PC, walked one step up
/// from the live thunk-wrapper frame) is in
/// `[0x821CB15C, 0x821CB160]` — this is the `NtCreateEvent` call
/// site inside `sub_821CB030+0x128`. The live `ctx.lr` is the
/// thunk wrapper's return slot (e.g. `0x824a9f6c`), so we walk
/// one back-chain step to reach the actual guest caller.
/// * Creating thread's `start_entry == 0x821748F0` (silph trampoline)
/// * Creating thread's `start_context == 0x4024a840`
///
/// On match, the handle is queued with `deadline = last_cycle_hint +
/// delay`. Drained by [`Self::fire_due_silph_autosignals`] from the
/// outer scheduler loop.
pub fn maybe_register_silph_autosignal(
&mut self,
handle: u32,
ctx: &PpcContext,
mem: &GuestMemory,
) {
let Some(delay) = self.silph_autosignal_delay else {
return;
};
let Some((entry, start_ctx)) = self.scheduler.current_thread_entry_and_ctx() else {
return;
};
if entry != 0x821748F0 || start_ctx != 0x4024_a840 {
return;
}
let frames = walk_guest_back_chain(ctx.gpr[1] as u32, ctx.lr as u32, mem, 2);
let caller_lr = match frames.get(1) {
Some((_, lr)) => *lr,
None => return,
};
if !(0x821CB15C..=0x821CB160).contains(&caller_lr) {
return;
}
let deadline = self.last_cycle_hint.saturating_add(delay);
self.silph_autosignal_pending
.push(AutoSignalPending { handle, deadline_cycle: deadline });
tracing::info!(
"silph autosignal: scheduled handle={:#x} caller_lr={:#x} for cycle {} (now={}, delay={})",
handle,
caller_lr,
deadline,
self.last_cycle_hint,
delay,
);
}
/// ITERATE-2C Phase D — drain pending entries whose deadline has
/// passed. Each fires by setting `Event { signaled = true }` and
/// invoking the existing `wake_eligible_waiters` to release blocked
/// waiters. No-op when the queue is empty (the common case).
pub fn fire_due_silph_autosignals(&mut self, now_cycle: u64) {
if self.silph_autosignal_pending.is_empty() {
return;
}
let any_due = self
.silph_autosignal_pending
.iter()
.any(|p| p.deadline_cycle <= now_cycle);
if !any_due {
// Diagnostic for the Phase D POC: log first time we visit
// with a non-empty queue but nothing due yet.
if !self.silph_autosignal_diag_logged {
self.silph_autosignal_diag_logged = true;
if let Some(first) = self.silph_autosignal_pending.first() {
tracing::info!(
"silph autosignal: tick (first visit, none due) now={} pending={} first_deadline={}",
now_cycle,
self.silph_autosignal_pending.len(),
first.deadline_cycle,
);
}
}
}
let mut i = 0;
while i < self.silph_autosignal_pending.len() {
if self.silph_autosignal_pending[i].deadline_cycle <= now_cycle {
let p = self.silph_autosignal_pending.swap_remove(i);
let prev = match self.objects.get_mut(&p.handle) {
Some(KernelObject::Event { signaled, .. }) => {
let was = *signaled;
*signaled = true;
Some(was)
}
_ => None,
};
tracing::info!(
"silph autosignal: firing handle={:#x} prev_signaled={:?} at cycle {}",
p.handle,
prev,
now_cycle,
);
self.audit_signal(p.handle, 0, "silph_autosignal", prev.unwrap_or(false) as u64);
crate::exports::wake_eligible_waiters(self, p.handle);
// do not advance i — swap_remove pulled a new entry into i
} else {
i += 1;
}
}
}
/// Diagnostic. If the live PC for HW slot `hw_id` is in
/// `self.ctor_probe_pcs`, emit a single `CTOR-PROBE` line with
/// the current cycle, tid, hw_id, sp, r3, lr, plus an 8-frame