diff --git a/audit-findings.md b/audit-findings.md index 3634a34..3e642d6 100644 --- a/audit-findings.md +++ b/audit-findings.md @@ -5819,3 +5819,253 @@ Synth-stub auto-enqueued `(0x0A, 1)` on the first `XNotifyGetNext` after listene ### Trace artifacts `audit-runs/audit-013-io-004-phase1.5/dispatch.{log,err}` (no-fire baseline at non-block PCs), `dispatch2.log` (block-entry probes — 1 fire on dispatch arm), `dispatch3.log` (full dispatch chain confirmed), `post-cascade.{log,err}` (focus + canary export delta + cascade probes). + +## KRNBUG-AUDIT-014 — 0x15e0 wake-eligibility hypothesis FALSIFIED; tid=17 actually parks on 0x15e4 (DIAGNOSTIC 2026-05-06) + +**Status**: read-only diagnostic. No fix landed. Master HEAD `d736a1d` unchanged. + +### Phase 1 finding (decisive) +Goal was to investigate why handle 0x15e0 records `signal_attempts=1 (primary=1)` post-IO-004 BUT tid=17 (the "0x15e0 worker") still parks. **The premise is wrong.** + +Trace at `-n 500M --trace-handles-focus=0x15e0` shows: +1. **Handle 0x15e0 is a Semaphore**, not an Event/Manual. Created from `lr=0x824ab110` (NtCreateSemaphore) on tid=1, with creator-frame chain `lr=0x82456a94 → 0x82456bac → 0x822f1b60 → 0x8216ee14 → 0x824ab8e0`. This is a **different** wrapper than the Event creator chain `lr=0x824a9f6c` shared by 0x1004 / 0x100c / 0x1020 / 0x15e4. +2. **0x15e0 is healthy**: `signal_attempts=1 (primary=1) waits=1 wakes=1`. End-of-run DIAGNOSIS reports "not stuck — signals consumed correctly". Timeline: tid=1 waited at `lr=0x824ac578`, then tid=16 `NtReleaseSemaphore` at `lr=0x824ab168` woke it. Handshake completed. +3. **tid=17 parks on 0x15e4**, NOT 0x15e0. State at end-of-run: `Blocked(WaitAny { handles: [5604] })` where `5604 == 0x15e4`. Worker entry context `r12=0x8217057c` (front of `sub_82170430`) matches the audit-009 / audit-008 / audit-002 stage-3 attribution of tid=17 to the 0x82170430 worker cluster. +4. **0x15e4 is the actual stuck handle**: `kind=Event/Manual waiters=1 signals=0 waits=1 wakes=0 `, created by tid=1 at `lr=0x824a9f6c` (same wrapper as 0x1004 / 0x100c / 0x1020). This is the same producer-missing class as the other Event/Manual handles tracked across audit-001 → IO-004. + +The IO-004 cascade-prediction scorecard's claim "(e) signal_attempts on parked handles: 0x15e0 = 1 (primary=1, ghost=0)" was technically correct (the semaphore did get one signal) but the inference that this represented forward progress for tid=17's wake was a misattribution. The label "0x15e0 worker" used in audit-009 / audit-002 / audit-008 stage-3 mappings is a long-standing transcription error: the actual handle is 0x15e4 (Event/Manual), and 0x15e0 is an unrelated Semaphore. Reference: `project_xenia_rs_producer_stack_trace_2026_05_03.md` already noted "third handle is **0x15e0**, not 0x15e4 (transcription typo)" — that correction itself was reversed; the original audit-002 label 0x15e4 was correct. + +### Bug class evaluation (α-ζ from prompt) +- α (PKEVENT vs handle mismatch): N/A — no Set call ever targets 0x15e4; the producer is genuinely missing. +- β (refresh_pkevent_shadow_from_guest miss): N/A — same. +- γ (wake-eligibility filter wrong): N/A — wake_eligible_waiters fires correctly elsewhere (0x10F0 handshake demonstrates healthy manual-reset wake; 0x15e0 demonstrates healthy semaphore wake). +- δ (memory ordering): N/A — no producer side observed. +- ε (race scheduler.resume vs signal): N/A. +- ζ (audit recorded but not propagated): N/A — DIAGNOSIS print-out matches state.objects waiter list. + +**Conclusion**: 0x15e4 belongs to the same "producer never reaches the Set call" class as 0x1004 / 0x100c / 0x1020. Renderer cluster work (audit-008 / audit-009) and AUDIT-014's parallel Fork B probing of newly-reached L1 entries (`sub_82173DC8`, `0x822c6870`, `0x824563e0`, `0x823ddb50`) is the correct line of attack — there is no wake-eligibility bug to fix. + +### Discipline gate +- Box 1 (named bug class with concrete evidence): FAIL — premise refuted, no bug class applies. +- Box 2 (narrow fix ~30-80 LOC): N/A. +- Box 3 (sharp 4-dim cascade prediction): N/A. +- Box 4 (no renderer/GPU changes): N/A. +- Box 5 (lockstep determinism preserved): N/A. + +Stop conditions met: hand back as Phase 1 only. + +### Cascade snapshot (unchanged from IO-004 baseline) +- swaps=2 (`VdSwap` kernel-direct frames 1 + 2) +- draws=0 +- 18 → 20 worker threads (consistent with IO-004) +- Canary-only exports: ExTerminateThread, KeReleaseSemaphore, XamUserReadProfileSettings still missing. + +### Recommended next session +Track Fork B's branch-probe results for `sub_82173DC8` (the first L1 entry in the renderer cluster reached after IO-004). The producer for handles 0x1004 / 0x100c / 0x1020 / 0x15e4 lives somewhere along the dispatch arm at `0x822f1be8 → 0x82175338 → 0x82173dc8 → ...`. If Fork B identifies a sub-function that gates the Set call (e.g. `sub_82173DC8` returns early on a stub kernel call), that becomes KRNBUG-AUDIT-015 / next IO-NNN candidate. + +The misattribution label "0x15e0 worker" should be corrected to "0x15e4 worker" in the index entries for AUDIT-002, AUDIT-008, AUDIT-009 — left for the next session to update if relevant. + +### Trace artifacts +`audit-runs/audit-014-0x15e0-wake/probe.log` (focus dump + 19-thread diagnostic), `probe.err` (kernel.calls counters confirming swaps=2 unchanged). + +## KRNBUG-AUDIT-015 — L1 propagation probe; next gate is silph::Semaphore on handle 0x1308 (workitem submitter unreached) (DIAGNOSTIC 2026-05-06) + +**Status**: read-only diagnostic, Fork B parallel session. No fix landed. Master HEAD `d736a1d` unchanged. + +### Probe set (112 PCs) +sub_82173DC8 dispatcher case-arms (25), worker 0x822c6878 body (12), worker sub_824563E0 body (17), worker sub_823DDB50 body (11), L1 callees (26), audit-009 unfired baseline (21). + +### Decisive findings +1. **sub_82173DC8 dispatches all 4 IO-004 startup notifications then idles.** Every fire takes the early-exit at `0x82173ed8` because `[r31+44] == 0` (callback-table pointer in the listener struct never populated). The post-merge dispatch helper `0x82174040` (which would call the renderer producers `sub_822C2A80`, `sub_8216F088`, etc.) is never invoked from the dispatcher path. +2. **Worker 0x822c6870 (= 0x822c6878 thunk; tids 14, 15) parks immediately on Semaphore handle 0x1308.** The semaphore is `Semaphore(0/INT_MAX) signals=0 waits=2 wakes=0 `, created by tid=13 inside `sub_822C66B4` (worker-pool initializer in `sub_822C6630`). Producer chain that releases it: `sub_822AE1F0 / sub_822F55F0 → sub_822C8B50 → sub_822C6808 → bl 0x824AB158 (silph::Semaphore::Release at NtReleaseSemaphore)`. Neither `sub_822AE1F0` nor `sub_822F55F0` was probed; both are statically reachable from main but unexercised at -n 500M — they're the renderer's frame-update / scene-graph-mutate path that never runs. +3. **Worker sub_824563E0 (tid=16) is healthy** — runs an XAM inactivity / timer poll loop (NtSetTimerEx handle 0x15d0, period=2; loops `XamEnableInactivityProcessing ↔ CS+bcctrl dispatch` 865k times). Not the gate. +4. **Worker sub_823DDB50 (tid=19) parks at entry** with body PCs unfired; final state `Blocked(WaitAny { handles: [0x160C, 0x01000000] })`. Handle 0x160C is `Event/Auto signals=0 waits=1 wakes=0 `. The wait callsite is unprobed (likely an early branch before 0x823ddb68); needs follow-up probe inside `sub_823DD838` (parent). +5. All 21 audit-009 PCs (renderer cluster `0x82287xxx-0x82294xxx` + audit-005 producer-callsites) remain UNFIRED, consistent with audit-009 baseline — they sit downstream of the unreached workitem-submitter chain. + +### Bug class +**δ (pure-guest renderer state-read)**, NOT a kernel-boundary stub. There is no missing `xboxkrnl`/`xam` import at the gate; main fails to advance past a state predicate that gates `sub_822AE1F0` / `sub_822F55F0` invocation. + +### Discipline gate +- Box 1 (named import α / narrow internal-sub bug): **NO** — δ-class, no kernel boundary. +- Box 2 (canary impl small): N/A. +- Box 3 (sharp 4-dim cascade prediction): **NO** — needs dump-addr triage of listener struct first. +- Box 4 (no new ABI plumbing): N/A. +- Box 5 (lockstep determinism preserved): N/A. + +Boxes 1 + 3 fail. Hand back per stop condition 1. + +### Recommended next session +Phase 1: probe `sub_822AE1F0`, `sub_822F55F0`, `sub_822C8B50`, `sub_822C6808` entries + `sub_82174040` post-merge dispatch helper (the 6 fall-through arms inside sub_82173DC8). Add `--dump-addr=0x40ba9a80` to capture the listener-struct fields each dispatcher fire. The struct's `[+44]` field is the gate predicate; once we know what populates it, the actual fix point becomes nameable. + +### Trace artifacts +`audit-runs/audit-015-l1-propagation/probe.log` (493 MB; 5.05M BRANCH-PROBE lines), `probe.err` (188 KB), `pc-fire-counts.txt` (28 fired PCs sorted). + +## KRNBUG-AUDIT-016 — submitter-caller probe; gate is γ (deeper-indirection / vtable registry not populated) (DIAGNOSTIC 2026-05-06) + +**Status**: read-only diagnostic. No fix landed. Master HEAD `d736a1d` unchanged. + +### Probe set +Run #1 (30 PCs): workitem-submitter chain entries + bl call-sites (`sub_822AE1F0`, `sub_822F55F0`, `sub_822C8B50`, `sub_822C6808`, `0x822B16E0`, `0x822F5728`), parents (`sub_822ADD70`, `sub_821A9920`, `sub_822ACAB8`, `sub_821A8578`), grandparents (`sub_82299250`, `sub_822A4460`, `sub_821A82A0`), dispatcher post-merge helper + early-exit. Run #2 (18 PCs): refined dispatcher arm coverage + `--dump-addr=0x40ba9a80,0x4024AC00,0x4024B3E0,0x40111890,0x4024A380`. + +### Decisive findings +1. **0/16 submitter-chain PCs fire** including all 4 levels of caller walk-up. Both static caller chains bottom-out in the audit-009 unreached renderer cluster: A-side `sub_822AE1F0 ← sub_822ADD70 ← sub_822ACAB8 ← sub_82299250 / sub_822A4460 ← sub_8229AB50 ← sub_8229A700 ← sub_82294F30 (renderer cluster)`. B-side `sub_822F55F0 ← sub_821A9920 ← sub_821A8578 ← sub_821A82A0 ← (cycle with sub_821A9920) and ← sub_821ABEA8 ← sub_821AC700 ← sub_821A6470 (renderer cluster)`. +2. **Listener struct dump at `0x40ba9a80`**: `[+0x00]` vtable=0x40111890; `[+0x04]` dispatch state bits=**0 (NEVER set)**; `[+0x08]` counter=0; `[+0x0C]`=1000 (set by case 0xA); `[+0x2C]` callback-table A=**0x4024AC00 (POPULATED)**; `[+0x3C]` callback-table B=**0x4024B3E0 (POPULATED)**. **Audit-015's claim that `[r31+44]==0` was wrong** — `[+0x2C]` IS populated. The real gate is `[base+0x04]` (dispatch state bits) read by `sub_821737F0` (case-9 helper) bit 14 / bit 15. +3. **Dispatcher arm fires (run #2 confirmed)**: case-9 r5==0 path (`0x82173e6c`, 1 fire) → `sub_821737F0` returned 0 → early-exit; default-high arm (`0x82173f48`, 2 fires) → both early-exit at `0x82174030`. **Case 0xA's write `oris 0x1; stw [r31+4]` should set bit 16, but EOR dump shows `[+0x04]=0`** — either the case-0xA fire and dispatch-r3 don't always target `0x40ba9a80`, or the write is overwritten back to 0 by another path. +4. **0x4024AC00 (callback table A) contains real renderer config** including string `"game:\\dat\\GP_TITLE.pak+eng\\\0"` and pointers `0x401119A0 / 0x40111990` — confirming the listener IS subscribed to the renderer's profile loader, but its dispatch-state bits are never advanced. +5. **Probe-machinery anomaly**: `sub_82174040` entry-PC never fires across both runs, yet `sub_821737F0` fires once at cycle 9183539 with `lr=0x821741f4` — meaning `0x821741F0 (bl sub_821737F0 inside sub_82174040 +0x1B0)` was executed. Either `sub_82174040` was reached via a jump-into-mid-function (highly unusual) or the probe missed an entry fire. **Worth verifying in AUDIT-017** with isolated probe of `0x82174040, 0x82174044, 0x82174048`. + +### Bug class +**γ (deeper indirection)** — refining audit-015's δ classification. The submitter chain bottom-outs in a vtable-dispatched renderer cluster registry that's never populated. Chicken-and-egg: listener can't advance state because workitem-submitter never fires; workitem-submitter never fires because the registry is never populated; the registry is populated by something the listener was supposed to drive. Only an external bootstrap can break it. + +### Discipline gate +- Box 1 (named α-class import / narrow internal sub): **NO** — γ-class, no kernel boundary; gate is structural. +- Box 2 (canary impl small): N/A. +- Box 3 (sharp 4-dim cascade prediction): **NO** — needs further state-write triage. +- Box 4 (no new ABI plumbing): N/A. +- Box 5: N/A. + +Boxes 1 + 3 fail. Hand back per stop condition 1. + +### Recommended next session (AUDIT-017) +1. Probe dispatcher caller layer: `0x822f1be8`, `0x822f1c04`, `sub_822F1AA8` (main's frame-poll loop — where main parks per AUDIT-009), `sub_821752C0` (jumps to `sub_82173DC8`). +2. Find writers of `[0x40ba9a80+4]` — byte-scan `.text` for `addi r?, ?, 4; stw r?, 0(r?)` patterns OR probe ALL functions that touch r3+4 with a stw (potentially via offset-write tracking). Identify the function that's supposed to set bit 14 / bit 15 of that field. +3. Probe inside `sub_82181D48` (default-high arm's secondary predicate): the `rlwinm r11, r11, 0, 30, 30` at `0x82181D74` reads `[[r3+0]+60]` bit 30 — find what writes this bit. If we can make `sub_82181D48` return 1, the default-high arm's `bctrl` fires → renderer cascade. +4. Verify probe-machinery anomaly (entry of `sub_82174040`). + +### Trace artifacts +`audit-runs/audit-016-submitter-callers/probe.log` (run #1, 9 KB), `probe.err` (187 KB), `probe2.log` (run #2, 12 KB; +4 dump-addrs), `probe2.err` (187 KB). + +## KRNBUG-AUDIT-017 — bit-14/15 writer triage; gate is β (`[0x828F4070+64]==-1`) with α tail (`XamUserGetSigninState=stub_return_zero`) (DIAGNOSTIC 2026-05-06) + +**Status**: read-only diagnostic. No fix landed. Master HEAD `d736a1d` unchanged. + +### Probe set +Static scan: `oris rN, rN, 0x1` or `oris rN, rN, 0x2` followed within 8 instructions by `stw rN, 4(rY)`. 5 candidates flagged. Runtime confirmation via `--branch-probe` at -n 500M + `--dump-addr=0x40ba9a80,0x828F48B0,0x828F4070`. + +### Decisive findings +1. **Static writer candidates** (5): + - `0x82173950` (sub_821737F0:bit-14, gated by `[r30+64]!=-1` AND XamUserGetSigninState ret-check) + - `0x82173e04` (sub_82173DC8 case-0xA:bit-15) + - `0x824d3ce8` (sub_824d3c78:bit-15, struct via `[parent+184]`) + - `0x824d3f24` (sub_824d3dc0:bit-14, struct via `[parent+184]`) + - `0x82769b84` (sub_82766db0:bit-15, struct stride 8 — false positive) +2. **Runtime: case-0xA fires once** at cycle 9183060 (PC 0x82173dfc), sets bit-15 of `[0x40ba9a80+4]`. Confirmed by EOR dump `[+0x0C]=0x000003E8` (case-0xA's subfic). +3. **sub_821737F0 work-path entered** at cycle 9183561 (lr=0x821737f8). Bit-15 cleared at 0x82173884. Bit-14 setter at 0x82173950 NEVER fires because at 0x821738E0, `cmpwi r3, -1; beq → 0x82173938` short-circuits (`r3=[r30+64]=0xFFFFFFFF`). +4. **r30 = `[0x828F48B0+0]` = `0x828F4070`** (singleton sub-object). EOR dump confirms `[0x828F4070+64]=0xFFFFFFFF`, initialized to -1 by `sub_821701c8` at 0x82170234. The only non-(-1) writer is `sub_82184318:0x82184374` (`bl 0x82456B58 (kernel handle creator); stw r3, 64(r30)`). Caller chain `sub_82184318 ← sub_82187768:0x821877bc ← sub_82187dd0:0x82187e78 ← sub_82183ca8:0x82183cd8 ← {sub_822919c8, sub_82186760, sub_821c88d0}`. **`sub_822919c8` is one of the audit-009 renderer-cluster L1 entry points that has zero non-call xrefs** — same γ-cluster blocked at audit-009/-016. +5. **bit-28 of `[0x828F4070+60]` IS set** at cycle 9224352 by `sub_821c4988:0x821c5450` — but 35,000 cycles AFTER case-9 fired. Also: bit-28 is a NEGATIVE gate at 0x821738F0 (`bne cr6, 0x82173938`) — bit-28 SET means NO bit-14. The positive gate is `[+64]!=-1`. +6. **Two orthogonal stubs uncovered (α tail)**: + - `XamUserGetSigninState` (xam.rs:48) is `stub_return_zero`. Even if β fixed, sub_821737F0's bit-14 deep-eval at 0x82173904-0x82173938 takes the no-bit-14 path in 2/3 sub-branches when ret==0. Also sub_822C2A80 at 0x822c2aac loops `XamUserGetSigninState(0..3)` searching for any signed-in user — broken. Canary `xam_user.cc:90-101` returns `SignedInLocally=1` for default profile. + +### Bug class +**β-dominant + α-tail.** Primary β is structural — `[0x828F4070+64]==-1` because the ctor that fills it (`sub_82184318`) is in the same audit-009 renderer cluster that audit-016 also identified. Secondary α is XamUserGetSigninState=stub_return_zero (2 separate guest paths broken). + +### Discipline gate +- Box 1: PARTIAL — α component named (XamUserGetSigninState) but not the dominant gate. +- Box 2: YES for α (5 LOC at `xam_user.cc:90-101`). +- Box 3: NO — β dominant, structural. +- Box 4-5: N/A. + +Boxes 1+3 fail. Hand back per stop condition 1. + +### Recommended next session (AUDIT-018) +- **Option A**: probe `sub_82184318, sub_82187768, sub_82187dd0, sub_82183ca8, sub_82186760, sub_821c88d0, sub_822919c8, sub_82456B58` at -n 500M to confirm the entire chain to `[singleton+64]` ctor is unreached. If all 8 fail to fire, this re-confirms γ-class structural blocker for the THIRD time (audit-009, -016, -017). Time to pivot strategy. +- **Option B**: canary-log diff during boot window 9.0M-9.3M cycles for any kernel call that writes a real handle to `0x828F4070+64`. Re-run `lutris lutris:rungameid/4` with kernel-call logging. +- **Option C** (cheap α): implement `XamUserGetSigninState` per canary (5 LOC). Will not fire cascade alone (β dominant) but is correct and unblocks sub_822C2A80. +- **Sharp 4-dim cascade prediction**: NEEDS FURTHER TRIAGE. + +### Trace artifacts +`audit-runs/audit-017-state-bits-writer/probe{1..5}.log` + `.err` (probe.log: 13 lines, probe3.log: 133 lines incl. dumps, probe4.log: 7 lines, probe5.log: 3 lines). + +--- + +### XamUserGetSigninState follow-up (post-AUDIT-017, master 7ed6192) + +Landed inline as a small canary-mirror correctness fix. Branch `xam-user-signin-state/p0-canary-mirror`, no-ff merged. + +- Impl returns `1` for user_index=0 (SignedInLocally), `0` otherwise. Mirrors canary `xam_user.cc:90-101`. +- Tests 599 → 600. Lockstep `instructions=100000012 → 100000006`, deterministic across 2 runs. +- **Cascade ripple**: `XamUserReadProfileSettings` now fires 2× (was canary-only). Per-AUDIT-017 prediction (α-tail correctness fix; β still dominant). +- Remaining canary-only kernel exports: `ExTerminateThread`, `KeReleaseSemaphore`. Down from 3 to 2. +- Renderer L1 reachability + parked-handle signal_attempts unchanged — β-class blocker `[0x828F4070+64]==-1` unmoved (audit-017's structural finding). + +## KRNBUG-AUDIT-018 — canary-log diff identifies α-class stub `KeResumeThread` (DIAGNOSTIC 2026-05-06) + +**Status**: read-only diagnostic. No fix landed. Master HEAD `7ed6192` unchanged. Tests 600. Lockstep `instructions=100000006`. + +### Method +Set-diff of kernel-call function names: ours (`audit-runs/audit-018-canary-diff/ours.log`, -n 500M) vs canary (`/home/fabi/xenia_canary_windows/xenia.log`, full boot to active rendering with `XamInputGetCapabilities` polling). + +### Decisive findings +1. Function-name diff: only 2 calls present in canary, absent in ours: `ExTerminateThread`, `KeReleaseSemaphore` — both already on the audit-006 canary-only export queue. +2. **`KeReleaseSemaphore(828A3230, 1, 1, 0)`** is hammered by canary tid `F800006C` repeatedly (audio-render ticker). That thread is created via `ExCreateThread(..., entry=0x824D2878, ctx=0, flags=0x10000001)` and immediately followed by `ObReferenceObjectByHandle / KeSetBasePriorityThread / KeResumeThread / ObDereferenceObject`. Same pattern for entry `0x824D2940`. +3. In our run, both these threads are `Blocked(Suspended)` at end-of-run. Counters `KeResumeThread = 2` and `NtResumeThread = 6` match canary's call pattern. +4. **Root cause**: `crates/xenia-kernel/src/exports.rs:3658-3664` — `ke_resume_thread` is a no-op cookie-returner that ignores r3 and sets r3=0. Comment claims "real `NtResumeThread` below handles the handle-based path properly", but `KeResumeThread` is a separate export that takes a KTHREAD pointer (which our `ObReferenceObjectByHandle` cookies as the handle itself per `exports.rs:3787-3807`). The fix is to mirror `nt_resume_thread`: `find_by_handle(handle).resume_ref(r)`. +5. Cross-reference: tid=17 (entry=0x82170430, ctx=0x828F4070, the audit-017 listener struct) IS spawned and parks on event handle 0x15E4 — same long-known parked dispatcher waiter. Worker body reads `[r29+56] (=[0x828F40A8])` as its loop predicate (clarification of audit-017's "+64" claim). Until tids 9/10 actually run, the audio-side cascade never starts. + +### Bug class +**α (named import stub_success on a load-bearing export)**. `KeResumeThread` is registered (canary `kImplemented`) but our impl is a stub_success no-op that fails to actually unsuspend. + +### Discipline gate +- Box 1 (named bug class with concrete evidence): YES. +- Box 2 (narrow fix ~5 LOC): YES. +- Box 3 (sharp 4-dim cascade prediction): YES (see memory file). +- Box 4 (no renderer/GPU changes): YES. +- Box 5 (lockstep determinism preserved): expected — same pattern as XamUserGetSigninState landing. + +**All 5 boxes pass — first time since IO-004.** + +### Sharp 4-dim cascade prediction +- **A (thread liveness)**: tids 9, 10 leave Suspended; XAudio voice-render workers run. +- **B (kernel counters)**: `KeReleaseSemaphore` non-zero for first time. `NtSetEvent` rises. Likely new `XAudioSubmitRenderDriverFrame`. +- **C (canary-only exports)**: 2→1 (`KeReleaseSemaphore` resolved). Possibly new audio-path exports. +- **D (listener `[+64]`)**: hypothesis-only — IF audit-017's β-class blocker is downstream of audio init, `[0x828F4070+64]` becomes non-(-1) and renderer cascade unblocks. If not, γ-cluster is independent → pivot to memory-watch instrumentation on `[+64]`. + +### Recommended next session (KRNBUG-IO-005 or KRNBUG-α-005) +Implement 5-LOC fix on branch `ke-resume-thread/p0-canary-mirror`: +```rust +fn ke_resume_thread(ctx: &mut PpcContext, _mem: &GuestMemory, state: &mut KernelState) { + let handle = resolve_pseudo_handle(state, ctx.gpr[3] as u32); + let prev = state.scheduler.find_by_handle(handle).map(|r| state.scheduler.resume_ref(r)).unwrap_or(0); + ctx.gpr[3] = prev; +} +``` +Lockstep ×2. Evaluate cascade. Tests 600→601 (add a `ke_resume_thread` unit test mirroring `nt_resume_thread`). + +### Trace artifacts +- `audit-runs/audit-018-canary-diff/ours.log` (full kernel trace + final-state thread diagnostics) +- `audit-runs/audit-018-canary-diff/ours.stdout.log` (counters) +- Canary: `/home/fabi/xenia_canary_windows/xenia.log` (untouched) + +## KRNBUG-KE-001 — Real `KeResumeThread` (LANDED 2026-05-06) + +`crates/xenia-kernel/src/exports.rs:3658-3669` — replaced the no-op cookie-returner with a canary-mirror real impl per `xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_threading.cc:216-227` (`XObject::GetNativeObject(...)->Resume()` → `STATUS_SUCCESS`, else `STATUS_INVALID_HANDLE`). Routes the KTHREAD-pointer-as-handle through `resolve_pseudo_handle` + `scheduler.find_by_handle` + `scheduler.resume_ref`, mirroring `nt_resume_thread`'s plumbing two functions below. + +### Cascade-prediction scorecard (audit-018 → post-fix) +- **A — thread liveness (PASS)**: tids 9 (entry=0x824D2878) and 10 (entry=0x824D2940) transition from `Blocked(Suspended)` → ran → now `Blocked(WaitAny)` on audio buffer-completion semaphores `0x828A3254` (handle 2190094932) / `0x828A3230` (handle 2190094896). Pre-fix they were Suspended at end-of-run; post-fix they execute their bodies and park on a downstream consumer wait. +- **B — counters (PARTIAL FAIL)**: `NtSetEvent 667→3334` (rises ~5×, audio frame-complete signaling). `KeResumeThread = 2` (now real). `NtResumeThread = 6`. **`KeReleaseSemaphore` still 0** (not in counters at all). **`XAudioSubmitRenderDriverFrame` still 0**. Workers ran prologue + parked on a downstream gate before reaching `KeReleaseSemaphore`. +- **C — canary-only delta (FAIL — predicted 2→1, actual 2→2)**: `ExTerminateThread` and `KeReleaseSemaphore` both still canary-only. The audio render-tick semaphore-release loop is gated by something downstream of the audio worker prologue. +- **D — γ-cluster blocker (FAIL)**: `--pc-probe=0x82184318,0x82184374` armed, neither fires. `--dump-addr=0x828F4070` armed, no DUMP lines emitted. Listener struct `[0x828F4070+64]` unchanged. `--trace-handles-focus` shows handles 0x1004/0x100c/0x1020/0x15e4 all still `signal_attempts=0`. + +### Milestone status +- Renderer cluster cascade collapsed? **NO**. +- signal_attempts > 0 on parked handles? **NO**. +- `draws > 0`? **NO** (still 0; `swaps` still 2). + +### Verification +- 600 → 601 tests (`cargo test --workspace --release` clean; new `ke_resume_thread_unblocks_suspended_worker` covers Suspended→Ready transition + INVALID_HANDLE branch). +- Lockstep determinism: `instructions=100000003 imports=987516` × 2 reruns identical. +- `swaps=2 draws=0` plateau intact. +- Goldens re-baselined: `sylpheed_n50m.json instructions 50000003→50000011, imports 407255→407247`. n2m unchanged. Oracle test passes. + +### Bug class (post-fact) +α (load-bearing stub_success). The fix unsticks two threads but those threads then park on a downstream gate that's part of a separate bug class — the audio voice-render dispatch never reaches `KeReleaseSemaphore`/`XAudioSubmitRenderDriverFrame` because the consumer-side semaphore producer is itself gated by something else (likely the same γ-cluster that audit-009/-016/-017 narrowed: `[0x828F4070+64]==-1`). + +### Recommended next session +Audit-019 — memory-watch instrumentation on `[0x828F4070+64]` (audit-017 Option B). With KE-001 landed, the discipline gate cleanly attributes the renderer plateau to the listener-struct field rather than to a stub upstream — narrows the search for the producer to whoever writes 64 bytes into the audit-017 dispatcher. + +### Trace artifacts +- `audit-runs/post-ke-resume/lockstep_run{1,2}.json` (lockstep determinism) +- `audit-runs/post-ke-resume/run.{log,err}` (full 500M cascade verification) +- `audit-runs/post-ke-resume/probe.{log,err}` (γ-cluster pc-probe + dump-addr) +- `audit-runs/post-ke-resume/handles.{log,err}` (--trace-handles-focus) + diff --git a/audit-runs/audit-006/canary_export_queue.md b/audit-runs/audit-006/canary_export_queue.md index d3b9f13..2afed74 100644 --- a/audit-runs/audit-006/canary_export_queue.md +++ b/audit-runs/audit-006/canary_export_queue.md @@ -1,6 +1,8 @@ # Canary-Only Export Fix Queue (audit-006) -- Status: **POST-IO-004 (2026-05-06): 7 → 3 canary-only.** Real `XamNotifyCreateListener` + `XNotifyGetNext` landed (KRNBUG-IO-004). Dispatch arm at `0x822f1be8` now fires; `sub_82173DC8` runs in a tight loop on tid=1; renderer-cluster L1 entries `0x822c6870`, `0x824563e0`, `0x823ddb50` are reached for the first time. 4 reclassified RE-FIRES (now reached): `KeResetEvent`, `ObCreateSymbolicLink`, `XamTaskCloseHandle`, `XamTaskSchedule`. Still canary-only: `ExTerminateThread`, `KeReleaseSemaphore`, `XamUserReadProfileSettings` — all REAL_BUT_UNREACHED at the new boot horizon. Worker count 18 → 20. signal_attempts on 0x15e0 = 1 (was 0). draws=0 still expected at this step. See KRNBUG-IO-004 in `audit-findings.md` and `project_xenia_rs_io_004_xnotify_listener_2026_05_06.md`. +- Status: **POST-KE-001 (2026-05-06): 2 canary-only (XamUserReadProfileSettings DROPPED post-XamUserGetSigninState landing earlier; KE-001 unsuspended audio workers but KeReleaseSemaphore producer is downstream-gated and did NOT fire).** `KeResumeThread` is now a real impl per canary `xboxkrnl_threading.cc:216-227` (KRNBUG-KE-001, branch `ke-resume-thread/p0-canary-mirror`). Cascade A passed: tids 9 (entry=0x824D2878) and 10 (entry=0x824D2940) leave Suspended → run prologue → park on `WaitAny` for audio buffer-completion semaphores `0x828A3254` / `0x828A3230`. Cascade B partial: `NtSetEvent 667→3334` (5×) but `KeReleaseSemaphore=0` and `XAudioSubmitRenderDriverFrame=0` — workers stuck before the producer. Cascade C predicted 2→1, actual 2→2 (`ExTerminateThread`, `KeReleaseSemaphore` both still canary-only). Cascade D: `--pc-probe=0x82184318,0x82184374` armed — neither fires; `--dump-addr=0x828F4070` no DUMP lines; γ-cluster blocker unchanged; signal_attempts on 0x1004/0x100c/0x1020/0x15e4 still 0. swaps=2 draws=0 plateau intact. Lockstep `instructions=100000003 imports=987516` deterministic ×2. Goldens re-baselined `sylpheed_n50m.json instructions 50000003→50000011, imports 407255→407247`. See KRNBUG-KE-001 in `audit-findings.md`. + +- Prior status (superseded by KE-001): **POST-IO-004 (2026-05-06): 7 → 3 canary-only.** Real `XamNotifyCreateListener` + `XNotifyGetNext` landed (KRNBUG-IO-004). Dispatch arm at `0x822f1be8` now fires; `sub_82173DC8` runs in a tight loop on tid=1; renderer-cluster L1 entries `0x822c6870`, `0x824563e0`, `0x823ddb50` are reached for the first time. 4 reclassified RE-FIRES (now reached): `KeResetEvent`, `ObCreateSymbolicLink`, `XamTaskCloseHandle`, `XamTaskSchedule`. Still canary-only: `ExTerminateThread`, `KeReleaseSemaphore`, `XamUserReadProfileSettings` — all REAL_BUT_UNREACHED at the new boot horizon. Worker count 18 → 20. signal_attempts on 0x15e0 = 1 (was 0). draws=0 still expected at this step. See KRNBUG-IO-004 in `audit-findings.md` and `project_xenia_rs_io_004_xnotify_listener_2026_05_06.md`. - Prior status (superseded by IO-004): **AUDIT-009 (2026-05-05): GATE IS HIGHER THAN THE CLUSTER ITSELF.** AUDIT-008's β-hypothesis (gate sits among the 5 callers of `sub_821800D8` in 0x82287000-0x82292FFF) is **falsified**: a 21-PC `--branch-probe` (the 6 parents + 5 shims + dispatcher + 9 audit-005 producer-callsites) shows **0/21 firings** at -n 500M (`audit-runs/audit-009/probe-500m.err`). The whole 0x82287000-0x82294000 cluster is unreached. Static analysis: the cluster's level-1 root functions (`sub_82293448`, `sub_822919C8`) have **zero non-call xrefs in sylpheed.db** — they are reached only via vtable / function-pointer that's never written. Main parks at `sub_822F1AA8` frame-poll loop forever (1.49M XNotifyGetNext iterations). Three canary-only exports (`ExTerminateThread`, `KeReleaseSemaphore`, `XamUserReadProfileSettings`) remain REAL_BUT_UNREACHED — same as audit-008. **DO NOT pull from this queue.** Next-session probe set: cluster L1 roots + new thread entry trampolines (0x822c6870 / 0x824563e0 / 0x823dde30 / 0x823ddb50) + main's frame-poll callees + main's post-poll continuation list. See KRNBUG-AUDIT-009 in `audit-findings.md` and `project_xenia_rs_audit_009_renderer_unreached_2026_05_05.md`. diff --git a/crates/xenia-app/tests/golden/sylpheed_n50m.json b/crates/xenia-app/tests/golden/sylpheed_n50m.json index 8691377..1a3f42e 100644 --- a/crates/xenia-app/tests/golden/sylpheed_n50m.json +++ b/crates/xenia-app/tests/golden/sylpheed_n50m.json @@ -1,6 +1,6 @@ { - "instructions": 50000003, - "imports": 407255, + "instructions": 50000011, + "imports": 407247, "unimpl": 0, "draws": 0, "swaps": 2, diff --git a/crates/xenia-kernel/src/exports.rs b/crates/xenia-kernel/src/exports.rs index c759a26..40e9ab8 100644 --- a/crates/xenia-kernel/src/exports.rs +++ b/crates/xenia-kernel/src/exports.rs @@ -3656,11 +3656,16 @@ fn nt_yield_execution(ctx: &mut PpcContext, _mem: &GuestMemory, _state: &mut Ker } fn ke_resume_thread(ctx: &mut PpcContext, _mem: &GuestMemory, state: &mut KernelState) { - // r3 = thread_ptr (KTHREAD). We don't track KTHREAD ↔ HW mapping through - // guest memory addresses, so accept and succeed. Real NtResumeThread - // below handles the handle-based path properly. - ctx.gpr[3] = 0; - let _ = state; + let handle = resolve_pseudo_handle(state, ctx.gpr[3] as u32); + match state.scheduler.find_by_handle(handle) { + Some(r) => { + state.scheduler.resume_ref(r); + ctx.gpr[3] = STATUS_SUCCESS; + } + None => { + ctx.gpr[3] = STATUS_INVALID_HANDLE; + } + } } fn nt_resume_thread(ctx: &mut PpcContext, mem: &GuestMemory, state: &mut KernelState) { @@ -3983,6 +3988,52 @@ mod tests { assert_eq!(ctx.gpr[3], 7); } + /// `KeResumeThread` resolves the KTHREAD-pointer-as-handle, decrements the + /// target's suspend count, and unblocks once it hits zero. Mirrors + /// xboxkrnl_threading.cc:216-227 (XObject::GetNativeObject + + /// thread->Resume()). + #[test] + fn ke_resume_thread_unblocks_suspended_worker() { + use xenia_cpu::scheduler::{BlockReason, HwState, SpawnParams}; + let (mut ctx, mut mem, mut state) = fresh(); + let pcr_base = SCRATCH_BASE + 0x500; + let params = SpawnParams { + entry: 0x8200_0000, + start_context: 0, + stack_base: 0x7200_0000, + stack_size: 0x10000, + pcr_base, + tls_base: 0, + thread_handle: 0x2000, + guest_tid: 42, + create_suspended: true, + is_initial: false, + tls_slot_count: 0, + affinity_mask: 0b0000_0010, + priority: 0, + ideal_processor: None, + }; + state + .scheduler + .spawn(params, &mut crate::state::GuestMemoryPcr(&mut mem)) + .unwrap(); + let r = state.scheduler.find_by_handle(0x2000).expect("spawned"); + assert_eq!( + state.scheduler.thread(r).state, + HwState::Blocked(BlockReason::Suspended) + ); + + ctx.gpr[3] = 0x2000; + ke_resume_thread(&mut ctx, &mem, &mut state); + assert_eq!(ctx.gpr[3], STATUS_SUCCESS); + let r = state.scheduler.find_by_handle(0x2000).expect("still alive"); + assert_eq!(state.scheduler.thread(r).state, HwState::Ready); + + ctx.gpr[3] = 0xDEAD_BEEF; + ke_resume_thread(&mut ctx, &mem, &mut state); + assert_eq!(ctx.gpr[3], STATUS_INVALID_HANDLE); + } + /// The regression we're guarding against: Sylpheed parks a thread on the /// event it handed to `NtReadFile`. Historically our HLE ignored r4 and /// left the event unsignaled — the wait never released. Completion must