feat(kernel): KRNBUG-AUDIT-005 — --pc-probe extension + canary diff identifies XexCheckExecutablePrivilege stub cascade

Extends `--ctor-probe` machinery into `--pc-probe` (clap alias) with
the optional `PC@DISPATCHER:OFFSET` token form: on a hit, the helper
additionally logs `[disp+off]` — what the producer's
`lwz r3, OFFSET(r3)` is about to read. Reuses `parse_hex_u32`; both
flags share parser + storage.

Read-only diagnostic. Lockstep digest preserved (`run digest matches
golden` at -n 50M `--stable-digest`). 588 tests green.

Decisive findings (full deliverable in `audit-findings.md` /
`audit-runs/audit-005/`):

- Failure mode α confirmed for KRNBUG-AUDIT-004: all 9 producer call
  sites for handles 0x100c (5 sites) and 0x15e0 (4 sites) fire 0x at
  -n 500M. The producer code path is not reached.

- Set-diff of kernel-call sequences (canary.log oracle vs ours.log
  at -n 500M) identifies 11 exports canary calls and we don't:
  XGetAVPack, XeCryptSha, XeKeysConsolePrivateKeySign,
  ObCreateSymbolicLink, NtDeviceIoControlFile (×2),
  XamUserReadProfileSettings (×2), XamTaskSchedule, XamTaskCloseHandle,
  KeReleaseSemaphore (×268), KeResetEvent, ExTerminateThread (×2).

- XGetAVPack has exactly one caller (sub_824AB578 at 0x824AB5A0).
  The 4 instructions immediately preceding it are:
      addi r3, r0, 10            ; privilege bit 10
      bl   XexCheckExecutablePrivilege
      cmpli 0, r3, 0
      bc 12, eq, 0x824AB724      ; if r3==0, skip whole block

- exports.rs:193 registers XexCheckExecutablePrivilege as
  stub_return_zero. Always returning 0 -> guest takes the branch
  and skips the entire AV/crypto/save-data init block.

- The other call site (sub_824A9710 at 0x824A99A0) queries privilege
  11 with opposite polarity (bne) -> gates XamTaskSchedule on the
  privilege-NOT-set arm. With both stubs returning 0, the guest
  walks the wrong arm of every privilege-gated branch.

- This explains why the dispatcher fields read zero
  ([0x828F3D08+0x50]=0, [0x828F4070+0x24]=0 from AUDIT-004 dumps):
  the ctors run, but the producers that would populate those fields
  with a non-zero handle never execute.

Next session: replace XexCheckExecutablePrivilege stub with real
priv-bit lookup from XEX header. See audit-findings.md
KRNBUG-AUDIT-005 for the validation matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-04 18:06:22 +02:00
parent 6a070bedc6
commit 3e2fc1ec88
3 changed files with 245 additions and 18 deletions

View File

@@ -4622,3 +4622,194 @@ function 0x824AA1D8 is then called with handle=0). If none fire,
the producer chain is gated upstream (likely a feature flag, init
phase, or RPC handler that never fires). Either way, the next
diagnostic narrows the bug surface dramatically.
---
### KRNBUG-AUDIT-005 — `--pc-probe` extended syntax + canary kernel-call diff; `XexCheckExecutablePrivilege` stub gates init flow
**Status**: landed on master (no-ff merge of feature branch
`canary-diff-and-pc-consumer-probe/p0-priv-stub-cascade`). Diagnostic-
only, read-only, lockstep-preserved (`run digest matches golden` at
`-n 50M --stable-digest`).
**Tests**: 588 → **588** (unchanged; existing ctor-probe tests cover the
shared infrastructure).
**What landed (`crates/xenia-kernel/src/state.rs`):**
- `pub pc_probe_consumers: HashMap<u32, (u32, u32)>` field on
`KernelState` (default empty). Maps a probe PC to a
`(dispatcher_addr, offset)` pair; on hit the helper additionally
logs `[disp+off]` — what the producer's `lwz r3, OFFSET(r3)` is
about to read after `bl outer_getter` returns the dispatcher in r3.
- `fire_ctor_probe_if_match` extended to read+print the consumer
field when present. Pure load — does not mutate guest state.
**What landed (`crates/xenia-app/src/main.rs`):**
- `--pc-probe` clap alias on `--ctor-probe` (semantically clearer
name; both share parser/storage).
- Extended token syntax `PC@DISPATCHER:OFFSET` parsed via existing
`parse_hex_u32`. Plain `PC` form still works (backward compatible).
- `XENIA_PC_PROBE` env var as alias for `XENIA_CTOR_PROBE`.
**What landed (`audit-runs/audit-005/`):** one-shot diagnostic
artifacts — not part of the repo build:
- `canary.log` — copy of `/home/fabi/xenia_canary_windows/xenia.log` from a Lutris launch of Sylpheed; oracle for what should happen
- `ours.log` — our trace at `-n 500M` with the 9-PC probe + `probe_calls=trace` filter (838 MB, 5.6 M lines)
- `diff.py` — kernel-call sequence diff (set-diff + first-divergence window); deletable after the audit
- `probe-test-10m.log` — initial smoke test confirming probe wiring
**Reproduce:**
```bash
cargo run --release -p xenia-app -- \
--log-filter='probe_calls=trace,xenia=warn' \
exec sylpheed.iso \
--halt-on-deadlock \
--trace-handles-focus=0x1004,0x100c,0x15e0 \
--pc-probe=0x821802D8@0x828F3D08:80,0x821806E0@0x828F3D08:80,\
0x82180B28@0x828F3D08:80,0x82180EA0@0x828F3D08:80,\
0x82181254@0x828F3D08:80,0x8216F9D4@0x828F4070:36,\
0x8216FC08@0x828F4070:36,0x821700B8@0x828F4070:36,\
0x821700F4@0x828F4070:36 \
-n 500_000_000 \
2> audit-runs/audit-005/ours.log
python3 audit-runs/audit-005/diff.py --max 100 --window 30
```
**Decisive findings:**
1. **Failure mode (α) for KRNBUG-AUDIT-004 confirmed.** All 9
non-create-chain producer call sites for handles 0x100c
(5 sites at `0x821802D8 / 0x821806E0 / 0x82180B28 / 0x82180EA0 /
0x82181254`) and 0x15e0 (4 sites at `0x8216F9D4 / 0x8216FC08 /
0x821700B8 / 0x821700F4`) **fire 0×** at -n 500M
(`grep -c CTOR-PROBE ours.log == 0`). The producer code path is
not reached. Rules out failure mode (B: `lwz` reads zero) and (3:
wake function called with stale handle). The bug is upstream,
in the control-flow that should lead the guest to those producer
functions.
2. **Upstream control-flow divergence located: `XexCheckExecutablePrivilege`
stub returning 0.** Set-diff of kernel-call sequences across our
500M-instruction run vs canary's full Sylpheed boot
(`canary.log`, ~5.3K lines, post-`swaps=2` boot loop reached)
identifies **11 exports that canary calls and we don't**:
```
ExTerminateThread (×2)
KeReleaseSemaphore (×268) ← we use Nt* equivalents
KeResetEvent (×1)
NtDeviceIoControlFile (×2)
ObCreateSymbolicLink (×1)
XGetAVPack (×1) ← gated by priv-10 check
XamTaskCloseHandle (×1)
XamTaskSchedule (×1) ← AUDIT-002 producer candidate
XamUserReadProfileSettings (×2)
XeCryptSha (×1)
XeKeysConsolePrivateKeySign (×1)
```
`XGetAVPack` has exactly one caller (`xrefs` table): site
`0x824AB5A0` inside `sub_824AB578`. The 4 instructions immediately
preceding it are:
```
824ab58c addi r3, r0, 10 ; privilege bit 10
824ab590 addi r31, r0, 0
824ab594 bl 0x8284DEFC ; XexCheckExecutablePrivilege
824ab598 cmpli 0, r3, 0x0
824ab59c bc 12, eq, 0x824AB724 ; if r3==0, skip whole block
; (XGetAVPack + crypto + Nt writes)
```
Our impl `crates/xenia-kernel/src/exports.rs:193`:
```rust
state.register_export(Xboxkrnl, 0x0194, "XexCheckExecutablePrivilege",
stub_return_zero);
```
`stub_return_zero` returns r3=0 unconditionally → guest takes
the `bc 12, eq, 0x824AB724` branch and skips the entire
AV/crypto/save-data init block.
The OTHER call site (`sub_824A9710`, `0x824A99A0`) queries
privilege bit **11**:
```
824a999c addi r3, r0, 11
824a99a0 bl 0x8284DEFC ; XexCheckExecutablePrivilege(11)
824a99a4 cmpli 0, r3, 0
824a99a8 bc 4, eq, 0x824A9A60 ; bne — skip block if priv set
```
Different polarity (this one gates `XamTaskSchedule` etc. on
the **privilege-NOT-set** path). With both stubs returning 0,
the guest walks the wrong arm of *every* privilege-gated branch.
3. **Cascade reaches the parked-waiter handles.** Trace evidence:
our `probe_calls` log shows `lr=0x824A97E4` (a hit from the
error path inside `sub_824A9710` *after* `sub_824ABA98` returned
negative NTSTATUS). The canary log shows all 11 missing exports
firing in a single contiguous boot phase between `XexCheckExecutablePrivilege`
and the worker-thread spawn — i.e. the init phase that sets up
the dispatcher data structures is exactly the phase we skip.
This explains **why the dispatcher fields read zero** (AUDIT-004
dump: `[0x828F3D08+0x50] = 0`, `[0x828F4070+0x24] = 0`): the
ctors run (we counted those), but the *producers* that would
populate those fields with a non-zero handle never execute,
because the upstream init flow that registers them is gated
by the privilege checks.
4. **Note on the diff: canary's log is filtered.** Canary's config
has `log_high_frequency_kernel_calls = false`, which suppresses
most `Rtl*`, `Mm*`, `Ke*`-internal calls from the log. The
"called in OURS but not canary" set (23 entries, headed by
`NtWaitForSingleObjectEx ×1.5M`) is dominated by this filter
difference — it is **not** a bug surface. The directionally
meaningful side of the diff is "called in CANARY but not OURS"
(above): canary's log includes every low-frequency call, so any
absence on our side is a real divergence.
**Stop conditions check:**
- Canary itself does NOT stall at swaps=2 — it reaches a steady
frame loop with `XamInputGetCapabilities` polling, texture loads,
`KeReleaseSemaphore` ticks. The diff was informative.
- First divergence is dense early-CRT noise (~3 entries in), but
the meaningful divergence anchored to a concrete export
(`XGetAVPack`, deterministically gated by a one-line stub) was
recoverable via set-diff. Did not need to narrow scope further.
**Recommendation for next session (do not implement a fix here — this
is the read-only audit deliverable):**
Replace `stub_return_zero` for `XexCheckExecutablePrivilege` with a
real implementation. The XEX header's privilege bitmask is parsed
during XEX load (see `crates/xenia-xex/`); `KernelState` already
holds the loaded `image_base`. Implementation outline:
- Parse `XEX_HEADER_EXECUTION_INFO` / privilege bits at load time
into `KernelState` (or surface via `Vfs` already-loaded XEX
metadata).
- `xex_check_executable_privilege(priv_id) -> u32`:
return 1 if bit `priv_id` is set in the title's privilege bitmask,
else 0. Match canary's encoding (privilege IDs are 0..7F; canary
reads `PrivilegeFlags[i/8] & (1 << (i%8))` from the XEX execution
info).
Validation after the fix:
1. Re-run `audit-runs/audit-005/diff.py` — `XGetAVPack`,
`XamTaskSchedule`, `XeCryptSha`, etc. should appear in our
sequence and the divergence should advance several hundred
calls past the priv-check.
2. Re-run with the 9-PC probe armed at -n 500M — at minimum, the
ctor-probe firings change, and ideally one or more of the 9
producer sites starts firing.
3. If producer sites fire, dispatcher fields `[0x828F3D08+0x50]` /
`[0x828F4070+0x24]` become non-zero (use `--dump-addr`).
4. Lockstep golden `crates/xenia-app/tests/golden/sylpheed_n50m.json`
will likely change (`imports` count goes up, `swaps` may advance);
regenerate the golden under `--stable-digest` and treat that as
the new lockstep anchor.
If after the fix the producer is reached and dispatcher fields
populate, the parked-waiter deadlock should resolve — or surface
the next layer of bugs (e.g. signaling code reads non-zero handle
but `wake_eligible_waiters` fails).