feat(kernel): KRNBUG-AUDIT-005 — --pc-probe extension + canary diff identifies XexCheckExecutablePrivilege stub cascade
Extends `--ctor-probe` machinery into `--pc-probe` (clap alias) with
the optional `PC@DISPATCHER:OFFSET` token form: on a hit, the helper
additionally logs `[disp+off]` — what the producer's
`lwz r3, OFFSET(r3)` is about to read. Reuses `parse_hex_u32`; both
flags share parser + storage.
Read-only diagnostic. Lockstep digest preserved (`run digest matches
golden` at -n 50M `--stable-digest`). 588 tests green.
Decisive findings (full deliverable in `audit-findings.md` /
`audit-runs/audit-005/`):
- Failure mode α confirmed for KRNBUG-AUDIT-004: all 9 producer call
sites for handles 0x100c (5 sites) and 0x15e0 (4 sites) fire 0x at
-n 500M. The producer code path is not reached.
- Set-diff of kernel-call sequences (canary.log oracle vs ours.log
at -n 500M) identifies 11 exports canary calls and we don't:
XGetAVPack, XeCryptSha, XeKeysConsolePrivateKeySign,
ObCreateSymbolicLink, NtDeviceIoControlFile (×2),
XamUserReadProfileSettings (×2), XamTaskSchedule, XamTaskCloseHandle,
KeReleaseSemaphore (×268), KeResetEvent, ExTerminateThread (×2).
- XGetAVPack has exactly one caller (sub_824AB578 at 0x824AB5A0).
The 4 instructions immediately preceding it are:
addi r3, r0, 10 ; privilege bit 10
bl XexCheckExecutablePrivilege
cmpli 0, r3, 0
bc 12, eq, 0x824AB724 ; if r3==0, skip whole block
- exports.rs:193 registers XexCheckExecutablePrivilege as
stub_return_zero. Always returning 0 -> guest takes the branch
and skips the entire AV/crypto/save-data init block.
- The other call site (sub_824A9710 at 0x824A99A0) queries privilege
11 with opposite polarity (bne) -> gates XamTaskSchedule on the
privilege-NOT-set arm. With both stubs returning 0, the guest
walks the wrong arm of every privilege-gated branch.
- This explains why the dispatcher fields read zero
([0x828F3D08+0x50]=0, [0x828F4070+0x24]=0 from AUDIT-004 dumps):
the ctors run, but the producers that would populate those fields
with a non-zero handle never execute.
Next session: replace XexCheckExecutablePrivilege stub with real
priv-bit lookup from XEX header. See audit-findings.md
KRNBUG-AUDIT-005 for the validation matrix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4622,3 +4622,194 @@ function 0x824AA1D8 is then called with handle=0). If none fire,
|
||||
the producer chain is gated upstream (likely a feature flag, init
|
||||
phase, or RPC handler that never fires). Either way, the next
|
||||
diagnostic narrows the bug surface dramatically.
|
||||
|
||||
---
|
||||
|
||||
### KRNBUG-AUDIT-005 — `--pc-probe` extended syntax + canary kernel-call diff; `XexCheckExecutablePrivilege` stub gates init flow
|
||||
|
||||
**Status**: landed on master (no-ff merge of feature branch
|
||||
`canary-diff-and-pc-consumer-probe/p0-priv-stub-cascade`). Diagnostic-
|
||||
only, read-only, lockstep-preserved (`run digest matches golden` at
|
||||
`-n 50M --stable-digest`).
|
||||
|
||||
**Tests**: 588 → **588** (unchanged; existing ctor-probe tests cover the
|
||||
shared infrastructure).
|
||||
|
||||
**What landed (`crates/xenia-kernel/src/state.rs`):**
|
||||
- `pub pc_probe_consumers: HashMap<u32, (u32, u32)>` field on
|
||||
`KernelState` (default empty). Maps a probe PC to a
|
||||
`(dispatcher_addr, offset)` pair; on hit the helper additionally
|
||||
logs `[disp+off]` — what the producer's `lwz r3, OFFSET(r3)` is
|
||||
about to read after `bl outer_getter` returns the dispatcher in r3.
|
||||
- `fire_ctor_probe_if_match` extended to read+print the consumer
|
||||
field when present. Pure load — does not mutate guest state.
|
||||
|
||||
**What landed (`crates/xenia-app/src/main.rs`):**
|
||||
- `--pc-probe` clap alias on `--ctor-probe` (semantically clearer
|
||||
name; both share parser/storage).
|
||||
- Extended token syntax `PC@DISPATCHER:OFFSET` parsed via existing
|
||||
`parse_hex_u32`. Plain `PC` form still works (backward compatible).
|
||||
- `XENIA_PC_PROBE` env var as alias for `XENIA_CTOR_PROBE`.
|
||||
|
||||
**What landed (`audit-runs/audit-005/`):** one-shot diagnostic
|
||||
artifacts — not part of the repo build:
|
||||
- `canary.log` — copy of `/home/fabi/xenia_canary_windows/xenia.log` from a Lutris launch of Sylpheed; oracle for what should happen
|
||||
- `ours.log` — our trace at `-n 500M` with the 9-PC probe + `probe_calls=trace` filter (838 MB, 5.6 M lines)
|
||||
- `diff.py` — kernel-call sequence diff (set-diff + first-divergence window); deletable after the audit
|
||||
- `probe-test-10m.log` — initial smoke test confirming probe wiring
|
||||
|
||||
**Reproduce:**
|
||||
|
||||
```bash
|
||||
cargo run --release -p xenia-app -- \
|
||||
--log-filter='probe_calls=trace,xenia=warn' \
|
||||
exec sylpheed.iso \
|
||||
--halt-on-deadlock \
|
||||
--trace-handles-focus=0x1004,0x100c,0x15e0 \
|
||||
--pc-probe=0x821802D8@0x828F3D08:80,0x821806E0@0x828F3D08:80,\
|
||||
0x82180B28@0x828F3D08:80,0x82180EA0@0x828F3D08:80,\
|
||||
0x82181254@0x828F3D08:80,0x8216F9D4@0x828F4070:36,\
|
||||
0x8216FC08@0x828F4070:36,0x821700B8@0x828F4070:36,\
|
||||
0x821700F4@0x828F4070:36 \
|
||||
-n 500_000_000 \
|
||||
2> audit-runs/audit-005/ours.log
|
||||
|
||||
python3 audit-runs/audit-005/diff.py --max 100 --window 30
|
||||
```
|
||||
|
||||
**Decisive findings:**
|
||||
|
||||
1. **Failure mode (α) for KRNBUG-AUDIT-004 confirmed.** All 9
|
||||
non-create-chain producer call sites for handles 0x100c
|
||||
(5 sites at `0x821802D8 / 0x821806E0 / 0x82180B28 / 0x82180EA0 /
|
||||
0x82181254`) and 0x15e0 (4 sites at `0x8216F9D4 / 0x8216FC08 /
|
||||
0x821700B8 / 0x821700F4`) **fire 0×** at -n 500M
|
||||
(`grep -c CTOR-PROBE ours.log == 0`). The producer code path is
|
||||
not reached. Rules out failure mode (B: `lwz` reads zero) and (3:
|
||||
wake function called with stale handle). The bug is upstream,
|
||||
in the control-flow that should lead the guest to those producer
|
||||
functions.
|
||||
|
||||
2. **Upstream control-flow divergence located: `XexCheckExecutablePrivilege`
|
||||
stub returning 0.** Set-diff of kernel-call sequences across our
|
||||
500M-instruction run vs canary's full Sylpheed boot
|
||||
(`canary.log`, ~5.3K lines, post-`swaps=2` boot loop reached)
|
||||
identifies **11 exports that canary calls and we don't**:
|
||||
|
||||
```
|
||||
ExTerminateThread (×2)
|
||||
KeReleaseSemaphore (×268) ← we use Nt* equivalents
|
||||
KeResetEvent (×1)
|
||||
NtDeviceIoControlFile (×2)
|
||||
ObCreateSymbolicLink (×1)
|
||||
XGetAVPack (×1) ← gated by priv-10 check
|
||||
XamTaskCloseHandle (×1)
|
||||
XamTaskSchedule (×1) ← AUDIT-002 producer candidate
|
||||
XamUserReadProfileSettings (×2)
|
||||
XeCryptSha (×1)
|
||||
XeKeysConsolePrivateKeySign (×1)
|
||||
```
|
||||
|
||||
`XGetAVPack` has exactly one caller (`xrefs` table): site
|
||||
`0x824AB5A0` inside `sub_824AB578`. The 4 instructions immediately
|
||||
preceding it are:
|
||||
|
||||
```
|
||||
824ab58c addi r3, r0, 10 ; privilege bit 10
|
||||
824ab590 addi r31, r0, 0
|
||||
824ab594 bl 0x8284DEFC ; XexCheckExecutablePrivilege
|
||||
824ab598 cmpli 0, r3, 0x0
|
||||
824ab59c bc 12, eq, 0x824AB724 ; if r3==0, skip whole block
|
||||
; (XGetAVPack + crypto + Nt writes)
|
||||
```
|
||||
|
||||
Our impl `crates/xenia-kernel/src/exports.rs:193`:
|
||||
```rust
|
||||
state.register_export(Xboxkrnl, 0x0194, "XexCheckExecutablePrivilege",
|
||||
stub_return_zero);
|
||||
```
|
||||
`stub_return_zero` returns r3=0 unconditionally → guest takes
|
||||
the `bc 12, eq, 0x824AB724` branch and skips the entire
|
||||
AV/crypto/save-data init block.
|
||||
|
||||
The OTHER call site (`sub_824A9710`, `0x824A99A0`) queries
|
||||
privilege bit **11**:
|
||||
```
|
||||
824a999c addi r3, r0, 11
|
||||
824a99a0 bl 0x8284DEFC ; XexCheckExecutablePrivilege(11)
|
||||
824a99a4 cmpli 0, r3, 0
|
||||
824a99a8 bc 4, eq, 0x824A9A60 ; bne — skip block if priv set
|
||||
```
|
||||
Different polarity (this one gates `XamTaskSchedule` etc. on
|
||||
the **privilege-NOT-set** path). With both stubs returning 0,
|
||||
the guest walks the wrong arm of *every* privilege-gated branch.
|
||||
|
||||
3. **Cascade reaches the parked-waiter handles.** Trace evidence:
|
||||
our `probe_calls` log shows `lr=0x824A97E4` (a hit from the
|
||||
error path inside `sub_824A9710` *after* `sub_824ABA98` returned
|
||||
negative NTSTATUS). The canary log shows all 11 missing exports
|
||||
firing in a single contiguous boot phase between `XexCheckExecutablePrivilege`
|
||||
and the worker-thread spawn — i.e. the init phase that sets up
|
||||
the dispatcher data structures is exactly the phase we skip.
|
||||
This explains **why the dispatcher fields read zero** (AUDIT-004
|
||||
dump: `[0x828F3D08+0x50] = 0`, `[0x828F4070+0x24] = 0`): the
|
||||
ctors run (we counted those), but the *producers* that would
|
||||
populate those fields with a non-zero handle never execute,
|
||||
because the upstream init flow that registers them is gated
|
||||
by the privilege checks.
|
||||
|
||||
4. **Note on the diff: canary's log is filtered.** Canary's config
|
||||
has `log_high_frequency_kernel_calls = false`, which suppresses
|
||||
most `Rtl*`, `Mm*`, `Ke*`-internal calls from the log. The
|
||||
"called in OURS but not canary" set (23 entries, headed by
|
||||
`NtWaitForSingleObjectEx ×1.5M`) is dominated by this filter
|
||||
difference — it is **not** a bug surface. The directionally
|
||||
meaningful side of the diff is "called in CANARY but not OURS"
|
||||
(above): canary's log includes every low-frequency call, so any
|
||||
absence on our side is a real divergence.
|
||||
|
||||
**Stop conditions check:**
|
||||
|
||||
- Canary itself does NOT stall at swaps=2 — it reaches a steady
|
||||
frame loop with `XamInputGetCapabilities` polling, texture loads,
|
||||
`KeReleaseSemaphore` ticks. The diff was informative.
|
||||
- First divergence is dense early-CRT noise (~3 entries in), but
|
||||
the meaningful divergence anchored to a concrete export
|
||||
(`XGetAVPack`, deterministically gated by a one-line stub) was
|
||||
recoverable via set-diff. Did not need to narrow scope further.
|
||||
|
||||
**Recommendation for next session (do not implement a fix here — this
|
||||
is the read-only audit deliverable):**
|
||||
|
||||
Replace `stub_return_zero` for `XexCheckExecutablePrivilege` with a
|
||||
real implementation. The XEX header's privilege bitmask is parsed
|
||||
during XEX load (see `crates/xenia-xex/`); `KernelState` already
|
||||
holds the loaded `image_base`. Implementation outline:
|
||||
- Parse `XEX_HEADER_EXECUTION_INFO` / privilege bits at load time
|
||||
into `KernelState` (or surface via `Vfs` already-loaded XEX
|
||||
metadata).
|
||||
- `xex_check_executable_privilege(priv_id) -> u32`:
|
||||
return 1 if bit `priv_id` is set in the title's privilege bitmask,
|
||||
else 0. Match canary's encoding (privilege IDs are 0..7F; canary
|
||||
reads `PrivilegeFlags[i/8] & (1 << (i%8))` from the XEX execution
|
||||
info).
|
||||
|
||||
Validation after the fix:
|
||||
1. Re-run `audit-runs/audit-005/diff.py` — `XGetAVPack`,
|
||||
`XamTaskSchedule`, `XeCryptSha`, etc. should appear in our
|
||||
sequence and the divergence should advance several hundred
|
||||
calls past the priv-check.
|
||||
2. Re-run with the 9-PC probe armed at -n 500M — at minimum, the
|
||||
ctor-probe firings change, and ideally one or more of the 9
|
||||
producer sites starts firing.
|
||||
3. If producer sites fire, dispatcher fields `[0x828F3D08+0x50]` /
|
||||
`[0x828F4070+0x24]` become non-zero (use `--dump-addr`).
|
||||
4. Lockstep golden `crates/xenia-app/tests/golden/sylpheed_n50m.json`
|
||||
will likely change (`imports` count goes up, `swaps` may advance);
|
||||
regenerate the golden under `--stable-digest` and treat that as
|
||||
the new lockstep anchor.
|
||||
|
||||
If after the fix the producer is reached and dispatcher fields
|
||||
populate, the parked-waiter deadlock should resolve — or surface
|
||||
the next layer of bugs (e.g. signaling code reads non-zero handle
|
||||
but `wake_eligible_waiters` fails).
|
||||
|
||||
Reference in New Issue
Block a user