feat(kernel): KRNBUG-AUDIT-007 — --branch-probe instrumentation; sub_824A9710 exit gate identified

Sister to --pc-probe / --ctor-probe but emits a single compact one-line
BRANCH-PROBE record per fire (pc, tid, hw, cycle, r3, lr, cr0/cr6 flags)
with no back-chain. Designed for tracing every conditional-branch fire
inside a candidate-gate function so the last PC reached before the
function epilogue identifies the exit branch.

Runtime trace at audit-runs/audit-007/sub_824A9710-trace.log decisively
identifies the priv-11 gate:

- Exit branch: 0x824a9944 (post bl sub_824ABD88 first call)
- Responsible kernel call: NtDeviceIoControlFile, FsCtlCode=0x74004
  (registered as stub_success at exports.rs:90)
- Mechanical chain: stub returns 0/SUCCESS without writing OUT, game
  reads [out_buf+8], finds zero, assigns hardcoded 0xC0000034
  (STATUS_OBJECT_NAME_NOT_FOUND) at sub_824ABD88:0x824abea8-ac, exits
  via 0x824a9944's lt branch before priv-11 site at 0x824a99a0.

592→592 tests; lockstep instructions=100000010, swaps=2, draws=0
deterministic across reruns. Read-only diagnostic — no fix this session.
Next session: KRNBUG-IO-003 (real NtDeviceIoControlFile per canary
NullDevice::IoControl for FsCtlCodes 0x70000 + 0x74004).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-04 21:35:10 +02:00
parent 79697ddf4e
commit c51f51f9cb
3 changed files with 229 additions and 0 deletions

View File

@@ -5079,3 +5079,143 @@ the priv-11 gate**.
- `audit-runs/post-IO-002/canary_only.txt` — set-difference output (the 7-entry list)
- `audit-runs/post-IO-002/canary_exports.txt`, `ours_exports.txt` — sorted unique export names
---
## KRNBUG-AUDIT-007 — branch-probe instrumentation + sub_824A9710 exit-branch identification (2026-05-04)
### Outcome
**`--branch-probe` instrumentation landed (read-only diagnostic). Runtime trace decisively identified the priv-11 gate.**
- 592→592 tests; lockstep `instructions=100000010, swaps=2, draws=0` deterministic across reruns
(`audit-runs/audit-007/lock_post_branchprobe.json` ≡ `lock_post_branchprobe_run2.json`
≡ `audit-runs/post-IO-002/lock_n100m_run1.json`).
- Branch: `investigate-sub-824a9710/p0-branch-probe` — kept (instrumentation is reusable).
### Decisive runtime evidence
`audit-runs/audit-007/sub_824A9710-trace.log`:
```
BRANCH-PROBE pc=0x824a9710 tid=1 hw=0 cycle=5363003 r3=0x00000000 lr=0x824a9acc
BRANCH-PROBE pc=0x824a97e0 tid=1 hw=0 cycle=5369559 r3=0xc0000034 lr=0x824a9940
BRANCH-PROBE pc=0x824a9a98 tid=1 hw=0 cycle=5369562 r3=0x00000002 lr=0x824a97e4
```
The probe at `0x824a97e0` (the failure landing pad) captured `r3=0xC0000034`, `lr=0x824a9940` (= the
`cmpi 0,r3,0` PC after `bl sub_824ABD88` at `0x824a993c`). This pinpoints:
- **Exit branch**: `0x824a9944` (`bc 12, lt, 0x824A97E0`) — taken because r3 was 0xC0000034 < 0.
- **Responsible bl**: `0x824a993c` → `sub_824ABD88` first call.
- **Status code**: `0xC0000034` = `STATUS_OBJECT_NAME_NOT_FOUND`.
### Root-cause chain through sub_824ABD88
The function-detector's `end_address=0x824abe3c` for sub_824ABD88 was a truncation artifact;
the function actually runs to `0x824ac184`. Within that range the `0xC0000034` is **HARDCODED**
at `0x824abea8-0x824abeac`:
```
0x824abe90 bl NtDeviceIoControlFile (FsCtlCode=0x74004, out_buf=r1+160, out_len=16)
0x824abe94 cmpi 0, r3, 0
0x824abe98 bc 12, lt, 0x824abeb8 # if r3 < 0 → failure cleanup (NOT taken; stub returned 0 = success)
0x824abe9c ld r10, 168(r1) # load doubleword from [out_buf+8]
0x824abea0 cmpi cr6, 1, r10, 0 # 64-bit cmp r10 == 0
0x824abea4 bc 4, 4*cr6+eq, 0x824abeb0 # if NOT eq, skip the assignment
0x824abea8 addis r3, r0, 0xC000 # r3 = 0xC0000000
0x824abeac ori r3, r3, 0x34 # r3 = 0xC0000034 (STATUS_OBJECT_NAME_NOT_FOUND)
0x824abeb0 cmpi cr6, 0, r3, 0
0x824abeb4 bc 4, 4*cr6+lt, 0x824abecc # if NOT lt → success path; r3 < 0 → NOT taken
0x824abeb8 or r28, r3, r3 # save 0xC0000034
0x824abebc lwz r3, 96(r1)
0x824abec0 bl NtClose
0x824abec4 or r3, r28, r28 # restore failure status
0x824abec8 b 0x824abe34 # epilogue → return 0xC0000034
```
The game expects the IOCTL response's upper 8 bytes to be non-zero. Our
`NtDeviceIoControlFile` is registered as `stub_success` at
`crates/xenia-kernel/src/exports.rs:90` — returns 0 (SUCCESS) but writes nothing
into the OUT buffer. The fresh stack frame has zero at `[r1+168]`, so the check
at `0x824abea4` falls through to the hardcoded failure assignment.
### Canary reference
`audit-runs/post-IO-002/canary.log` lines 1196-1209 show canary calls
`NtDeviceIoControlFile(handle, ..., FsCtlCode=0x74004, ..., out_buf, out_len=16)`,
gets a populated 16-byte response (whose upper 8 bytes are non-zero), then proceeds
through 17× NtWriteFile zero-fill, NtClose, NtCreateFile (Cache0\), NtQueryVolumeInformationFile
class=3, NtClose, and finally **`XexCheckExecutablePrivilege(0x0000000B)`** — the
priv-11 site that has never fired in our run. Immediately followed by
**`XamTaskSchedule(824A93C8, 828A28F0, ...)`** — the canary-only export hunt's
gate-pivot call.
The IOCTL implementation in canary lives in `xenia-canary/src/xenia/vfs/devices/null_device.{h,cc}`
(`NullDevice::IoControl`) — the device's `IoControl` writes the structured payload
that the game-side check consumes.
### Next session: KRNBUG-IO-003
**Where:** `crates/xenia-kernel/src/exports.rs:90` — replace the
`stub_success` registration with a real `nt_device_io_control_file`.
**Minimum viable fix:** for FsCtlCode=0x74004, write any non-zero u64 at
`[out_buf+8]`. That alone clears the gate.
**Canary-faithful fix:** mirror `NullDevice::IoControl` for FsCtlCodes
`0x70000` (8-byte response, consumed at `sub_824ABD88:0x824abe3c` for a
log2/shift count) and `0x74004` (16-byte response, partition geometry).
Fall through to `STATUS_NOT_IMPLEMENTED` for unrecognized codes so future
divergences surface.
**Falsifiable cascade prediction:**
- `XexCheckExecutablePrivilege` count: **1 → 2** (priv=0xA + priv=0xB).
- `XamTaskSchedule` count: **0 → 1**.
- canary-only export count: **7 → ≤ 3**.
- Worker thread spawn at `ExCreateThread(entry=0x82181830, ctx=0x828F3D08)` —
the parked-handle 0x100c producer fires.
- `swaps=2 draws=0` plateau persists (renderer is multi-causal).
**Failure modes to watch for:**
- (α) Re-running `--branch-probe` should show a NEW exit branch in
`sub_824A9710` (one of `0x824a996c`, `0x824a9998`, `0x824a9a18`) if a downstream
helper has its own unimplemented dependency.
- (β) sub_824ABA98's analogous failure path (called at 0x824a9950, 0x824a9990)
may surface if its own kernel-call dependencies are stubs.
- (γ) `nt_write_file` against the synth empty-file Cache0 path needs to handle
the 17× zero-fill loop; if our implementation rejects writes to a zero-byte
file, the cascade stalls just past the IOCTL fix.
### Files added / modified (instrumentation only)
- `crates/xenia-kernel/src/state.rs` — added `branch_probe_pcs: HashSet<u32>`
field + `fire_branch_probe_if_match(hw_id)` method emitting a single compact
`BRANCH-PROBE` line per fire (pc, tid, hw, cycle, r3, lr, cr0/cr6). Sister to
`fire_ctor_probe_if_match`; no back-chain walk. ~40 LOC.
- `crates/xenia-app/src/main.rs` — `--branch-probe` CLI flag (env var
`XENIA_BRANCH_PROBE`), parser, and call in `worker_prologue`. ~30 LOC.
### Probe-machinery limitation
The probe fires only when the **block head** at the matched PC is dispatched —
mid-block PCs in the request set don't trigger because the prologue runs once
per block, not once per instruction. In this trace: function entries, failure
landing pads (`0x824a97e0`), and external-call return PCs (`0x824a9a98`) all
hit. Internal `bc` PCs (`0x824a9944`, `0x824a9958`, ...) were silent. The data
captured was sufficient — the failure landing PC + LR pair uniquely identified
the upstream branch — but if a future audit needs every-branch coverage, the
helper call would need to move from `worker_prologue` into the per-instruction
step loop (or a custom block-scan that flags branches matching the request
list).
### Trace artifacts (re-runnable)
- `audit-runs/audit-007/sub_824A9710-trace.log` — 5 BRANCH-PROBE lines + thread diagnostics.
- `audit-runs/audit-007/sub_824A9710-trace.err` — full kernel-call trace + counter dump.
- `audit-runs/audit-007/lock_post_branchprobe.json`, `lock_post_branchprobe_run2.json` — lockstep digests.
Re-run command:
```
PROBE_LIST="0x824a9aa0,0x824a9128,0x824a9710,0x824a9778,0x824a9788,0x824a9790,0x824a97dc,0x824a97e0,0x824a9824,0x824a9828,0x824a9840,0x824a9850,0x824a985c,0x824a9870,0x824a9880,0x824a9888,0x824a9918,0x824a9944,0x824a9958,0x824a996c,0x824a9998,0x824a999c,0x824a99a0,0x824a99a8,0x824a9a10,0x824a9a18,0x824a9a60,0x824a9a78,0x824a9a98"
./target/release/xenia-rs exec sylpheed.iso --halt-on-deadlock \
--branch-probe="$PROBE_LIST" -n 500_000_000 \
> audit-runs/audit-007/sub_824A9710-trace.log \
2> audit-runs/audit-007/sub_824A9710-trace.err
```