Files
xenia-rs/audit-runs/audit-006/canary_export_queue.md
MechaCat02 a1a7265f29 fix(kernel): KRNBUG-IO-003 — NtDeviceIoControlFile real impl mirroring NullDevice::IoControl
Replace the stub_success registration of NtDeviceIoControlFile at
exports.rs:90 with a real handler for FsCtlCodes 0x70000 (drive
geometry) and 0x74004 (partition info), mirroring xenia-canary
xboxkrnl_io.cc:645-678 + null_device.{h,cc}. The 16-byte 0x74004
response with cache_size=0xFF000 at OUT+8 is the gate that lets
sub_824ABD88 return SUCCESS and sub_824A9710 reach the priv-11
XexCheckExecutablePrivilege site identified by KRNBUG-AUDIT-007.

Stack args 9-10 (OutputBuffer, OutputBufferLength) read from the
caller's parameter save area at [sp+0x54] / [sp+0x5C] per the Xbox
360 PowerPC EABI (linkage area sp+0..sp+8, 8-quadword spill area
sp+0x14..sp+0x54, then stack args every 8 bytes). First HLE export
in the codebase to need 9+ args.

Cascade vs. KRNBUG-AUDIT-007 prediction (5/8 held):
- XexCheckExecutablePrivilege count 1 → 2 (priv=0xA + priv=0xB) ✓
- XamTaskSchedule count 0 → 1 ✓
- canary-only exports 7 → 3 (audit predicted ≤3) ✓
- 0x15e0 semaphore signal_attempts 0 → 1 (bonus)
- 0x100c worker spawn DID NOT fire (still UNCREATED) ✗
- 0x1004 signal_attempts unchanged ✗
- Worker spawn count unchanged at 19 ✗

Tests: 592 → 594. Lockstep deterministic at -n 100M (run1 ≡ run2 ≡
run3, byte-identical). instructions=100000010 → 100000019, imports
407417 → 987524 (+2.4×). swaps=2 draws=0 plateau persists.

sylpheed_n50m golden re-baselined instructions=50000004→50000003,
imports=407362→407255. sylpheed_n2m unchanged.

Still canary-only after this fix: ExTerminateThread,
KeReleaseSemaphore, XamUserReadProfileSettings. The next downstream
gate is somewhere past XamTaskSchedule's completion path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:00:12 +02:00

274 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Canary-Only Export Fix Queue (audit-006)
- Status: **PARTIAL CASCADE (2026-05-04, post-KRNBUG-IO-003). 7 → 3 canary-only exports.** `NtDeviceIoControlFile` real impl landed; the priv-11 query (`XexCheckExecutablePrivilege(0xB)`) and `XamTaskSchedule` now fire. **Reclassified (now firing on our side):** `KeResetEvent`, `ObCreateSymbolicLink`, `XamTaskCloseHandle`, `XamTaskSchedule`. **Bonus pickups:** `XeCryptSha`, `XeKeysConsolePrivateKeySign` (both 0→1 — were not on the canary-only list because they were already in `ours_exports` but unreached). **Still canary-only:** `ExTerminateThread`, `KeReleaseSemaphore`, `XamUserReadProfileSettings`. Worker thread spawn count unchanged at 19; handle 0x100c remains UNCREATED. See KRNBUG-IO-003 in `audit-findings.md` and `project_xenia_rs_io_003_ioctl_2026_05_04.md`.
- Prior status (now superseded): **SUPERSEDED by AUDIT-007 (2026-05-04). Real gate identified: `NtDeviceIoControlFile` (FsCtlCode=0x74004) is `stub_success` at `crates/xenia-kernel/src/exports.rs:90`. Game-side `sub_824ABD88:0x824abea8-ac` reads `[out_buf+8]` of the IOCTL response, finds zero (stub doesn't write OUT), assigns hardcoded `0xC0000034` (STATUS_OBJECT_NAME_NOT_FOUND); caller `sub_824A9710` exits at `0x824a9944` before priv-11. Tier 4 entries remain parked, classification unchanged (still REAL_BUT_UNREACHED), awaiting KRNBUG-IO-003. See `project_xenia_rs_audit_007_branch_probe_2026_05_04.md` for the runtime trace + decisive proof.**
- Prior status: **PARTIAL — KRNBUG-IO-002 landed, but predicted cascade did NOT fire (7 → 7). Tier 0 marked superseded; Tier 4 entries STILL parked. Re-audit needed to find the real upstream gate.**
- Pre-state: master HEAD `556a8c3`, exports diff captured 2026-05-04
- Post-IO-002 state: branch `xboxkrnl-vol-allocunit/p0-65536-cluster`, fresh 500 M trace at `audit-runs/post-IO-002/`. Canary-only kernel exports remain identical: `{ExTerminateThread, KeReleaseSemaphore, KeResetEvent, ObCreateSymbolicLink, XamTaskCloseHandle, XamTaskSchedule, XamUserReadProfileSettings}`.
- Inputs:
- `canary.log` (348720 B, identical to audit-005 oracle, canary build `9467c77f0`)
- `ours.log` (692 MB, 5.6 M trace lines, run at 17:2017:21 today, post-IO-001)
- Tooling: `diff.py` + plain `comm -23` set-difference on extracted call names
## Headline finding
**7/7 canary-only entries classify as REAL_BUT_UNREACHED or STUB_BUT_UNREACHED.**
Per the audit-006 spec stop condition ("if two-thirds of entries are
REAL_BUT_UNREACHED, the problem isn't stubs — it's an upstream gate"),
the next session should **NOT** pull a Tier-1 entry from this queue.
Instead, it should fix the gate.
The gate is **KRNBUG-IO-002**: our `nt_query_volume_information_file`
class-3 (FileFsSizeInformation) returns alloc_unit = 1 × 2048 = 2048,
but Sylpheed's `main(1, 0x10000, 0xFF000)` expects alloc_unit = 65536
(see `project_xenia_rs_io_nullfile_2026_05_04.md`).
Sylpheed's verifier `sub_824ABA98` rejects 2048, propagates failure to
`sub_824A9710`, which exits early before its `XexCheckExecutablePrivilege(0xB)`
call site. Canary fires the priv-11 query *and* the entire downstream
cluster (`XamTaskSchedule` → Cache0 callback thread → 0x100c worker spawn
→ display-init pump → profile-settings cascade); we fire none of it.
Direct evidence (telemetry):
- Our `XexCheckExecutablePrivilege` count = **1** (priv=0xA only).
Canary count = **2** (priv=0xA + priv=0xB).
- All 7 canary-only entries have ours-side count = **0** at -n 500M.
- Our trace ends with main thread (hw=0) parked on `XNotifyGetNext +
NtWaitForSingleObjectEx(0x10f4, lr=0x824ac578)` and hw=1 parked on
`NtWaitForMultipleObjectsEx(lr=0x824ab214) + cs=0x828f3e70` —
classic post-cache-recreate spin.
- The 44 `NtWriteFile` calls in ours.log (cache zero-fill) are followed by
more NtClose / NtCreateFile cycles, but `XexCheckExecutablePrivilege(0xB)`
never fires → priv-11 site in `sub_824A9710` is unreached.
- Memory's predicted `0xC000014F` does not yet appear in ours.log; first
cache-related error is `0xC0000034` (OBJECT_NAME_NOT_FOUND) from
`lr=0x824a97e4`. This still fits the gate hypothesis: the recreate path
is reached, completes its writes, re-opens, queries volume info, and
the *game-side* verifier rejects our reply silently (no kernel error).
---
## Tier 0 — upstream gate (SUPERSEDED 2026-05-04 — fix landed but cascade did NOT fire)
### KRNBUG-IO-002 — `nt_query_volume_information_file` block size — **LANDED, gate hypothesis FALSIFIED**
**Outcome:** the block-size literals at `exports.rs:1255-1256` were corrected
to canary's NullDevice values (`sectors_per_unit=0x80, bytes_per_sector=0x200`,
product `0x10000`). 591 → 592 tests, lockstep `instructions=100000010, swaps=2,
draws=0` deterministic across two reruns (`audit-runs/post-IO-002/lock_n100m_run{1,2}.json`).
sylpheed_n50m oracle still matches its existing golden (no observable change at -n 50M).
**However, the predicted cascade DID NOT fire.** Set-difference on a fresh
500 M trace (`audit-runs/post-IO-002/ours.log`) produces the **identical**
seven-entry canary-only set audit-006 captured pre-fix:
```
ExTerminateThread, KeReleaseSemaphore, KeResetEvent,
ObCreateSymbolicLink, XamTaskCloseHandle, XamTaskSchedule,
XamUserReadProfileSettings
```
`XexCheckExecutablePrivilege` count remains **1** (priv=0xA only, priv=0xB
unreached). `XamTaskSchedule` count remains **0**. Worker thread spawns
fell from 19 → 18 (within noise — single thread variance per call-site
breakdown: `lr=0x824ac5f0×15 + 0x824cd984×1 + 0x824d2e68×2`). The 16
NtQueryVolumeInformationFile call sites in `ours.log` all originate from
a single LR `0x82611f38` — meaning the `audit-006` premise that
`sub_824ABA98`/`sub_824A9710` consume the volume-info reply at the
priv-11 gate may be **incorrect**, or the gate consumes a *different*
information class entirely.
**Stop-condition triggered.** Per the IO-002 task brief, this session does
not pivot to a second fix. The fix is correct (it makes our reply
byte-identical to canary's NullDevice and survives every test we have);
it is just not load-bearing for the priv-11 gate. The branch landed as a
strict no-op at our current boot horizon — kept because it's correct and
unblocks no regression.
**Next-session next gate hypothesis (untested):**
- The audit-005 disasm of `sub_824ABA98` may have mis-attributed the consumer
of bytes_per_sector. The IO-001 trace decisively located the failure at
the `NtReadFile` inside `sub_824A9710`, not at any volume-info site.
Re-read the `sub_824A9710` disasm with that in mind.
- Volume-info LR `0x82611f38` is far downstream of the priv-10/priv-11
cluster (the calls *complete* successfully — they don't gate anything
visible). The actual gate may be `nt_query_information_file`,
`nt_query_full_attributes_file`, an FsCtl IOCTL, or a different
alloc-unit query path.
- Per AUDIT-005 instrumentation, the priv-11 site at `sub_824A9710` PC
cluster has **never fired** in any session. Probe `sub_824A9710` entry
with `--pc-probe` and trace which conditional exits the function before
the priv-11 query — that's the real gate.
---
### KRNBUG-IO-002 — `nt_query_volume_information_file` block size (original spec, kept for archaeology)
- **Where in our code:** `crates/xenia-kernel/src/exports.rs:1241-1269` (function
`nt_query_volume_information_file`).
- **Classification:** `REAL_BUT_BUGGY`. Registered at exports.rs:100, called
16× in ours.log (16× in canary.log too — call counts match), returns
`STATUS_SUCCESS`, but the FileFsSizeInformation payload is wrong.
- **Bug:** class=3 branch writes `(total=0x100000, free=0,
sectors_per_unit=1, bytes_per_sector=2048)`. Product = 2048 bytes per
cluster.
- **Canary reference:**
- Entry function `NtQueryVolumeInformationFile_entry` at
`xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_io_info.cc:323` (case
`XFileFsSizeInformation` at lines 355365). Canary delegates to per-device
methods on `file->device()`.
- `NullDevice` (the device backing `\Device\Harddisk0\Cache0`) returns
`sectors_per_allocation_unit() = 0x80` and `bytes_per_sector() = 0x200`
at `xenia-canary/src/xenia/vfs/devices/null_device.h:38-46`. Product =
65536, matching Sylpheed's expectation.
- Other device backings (HostPath, DiscImage, DiscZArchive) all return
`(1, 0x200)` = 512. Sylpheed's first volume query at this site is
against Cache0 (NullDevice), so the relevant value is 65536.
- **Fix sketch (minimum):** in the class=3 branch, change the two writes to
`mem.write_u32(info+16, 128); mem.write_u32(info+20, 512);` (and reduce
TotalAllocationUnits accordingly so the disk size remains plausible —
e.g. 0x10 units × 128 sectors × 512 bytes ≈ 1 MB, matching NullDevice).
Total diff ≤ 4 lines.
- **Fix sketch (proper, deferred until the cluster fires reliably):**
introduce a per-handle device-info lookup so HostPath / DiscImage paths
return their canary-correct values too. Skipped for now because Sylpheed
only queries Cache0 at this gate.
- **Expected observable post-fix:**
- `XexCheckExecutablePrivilege` count: 1 → 2.
- `XamTaskSchedule` count: 0 → 1 (callback=0x824A93C8, message=0x828A28F0).
- `kernel.calls{XamTaskSchedule}` finally non-zero — closes the
APUBUG-PRODUCER-001 / XAMBUG-PRODUCER-001 producer hunt that
falsified XAudio + XamTaskSchedule producer hypotheses.
- Spawn of the Cache0 callback thread (XThread::Execute thid 7 in
canary, our equivalent to come).
- Inside that thread: `StfsCreateDevice` (still undefined extern in
canary too — does not block) + `ObCreateSymbolicLink` +
`ExRegisterTitleTerminateNotification`.
- Back on main: `KeResetEvent(0x8287094C)`, `NtCreateEvent`,
`ExCreateThread(entry=0x82181830, ctx=0x828F3D08)` — and `0x82181830`
is the worker entry for **dispatcher 0x100c**, one of the four
parked-handle producers (per
`project_xenia_rs_producer_stack_trace_2026_05_03.md`). Spawning
that worker should advance handle 0x100c's `signal_attempts`
counter off zero.
- Eventually (further into the boot): `XamUserGetXUID`,
`XamUserReadProfileSettings`, `XamContentCreateEnumerator`,
`KeReleaseSemaphore` display-pump (268+ calls in canary at this
horizon).
- **Risk:** low. Two-line value change. NullDevice is the only device
Sylpheed asks about at this gate; other devices are not yet hit.
- **Effort:** trivial.
- **Dependencies:** none. Land directly.
- **Verification chain:** `cargo test -p xenia-kernel`,
then `cargo run --release -p xenia-app -- exec sylpheed.iso -n 500_000_000`
with kernel-call tracing on, then re-run audit-006's set-difference;
expect canary-only count to drop from 7 toward 0 as the cluster fires.
---
## Tier 4 — REAL_BUT_UNREACHED / STUB_BUT_UNREACHED — do not fix yet
These are downstream of Tier 0. Reachability is blocked on KRNBUG-IO-002
landing. After IO-002 lands, re-derive this list — most entries should
have moved off, and any survivors will be classifiable on real evidence.
| # | Export | Ordinal | Library | Our state | Canary impl | Canary calls (at horizon) | Cascade rank |
|---|--------|---------|---------|-----------|-------------|---------------------------|--------------|
| 1 | `XamTaskSchedule` | 0x01AF | xam | REAL_BUT_UNREACHED (`xam_task_schedule`, xam.rs:213) | `xam_task.cc:43-80` | 1 (gate-pivot call) | upstream-of-cluster — fires the entire post-IO-002 cascade |
| 2 | `XamTaskCloseHandle` | 0x01B1 | xam | STUB_BUT_UNREACHED (`stub_success`, xam.rs:33) | `xam_task.cc:83-93` (one-liner: `NtClose` + last-error) | 1 | low (cleanup after #1) |
| 3 | `KeResetEvent` | 0x8F | xboxkrnl | REAL_BUT_UNREACHED (`ke_reset_event`, exports.rs:3172) | `xboxkrnl_threading.cc:566` | 1 | medium — clears 0x8287094C right before ExCreateThread(0x82181830) on main |
| 4 | `ObCreateSymbolicLink` | 0x0103 | xboxkrnl | STUB_BUT_UNREACHED (`stub_success`, exports.rs:121) | `xboxkrnl_ob.cc:351` | 1 | low — Cache0-symlink registration; cosmetic for Sylpheed boot |
| 5 | `KeReleaseSemaphore` | 0x88 | xboxkrnl | REAL_BUT_UNREACHED (`ke_release_semaphore`, exports.rs:3280) | `xboxkrnl_threading.cc:724` | 268 | high (in volume) — display-init pump on the post-cluster main loop |
| 6 | `ExTerminateThread` | 0x19 | xboxkrnl | REAL_BUT_UNREACHED (`ex_terminate_thread`, exports.rs:312) | `xboxkrnl_threading.cc:173` | 2 | low — thread cleanup on Cache0 / profile threads |
| 7 | `XamUserReadProfileSettings` | 0x0219 | xam | REAL_BUT_UNREACHED (`xam_user_read_profile_settings`, xam.rs:327) | `xam_user.cc:329` | 2 | medium — gates the `XamUserGetXUID → profile load` flow far downstream |
**Why every entry above is Tier 4 (not Tier 1):**
- Each entry's first call in `canary.log` falls **after** line 1210
(`XamTaskSchedule(824A93C8, ...)`), which is the gate-pivot call.
- Our trace contains zero of any of the seven, despite running 500 M
instructions and reaching the post-cache-recreate horizon.
- Six of the seven are already real implementations. The two stubs
(`XamTaskCloseHandle`, `ObCreateSymbolicLink`) are minor cleanups; even
upgrading them would not move boot progress until #1 (`XamTaskSchedule`)
fires.
- Therefore: fixing any of these in isolation is wasted effort. They
should be re-classified after KRNBUG-IO-002 lands and the priv-11 /
Cache0 callback chain runs.
---
## Tier 1 / 2 / 3 — empty for this audit
No entry qualifies as Tier 1 or Tier 2 in the current state. The single
high-cascade fix worth pulling next is the Tier-0 gate (KRNBUG-IO-002),
which is **not itself a canary-only export** — it's a wrong-value bug in
an export both sides call, so the diff.py based set-difference doesn't
surface it. That is exactly why audit-006 was scoped this way: to confirm
the gate hypothesis from `project_xenia_rs_io_nullfile_2026_05_04.md`
before another implementation session is started.
---
## Cross-check vs IO-001 snapshot
IO-001 memory recorded these 7 still-canary-only exports:
> ExTerminateThread, KeReleaseSemaphore, KeResetEvent, ObCreateSymbolicLink,
> XamTaskCloseHandle, XamTaskSchedule, XamUserReadProfileSettings.
Audit-006 set-difference produces the **identical** 7, in 1:1
correspondence. No new canary-only export has appeared since IO-001
landed; no entry has moved off. Cascade is still parked at the same gate.
The `XeCryptSha`, `XeKeysConsolePrivateKeySign`, and `NtDeviceIoControlFile`
entries that IO-001 was credited with unblocking are confirmed: ours
calls them 1, 1, 2 times respectively (canary calls them 1, 1, 2 — exact
match). They are correctly off the canary-only list.
---
## Methodology notes
1. **"Cascade rank" definition:** estimated by where the export's first
canary call falls in the boot sequence and how many downstream code
paths depend on it. "high" = upstream-of-cluster (XamTaskSchedule).
"medium" = intermediate (KeResetEvent, profile cascade).
"low" = leaf cleanup or cosmetic (XamTaskCloseHandle, ObCreateSymbolicLink).
Rank only matters once Tier 0 is landed; until then everything is parked.
2. **Reachability oracle:** binary `grep -c "call=NAME"` against ours.log
at -n 500M. Zero counts are conclusive for "unreached" because tracing
is unconditional.
3. **Canary log freshness:** the log is from 17:34 (3 h before this
audit) but is byte-identical to audit-005's input — canary's behavior
is deterministic given the same ROM and the canary build header
(`canary_experimental@9467c77f0 on May 2 2026`) hasn't changed.
Re-running through Lutris is unnecessary.
4. **Gate confirmation:** memory predicted block-size mismatch as the
IO-002 blocker; this audit confirmed it by eliminating the alternative
(no Tier-1-eligible canary-only export exists in the current 7-entry
list). The 0xC000014F status memory predicted is not yet visible in
ours.log because the recreate path completes the writes — the
verifier inside `sub_824ABA98` rejects the volume-info reply at the
game level (no kernel error logged).
5. **What this queue is *not*:** a list of fixes to land. The audit-006
discipline was scoping; the discipline of subsequent sessions is to
re-run audit-006's diff after IO-002, then either close audit-006 (if
the cluster fires through and all 7 entries drop) or open audit-007
on whatever new canary-only set surfaces.
---
## Recommended next session
**KRNBUG-IO-002 (block-size fix), one-shot.** Two-line edit at
`crates/xenia-kernel/src/exports.rs:1255-1256`. Verify the cluster fires
by re-running audit-006's set-difference; expect 7 → 0 (or close to 0)
canary-only entries. If new entries surface in either direction, that's
audit-007's input.
**Do not** open this queue's Tier 4 entries before IO-002 closes. Their
classification is pending; their fix sketches will look very different
once they're observably called and their actual return values can be
compared to canary.