Files
xenia-rs/audit-runs/audit-006/canary_export_queue.md
MechaCat02 a1a7265f29 fix(kernel): KRNBUG-IO-003 — NtDeviceIoControlFile real impl mirroring NullDevice::IoControl
Replace the stub_success registration of NtDeviceIoControlFile at
exports.rs:90 with a real handler for FsCtlCodes 0x70000 (drive
geometry) and 0x74004 (partition info), mirroring xenia-canary
xboxkrnl_io.cc:645-678 + null_device.{h,cc}. The 16-byte 0x74004
response with cache_size=0xFF000 at OUT+8 is the gate that lets
sub_824ABD88 return SUCCESS and sub_824A9710 reach the priv-11
XexCheckExecutablePrivilege site identified by KRNBUG-AUDIT-007.

Stack args 9-10 (OutputBuffer, OutputBufferLength) read from the
caller's parameter save area at [sp+0x54] / [sp+0x5C] per the Xbox
360 PowerPC EABI (linkage area sp+0..sp+8, 8-quadword spill area
sp+0x14..sp+0x54, then stack args every 8 bytes). First HLE export
in the codebase to need 9+ args.

Cascade vs. KRNBUG-AUDIT-007 prediction (5/8 held):
- XexCheckExecutablePrivilege count 1 → 2 (priv=0xA + priv=0xB) ✓
- XamTaskSchedule count 0 → 1 ✓
- canary-only exports 7 → 3 (audit predicted ≤3) ✓
- 0x15e0 semaphore signal_attempts 0 → 1 (bonus)
- 0x100c worker spawn DID NOT fire (still UNCREATED) ✗
- 0x1004 signal_attempts unchanged ✗
- Worker spawn count unchanged at 19 ✗

Tests: 592 → 594. Lockstep deterministic at -n 100M (run1 ≡ run2 ≡
run3, byte-identical). instructions=100000010 → 100000019, imports
407417 → 987524 (+2.4×). swaps=2 draws=0 plateau persists.

sylpheed_n50m golden re-baselined instructions=50000004→50000003,
imports=407362→407255. sylpheed_n2m unchanged.

Still canary-only after this fix: ExTerminateThread,
KeReleaseSemaphore, XamUserReadProfileSettings. The next downstream
gate is somewhere past XamTaskSchedule's completion path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:00:12 +02:00

17 KiB
Raw Blame History

Canary-Only Export Fix Queue (audit-006)

  • Status: PARTIAL CASCADE (2026-05-04, post-KRNBUG-IO-003). 7 → 3 canary-only exports. NtDeviceIoControlFile real impl landed; the priv-11 query (XexCheckExecutablePrivilege(0xB)) and XamTaskSchedule now fire. Reclassified (now firing on our side): KeResetEvent, ObCreateSymbolicLink, XamTaskCloseHandle, XamTaskSchedule. Bonus pickups: XeCryptSha, XeKeysConsolePrivateKeySign (both 0→1 — were not on the canary-only list because they were already in ours_exports but unreached). Still canary-only: ExTerminateThread, KeReleaseSemaphore, XamUserReadProfileSettings. Worker thread spawn count unchanged at 19; handle 0x100c remains UNCREATED. See KRNBUG-IO-003 in audit-findings.md and project_xenia_rs_io_003_ioctl_2026_05_04.md.

  • Prior status (now superseded): SUPERSEDED by AUDIT-007 (2026-05-04). Real gate identified: NtDeviceIoControlFile (FsCtlCode=0x74004) is stub_success at crates/xenia-kernel/src/exports.rs:90. Game-side sub_824ABD88:0x824abea8-ac reads [out_buf+8] of the IOCTL response, finds zero (stub doesn't write OUT), assigns hardcoded 0xC0000034 (STATUS_OBJECT_NAME_NOT_FOUND); caller sub_824A9710 exits at 0x824a9944 before priv-11. Tier 4 entries remain parked, classification unchanged (still REAL_BUT_UNREACHED), awaiting KRNBUG-IO-003. See project_xenia_rs_audit_007_branch_probe_2026_05_04.md for the runtime trace + decisive proof.

  • Prior status: PARTIAL — KRNBUG-IO-002 landed, but predicted cascade did NOT fire (7 → 7). Tier 0 marked superseded; Tier 4 entries STILL parked. Re-audit needed to find the real upstream gate.

  • Pre-state: master HEAD 556a8c3, exports diff captured 2026-05-04

  • Post-IO-002 state: branch xboxkrnl-vol-allocunit/p0-65536-cluster, fresh 500 M trace at audit-runs/post-IO-002/. Canary-only kernel exports remain identical: {ExTerminateThread, KeReleaseSemaphore, KeResetEvent, ObCreateSymbolicLink, XamTaskCloseHandle, XamTaskSchedule, XamUserReadProfileSettings}.

  • Inputs:

    • canary.log (348720 B, identical to audit-005 oracle, canary build 9467c77f0)
    • ours.log (692 MB, 5.6 M trace lines, run at 17:2017:21 today, post-IO-001)
  • Tooling: diff.py + plain comm -23 set-difference on extracted call names

Headline finding

7/7 canary-only entries classify as REAL_BUT_UNREACHED or STUB_BUT_UNREACHED. Per the audit-006 spec stop condition ("if two-thirds of entries are REAL_BUT_UNREACHED, the problem isn't stubs — it's an upstream gate"), the next session should NOT pull a Tier-1 entry from this queue. Instead, it should fix the gate.

The gate is KRNBUG-IO-002: our nt_query_volume_information_file class-3 (FileFsSizeInformation) returns alloc_unit = 1 × 2048 = 2048, but Sylpheed's main(1, 0x10000, 0xFF000) expects alloc_unit = 65536 (see project_xenia_rs_io_nullfile_2026_05_04.md). Sylpheed's verifier sub_824ABA98 rejects 2048, propagates failure to sub_824A9710, which exits early before its XexCheckExecutablePrivilege(0xB) call site. Canary fires the priv-11 query and the entire downstream cluster (XamTaskSchedule → Cache0 callback thread → 0x100c worker spawn → display-init pump → profile-settings cascade); we fire none of it.

Direct evidence (telemetry):

  • Our XexCheckExecutablePrivilege count = 1 (priv=0xA only). Canary count = 2 (priv=0xA + priv=0xB).
  • All 7 canary-only entries have ours-side count = 0 at -n 500M.
  • Our trace ends with main thread (hw=0) parked on XNotifyGetNext + NtWaitForSingleObjectEx(0x10f4, lr=0x824ac578) and hw=1 parked on NtWaitForMultipleObjectsEx(lr=0x824ab214) + cs=0x828f3e70 — classic post-cache-recreate spin.
  • The 44 NtWriteFile calls in ours.log (cache zero-fill) are followed by more NtClose / NtCreateFile cycles, but XexCheckExecutablePrivilege(0xB) never fires → priv-11 site in sub_824A9710 is unreached.
  • Memory's predicted 0xC000014F does not yet appear in ours.log; first cache-related error is 0xC0000034 (OBJECT_NAME_NOT_FOUND) from lr=0x824a97e4. This still fits the gate hypothesis: the recreate path is reached, completes its writes, re-opens, queries volume info, and the game-side verifier rejects our reply silently (no kernel error).

Tier 0 — upstream gate (SUPERSEDED 2026-05-04 — fix landed but cascade did NOT fire)

KRNBUG-IO-002 — nt_query_volume_information_file block size — LANDED, gate hypothesis FALSIFIED

Outcome: the block-size literals at exports.rs:1255-1256 were corrected to canary's NullDevice values (sectors_per_unit=0x80, bytes_per_sector=0x200, product 0x10000). 591 → 592 tests, lockstep instructions=100000010, swaps=2, draws=0 deterministic across two reruns (audit-runs/post-IO-002/lock_n100m_run{1,2}.json). sylpheed_n50m oracle still matches its existing golden (no observable change at -n 50M).

However, the predicted cascade DID NOT fire. Set-difference on a fresh 500 M trace (audit-runs/post-IO-002/ours.log) produces the identical seven-entry canary-only set audit-006 captured pre-fix:

ExTerminateThread, KeReleaseSemaphore, KeResetEvent,
ObCreateSymbolicLink, XamTaskCloseHandle, XamTaskSchedule,
XamUserReadProfileSettings

XexCheckExecutablePrivilege count remains 1 (priv=0xA only, priv=0xB unreached). XamTaskSchedule count remains 0. Worker thread spawns fell from 19 → 18 (within noise — single thread variance per call-site breakdown: lr=0x824ac5f0×15 + 0x824cd984×1 + 0x824d2e68×2). The 16 NtQueryVolumeInformationFile call sites in ours.log all originate from a single LR 0x82611f38 — meaning the audit-006 premise that sub_824ABA98/sub_824A9710 consume the volume-info reply at the priv-11 gate may be incorrect, or the gate consumes a different information class entirely.

Stop-condition triggered. Per the IO-002 task brief, this session does not pivot to a second fix. The fix is correct (it makes our reply byte-identical to canary's NullDevice and survives every test we have); it is just not load-bearing for the priv-11 gate. The branch landed as a strict no-op at our current boot horizon — kept because it's correct and unblocks no regression.

Next-session next gate hypothesis (untested):

  • The audit-005 disasm of sub_824ABA98 may have mis-attributed the consumer of bytes_per_sector. The IO-001 trace decisively located the failure at the NtReadFile inside sub_824A9710, not at any volume-info site. Re-read the sub_824A9710 disasm with that in mind.
  • Volume-info LR 0x82611f38 is far downstream of the priv-10/priv-11 cluster (the calls complete successfully — they don't gate anything visible). The actual gate may be nt_query_information_file, nt_query_full_attributes_file, an FsCtl IOCTL, or a different alloc-unit query path.
  • Per AUDIT-005 instrumentation, the priv-11 site at sub_824A9710 PC cluster has never fired in any session. Probe sub_824A9710 entry with --pc-probe and trace which conditional exits the function before the priv-11 query — that's the real gate.

KRNBUG-IO-002 — nt_query_volume_information_file block size (original spec, kept for archaeology)

  • Where in our code: crates/xenia-kernel/src/exports.rs:1241-1269 (function nt_query_volume_information_file).
  • Classification: REAL_BUT_BUGGY. Registered at exports.rs:100, called 16× in ours.log (16× in canary.log too — call counts match), returns STATUS_SUCCESS, but the FileFsSizeInformation payload is wrong.
  • Bug: class=3 branch writes (total=0x100000, free=0, sectors_per_unit=1, bytes_per_sector=2048). Product = 2048 bytes per cluster.
  • Canary reference:
    • Entry function NtQueryVolumeInformationFile_entry at xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_io_info.cc:323 (case XFileFsSizeInformation at lines 355365). Canary delegates to per-device methods on file->device().
    • NullDevice (the device backing \Device\Harddisk0\Cache0) returns sectors_per_allocation_unit() = 0x80 and bytes_per_sector() = 0x200 at xenia-canary/src/xenia/vfs/devices/null_device.h:38-46. Product = 65536, matching Sylpheed's expectation.
    • Other device backings (HostPath, DiscImage, DiscZArchive) all return (1, 0x200) = 512. Sylpheed's first volume query at this site is against Cache0 (NullDevice), so the relevant value is 65536.
  • Fix sketch (minimum): in the class=3 branch, change the two writes to mem.write_u32(info+16, 128); mem.write_u32(info+20, 512); (and reduce TotalAllocationUnits accordingly so the disk size remains plausible — e.g. 0x10 units × 128 sectors × 512 bytes ≈ 1 MB, matching NullDevice). Total diff ≤ 4 lines.
  • Fix sketch (proper, deferred until the cluster fires reliably): introduce a per-handle device-info lookup so HostPath / DiscImage paths return their canary-correct values too. Skipped for now because Sylpheed only queries Cache0 at this gate.
  • Expected observable post-fix:
    • XexCheckExecutablePrivilege count: 1 → 2.
    • XamTaskSchedule count: 0 → 1 (callback=0x824A93C8, message=0x828A28F0).
    • kernel.calls{XamTaskSchedule} finally non-zero — closes the APUBUG-PRODUCER-001 / XAMBUG-PRODUCER-001 producer hunt that falsified XAudio + XamTaskSchedule producer hypotheses.
    • Spawn of the Cache0 callback thread (XThread::Execute thid 7 in canary, our equivalent to come).
    • Inside that thread: StfsCreateDevice (still undefined extern in canary too — does not block) + ObCreateSymbolicLink + ExRegisterTitleTerminateNotification.
    • Back on main: KeResetEvent(0x8287094C), NtCreateEvent, ExCreateThread(entry=0x82181830, ctx=0x828F3D08) — and 0x82181830 is the worker entry for dispatcher 0x100c, one of the four parked-handle producers (per project_xenia_rs_producer_stack_trace_2026_05_03.md). Spawning that worker should advance handle 0x100c's signal_attempts counter off zero.
    • Eventually (further into the boot): XamUserGetXUID, XamUserReadProfileSettings, XamContentCreateEnumerator, KeReleaseSemaphore display-pump (268+ calls in canary at this horizon).
  • Risk: low. Two-line value change. NullDevice is the only device Sylpheed asks about at this gate; other devices are not yet hit.
  • Effort: trivial.
  • Dependencies: none. Land directly.
  • Verification chain: cargo test -p xenia-kernel, then cargo run --release -p xenia-app -- exec sylpheed.iso -n 500_000_000 with kernel-call tracing on, then re-run audit-006's set-difference; expect canary-only count to drop from 7 toward 0 as the cluster fires.

Tier 4 — REAL_BUT_UNREACHED / STUB_BUT_UNREACHED — do not fix yet

These are downstream of Tier 0. Reachability is blocked on KRNBUG-IO-002 landing. After IO-002 lands, re-derive this list — most entries should have moved off, and any survivors will be classifiable on real evidence.

# Export Ordinal Library Our state Canary impl Canary calls (at horizon) Cascade rank
1 XamTaskSchedule 0x01AF xam REAL_BUT_UNREACHED (xam_task_schedule, xam.rs:213) xam_task.cc:43-80 1 (gate-pivot call) upstream-of-cluster — fires the entire post-IO-002 cascade
2 XamTaskCloseHandle 0x01B1 xam STUB_BUT_UNREACHED (stub_success, xam.rs:33) xam_task.cc:83-93 (one-liner: NtClose + last-error) 1 low (cleanup after #1)
3 KeResetEvent 0x8F xboxkrnl REAL_BUT_UNREACHED (ke_reset_event, exports.rs:3172) xboxkrnl_threading.cc:566 1 medium — clears 0x8287094C right before ExCreateThread(0x82181830) on main
4 ObCreateSymbolicLink 0x0103 xboxkrnl STUB_BUT_UNREACHED (stub_success, exports.rs:121) xboxkrnl_ob.cc:351 1 low — Cache0-symlink registration; cosmetic for Sylpheed boot
5 KeReleaseSemaphore 0x88 xboxkrnl REAL_BUT_UNREACHED (ke_release_semaphore, exports.rs:3280) xboxkrnl_threading.cc:724 268 high (in volume) — display-init pump on the post-cluster main loop
6 ExTerminateThread 0x19 xboxkrnl REAL_BUT_UNREACHED (ex_terminate_thread, exports.rs:312) xboxkrnl_threading.cc:173 2 low — thread cleanup on Cache0 / profile threads
7 XamUserReadProfileSettings 0x0219 xam REAL_BUT_UNREACHED (xam_user_read_profile_settings, xam.rs:327) xam_user.cc:329 2 medium — gates the XamUserGetXUID → profile load flow far downstream

Why every entry above is Tier 4 (not Tier 1):

  • Each entry's first call in canary.log falls after line 1210 (XamTaskSchedule(824A93C8, ...)), which is the gate-pivot call.
  • Our trace contains zero of any of the seven, despite running 500 M instructions and reaching the post-cache-recreate horizon.
  • Six of the seven are already real implementations. The two stubs (XamTaskCloseHandle, ObCreateSymbolicLink) are minor cleanups; even upgrading them would not move boot progress until #1 (XamTaskSchedule) fires.
  • Therefore: fixing any of these in isolation is wasted effort. They should be re-classified after KRNBUG-IO-002 lands and the priv-11 / Cache0 callback chain runs.

Tier 1 / 2 / 3 — empty for this audit

No entry qualifies as Tier 1 or Tier 2 in the current state. The single high-cascade fix worth pulling next is the Tier-0 gate (KRNBUG-IO-002), which is not itself a canary-only export — it's a wrong-value bug in an export both sides call, so the diff.py based set-difference doesn't surface it. That is exactly why audit-006 was scoped this way: to confirm the gate hypothesis from project_xenia_rs_io_nullfile_2026_05_04.md before another implementation session is started.


Cross-check vs IO-001 snapshot

IO-001 memory recorded these 7 still-canary-only exports:

ExTerminateThread, KeReleaseSemaphore, KeResetEvent, ObCreateSymbolicLink, XamTaskCloseHandle, XamTaskSchedule, XamUserReadProfileSettings.

Audit-006 set-difference produces the identical 7, in 1:1 correspondence. No new canary-only export has appeared since IO-001 landed; no entry has moved off. Cascade is still parked at the same gate.

The XeCryptSha, XeKeysConsolePrivateKeySign, and NtDeviceIoControlFile entries that IO-001 was credited with unblocking are confirmed: ours calls them 1, 1, 2 times respectively (canary calls them 1, 1, 2 — exact match). They are correctly off the canary-only list.


Methodology notes

  1. "Cascade rank" definition: estimated by where the export's first canary call falls in the boot sequence and how many downstream code paths depend on it. "high" = upstream-of-cluster (XamTaskSchedule). "medium" = intermediate (KeResetEvent, profile cascade). "low" = leaf cleanup or cosmetic (XamTaskCloseHandle, ObCreateSymbolicLink). Rank only matters once Tier 0 is landed; until then everything is parked.
  2. Reachability oracle: binary grep -c "call=NAME" against ours.log at -n 500M. Zero counts are conclusive for "unreached" because tracing is unconditional.
  3. Canary log freshness: the log is from 17:34 (3 h before this audit) but is byte-identical to audit-005's input — canary's behavior is deterministic given the same ROM and the canary build header (canary_experimental@9467c77f0 on May 2 2026) hasn't changed. Re-running through Lutris is unnecessary.
  4. Gate confirmation: memory predicted block-size mismatch as the IO-002 blocker; this audit confirmed it by eliminating the alternative (no Tier-1-eligible canary-only export exists in the current 7-entry list). The 0xC000014F status memory predicted is not yet visible in ours.log because the recreate path completes the writes — the verifier inside sub_824ABA98 rejects the volume-info reply at the game level (no kernel error logged).
  5. What this queue is not: a list of fixes to land. The audit-006 discipline was scoping; the discipline of subsequent sessions is to re-run audit-006's diff after IO-002, then either close audit-006 (if the cluster fires through and all 7 entries drop) or open audit-007 on whatever new canary-only set surfaces.

KRNBUG-IO-002 (block-size fix), one-shot. Two-line edit at crates/xenia-kernel/src/exports.rs:1255-1256. Verify the cluster fires by re-running audit-006's set-difference; expect 7 → 0 (or close to 0) canary-only entries. If new entries surface in either direction, that's audit-007's input.

Do not open this queue's Tier 4 entries before IO-002 closes. Their classification is pending; their fix sketches will look very different once they're observably called and their actual return values can be compared to canary.