Files
xenia-rs/audit-runs/phase-c5-NtWriteFile/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

141 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase C+5 — investigation: `NtWriteFile` at idx=102068
## Divergence
| | canary | ours (pre-fix) |
|---|---|---|
| `payload.return_value` at idx=102068, tid=6→1 | `259 = 0x103` (`STATUS_PENDING`) | `0` (`STATUS_SUCCESS`) |
| `payload.status` | `0x00000103` | `0x00000000` |
| Surrounding context (idx 102060..102067): `RtlEnterCriticalSection`/`RtlLeaveCriticalSection``NtWriteFile` | | |
| Game thread | tid=6 main | tid=1 main |
| Next 6 events (102069..102074) | three more `NtWriteFile` calls, ALL return `0x103` in canary | same calls, ours returns `0` |
## Step 1 — Event context at idx=102068
Both engines emit identical call sequences leading to this point:
```
102016 NtCreateFile (sync; both return 0)
102019 NtReadFile (both return 0)
102022 NtClose
102025 NtCreateFile (sync; both return 0)
102034 NtReadFile (both return 0)
102037 NtWriteFile (both return 0 - sync file)
102040 NtClose
102052 NtOpenFile (both return 0)
102055 NtDeviceIoControlFile (GET_DRIVE_GEOMETRY)
102058 NtDeviceIoControlFile (GET_PARTITION_INFO)
102067 NtWriteFile (canary returns 0x103, ours returns 0) ← divergence
```
The handle written at 102067 was opened by `NtOpenFile` at 102052. Per
canary log `audit-065-canary.log` line for `cache:\,...,00000003`, the
`open_options` passed by the game is `0x00000003` =
`FILE_DIRECTORY_FILE | FILE_WRITE_THROUGH`. **No `FILE_SYNCHRONOUS_IO_*`
bit** — file is async in canary.
## Step 2 — Source-read both engines
### Canary `NtWriteFile_entry` (xboxkrnl_io.cc:304-389)
```cpp
// Write completes synchronously (the `if (true || ...)` short-circuit
// at line 327 always takes the sync path).
if (!file->is_synchronous()) {
result = X_STATUS_PENDING; // ← line 351-353
}
```
`is_synchronous_` is the bool stored on `XFile`, derived at open time
from `create_options & (FILE_SYNCHRONOUS_IO_ALERT | FILE_SYNCHRONOUS_IO_NONALERT)`
(xboxkrnl_io.cc:94-97 inside `NtCreateFile_entry`). `NtOpenFile_entry`
forwards `open_options` straight into `NtCreateFile_entry`'s
`create_options` slot (xboxkrnl_io.cc:118-122).
So canary's invariant: a file opened **without** bit 0x10 or 0x20 in
its `create_options` is async, and `NtWriteFile` on it returns
`STATUS_PENDING` after the synchronous write completes. The IO_STATUS_BLOCK
still records `STATUS_SUCCESS`; only the function-return value flips.
### Ours `nt_write_file` (exports.rs:1484-1554)
Pre-fix: returns `STATUS_SUCCESS` unconditionally after a successful
write. The `KernelObject::File` enum does not track `is_synchronous`.
### Ours `nt_open_file` (exports.rs:1317-1335)
Pre-fix: reads `open_options` from `ctx.gpr[8]` (= r8). **This is the
wrong register.**
Canary's `NtOpenFile_entry` signature is
```cpp
dword_result_t NtOpenFile_entry(
lpdword_t handle_out, // r3
dword_t desired_access, // r4
pointer_t<X_OBJECT_ATTRIBUTES> object_attributes, // r5
pointer_t<X_IO_STATUS_BLOCK> io_status_block, // r6
dword_t open_options); // r7
```
**5 args**, so per Xenia's `shim_utils::Param::LoadValue`
(util/shim_utils.h:158-167), the 5th dword arrives in `r3 + (ordinal_) = r7`.
Live capture (Phase C+5 debug log):
```
nt_open_file: r7=0x3 r8=0x800021 ← cache:\ probe
nt_open_file: r7=0x3 r8=0x800021
nt_open_file: r7=0x3 r8=0x4021
nt_open_file: r7=0x7 r8=0x4040
nt_open_file: r7=0x7 r8=0x4020
```
`r7=0x3` matches canary's logged value exactly. Ours's
`r8=0x4021,0x4020,...` are residuals from prior register usage that
happen to have the FILE_DIRECTORY_FILE bit (0x01) set — which is why
the AUDIT-053/054 hierarchical-create fix worked at all. But the
**0x20 bit (FILE_SYNCHRONOUS_IO_NONALERT)** was also frequently set in
r8 residual data, making every NtOpenFile-derived file appear
synchronous in ours.
## Step 3 — Classification
**Class (A) — Engine bug, two interlinked defects:**
1. **Wrong-register bug** in `nt_open_file`: reads `open_options` from
r8 instead of r7. Confirmed by canary-side ground truth (r7=0x3
matches canary log `cache:\,…,00000003`).
2. **Missing async/sync tracking** on `KernelObject::File`: even with
correct `open_options`, ours had no machinery to remember the
sync/async state and flip `NtWriteFile` returns.
Both defects must be fixed together to align Phase A's matched prefix
past idx=102068. Fixing only #2 (without #1) leaves the file marked
sync (because r8 has bit 0x20 from residual register usage), so
`NtWriteFile` returns `STATUS_SUCCESS` and the divergence persists —
which we observed in the first fix iteration (matched-prefix stayed
at 102068 after a Path-A fix that relied on r8).
### Why this is class (A) not (α) canonicalization
Examining events 102069..102074 in canary: three more `NtWriteFile`
calls on the same handle, all returning `STATUS_PENDING`. Then at
idx=102132 canary calls `NtClose`; ours diverges at 102132 by calling
`IoDismountVolumeByFileHandle` instead. **The game branches on the
return value of these writes** — without aligning the return values,
ours's downstream code path stays divergent. Canonicalization would
mask this, not fix it.
### Tripstone #2 (reading-error #23) check
A "fix the upstream cause" change could in principle flip a CRT
branch. Empirically, after the fix:
- imports counter: 40452 → 40470 (game responded to the new return
values by issuing additional kernel calls — expected for async-IO
semantics).
- main matched prefix: 102068 → **102132 (+64)**. No regression.
- All sub-chains' matched prefixes unchanged.
Reading-error #23 risk DID NOT MATERIALIZE because the new return
values match canary's, so the CRT branches identically downstream.