handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,140 @@
# Phase C+5 — investigation: `NtWriteFile` at idx=102068
## Divergence
| | canary | ours (pre-fix) |
|---|---|---|
| `payload.return_value` at idx=102068, tid=6→1 | `259 = 0x103` (`STATUS_PENDING`) | `0` (`STATUS_SUCCESS`) |
| `payload.status` | `0x00000103` | `0x00000000` |
| Surrounding context (idx 102060..102067): `RtlEnterCriticalSection`/`RtlLeaveCriticalSection``NtWriteFile` | | |
| Game thread | tid=6 main | tid=1 main |
| Next 6 events (102069..102074) | three more `NtWriteFile` calls, ALL return `0x103` in canary | same calls, ours returns `0` |
## Step 1 — Event context at idx=102068
Both engines emit identical call sequences leading to this point:
```
102016 NtCreateFile (sync; both return 0)
102019 NtReadFile (both return 0)
102022 NtClose
102025 NtCreateFile (sync; both return 0)
102034 NtReadFile (both return 0)
102037 NtWriteFile (both return 0 - sync file)
102040 NtClose
102052 NtOpenFile (both return 0)
102055 NtDeviceIoControlFile (GET_DRIVE_GEOMETRY)
102058 NtDeviceIoControlFile (GET_PARTITION_INFO)
102067 NtWriteFile (canary returns 0x103, ours returns 0) ← divergence
```
The handle written at 102067 was opened by `NtOpenFile` at 102052. Per
canary log `audit-065-canary.log` line for `cache:\,...,00000003`, the
`open_options` passed by the game is `0x00000003` =
`FILE_DIRECTORY_FILE | FILE_WRITE_THROUGH`. **No `FILE_SYNCHRONOUS_IO_*`
bit** — file is async in canary.
## Step 2 — Source-read both engines
### Canary `NtWriteFile_entry` (xboxkrnl_io.cc:304-389)
```cpp
// Write completes synchronously (the `if (true || ...)` short-circuit
// at line 327 always takes the sync path).
if (!file->is_synchronous()) {
result = X_STATUS_PENDING; // ← line 351-353
}
```
`is_synchronous_` is the bool stored on `XFile`, derived at open time
from `create_options & (FILE_SYNCHRONOUS_IO_ALERT | FILE_SYNCHRONOUS_IO_NONALERT)`
(xboxkrnl_io.cc:94-97 inside `NtCreateFile_entry`). `NtOpenFile_entry`
forwards `open_options` straight into `NtCreateFile_entry`'s
`create_options` slot (xboxkrnl_io.cc:118-122).
So canary's invariant: a file opened **without** bit 0x10 or 0x20 in
its `create_options` is async, and `NtWriteFile` on it returns
`STATUS_PENDING` after the synchronous write completes. The IO_STATUS_BLOCK
still records `STATUS_SUCCESS`; only the function-return value flips.
### Ours `nt_write_file` (exports.rs:1484-1554)
Pre-fix: returns `STATUS_SUCCESS` unconditionally after a successful
write. The `KernelObject::File` enum does not track `is_synchronous`.
### Ours `nt_open_file` (exports.rs:1317-1335)
Pre-fix: reads `open_options` from `ctx.gpr[8]` (= r8). **This is the
wrong register.**
Canary's `NtOpenFile_entry` signature is
```cpp
dword_result_t NtOpenFile_entry(
lpdword_t handle_out, // r3
dword_t desired_access, // r4
pointer_t<X_OBJECT_ATTRIBUTES> object_attributes, // r5
pointer_t<X_IO_STATUS_BLOCK> io_status_block, // r6
dword_t open_options); // r7
```
**5 args**, so per Xenia's `shim_utils::Param::LoadValue`
(util/shim_utils.h:158-167), the 5th dword arrives in `r3 + (ordinal_) = r7`.
Live capture (Phase C+5 debug log):
```
nt_open_file: r7=0x3 r8=0x800021 ← cache:\ probe
nt_open_file: r7=0x3 r8=0x800021
nt_open_file: r7=0x3 r8=0x4021
nt_open_file: r7=0x7 r8=0x4040
nt_open_file: r7=0x7 r8=0x4020
```
`r7=0x3` matches canary's logged value exactly. Ours's
`r8=0x4021,0x4020,...` are residuals from prior register usage that
happen to have the FILE_DIRECTORY_FILE bit (0x01) set — which is why
the AUDIT-053/054 hierarchical-create fix worked at all. But the
**0x20 bit (FILE_SYNCHRONOUS_IO_NONALERT)** was also frequently set in
r8 residual data, making every NtOpenFile-derived file appear
synchronous in ours.
## Step 3 — Classification
**Class (A) — Engine bug, two interlinked defects:**
1. **Wrong-register bug** in `nt_open_file`: reads `open_options` from
r8 instead of r7. Confirmed by canary-side ground truth (r7=0x3
matches canary log `cache:\,…,00000003`).
2. **Missing async/sync tracking** on `KernelObject::File`: even with
correct `open_options`, ours had no machinery to remember the
sync/async state and flip `NtWriteFile` returns.
Both defects must be fixed together to align Phase A's matched prefix
past idx=102068. Fixing only #2 (without #1) leaves the file marked
sync (because r8 has bit 0x20 from residual register usage), so
`NtWriteFile` returns `STATUS_SUCCESS` and the divergence persists —
which we observed in the first fix iteration (matched-prefix stayed
at 102068 after a Path-A fix that relied on r8).
### Why this is class (A) not (α) canonicalization
Examining events 102069..102074 in canary: three more `NtWriteFile`
calls on the same handle, all returning `STATUS_PENDING`. Then at
idx=102132 canary calls `NtClose`; ours diverges at 102132 by calling
`IoDismountVolumeByFileHandle` instead. **The game branches on the
return value of these writes** — without aligning the return values,
ours's downstream code path stays divergent. Canonicalization would
mask this, not fix it.
### Tripstone #2 (reading-error #23) check
A "fix the upstream cause" change could in principle flip a CRT
branch. Empirically, after the fix:
- imports counter: 40452 → 40470 (game responded to the new return
values by issuing additional kernel calls — expected for async-IO
semantics).
- main matched prefix: 102068 → **102132 (+64)**. No regression.
- All sub-chains' matched prefixes unchanged.
Reading-error #23 risk DID NOT MATERIALIZE because the new return
values match canary's, so the CRT branches identically downstream.