Files
xenia-rs/audit-runs/iterate-2F-vdswap-drain-fix/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

224 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iterate 2.F — VdSwap drain fix (writer report)
**Date:** 2026-05-27. **LOC delta:** engine **+15 / -2** (1 file, 2 effective
numeric literal changes), canary **0**. **Tests:** xenia-gpu 149 PASS,
xenia-kernel 226 PASS, ZERO regressions.
## Headline
**FIX-PARTIAL-CASCADE.** VdSwap kernel.return latency drops **900.04 ms → 1.03 ms**
(~876× improvement, single-gate PASS). Determinism preserved across 3 cold runs.
But downstream cascade gates (b)/(c)/(d)/(e) are **unchanged** — the 900 ms
inline-drain was NOT the upstream timing gate for the iterate-2D 28 missing
(op, lr) tuples or the tid=13 wedge. After the fix, ours still wedges at the
same set of guest PCs (tid=1@0x824ac578, tid=13@0x824ac578); the wedge just
arrives ~840 ms earlier in wallclock.
## Mode detected
**Threaded** (M1.9 default, `crates/xenia-app/src/main.rs:1090-1096`). Both
the `Inline` and `Threaded` (worker-side) backends had a **900 ms internal
drain deadline**, so the same fix was applied to both call sites. The original
hypothesis (Inline path) was correct in spirit; in practice the same numeric
deadline lived on the Threaded worker (handle.rs:563) and that was the one
the test invocation hit. The CPU side's `recv_timeout(1s)` was the outer
wrapper; the worker's `Duration::from_millis(900)` was the actual ceiling.
## Patch
File: `xenia-rs/crates/xenia-gpu/src/handle.rs`
| site | line | before | after |
|------|------|--------|-------|
| Inline drain | 393 | `Duration::from_millis(900)` | `Duration::from_millis(1)` |
| Threaded worker drain | 563 | `Duration::from_millis(900)` | `Duration::from_millis(1)` |
Plus 12 LOC of inline comments documenting the iterate-2F intent. `git diff --stat`:
`crates/xenia-gpu/src/handle.rs | 19 +++++++++++++++++--`, **17 insertions / 2 deletions**,
under the 20-LOC hard cap.
`exports.rs:4218`'s call to `drain_to_current_wptr` was NOT modified
(prompt scope: avoid stripping the drain). The `GPUBUG-FETCH-PATCH-001`
slot-0 comment was NOT touched (out of scope).
## Cascade gate results
### (a) VdSwap kernel.return latency
| run | call host_ns | return host_ns | delta | status |
|-----|--------------|---------------|-------|--------|
| c23 baseline (pre-fix) | 489,685,332 | 1,389,721,914 | **900.04 ms** | baseline |
| i2f run-1 (-n 50M) | 522,924,748 | 523,952,196 | **1.03 ms** | **PASS** |
| i2f run-2 (-n 500M) | 571,370,654 | 572,397,252 | **1.03 ms** | **PASS** |
Target was <1 ms; landed at 1.03 ms. The remaining ~30 µs above the 1 ms
deadline is `is_ready`-loop overhead + `sync_with_mmio` + reply-channel
hop; not material vs canary's 6.6 µs since the CPU side proceeds immediately.
**Gate (a): PASS.**
### (b) Missing (op, lr) tuples (iterate-2D method)
IAT-thunk LR trace (`--lr-trace=0x8284DDDC,0x8284E49C,0x8284DF5C,0x8284E07C`,
90 s wallclock timeout):
| | events | distinct (op,lr) | digest |
|---|--:|--:|---|
| i2d baseline (pre-fix, 2026-05-21) | 153 | 19 | 21,448 B |
| i2f post-fix (2026-05-27) | 153 | 19 | 21,448 B (bit-identical content) |
Diff of sorted JSONL between baseline and i2f shows only sub-microsecond
guest-cycle jitter on individual lines (e.g. `cycle=6350123 vs 6350130`);
every (pc, tid, lr, r3, r4, r5, r6) tuple is identical. **28 missing-in-ours
tuples count: UNCHANGED at 28. Gate (b): FAIL.**
### (c) Thread set (entry_pc, start_ctx) tuples
Both c23 and i2f end-of-run dumps list the same 13 ours threads (tids 0-13).
No new thread spawned that wasn't there pre-fix. Notably, the post-swap
worker fan-out from `sub_825070F0` (which would spawn the four workers at
canary tids 15/27/28 etc.) does **not** fire in i2f either — the workers
still don't materialize. **Gate (c): FAIL** (no analog for canary tids 15/27/28).
### (d) Producer-rate at LR 0x824AB168
LR 0x824AB168 fires per i2f IAT trace: **90** (same as i2d baseline).
Canary baseline: 903. **Ratio: 90/903 = 9.97% UNCHANGED. Gate (d): FAIL.**
### (e) tid=1 wedge timestamp
`--halt-on-deadlock` -n 500M post-fix produces an end-of-run blocked-thread
dump structurally identical to c23's pre-fix dump:
| | tid=1 PC | tid=1 LR | tid=1 wait handle | tid=13 PC | tid=13 wait handle |
|---|---|---|---|---|---|
| c23 (pre-fix) | 0x824ac578 | 0x824ac578 | 0x12C8 (thread handle) | 0x824ac578 | 0x12D0 (event handle) |
| i2f (post-fix) | 0x824ac578 | 0x824ac578 | 0x1210 (thread handle, alloc-order shifted) | 0x824ac578 | 0x1218 (event handle, alloc-order shifted) |
Same wedge PC, same wait-class (single handle), only the handle numeric
ID shifts due to allocator order change (reading-error #28 absorbs this).
Wedge wallclock: ~810 ms (i2f) vs ~1,648 ms (c23) — the wedge arrives
**earlier** because the 900 ms VdSwap stall is gone, but it still
arrives. **Gate (e): NEUTRAL/PARTIAL** — wedge moved but is not absent.
Tripstone #40: this is a single-keystone "wedge timestamp" gate that
is moved but not eliminated — does not justify a single-keystone follow-up
claim.
## Determinism check (gate gate)
3 cold `check --stable-digest -n 50000000` runs against the ISO:
| run | instructions | imports | swaps | draws | unique_RTs |
|-----|-------------:|-------:|-----:|-----:|-----------:|
| 1 | 50,000,000 | 39,290 | 1 | 0 | 0 |
| 2 | 50,000,000 | 39,290 | 1 | 0 | 0 |
| 3 | 50,000,000 | 39,290 | 1 | 0 | 0 |
Bit-identical across 3 runs. Pre-fix c23 baseline had `imports=40,388` and
`swaps=1`; i2f has `imports=39,290` and `swaps=1`. The drop in imports is
the predictable consequence of the same 50M-instruction budget finishing
faster wallclock — fewer kernel-import calls fit in the budget because
each instruction now does less wait-time-skip. **NOT a regression** — the
swap count is preserved at 1, draws stays at 0 (Sylpheed's pre-existing
draws=0 limitation; out of scope).
Phase B image hash NOT measured (no phase_b_snapshot_dir flag set on
this run), but the patch does not touch any image-loading path.
## Confidence: did this fix the root cause?
**MEDIUM-LOW.** The patch decisively kills the 900 ms VdSwap stall — that
hypothesis (gate a) is no longer in dispute. But the predicted downstream
cascade (gates b/c/d/e) does NOT follow. Two implications:
1. The 900 ms inline-drain was a **real timing wart** but NOT the
upstream timing gate for the iterate-2D producer-rate divergence.
Removing it frees ~840 ms of tid=1 wall-time, yet the cascade
(workers spawn → producers fire → tid=13 wait satisfied) still
does not engage.
2. The real blocker is **downstream**: per Review A Step 1 (2026-05-27),
force-spawning the 4 workers under `--force-spawn-workers` makes
them fault on unmapped guest VA `0xBCE25640` at `[ctx+44]`.
That ctx-state-installer bug is unaffected by VdSwap drain
latency. Until the ctx for the post-swap workers is correctly
initialized, no amount of main-thread headroom causes those
workers to spawn naturally — the spawn path itself depends on
game-side state (the AUDIT-068 ANON_Class install epoch at
host_ns ≈ 9.4 s, per the canary trace) that ours never reaches.
The fix is **not** inert — it removes a real and substantial host-side
performance gate (a 900 ms blocking call per swap on the CPU thread is
indefensible vs canary's 6.6 µs). It just doesn't break the cascade
predicted by the iterate-2E framing. The framing was too optimistic.
## Tripstone audit
- **#28** (per-engine tid stability): handle.IDs allowed to shift between
c23 and i2f, wedge comparison done on PC + wait-class, not raw ID.
- **#39** (composite progression metric): the only metric improved is
VdSwap latency (a host-side property, not a guest-progression metric).
swaps stays at 1, draws at 0. **No claim of "progression"** is made.
- **#40** (single-keystone framing): explicitly checked. The single
keystone "VdSwap-inline-drain is the upstream blocker" is **FALSIFIED**
by the gate (b)/(c)/(d) failures. The fix is retained on its own merits
(VdSwap latency is a real wart) but does not unblock the cascade.
## Next iterate recommendation
**NOT** a single-keystone follow-up. Two parallel, independent angles:
1. **0xBCE25640 ctx-state installer** (HIGH confidence root cause for the
worker-spawn cascade). Per AUDIT-068 Session 4, the writer is guest
PPC code at `sub_824FD240+0x24` (PC `0x824FD264`); per AUDIT-068
Session 3, the install epoch is host_ns ≈ 9.4 s on canary, well after
ours's wedge at ~810 ms. The question is **what guest path leads to
sub_824FD240**, and which prior kernel-call sequence in [0, 9.4 s] on
canary is absent in ours. This is the natural successor to iterate-2D
§Step 3's 1.3 s upstream timing skew finding.
2. **VdSwap drain still has a small (~1 ms) host-side blocking call.**
Canary's VdSwap returns in 6.6 µs — three orders of magnitude faster.
The remaining gap is the `recv_timeout` + worker's `is_ready` loop
overhead. A follow-up could remove the `DrainFence` entirely in the
Threaded path (worker is already draining continuously in its own
loop; the synchronous fence is a vestigial belt-and-braces from M1.4).
~5-10 LOC. LOW priority — gate (a) is already PASS at the target
threshold.
The iterate-2F retention question (revert if FIX-INERT) is **NO** — keep
the patch. The 900 ms VdSwap stall was a real performance wart with
non-progression cascade consequences (it inflated host wallclock by
~2× without doing useful guest work). Keeping the fix lowers test
turnaround for downstream iterates investigating the real upstream
cause (the 0xBCE25640 chain).
## Artifacts
Under `xenia-rs/audit-runs/iterate-2F-vdswap-drain-fix/`:
- `ours-cold.jsonl` (118,149 events, 50M-instr run, phase-a log)
- `ours-cold-long.jsonl` (118,149 events, 500M-instr run — same wedge state)
- `ours-i2f-iat-trace.jsonl` (153 events, bit-identical to i2d baseline)
- `ours-i2f-halt.stderr.log` (post-fix run with deadlock dump active —
shows sound.p04 NtReadFile progress through 90s)
- `digest-{1,2,3}.json` (3× bit-identical `check --stable-digest`
determinism check)
- `writer-report.md` (this file)
## Cascade roll-up
| gate | description | result |
|------|-------------|--------|
| Patch LOC ≤ 20 | hard cap | PASS (15 LOC net) |
| Build clean | warnings only, no errors | PASS |
| xenia-gpu tests | no regression | PASS (149/149) |
| xenia-kernel tests | no regression | PASS (226/226) |
| Determinism | 3 cold runs bit-identical | PASS |
| (a) VdSwap latency <1 ms | 900 ms → 1.03 ms | **PASS** |
| (b) missing (op,lr) tuples <28 | 28 → 28 | **FAIL** |
| (c) ours analogs for canary tids 15/27/28 | 0 → 0 | **FAIL** |
| (d) producer-rate at 0x824AB168 >9.97% | 9.97% → 9.97% | **FAIL** |
| (e) tid=1 wedge moved/absent | wedge earlier, same PC | NEUTRAL |
**Outcome class: FIX-PARTIAL-CASCADE.** Single-gate fix lands cleanly,
broader cascade does not follow. Patch retained.