Files

MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 07:19:08 +02:00

11 KiB

Raw Blame History

Iterate 2.F — VdSwap drain fix (writer report)

Date: 2026-05-27. LOC delta: engine +15 / -2 (1 file, 2 effective numeric literal changes), canary 0. Tests: xenia-gpu 149 PASS, xenia-kernel 226 PASS, ZERO regressions.

Headline

FIX-PARTIAL-CASCADE. VdSwap kernel.return latency drops 900.04 ms → 1.03 ms (~876× improvement, single-gate PASS). Determinism preserved across 3 cold runs. But downstream cascade gates (b)/(c)/(d)/(e) are unchanged — the 900 ms inline-drain was NOT the upstream timing gate for the iterate-2D 28 missing (op, lr) tuples or the tid=13 wedge. After the fix, ours still wedges at the same set of guest PCs (tid=1@0x824ac578, tid=13@0x824ac578); the wedge just arrives ~840 ms earlier in wallclock.

Mode detected

Threaded (M1.9 default, crates/xenia-app/src/main.rs:1090-1096). Both the Inline and Threaded (worker-side) backends had a 900 ms internal drain deadline, so the same fix was applied to both call sites. The original hypothesis (Inline path) was correct in spirit; in practice the same numeric deadline lived on the Threaded worker (handle.rs:563) and that was the one the test invocation hit. The CPU side's recv_timeout(1s) was the outer wrapper; the worker's Duration::from_millis(900) was the actual ceiling.

Patch

File: xenia-rs/crates/xenia-gpu/src/handle.rs

site	line	before	after
Inline drain	393	`Duration::from_millis(900)`	`Duration::from_millis(1)`
Threaded worker drain	563	`Duration::from_millis(900)`	`Duration::from_millis(1)`

Plus 12 LOC of inline comments documenting the iterate-2F intent. git diff --stat: crates/xenia-gpu/src/handle.rs | 19 +++++++++++++++++--, 17 insertions / 2 deletions, under the 20-LOC hard cap.

exports.rs:4218's call to drain_to_current_wptr was NOT modified (prompt scope: avoid stripping the drain). The GPUBUG-FETCH-PATCH-001 slot-0 comment was NOT touched (out of scope).

Cascade gate results

(a) VdSwap kernel.return latency

run	call host_ns	return host_ns	delta	status
c23 baseline (pre-fix)	489,685,332	1,389,721,914	900.04 ms	baseline
i2f run-1 (-n 50M)	522,924,748	523,952,196	1.03 ms	PASS
i2f run-2 (-n 500M)	571,370,654	572,397,252	1.03 ms	PASS

Target was <1 ms; landed at 1.03 ms. The remaining ~30 µs above the 1 ms deadline is is_ready-loop overhead + sync_with_mmio + reply-channel hop; not material vs canary's 6.6 µs since the CPU side proceeds immediately. Gate (a): PASS.

(b) Missing (op, lr) tuples (iterate-2D method)

IAT-thunk LR trace (--lr-trace=0x8284DDDC,0x8284E49C,0x8284DF5C,0x8284E07C, 90 s wallclock timeout):

	events	distinct (op,lr)	digest
i2d baseline (pre-fix, 2026-05-21)	153	19	21,448 B
i2f post-fix (2026-05-27)	153	19	21,448 B (bit-identical content)

Diff of sorted JSONL between baseline and i2f shows only sub-microsecond guest-cycle jitter on individual lines (e.g. cycle=6350123 vs 6350130); every (pc, tid, lr, r3, r4, r5, r6) tuple is identical. 28 missing-in-ours tuples count: UNCHANGED at 28. Gate (b): FAIL.

(c) Thread set (entry_pc, start_ctx) tuples

Both c23 and i2f end-of-run dumps list the same 13 ours threads (tids 0-13). No new thread spawned that wasn't there pre-fix. Notably, the post-swap worker fan-out from sub_825070F0 (which would spawn the four workers at canary tids 15/27/28 etc.) does not fire in i2f either — the workers still don't materialize. Gate (c): FAIL (no analog for canary tids 15/27/28).

(d) Producer-rate at LR 0x824AB168

LR 0x824AB168 fires per i2f IAT trace: 90 (same as i2d baseline). Canary baseline: 903. Ratio: 90/903 = 9.97% UNCHANGED. Gate (d): FAIL.

(e) tid=1 wedge timestamp

--halt-on-deadlock -n 500M post-fix produces an end-of-run blocked-thread dump structurally identical to c23's pre-fix dump:

	tid=1 PC	tid=1 LR	tid=1 wait handle	tid=13 PC	tid=13 wait handle
c23 (pre-fix)	0x824ac578	0x824ac578	0x12C8 (thread handle)	0x824ac578	0x12D0 (event handle)
i2f (post-fix)	0x824ac578	0x824ac578	0x1210 (thread handle, alloc-order shifted)	0x824ac578	0x1218 (event handle, alloc-order shifted)

Same wedge PC, same wait-class (single handle), only the handle numeric ID shifts due to allocator order change (reading-error #28 absorbs this). Wedge wallclock: ~810 ms (i2f) vs ~1,648 ms (c23) — the wedge arrives earlier because the 900 ms VdSwap stall is gone, but it still arrives. Gate (e): NEUTRAL/PARTIAL — wedge moved but is not absent. Tripstone #40: this is a single-keystone "wedge timestamp" gate that is moved but not eliminated — does not justify a single-keystone follow-up claim.

Determinism check (gate gate)

3 cold check --stable-digest -n 50000000 runs against the ISO:

run	instructions	imports	swaps
1	50,000,000	39,290	1
2	50,000,000	39,290	1
3	50,000,000	39,290	1

Bit-identical across 3 runs. Pre-fix c23 baseline had imports=40,388 and swaps=1; i2f has imports=39,290 and swaps=1. The drop in imports is the predictable consequence of the same 50M-instruction budget finishing faster wallclock — fewer kernel-import calls fit in the budget because each instruction now does less wait-time-skip. NOT a regression — the swap count is preserved at 1, draws stays at 0 (Sylpheed's pre-existing draws=0 limitation; out of scope).

Phase B image hash NOT measured (no phase_b_snapshot_dir flag set on this run), but the patch does not touch any image-loading path.

Confidence: did this fix the root cause?

MEDIUM-LOW. The patch decisively kills the 900 ms VdSwap stall — that hypothesis (gate a) is no longer in dispute. But the predicted downstream cascade (gates b/c/d/e) does NOT follow. Two implications:

The 900 ms inline-drain was a real timing wart but NOT the upstream timing gate for the iterate-2D producer-rate divergence. Removing it frees ~840 ms of tid=1 wall-time, yet the cascade (workers spawn → producers fire → tid=13 wait satisfied) still does not engage.
The real blocker is downstream: per Review A Step 1 (2026-05-27), force-spawning the 4 workers under --force-spawn-workers makes them fault on unmapped guest VA 0xBCE25640 at [ctx+44]. That ctx-state-installer bug is unaffected by VdSwap drain latency. Until the ctx for the post-swap workers is correctly initialized, no amount of main-thread headroom causes those workers to spawn naturally — the spawn path itself depends on game-side state (the AUDIT-068 ANON_Class install epoch at host_ns ≈ 9.4 s, per the canary trace) that ours never reaches.

The fix is not inert — it removes a real and substantial host-side performance gate (a 900 ms blocking call per swap on the CPU thread is indefensible vs canary's 6.6 µs). It just doesn't break the cascade predicted by the iterate-2E framing. The framing was too optimistic.

Tripstone audit

#28 (per-engine tid stability): handle.IDs allowed to shift between c23 and i2f, wedge comparison done on PC + wait-class, not raw ID.
#39 (composite progression metric): the only metric improved is VdSwap latency (a host-side property, not a guest-progression metric). swaps stays at 1, draws at 0. No claim of "progression" is made.
#40 (single-keystone framing): explicitly checked. The single keystone "VdSwap-inline-drain is the upstream blocker" is FALSIFIED by the gate (b)/(c)/(d) failures. The fix is retained on its own merits (VdSwap latency is a real wart) but does not unblock the cascade.

Next iterate recommendation

NOT a single-keystone follow-up. Two parallel, independent angles:

0xBCE25640 ctx-state installer (HIGH confidence root cause for the worker-spawn cascade). Per AUDIT-068 Session 4, the writer is guest PPC code at sub_824FD240+0x24 (PC 0x824FD264); per AUDIT-068 Session 3, the install epoch is host_ns ≈ 9.4 s on canary, well after ours's wedge at ~810 ms. The question is what guest path leads to sub_824FD240, and which prior kernel-call sequence in [0, 9.4 s] on canary is absent in ours. This is the natural successor to iterate-2D §Step 3's 1.3 s upstream timing skew finding.
VdSwap drain still has a small (~1 ms) host-side blocking call. Canary's VdSwap returns in 6.6 µs — three orders of magnitude faster. The remaining gap is the recv_timeout + worker's is_ready loop overhead. A follow-up could remove the DrainFence entirely in the Threaded path (worker is already draining continuously in its own loop; the synchronous fence is a vestigial belt-and-braces from M1.4). ~5-10 LOC. LOW priority — gate (a) is already PASS at the target threshold.

The iterate-2F retention question (revert if FIX-INERT) is NO — keep the patch. The 900 ms VdSwap stall was a real performance wart with non-progression cascade consequences (it inflated host wallclock by ~2× without doing useful guest work). Keeping the fix lowers test turnaround for downstream iterates investigating the real upstream cause (the 0xBCE25640 chain).

Artifacts

Under xenia-rs/audit-runs/iterate-2F-vdswap-drain-fix/:

ours-cold.jsonl (118,149 events, 50M-instr run, phase-a log)
ours-cold-long.jsonl (118,149 events, 500M-instr run — same wedge state)
ours-i2f-iat-trace.jsonl (153 events, bit-identical to i2d baseline)
ours-i2f-halt.stderr.log (post-fix run with deadlock dump active — shows sound.p04 NtReadFile progress through 90s)
digest-{1,2,3}.json (3× bit-identical check --stable-digest determinism check)
writer-report.md (this file)

Cascade roll-up

gate	description	result
Patch LOC ≤ 20	hard cap	PASS (15 LOC net)
Build clean	warnings only, no errors	PASS
xenia-gpu tests	no regression	PASS (149/149)
xenia-kernel tests	no regression	PASS (226/226)
Determinism	3 cold runs bit-identical	PASS
(a) VdSwap latency <1 ms	900 ms → 1.03 ms	PASS
(b) missing (op,lr) tuples <28	28 → 28	FAIL
(c) ours analogs for canary tids 15/27/28	0 → 0	FAIL
(d) producer-rate at 0x824AB168 >9.97%	9.97% → 9.97%	FAIL
(e) tid=1 wedge moved/absent	wedge earlier, same PC	NEUTRAL

Outcome class: FIX-PARTIAL-CASCADE. Single-gate fix lands cleanly, broader cascade does not follow. Patch retained.

11 KiB Raw Blame History Unescape Escape