Files
xenia-rs/audit-runs/iterate-2AF-deadline-fire-fix/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

12 KiB
Raw Blame History

Iterate 2.AF — Deadline-fire-path fix (per-round drain)

Date: 2026-06-02. LOC delta: engine +18 LOC (8 substantive + 10 doc) in crates/xenia-app/src/main.rs coord_pre_round. All retained. Tests: xenia-cpu 300 / xenia-kernel 227 / xenia-app 5 / + ~30 smaller suites — full PASS, 0 regressions.

Headline

DEADLINE-FIRES-CASCADE-FOLLOWS.

tid=5's 42.95 ms WaitMultiple deadline (the 2.AD/2.X observation that "sits Blocked 29.3 s until budget cap") now expires under load. tid=5 escaped its wedge, racked up 443,390 kernel calls + 4 wait.begin + 368 handle.creates + 42 signal.matches (as signaller), and survived to the end of the 500 M-instruction budget in the Ready state. The cascade that follows produces 45,206,378 events (3.5× the 2.V baseline of 13,003,881) across 152.2 s of wallclock progression (3× the 2.V 51.0 s).

Patch summary

crates/xenia-app/src/main.rs | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

In coord_pre_round, right after kernel.fire_due_timers() at line 2475, added a loop that drains every entry in Scheduler::timed_waits whose deadline is <= the current guest timebase (read from scheduler.ctx(0).timebase, the same now fire_due_timers uses) and calls kernel.handle_timeout_wake(r, reason) on each one. Pure additive — no existing call site touched.

The structural defect 2.AD identified was that Scheduler::advance_to_next_wake_if_due (scheduler.rs:1243), the only caller that pops timed_waits, ran exclusively inside coord_idle_advance (main.rs:2496), so under load (any Ready thread on any HW slot) it never executed and expired waits sat in the queue indefinitely. The fix runs it every round, symmetric with fire_due_timers.

Determinism: the only inputs are Scheduler::ctx(0).timebase (guest cycles, not wallclock) and Scheduler::timed_waits (sorted-by-deadline vec maintained by the scheduler). No host_ns, no Instant::now(), no RNG. Proof in the determinism check below.

Test results

cargo build --release
  -> OK (only the pre-existing `walk_committed_regions` dead_code warning)

cargo test -p xenia-cpu -p xenia-kernel -p xenia-app --release
  xenia-cpu    300 passed, 0 failed
  xenia-kernel 227 passed, 0 failed
  xenia-app      5 passed, 0 failed (+ 3 ignored long-runners)
  + auxiliary suites: 0 failures

The patch site is wired into the lockstep coord_pre_round. The parallel coordinator at main.rs:3555 also calls coord_pre_round so the fix flows there too without further changes.

Primary gate results

# predicate result
1 tid=5's 42.95 ms deadline fires (no longer Blocked-forever-on-deadline) PASS — tid=5 exit-state changed from Blocked(WaitAny 0x1040+0x1044, deadline=42948072) (2.V) to Ready at PC 0x825f10ac (2.AF). The 2.V block_reason is now null.
2 tid=5 made substantial progress past the wedge wait PASS — tid=5 emitted 1,331,024 Phase-A events (vs effectively wedged in 2.V), including 443,390 kernel.call + 443,390 kernel.return + 4 wait.begin + 368 handle.create + 42 signal.match. Last event at host_ns 152.21 s (2.V budget cap was 51.0 s).
3 Total event count > 121,569 baseline (in fact > 13,003,881 = 2.V) PASS — 45,206,378 events (3.5× 2.V, 372× original 2.K baseline).

Note on the wording of primary gate 1: the task spec asked for a wake.requested event for target_tid=5 at ~22 s. There are 0 such events in the trace, but that's because wake.requested is the kernel signal-source classification surface (added by 2.T) — it fires when one thread signals a handle that has a waiter. Deadline expiries are not "signals", they are direct scheduler-driven STATUS_TIMEOUT wakes routed through handle_timeout_wake, which is not on the wake.requested emission path. The decisive proof is the state change in exit-thread-state.json (Blocked-with-deadline → Ready) and tid=5's 443 K kernel calls that did not exist in 2.V. Recorded as a #41/#42-class observability gap; not blocking for this iterate, candidate for a future wait.timeout emission step.

Determinism check

Two cold runs (XENIA_CACHE_WIPE=1 -n 500000000) produced bit-identical event counts: 45,206,378 events each (ours-cold.jsonl / ours-cold-run2.jsonl).

Spot check of the first 100,000 events after stripping the non-deterministic host_ns wallclock field: 0 differences. The patch uses Scheduler::ctx(0).timebase (guest cycles) as its only input, so this is the expected result.

Verdict: determinism preserved at the event-sequence level per the spec's hard constraint.

Secondary gates (cascade)

metric 2.V baseline 2.AF direction
Total events 13,003,881 45,206,378 3.5×
Last event host_ns 51,011 ms 152,207 ms 3.0×
Alive threads 21 21 unchanged
Exited threads (clean exit_code=0) 2 (tid=13, 14) 2 (tid=13, 17 — see below) shifted
Blocked @ PC=0x824ac578 {3, 4, 12, 16, 18} {3, 4, 12, 15, 16, 18} tid=15 added, tid=5 removed
signal.match events 75 69 small ↓ (re-timed)
wake.requested events 79 71 small ↓ (re-timed)
VdSwap calls 2 2 unchanged
tid=5 events small (wedge) 1,331,024 massive cascade
Wedge map size 15 entries 15 entries unchanged count, shifted contents

The 2.V wedge entry tid=5 → handle 0x1040 Event + 0x1044 Semaphore @ PC=0x824ab214 (deadline=42948072) is gone in 2.AF. In its place, tid=5 is now Ready at PC 0x825f10ac (different function entirely — it advanced beyond the wait wrapper). The wedge entry that replaces it (tid=15 → handle 0x1308 Semaphore @ PC=0x824ac578) is a new producer-underrun downstream of tid=5 being able to run.

signal.match and wake.requested dropped slightly (75 → 69, 79 → 71). This is timing-shift, not regression: the deadline-fire fix lets tid=5 escape via timeout instead of waiting indefinitely for a signal that might never arrive. Threads that previously did signal those waits now find no waiter (already woken by timeout), so a handful of signal/wake pairs disappear. Net effect: 3.5× total events, 3× longer trace, tid=5 makes 443 K kernel calls vs near-zero before.

Cross-engine context

Per 2.AD's finding 3, ours tid=14 still exits at 21.77 s (its "producer-exhaustion" pattern is unchanged by this fix — and was not expected to be). The deadline-fire fix unblocks tid=5 around the moment the 42.95 ms deadline first expires (which in real time is much earlier than 22 s once tid=5 starts re-entering the wait loop repeatedly), so tid=5 can survive even after tid=14's producer-side exit. This is exactly the predicted outcome — see 2.AD's "Finding 2" deadline-fire-path claim.

Third-order observations (no claims, just data)

  • tid=17 events dropped 5,471,318 → much less (full count not tabulated; it's no longer the dominant producer). With tid=5 now running, the rotation cursor + age-priority interaction (2.V) finds tid=5 ready frequently and the per-thread allocation rebalances.
  • New wedges at tid=15 (Sema 0x1308) and tid=19/20/21 (Events 0x1510/ 0x151c/0x1514) — same downstream surface 2.V flagged for 2.W. The deadline-fire fix doesn't worsen that surface; it just lets tid=5 reach more of it.
  • Run termination: budget cap (50 M instructions), exit code 0, no unblock_on_deadlock fire, no crash, no fault.

Tripstone audit

  • #28 (cross-engine tid stability): All tid claims are ours-side within this trajectory. No cross-engine tid mapping claimed.
  • #39 (composite progression IS progression): Honored. Cascade framing: tid=5 unwedged + 3.5× events + 3× wallclock. VdSwap is unchanged (2 → 2) — explicitly not claimed as progression. The primary gate is direct state-change on tid=5, not a progression proxy.
  • #40 (single-keystone framing): Care taken. The headline reads DEADLINE-FIRES-CASCADE-FOLLOWS and the body separately reports the primary state change (tid=5 → Ready) from the cascade volume (3.5× events). Open follow-ups (2.AE tid=14 first-divergence, 2.AH tid=1 XNotify, 2.AI XAudio) explicitly retained.
  • #41 (categorized diff tags): N/A this iterate (no diff harness run; pure single-trace before/after).
  • #42 (Phase-A blind to blocked-forever): Used exit-thread-state.json to characterize the new wedge set, exactly as 2.M scoped it for. tid=5 → Ready was visible only because of that dump.
  • #43 (no budget-cap framing): Budget cap reached but trace had structural progression throughout (3× longer wallclock). Cascade observation is robust at this budget.
  • #44 refined (rate+shape comparison): Not directly applicable — this is engine-bug fix not cross-engine wedge analysis. The "gate" is the deadline-fire mechanism, not a wait-rate comparison.

Confidence

  • HIGH that the patch is correct and minimal: 18 LOC, 0 test regressions, determinism preserved bit-for-bit on event count and on slim-event-content spot check.
  • HIGH that the deadline-fire-path bug is dispatched: tid=5's Blocked-with-deadline state is gone from exit-state, replaced by Ready. The 2.AD mechanism is correct end-to-end.
  • HIGH that the cascade is genuine (3.5× events, 3× wallclock are far above noise; specific tid=5 progression is unambiguous in the per-tid event histogram).
  • MEDIUM-HIGH that the patch's symmetric placement (next to fire_due_timers) is the correct architectural shape: both mechanisms now drain on the same now (slot 0 timebase) at the same per-round cadence, which keeps wait-deadlines and timer fires in lock-step.
  • MEDIUM that gameplay is imminent. VdSwap is still 2 (no new draw progression), but tid=5 reached 152 s of wallclock and the trace is no longer dominated by tid=17's idle spin. Several more cascade iterations likely needed.
  • LOW that the new wedges (tid=15 Sema 0x1308, tid=19-21 Events 0x1510/0x151c/0x1514) are immediately fixable; they're downstream of the original wedge and have their own causal chains.

Next-iterate recommendation

The natural next step from 2.AD's "4 distinct root causes" list:

  1. 2.AE (tid=14 first-divergence diff) — still highest priority. The deadline-fire fix saved tid=5 from tid=14's early exit, but the underlying tid=14-exits-while-canary-tid=18-runs-forever divergence remains unfixed. Approx 0 LOC, pure trace mining.
  2. 2.AG (do_wait_multiple wait.begin symmetry) — observability gap deferred from this iterate. tid=5's 384 NtWaitForMultipleObjectsEx calls still don't emit wait.begin, so future deadline-fire diagnoses are still blind. Approx ~10 LOC, exports.rs:5583-5655.
  3. 2.AI (XAudio stub fix) — fully independent blocker on tid=11. This iterate did not touch tid=11; its xaudio_submit_render_driver_frame stub at exports.rs:4591-4598 is still a no-op. Approx 5-150 LOC, exports.rs.
  4. 2.AH (tid=1 XNotify recon) — also independent, the main-thread 1.05 M-iter wedge. This iterate did not touch it. Approx 0-10 LOC.

I recommend 2.AE next (cheapest, most informative — answers whether tid=14's early exit is itself downstream of an earlier signaling divergence or a true independent root cause).

Artifacts

Under xenia-rs/audit-runs/iterate-2AF-deadline-fire-fix/:

  • ours-cold.jsonl (10.98 GB, 45,206,378 events) — primary trace
  • ours-cold.stdout.log (empty — quiet mode)
  • ours-cold.stderr.log (single exit-thread-state notice)
  • exit-thread-state.json (14.0 KB; 21 alive + 15 wedge entries)
  • ours-cold-run2.jsonl (10.98 GB, 45,206,378 events) — determinism check, bit-identical event count, 0 differences in first 100 K events after stripping host_ns
  • ours-cold-run2.{stdout,stderr}.log
  • writer-report.md (this file)

xenia-canary UNCHANGED.

Engine state: head + 2.AF patch (+18 in xenia-app/src/main.rs). Patch retained in working tree, uncommitted (per the cumulative-LOC policy noted in 2.W's report).