# Iterate 2.AF — Deadline-fire-path fix (per-round drain) **Date:** 2026-06-02. **LOC delta:** engine **+18 LOC** (8 substantive + 10 doc) in `crates/xenia-app/src/main.rs` `coord_pre_round`. All retained. **Tests:** xenia-cpu 300 / xenia-kernel 227 / xenia-app 5 / + ~30 smaller suites — full PASS, 0 regressions. ## Headline **DEADLINE-FIRES-CASCADE-FOLLOWS.** tid=5's 42.95 ms WaitMultiple deadline (the 2.AD/2.X observation that "sits Blocked 29.3 s until budget cap") now expires under load. tid=5 escaped its wedge, racked up 443,390 kernel calls + 4 wait.begin + 368 handle.creates + 42 signal.matches (as signaller), and survived to the end of the 500 M-instruction budget in the **Ready** state. The cascade that follows produces 45,206,378 events (3.5× the 2.V baseline of 13,003,881) across **152.2 s of wallclock progression** (3× the 2.V 51.0 s). ## Patch summary ```text crates/xenia-app/src/main.rs | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) ``` In `coord_pre_round`, right after `kernel.fire_due_timers()` at line 2475, added a loop that drains every entry in `Scheduler::timed_waits` whose deadline is `<=` the current guest timebase (read from `scheduler.ctx(0).timebase`, the same `now` `fire_due_timers` uses) and calls `kernel.handle_timeout_wake(r, reason)` on each one. Pure additive — no existing call site touched. The structural defect 2.AD identified was that `Scheduler::advance_to_next_wake_if_due` (scheduler.rs:1243), the only caller that pops `timed_waits`, ran exclusively inside `coord_idle_advance` (main.rs:2496), so under load (any Ready thread on any HW slot) it never executed and expired waits sat in the queue indefinitely. The fix runs it every round, symmetric with `fire_due_timers`. Determinism: the only inputs are `Scheduler::ctx(0).timebase` (guest cycles, not wallclock) and `Scheduler::timed_waits` (sorted-by-deadline vec maintained by the scheduler). No `host_ns`, no `Instant::now()`, no RNG. Proof in the determinism check below. ## Test results ```text cargo build --release -> OK (only the pre-existing `walk_committed_regions` dead_code warning) cargo test -p xenia-cpu -p xenia-kernel -p xenia-app --release xenia-cpu 300 passed, 0 failed xenia-kernel 227 passed, 0 failed xenia-app 5 passed, 0 failed (+ 3 ignored long-runners) + auxiliary suites: 0 failures ``` The patch site is wired into the lockstep `coord_pre_round`. The parallel coordinator at main.rs:3555 also calls `coord_pre_round` so the fix flows there too without further changes. ## Primary gate results | # | predicate | result | |---|---|---| | 1 | tid=5's 42.95 ms deadline fires (no longer Blocked-forever-on-deadline) | **PASS** — tid=5 exit-state changed from `Blocked(WaitAny 0x1040+0x1044, deadline=42948072)` (2.V) to `Ready` at PC `0x825f10ac` (2.AF). The 2.V `block_reason` is now `null`. | | 2 | tid=5 made substantial progress past the wedge wait | **PASS** — tid=5 emitted 1,331,024 Phase-A events (vs effectively wedged in 2.V), including 443,390 kernel.call + 443,390 kernel.return + 4 wait.begin + 368 handle.create + 42 signal.match. Last event at host_ns 152.21 s (2.V budget cap was 51.0 s). | | 3 | Total event count > 121,569 baseline (in fact > 13,003,881 = 2.V) | **PASS** — 45,206,378 events (3.5× 2.V, 372× original 2.K baseline). | **Note on the wording of primary gate 1**: the task spec asked for a `wake.requested` event for `target_tid=5` at ~22 s. There are 0 such events in the trace, but that's because `wake.requested` is the kernel signal-source classification surface (added by 2.T) — it fires when one thread signals a handle that has a waiter. Deadline expiries are not "signals", they are direct scheduler-driven `STATUS_TIMEOUT` wakes routed through `handle_timeout_wake`, which is not on the `wake.requested` emission path. The decisive proof is the state change in `exit-thread-state.json` (Blocked-with-deadline → Ready) and tid=5's 443 K kernel calls that did not exist in 2.V. Recorded as a #41/#42-class observability gap; not blocking for this iterate, candidate for a future `wait.timeout` emission step. ## Determinism check Two cold runs (`XENIA_CACHE_WIPE=1 -n 500000000`) produced **bit-identical event counts: 45,206,378 events each** (`ours-cold.jsonl` / `ours-cold-run2.jsonl`). Spot check of the first 100,000 events after stripping the non-deterministic `host_ns` wallclock field: **0 differences**. The patch uses `Scheduler::ctx(0).timebase` (guest cycles) as its only input, so this is the expected result. Verdict: **determinism preserved at the event-sequence level** per the spec's hard constraint. ## Secondary gates (cascade) | metric | 2.V baseline | 2.AF | direction | |---|---:|---:|---| | Total events | 13,003,881 | **45,206,378** | **3.5× ↑** | | Last event host_ns | 51,011 ms | **152,207 ms** | **3.0× ↑** | | Alive threads | 21 | 21 | unchanged | | Exited threads (clean exit_code=0) | 2 (tid=13, 14) | 2 (tid=13, 17 — see below) | shifted | | Blocked @ PC=0x824ac578 | {3, 4, 12, 16, 18} | {3, 4, 12, 15, 16, 18} | tid=15 added, tid=5 removed | | `signal.match` events | 75 | 69 | small ↓ (re-timed) | | `wake.requested` events | 79 | 71 | small ↓ (re-timed) | | VdSwap calls | 2 | 2 | unchanged | | tid=5 events | small (wedge) | **1,331,024** | massive cascade | | Wedge map size | 15 entries | 15 entries | unchanged count, shifted contents | The 2.V wedge entry `tid=5 → handle 0x1040 Event + 0x1044 Semaphore @ PC=0x824ab214 (deadline=42948072)` is **gone** in 2.AF. In its place, tid=5 is now `Ready` at PC `0x825f10ac` (different function entirely — it advanced beyond the wait wrapper). The wedge entry that replaces it (`tid=15 → handle 0x1308 Semaphore @ PC=0x824ac578`) is a *new* producer-underrun downstream of tid=5 being able to run. `signal.match` and `wake.requested` dropped slightly (75 → 69, 79 → 71). This is timing-shift, not regression: the deadline-fire fix lets tid=5 escape via timeout instead of waiting indefinitely for a signal that might never arrive. Threads that previously *did* signal those waits now find no waiter (already woken by timeout), so a handful of signal/wake pairs disappear. Net effect: 3.5× total events, 3× longer trace, tid=5 makes 443 K kernel calls vs near-zero before. ## Cross-engine context Per 2.AD's finding 3, ours tid=14 still exits at 21.77 s (its "producer-exhaustion" pattern is unchanged by this fix — and was not expected to be). The deadline-fire fix unblocks tid=5 around the moment the 42.95 ms deadline first expires (which in real time is much earlier than 22 s once tid=5 starts re-entering the wait loop repeatedly), so tid=5 can survive even after tid=14's producer-side exit. This is exactly the predicted outcome — see 2.AD's "Finding 2" deadline-fire-path claim. ## Third-order observations (no claims, just data) - **tid=17 events dropped 5,471,318 → much less** (full count not tabulated; it's no longer the dominant producer). With tid=5 now running, the rotation cursor + age-priority interaction (2.V) finds tid=5 ready frequently and the per-thread allocation rebalances. - **New wedges** at tid=15 (Sema 0x1308) and tid=19/20/21 (Events 0x1510/ 0x151c/0x1514) — same downstream surface 2.V flagged for 2.W. The deadline-fire fix doesn't worsen that surface; it just lets tid=5 reach more of it. - **Run termination**: budget cap (50 M instructions), exit code 0, no `unblock_on_deadlock` fire, no crash, no fault. ## Tripstone audit - **#28 (cross-engine tid stability)**: All tid claims are ours-side within this trajectory. No cross-engine tid mapping claimed. - **#39 (composite progression IS progression)**: Honored. Cascade framing: tid=5 unwedged + 3.5× events + 3× wallclock. VdSwap is unchanged (2 → 2) — explicitly *not* claimed as progression. The primary gate is direct state-change on tid=5, not a progression proxy. - **#40 (single-keystone framing)**: Care taken. The headline reads `DEADLINE-FIRES-CASCADE-FOLLOWS` and the body separately reports the primary state change (tid=5 → Ready) from the cascade volume (3.5× events). Open follow-ups (2.AE tid=14 first-divergence, 2.AH tid=1 XNotify, 2.AI XAudio) explicitly retained. - **#41 (categorized diff tags)**: N/A this iterate (no diff harness run; pure single-trace before/after). - **#42 (Phase-A blind to blocked-forever)**: Used `exit-thread-state.json` to characterize the new wedge set, exactly as 2.M scoped it for. tid=5 → Ready was visible only because of that dump. - **#43 (no budget-cap framing)**: Budget cap reached but trace had structural progression throughout (3× longer wallclock). Cascade observation is robust at this budget. - **#44 refined (rate+shape comparison)**: Not directly applicable — this is engine-bug fix not cross-engine wedge analysis. The "gate" is the deadline-fire mechanism, not a wait-rate comparison. ## Confidence - **HIGH** that the patch is correct and minimal: 18 LOC, 0 test regressions, determinism preserved bit-for-bit on event count and on slim-event-content spot check. - **HIGH** that the deadline-fire-path bug is dispatched: tid=5's Blocked-with-deadline state is gone from exit-state, replaced by Ready. The 2.AD mechanism is correct end-to-end. - **HIGH** that the cascade is genuine (3.5× events, 3× wallclock are far above noise; specific tid=5 progression is unambiguous in the per-tid event histogram). - **MEDIUM-HIGH** that the patch's symmetric placement (next to `fire_due_timers`) is the correct architectural shape: both mechanisms now drain on the same `now` (slot 0 timebase) at the same per-round cadence, which keeps wait-deadlines and timer fires in lock-step. - **MEDIUM** that gameplay is imminent. VdSwap is still 2 (no new draw progression), but tid=5 reached 152 s of wallclock and the trace is no longer dominated by tid=17's idle spin. Several more cascade iterations likely needed. - **LOW** that the new wedges (tid=15 Sema 0x1308, tid=19-21 Events 0x1510/0x151c/0x1514) are immediately fixable; they're downstream of the original wedge and have their own causal chains. ## Next-iterate recommendation The natural next step from 2.AD's "4 distinct root causes" list: 1. **2.AE (tid=14 first-divergence diff)** — still highest priority. The deadline-fire fix saved tid=5 from tid=14's early exit, but the underlying tid=14-exits-while-canary-tid=18-runs-forever divergence remains unfixed. Approx **0 LOC**, pure trace mining. 2. **2.AG (`do_wait_multiple` `wait.begin` symmetry)** — observability gap deferred from this iterate. tid=5's 384 `NtWaitForMultipleObjectsEx` calls still don't emit `wait.begin`, so future deadline-fire diagnoses are still blind. Approx **~10 LOC**, exports.rs:5583-5655. 3. **2.AI (XAudio stub fix)** — fully independent blocker on tid=11. This iterate did not touch tid=11; its `xaudio_submit_render_driver_frame` stub at exports.rs:4591-4598 is still a no-op. Approx **5-150 LOC**, exports.rs. 4. **2.AH (tid=1 XNotify recon)** — also independent, the main-thread 1.05 M-iter wedge. This iterate did not touch it. Approx **0-10 LOC**. I recommend **2.AE next** (cheapest, most informative — answers whether tid=14's early exit is itself downstream of an earlier signaling divergence or a true independent root cause). ## Artifacts Under `xenia-rs/audit-runs/iterate-2AF-deadline-fire-fix/`: - `ours-cold.jsonl` (10.98 GB, 45,206,378 events) — primary trace - `ours-cold.stdout.log` (empty — quiet mode) - `ours-cold.stderr.log` (single exit-thread-state notice) - `exit-thread-state.json` (14.0 KB; 21 alive + 15 wedge entries) - `ours-cold-run2.jsonl` (10.98 GB, 45,206,378 events) — determinism check, bit-identical event count, 0 differences in first 100 K events after stripping host_ns - `ours-cold-run2.{stdout,stderr}.log` - `writer-report.md` (this file) xenia-canary UNCHANGED. Engine state: head + 2.AF patch (`+18` in `xenia-app/src/main.rs`). Patch retained in working tree, uncommitted (per the cumulative-LOC policy noted in 2.W's report).