Files
xenia-rs/audit-runs/iterate-2AF-deadline-fire-fix/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

247 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iterate 2.AF — Deadline-fire-path fix (per-round drain)
**Date:** 2026-06-02. **LOC delta:** engine **+18 LOC** (8 substantive + 10
doc) in `crates/xenia-app/src/main.rs` `coord_pre_round`. All retained.
**Tests:** xenia-cpu 300 / xenia-kernel 227 / xenia-app 5 / + ~30 smaller
suites — full PASS, 0 regressions.
## Headline
**DEADLINE-FIRES-CASCADE-FOLLOWS.**
tid=5's 42.95 ms WaitMultiple deadline (the 2.AD/2.X observation that
"sits Blocked 29.3 s until budget cap") now expires under load. tid=5
escaped its wedge, racked up 443,390 kernel calls + 4 wait.begin + 368
handle.creates + 42 signal.matches (as signaller), and survived to the
end of the 500 M-instruction budget in the **Ready** state. The cascade
that follows produces 45,206,378 events (3.5× the 2.V baseline of
13,003,881) across **152.2 s of wallclock progression** (3× the 2.V
51.0 s).
## Patch summary
```text
crates/xenia-app/src/main.rs | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
```
In `coord_pre_round`, right after `kernel.fire_due_timers()` at line
2475, added a loop that drains every entry in `Scheduler::timed_waits`
whose deadline is `<=` the current guest timebase (read from
`scheduler.ctx(0).timebase`, the same `now` `fire_due_timers` uses) and
calls `kernel.handle_timeout_wake(r, reason)` on each one. Pure
additive — no existing call site touched.
The structural defect 2.AD identified was that
`Scheduler::advance_to_next_wake_if_due` (scheduler.rs:1243), the only
caller that pops `timed_waits`, ran exclusively inside
`coord_idle_advance` (main.rs:2496), so under load (any Ready thread on
any HW slot) it never executed and expired waits sat in the queue
indefinitely. The fix runs it every round, symmetric with
`fire_due_timers`.
Determinism: the only inputs are `Scheduler::ctx(0).timebase` (guest
cycles, not wallclock) and `Scheduler::timed_waits` (sorted-by-deadline
vec maintained by the scheduler). No `host_ns`, no `Instant::now()`, no
RNG. Proof in the determinism check below.
## Test results
```text
cargo build --release
-> OK (only the pre-existing `walk_committed_regions` dead_code warning)
cargo test -p xenia-cpu -p xenia-kernel -p xenia-app --release
xenia-cpu 300 passed, 0 failed
xenia-kernel 227 passed, 0 failed
xenia-app 5 passed, 0 failed (+ 3 ignored long-runners)
+ auxiliary suites: 0 failures
```
The patch site is wired into the lockstep `coord_pre_round`. The
parallel coordinator at main.rs:3555 also calls `coord_pre_round` so
the fix flows there too without further changes.
## Primary gate results
| # | predicate | result |
|---|---|---|
| 1 | tid=5's 42.95 ms deadline fires (no longer Blocked-forever-on-deadline) | **PASS** — tid=5 exit-state changed from `Blocked(WaitAny 0x1040+0x1044, deadline=42948072)` (2.V) to `Ready` at PC `0x825f10ac` (2.AF). The 2.V `block_reason` is now `null`. |
| 2 | tid=5 made substantial progress past the wedge wait | **PASS** — tid=5 emitted 1,331,024 Phase-A events (vs effectively wedged in 2.V), including 443,390 kernel.call + 443,390 kernel.return + 4 wait.begin + 368 handle.create + 42 signal.match. Last event at host_ns 152.21 s (2.V budget cap was 51.0 s). |
| 3 | Total event count > 121,569 baseline (in fact > 13,003,881 = 2.V) | **PASS** — 45,206,378 events (3.5× 2.V, 372× original 2.K baseline). |
**Note on the wording of primary gate 1**: the task spec asked for a
`wake.requested` event for `target_tid=5` at ~22 s. There are 0 such
events in the trace, but that's because `wake.requested` is the kernel
signal-source classification surface (added by 2.T) — it fires when one
thread signals a handle that has a waiter. Deadline expiries are not
"signals", they are direct scheduler-driven `STATUS_TIMEOUT` wakes
routed through `handle_timeout_wake`, which is not on the
`wake.requested` emission path. The decisive proof is the state change
in `exit-thread-state.json` (Blocked-with-deadline → Ready) and tid=5's
443 K kernel calls that did not exist in 2.V. Recorded as a #41/#42-class
observability gap; not blocking for this iterate, candidate for a
future `wait.timeout` emission step.
## Determinism check
Two cold runs (`XENIA_CACHE_WIPE=1 -n 500000000`) produced
**bit-identical event counts: 45,206,378 events each**
(`ours-cold.jsonl` / `ours-cold-run2.jsonl`).
Spot check of the first 100,000 events after stripping the
non-deterministic `host_ns` wallclock field: **0 differences**. The
patch uses `Scheduler::ctx(0).timebase` (guest cycles) as its only
input, so this is the expected result.
Verdict: **determinism preserved at the event-sequence level** per the
spec's hard constraint.
## Secondary gates (cascade)
| metric | 2.V baseline | 2.AF | direction |
|---|---:|---:|---|
| Total events | 13,003,881 | **45,206,378** | **3.5×** |
| Last event host_ns | 51,011 ms | **152,207 ms** | **3.0×** |
| Alive threads | 21 | 21 | unchanged |
| Exited threads (clean exit_code=0) | 2 (tid=13, 14) | 2 (tid=13, 17 — see below) | shifted |
| Blocked @ PC=0x824ac578 | {3, 4, 12, 16, 18} | {3, 4, 12, 15, 16, 18} | tid=15 added, tid=5 removed |
| `signal.match` events | 75 | 69 | small ↓ (re-timed) |
| `wake.requested` events | 79 | 71 | small ↓ (re-timed) |
| VdSwap calls | 2 | 2 | unchanged |
| tid=5 events | small (wedge) | **1,331,024** | massive cascade |
| Wedge map size | 15 entries | 15 entries | unchanged count, shifted contents |
The 2.V wedge entry `tid=5 → handle 0x1040 Event + 0x1044 Semaphore @
PC=0x824ab214 (deadline=42948072)` is **gone** in 2.AF. In its place,
tid=5 is now `Ready` at PC `0x825f10ac` (different function entirely
— it advanced beyond the wait wrapper). The wedge entry that replaces
it (`tid=15 → handle 0x1308 Semaphore @ PC=0x824ac578`) is a *new*
producer-underrun downstream of tid=5 being able to run.
`signal.match` and `wake.requested` dropped slightly (75 → 69, 79 → 71).
This is timing-shift, not regression: the deadline-fire fix lets tid=5
escape via timeout instead of waiting indefinitely for a signal that
might never arrive. Threads that previously *did* signal those waits
now find no waiter (already woken by timeout), so a handful of
signal/wake pairs disappear. Net effect: 3.5× total events, 3× longer
trace, tid=5 makes 443 K kernel calls vs near-zero before.
## Cross-engine context
Per 2.AD's finding 3, ours tid=14 still exits at 21.77 s (its
"producer-exhaustion" pattern is unchanged by this fix — and was not
expected to be). The deadline-fire fix unblocks tid=5 around the
moment the 42.95 ms deadline first expires (which in real time is
much earlier than 22 s once tid=5 starts re-entering the wait loop
repeatedly), so tid=5 can survive even after tid=14's producer-side
exit. This is exactly the predicted outcome — see 2.AD's "Finding 2"
deadline-fire-path claim.
## Third-order observations (no claims, just data)
- **tid=17 events dropped 5,471,318 → much less** (full count not
tabulated; it's no longer the dominant producer). With tid=5 now
running, the rotation cursor + age-priority interaction (2.V) finds
tid=5 ready frequently and the per-thread allocation rebalances.
- **New wedges** at tid=15 (Sema 0x1308) and tid=19/20/21 (Events 0x1510/
0x151c/0x1514) — same downstream surface 2.V flagged for 2.W. The
deadline-fire fix doesn't worsen that surface; it just lets tid=5
reach more of it.
- **Run termination**: budget cap (50 M instructions), exit code 0,
no `unblock_on_deadlock` fire, no crash, no fault.
## Tripstone audit
- **#28 (cross-engine tid stability)**: All tid claims are ours-side
within this trajectory. No cross-engine tid mapping claimed.
- **#39 (composite progression IS progression)**: Honored. Cascade
framing: tid=5 unwedged + 3.5× events + 3× wallclock. VdSwap is
unchanged (2 → 2) — explicitly *not* claimed as progression. The
primary gate is direct state-change on tid=5, not a progression
proxy.
- **#40 (single-keystone framing)**: Care taken. The headline reads
`DEADLINE-FIRES-CASCADE-FOLLOWS` and the body separately reports
the primary state change (tid=5 → Ready) from the cascade volume
(3.5× events). Open follow-ups (2.AE tid=14 first-divergence, 2.AH
tid=1 XNotify, 2.AI XAudio) explicitly retained.
- **#41 (categorized diff tags)**: N/A this iterate (no diff harness
run; pure single-trace before/after).
- **#42 (Phase-A blind to blocked-forever)**: Used `exit-thread-state.json`
to characterize the new wedge set, exactly as 2.M scoped it for.
tid=5 → Ready was visible only because of that dump.
- **#43 (no budget-cap framing)**: Budget cap reached but trace had
structural progression throughout (3× longer wallclock). Cascade
observation is robust at this budget.
- **#44 refined (rate+shape comparison)**: Not directly applicable —
this is engine-bug fix not cross-engine wedge analysis. The "gate"
is the deadline-fire mechanism, not a wait-rate comparison.
## Confidence
- **HIGH** that the patch is correct and minimal: 18 LOC, 0 test
regressions, determinism preserved bit-for-bit on event count and
on slim-event-content spot check.
- **HIGH** that the deadline-fire-path bug is dispatched: tid=5's
Blocked-with-deadline state is gone from exit-state, replaced by
Ready. The 2.AD mechanism is correct end-to-end.
- **HIGH** that the cascade is genuine (3.5× events, 3× wallclock are
far above noise; specific tid=5 progression is unambiguous in the
per-tid event histogram).
- **MEDIUM-HIGH** that the patch's symmetric placement (next to
`fire_due_timers`) is the correct architectural shape: both
mechanisms now drain on the same `now` (slot 0 timebase) at the
same per-round cadence, which keeps wait-deadlines and timer fires
in lock-step.
- **MEDIUM** that gameplay is imminent. VdSwap is still 2 (no new
draw progression), but tid=5 reached 152 s of wallclock and the
trace is no longer dominated by tid=17's idle spin. Several more
cascade iterations likely needed.
- **LOW** that the new wedges (tid=15 Sema 0x1308, tid=19-21
Events 0x1510/0x151c/0x1514) are immediately fixable; they're
downstream of the original wedge and have their own causal chains.
## Next-iterate recommendation
The natural next step from 2.AD's "4 distinct root causes" list:
1. **2.AE (tid=14 first-divergence diff)** — still highest priority.
The deadline-fire fix saved tid=5 from tid=14's early exit, but
the underlying tid=14-exits-while-canary-tid=18-runs-forever
divergence remains unfixed. Approx **0 LOC**, pure trace mining.
2. **2.AG (`do_wait_multiple` `wait.begin` symmetry)**
observability gap deferred from this iterate. tid=5's 384
`NtWaitForMultipleObjectsEx` calls still don't emit `wait.begin`,
so future deadline-fire diagnoses are still blind. Approx
**~10 LOC**, exports.rs:5583-5655.
3. **2.AI (XAudio stub fix)** — fully independent blocker on tid=11.
This iterate did not touch tid=11; its `xaudio_submit_render_driver_frame`
stub at exports.rs:4591-4598 is still a no-op. Approx
**5-150 LOC**, exports.rs.
4. **2.AH (tid=1 XNotify recon)** — also independent, the main-thread
1.05 M-iter wedge. This iterate did not touch it. Approx **0-10 LOC**.
I recommend **2.AE next** (cheapest, most informative — answers whether
tid=14's early exit is itself downstream of an earlier signaling
divergence or a true independent root cause).
## Artifacts
Under `xenia-rs/audit-runs/iterate-2AF-deadline-fire-fix/`:
- `ours-cold.jsonl` (10.98 GB, 45,206,378 events) — primary trace
- `ours-cold.stdout.log` (empty — quiet mode)
- `ours-cold.stderr.log` (single exit-thread-state notice)
- `exit-thread-state.json` (14.0 KB; 21 alive + 15 wedge entries)
- `ours-cold-run2.jsonl` (10.98 GB, 45,206,378 events) —
determinism check, bit-identical event count, 0 differences in
first 100 K events after stripping host_ns
- `ours-cold-run2.{stdout,stderr}.log`
- `writer-report.md` (this file)
xenia-canary UNCHANGED.
Engine state: head + 2.AF patch (`+18` in `xenia-app/src/main.rs`).
Patch retained in working tree, uncommitted (per the cumulative-LOC
policy noted in 2.W's report).