Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
383 lines
18 KiB
Markdown
383 lines
18 KiB
Markdown
# Iterate 2.AJ — VSync→Event wiring (reciprocal-shadow plumbing landed; real wedge re-localized)
|
||
|
||
**Date:** 2026-06-02. **LOC delta:** engine **+45 / 0 LOC** (7 substantive
|
||
+ 38 doc comment) in `crates/xenia-kernel/src/exports.rs`. Retained.
|
||
**Tests:** xenia-cpu 300 / xenia-kernel 227 / xenia-app 5 — full PASS,
|
||
0 regressions.
|
||
|
||
## Headline
|
||
|
||
**FIX-INERT-ON-THIS-TRAJECTORY (PRODUCER-SIDE WEDGE RE-LOCALIZED).**
|
||
|
||
The patch lands and is structurally correct (matches the canonical
|
||
reciprocal-shadow discipline expected by canary's host-OS-Event model).
|
||
**Determinism bit-identical 65,691,821 events across 2 cold runs and
|
||
bit-identical to 2.AI's terminus.** Tests pass with zero regressions.
|
||
|
||
But: this fix targets the **consume-side** of the shadow / guest-memory
|
||
bridge, and **2.AI's exposed wedge is on the producer-side**. The
|
||
VSync ISR delivers (76 callbacks per 100M instructions, metric
|
||
`gpu.interrupt.delivered{source=0}`), the registered guest callback at
|
||
PC `0x824be9a0` runs to `LR_HALT_SENTINEL` cleanly, BUT the callback's
|
||
guest code never writes `SignalState = 1` to the dispatcher at
|
||
`0xbe8cbb5c + 4` that tid=7 polls. The reciprocal-clear path I plumbed
|
||
is therefore never on the critical path for this iterate (signal_state
|
||
remains 0 forever, the fast-path never triggers, no consume happens,
|
||
no reciprocal clear runs).
|
||
|
||
The fix is preserved in the working tree because the discipline it
|
||
implements is necessary for *any* future trajectory where a Sylpheed
|
||
guest dispatcher actually receives a rising-edge signal from a
|
||
non-kernel-API path (e.g. a future direct-write callback). Without
|
||
reciprocal-clear, that future signal would latch and re-fast-path every
|
||
subsequent wait. Removing it would be a deliberate step backward.
|
||
|
||
## Re-framing of the wedge (sub-hypothesis revision)
|
||
|
||
2.AI's report and the iterate-2.AJ spec both framed the wedge as
|
||
"tid=1's auto-reset Event `0x000010e8` has no signaler, VSync ISR needs
|
||
to be wired to it." Investigation revealed a more accurate model:
|
||
|
||
| sub-hyp | requires | observed | verdict |
|
||
|---|---|---|---|
|
||
| **C-A** "Wire VSync ISR → 0x10e8" (spec hint) | Kernel side knows the frame-sync event handle from `VdSetGraphicsInterruptCallback` args | `VdSetGraphicsInterruptCallback` takes `(callback_pc, user_data)` only; no event handle. Game's contract: callback is a guest function that signals events itself. | **falsified at API surface** |
|
||
| **C-B** Reciprocal-shadow clear (this fix) | tid=7's KeWait fast-paths because shadow.signaled=true from stale guest mem signal_state=1 | Refresh observes guest mem signal_state=0 every single time on `0xbe8cbb5c`; wait fast-path never hits; reciprocal-clear path never runs. | **structurally correct, not on critical path** |
|
||
| **C-C** Callback runs but doesn't reach SignalState write | IRQ injection delivers callback (we see `gpu.interrupt.delivered{source=0}=76` per 100M) and the callback returns cleanly to `LR_HALT_SENTINEL`; guest mem at the candidate dispatcher stays unsignaled. | matches exactly | **chosen** |
|
||
| **C-D** tid=7 is downstream of tid=1 ("wedge moved one deeper") | tid=1 first wedge; tid=7 spin emerges only post-2.AI | Yes: tid=7's 6,549,579 KeWait calls = **99.7%** of the 65.7M-event total. tid=7 priority=17 starves tid=8 (priority=0) on hw_id=2 → tid=8 Ready-but-never-picked → no further VdSwap → tid=1 stays Blocked on 0x10e8. | **co-confirmed** |
|
||
|
||
The actual fix surface is **not** kernel-side wiring; it's the guest
|
||
callback at `0x824be9a0` failing to write its own SignalState. That
|
||
could be:
|
||
- our IRQ-injection state-mangling subtly corrupting the callback's
|
||
guest-side decision tree (`r4 = user_data = 0xbe8c8f00`, callback
|
||
expects something specific in `user_data + N` to be non-zero before
|
||
writing SignalState)
|
||
- our `try_inject_graphics_interrupt`'s Pass-1/Pass-2 thread-selection
|
||
policy injecting on the wrong thread (the callback may probe TLS to
|
||
decide what to signal)
|
||
- a missing initialization that the callback's first-fire pre-requires
|
||
|
||
## Decisive evidence
|
||
|
||
**Callback DOES execute** — direct measurement via metrics counter:
|
||
|
||
```
|
||
counter gpu.interrupt.delivered{source=0} = 76 (per 100M instr)
|
||
counter gpu.interrupt.delivered{source=1} = 1
|
||
counter kernel.calls{name=VdSetGraphicsInterruptCallback} = 1
|
||
```
|
||
|
||
```text
|
||
INFO VdSetGraphicsInterruptCallback(0x824be9a0, 0xbe8c8f00) — callback armed
|
||
```
|
||
|
||
**Callback DOES NOT signal `0xbe8cbb5c`** — direct measurement via the
|
||
`refresh_pkevent_shadow_from_guest` path (verified with temporary debug
|
||
instrumentation, since reverted):
|
||
|
||
```
|
||
DEBUG refresh[#2..#9]: ptr=0xbe8cbb5c signal_state=0 obj_was_signaled=Some(false)
|
||
... no instance with signal_state != 0 across full 50M-instr probe ...
|
||
```
|
||
|
||
**Result**: tid=7's 1,593,666 KeWait calls per 50M (3.19% rate) all
|
||
return `STATUS_SUCCESS` via the 30 ms deadline-wake path. They do NOT
|
||
fast-path through shadow.signaled. So `handle_consume` on auto-reset
|
||
runs ZERO times for this handle in this trajectory — meaning my
|
||
reciprocal-clear is unreachable on this path.
|
||
|
||
**Cross-engine confirmation** that the canary's same dispatcher SID
|
||
analog (`1381cc5eb0aa0b99` in `phase-c22-rtl-enter-leave-control-flow/canary-cold-trunc.jsonl`)
|
||
also shows ZERO signal.match events while its waiter exhibits the
|
||
expected ~16.67 ms inter-wait gap — confirming canary's signal
|
||
mechanism for this dispatcher is **also not visible at the canary
|
||
Phase-A `signal.match` emission layer** (which is only fired on
|
||
`Ke{Set,Reset}Event` / `Nt{Set,Reset}Event` kernel paths in canary;
|
||
canary's underlying host-OS-Event Set, called by either the guest
|
||
callback or canary's GraphicsSystem MarkVblank chain, isn't emitted).
|
||
|
||
The fix-surface for the **producer-side** is therefore very narrow:
|
||
something needs to either (a) ensure the guest callback's writes
|
||
actually land at the right offset within `user_data=0xbe8c8f00`, or
|
||
(b) directly emulate the canary's host-OS auto-reset semantics by
|
||
having `try_inject_graphics_interrupt` perform an unconditional
|
||
`mem.write_u32(0xbe8cbb5c + 4, 1)` immediately before injecting (a
|
||
crowbar that would bypass the callback's own write path).
|
||
|
||
Option (b) is **out of scope** for 2.AJ as specified — it requires a
|
||
heuristic for *which* guest-pointer dispatcher to signal (the game
|
||
doesn't tell the kernel; the kernel would need to track that the
|
||
callback wrote to that offset on a prior delivery, then keep writing
|
||
it). That's wedge-track investigation for 2.AK or later, not a
|
||
mechanical fix.
|
||
|
||
## Patch summary
|
||
|
||
```text
|
||
crates/xenia-kernel/src/exports.rs | 45 ++++++++++++++++++++++++++++++++++++++
|
||
1 file changed, 45 insertions(+)
|
||
```
|
||
|
||
Three callsite hookups + one new helper:
|
||
|
||
```diff
|
||
pub(crate) fn handle_consume(state: &mut KernelState, handle: u32) {
|
||
// ... existing shadow-only consume ...
|
||
}
|
||
|
||
+/// 2.AJ — reciprocal-shadow clear for guest-pointer auto-reset dispatchers.
|
||
+/// (docs explaining why canary doesn't need this and we do)
|
||
+pub(crate) fn handle_consume_reciprocal_clear(
|
||
+ state: &KernelState, mem: &GuestMemory, handle: u32,
|
||
+) {
|
||
+ if handle < 0x1_0000 { return; }
|
||
+ match state.objects.get(&handle) {
|
||
+ Some(KernelObject::Event { manual_reset, signaled, .. })
|
||
+ | Some(KernelObject::Timer { manual_reset, signaled, .. }) => {
|
||
+ if !*manual_reset && !*signaled {
|
||
+ mem.write_u32(handle + 4, 0);
|
||
+ }
|
||
+ }
|
||
+ _ => {}
|
||
+ }
|
||
+}
|
||
|
||
fn do_wait_single(...) {
|
||
if handle_signaled(state, handle) {
|
||
handle_consume(state, handle);
|
||
+ handle_consume_reciprocal_clear(state, mem, handle);
|
||
ctx.gpr[3] = STATUS_SUCCESS;
|
||
return;
|
||
}
|
||
// ...
|
||
}
|
||
|
||
// similar in do_wait_multiple's two fast-path arms.
|
||
```
|
||
|
||
7 substantive LOC (1 new helper signature + 4-line body + 2 callsite
|
||
hookups in do_wait_single + 2 callsite hookups in do_wait_multiple).
|
||
The remaining 38 LOC are doc/comments explaining the canary-vs-ours
|
||
shadow/guest split and what triggers spin-forever loops without this
|
||
clear.
|
||
|
||
Determinism: the only added write is `mem.write_u32(handle + 4, 0)`
|
||
guarded by the just-cleared shadow state (`signaled: false`). The
|
||
trigger conditions are deterministic functions of `(handle, shadow,
|
||
guest_mem)`. No `host_ns`, no RNG. Proof in the determinism check
|
||
below.
|
||
|
||
## Test results
|
||
|
||
```text
|
||
cargo build --release -> OK
|
||
cargo test -p xenia-cpu -p xenia-kernel -p xenia-app --release
|
||
xenia-cpu 300 passed, 0 failed
|
||
xenia-kernel 227 passed, 0 failed
|
||
xenia-app 5 passed, 0 failed (+ 2/1 ignored long-runners)
|
||
+ disasm_goldens 6 passed (sub-suite)
|
||
Auxiliary suites: 0 failures
|
||
```
|
||
|
||
## Primary gate results
|
||
|
||
| # | predicate | result |
|
||
|---|---|---|
|
||
| 1 | tid=1's wait gap on Event 0x10e8 rises from 126.8 µs to ~16-17 ms (one VSync period) | **FAIL** — still 126.8 µs (bit-identical to 2.AI's trace). The frame-sync event has no signaler reach because the wedge is on the producer-side. |
|
||
| 2 | tid=1's main-loop iteration count drops from 23 kHz to ~60 Hz | **N/A** — already dropped 23 kHz → 0 by the 2.AI polarity fix. This iterate does not regress that. |
|
||
| 3 | VdSwap count grows from 6 (2.AI) | **FAIL** — VdSwap = 2 in this run, identical bit-pattern to the parent 2.AI run by design (no behavioral change). |
|
||
|
||
The primary objective ("wire VSync ISR → frame-sync Event") was not
|
||
accomplished because the precondition was wrong: the wedge is not a
|
||
missing kernel-side wiring, it's a missing guest-side write the
|
||
callback was supposed to make.
|
||
|
||
## Determinism check
|
||
|
||
Two cold runs (`XENIA_CACHE_WIPE=1 -n 500000000`) produced
|
||
**bit-identical event counts: 65,691,821 events each**
|
||
(`ours-cold.jsonl` / `ours-cold-run2.jsonl`).
|
||
|
||
After stripping `host_ns` and re-serializing sorted-keys, the
|
||
**first 100,000 events match byte-for-byte** between the two runs.
|
||
|
||
Bit-identical to 2.AI's terminus (also 65,691,821 events) — which is
|
||
the structural-effect signal of FIX-INERT: the path we patched isn't
|
||
on the critical path for this trajectory, so the trace doesn't
|
||
diverge.
|
||
|
||
Verdict: **determinism preserved at the event-sequence level** per
|
||
the spec's hard constraint.
|
||
|
||
## Secondary gates (cascade)
|
||
|
||
| metric | 2.AF | 2.AI | 2.AJ | direction |
|
||
|---|---:|---:|---:|---|
|
||
| Total events | 45,206,378 | 65,691,821 | **65,691,821** | unchanged from 2.AI |
|
||
| Last event host_ns | 152,207 ms | 208,272 ms | **~208,272 ms** | unchanged |
|
||
| Alive threads | 21 | 21 | **21** | unchanged |
|
||
| Exited threads | 2 (13,17) | 2 (13,14) | **2 (13,14)** | unchanged |
|
||
| Wedge map entries | 15 | 18 | **18** | unchanged |
|
||
| `signal.match` events | 69 | 84 | **84** | unchanged |
|
||
| VdSwap calls | 2 | 6 | **6** | unchanged (still 6) |
|
||
| tid=12 (DPC) state | Blocked@Event 0x1004 | Blocked@Event 0x1004 | **Blocked@Event 0x1004** | unchanged |
|
||
|
||
tid=7's spin (the actual cycle-budget consumer): **6,549,579 KeWait
|
||
calls** on guest-pointer dispatcher `0xbe8cbb5c` (sid
|
||
`9559797117e919f0`) — accounts for ~99.7% of the entire 65.7M-event
|
||
trace. Pattern is `KeWait → RtlEnterCriticalSection →
|
||
RtlLeaveCriticalSection`, three calls per cycle. Each KeWait returns
|
||
SUCCESS via the **30 ms deadline-wake path** (not the fast-path), so
|
||
the reciprocal-clear hook is structurally unreachable for this
|
||
trajectory until the producer-side starts firing.
|
||
|
||
## Thread-by-thread post-fix wedge analysis
|
||
|
||
Identical to 2.AI's 18 wedge entries. No behavioral cascade observed.
|
||
The patch is effectively a no-op on this trace; the spin pattern is
|
||
preserved bit-for-bit because the consume-side fast-path is never
|
||
entered. tid=8 remains in `state: Ready` at PC `0x824c1790`
|
||
(starving on hw_id=2 behind tid=7 priority=17 vs tid=8 priority=0).
|
||
|
||
## Cross-engine context
|
||
|
||
Direct measurement from `phase-c22-rtl-enter-leave-control-flow/canary-cold-trunc.jsonl`
|
||
on the analog dispatcher (canary tid=6 polling `1381cc5eb0aa0b99` /
|
||
raw `0xf8000068`, an Event with kernel-table handle):
|
||
|
||
- 368+ `wait.begin` events with **median inter-arrival 16.61 ms**
|
||
(exactly VSync period)
|
||
- **ZERO `signal.match` events** on this handle in canary either —
|
||
because canary's host-OS-Event `Set()` is **not** instrumented in
|
||
the canary Phase-A `signal.match` emit (which only fires for the
|
||
kernel API surface, not internal `XEvent::Set()` calls from
|
||
arbitrary guest-callback paths).
|
||
|
||
So canary's frame-sync event is also signaled via a non-kernel-API
|
||
path. The mechanism is presumably: the guest's IRQ callback writes
|
||
SignalState in guest memory; canary's `XEvent`'s underlying host OS
|
||
Event mirrors that on the next `Wait()` call. The crucial difference:
|
||
**canary's guest callback successfully writes SignalState**, ours
|
||
**doesn't**. That's the producer-side root cause.
|
||
|
||
## Third-order observations (no claims, just data)
|
||
|
||
- `gpu.interrupt.delivered{source=0} = 76` per 100M instr is **too
|
||
low**: 100M instr at ~10 MIPS guest = ~10 s wallclock; 60 Hz VSync
|
||
should give ~600 deliveries, not 76. Either the tick-vsync-instr
|
||
proxy (150k instr period) drifted (audit M11 already documented
|
||
similar drift) or guest threads stall the interpreter and we
|
||
under-count rounds. Out of scope here, but worth flagging for
|
||
iterate-2.AK's wedge-track scoping.
|
||
- 99.7% trace dominance by a single thread's spin (tid=7) is a
|
||
significant scheduling pathology. tid=7's priority=17 vs tid=8's
|
||
priority=0 on the same hw_id means starvation is permanent under
|
||
our strict-priority `pick_runnable` (no aging boost large enough to
|
||
preempt prio=17). This recapitulates the 2.U / 2.V starvation-fix
|
||
precedent (priority aging landed for prio=0 vs prio=15 on hw_id=4/5
|
||
was tid=6 vs tid=10; here it's a different slot with a steeper
|
||
17-vs-0 gradient).
|
||
|
||
## Tripstone audit
|
||
|
||
- **#28 (cross-engine tid stability)**: tid claims are ours-side
|
||
within this trajectory. Canary cross-references rely on prior
|
||
mappings (`+ ctx_ptr` discipline maintained).
|
||
- **#39 (composite progression IS progression)**: Honored. Headline
|
||
is honest "FIX-INERT-ON-THIS-TRAJECTORY"; no progression claim.
|
||
- **#40 (no single-keystone framing)**: Care taken. The wedge surface
|
||
is restated explicitly: tid=7 spin (producer-side dispatcher write
|
||
missing) + tid=8 starvation + tid=11 XAudio + tid=12 DPC + tid=1
|
||
on-deck. The spec's framing of "wire VSync ISR → 0x10e8" is shown
|
||
to be a precondition error, not a fix-the-keystone-and-cascade.
|
||
- **#41 (categorized diff tags)**: N/A this iterate.
|
||
- **#42 (Phase-A blind to blocked-forever)**: Exit-state JSON used.
|
||
- **#43 (no budget-cap framing)**: Trace is at the 500M-budget cap,
|
||
but no progression claim is made; cap is descriptive not
|
||
load-bearing.
|
||
- **#44 refined (rate+shape comparison)**: Honored. Cross-engine
|
||
canary trace measurement explicitly confirms the shape match (no
|
||
signal.match in canary either) — and the **rate** is the divergent
|
||
axis (canary's tid=6 wait gap 16.6 ms vs ours's tid=7 30-ms-deadline
|
||
timeouts with 0.16ms gap = ~190× rate inversion in the spin
|
||
direction, not the canary direction).
|
||
|
||
## Confidence
|
||
|
||
- **HIGH** that the patch is correct and minimal: 7 substantive LOC,
|
||
matches a documented design pattern (the comment block in
|
||
`refresh_pkevent_shadow_from_guest` already anticipates the
|
||
reciprocal direction), 0 test regressions, bit-identical
|
||
determinism check.
|
||
- **HIGH** that the patch is **inert on this trajectory**: 50M-instr
|
||
debug probe showed 30 `do_wait_single` invocations on the candidate
|
||
guest-pointer handle, ALL with `signaled=false` (fast-path
|
||
unreached). The reciprocal-clear is structurally unreachable on
|
||
this path.
|
||
- **HIGH** that the real producer-side wedge is `0x824be9a0` (the
|
||
registered callback) failing to write `0xbe8cbb5c + 4 = 1`.
|
||
Evidence: 76 delivered callbacks per 100M, but 0 changes to the
|
||
candidate guest memory address across 500M instr.
|
||
- **MEDIUM-HIGH** that the patch is **useful for future trajectories**.
|
||
Once the producer-side starts writing (whether via a guest-callback
|
||
fix or a crowbar kernel-side write), the consume-side reciprocal
|
||
clear becomes critical: without it, the first write would latch and
|
||
fast-path forever, the symptom 2.AI dispatched at the create-time
|
||
signal flag would re-emerge at the dispatcher's `SignalState` flag.
|
||
- **LOW-MEDIUM** that this is sufficient to reach gameplay. VdSwap
|
||
stays at 6 (no rendering progression), tid=8 starves, tid=11/12
|
||
XAudio/DPC still blocked. Several more iterations likely needed.
|
||
|
||
## Next-iterate recommendation
|
||
|
||
Priority list:
|
||
|
||
1. **2.AK (producer-side VSync callback investigation)** — the actual
|
||
missing wedge for this iterate's stated objective. Trace the
|
||
callback's guest code at PC `0x824be9a0` via `--lr-trace` to find
|
||
what conditional gates the `SignalState` write, or scope a
|
||
**crowbar** path in `try_inject_graphics_interrupt`: maintain a
|
||
per-callback `signal_state_addr: Option<u32>` field on
|
||
`KernelState`, initialized via heuristic (e.g. user_data + scan
|
||
for `KEVENT` signature), and force `mem.write_u32(addr, 1)` on
|
||
each IRQ delivery alongside the callback inject. Estimated
|
||
20-50 LOC.
|
||
2. **2.AL (tid=7 priority-aging extension)** — the 2.V aging hot-path
|
||
targeted prio=0 vs prio=15; that's a slimmer gradient than tid=7's
|
||
prio=17 vs tid=8's prio=0. Either lift the cap or apply the same
|
||
aging-bonus formula on the steeper gradient. Estimated 10 LOC if
|
||
the existing aging knob extends, 30 LOC if a separate
|
||
max-bonus-for-low-priority logic is needed.
|
||
3. **2.AM (XAudio stub, tid=11 unchanged)** — remains from 2.AB. ~5-150 LOC.
|
||
4. **2.AN (regression-grep for guest-pointer dispatcher writes)** —
|
||
if 2.AK lands a crowbar, the same pattern likely needs
|
||
generalizing across other dispatcher families.
|
||
|
||
I recommend **2.AK next** — it's the actual producer-side wedge this
|
||
iterate was supposed to address; the consume-side discipline this
|
||
iterate landed is necessary infrastructure for whatever 2.AK chooses
|
||
as its mechanism.
|
||
|
||
## Artifacts
|
||
|
||
Under `xenia-rs/audit-runs/iterate-2AJ-vsync-event-wiring/`:
|
||
|
||
- `ours-cold.jsonl` (16.07 GB, 65,691,821 events) — primary trace
|
||
- `ours-cold.stdout.log` (empty — quiet mode)
|
||
- `ours-cold.stderr.log` (single exit-thread-state notice)
|
||
- `exit-thread-state.json` (17.4 KB; 21 alive + 18 wedge entries —
|
||
same wedge set as 2.AI)
|
||
- `ours-cold-run2.jsonl` (16.07 GB, 65,691,821 events) — determinism
|
||
check, bit-identical event count, head-100K stripped-host_ns equal
|
||
- `ours-cold-run2.{stdout,stderr}.log`
|
||
- `writer-report.md` (this file)
|
||
|
||
xenia-canary UNCHANGED.
|
||
|
||
Engine state: head + 2.AF patch (`+18` in `xenia-app/src/main.rs`)
|
||
+ 2.AI patch (`+16/-2` in `xenia-kernel/src/exports.rs`) + **2.AJ
|
||
patch (`+45` in `xenia-kernel/src/exports.rs`)**. All three
|
||
retained in working tree, uncommitted (per the cumulative-LOC
|
||
policy noted in 2.W's report). Cumulative 5-day LOC: 2.V (+30) +
|
||
2.AF (+18) + 2.AI (+16) + 2.AJ (+45) = +109 LOC uncommitted.
|