Files
xenia-rs/audit-runs/scheduler-determinism-plan/approach-matrix.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

6.6 KiB
Raw Blame History

Approach Tradeoff Matrix

Each approach is evaluated against the same criteria. The recommended approach is H' (manifest replay, scoped to RtlEnterCS), gated on Stage 0 spike.

Criteria

  • Eng LOC: estimated engine-source modification (ours + canary).
  • Tool LOC: estimated diff-tool / python tooling.
  • Test LOC: estimated tests.
  • Unblocks 104,607?: probability of advancing the main matched-prefix past the current cap.
  • Preserves ours digest: whether e1dfcb1559f987b35012a7f2dc6d93f5 (Phase A) and ea8d160e… (Phase B) remain unchanged in the default mode.
  • Preserves canary default: whether canary's default-mode (no new cvar) cold-run behavior is byte-identical.
  • Wine-constraint: whether the approach requires changing Wine itself (always: NO — out of scope).
  • Reading-error risk: which class of reading error this approach risks crossing.

Matrix

approach Eng LOC Tool LOC Test LOC Unblocks 104,607? Preserves ours digest Preserves canary default Reading-error risk Verdict
A — cycle-counted clock in canary ~200 (base/clock.cc) 0 ~50 NO (Sylpheed: 2 KeQuerySystemTime calls) yes yes (cvar-gated) #19 (wrong-target) WRONG TARGET
B — single-thread cooperative canary ~2000-3000 (xthread.cc, threading*.cc, processor.cc) ~50 ~300 YES yes NO — fundamentally changes scheduling #28 (rewrite-without-verify) OVERSCOPED
C/H — manifest replay, broad (CS + wait) ~600-700 ~200 ~200 YES (with risk in wait-side semantics) yes (default-off) yes (cvar-off) #23 (synthetic events) 2nd choice
H' — manifest replay, scoped to RtlEnterCS ~450-500 ~180 ~150 YES yes (default-off) yes (cvar-off) #23 (bounded) RECOMMENDED
D — diff-harness absorption extension 0 ~150 (diff_events.py) ~50 PARTIAL (10-100 idx) yes yes #23 (FOLDS REAL GUEST CODE) fallback only
E — A+D hybrid ~200 ~150 ~100 LOW (clock isn't the lever; D hits #23 wall) yes yes #19 + #23 band-aid
F — make ours preemptive ~500 (scheduler.rs) 0 ~100 UNKNOWN (no replay anchor) NO — destabilizes cold digest n/a #28 (loses 23 phases of stabilization) WRONG DIRECTION
Stage 0 spike — cycle-quantum preemption ~80 (scheduler.rs) 0 ~40 TBD by spike TBD (default Fixed unchanged) n/a #19 (premature optimization if not validated) GATE
spin-then-wait fix in ours ~50 (exports.rs:2886) 0 ~30 NO (wrong direction: adding spin makes contention less likely on ours's side) yes n/a #28 (verified — would not help 104,607) document, defer

Detailed reasoning

Why H' over C/H (broad)

The broad variant (C/H) covers both RtlEnterCriticalSection and KeWaitForSingleObject. Phase 1 evidence shows:

  • 19,494 RtlEnter calls in Sylpheed's boot
  • 34 wait.begin events total

The CS surface is ~570× larger than the wait surface. Adding wait-side replay buys little. More importantly, wait-side replay has tougher semantics: when canary's KeWaitForSingleObject fires on a TIMER (with a host-wallclock deadline), ours can't replay because ours doesn't have a wallclock to match.

H' defers wait-side replay until evidence shows it's needed (backstop in plan.md §Backstop).

Why H' over B (single-thread canary)

B fundamentally changes the oracle. The oracle's stability across phases is a foundational invariant; modifying its scheduling layer introduces game-compatibility risk that we cannot fully test (only Sylpheed is in scope, but canary supports many titles). LOC is also 4-6× larger.

H' leaves the oracle's behavior unchanged in the default case. The contention emitter (Stage 1) is a passive observer; the manifest captures one canary cold run as canonical and ours replays it. Canary is not asked to be deterministic — it's asked to report its non-determinism.

Why H' over D (diff absorber extension)

The current C+21 absorber is already at the safe limit of reading-error #23. Extending the absorber to fold "post-wait nested Enter/Leave blocks" would hide REAL guest-code execution differences. The canary side's nested-Enter reads mutated memory and modifies state (lock_count, recursion_count) that affects subsequent events. Folding it at the diff layer means downstream divergences are misattributed.

D remains as a backstop (plan.md §Backstop item 2) for residual gaps post-Stage-3, with explicit reading-error annotation.

Why H' over F (make ours preemptive)

23 phases (C+1 through C+23) have stabilized ours's cold digest. Changing the default scheduler to preempt at fixed intervals would invalidate every prior baseline. Even if the new digest is stable, it severs continuity with the existing test infrastructure and audit-run archives.

H' preserves ours's default OrderMode::Fixed. The replay mode is opt-in via --scheduler-replay-manifest PATH. Default-mode digest is provably unchanged (Stage 3 validation #2).

Why Stage 0 first

Cost is 1 day, 80 LOC. If a tuned quantum advances the prefix past 104,607 with a stable digest, the manifest work (Stages 1-4, ~450-500 LOC, 3-5 sessions) is unnecessary. Even if Stage 0 doesn't fully unblock, the data informs the manifest design (e.g., "quantum=200 advances prefix by 800 events but stalls at 105,407" tells us the next divergence is a different class).

Stage 0 is strictly dominated by approaches that include it. Skipping risks doing 500 LOC of unnecessary work.

Why NOT spin-then-wait fix in ours

The 104,607 divergence is canary contending, ours NOT contending. Adding spin to ours would make ours's RtlEnterCS try harder to acquire without parking — which makes contention less likely on the ours side, the OPPOSITE of what we need. Documenting the spin asymmetry is valuable for future divergences in the opposite direction (where ours spuriously contends and canary doesn't), but it's not the lever for 104,607.

Open tradeoffs (decisions deferred to user / Stage 0 outcome)

  • Stage 0 alone might suffice: if quantum=N produces a stable digest matching canary's behavior at 104,607, the plan collapses to a single 80-LOC change. Stage 0 decision tree is in plan.md.
  • Sister chain regression budget: -5 per sister. If exceeded post-Stage-3, scope manifest to tid=6 only initially, then iterate.
  • Wait-side replay (broad H): deferred unless sister chains (esp tid=12→7 timeout class) need it. Backstop only.
  • Approach D extension as final band-aid: documented in backstop with explicit #23 annotation. Only land if Stages 0-4 leave residual divergence with no other path forward.