Files
xenia-rs/audit-runs/scheduler-determinism-plan/plan.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

22 KiB
Raw Blame History

Plan: Unblock the 104,607 Scheduler-Determinism Cap

Context

The Phase A matched-prefix between xenia-canary (oracle, C++) and xenia-rs ("ours", Rust) is structurally capped at 104,607 events on the main chain (canary tid=6 → ours tid=1). C+20 and C+22 escalated the divergence at idx 104,607 as class (A) scheduler-determinism — not a fixable bug in either engine, but a fundamental mismatch in scheduling philosophy.

The two engines are correct independently; their scheduling models cannot agree:

dimension canary (oracle) ours
host-thread mapping 1 host std::thread per XThread (xthread.cc:62/358/476) single host thread, 6 cooperative HW slots (scheduler.rs:230-258)
who picks next runnable host OS (Wine on Linux) — non-deterministic round_schedule over OrderMode::Fixed rotation cursor (scheduler.rs:710-740) — deterministic
RtlEnterCriticalSection on contention spins cs->header.absolute*256 times then xeKeWaitForSingleObject (xboxkrnl_rtl.cc:596-633) parks immediately via BlockReason::CriticalSection (exports.rs:2927-2945) — no spin
clock wallclock-driven (Clock::QueryHostSystemTime, optional rdtsc) fixed FILETIME 132_500_000_000_000_000
determinism cvars clock_no_scaling, clock_source_raw, ignore_thread_priorities, ignore_thread_affinities — none enable lockstep already lockstep by default; XENIA_SCHED_ORDER=random opt-out

The 104,607 divergence is the symptom: canary's tid=6 contends → blocks on shared dispatcher Event sid=75ae880ec432eb36 → another guest thread mutates protected state during the wait → post-acquire reads mutated value → nested-cleanup branch (E E L L). Ours's tid=1 runs monolithically, no other thread gets the CS first → fast-path acquire → reads pre-wait value → simple-release branch (E L NtClose). The C+18/C+21 floating absorbers already mask the observation-side jitter (the wait.begin event itself); the post-wait control-flow divergence is real guest code, not absorbable without crossing reading-error #23.

Sylpheed workload profile (probed from ours-cold.jsonl, 121,569 events):

  • 19,494 RtlEnterCriticalSection + 19,492 RtlLeaveCriticalSection calls (≈80% of all kernel.calls)
  • only 2 KeQuerySystemTime, 0 KeQueryPerformanceCounter, 0 KeDelayExecutionThread, 0 NtYieldExecution
  • 34 wait.begin events total, most with timeout_ns=-1 (indefinite)
  • 6 sister chains; main capped at 104,607, sisters capped at 11/32/4/41/16

Intended outcome: advance the main matched-prefix by ≥1,000 events (target ≥106,000) without destabilizing ours's cold digest e1dfcb1559f987b35012a7f2dc6d93f5 and without modifying canary's default behavior.

Stage 0 first — a cheap (≈80 LOC, 1 day) cycle-quantum preemption sweep to test whether scheduling shape alone unblocks the cap. If a tuned quantum advances main prefix past 104,607 with stable digest across 3 cold runs, the manifest work may be unnecessary.

Stages 14 (gated on Stage 0 outcome) — a contention manifest: canary emits a new event kind contention.observed on every RtlEnterCriticalSection with {site_sid, cs_ptr, contended}. A Python tool distills the trace into a per-(tid, tid_event_idx) manifest. Ours's rtl_enter_critical_section, in a new OrderMode::ContentionReplay mode, consults the manifest before its fast-path check; when an entry says contended, it parks via the existing BlockReason::CriticalSection machinery and lets the actual owner's RtlLeaveCriticalSection (also already wired) hand back the lock through the existing wake path. Stage 5 is a per-CS fallback kludge.

Why this approach over the alternatives:

approach LOC unblocks 104,607? preserves ours digest preserves canary default verdict
A — cycle clock in canary ~200 in base/clock.cc NO — workload is clock-light n/a yes wrong target
B — single-thread cooperative canary ~2000-3000 in kernel/xthread.cc, base/threading*.cc, cpu/processor.cc yes yes NO — destabilizes oracle overscoped, breaks oracle
C/H — contention manifest replay (broad: CS + wait) ~600-700 yes yes (default-off) yes (cvar-off) second choice
H' — manifest replay, scoped to RtlEnterCS ~450-500 yes yes yes recommended
D — diff-harness absorption extension ~200 in diff_events.py partially — hits #23 wall, buys 10-100 idx n/a n/a fallback only
E — A+D hybrid ~400 LOW n/a n/a tactical band-aid
F — make ours preemptive ~500 in scheduler.rs maybe NO — breaks e1dfcb15… n/a wrong direction
cycle-quantum spike ~80 in scheduler.rs TBD by spike TBD by spike n/a Stage 0 gate
spin-then-wait CS fix in ours ~50 in exports.rs:2886 NO (canary contends/ours doesn't — adding spin to ours makes contention less likely, wrong direction) yes n/a log finding, defer

The fundamental insight is that scoping the replay to RtlEnterCriticalSection only sidesteps four traps in a broader design: (1) no tid translation needed since canary and ours each consume their own native-tid events; (2) no mid-instruction forced-yield primitive needed since the import dispatch is already a scheduling boundary; (3) no "mutating thread" field needed since the current owner does the mutation and the existing wake path handles it; (4) wait-side replay is deferred since only 34 wait.begin events exist in the whole boot.

Stages

Stage 0 — Cycle-Quantum Preemption Spike (~80 LOC, 1 day) [GATE]

Goal: cheap signal on whether scheduling shape alone is sufficient.

Change File LOC
Add OrderMode::ScanQuantum { ticks: u32 } variant; in round_schedule or the step loop, force decrement_quantum on every Nth step scheduler.rs:230-258 and scheduler.rs:710-740 ~30
Wire XENIA_SCHED_QUANTUM=<u32> env var → OrderMode::ScanQuantum same ~10
Sweep harness: bash script running cold-vs-cold at quanta [10, 50, 200, 1000, 5000] new under xenia-rs/audit-runs/stage0-quantum-sweep/ ~40 (script + notes)

Validation:

  • Cold-vs-cold per quantum (XENIA_CACHE_WIPE=1, .iso path, canary --mute=true).
  • Record matched-prefix per quantum value in a sweep table.
  • Verify ours's digest stable × 3 cold runs at each candidate quantum.

Decision tree:

  • If a quantum value advances main prefix ≥ 105,500 AND ours's digest is stable × 3 at that quantum: land it behind a non-default OrderMode (keep Fixed as default so e1dfcb15… is preserved). Skip Stages 1-4. Document.
  • Else if some quantum partially helps (105,000-105,500) but digest is unstable: keep the variant available as a probe but proceed to Stage 1.
  • Else (no improvement): proceed to Stage 1 immediately.

Rollback: trivial — revert the variant; default OrderMode::Fixed is unchanged.

Stage 1 — Canary-Side Contention Emitter (~100 LOC, cvar-OFF byte-identical)

Goal: produce ground truth that "tid X contended on cs Y at its kernel-call ordinal N."

File Edit LOC
xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_rtl.cc:596-633 (RtlEnterCriticalSection_entry) Emit contention.observed with contended=false on spin-loop success (atomic_cas hits) and contended=true when control falls through to xeKeWaitForSingleObject ~40
xenia-canary/src/xenia/kernel/util/event_log.{h,cc} New EmitContentionObserved(site_sid, cs_ptr, contended); cvar kernel_emit_contention=false default ~30
xenia-rs/audit-runs/phase-a-diff-harness/schema-v1.md New §"contention.observed (v1.4 — Phase D+0)" ~10
xenia-canary/src/xenia/kernel/util/event_log_test.cc Round-trip test ~20

Schema (minimum):

kind: "contention.observed"
tid: <guest tid of caller>
tid_event_idx: <per-tid ordinal>
payload: { "cs_ptr": <u32 hex>, "site_sid": <16-hex>, "contended": <bool> }

site_sid is the C+18 shared-global recipe semantic_id_shared_global(cs_ptr, KernelObjectType::CriticalSection) — both engines compute the same SID for the same CS pointer, so it's a valid cross-engine lookup key.

Validation:

  • Enable cvar, cold-run canary, verify ≥1 contended=true event near canary's tid=6 tid_event_idx ≈ 104,605.
  • Verify cold digest unchanged when kernel_emit_contention=false (default) — byte-identical to pre-Stage-1.

Rollback: cvar OFF by default; revert the 4 files.

Stage 2 — Manifest Builder (~150 LOC, pure Python)

Goal: distill canary jsonl into a replay-ready manifest.

File LOC
xenia-rs/tools/diff-events/build_contention_manifest.py (new) ~120
xenia-rs/tools/diff-events/test_build_manifest.py (new) ~30

Manifest schema (contention_manifest.json):

{
  "version": 1,
  "source_canary_digest": "<sha256 of canary jsonl>",
  "entries": [
    { "tid": 6, "tid_event_idx": 104605, "site_sid": "75ae880ec432eb36",
      "cs_ptr": "0x82abc000", "contended": true }
  ]
}

Builder reads canary jsonl, filters kind == "contention.observed", keeps contended=true (Phase 1 evidence suggests <100 entries across the whole boot given the wait-light profile), sorts by (tid, tid_event_idx). Diff tool already keys events by (tid, tid_event_idx); this matches.

Validation: round-trip — build from canary cold jsonl, count contended=true entries, eyeball-diff against C+22 jitter samples.

Stage 3 — Ours Replay Mode (~200 LOC + ~50 LOC tests)

Goal: ours's rtl_enter_critical_section consults the manifest before the fast-path check; forces park if the manifest says contended.

File Edit LOC
xenia-rs/crates/xenia-cpu/src/scheduler.rs:230-258 New OrderMode::ContentionReplay { manifest_path }; Scheduler carries Option<Arc<ContentionManifest>> ~40
xenia-rs/crates/xenia-kernel/src/contention_manifest.rs (new) Loader, hashmap keyed on (tid, tid_event_idx), consume(tid, idx) -> Option<Entry> ~80
xenia-rs/crates/xenia-kernel/src/exports.rs:2886-2946 (rtl_enter_critical_section) After computing current_tid, peek tid_event_idx; if manifest says contended at (tid, idx): (a) verify site_sid matches recomputed shared-global SID for cs_ptr, (b) check the CS in guest memory actually has a live non-self owner — if not, skip with a log warning (state-divergence not schedule-divergence), (c) emit a synthetic wait.begin (C+21 absorber will handle it), (d) push self onto cs_waiters[cs_ptr], (e) call park_current(BlockReason::CriticalSection(cs_ptr)). The existing wake path at lines 2972-2980 already hands us the lock when the owner releases. ~50
xenia-rs/crates/xenia-cpu/src/main.rs or equivalent CLI module --scheduler-replay-manifest PATH flag ~20
Replay-mode unit tests xenia-rs/crates/xenia-kernel/src/contention_manifest.rs ~50

Critical subtlety: only force park when the CS in guest memory actually has a live different-tid owner at the replay point. If the CS is free, this is a state-divergence (mutation timing mismatch), not a schedule-divergence; replay must skip and log. Otherwise we'd park on a CS that no one will release → deadlock. Explicit branch in (b) above.

Validation:

  1. Cold-vs-cold matched-prefix advances past 104,607 (target ≥106,000, the next major divergence boundary).
  2. Ours's digest, when --scheduler-replay-manifest is NOT passed, byte-identical to pre-Stage-3 e1dfcb1559f987b35012a7f2dc6d93f5.
  3. With manifest passed, replay-mode digest stable × 3 cold runs (a NEW digest, archived).
  4. Sister chains tid=4→11/7→2/12→7/14→9/15→10 regress at most -5 events each.
  5. Phase B image_loaded_sha256 ea8d160e… unchanged.

Rollback criteria:

  • If prefix doesn't advance past 104,607: diagnose via RUST_LOG=trace on the replay-consume path; verify SID match against canary's emitted SID for the contended cs_ptr; check whether the CS was free at the replay point (the (b) skip-branch may be firing).
  • If digest unstable with replay: forced-park is non-deterministic. Inspect cs_waiters[cs_ptr] ordering, wake_ref selection at scheduler.rs (queue.remove(0) — FIFO, should be deterministic). Possible culprit: find_by_tid at exports.rs:2903 traverses HW slots in rotation_cursor order — pin or verify.
  • If sister chains regress >5: forced contention on tid=6 is changing other chains' progression. Initially keep manifest tid-scoped to tid=6 only; broaden iteratively if needed.

Stage 4 — Diff Tool Hookup (~30 LOC + tests)

Goal: register the new event kind.

File LOC
xenia-rs/tools/diff-events/diff_events.py:201-216 Add ENGINE_LOCAL_KINDS = {"contention.observed"} set; advance per-tid pointer past these events without comparison
xenia-rs/tools/diff-events/test_diff_events.py Test: contention.observed doesn't affect matched-prefix

Rationale for engine-local: ours doesn't emit contention.observed (it consumes the manifest instead). Marking the kind engine-local keeps the matched-prefix definition unchanged and reversible. We can promote to a matched event later if both engines start emitting it.

Validation: existing 30 diff tests pass; new test confirms engine-local handling.

Stage 5 — Per-CS Fallback (deferred, ~30 LOC if needed)

If Stage 3 succeeds on the 104,607 cap but a similar divergence appears soon after on a CS the manifest doesn't cover (e.g., because canary's emitter missed it), hardcode the SID 75ae880ec432eb36 (or whichever) to always force a one-round yield. Document as a known kludge. Don't generalize.

Critical files

Reused utilities

  • semantic_id_shared_global(pointer, object_type) (C+18 recipe) — both engines compute identical SIDs for the same CS pointer. Use as the cross-engine lookup key in the manifest.
  • Scheduler::park_current(BlockReason::CriticalSection(cs_ptr)) (scheduler.rs:808) — existing primitive; Stage 3 reuses without modification.
  • Scheduler::wake_ref(r) and the cs_waiters queue (exports.rs:2972-2980) — existing wake/transfer machinery handles the post-park resume without any new code.
  • C+21 wait.begin floating absorber (diff_events.py) — Stage 3's synthetic wait.begin emission is automatically absorbed.
  • C+18 shared-global SID emission (event_log.rs / event_log.cc) — Stage 1's site_sid field uses this directly.

Verification end-to-end

After Stage 0:

cd "/home/fabi/RE - Project Sylpheed"
for q in 10 50 200 1000 5000; do
  XENIA_CACHE_WIPE=1 XENIA_SCHED_QUANTUM=$q \
    xenia-rs/target/release/xrs-verify-stage0 "Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso" \
    --phase-a-event-log /tmp/ours-q$q.jsonl
done
# compare each /tmp/ours-q*.jsonl digest and matched-prefix vs canary baseline

After Stages 1-4:

# capture canary contention trace (one-time)
wine xenia-canary/build-cross/bin/Windows/Debug/xenia_canary.exe \
  --mute=true --kernel_emit_contention=true \
  --phase_a_event_log_path=/tmp/canary-contention.jsonl \
  "Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso"

# build manifest
python3 xenia-rs/tools/diff-events/build_contention_manifest.py \
  /tmp/canary-contention.jsonl > /tmp/contention.json

# replay in ours
XENIA_CACHE_WIPE=1 xenia-rs/target/release/xrs-replay \
  --scheduler-replay-manifest /tmp/contention.json \
  --phase-a-event-log /tmp/ours-replay.jsonl \
  "Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso"

# diff
python3 xenia-rs/tools/diff-events/diff_events.py \
  --canary /tmp/canary-contention.jsonl \
  --ours /tmp/ours-replay.jsonl
# expect main matched-prefix ≥ 106,000

Test suites:

cd xenia-rs && cargo test --release
# expect: kernel 204→≥210, workspace 291→≥305
cd xenia-canary && build-cross/bin/Linux/Debug/xenia-kernel-test
# expect: pass with kernel_emit_contention round-trip test
python3 xenia-rs/tools/diff-events/test_diff_events.py
# expect: 30→≥34 pass
python3 xenia-rs/tools/diff-events/test_build_manifest.py
# expect: pass

Risk register

risk likelihood mitigation
Forced contention on tid=6 changes ordinal progression on later tids, breaking sister chain matches high Stage 3 validation #4 (-5 budget). If exceeded: keep manifest tid-scoped to tid=6 only initially; broaden iteratively per sister.
Synthetic park on a CS that's free in ours's memory → deadlock medium Stage 3 explicit (b) skip-branch + warn. Replay only parks when guest-memory shows a live different-tid owner.
Canary's spin loop usually succeeds → very few contended=true events to drive replay medium Stage 1 round-trip validation confirms ≥1 entry near 104,605. If empty: spin sizes are too large for Sylpheed's CS contention; emit on spin-loop entry too with contended=spin_count_exhausted.
Ours digest destabilizes under replay (forced park orderings non-deterministic) medium Inspect cs_waiters FIFO + find_by_tid HW-slot traversal order. Pin both to deterministic order if needed.
Reading-error #34 (loose .xex vs .iso) known, covered All validation uses .iso path.
Reading-error #28 (verify source before writing) active Stage 1 requires reading xboxkrnl_rtl.cc::RtlEnterCriticalSection end-to-end first; Stage 3 likewise for exports.rs:2886-2946. Already done in planning.
Manifest grows huge on other games low Per-game tool; Sylpheed wait-light. Document scope in plan-doc / memory.
Phase B image_loaded_sha256 ea8d160e… regression low Image load is Phase B; CS replay touches Phase A only. Verify in every cold run.
Game-compat: real Sylpheed depends on wallclock pacing low Workload profile: 2 KeQuerySystemTime calls. Clock not in scope.
The spin-then-wait asymmetry is itself a divergence known, deferred Don't add spin to ours under this plan — it would make ours's contention less likely, which is the wrong direction for 104,607. Log finding; defer to a separate phase.

Backstop

If Stage 3 lands but matched-prefix advance is <500 events (i.e., we get past 104,607 but quickly hit a similar wall):

  1. Stage 5 per-CS hardcoded yield for the next blocking SID.
  2. Approach D extension — extend the diff-tool absorber to fold "post-wait nested Enter/Leave blocks" matched against a known pattern (one extra E-then-L cycle on the canary side with the same outer CS). ~150 LOC in diff_events.py. This crosses reading-error #23 in spirit but with a narrow heuristic; tag explicitly as a band-aid in schema v1.5.
  3. Broaden manifest to wait.begin (the deferred "H broad" variant): ~150 LOC. Only worth it if sister chains tid=12→7 or tid=14→9 are stuck on wait timing.

Acceptance criteria

  • Stage 0 spike completes with a decision (land / proceed).
  • Stages 1-4 (if needed) land in 3-5 sessions total.
  • Main matched-prefix ≥ 106,000 (≥1,393 events past current cap).
  • Ours's default-mode cold digest unchanged: e1dfcb1559f987b35012a7f2dc6d93f5 × 3.
  • Phase B image_loaded_sha256 unchanged: ea8d160e….
  • Canary default-mode (kernel_emit_contention=false) cold digest unchanged.
  • Replay-mode digest stable × 3 cold runs (new value, archived).
  • Sister chain regression ≤5 events per sister.
  • Test suite: kernel 204→≥210, workspace 291→≥305, diff-tool 30→≥34.
  • Memory entry + schema-v1.md v1.4 §"contention.observed" + audit-run dir.

Out of scope

  • GPU/audio determinism (separate subaudits).
  • Wine-level changes (unmodifiable host primitive).
  • Modifying ours's default scheduler (OrderMode::Fixed stays the default; replay is opt-in).
  • Modifying canary's default scheduling (single-thread cooperative canary deferred indefinitely — too risky to oracle).
  • Modifying ours's spin behavior on RtlEnterCS (logged, deferred — wrong direction for 104,607).
  • Adding wait.begin replay to manifest (deferred unless sisters need it; backstop only).
  • Clock determinism in canary (workload doesn't need it).

Open questions for the user

  1. Stage 0 as gate or parallel? Recommended as gate (1 day, cheap, may answer the whole question). Parallel risks duplicating effort.
  2. Sister-chain regression budget: -5 events per sister acceptable? Past escalations (C+17 tid=15→10 was -14, treated as D-NEW-3) suggest budget is tight.
  3. Canary cvar default: kernel_emit_contention=false (off by default) confirmed?
  4. Reading-error class: should this plan reserve class #35 for "scheduling-philosophy divergences", or extend #30 ("scheduling determinism")?
  5. Spin-then-wait scope: separate mini-phase or document-and-defer? Recommended defer.