# Plan: Unblock the 104,607 Scheduler-Determinism Cap ## Context The Phase A matched-prefix between xenia-canary (oracle, C++) and xenia-rs ("ours", Rust) is structurally capped at **104,607 events** on the main chain (canary tid=6 → ours tid=1). C+20 and C+22 escalated the divergence at idx 104,607 as **class (A) scheduler-determinism** — not a fixable bug in either engine, but a fundamental mismatch in scheduling philosophy. **The two engines are correct independently; their scheduling models cannot agree:** | dimension | canary (oracle) | ours | |---|---|---| | host-thread mapping | 1 host std::thread per XThread (`xthread.cc:62/358/476`) | single host thread, 6 cooperative HW slots (`scheduler.rs:230-258`) | | who picks next runnable | host OS (Wine on Linux) — non-deterministic | `round_schedule` over `OrderMode::Fixed` rotation cursor (`scheduler.rs:710-740`) — deterministic | | RtlEnterCriticalSection on contention | spins `cs->header.absolute*256` times then `xeKeWaitForSingleObject` (`xboxkrnl_rtl.cc:596-633`) | parks immediately via `BlockReason::CriticalSection` (`exports.rs:2927-2945`) — **no spin** | | clock | wallclock-driven (`Clock::QueryHostSystemTime`, optional rdtsc) | fixed FILETIME `132_500_000_000_000_000` | | determinism cvars | `clock_no_scaling`, `clock_source_raw`, `ignore_thread_priorities`, `ignore_thread_affinities` — none enable lockstep | already lockstep by default; `XENIA_SCHED_ORDER=random` opt-out | The 104,607 divergence is the symptom: canary's tid=6 contends → blocks on shared dispatcher Event `sid=75ae880ec432eb36` → another guest thread mutates protected state during the wait → post-acquire reads mutated value → nested-cleanup branch (`E E L L`). Ours's tid=1 runs monolithically, no other thread gets the CS first → fast-path acquire → reads pre-wait value → simple-release branch (`E L NtClose`). The C+18/C+21 floating absorbers already mask the observation-side jitter (the `wait.begin` event itself); the post-wait *control-flow* divergence is real guest code, not absorbable without crossing reading-error #23. **Sylpheed workload profile** (probed from `ours-cold.jsonl`, 121,569 events): - 19,494 RtlEnterCriticalSection + 19,492 RtlLeaveCriticalSection calls (≈80% of all kernel.calls) - only 2 KeQuerySystemTime, 0 KeQueryPerformanceCounter, 0 KeDelayExecutionThread, 0 NtYieldExecution - 34 wait.begin events total, most with `timeout_ns=-1` (indefinite) - 6 sister chains; main capped at 104,607, sisters capped at 11/32/4/41/16 **Intended outcome**: advance the main matched-prefix by ≥1,000 events (target ≥106,000) without destabilizing ours's cold digest `e1dfcb1559f987b35012a7f2dc6d93f5` and without modifying canary's default behavior. ## Recommended approach: Stage 0 spike → Targeted Contention-Replay Manifest **Stage 0** first — a cheap (≈80 LOC, 1 day) cycle-quantum preemption sweep to test whether *scheduling shape alone* unblocks the cap. If a tuned quantum advances main prefix past 104,607 with stable digest across 3 cold runs, the manifest work may be unnecessary. **Stages 1–4 (gated on Stage 0 outcome)** — a **contention manifest**: canary emits a new event kind `contention.observed` on every `RtlEnterCriticalSection` with `{site_sid, cs_ptr, contended}`. A Python tool distills the trace into a per-(tid, tid_event_idx) manifest. Ours's `rtl_enter_critical_section`, in a new `OrderMode::ContentionReplay` mode, consults the manifest before its fast-path check; when an entry says contended, it parks via the *existing* `BlockReason::CriticalSection` machinery and lets the actual owner's `RtlLeaveCriticalSection` (also already wired) hand back the lock through the existing wake path. Stage 5 is a per-CS fallback kludge. **Why this approach over the alternatives:** | approach | LOC | unblocks 104,607? | preserves ours digest | preserves canary default | verdict | |---|---|---|---|---|---| | A — cycle clock in canary | ~200 in `base/clock.cc` | NO — workload is clock-light | n/a | yes | wrong target | | B — single-thread cooperative canary | ~2000-3000 in `kernel/xthread.cc`, `base/threading*.cc`, `cpu/processor.cc` | yes | yes | NO — destabilizes oracle | overscoped, breaks oracle | | C/H — contention manifest replay (broad: CS + wait) | ~600-700 | yes | yes (default-off) | yes (cvar-off) | **second choice** | | **H' — manifest replay, scoped to RtlEnterCS** | **~450-500** | **yes** | **yes** | **yes** | **recommended** | | D — diff-harness absorption extension | ~200 in `diff_events.py` | partially — hits #23 wall, buys 10-100 idx | n/a | n/a | fallback only | | E — A+D hybrid | ~400 | LOW | n/a | n/a | tactical band-aid | | F — make ours preemptive | ~500 in `scheduler.rs` | maybe | NO — breaks `e1dfcb15…` | n/a | wrong direction | | **cycle-quantum spike** | **~80 in `scheduler.rs`** | **TBD by spike** | **TBD by spike** | **n/a** | **Stage 0 gate** | | spin-then-wait CS fix in ours | ~50 in `exports.rs:2886` | NO (canary contends/ours doesn't — adding spin to ours makes contention *less* likely, wrong direction) | yes | n/a | log finding, defer | The fundamental insight is that **scoping the replay to RtlEnterCriticalSection only** sidesteps four traps in a broader design: (1) no tid translation needed since canary and ours each consume their own native-tid events; (2) no mid-instruction forced-yield primitive needed since the import dispatch is already a scheduling boundary; (3) no "mutating thread" field needed since the current owner does the mutation and the existing wake path handles it; (4) wait-side replay is deferred since only 34 wait.begin events exist in the whole boot. ## Stages ### Stage 0 — Cycle-Quantum Preemption Spike (~80 LOC, 1 day) **[GATE]** **Goal**: cheap signal on whether scheduling shape alone is sufficient. | Change | File | LOC | |---|---|---| | Add `OrderMode::ScanQuantum { ticks: u32 }` variant; in `round_schedule` or the step loop, force `decrement_quantum` on every Nth step | [scheduler.rs:230-258](xenia-rs/crates/xenia-cpu/src/scheduler.rs#L230-L258) and [scheduler.rs:710-740](xenia-rs/crates/xenia-cpu/src/scheduler.rs#L710-L740) | ~30 | | Wire `XENIA_SCHED_QUANTUM=` env var → `OrderMode::ScanQuantum` | same | ~10 | | Sweep harness: bash script running cold-vs-cold at quanta `[10, 50, 200, 1000, 5000]` | new under `xenia-rs/audit-runs/stage0-quantum-sweep/` | ~40 (script + notes) | **Validation**: - Cold-vs-cold per quantum (`XENIA_CACHE_WIPE=1`, `.iso` path, canary `--mute=true`). - Record matched-prefix per quantum value in a sweep table. - Verify ours's digest stable × 3 cold runs at each candidate quantum. **Decision tree**: - *If* a quantum value advances main prefix ≥ 105,500 AND ours's digest is stable × 3 at that quantum: land it behind a non-default `OrderMode` (keep `Fixed` as default so `e1dfcb15…` is preserved). Skip Stages 1-4. Document. - *Else if* some quantum partially helps (105,000-105,500) but digest is unstable: keep the variant available as a probe but proceed to Stage 1. - *Else* (no improvement): proceed to Stage 1 immediately. **Rollback**: trivial — revert the variant; default `OrderMode::Fixed` is unchanged. ### Stage 1 — Canary-Side Contention Emitter (~100 LOC, cvar-OFF byte-identical) **Goal**: produce ground truth that "tid X contended on cs Y at its kernel-call ordinal N." | File | Edit | LOC | |---|---|---| | [xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_rtl.cc:596-633](xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_rtl.cc) (`RtlEnterCriticalSection_entry`) | Emit `contention.observed` with `contended=false` on spin-loop success (`atomic_cas` hits) and `contended=true` when control falls through to `xeKeWaitForSingleObject` | ~40 | | `xenia-canary/src/xenia/kernel/util/event_log.{h,cc}` | New `EmitContentionObserved(site_sid, cs_ptr, contended)`; cvar `kernel_emit_contention=false` default | ~30 | | `xenia-rs/audit-runs/phase-a-diff-harness/schema-v1.md` | New §"contention.observed (v1.4 — Phase D+0)" | ~10 | | `xenia-canary/src/xenia/kernel/util/event_log_test.cc` | Round-trip test | ~20 | **Schema (minimum)**: ``` kind: "contention.observed" tid: tid_event_idx: payload: { "cs_ptr": , "site_sid": <16-hex>, "contended": } ``` `site_sid` is the **C+18 shared-global recipe** `semantic_id_shared_global(cs_ptr, KernelObjectType::CriticalSection)` — both engines compute the same SID for the same CS pointer, so it's a valid cross-engine lookup key. **Validation**: - Enable cvar, cold-run canary, verify ≥1 `contended=true` event near canary's tid=6 `tid_event_idx` ≈ 104,605. - Verify cold digest unchanged when `kernel_emit_contention=false` (default) — byte-identical to pre-Stage-1. **Rollback**: cvar OFF by default; revert the 4 files. ### Stage 2 — Manifest Builder (~150 LOC, pure Python) **Goal**: distill canary jsonl into a replay-ready manifest. | File | LOC | |---|---| | `xenia-rs/tools/diff-events/build_contention_manifest.py` (new) | ~120 | | `xenia-rs/tools/diff-events/test_build_manifest.py` (new) | ~30 | **Manifest schema** (`contention_manifest.json`): ```json { "version": 1, "source_canary_digest": "", "entries": [ { "tid": 6, "tid_event_idx": 104605, "site_sid": "75ae880ec432eb36", "cs_ptr": "0x82abc000", "contended": true } ] } ``` Builder reads canary jsonl, filters `kind == "contention.observed"`, keeps `contended=true` (Phase 1 evidence suggests <100 entries across the whole boot given the wait-light profile), sorts by `(tid, tid_event_idx)`. Diff tool already keys events by `(tid, tid_event_idx)`; this matches. **Validation**: round-trip — build from canary cold jsonl, count `contended=true` entries, eyeball-diff against C+22 jitter samples. ### Stage 3 — Ours Replay Mode (~200 LOC + ~50 LOC tests) **Goal**: ours's `rtl_enter_critical_section` consults the manifest *before* the fast-path check; forces park if the manifest says contended. | File | Edit | LOC | |---|---|---| | [xenia-rs/crates/xenia-cpu/src/scheduler.rs:230-258](xenia-rs/crates/xenia-cpu/src/scheduler.rs#L230-L258) | New `OrderMode::ContentionReplay { manifest_path }`; `Scheduler` carries `Option>` | ~40 | | `xenia-rs/crates/xenia-kernel/src/contention_manifest.rs` (new) | Loader, hashmap keyed on `(tid, tid_event_idx)`, `consume(tid, idx) -> Option` | ~80 | | [xenia-rs/crates/xenia-kernel/src/exports.rs:2886-2946](xenia-rs/crates/xenia-kernel/src/exports.rs#L2886-L2946) (`rtl_enter_critical_section`) | After computing `current_tid`, peek `tid_event_idx`; if manifest says contended at `(tid, idx)`: (a) verify `site_sid` matches recomputed shared-global SID for `cs_ptr`, (b) check the CS in guest memory actually has a live non-self owner — if not, skip with a log warning (state-divergence not schedule-divergence), (c) emit a synthetic `wait.begin` (C+21 absorber will handle it), (d) push self onto `cs_waiters[cs_ptr]`, (e) call `park_current(BlockReason::CriticalSection(cs_ptr))`. The existing wake path at lines 2972-2980 already hands us the lock when the owner releases. | ~50 | | `xenia-rs/crates/xenia-cpu/src/main.rs` or equivalent CLI module | `--scheduler-replay-manifest PATH` flag | ~20 | | Replay-mode unit tests | `xenia-rs/crates/xenia-kernel/src/contention_manifest.rs` | ~50 | **Critical subtlety**: only force park when the CS in guest memory actually has a live different-tid owner at the replay point. If the CS is free, this is a state-divergence (mutation timing mismatch), not a schedule-divergence; replay must skip and log. Otherwise we'd park on a CS that no one will release → deadlock. Explicit branch in (b) above. **Validation**: 1. Cold-vs-cold matched-prefix advances past 104,607 (target ≥106,000, the next major divergence boundary). 2. Ours's digest, when `--scheduler-replay-manifest` is NOT passed, byte-identical to pre-Stage-3 `e1dfcb1559f987b35012a7f2dc6d93f5`. 3. With manifest passed, replay-mode digest stable × 3 cold runs (a NEW digest, archived). 4. Sister chains tid=4→11/7→2/12→7/14→9/15→10 regress at most -5 events each. 5. Phase B `image_loaded_sha256 ea8d160e…` unchanged. **Rollback criteria**: - If prefix doesn't advance past 104,607: diagnose via `RUST_LOG=trace` on the replay-consume path; verify SID match against canary's emitted SID for the contended cs_ptr; check whether the CS was free at the replay point (the (b) skip-branch may be firing). - If digest unstable with replay: forced-park is non-deterministic. Inspect `cs_waiters[cs_ptr]` ordering, `wake_ref` selection at scheduler.rs (queue.remove(0) — FIFO, should be deterministic). Possible culprit: `find_by_tid` at exports.rs:2903 traverses HW slots in `rotation_cursor` order — pin or verify. - If sister chains regress >5: forced contention on tid=6 is changing other chains' progression. Initially keep manifest tid-scoped to tid=6 only; broaden iteratively if needed. ### Stage 4 — Diff Tool Hookup (~30 LOC + tests) **Goal**: register the new event kind. | File | LOC | |---|---| | [xenia-rs/tools/diff-events/diff_events.py:201-216](xenia-rs/tools/diff-events/diff_events.py#L201-L216) | Add `ENGINE_LOCAL_KINDS = {"contention.observed"}` set; advance per-tid pointer past these events without comparison | ~20 | | `xenia-rs/tools/diff-events/test_diff_events.py` | Test: `contention.observed` doesn't affect matched-prefix | ~10 | **Rationale for engine-local**: ours doesn't emit `contention.observed` (it consumes the manifest instead). Marking the kind engine-local keeps the matched-prefix definition unchanged and reversible. We can promote to a matched event later if both engines start emitting it. **Validation**: existing 30 diff tests pass; new test confirms engine-local handling. ### Stage 5 — Per-CS Fallback (deferred, ~30 LOC if needed) If Stage 3 succeeds on the 104,607 cap but a similar divergence appears soon after on a CS the manifest doesn't cover (e.g., because canary's emitter missed it), hardcode the SID `75ae880ec432eb36` (or whichever) to always force a one-round yield. Document as a known kludge. Don't generalize. ## Critical files - [xenia-rs/crates/xenia-kernel/src/exports.rs:2886-2980](xenia-rs/crates/xenia-kernel/src/exports.rs#L2886-L2980) — `rtl_enter_critical_section`, `rtl_leave_critical_section` - [xenia-rs/crates/xenia-cpu/src/scheduler.rs:230-258](xenia-rs/crates/xenia-cpu/src/scheduler.rs#L230-L258) — `OrderMode` - [xenia-rs/crates/xenia-cpu/src/scheduler.rs:710-740](xenia-rs/crates/xenia-cpu/src/scheduler.rs#L710-L740) — `round_schedule` - [xenia-rs/crates/xenia-cpu/src/scheduler.rs:808-852](xenia-rs/crates/xenia-cpu/src/scheduler.rs#L808-L852) — `park_current`, `wake_ref` - [xenia-rs/crates/xenia-kernel/src/event_log.rs](xenia-rs/crates/xenia-kernel/src/event_log.rs) — `TID_COUNTERS`, `next_tid_idx`, `peek_tid_idx`, C+18 shared-global SID recipe - [xenia-rs/tools/diff-events/diff_events.py](xenia-rs/tools/diff-events/diff_events.py) — `SKIP_PAYLOAD_FIELDS_BY_KIND`, `SHARED_GLOBAL_SID_MARKER`, C+18/C+21 absorb logic - [xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_rtl.cc:596-633](xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_rtl.cc#L596-L633) — `RtlEnterCriticalSection_entry` (read before Stage 1 edit) - `xenia-canary/src/xenia/kernel/util/event_log.{h,cc}` — Phase A emitter (add Stage 1 helper here) - `xenia-rs/audit-runs/phase-a-diff-harness/schema-v1.md` — extend with v1.4 §"contention.observed" ## Reused utilities - `semantic_id_shared_global(pointer, object_type)` (C+18 recipe) — both engines compute identical SIDs for the same CS pointer. Use as the cross-engine lookup key in the manifest. - `Scheduler::park_current(BlockReason::CriticalSection(cs_ptr))` (scheduler.rs:808) — existing primitive; Stage 3 reuses without modification. - `Scheduler::wake_ref(r)` and the `cs_waiters` queue (exports.rs:2972-2980) — existing wake/transfer machinery handles the post-park resume without any new code. - C+21 `wait.begin` floating absorber (diff_events.py) — Stage 3's synthetic `wait.begin` emission is automatically absorbed. - C+18 shared-global SID emission (event_log.rs / event_log.cc) — Stage 1's `site_sid` field uses this directly. ## Verification end-to-end After Stage 0: ```bash cd "/home/fabi/RE - Project Sylpheed" for q in 10 50 200 1000 5000; do XENIA_CACHE_WIPE=1 XENIA_SCHED_QUANTUM=$q \ xenia-rs/target/release/xrs-verify-stage0 "Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso" \ --phase-a-event-log /tmp/ours-q$q.jsonl done # compare each /tmp/ours-q*.jsonl digest and matched-prefix vs canary baseline ``` After Stages 1-4: ```bash # capture canary contention trace (one-time) wine xenia-canary/build-cross/bin/Windows/Debug/xenia_canary.exe \ --mute=true --kernel_emit_contention=true \ --phase_a_event_log_path=/tmp/canary-contention.jsonl \ "Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso" # build manifest python3 xenia-rs/tools/diff-events/build_contention_manifest.py \ /tmp/canary-contention.jsonl > /tmp/contention.json # replay in ours XENIA_CACHE_WIPE=1 xenia-rs/target/release/xrs-replay \ --scheduler-replay-manifest /tmp/contention.json \ --phase-a-event-log /tmp/ours-replay.jsonl \ "Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso" # diff python3 xenia-rs/tools/diff-events/diff_events.py \ --canary /tmp/canary-contention.jsonl \ --ours /tmp/ours-replay.jsonl # expect main matched-prefix ≥ 106,000 ``` Test suites: ```bash cd xenia-rs && cargo test --release # expect: kernel 204→≥210, workspace 291→≥305 cd xenia-canary && build-cross/bin/Linux/Debug/xenia-kernel-test # expect: pass with kernel_emit_contention round-trip test python3 xenia-rs/tools/diff-events/test_diff_events.py # expect: 30→≥34 pass python3 xenia-rs/tools/diff-events/test_build_manifest.py # expect: pass ``` ## Risk register | risk | likelihood | mitigation | |---|---|---| | Forced contention on tid=6 changes ordinal progression on later tids, breaking sister chain matches | high | Stage 3 validation #4 (-5 budget). If exceeded: keep manifest tid-scoped to tid=6 only initially; broaden iteratively per sister. | | Synthetic park on a CS that's free in ours's memory → deadlock | medium | Stage 3 explicit (b) skip-branch + warn. Replay only parks when guest-memory shows a live different-tid owner. | | Canary's spin loop usually succeeds → very few `contended=true` events to drive replay | medium | Stage 1 round-trip validation confirms ≥1 entry near 104,605. If empty: spin sizes are too large for Sylpheed's CS contention; emit on spin-loop entry too with `contended=spin_count_exhausted`. | | Ours digest destabilizes under replay (forced park orderings non-deterministic) | medium | Inspect `cs_waiters` FIFO + `find_by_tid` HW-slot traversal order. Pin both to deterministic order if needed. | | Reading-error #34 (loose `.xex` vs `.iso`) | known, covered | All validation uses `.iso` path. | | Reading-error #28 (verify source before writing) | active | Stage 1 requires reading `xboxkrnl_rtl.cc::RtlEnterCriticalSection` end-to-end first; Stage 3 likewise for `exports.rs:2886-2946`. Already done in planning. | | Manifest grows huge on other games | low | Per-game tool; Sylpheed wait-light. Document scope in plan-doc / memory. | | Phase B `image_loaded_sha256 ea8d160e…` regression | low | Image load is Phase B; CS replay touches Phase A only. Verify in every cold run. | | Game-compat: real Sylpheed depends on wallclock pacing | low | Workload profile: 2 KeQuerySystemTime calls. Clock not in scope. | | The spin-then-wait asymmetry is itself a divergence | known, deferred | Don't add spin to ours under this plan — it would make ours's contention *less* likely, which is the wrong direction for 104,607. Log finding; defer to a separate phase. | ## Backstop If Stage 3 lands but matched-prefix advance is <500 events (i.e., we get past 104,607 but quickly hit a similar wall): 1. **Stage 5** per-CS hardcoded yield for the next blocking SID. 2. **Approach D extension** — extend the diff-tool absorber to fold "post-wait nested Enter/Leave blocks" matched against a known pattern (one extra `E`-then-`L` cycle on the canary side with the same outer CS). ~150 LOC in `diff_events.py`. This crosses reading-error #23 in spirit but with a narrow heuristic; tag explicitly as a band-aid in schema v1.5. 3. **Broaden manifest** to wait.begin (the deferred "H broad" variant): ~150 LOC. Only worth it if sister chains tid=12→7 or tid=14→9 are stuck on wait timing. ## Acceptance criteria - Stage 0 spike completes with a decision (land / proceed). - Stages 1-4 (if needed) land in 3-5 sessions total. - Main matched-prefix ≥ 106,000 (≥1,393 events past current cap). - Ours's default-mode cold digest unchanged: `e1dfcb1559f987b35012a7f2dc6d93f5` × 3. - Phase B `image_loaded_sha256` unchanged: `ea8d160e…`. - Canary default-mode (`kernel_emit_contention=false`) cold digest unchanged. - Replay-mode digest stable × 3 cold runs (new value, archived). - Sister chain regression ≤5 events per sister. - Test suite: kernel 204→≥210, workspace 291→≥305, diff-tool 30→≥34. - Memory entry + schema-v1.md v1.4 §"contention.observed" + audit-run dir. ## Out of scope - GPU/audio determinism (separate subaudits). - Wine-level changes (unmodifiable host primitive). - Modifying ours's default scheduler (`OrderMode::Fixed` stays the default; replay is opt-in). - Modifying canary's default scheduling (single-thread cooperative canary deferred indefinitely — too risky to oracle). - Modifying ours's spin behavior on RtlEnterCS (logged, deferred — wrong direction for 104,607). - Adding wait.begin replay to manifest (deferred unless sisters need it; backstop only). - Clock determinism in canary (workload doesn't need it). ## Open questions for the user 1. **Stage 0 as gate or parallel?** Recommended as gate (1 day, cheap, may answer the whole question). Parallel risks duplicating effort. 2. **Sister-chain regression budget**: -5 events per sister acceptable? Past escalations (C+17 tid=15→10 was -14, treated as D-NEW-3) suggest budget is tight. 3. **Canary cvar default**: `kernel_emit_contention=false` (off by default) confirmed? 4. **Reading-error class**: should this plan reserve class #35 for "scheduling-philosophy divergences", or extend #30 ("scheduling determinism")? 5. **Spin-then-wait scope**: separate mini-phase or document-and-defer? Recommended defer.