[iterate-2E] Extend coherent monotonic clock to lockstep (timebase-desync livelock fix)

Lockstep livelocked the scheduler the same way --parallel did before 0332d19: the kernel deadline-arithmetic (`now_basis_at`) read per-thread `ctx(hw_id).timebase`, but a parked/poll thread has `running_idx == None` so `Scheduler::ctx()` returns `idle_ctx` (timebase 0). A poll thread (tid=7, a `KeWaitForSingleObject` loop with a 30ms relative timeout) computing its deadline via `parse_timeout` therefore read `now = 0` and registered `deadline = 0 + 3000 = 3000` — a constant ~7.78M units in the past. `coord_idle_advance` then re-armed that same constant 3000 deadline forever, pinning virtual time and starving every other thread's real future deadline. Render-gate impact: the submitter (tid=6) re-enters a 16ms-timeout WaitForMultiple after its first jobs; that timeout never fired because vtime was pinned at 3000, so virtual time never reached real future deadlines. Fix (Option A — mirror the parallel fix): drive the existing deterministic `Scheduler::global_clock` in lockstep too (floored up once per outer round to `stats.instruction_count`, a pure function of retired guest instructions — no wall-clock), and route `KernelState::now_basis_at` through `global_clock()` in BOTH modes. New `Scheduler::advance_global_clock_to(now)` floor-up keeps it monotone alongside `advance_all_timebases_to`. Parallel behavior unchanged (it already read `global_clock()`). Verified (lockstep, 50M): - DETERMINISM: two cold `check -n 5M` and two cold `-n 50M` runs byte-identical. - LIVELOCK GONE: "advanced to deadline" went from 592,679 fires / 2 unique values / 562,084 pinned at 3000 -> 18,586 fires / 18,567 unique / 0 pinned, strictly increasing 5.4M -> 50M. Poll thread tid=7 now ends Blocked with a real future deadline Some(60002824) instead of spin-Ready on the past 3000. - imports 1,790,936 -> 92,317 at 50M (the spin no longer burns import calls). Cascade (lockstep, XENIA_CACHE_PERSIST=1, -n 200M): engine now runs to budget instead of hard-deadlocking. Hub enqueue (sub_82458068) 4x; submitter dequeue (sub_82458508) still 3x — the lost 4th-job HANDOFF (count/notify between sub_82458068's tail and the submitter queue) is a SEPARATE downstream gate, not the timebase. New gate: tid=5 (hub) Blocked INFINITE on event 0x1080 (job-4 completion); tid=6 (submitter) Ready, parked in WaitForMultiple (sub_824AB214), loop-top stops at cycle 6.23M. draws still 0, VdSwap 1. Golden re-baseline (same commit): sylpheed_n50m instructions 50000004 -> 50000007, imports 1790936 -> 92317 (swaps/draws/RTs/shaders/textures unchanged). sylpheed_n2m unchanged (livelock onsets after 2M). Suite 665/665 + oracle green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 21:42:28 +02:00
parent 5aaadfec36
commit 7e2603a9e5
4 changed files with 72 additions and 33 deletions
--- a/crates/xenia-app/src/main.rs
+++ b/crates/xenia-app/src/main.rs
@@ -2830,6 +2830,19 @@ fn run_execution(
        // Both calls are no-ops when `XENIA_SILPH_UI_AUTOSIGNAL_DELAY`
        // is unset (the pending queue stays empty).
        kernel.set_now_cycle_hint(stats.instruction_count);
        // Drive the coherent monotonic "now" the kernel deadline-arithmetic
        // reads (`KernelState::now_basis_at` -> `Scheduler::global_clock`)
        // from the deterministic retired-instruction count. Floored up (never
        // backwards). This is the LOCKSTEP analogue of the parallel writeback's
        // `advance_global_clock`: a parked/poll thread computing a relative
        // timeout via `parse_timeout` now reads a real, non-zero, monotone
        // basis instead of `idle_ctx`'s timebase-0, so its deadline lands in
        // the future and `coord_idle_advance` stops re-arming the constant
        // past deadline forever (the timebase-desync livelock / render-gate
        // root). Pure function of guest instructions -> bit-reproducible.
        kernel
            .scheduler
            .advance_global_clock_to(stats.instruction_count);
        kernel.fire_due_silph_autosignals(stats.instruction_count);
        dispatch_graphics_interrupts(
            kernel,
--- a/crates/xenia-app/tests/golden/sylpheed_n50m.json
+++ b/crates/xenia-app/tests/golden/sylpheed_n50m.json
@@ -1,6 +1,6 @@
 {
-  "instructions": 50000004,
+  "instructions": 50000007,
-  "imports": 1790936,
+  "imports": 92317,
  "unimpl": 0,
  "draws": 0,
  "swaps": 1,
--- a/crates/xenia-cpu/src/scheduler.rs
+++ b/crates/xenia-cpu/src/scheduler.rs
@@ -351,18 +351,27 @@ pub struct Scheduler {
    /// Sorted by deadline ascending. Scheduler wakes the first entry via
    /// `advance_to_next_wake` when a round finds nothing runnable.
    timed_waits: Vec<(u64, ThreadRef)>,
-    /// Parallel-mode coherent monotonic clock. In `--parallel`, workers
+    /// Coherent monotonic "now" clock — the single authoritative basis the
-    /// extract their `PpcContext` (leaving a zeroed timebase in the slot)
+    /// kernel deadline-arithmetic (`KernelState::now_basis_at`) reads in
-    /// and step unlocked, so `ctx(hw_id).timebase` is NOT a coherent "now"
+    /// BOTH execution modes. Per-thread `ctx(hw_id).timebase` is NOT a
-    /// — a coordinator that reads it can see a stale/zero basis decoupled
+    /// coherent "now":
-    /// from the deadline it just advanced to, re-arming the same constant
+    ///   * In `--parallel`, workers extract their `PpcContext` (leaving a
-    /// deadline forever (timebase-desync livelock). This field is the
+    ///     zeroed timebase in the slot) and step unlocked.
-    /// single authoritative "now" the parallel coordinator and kernel
+    ///   * In **lockstep**, a parked/poll thread has `running_idx == None`,
-    /// deadline-arithmetic read instead. Advanced by `advance_global_clock`
+    ///     so `ctx()` returns `idle_ctx` (timebase 0); a `parse_timeout`
-    /// (per-block retired-instruction count) on each parallel writeback and
+    ///     reading that basis registers `deadline = 0 + relative`, a value
-    /// floored up by `advance_all_timebases_to`. LOCKSTEP never reads it
+    ///     permanently in the past, and `coord_idle_advance` re-arms that
-    /// (gated by `KernelState::parallel_active`), so it has zero effect on
+    ///     same constant deadline forever (timebase-desync livelock — the
-    /// the deterministic lockstep trace.
+    ///     render-gate root: the submitter's 16ms re-wait never fires).
    /// So a coordinator/parked thread reading per-thread timebase can see a
    /// stale/zero basis decoupled from the deadline it just advanced to.
    /// This field is that coherent basis instead. It is DETERMINISTIC: a
    /// pure function of retired guest instructions (never wall-clock).
    /// Advanced by `advance_global_clock` (per-block retired count on each
    /// parallel writeback), `advance_global_clock_to` (floored up to the
    /// deterministic per-round `stats.instruction_count` in lockstep), and
    /// floored up by `advance_all_timebases_to`. Two cold lockstep runs
    /// read identical values, so the lockstep trace stays bit-reproducible.
    global_clock: u64,
    /// Global count of TLS slots allocated — `spawn` pre-sizes new threads'
    /// `tls_values` to this.
@@ -1146,13 +1155,26 @@ impl Scheduler {
    /// Advance the parallel-mode coherent clock by `n` retired instructions.
    /// Called from the parallel worker writeback with the block's executed
-    /// count so "now" tracks aggregate guest progress. Never called in
+    /// count so "now" tracks aggregate guest progress.
    /// lockstep (the clock stays 0 and unread there).
    #[inline]
    pub fn advance_global_clock(&mut self, n: u64) {
        self.global_clock = self.global_clock.saturating_add(n);
    }
    /// Floor the coherent clock up to `now` (monotonic; never goes
    /// backwards). Used by the **lockstep** outer loop once per round to
    /// track the deterministic retired-instruction count
    /// (`stats.instruction_count`) as the single coherent "now". A plain
    /// floor-up rather than `saturating_add` because the lockstep caller
    /// passes an absolute monotonic counter (not a per-block delta), and
    /// because `advance_all_timebases_to` may already have pushed
    /// `global_clock` past the instruction count when fast-forwarding to a
    /// future deadline — clamping with `max` keeps both sources monotone.
    #[inline]
    pub fn advance_global_clock_to(&mut self, now: u64) {
        self.global_clock = self.global_clock.max(now);
    }
    /// Fast-forward the timebase to the earliest pending timed wait and
    /// wake that sleeper. Used when a round had no Ready threads and no
    /// timer fires closer than the earliest wait. Returns the woken
--- a/crates/xenia-kernel/src/state.rs
+++ b/crates/xenia-kernel/src/state.rs
@@ -1295,24 +1295,28 @@ impl KernelState {
        self.pending_timer_fires.first().map(|&(d, _)| d)
    }
-    /// Coherent "now" basis for deadline arithmetic, gated on execution mode.
+    /// Coherent "now" basis for deadline arithmetic — the scheduler's
    /// single monotonic `global_clock`, in BOTH execution modes.
    ///
-    /// In **lockstep** (`parallel_active == false`) this returns exactly the
+    /// Per-thread `ctx(hw_id).timebase` is NOT a sound "now" for deadline
-    /// pre-existing per-thread `ctx(hw_id).timebase` each call site read
+    /// arithmetic: in `--parallel` workers extract/zero their slots while
-    /// before, so the deterministic lockstep trace is byte-identical (no
+    /// stepping unlocked, and in **lockstep** a parked/poll thread has
-    /// golden re-baseline). In **parallel** (`parallel_active == true`) the
+    /// `running_idx == None` so `ctx()` returns `idle_ctx` (timebase 0).
-    /// per-thread timebases are incoherent (workers extract/zero their slots
+    /// Either way a `parse_timeout` reading the per-thread basis can see 0
-    /// while stepping unlocked), so we return the scheduler's single
+    /// (or a stale value) and register `deadline = 0 + relative`, a value
-    /// monotonic `global_clock` instead — the basis that breaks the
+    /// permanently in the past, which `coord_idle_advance` then re-arms
-    /// timebase-desync livelock. Callers pass the `hw_id` they would have
+    /// forever (the timebase-desync livelock; the render-gate root). The
-    /// used for the lockstep `ctx()` read (slot 0 for coordinator-side
+    /// `global_clock` is a deterministic function of retired guest
-    /// drains, the current thread's slot for in-guest waits).
+    /// instructions (per-round `stats.instruction_count` floor-ups in
-    pub fn now_basis_at(&self, hw_id: u8) -> u64 {
+    /// lockstep, per-block retired counts in parallel), so it is coherent,
-        if self.parallel_active {
+    /// monotonic, never zero after boot, and bit-reproducible across two
-            self.scheduler.global_clock()
+    /// cold lockstep runs.
-        } else {
+    ///
-            self.scheduler.ctx(hw_id).timebase
+    /// The `hw_id` argument is retained for call-site clarity (which slot a
-        }
+    /// caller would conceptually be "asking about") but is no longer read —
    /// the basis is global.
    pub fn now_basis_at(&self, _hw_id: u8) -> u64 {
        self.scheduler.global_clock()
    }
    /// Fire every timer whose deadline is `<= now` (derived from slot 0's