[iterate-2E] Extend coherent monotonic clock to lockstep (timebase-desync livelock fix)

Lockstep livelocked the scheduler the same way --parallel did before
0332d19: the kernel deadline-arithmetic (`now_basis_at`) read per-thread
`ctx(hw_id).timebase`, but a parked/poll thread has `running_idx == None`
so `Scheduler::ctx()` returns `idle_ctx` (timebase 0). A poll thread (tid=7,
a `KeWaitForSingleObject` loop with a 30ms relative timeout) computing its
deadline via `parse_timeout` therefore read `now = 0` and registered
`deadline = 0 + 3000 = 3000` — a constant ~7.78M units in the past.
`coord_idle_advance` then re-armed that same constant 3000 deadline forever,
pinning virtual time and starving every other thread's real future deadline.

Render-gate impact: the submitter (tid=6) re-enters a 16ms-timeout
WaitForMultiple after its first jobs; that timeout never fired because vtime
was pinned at 3000, so virtual time never reached real future deadlines.

Fix (Option A — mirror the parallel fix): drive the existing deterministic
`Scheduler::global_clock` in lockstep too (floored up once per outer round
to `stats.instruction_count`, a pure function of retired guest instructions —
no wall-clock), and route `KernelState::now_basis_at` through `global_clock()`
in BOTH modes. New `Scheduler::advance_global_clock_to(now)` floor-up keeps it
monotone alongside `advance_all_timebases_to`. Parallel behavior unchanged
(it already read `global_clock()`).

Verified (lockstep, 50M):
- DETERMINISM: two cold `check -n 5M` and two cold `-n 50M` runs byte-identical.
- LIVELOCK GONE: "advanced to deadline" went from 592,679 fires / 2 unique
  values / 562,084 pinned at 3000  ->  18,586 fires / 18,567 unique /
  0 pinned, strictly increasing 5.4M -> 50M. Poll thread tid=7 now ends
  Blocked with a real future deadline Some(60002824) instead of spin-Ready
  on the past 3000.
- imports 1,790,936 -> 92,317 at 50M (the spin no longer burns import calls).

Cascade (lockstep, XENIA_CACHE_PERSIST=1, -n 200M): engine now runs to budget
instead of hard-deadlocking. Hub enqueue (sub_82458068) 4x; submitter dequeue
(sub_82458508) still 3x — the lost 4th-job HANDOFF (count/notify between
sub_82458068's tail and the submitter queue) is a SEPARATE downstream gate,
not the timebase. New gate: tid=5 (hub) Blocked INFINITE on event 0x1080
(job-4 completion); tid=6 (submitter) Ready, parked in WaitForMultiple
(sub_824AB214), loop-top stops at cycle 6.23M. draws still 0, VdSwap 1.

Golden re-baseline (same commit): sylpheed_n50m
  instructions 50000004 -> 50000007, imports 1790936 -> 92317
  (swaps/draws/RTs/shaders/textures unchanged). sylpheed_n2m unchanged
  (livelock onsets after 2M). Suite 665/665 + oracle green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-12 21:42:28 +02:00
parent 5aaadfec36
commit 7e2603a9e5
4 changed files with 72 additions and 33 deletions

View File

@@ -2830,6 +2830,19 @@ fn run_execution(
// Both calls are no-ops when `XENIA_SILPH_UI_AUTOSIGNAL_DELAY` // Both calls are no-ops when `XENIA_SILPH_UI_AUTOSIGNAL_DELAY`
// is unset (the pending queue stays empty). // is unset (the pending queue stays empty).
kernel.set_now_cycle_hint(stats.instruction_count); kernel.set_now_cycle_hint(stats.instruction_count);
// Drive the coherent monotonic "now" the kernel deadline-arithmetic
// reads (`KernelState::now_basis_at` -> `Scheduler::global_clock`)
// from the deterministic retired-instruction count. Floored up (never
// backwards). This is the LOCKSTEP analogue of the parallel writeback's
// `advance_global_clock`: a parked/poll thread computing a relative
// timeout via `parse_timeout` now reads a real, non-zero, monotone
// basis instead of `idle_ctx`'s timebase-0, so its deadline lands in
// the future and `coord_idle_advance` stops re-arming the constant
// past deadline forever (the timebase-desync livelock / render-gate
// root). Pure function of guest instructions -> bit-reproducible.
kernel
.scheduler
.advance_global_clock_to(stats.instruction_count);
kernel.fire_due_silph_autosignals(stats.instruction_count); kernel.fire_due_silph_autosignals(stats.instruction_count);
dispatch_graphics_interrupts( dispatch_graphics_interrupts(
kernel, kernel,

View File

@@ -1,6 +1,6 @@
{ {
"instructions": 50000004, "instructions": 50000007,
"imports": 1790936, "imports": 92317,
"unimpl": 0, "unimpl": 0,
"draws": 0, "draws": 0,
"swaps": 1, "swaps": 1,

View File

@@ -351,18 +351,27 @@ pub struct Scheduler {
/// Sorted by deadline ascending. Scheduler wakes the first entry via /// Sorted by deadline ascending. Scheduler wakes the first entry via
/// `advance_to_next_wake` when a round finds nothing runnable. /// `advance_to_next_wake` when a round finds nothing runnable.
timed_waits: Vec<(u64, ThreadRef)>, timed_waits: Vec<(u64, ThreadRef)>,
/// Parallel-mode coherent monotonic clock. In `--parallel`, workers /// Coherent monotonic "now" clock — the single authoritative basis the
/// extract their `PpcContext` (leaving a zeroed timebase in the slot) /// kernel deadline-arithmetic (`KernelState::now_basis_at`) reads in
/// and step unlocked, so `ctx(hw_id).timebase` is NOT a coherent "now" /// BOTH execution modes. Per-thread `ctx(hw_id).timebase` is NOT a
/// — a coordinator that reads it can see a stale/zero basis decoupled /// coherent "now":
/// from the deadline it just advanced to, re-arming the same constant /// * In `--parallel`, workers extract their `PpcContext` (leaving a
/// deadline forever (timebase-desync livelock). This field is the /// zeroed timebase in the slot) and step unlocked.
/// single authoritative "now" the parallel coordinator and kernel /// * In **lockstep**, a parked/poll thread has `running_idx == None`,
/// deadline-arithmetic read instead. Advanced by `advance_global_clock` /// so `ctx()` returns `idle_ctx` (timebase 0); a `parse_timeout`
/// (per-block retired-instruction count) on each parallel writeback and /// reading that basis registers `deadline = 0 + relative`, a value
/// floored up by `advance_all_timebases_to`. LOCKSTEP never reads it /// permanently in the past, and `coord_idle_advance` re-arms that
/// (gated by `KernelState::parallel_active`), so it has zero effect on /// same constant deadline forever (timebase-desync livelock — the
/// the deterministic lockstep trace. /// render-gate root: the submitter's 16ms re-wait never fires).
/// So a coordinator/parked thread reading per-thread timebase can see a
/// stale/zero basis decoupled from the deadline it just advanced to.
/// This field is that coherent basis instead. It is DETERMINISTIC: a
/// pure function of retired guest instructions (never wall-clock).
/// Advanced by `advance_global_clock` (per-block retired count on each
/// parallel writeback), `advance_global_clock_to` (floored up to the
/// deterministic per-round `stats.instruction_count` in lockstep), and
/// floored up by `advance_all_timebases_to`. Two cold lockstep runs
/// read identical values, so the lockstep trace stays bit-reproducible.
global_clock: u64, global_clock: u64,
/// Global count of TLS slots allocated — `spawn` pre-sizes new threads' /// Global count of TLS slots allocated — `spawn` pre-sizes new threads'
/// `tls_values` to this. /// `tls_values` to this.
@@ -1146,13 +1155,26 @@ impl Scheduler {
/// Advance the parallel-mode coherent clock by `n` retired instructions. /// Advance the parallel-mode coherent clock by `n` retired instructions.
/// Called from the parallel worker writeback with the block's executed /// Called from the parallel worker writeback with the block's executed
/// count so "now" tracks aggregate guest progress. Never called in /// count so "now" tracks aggregate guest progress.
/// lockstep (the clock stays 0 and unread there).
#[inline] #[inline]
pub fn advance_global_clock(&mut self, n: u64) { pub fn advance_global_clock(&mut self, n: u64) {
self.global_clock = self.global_clock.saturating_add(n); self.global_clock = self.global_clock.saturating_add(n);
} }
/// Floor the coherent clock up to `now` (monotonic; never goes
/// backwards). Used by the **lockstep** outer loop once per round to
/// track the deterministic retired-instruction count
/// (`stats.instruction_count`) as the single coherent "now". A plain
/// floor-up rather than `saturating_add` because the lockstep caller
/// passes an absolute monotonic counter (not a per-block delta), and
/// because `advance_all_timebases_to` may already have pushed
/// `global_clock` past the instruction count when fast-forwarding to a
/// future deadline — clamping with `max` keeps both sources monotone.
#[inline]
pub fn advance_global_clock_to(&mut self, now: u64) {
self.global_clock = self.global_clock.max(now);
}
/// Fast-forward the timebase to the earliest pending timed wait and /// Fast-forward the timebase to the earliest pending timed wait and
/// wake that sleeper. Used when a round had no Ready threads and no /// wake that sleeper. Used when a round had no Ready threads and no
/// timer fires closer than the earliest wait. Returns the woken /// timer fires closer than the earliest wait. Returns the woken

View File

@@ -1295,24 +1295,28 @@ impl KernelState {
self.pending_timer_fires.first().map(|&(d, _)| d) self.pending_timer_fires.first().map(|&(d, _)| d)
} }
/// Coherent "now" basis for deadline arithmetic, gated on execution mode. /// Coherent "now" basis for deadline arithmetic — the scheduler's
/// single monotonic `global_clock`, in BOTH execution modes.
/// ///
/// In **lockstep** (`parallel_active == false`) this returns exactly the /// Per-thread `ctx(hw_id).timebase` is NOT a sound "now" for deadline
/// pre-existing per-thread `ctx(hw_id).timebase` each call site read /// arithmetic: in `--parallel` workers extract/zero their slots while
/// before, so the deterministic lockstep trace is byte-identical (no /// stepping unlocked, and in **lockstep** a parked/poll thread has
/// golden re-baseline). In **parallel** (`parallel_active == true`) the /// `running_idx == None` so `ctx()` returns `idle_ctx` (timebase 0).
/// per-thread timebases are incoherent (workers extract/zero their slots /// Either way a `parse_timeout` reading the per-thread basis can see 0
/// while stepping unlocked), so we return the scheduler's single /// (or a stale value) and register `deadline = 0 + relative`, a value
/// monotonic `global_clock` instead — the basis that breaks the /// permanently in the past, which `coord_idle_advance` then re-arms
/// timebase-desync livelock. Callers pass the `hw_id` they would have /// forever (the timebase-desync livelock; the render-gate root). The
/// used for the lockstep `ctx()` read (slot 0 for coordinator-side /// `global_clock` is a deterministic function of retired guest
/// drains, the current thread's slot for in-guest waits). /// instructions (per-round `stats.instruction_count` floor-ups in
pub fn now_basis_at(&self, hw_id: u8) -> u64 { /// lockstep, per-block retired counts in parallel), so it is coherent,
if self.parallel_active { /// monotonic, never zero after boot, and bit-reproducible across two
self.scheduler.global_clock() /// cold lockstep runs.
} else { ///
self.scheduler.ctx(hw_id).timebase /// The `hw_id` argument is retained for call-site clarity (which slot a
} /// caller would conceptually be "asking about") but is no longer read —
/// the basis is global.
pub fn now_basis_at(&self, _hw_id: u8) -> u64 {
self.scheduler.global_clock()
} }
/// Fire every timer whose deadline is `<= now` (derived from slot 0's /// Fire every timer whose deadline is `<= now` (derived from slot 0's