[iterate-2E] Extend coherent monotonic clock to lockstep (timebase-desync livelock fix)

Lockstep livelocked the scheduler the same way --parallel did before
0332d19: the kernel deadline-arithmetic (`now_basis_at`) read per-thread
`ctx(hw_id).timebase`, but a parked/poll thread has `running_idx == None`
so `Scheduler::ctx()` returns `idle_ctx` (timebase 0). A poll thread (tid=7,
a `KeWaitForSingleObject` loop with a 30ms relative timeout) computing its
deadline via `parse_timeout` therefore read `now = 0` and registered
`deadline = 0 + 3000 = 3000` — a constant ~7.78M units in the past.
`coord_idle_advance` then re-armed that same constant 3000 deadline forever,
pinning virtual time and starving every other thread's real future deadline.

Render-gate impact: the submitter (tid=6) re-enters a 16ms-timeout
WaitForMultiple after its first jobs; that timeout never fired because vtime
was pinned at 3000, so virtual time never reached real future deadlines.

Fix (Option A — mirror the parallel fix): drive the existing deterministic
`Scheduler::global_clock` in lockstep too (floored up once per outer round
to `stats.instruction_count`, a pure function of retired guest instructions —
no wall-clock), and route `KernelState::now_basis_at` through `global_clock()`
in BOTH modes. New `Scheduler::advance_global_clock_to(now)` floor-up keeps it
monotone alongside `advance_all_timebases_to`. Parallel behavior unchanged
(it already read `global_clock()`).

Verified (lockstep, 50M):
- DETERMINISM: two cold `check -n 5M` and two cold `-n 50M` runs byte-identical.
- LIVELOCK GONE: "advanced to deadline" went from 592,679 fires / 2 unique
  values / 562,084 pinned at 3000  ->  18,586 fires / 18,567 unique /
  0 pinned, strictly increasing 5.4M -> 50M. Poll thread tid=7 now ends
  Blocked with a real future deadline Some(60002824) instead of spin-Ready
  on the past 3000.
- imports 1,790,936 -> 92,317 at 50M (the spin no longer burns import calls).

Cascade (lockstep, XENIA_CACHE_PERSIST=1, -n 200M): engine now runs to budget
instead of hard-deadlocking. Hub enqueue (sub_82458068) 4x; submitter dequeue
(sub_82458508) still 3x — the lost 4th-job HANDOFF (count/notify between
sub_82458068's tail and the submitter queue) is a SEPARATE downstream gate,
not the timebase. New gate: tid=5 (hub) Blocked INFINITE on event 0x1080
(job-4 completion); tid=6 (submitter) Ready, parked in WaitForMultiple
(sub_824AB214), loop-top stops at cycle 6.23M. draws still 0, VdSwap 1.

Golden re-baseline (same commit): sylpheed_n50m
  instructions 50000004 -> 50000007, imports 1790936 -> 92317
  (swaps/draws/RTs/shaders/textures unchanged). sylpheed_n2m unchanged
  (livelock onsets after 2M). Suite 665/665 + oracle green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-12 21:42:28 +02:00
parent 5aaadfec36
commit 7e2603a9e5
4 changed files with 72 additions and 33 deletions

View File

@@ -351,18 +351,27 @@ pub struct Scheduler {
/// Sorted by deadline ascending. Scheduler wakes the first entry via
/// `advance_to_next_wake` when a round finds nothing runnable.
timed_waits: Vec<(u64, ThreadRef)>,
/// Parallel-mode coherent monotonic clock. In `--parallel`, workers
/// extract their `PpcContext` (leaving a zeroed timebase in the slot)
/// and step unlocked, so `ctx(hw_id).timebase` is NOT a coherent "now"
/// — a coordinator that reads it can see a stale/zero basis decoupled
/// from the deadline it just advanced to, re-arming the same constant
/// deadline forever (timebase-desync livelock). This field is the
/// single authoritative "now" the parallel coordinator and kernel
/// deadline-arithmetic read instead. Advanced by `advance_global_clock`
/// (per-block retired-instruction count) on each parallel writeback and
/// floored up by `advance_all_timebases_to`. LOCKSTEP never reads it
/// (gated by `KernelState::parallel_active`), so it has zero effect on
/// the deterministic lockstep trace.
/// Coherent monotonic "now" clock — the single authoritative basis the
/// kernel deadline-arithmetic (`KernelState::now_basis_at`) reads in
/// BOTH execution modes. Per-thread `ctx(hw_id).timebase` is NOT a
/// coherent "now":
/// * In `--parallel`, workers extract their `PpcContext` (leaving a
/// zeroed timebase in the slot) and step unlocked.
/// * In **lockstep**, a parked/poll thread has `running_idx == None`,
/// so `ctx()` returns `idle_ctx` (timebase 0); a `parse_timeout`
/// reading that basis registers `deadline = 0 + relative`, a value
/// permanently in the past, and `coord_idle_advance` re-arms that
/// same constant deadline forever (timebase-desync livelock — the
/// render-gate root: the submitter's 16ms re-wait never fires).
/// So a coordinator/parked thread reading per-thread timebase can see a
/// stale/zero basis decoupled from the deadline it just advanced to.
/// This field is that coherent basis instead. It is DETERMINISTIC: a
/// pure function of retired guest instructions (never wall-clock).
/// Advanced by `advance_global_clock` (per-block retired count on each
/// parallel writeback), `advance_global_clock_to` (floored up to the
/// deterministic per-round `stats.instruction_count` in lockstep), and
/// floored up by `advance_all_timebases_to`. Two cold lockstep runs
/// read identical values, so the lockstep trace stays bit-reproducible.
global_clock: u64,
/// Global count of TLS slots allocated — `spawn` pre-sizes new threads'
/// `tls_values` to this.
@@ -1146,13 +1155,26 @@ impl Scheduler {
/// Advance the parallel-mode coherent clock by `n` retired instructions.
/// Called from the parallel worker writeback with the block's executed
/// count so "now" tracks aggregate guest progress. Never called in
/// lockstep (the clock stays 0 and unread there).
/// count so "now" tracks aggregate guest progress.
#[inline]
pub fn advance_global_clock(&mut self, n: u64) {
self.global_clock = self.global_clock.saturating_add(n);
}
/// Floor the coherent clock up to `now` (monotonic; never goes
/// backwards). Used by the **lockstep** outer loop once per round to
/// track the deterministic retired-instruction count
/// (`stats.instruction_count`) as the single coherent "now". A plain
/// floor-up rather than `saturating_add` because the lockstep caller
/// passes an absolute monotonic counter (not a per-block delta), and
/// because `advance_all_timebases_to` may already have pushed
/// `global_clock` past the instruction count when fast-forwarding to a
/// future deadline — clamping with `max` keeps both sources monotone.
#[inline]
pub fn advance_global_clock_to(&mut self, now: u64) {
self.global_clock = self.global_clock.max(now);
}
/// Fast-forward the timebase to the earliest pending timed wait and
/// wake that sleeper. Used when a round had no Ready threads and no
/// timer fires closer than the earliest wait. Returns the woken