[iterate-2E] Extend coherent monotonic clock to lockstep (timebase-desync livelock fix)
Lockstep livelocked the scheduler the same way --parallel did before
0332d19: the kernel deadline-arithmetic (`now_basis_at`) read per-thread
`ctx(hw_id).timebase`, but a parked/poll thread has `running_idx == None`
so `Scheduler::ctx()` returns `idle_ctx` (timebase 0). A poll thread (tid=7,
a `KeWaitForSingleObject` loop with a 30ms relative timeout) computing its
deadline via `parse_timeout` therefore read `now = 0` and registered
`deadline = 0 + 3000 = 3000` — a constant ~7.78M units in the past.
`coord_idle_advance` then re-armed that same constant 3000 deadline forever,
pinning virtual time and starving every other thread's real future deadline.
Render-gate impact: the submitter (tid=6) re-enters a 16ms-timeout
WaitForMultiple after its first jobs; that timeout never fired because vtime
was pinned at 3000, so virtual time never reached real future deadlines.
Fix (Option A — mirror the parallel fix): drive the existing deterministic
`Scheduler::global_clock` in lockstep too (floored up once per outer round
to `stats.instruction_count`, a pure function of retired guest instructions —
no wall-clock), and route `KernelState::now_basis_at` through `global_clock()`
in BOTH modes. New `Scheduler::advance_global_clock_to(now)` floor-up keeps it
monotone alongside `advance_all_timebases_to`. Parallel behavior unchanged
(it already read `global_clock()`).
Verified (lockstep, 50M):
- DETERMINISM: two cold `check -n 5M` and two cold `-n 50M` runs byte-identical.
- LIVELOCK GONE: "advanced to deadline" went from 592,679 fires / 2 unique
values / 562,084 pinned at 3000 -> 18,586 fires / 18,567 unique /
0 pinned, strictly increasing 5.4M -> 50M. Poll thread tid=7 now ends
Blocked with a real future deadline Some(60002824) instead of spin-Ready
on the past 3000.
- imports 1,790,936 -> 92,317 at 50M (the spin no longer burns import calls).
Cascade (lockstep, XENIA_CACHE_PERSIST=1, -n 200M): engine now runs to budget
instead of hard-deadlocking. Hub enqueue (sub_82458068) 4x; submitter dequeue
(sub_82458508) still 3x — the lost 4th-job HANDOFF (count/notify between
sub_82458068's tail and the submitter queue) is a SEPARATE downstream gate,
not the timebase. New gate: tid=5 (hub) Blocked INFINITE on event 0x1080
(job-4 completion); tid=6 (submitter) Ready, parked in WaitForMultiple
(sub_824AB214), loop-top stops at cycle 6.23M. draws still 0, VdSwap 1.
Golden re-baseline (same commit): sylpheed_n50m
instructions 50000004 -> 50000007, imports 1790936 -> 92317
(swaps/draws/RTs/shaders/textures unchanged). sylpheed_n2m unchanged
(livelock onsets after 2M). Suite 665/665 + oracle green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -351,18 +351,27 @@ pub struct Scheduler {
|
||||
/// Sorted by deadline ascending. Scheduler wakes the first entry via
|
||||
/// `advance_to_next_wake` when a round finds nothing runnable.
|
||||
timed_waits: Vec<(u64, ThreadRef)>,
|
||||
/// Parallel-mode coherent monotonic clock. In `--parallel`, workers
|
||||
/// extract their `PpcContext` (leaving a zeroed timebase in the slot)
|
||||
/// and step unlocked, so `ctx(hw_id).timebase` is NOT a coherent "now"
|
||||
/// — a coordinator that reads it can see a stale/zero basis decoupled
|
||||
/// from the deadline it just advanced to, re-arming the same constant
|
||||
/// deadline forever (timebase-desync livelock). This field is the
|
||||
/// single authoritative "now" the parallel coordinator and kernel
|
||||
/// deadline-arithmetic read instead. Advanced by `advance_global_clock`
|
||||
/// (per-block retired-instruction count) on each parallel writeback and
|
||||
/// floored up by `advance_all_timebases_to`. LOCKSTEP never reads it
|
||||
/// (gated by `KernelState::parallel_active`), so it has zero effect on
|
||||
/// the deterministic lockstep trace.
|
||||
/// Coherent monotonic "now" clock — the single authoritative basis the
|
||||
/// kernel deadline-arithmetic (`KernelState::now_basis_at`) reads in
|
||||
/// BOTH execution modes. Per-thread `ctx(hw_id).timebase` is NOT a
|
||||
/// coherent "now":
|
||||
/// * In `--parallel`, workers extract their `PpcContext` (leaving a
|
||||
/// zeroed timebase in the slot) and step unlocked.
|
||||
/// * In **lockstep**, a parked/poll thread has `running_idx == None`,
|
||||
/// so `ctx()` returns `idle_ctx` (timebase 0); a `parse_timeout`
|
||||
/// reading that basis registers `deadline = 0 + relative`, a value
|
||||
/// permanently in the past, and `coord_idle_advance` re-arms that
|
||||
/// same constant deadline forever (timebase-desync livelock — the
|
||||
/// render-gate root: the submitter's 16ms re-wait never fires).
|
||||
/// So a coordinator/parked thread reading per-thread timebase can see a
|
||||
/// stale/zero basis decoupled from the deadline it just advanced to.
|
||||
/// This field is that coherent basis instead. It is DETERMINISTIC: a
|
||||
/// pure function of retired guest instructions (never wall-clock).
|
||||
/// Advanced by `advance_global_clock` (per-block retired count on each
|
||||
/// parallel writeback), `advance_global_clock_to` (floored up to the
|
||||
/// deterministic per-round `stats.instruction_count` in lockstep), and
|
||||
/// floored up by `advance_all_timebases_to`. Two cold lockstep runs
|
||||
/// read identical values, so the lockstep trace stays bit-reproducible.
|
||||
global_clock: u64,
|
||||
/// Global count of TLS slots allocated — `spawn` pre-sizes new threads'
|
||||
/// `tls_values` to this.
|
||||
@@ -1146,13 +1155,26 @@ impl Scheduler {
|
||||
|
||||
/// Advance the parallel-mode coherent clock by `n` retired instructions.
|
||||
/// Called from the parallel worker writeback with the block's executed
|
||||
/// count so "now" tracks aggregate guest progress. Never called in
|
||||
/// lockstep (the clock stays 0 and unread there).
|
||||
/// count so "now" tracks aggregate guest progress.
|
||||
#[inline]
|
||||
pub fn advance_global_clock(&mut self, n: u64) {
|
||||
self.global_clock = self.global_clock.saturating_add(n);
|
||||
}
|
||||
|
||||
/// Floor the coherent clock up to `now` (monotonic; never goes
|
||||
/// backwards). Used by the **lockstep** outer loop once per round to
|
||||
/// track the deterministic retired-instruction count
|
||||
/// (`stats.instruction_count`) as the single coherent "now". A plain
|
||||
/// floor-up rather than `saturating_add` because the lockstep caller
|
||||
/// passes an absolute monotonic counter (not a per-block delta), and
|
||||
/// because `advance_all_timebases_to` may already have pushed
|
||||
/// `global_clock` past the instruction count when fast-forwarding to a
|
||||
/// future deadline — clamping with `max` keeps both sources monotone.
|
||||
#[inline]
|
||||
pub fn advance_global_clock_to(&mut self, now: u64) {
|
||||
self.global_clock = self.global_clock.max(now);
|
||||
}
|
||||
|
||||
/// Fast-forward the timebase to the earliest pending timed wait and
|
||||
/// wake that sleeper. Used when a round had no Ready threads and no
|
||||
/// timer fires closer than the earliest wait. Returns the woken
|
||||
|
||||
Reference in New Issue
Block a user