[Track 2] Parallel-scoped global clock fixes timebase-desync livelock
In --parallel mode a long run livelocked: the scheduler spun "advanced to deadline 3000 waking hw=2 idx=0" ~14k times in microseconds. Root cause: each guest thread owns ctx.timebase (+1/instr in step_block), and all kernel deadline arithmetic read Scheduler::ctx(hw_id).timebase as "now". But the parallel worker extracts its PpcContext via mem::replace(ctx_mut_ref, PpcContext::new()) — leaving a ZEROED timebase in the slot while it steps unlocked — and advance_all_timebases_to only walks runqueue (never idle_ctx). So the coordinator's coord_pre_round drain and a woken thread's parse_timeout could read a zeroed/stale basis decoupled from the deadline the scheduler just advanced to. The thread re-armed the same constant deadline forever; the global clock never moved. Fix: add a single monotonic Scheduler::global_clock, advanced by the per-block retired-instruction count on each parallel writeback and floored up by advance_all_timebases_to. Kernel deadline reads route through KernelState::now_basis_at(hw_id), which returns global_clock ONLY when parallel_active; lockstep keeps reading the exact pre-existing ctx(hw_id).timebase expression, so the deterministic lockstep trace is byte-identical (sylpheed_n50m golden unchanged, zero re-baseline). Verified: - 50M --parallel run completes (was: hung). Deadlines now strictly increasing 5.4M -> 49.1M (18097 unique of 18116; max repeat 2) vs pre-fix constant 3000 x ~14000. - sylpheed_n50m golden byte-identical via plain `check` (no persist). - Full suite 665/665 green. Note: an intermittent parallel hang/crash (~1-2/20 at -n 5M) is pre-existing (master 1/20, this build 2/20 — within noise) and distinct from the timebase livelock: it is a parallel-race class (e.g. the unsafe block_ptr deref in run_execution_parallel). Tracked separately; lockstep remains the recommendation for long runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2138,7 +2138,7 @@ fn coord_pre_round(
|
|||||||
// is the guest-cycle timebase, not host_ns. This runs in `coord_pre_round`
|
// is the guest-cycle timebase, not host_ns. This runs in `coord_pre_round`
|
||||||
// which both the lockstep and parallel outer loops call every round.
|
// which both the lockstep and parallel outer loops call every round.
|
||||||
loop {
|
loop {
|
||||||
let now = kernel.scheduler.ctx(0).timebase;
|
let now = kernel.now_basis_at(0);
|
||||||
let Some((r, reason)) = kernel.scheduler.advance_to_next_wake_if_due(now)
|
let Some((r, reason)) = kernel.scheduler.advance_to_next_wake_if_due(now)
|
||||||
else {
|
else {
|
||||||
break;
|
break;
|
||||||
@@ -3146,6 +3146,16 @@ fn run_execution_parallel(
|
|||||||
.and_then(|t| guard.scheduler.find_by_tid(t))
|
.and_then(|t| guard.scheduler.find_by_tid(t))
|
||||||
.unwrap_or(thread_ref);
|
.unwrap_or(thread_ref);
|
||||||
*guard.scheduler.ctx_mut_ref(target_ref) = ctx_taken;
|
*guard.scheduler.ctx_mut_ref(target_ref) = ctx_taken;
|
||||||
|
// Advance the parallel-mode coherent clock by
|
||||||
|
// the instructions this block retired. This is
|
||||||
|
// the single authoritative "now" the kernel
|
||||||
|
// deadline-arithmetic reads in parallel mode
|
||||||
|
// (per-thread `ctx.timebase` is incoherent here
|
||||||
|
// because peers extract/zero their slots) —
|
||||||
|
// keeping it monotonic breaks the timebase-
|
||||||
|
// desync livelock where a woken thread re-armed
|
||||||
|
// the same constant deadline forever.
|
||||||
|
guard.scheduler.advance_global_clock(executed);
|
||||||
// worker_epilogue's exit_current path
|
// worker_epilogue's exit_current path
|
||||||
// expects scheduler.current to be set
|
// expects scheduler.current to be set
|
||||||
// to the running thread.
|
// to the running thread.
|
||||||
|
|||||||
@@ -351,6 +351,19 @@ pub struct Scheduler {
|
|||||||
/// Sorted by deadline ascending. Scheduler wakes the first entry via
|
/// Sorted by deadline ascending. Scheduler wakes the first entry via
|
||||||
/// `advance_to_next_wake` when a round finds nothing runnable.
|
/// `advance_to_next_wake` when a round finds nothing runnable.
|
||||||
timed_waits: Vec<(u64, ThreadRef)>,
|
timed_waits: Vec<(u64, ThreadRef)>,
|
||||||
|
/// Parallel-mode coherent monotonic clock. In `--parallel`, workers
|
||||||
|
/// extract their `PpcContext` (leaving a zeroed timebase in the slot)
|
||||||
|
/// and step unlocked, so `ctx(hw_id).timebase` is NOT a coherent "now"
|
||||||
|
/// — a coordinator that reads it can see a stale/zero basis decoupled
|
||||||
|
/// from the deadline it just advanced to, re-arming the same constant
|
||||||
|
/// deadline forever (timebase-desync livelock). This field is the
|
||||||
|
/// single authoritative "now" the parallel coordinator and kernel
|
||||||
|
/// deadline-arithmetic read instead. Advanced by `advance_global_clock`
|
||||||
|
/// (per-block retired-instruction count) on each parallel writeback and
|
||||||
|
/// floored up by `advance_all_timebases_to`. LOCKSTEP never reads it
|
||||||
|
/// (gated by `KernelState::parallel_active`), so it has zero effect on
|
||||||
|
/// the deterministic lockstep trace.
|
||||||
|
global_clock: u64,
|
||||||
/// Global count of TLS slots allocated — `spawn` pre-sizes new threads'
|
/// Global count of TLS slots allocated — `spawn` pre-sizes new threads'
|
||||||
/// `tls_values` to this.
|
/// `tls_values` to this.
|
||||||
tls_slot_count: usize,
|
tls_slot_count: usize,
|
||||||
@@ -389,6 +402,7 @@ impl Scheduler {
|
|||||||
order,
|
order,
|
||||||
rng_state,
|
rng_state,
|
||||||
timed_waits: Vec::new(),
|
timed_waits: Vec::new(),
|
||||||
|
global_clock: 0,
|
||||||
tls_slot_count: 0,
|
tls_slot_count: 0,
|
||||||
non_empty_runnable: 0,
|
non_empty_runnable: 0,
|
||||||
rotation_cursor: 0,
|
rotation_cursor: 0,
|
||||||
@@ -1114,6 +1128,29 @@ impl Scheduler {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
// Keep the parallel-mode coherent clock at least as far forward as
|
||||||
|
// any deadline we fast-forward to (idle/timer/wake advances). This
|
||||||
|
// only mutates the new `global_clock` field — lockstep never reads
|
||||||
|
// it — so it cannot perturb the deterministic lockstep trace.
|
||||||
|
self.global_clock = self.global_clock.max(deadline);
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Parallel-mode coherent "now" (see [`Self::global_clock`] field doc).
|
||||||
|
/// Read by the kernel deadline-arithmetic ONLY when
|
||||||
|
/// `KernelState::parallel_active`; lockstep keeps reading per-thread
|
||||||
|
/// `ctx(hw_id).timebase`.
|
||||||
|
#[inline]
|
||||||
|
pub fn global_clock(&self) -> u64 {
|
||||||
|
self.global_clock
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Advance the parallel-mode coherent clock by `n` retired instructions.
|
||||||
|
/// Called from the parallel worker writeback with the block's executed
|
||||||
|
/// count so "now" tracks aggregate guest progress. Never called in
|
||||||
|
/// lockstep (the clock stays 0 and unread there).
|
||||||
|
#[inline]
|
||||||
|
pub fn advance_global_clock(&mut self, n: u64) {
|
||||||
|
self.global_clock = self.global_clock.saturating_add(n);
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Fast-forward the timebase to the earliest pending timed wait and
|
/// Fast-forward the timebase to the earliest pending timed wait and
|
||||||
|
|||||||
@@ -2165,7 +2165,7 @@ fn nt_set_timer_ex(ctx: &mut PpcContext, mem: &GuestMemory, state: &mut KernelSt
|
|||||||
// timebase separately (immutable borrow) before any mutation of the
|
// timebase separately (immutable borrow) before any mutation of the
|
||||||
// object to keep the borrow-checker happy.
|
// object to keep the borrow-checker happy.
|
||||||
let hw_id = state.scheduler.current_hw_id().unwrap_or(0);
|
let hw_id = state.scheduler.current_hw_id().unwrap_or(0);
|
||||||
let now = state.scheduler.ctx(hw_id).timebase;
|
let now = state.now_basis_at(hw_id);
|
||||||
|
|
||||||
// Read signed i64 due_time (big-endian hi/lo — same pattern as
|
// Read signed i64 due_time (big-endian hi/lo — same pattern as
|
||||||
// parse_timeout). Negative = relative-from-now, positive = absolute
|
// parse_timeout). Negative = relative-from-now, positive = absolute
|
||||||
@@ -3589,7 +3589,7 @@ pub(crate) fn parse_timeout(state: &KernelState, timeout_ptr: u32, mem: &GuestMe
|
|||||||
return Some(Some(0)); // poll
|
return Some(Some(0)); // poll
|
||||||
}
|
}
|
||||||
let hw_id = state.scheduler.current_hw_id().unwrap_or(0);
|
let hw_id = state.scheduler.current_hw_id().unwrap_or(0);
|
||||||
let now = state.scheduler.ctx(hw_id).timebase;
|
let now = state.now_basis_at(hw_id);
|
||||||
// Negative = relative, positive = absolute wall-clock. Our timebase is a
|
// Negative = relative, positive = absolute wall-clock. Our timebase is a
|
||||||
// plain instruction counter, so we treat all timeouts as "time-units
|
// plain instruction counter, so we treat all timeouts as "time-units
|
||||||
// after now" regardless of sign, using the magnitude.
|
// after now" regardless of sign, using the magnitude.
|
||||||
|
|||||||
@@ -1251,6 +1251,26 @@ impl KernelState {
|
|||||||
self.pending_timer_fires.first().map(|&(d, _)| d)
|
self.pending_timer_fires.first().map(|&(d, _)| d)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Coherent "now" basis for deadline arithmetic, gated on execution mode.
|
||||||
|
///
|
||||||
|
/// In **lockstep** (`parallel_active == false`) this returns exactly the
|
||||||
|
/// pre-existing per-thread `ctx(hw_id).timebase` each call site read
|
||||||
|
/// before, so the deterministic lockstep trace is byte-identical (no
|
||||||
|
/// golden re-baseline). In **parallel** (`parallel_active == true`) the
|
||||||
|
/// per-thread timebases are incoherent (workers extract/zero their slots
|
||||||
|
/// while stepping unlocked), so we return the scheduler's single
|
||||||
|
/// monotonic `global_clock` instead — the basis that breaks the
|
||||||
|
/// timebase-desync livelock. Callers pass the `hw_id` they would have
|
||||||
|
/// used for the lockstep `ctx()` read (slot 0 for coordinator-side
|
||||||
|
/// drains, the current thread's slot for in-guest waits).
|
||||||
|
pub fn now_basis_at(&self, hw_id: u8) -> u64 {
|
||||||
|
if self.parallel_active {
|
||||||
|
self.scheduler.global_clock()
|
||||||
|
} else {
|
||||||
|
self.scheduler.ctx(hw_id).timebase
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/// Fire every timer whose deadline is `<= now` (derived from slot 0's
|
/// Fire every timer whose deadline is `<= now` (derived from slot 0's
|
||||||
/// timebase, matching `parse_timeout`'s "current thread" fallback).
|
/// timebase, matching `parse_timeout`'s "current thread" fallback).
|
||||||
/// For each fire: mark the timer `signaled=true`, clear its
|
/// For each fire: mark the timer `signaled=true`, clear its
|
||||||
@@ -1259,7 +1279,7 @@ impl KernelState {
|
|||||||
/// fired — the caller uses this to decide whether the scheduler round
|
/// fired — the caller uses this to decide whether the scheduler round
|
||||||
/// needs a follow-up `advance_to_next_wake_if_due` step.
|
/// needs a follow-up `advance_to_next_wake_if_due` step.
|
||||||
pub fn fire_due_timers(&mut self) -> bool {
|
pub fn fire_due_timers(&mut self) -> bool {
|
||||||
let now = self.scheduler.ctx(0).timebase;
|
let now = self.now_basis_at(0);
|
||||||
let mut fired = false;
|
let mut fired = false;
|
||||||
loop {
|
loop {
|
||||||
let Some(&(deadline, handle)) = self.pending_timer_fires.first() else {
|
let Some(&(deadline, handle)) = self.pending_timer_fires.first() else {
|
||||||
|
|||||||
Reference in New Issue
Block a user