[Track 2] Parallel-scoped global clock fixes timebase-desync livelock
In --parallel mode a long run livelocked: the scheduler spun "advanced to deadline 3000 waking hw=2 idx=0" ~14k times in microseconds. Root cause: each guest thread owns ctx.timebase (+1/instr in step_block), and all kernel deadline arithmetic read Scheduler::ctx(hw_id).timebase as "now". But the parallel worker extracts its PpcContext via mem::replace(ctx_mut_ref, PpcContext::new()) — leaving a ZEROED timebase in the slot while it steps unlocked — and advance_all_timebases_to only walks runqueue (never idle_ctx). So the coordinator's coord_pre_round drain and a woken thread's parse_timeout could read a zeroed/stale basis decoupled from the deadline the scheduler just advanced to. The thread re-armed the same constant deadline forever; the global clock never moved. Fix: add a single monotonic Scheduler::global_clock, advanced by the per-block retired-instruction count on each parallel writeback and floored up by advance_all_timebases_to. Kernel deadline reads route through KernelState::now_basis_at(hw_id), which returns global_clock ONLY when parallel_active; lockstep keeps reading the exact pre-existing ctx(hw_id).timebase expression, so the deterministic lockstep trace is byte-identical (sylpheed_n50m golden unchanged, zero re-baseline). Verified: - 50M --parallel run completes (was: hung). Deadlines now strictly increasing 5.4M -> 49.1M (18097 unique of 18116; max repeat 2) vs pre-fix constant 3000 x ~14000. - sylpheed_n50m golden byte-identical via plain `check` (no persist). - Full suite 665/665 green. Note: an intermittent parallel hang/crash (~1-2/20 at -n 5M) is pre-existing (master 1/20, this build 2/20 — within noise) and distinct from the timebase livelock: it is a parallel-race class (e.g. the unsafe block_ptr deref in run_execution_parallel). Tracked separately; lockstep remains the recommendation for long runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2138,7 +2138,7 @@ fn coord_pre_round(
|
||||
// is the guest-cycle timebase, not host_ns. This runs in `coord_pre_round`
|
||||
// which both the lockstep and parallel outer loops call every round.
|
||||
loop {
|
||||
let now = kernel.scheduler.ctx(0).timebase;
|
||||
let now = kernel.now_basis_at(0);
|
||||
let Some((r, reason)) = kernel.scheduler.advance_to_next_wake_if_due(now)
|
||||
else {
|
||||
break;
|
||||
@@ -3146,6 +3146,16 @@ fn run_execution_parallel(
|
||||
.and_then(|t| guard.scheduler.find_by_tid(t))
|
||||
.unwrap_or(thread_ref);
|
||||
*guard.scheduler.ctx_mut_ref(target_ref) = ctx_taken;
|
||||
// Advance the parallel-mode coherent clock by
|
||||
// the instructions this block retired. This is
|
||||
// the single authoritative "now" the kernel
|
||||
// deadline-arithmetic reads in parallel mode
|
||||
// (per-thread `ctx.timebase` is incoherent here
|
||||
// because peers extract/zero their slots) —
|
||||
// keeping it monotonic breaks the timebase-
|
||||
// desync livelock where a woken thread re-armed
|
||||
// the same constant deadline forever.
|
||||
guard.scheduler.advance_global_clock(executed);
|
||||
// worker_epilogue's exit_current path
|
||||
// expects scheduler.current to be set
|
||||
// to the running thread.
|
||||
|
||||
Reference in New Issue
Block a user