The lockstep scheduler's pick_runnable is strict priority (max_by_key (priority, -idx)). On a cooperative single-host HW slot, a CPU-bound spinner that never blocks (the silph poll loop pinned by affinity to hw=5) wins pick_runnable every round forever, permanently starving a co-located peer (the submitter, tid6) that the spinner is actually waiting on. On real hardware those threads run on separate SMT contexts concurrently, so the spinner never starves the submitter; ours collapses them onto one slot with no anti-starvation, turning priority (or equal-priority index order) into permanent starvation. The starved submitter never dequeued job-4 -> the worker-hub (tid5) blocked INFINITE on completion event 0x1080 -> silph (tid13) wedged on 0x1078 -> no vsync -> draws_seen=0, the publisher splash never renders. (decrement_quantum's within-slot rotation is dead: begin_slot_visit unconditionally re-pick_runnable()s each round, discarding the rotated running_idx. The fix is therefore evaluated at pick time, not via that discarded rotation.) Fix (Option A, bounded anti-starvation, deterministic): - Add per-thread steps_starved counter to GuestThread. - begin_slot_visit increments it for every Ready peer passed over this visit, resets it to 0 for the picked thread. - pick_runnable selects by effective_priority: once steps_starved reaches STARVE_LIMIT (4096) the thread is lifted to i32::MAX and wins exactly one pick, then resets. The genuinely higher-priority thread still wins ~4095/4096 visits -- the boost grants periodic forward progress only, it does NOT invert priority. Pure function of counter/priority/index -> deterministic (no wall-clock, no RNG). Cascade (lockstep exec, XENIA_CACHE_PERSIST=1, -n 200M): - submitter dequeue sub_82458508 now fires 4x (was 3x); the 4th job (buf 0x40baa2c0) is dequeued at cycle 6.15M. - hub tid5 leaves Blocked(0x1080) -> now Ready (no more INFINITE wait). - GPU packets 0 -> 116,101,363 (command stream now flowing). - tid13 (silph::UImpl) advances past the old 0x1078 wedge to a NEW downstream wait (handle 0x10a0); 3 new threads spawn (tid14/15/16). - draws_seen still 0 -> the splash's first draw is a NEW downstream gate, not this starvation. Determinism: two cold lockstep `check -n 5M` runs byte-identical (full and stable digests). New n50m stable digest deterministic across two cold runs. Golden re-baselined: instructions 50000007->50000003, imports 92317->90296 (trajectory shift from the changed pick order). Tests: 666/666 (+1 test_anti_starvation_bounded_progress). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sylpheed regression goldens
These JSON files anchor xenia-rs check digest output for Project Sylpheed.
Files
| File | -n | Mode | Captures |
|---|---|---|---|
sylpheed_n2m.json |
2_000_000 | full digest | early boot (swaps=0, no rendering) |
sylpheed_n50m.json |
50_000_000 | stable-digest | first VdSwap pair (swaps=2 post-Phase-A) |
Stable-digest mode
sylpheed_n50m.json is captured with --stable-digest, which omits
timing-sensitive counters: packets (±2–8% lockstep noise from a GPU thread
race), resolves, interrupts_delivered, interrupts_dropped,
texture_decodes. The remaining fields are byte-identical across repeated
lockstep runs at a fixed -n.
sylpheed_n2m.json predates the stable-digest flag and uses full-digest
compare. It still works because at -n 2M the GPU pipeline has not produced any
packets yet — packets=0 is trivially deterministic.
Circularity hazard
Per ORACBUG-001/002/003, these goldens were captured by running the same code
they validate. They detect regression from a known-good snapshot, not
correctness. When a planned fix intentionally moves the digest (e.g. a
shader fix landing draws > 0 for the first time), re-baseline the golden as
a separate commit and reference the audit ID in the message.
Re-baselining
cargo build --release -p xenia-app
target/release/xenia-rs check \
"$SYLPHEED_ISO" \
-n 50000000 \
--stable-digest \
--out crates/xenia-app/tests/golden/sylpheed_n50m.json
Running the goldens
cargo test --release -p xenia-app --test sylpheed_oracles -- --ignored --nocapture
The tests are #[ignore]-gated because each run takes a few seconds, which is
unacceptable in the default cargo test cycle. The ISO path defaults to the
contributor's local ~/RE Project Sylpheed/Project Sylpheed*.iso and can be
overridden via SYLPHEED_ISO=/path/to/sylpheed.iso.
n4b canonical-invocation regression anchor (deferred)
The audit's recommended next sprint also called for a sylpheed_n4b.json
golden capturing the canonical reference invocation
xenia-rs check sylpheed.iso -n 4_000_000_000 --parallel --reservations-table.
This is deferred because:
- The
--parallel --reservations-tablecombination is empirically pathologically slow at -n 100M (>32 min per run per the audit memory). At -n 4B the run cost is many hours, not the single-session-friendly 5–15 min the original plan estimated. - Each phase that intentionally moves rendering counters (C, D, E, F) would need a re-baseline of n4b — a significant time cost compounding over the sprint.
Once the renderer-unblock phases (C+D+E) land and draws > 0 is confirmed at
-n 100M lockstep, an n4b artifact may be captured one-shot and stored under
audit-runs/post-fix/ (not as a test golden) as a manual regression anchor for
the canonical invocation.