[iterate-3AL] Superblock dispatch: chain basic blocks per slot-visit (~1.6x boot-to-splash)

Replace the one-basic-block-per-slot-per-round lockstep dispatch with a
SUPERBLOCK runner: each slot-visit chains straight-line blocks through
their terminating branches up to a deterministic instruction budget,
amortizing the per-round (timebase/coord/round_schedule) and per-slot
(worker_prologue) dispatch tax over ~128 instructions instead of ~6.

Yield-points (end the chain, return to the round) are pure functions of
guest state, preserving the lockstep cross-thread interleaving correctness:
  - non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted);
    db16cyc Yield is the spin-wait producer hand-off.
  - sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync
    (new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build).
  - MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled
    per block, keeps GPU/register ordering at one-block granularity.
  - next PC leaves ordinary guest code (import thunk / halt sentinel /
    unmapped) -> hand to the full worker_prologue next round.
  - instruction budget reached.

Instruction-count/clock accounting stays exact: per-block cycle_count
deltas are summed and handed to worker_epilogue once (instruction_count +
decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1
reproduces the old one-block schedule byte-for-byte.

Budget tuned to 128 (env-overridable): boot progression stays healthy up
to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves);
128 is 3x below the cliff. Also scale the inline-GPU per-round fairness
cap with the budget (flat 64 throttled GPU command processing 17x under
superblocks and collapsed the present loop).

PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 ->
41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B
(-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit /
round_schedule_into / coord_post_round / update_timestamp_bundle each
~-90%; interpreter execute byte-identical (real work unchanged).

GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172),
1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact.
C2 determinism: n50m stable digest byte-identical across fresh runs;
golden re-baselined intentionally (pacing-only deltas: imports 333453->243387,
draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/
present cadence track baseline (3AJ fade-in pacing preserved). C4: 690
tests green (+2 sync_sensitive).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-19 22:31:54 +02:00
parent dc1320cd4b
commit acb29db444
6 changed files with 289 additions and 29 deletions

View File

@@ -2326,8 +2326,19 @@ fn coord_post_round(
let mut gpu_runs = (executed_this_round
/ xenia_cpu::scheduler::HW_THREAD_COUNT as u64)
.max(1);
if gpu_runs > 64 {
gpu_runs = 64;
// Fairness cap on GPU commands drained per round. Must scale with the
// per-round instruction volume: with the superblock runner a single
// round legitimately retires up to ~SUPERBLOCK_INSTR_BUDGET per slot
// (vs ~6 for the old one-block path), so the rate `executed/6` is much
// higher and a flat cap of 64 throttled GPU command processing ~17×
// (packets 50279→1861 @50M) — collapsing the present loop / splash.
// Cap at the budget so the GPU keeps pace with the CPU at the same
// per-instruction rate the one-block path had. The inner loop already
// early-breaks on `!gpu.is_ready`, so this only bounds a pathological
// backlog, never busy-spins.
let gpu_cap = superblock_budget().max(64);
if gpu_runs > gpu_cap {
gpu_runs = gpu_cap;
}
if let Some(gpu) = kernel.gpu.as_inline_mut() {
gpu.sync_with_mmio();
@@ -2812,6 +2823,160 @@ fn worker_epilogue(
SlotOutcome::Continue
}
/// Hard cap on the number of guest instructions a single superblock
/// runner invocation executes before returning to the round-robin
/// scheduler. Bounds how coarse the lockstep interleaving can get: a
/// larger budget amortizes more per-round/per-slot tax (faster) but
/// runs one HW thread for longer between scheduler returns (coarser
/// cross-thread interleaving). 1024 keeps a slot-visit ~170× longer
/// than the old single-block (~6 instr) granularity while still
/// returning to the round well inside a single 50k quantum. Purely an
/// instruction count → deterministic, schedule reproduces byte-identically.
///
/// Tuned empirically on the Sylpheed boot-to-splash workload (iterate-3AL):
/// budgets up to 256 keep boot progression byte-for-byte healthy (draws /
/// swaps / packets track the one-block baseline), then a sharp cliff at
/// ~384 collapses the present loop (a producer/consumer boot handoff
/// starves when one slot runs too long without returning to the round).
/// 128 sits 3× below that cliff with ~1.65× boot-to-splash speedup — a
/// deliberately conservative pick (correctness over the last few %). The
/// `XENIA_SUPERBLOCK_BUDGET` env var overrides it for further tuning.
const SUPERBLOCK_INSTR_BUDGET: u64 = 128;
/// Effective superblock budget. Defaults to [`SUPERBLOCK_INSTR_BUDGET`];
/// `XENIA_SUPERBLOCK_BUDGET` overrides it (A/B tuning without a rebuild).
/// A budget of 1 reproduces the old one-block-per-slot-visit behaviour
/// (the chain always stops after the first block). Read once and cached.
fn superblock_budget() -> u64 {
use std::sync::OnceLock;
static BUDGET: OnceLock<u64> = OnceLock::new();
*BUDGET.get_or_init(|| {
std::env::var("XENIA_SUPERBLOCK_BUDGET")
.ok()
.and_then(|v| v.parse::<u64>().ok())
.filter(|&v| v >= 1)
.unwrap_or(SUPERBLOCK_INSTR_BUDGET)
})
}
/// Superblock runner (iterate-3AL). Executes a *chain* of basic blocks
/// for one slot-visit — following each block's terminating branch into
/// the next block — instead of a single block, amortizing the per-round
/// (timebase / coord / `round_schedule`) and per-slot (`worker_prologue`)
/// dispatch tax over up to [`SUPERBLOCK_INSTR_BUDGET`] guest instructions.
///
/// Determinism + cross-thread correctness: the chain ENDS (returns to the
/// round) at exactly the points where lockstep granularity matters, all
/// pure functions of guest state (never wall-clock):
/// - a non-`Continue` step result (Yield / SystemCall / Trap / Unimpl /
/// Halted) — `step_block` already bails on these; `Yield` in
/// particular is the db16cyc spin-wait hand-off that prevents a
/// spinner from starving its producer.
/// - the just-run block was `sync_sensitive` (reserved load/store or a
/// memory barrier) — the guest's own ordering points.
/// - the block touched MMIO (the `mem.mmio_access_count()` watermark
/// advanced) — GPU/register ordering vs other HW threads stays at the
/// same fine granularity as the old one-block path.
/// - the next PC leaves ordinary guest code: an import thunk, the halt
/// sentinel, or unmapped memory — those need the full `worker_prologue`
/// dispatch, so we stop and let the next round's prologue handle them.
/// - the instruction budget is reached.
///
/// Instruction-count / clock accounting stays exact: `executed` is summed
/// from the per-block `cycle_count` delta across every chained block and
/// handed to `worker_epilogue` once, which advances `stats.instruction_count`
/// and `decrement_quantum` by precisely the retired count — identical to
/// dispatching each block separately.
#[allow(clippy::too_many_arguments)]
fn run_superblock(
wc: &mut WorkerCtx,
kernel: &mut xenia_kernel::KernelState,
mem: &xenia_memory::GuestMemory,
debugger: &mut xenia_debugger::Debugger,
thunk_map: &HashMap<u32, (ModuleId, u16, String)>,
stats: &mut ExecStats,
tid: Option<u32>,
thread_ref: xenia_cpu::ThreadRef,
first_block_ptr: *const xenia_cpu::block_cache::DecodedBlock,
first_pc_before: u32,
) -> SlotOutcome {
use xenia_cpu::interpreter::{step_block, StepResult};
const LR_HALT: u32 = xenia_cpu::context::LR_HALT_SENTINEL as u32;
let budget = superblock_budget();
// Probe / mem-watch / debugger-hook modes need per-block-entry
// observability; in those modes never chain (run exactly one block,
// identical to the pre-superblock behaviour). The block-cache fast
// path is only entered when hooks/DB are off anyway, but a probe or
// mem-watch can be armed alongside it.
let chain_allowed = !kernel.any_probe_active() && !mem.has_mem_watch();
let mut block_ptr = first_block_ptr;
let mut pc_before = first_pc_before;
let mut total_executed: u64 = 0;
let (result, last_block_ptr, last_pc_before) = loop {
let cycle_before = kernel.scheduler.ctx_mut_ref(thread_ref).cycle_count;
let mmio_before = mem.mmio_access_count();
let block = unsafe { &*block_ptr };
let result = {
let ctx = kernel.scheduler.ctx_mut_ref(thread_ref);
step_block(ctx, mem, block)
};
let executed = kernel
.scheduler
.ctx_mut_ref(thread_ref)
.cycle_count
.saturating_sub(cycle_before);
total_executed = total_executed.saturating_add(executed);
// STOP conditions (any → end the superblock, hand to epilogue):
// non-Continue result (let the epilogue apply it), chaining
// disabled, a sync-sensitive block just ran, MMIO was touched,
// or the budget is spent.
if !chain_allowed
|| !matches!(result, StepResult::Continue)
|| block.sync_sensitive
|| mem.mmio_access_count() != mmio_before
|| total_executed >= budget
{
break (result, block_ptr, pc_before);
}
// Decide whether the NEXT PC is an ordinary guest block we can
// chain into. Anything else (thunk / halt sentinel / unmapped)
// needs the full prologue dispatch next round.
let next_pc = kernel.scheduler.ctx(wc.hw_id).pc;
if next_pc == LR_HALT
|| (kernel.pc_in_thunk_band(next_pc) && thunk_map.contains_key(&next_pc))
|| !mem.is_mapped(next_pc)
{
break (result, block_ptr, pc_before);
}
// Chain: build/fetch the next block. Re-borrows `wc.block_cache`,
// which invalidates the previous `block_ptr` — but we've already
// finished using it (only `sync_sensitive`/diagnostics were read,
// above), so the raw-pointer aliasing rule is respected.
pc_before = next_pc;
block_ptr = wc.block_cache.lookup_or_build(next_pc, mem) as *const _;
};
worker_epilogue(
wc,
kernel,
debugger,
stats,
tid,
thread_ref,
last_block_ptr,
last_pc_before,
result,
total_executed,
)
}
#[instrument(skip_all, fields(max = ?max_instructions, ips = ?ips_limit))]
fn run_execution(
mem: &xenia_memory::GuestMemory,
@@ -2825,8 +2990,6 @@ fn run_execution(
halt_on_deadlock: bool,
shutdown: Option<std::sync::Arc<std::sync::atomic::AtomicBool>>,
) -> ExecStats {
use xenia_cpu::interpreter::step_block;
let mut stats = ExecStats::default();
let _ = quiet; // retained for future per-kind suppression
@@ -2974,34 +3137,25 @@ fn run_execution(
block_ptr,
pc_before,
} => {
// Block-cache step. The lockstep path keeps the
// kernel state borrowed straight through (single
// host thread, no contention). Step 03 of the
// M3 real-parallelism plan introduces a
// drop-and-reacquire window around `step_block`
// for the parallel branch.
let cycle_before = kernel.scheduler.ctx_mut_ref(thread_ref).cycle_count;
let block = unsafe { &*block_ptr };
let result = {
let ctx = kernel.scheduler.ctx_mut_ref(thread_ref);
step_block(ctx, mem, block)
};
let executed = kernel
.scheduler
.ctx_mut_ref(thread_ref)
.cycle_count
.saturating_sub(cycle_before);
match worker_epilogue(
// SUPERBLOCK runner (iterate-3AL). Instead of one
// basic block per slot-visit, chain straight-line
// blocks through their branches up to a deterministic
// instruction budget, yielding back to the round only
// at cross-thread synchronization points. Amortizes
// the per-round (timebase / coord / round_schedule)
// and per-slot (prologue) tax over hundreds of
// instructions instead of ~6. See `run_superblock`.
match run_superblock(
wc,
kernel,
mem,
debugger,
thunk_map,
&mut stats,
tid,
thread_ref,
block_ptr,
pc_before,
result,
executed,
) {
SlotOutcome::Continue => continue,
SlotOutcome::BreakOuter => break 'outer,

View File

@@ -1,5 +1,5 @@
{
"instructions": 2000005,
"instructions": 2000073,
"imports": 5635,
"unimpl": 0,
"draws": 0,

View File

@@ -1,9 +1,9 @@
{
"instructions": 50000007,
"imports": 333453,
"instructions": 50000110,
"imports": 243387,
"unimpl": 0,
"draws": 1274,
"swaps": 259,
"draws": 1279,
"swaps": 260,
"unique_render_targets": 2,
"shader_blobs_live": 6,
"texture_cache_entries": 1