[iterate-3AL] Superblock dispatch: chain basic blocks per slot-visit (~1.6x boot-to-splash)

Replace the one-basic-block-per-slot-per-round lockstep dispatch with a SUPERBLOCK runner: each slot-visit chains straight-line blocks through their terminating branches up to a deterministic instruction budget, amortizing the per-round (timebase/coord/round_schedule) and per-slot (worker_prologue) dispatch tax over ~128 instructions instead of ~6. Yield-points (end the chain, return to the round) are pure functions of guest state, preserving the lockstep cross-thread interleaving correctness: - non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted); db16cyc Yield is the spin-wait producer hand-off. - sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync (new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build). - MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled per block, keeps GPU/register ordering at one-block granularity. - next PC leaves ordinary guest code (import thunk / halt sentinel / unmapped) -> hand to the full worker_prologue next round. - instruction budget reached. Instruction-count/clock accounting stays exact: per-block cycle_count deltas are summed and handed to worker_epilogue once (instruction_count + decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1 reproduces the old one-block schedule byte-for-byte. Budget tuned to 128 (env-overridable): boot progression stays healthy up to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves); 128 is 3x below the cliff. Also scale the inline-GPU per-round fairness cap with the budget (flat 64 throttled GPU command processing 17x under superblocks and collapsed the present loop). PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 -> 41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B (-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit / round_schedule_into / coord_post_round / update_timestamp_bundle each ~-90%; interpreter execute byte-identical (real work unchanged). GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172), 1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact. C2 determinism: n50m stable digest byte-identical across fresh runs; golden re-baselined intentionally (pacing-only deltas: imports 333453->243387, draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/ present cadence track baseline (3AJ fade-in pacing preserved). C4: 690 tests green (+2 sync_sensitive). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:31:54 +02:00
parent dc1320cd4b
commit acb29db444
6 changed files with 289 additions and 29 deletions
--- a/crates/xenia-app/src/main.rs
+++ b/crates/xenia-app/src/main.rs
@@ -2326,8 +2326,19 @@ fn coord_post_round(
    let mut gpu_runs = (executed_this_round
        / xenia_cpu::scheduler::HW_THREAD_COUNT as u64)
        .max(1);
-    if gpu_runs > 64 {
-        gpu_runs = 64;
+    // Fairness cap on GPU commands drained per round. Must scale with the
+    // per-round instruction volume: with the superblock runner a single
+    // round legitimately retires up to ~SUPERBLOCK_INSTR_BUDGET per slot
+    // (vs ~6 for the old one-block path), so the rate `executed/6` is much
+    // higher and a flat cap of 64 throttled GPU command processing ~17×
+    // (packets 50279→1861 @50M) — collapsing the present loop / splash.
+    // Cap at the budget so the GPU keeps pace with the CPU at the same
+    // per-instruction rate the one-block path had. The inner loop already
+    // early-breaks on `!gpu.is_ready`, so this only bounds a pathological
+    // backlog, never busy-spins.
+    let gpu_cap = superblock_budget().max(64);
+    if gpu_runs > gpu_cap {
+        gpu_runs = gpu_cap;
    }
    if let Some(gpu) = kernel.gpu.as_inline_mut() {
        gpu.sync_with_mmio();
@@ -2812,6 +2823,160 @@ fn worker_epilogue(
    SlotOutcome::Continue
 }

+/// Hard cap on the number of guest instructions a single superblock
+/// runner invocation executes before returning to the round-robin
+/// scheduler. Bounds how coarse the lockstep interleaving can get: a
+/// larger budget amortizes more per-round/per-slot tax (faster) but
+/// runs one HW thread for longer between scheduler returns (coarser
+/// cross-thread interleaving). 1024 keeps a slot-visit ~170× longer
+/// than the old single-block (~6 instr) granularity while still
+/// returning to the round well inside a single 50k quantum. Purely an
+/// instruction count → deterministic, schedule reproduces byte-identically.
+///
+/// Tuned empirically on the Sylpheed boot-to-splash workload (iterate-3AL):
+/// budgets up to 256 keep boot progression byte-for-byte healthy (draws /
+/// swaps / packets track the one-block baseline), then a sharp cliff at
+/// ~384 collapses the present loop (a producer/consumer boot handoff
+/// starves when one slot runs too long without returning to the round).
+/// 128 sits 3× below that cliff with ~1.65× boot-to-splash speedup — a
+/// deliberately conservative pick (correctness over the last few %). The
+/// `XENIA_SUPERBLOCK_BUDGET` env var overrides it for further tuning.
+const SUPERBLOCK_INSTR_BUDGET: u64 = 128;
+
+/// Effective superblock budget. Defaults to [`SUPERBLOCK_INSTR_BUDGET`];
+/// `XENIA_SUPERBLOCK_BUDGET` overrides it (A/B tuning without a rebuild).
+/// A budget of 1 reproduces the old one-block-per-slot-visit behaviour
+/// (the chain always stops after the first block). Read once and cached.
+fn superblock_budget() -> u64 {
+    use std::sync::OnceLock;
+    static BUDGET: OnceLock<u64> = OnceLock::new();
+    *BUDGET.get_or_init(|| {
+        std::env::var("XENIA_SUPERBLOCK_BUDGET")
+            .ok()
+            .and_then(|v| v.parse::<u64>().ok())
+            .filter(|&v| v >= 1)
+            .unwrap_or(SUPERBLOCK_INSTR_BUDGET)
+    })
+}
+
+/// Superblock runner (iterate-3AL). Executes a *chain* of basic blocks
+/// for one slot-visit — following each block's terminating branch into
+/// the next block — instead of a single block, amortizing the per-round
+/// (timebase / coord / `round_schedule`) and per-slot (`worker_prologue`)
+/// dispatch tax over up to [`SUPERBLOCK_INSTR_BUDGET`] guest instructions.
+///
+/// Determinism + cross-thread correctness: the chain ENDS (returns to the
+/// round) at exactly the points where lockstep granularity matters, all
+/// pure functions of guest state (never wall-clock):
+///   - a non-`Continue` step result (Yield / SystemCall / Trap / Unimpl /
+///     Halted) — `step_block` already bails on these; `Yield` in
+///     particular is the db16cyc spin-wait hand-off that prevents a
+///     spinner from starving its producer.
+///   - the just-run block was `sync_sensitive` (reserved load/store or a
+///     memory barrier) — the guest's own ordering points.
+///   - the block touched MMIO (the `mem.mmio_access_count()` watermark
+///     advanced) — GPU/register ordering vs other HW threads stays at the
+///     same fine granularity as the old one-block path.
+///   - the next PC leaves ordinary guest code: an import thunk, the halt
+///     sentinel, or unmapped memory — those need the full `worker_prologue`
+///     dispatch, so we stop and let the next round's prologue handle them.
+///   - the instruction budget is reached.
+///
+/// Instruction-count / clock accounting stays exact: `executed` is summed
+/// from the per-block `cycle_count` delta across every chained block and
+/// handed to `worker_epilogue` once, which advances `stats.instruction_count`
+/// and `decrement_quantum` by precisely the retired count — identical to
+/// dispatching each block separately.
+#[allow(clippy::too_many_arguments)]
+fn run_superblock(
+    wc: &mut WorkerCtx,
+    kernel: &mut xenia_kernel::KernelState,
+    mem: &xenia_memory::GuestMemory,
+    debugger: &mut xenia_debugger::Debugger,
+    thunk_map: &HashMap<u32, (ModuleId, u16, String)>,
+    stats: &mut ExecStats,
+    tid: Option<u32>,
+    thread_ref: xenia_cpu::ThreadRef,
+    first_block_ptr: *const xenia_cpu::block_cache::DecodedBlock,
+    first_pc_before: u32,
+) -> SlotOutcome {
+    use xenia_cpu::interpreter::{step_block, StepResult};
+    const LR_HALT: u32 = xenia_cpu::context::LR_HALT_SENTINEL as u32;
+
+    let budget = superblock_budget();
+
+    // Probe / mem-watch / debugger-hook modes need per-block-entry
+    // observability; in those modes never chain (run exactly one block,
+    // identical to the pre-superblock behaviour). The block-cache fast
+    // path is only entered when hooks/DB are off anyway, but a probe or
+    // mem-watch can be armed alongside it.
+    let chain_allowed = !kernel.any_probe_active() && !mem.has_mem_watch();
+
+    let mut block_ptr = first_block_ptr;
+    let mut pc_before = first_pc_before;
+    let mut total_executed: u64 = 0;
+
+    let (result, last_block_ptr, last_pc_before) = loop {
+        let cycle_before = kernel.scheduler.ctx_mut_ref(thread_ref).cycle_count;
+        let mmio_before = mem.mmio_access_count();
+        let block = unsafe { &*block_ptr };
+        let result = {
+            let ctx = kernel.scheduler.ctx_mut_ref(thread_ref);
+            step_block(ctx, mem, block)
+        };
+        let executed = kernel
+            .scheduler
+            .ctx_mut_ref(thread_ref)
+            .cycle_count
+            .saturating_sub(cycle_before);
+        total_executed = total_executed.saturating_add(executed);
+
+        // STOP conditions (any → end the superblock, hand to epilogue):
+        // non-Continue result (let the epilogue apply it), chaining
+        // disabled, a sync-sensitive block just ran, MMIO was touched,
+        // or the budget is spent.
+        if !chain_allowed
+            || !matches!(result, StepResult::Continue)
+            || block.sync_sensitive
+            || mem.mmio_access_count() != mmio_before
+            || total_executed >= budget
+        {
+            break (result, block_ptr, pc_before);
+        }
+
+        // Decide whether the NEXT PC is an ordinary guest block we can
+        // chain into. Anything else (thunk / halt sentinel / unmapped)
+        // needs the full prologue dispatch next round.
+        let next_pc = kernel.scheduler.ctx(wc.hw_id).pc;
+        if next_pc == LR_HALT
+            || (kernel.pc_in_thunk_band(next_pc) && thunk_map.contains_key(&next_pc))
+            || !mem.is_mapped(next_pc)
+        {
+            break (result, block_ptr, pc_before);
+        }
+
+        // Chain: build/fetch the next block. Re-borrows `wc.block_cache`,
+        // which invalidates the previous `block_ptr` — but we've already
+        // finished using it (only `sync_sensitive`/diagnostics were read,
+        // above), so the raw-pointer aliasing rule is respected.
+        pc_before = next_pc;
+        block_ptr = wc.block_cache.lookup_or_build(next_pc, mem) as *const _;
+    };
+
+    worker_epilogue(
+        wc,
+        kernel,
+        debugger,
+        stats,
+        tid,
+        thread_ref,
+        last_block_ptr,
+        last_pc_before,
+        result,
+        total_executed,
+    )
+}
+
 #[instrument(skip_all, fields(max = ?max_instructions, ips = ?ips_limit))]
 fn run_execution(
    mem: &xenia_memory::GuestMemory,
@@ -2825,8 +2990,6 @@ fn run_execution(
    halt_on_deadlock: bool,
    shutdown: Option<std::sync::Arc<std::sync::atomic::AtomicBool>>,
 ) -> ExecStats {
-    use xenia_cpu::interpreter::step_block;
-
    let mut stats = ExecStats::default();
    let _ = quiet; // retained for future per-kind suppression

@@ -2974,34 +3137,25 @@ fn run_execution(
                    block_ptr,
                    pc_before,
                } => {
-                    // Block-cache step. The lockstep path keeps the
-                    // kernel state borrowed straight through (single
-                    // host thread, no contention). Step 03 of the
-                    // M3 real-parallelism plan introduces a
-                    // drop-and-reacquire window around `step_block`
-                    // for the parallel branch.
-                    let cycle_before = kernel.scheduler.ctx_mut_ref(thread_ref).cycle_count;
-                    let block = unsafe { &*block_ptr };
-                    let result = {
-                        let ctx = kernel.scheduler.ctx_mut_ref(thread_ref);
-                        step_block(ctx, mem, block)
-                    };
-                    let executed = kernel
-                        .scheduler
-                        .ctx_mut_ref(thread_ref)
-                        .cycle_count
-                        .saturating_sub(cycle_before);
-                    match worker_epilogue(
+                    // SUPERBLOCK runner (iterate-3AL). Instead of one
+                    // basic block per slot-visit, chain straight-line
+                    // blocks through their branches up to a deterministic
+                    // instruction budget, yielding back to the round only
+                    // at cross-thread synchronization points. Amortizes
+                    // the per-round (timebase / coord / round_schedule)
+                    // and per-slot (prologue) tax over hundreds of
+                    // instructions instead of ~6. See `run_superblock`.
+                    match run_superblock(
                        wc,
                        kernel,
+                        mem,
                        debugger,
+                        thunk_map,
                        &mut stats,
                        tid,
                        thread_ref,
                        block_ptr,
                        pc_before,
-                        result,
-                        executed,
                    ) {
                        SlotOutcome::Continue => continue,
                        SlotOutcome::BreakOuter => break 'outer,
--- a/crates/xenia-app/tests/golden/sylpheed_n2m.json
+++ b/crates/xenia-app/tests/golden/sylpheed_n2m.json
@@ -1,5 +1,5 @@
 {
-  "instructions": 2000005,
+  "instructions": 2000073,
  "imports": 5635,
  "unimpl": 0,
  "draws": 0,
--- a/crates/xenia-app/tests/golden/sylpheed_n50m.json
+++ b/crates/xenia-app/tests/golden/sylpheed_n50m.json
@@ -1,9 +1,9 @@
 {
-  "instructions": 50000007,
-  "imports": 333453,
+  "instructions": 50000110,
+  "imports": 243387,
  "unimpl": 0,
-  "draws": 1274,
-  "swaps": 259,
+  "draws": 1279,
+  "swaps": 260,
  "unique_render_targets": 2,
  "shader_blobs_live": 6,
  "texture_cache_entries": 1