[iterate-3AL] Superblock dispatch: chain basic blocks per slot-visit (~1.6x boot-to-splash)

Replace the one-basic-block-per-slot-per-round lockstep dispatch with a SUPERBLOCK runner: each slot-visit chains straight-line blocks through their terminating branches up to a deterministic instruction budget, amortizing the per-round (timebase/coord/round_schedule) and per-slot (worker_prologue) dispatch tax over ~128 instructions instead of ~6. Yield-points (end the chain, return to the round) are pure functions of guest state, preserving the lockstep cross-thread interleaving correctness: - non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted); db16cyc Yield is the spin-wait producer hand-off. - sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync (new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build). - MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled per block, keeps GPU/register ordering at one-block granularity. - next PC leaves ordinary guest code (import thunk / halt sentinel / unmapped) -> hand to the full worker_prologue next round. - instruction budget reached. Instruction-count/clock accounting stays exact: per-block cycle_count deltas are summed and handed to worker_epilogue once (instruction_count + decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1 reproduces the old one-block schedule byte-for-byte. Budget tuned to 128 (env-overridable): boot progression stays healthy up to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves); 128 is 3x below the cliff. Also scale the inline-GPU per-round fairness cap with the budget (flat 64 throttled GPU command processing 17x under superblocks and collapsed the present loop). PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 -> 41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B (-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit / round_schedule_into / coord_post_round / update_timestamp_bundle each ~-90%; interpreter execute byte-identical (real work unchanged). GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172), 1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact. C2 determinism: n50m stable digest byte-identical across fresh runs; golden re-baselined intentionally (pacing-only deltas: imports 333453->243387, draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/ present cadence track baseline (3AJ fade-in pacing preserved). C4: 690 tests green (+2 sync_sensitive). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:31:54 +02:00
parent dc1320cd4b
commit acb29db444
6 changed files with 289 additions and 29 deletions
--- a/crates/xenia-cpu/src/block_cache.rs
+++ b/crates/xenia-cpu/src/block_cache.rs
@@ -79,6 +79,14 @@ pub struct DecodedBlock {
    /// a successful build (`MAX_BLOCK_INSTRS >= 1` and the build walk
    /// pushes the first decoded word unconditionally).
    pub instrs: Vec<DecodedInstr>,
+    /// True if this block contains a cross-thread synchronization point
+    /// (`PpcOpcode::is_sync_sensitive`: reserved load/store or a memory
+    /// barrier). Computed once at build time. The superblock runner ends
+    /// the run after executing a sync-sensitive block so the lockstep
+    /// interleaving stays fine-grained at exactly those points (preserving
+    /// the cross-thread ordering the 2E/2F/2J boot work depends on),
+    /// while chaining freely through ordinary straight-line blocks.
+    pub sync_sensitive: bool,
 }

 /// Per-slot status from a `lookup_or_build` probe. Internal only.
@@ -187,11 +195,13 @@ fn build_block(start_pc: u32, mem: &dyn MemoryAccess, page_version: u64) -> Deco
    let mut instrs: Vec<DecodedInstr> = Vec::with_capacity(8);
    let page_base = start_pc & GUEST_PAGE_MASK;
    let mut cur = start_pc;
+    let mut sync_sensitive = false;

    loop {
        let raw = mem.read_u32(cur);
        let decoded = decode(raw, cur);
        let terminates = decoded.opcode.terminates_block();
+        sync_sensitive |= decoded.opcode.is_sync_sensitive();
        instrs.push(decoded);

        if terminates {
@@ -215,6 +225,7 @@ fn build_block(start_pc: u32, mem: &dyn MemoryAccess, page_version: u64) -> Deco
        end_pc,
        page_version,
        instrs,
+        sync_sensitive,
    }
 }

@@ -335,6 +346,40 @@ mod tests {
        assert_eq!(b.end_pc, 0x110);
    }

+    #[test]
+    fn sync_sensitive_flag_set_for_barrier_block() {
+        // A block containing `sync` (0x7C0004AC) must flag sync_sensitive
+        // so the superblock runner ends the chain there (cross-thread
+        // ordering point). `sync` does NOT terminate a block, so it sits
+        // mid-block followed by straight-line code up to a terminator.
+        let mem = BlockTestMem::new();
+        mem.put(0x100, enc_addi(3, 3, 1));
+        mem.put(0x104, 0x7C00_04AC); // sync
+        mem.put(0x108, enc_addi(3, 3, 1));
+        mem.put(0x10C, enc_b_self()); // terminator
+        let mut bc = BlockCache::new();
+        let b = bc.lookup_or_build(0x100, &mem);
+        assert!(
+            b.sync_sensitive,
+            "block containing `sync` must flag sync_sensitive; decoded last={:?}",
+            b.instrs.iter().map(|i| i.opcode).collect::<Vec<_>>()
+        );
+    }
+
+    #[test]
+    fn sync_sensitive_flag_clear_for_plain_block() {
+        // A straight-line ALU block with no reserved-op / barrier must
+        // NOT flag sync_sensitive (so the superblock runner is free to
+        // chain through it).
+        let mem = BlockTestMem::new();
+        mem.put(0x100, enc_addi(3, 3, 1));
+        mem.put(0x104, enc_addi(3, 3, 1));
+        mem.put(0x108, enc_b_self());
+        let mut bc = BlockCache::new();
+        let b = bc.lookup_or_build(0x100, &mem);
+        assert!(!b.sync_sensitive, "plain ALU block must not flag sync_sensitive");
+    }
+
    #[test]
    fn block_stops_at_page_boundary() {
        // Build from 0x1FFC. The next PC (0x2000) is in a different
--- a/crates/xenia-cpu/src/opcode.rs
+++ b/crates/xenia-cpu/src/opcode.rs
@@ -204,6 +204,34 @@ impl PpcOpcode {
        )
    }

+    /// Returns true if this opcode is a cross-thread synchronization
+    /// point at which the superblock runner MUST yield back to the
+    /// round-robin scheduler so the lockstep interleaving stays
+    /// fine-grained enough to preserve correct cross-thread ordering:
+    ///
+    ///   - reserved load/store (`lwarx`/`ldarx`/`stwcx.`/`stdcx.`): the
+    ///     atomic primitive other threads race on. Running past one
+    ///     without returning to the scheduler would let a single slot
+    ///     win/lose a reservation across many blocks before any peer
+    ///     observes it.
+    ///   - memory barriers (`sync`/`eieio`/`isync`): the guest explicitly
+    ///     demands a global ordering point here; honour it by ending the
+    ///     superblock so the scheduler re-interleaves.
+    ///
+    /// Purely a function of the opcode (no guest data), so the yield
+    /// decision is deterministic and the schedule reproduces byte-identically.
+    /// Note: `sc` (syscall) and traps already `terminates_block`, and
+    /// import-thunk / halt-sentinel PCs are handled by the per-block
+    /// prologue re-check in the superblock loop — they are not listed here.
+    #[inline]
+    pub fn is_sync_sensitive(&self) -> bool {
+        matches!(
+            self,
+            Self::lwarx | Self::ldarx | Self::stwcx | Self::stdcx
+                | Self::sync | Self::eieio | Self::isync
+        )
+    }
+
    pub fn name(&self) -> &'static str {
        match self {
            Self::Invalid => "invalid",