[iterate-3AL] Superblock dispatch: chain basic blocks per slot-visit (~1.6x boot-to-splash)

Replace the one-basic-block-per-slot-per-round lockstep dispatch with a
SUPERBLOCK runner: each slot-visit chains straight-line blocks through
their terminating branches up to a deterministic instruction budget,
amortizing the per-round (timebase/coord/round_schedule) and per-slot
(worker_prologue) dispatch tax over ~128 instructions instead of ~6.

Yield-points (end the chain, return to the round) are pure functions of
guest state, preserving the lockstep cross-thread interleaving correctness:
  - non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted);
    db16cyc Yield is the spin-wait producer hand-off.
  - sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync
    (new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build).
  - MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled
    per block, keeps GPU/register ordering at one-block granularity.
  - next PC leaves ordinary guest code (import thunk / halt sentinel /
    unmapped) -> hand to the full worker_prologue next round.
  - instruction budget reached.

Instruction-count/clock accounting stays exact: per-block cycle_count
deltas are summed and handed to worker_epilogue once (instruction_count +
decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1
reproduces the old one-block schedule byte-for-byte.

Budget tuned to 128 (env-overridable): boot progression stays healthy up
to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves);
128 is 3x below the cliff. Also scale the inline-GPU per-round fairness
cap with the budget (flat 64 throttled GPU command processing 17x under
superblocks and collapsed the present loop).

PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 ->
41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B
(-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit /
round_schedule_into / coord_post_round / update_timestamp_bundle each
~-90%; interpreter execute byte-identical (real work unchanged).

GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172),
1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact.
C2 determinism: n50m stable digest byte-identical across fresh runs;
golden re-baselined intentionally (pacing-only deltas: imports 333453->243387,
draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/
present cadence track baseline (3AJ fade-in pacing preserved). C4: 690
tests green (+2 sync_sensitive).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-19 22:31:54 +02:00
parent dc1320cd4b
commit acb29db444
6 changed files with 289 additions and 29 deletions

View File

@@ -79,6 +79,14 @@ pub struct DecodedBlock {
/// a successful build (`MAX_BLOCK_INSTRS >= 1` and the build walk
/// pushes the first decoded word unconditionally).
pub instrs: Vec<DecodedInstr>,
/// True if this block contains a cross-thread synchronization point
/// (`PpcOpcode::is_sync_sensitive`: reserved load/store or a memory
/// barrier). Computed once at build time. The superblock runner ends
/// the run after executing a sync-sensitive block so the lockstep
/// interleaving stays fine-grained at exactly those points (preserving
/// the cross-thread ordering the 2E/2F/2J boot work depends on),
/// while chaining freely through ordinary straight-line blocks.
pub sync_sensitive: bool,
}
/// Per-slot status from a `lookup_or_build` probe. Internal only.
@@ -187,11 +195,13 @@ fn build_block(start_pc: u32, mem: &dyn MemoryAccess, page_version: u64) -> Deco
let mut instrs: Vec<DecodedInstr> = Vec::with_capacity(8);
let page_base = start_pc & GUEST_PAGE_MASK;
let mut cur = start_pc;
let mut sync_sensitive = false;
loop {
let raw = mem.read_u32(cur);
let decoded = decode(raw, cur);
let terminates = decoded.opcode.terminates_block();
sync_sensitive |= decoded.opcode.is_sync_sensitive();
instrs.push(decoded);
if terminates {
@@ -215,6 +225,7 @@ fn build_block(start_pc: u32, mem: &dyn MemoryAccess, page_version: u64) -> Deco
end_pc,
page_version,
instrs,
sync_sensitive,
}
}
@@ -335,6 +346,40 @@ mod tests {
assert_eq!(b.end_pc, 0x110);
}
#[test]
fn sync_sensitive_flag_set_for_barrier_block() {
// A block containing `sync` (0x7C0004AC) must flag sync_sensitive
// so the superblock runner ends the chain there (cross-thread
// ordering point). `sync` does NOT terminate a block, so it sits
// mid-block followed by straight-line code up to a terminator.
let mem = BlockTestMem::new();
mem.put(0x100, enc_addi(3, 3, 1));
mem.put(0x104, 0x7C00_04AC); // sync
mem.put(0x108, enc_addi(3, 3, 1));
mem.put(0x10C, enc_b_self()); // terminator
let mut bc = BlockCache::new();
let b = bc.lookup_or_build(0x100, &mem);
assert!(
b.sync_sensitive,
"block containing `sync` must flag sync_sensitive; decoded last={:?}",
b.instrs.iter().map(|i| i.opcode).collect::<Vec<_>>()
);
}
#[test]
fn sync_sensitive_flag_clear_for_plain_block() {
// A straight-line ALU block with no reserved-op / barrier must
// NOT flag sync_sensitive (so the superblock runner is free to
// chain through it).
let mem = BlockTestMem::new();
mem.put(0x100, enc_addi(3, 3, 1));
mem.put(0x104, enc_addi(3, 3, 1));
mem.put(0x108, enc_b_self());
let mut bc = BlockCache::new();
let b = bc.lookup_or_build(0x100, &mem);
assert!(!b.sync_sensitive, "plain ALU block must not flag sync_sensitive");
}
#[test]
fn block_stops_at_page_boundary() {
// Build from 0x1FFC. The next PC (0x2000) is in a different

View File

@@ -204,6 +204,34 @@ impl PpcOpcode {
)
}
/// Returns true if this opcode is a cross-thread synchronization
/// point at which the superblock runner MUST yield back to the
/// round-robin scheduler so the lockstep interleaving stays
/// fine-grained enough to preserve correct cross-thread ordering:
///
/// - reserved load/store (`lwarx`/`ldarx`/`stwcx.`/`stdcx.`): the
/// atomic primitive other threads race on. Running past one
/// without returning to the scheduler would let a single slot
/// win/lose a reservation across many blocks before any peer
/// observes it.
/// - memory barriers (`sync`/`eieio`/`isync`): the guest explicitly
/// demands a global ordering point here; honour it by ending the
/// superblock so the scheduler re-interleaves.
///
/// Purely a function of the opcode (no guest data), so the yield
/// decision is deterministic and the schedule reproduces byte-identically.
/// Note: `sc` (syscall) and traps already `terminates_block`, and
/// import-thunk / halt-sentinel PCs are handled by the per-block
/// prologue re-check in the superblock loop — they are not listed here.
#[inline]
pub fn is_sync_sensitive(&self) -> bool {
matches!(
self,
Self::lwarx | Self::ldarx | Self::stwcx | Self::stdcx
| Self::sync | Self::eieio | Self::isync
)
}
pub fn name(&self) -> &'static str {
match self {
Self::Invalid => "invalid",