[iterate-3AL] Superblock dispatch: chain basic blocks per slot-visit (~1.6x boot-to-splash)

Replace the one-basic-block-per-slot-per-round lockstep dispatch with a
SUPERBLOCK runner: each slot-visit chains straight-line blocks through
their terminating branches up to a deterministic instruction budget,
amortizing the per-round (timebase/coord/round_schedule) and per-slot
(worker_prologue) dispatch tax over ~128 instructions instead of ~6.

Yield-points (end the chain, return to the round) are pure functions of
guest state, preserving the lockstep cross-thread interleaving correctness:
  - non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted);
    db16cyc Yield is the spin-wait producer hand-off.
  - sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync
    (new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build).
  - MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled
    per block, keeps GPU/register ordering at one-block granularity.
  - next PC leaves ordinary guest code (import thunk / halt sentinel /
    unmapped) -> hand to the full worker_prologue next round.
  - instruction budget reached.

Instruction-count/clock accounting stays exact: per-block cycle_count
deltas are summed and handed to worker_epilogue once (instruction_count +
decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1
reproduces the old one-block schedule byte-for-byte.

Budget tuned to 128 (env-overridable): boot progression stays healthy up
to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves);
128 is 3x below the cliff. Also scale the inline-GPU per-round fairness
cap with the budget (flat 64 throttled GPU command processing 17x under
superblocks and collapsed the present loop).

PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 ->
41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B
(-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit /
round_schedule_into / coord_post_round / update_timestamp_bundle each
~-90%; interpreter execute byte-identical (real work unchanged).

GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172),
1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact.
C2 determinism: n50m stable digest byte-identical across fresh runs;
golden re-baselined intentionally (pacing-only deltas: imports 333453->243387,
draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/
present cadence track baseline (3AJ fade-in pacing preserved). C4: 690
tests green (+2 sync_sensitive).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-19 22:31:54 +02:00
parent dc1320cd4b
commit acb29db444
6 changed files with 289 additions and 29 deletions

View File

@@ -89,6 +89,14 @@ pub struct GuestMemory {
mem_watch_addrs: Vec<u32>,
/// Count of fires observed (for tests / hand-off telemetry).
mem_watch_count: AtomicU64,
/// Monotonic count of MMIO accesses (every scalar load/store that
/// resolves to a registered MMIO region bumps this by 1). A pure,
/// deterministic function of guest execution — the superblock runner
/// samples it before/after each block to detect an MMIO touch and
/// end the run there (so MMIO ordering vs other HW threads stays at
/// the same fine lockstep granularity as before). Relaxed because the
/// lockstep path is single-threaded and only needs monotonicity.
mmio_access_count: AtomicU64,
}
/// Greatest common bit-mask such that `(a & m) == (b & m)` for every bit
@@ -133,9 +141,26 @@ impl GuestMemory {
writes_total: AtomicU64::new(0),
mem_watch_addrs: Vec::new(),
mem_watch_count: AtomicU64::new(0),
mmio_access_count: AtomicU64::new(0),
})
}
/// Monotonic count of MMIO accesses since boot. Used by the superblock
/// runner to detect that a just-executed block touched MMIO (so it can
/// end the superblock there and keep MMIO ordering at lockstep
/// granularity). Deterministic function of guest execution.
#[inline]
pub fn mmio_access_count(&self) -> u64 {
self.mmio_access_count
.load(std::sync::atomic::Ordering::Relaxed)
}
#[inline]
fn bump_mmio_access(&self) {
self.mmio_access_count
.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
}
/// Current version watermark for the page containing `addr`. Bumped by
/// any write through `write_u8/16/32/64`. Not affected by MMIO writes
/// (those don't touch the backing texture memory).
@@ -488,6 +513,7 @@ impl MemoryAccess for GuestMemory {
// MMIO dispatch must come first — a byte read at an MMIO-mapped
// address should invoke the callback, not the backing memory.
if let Some(mmio) = self.find_mmio(addr) {
self.bump_mmio_access();
return (mmio.read_callback)(addr) as u8;
}
if !self.is_mapped(addr) { return 0; }
@@ -498,6 +524,7 @@ impl MemoryAccess for GuestMemory {
#[inline]
fn read_u16(&self, addr: u32) -> u16 {
if let Some(mmio) = self.find_mmio(addr) {
self.bump_mmio_access();
(mmio.read_callback)(addr) as u16
} else if !self.is_mapped(addr) {
0
@@ -510,6 +537,7 @@ impl MemoryAccess for GuestMemory {
#[inline]
fn read_u32(&self, addr: u32) -> u32 {
if let Some(mmio) = self.find_mmio(addr) {
self.bump_mmio_access();
(mmio.read_callback)(addr)
} else if !self.is_mapped(addr) {
0
@@ -522,6 +550,7 @@ impl MemoryAccess for GuestMemory {
#[inline]
fn read_u64(&self, addr: u32) -> u64 {
if let Some(mmio) = self.find_mmio(addr) {
self.bump_mmio_access();
let hi = (mmio.read_callback)(addr) as u64;
let lo = (mmio.read_callback)(addr.wrapping_add(4)) as u64;
(hi << 32) | lo
@@ -537,6 +566,7 @@ impl MemoryAccess for GuestMemory {
// MMIO dispatch first — a byte write at an MMIO-mapped address
// must invoke the callback, not the backing memory.
if let Some(mmio) = self.find_mmio(addr) {
self.bump_mmio_access();
(mmio.write_callback)(addr, val as u32);
return;
}
@@ -555,6 +585,7 @@ impl MemoryAccess for GuestMemory {
fn write_u16(&self, addr: u32, val: u16) {
if let Some(mmio) = self.find_mmio(addr) {
self.bump_mmio_access();
(mmio.write_callback)(addr, val as u32);
} else if !self.is_mapped(addr) {
} else {
@@ -577,6 +608,7 @@ impl MemoryAccess for GuestMemory {
fn write_u32(&self, addr: u32, val: u32) {
if let Some(mmio) = self.find_mmio(addr) {
self.bump_mmio_access();
(mmio.write_callback)(addr, val);
} else if !self.is_mapped(addr) {
} else {
@@ -596,6 +628,7 @@ impl MemoryAccess for GuestMemory {
fn write_u64(&self, addr: u32, val: u64) {
if let Some(mmio) = self.find_mmio(addr) {
self.bump_mmio_access();
(mmio.write_callback)(addr, (val >> 32) as u32);
(mmio.write_callback)(addr.wrapping_add(4), val as u32);
} else if !self.is_mapped(addr) {