Replace the one-basic-block-per-slot-per-round lockstep dispatch with a
SUPERBLOCK runner: each slot-visit chains straight-line blocks through
their terminating branches up to a deterministic instruction budget,
amortizing the per-round (timebase/coord/round_schedule) and per-slot
(worker_prologue) dispatch tax over ~128 instructions instead of ~6.
Yield-points (end the chain, return to the round) are pure functions of
guest state, preserving the lockstep cross-thread interleaving correctness:
- non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted);
db16cyc Yield is the spin-wait producer hand-off.
- sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync
(new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build).
- MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled
per block, keeps GPU/register ordering at one-block granularity.
- next PC leaves ordinary guest code (import thunk / halt sentinel /
unmapped) -> hand to the full worker_prologue next round.
- instruction budget reached.
Instruction-count/clock accounting stays exact: per-block cycle_count
deltas are summed and handed to worker_epilogue once (instruction_count +
decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1
reproduces the old one-block schedule byte-for-byte.
Budget tuned to 128 (env-overridable): boot progression stays healthy up
to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves);
128 is 3x below the cliff. Also scale the inline-GPU per-round fairness
cap with the budget (flat 64 throttled GPU command processing 17x under
superblocks and collapsed the present loop).
PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 ->
41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B
(-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit /
round_schedule_into / coord_post_round / update_timestamp_bundle each
~-90%; interpreter execute byte-identical (real work unchanged).
GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172),
1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact.
C2 determinism: n50m stable digest byte-identical across fresh runs;
golden re-baselined intentionally (pacing-only deltas: imports 333453->243387,
draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/
present cadence track baseline (3AJ fade-in pacing preserved). C4: 690
tests green (+2 sync_sensitive).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 lines
178 B
JSON
11 lines
178 B
JSON
{
|
|
"instructions": 2000073,
|
|
"imports": 5635,
|
|
"unimpl": 0,
|
|
"draws": 0,
|
|
"swaps": 0,
|
|
"unique_render_targets": 0,
|
|
"shader_blobs_live": 0,
|
|
"texture_cache_entries": 0
|
|
}
|