[iterate-3AK] Perf quick-wins: ~21% faster boot-to-splash (22→27 MIPS)

Profile-driven low-risk optimizations attacking the ~48% per-block /
per-round host-bookkeeping tax found by the callgrind profile. Measured
on the bounded headless workload `check -n 100000000 --gpu-inline`:
baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%.

Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0):
1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch
   behind one has_mem_watch() predicted branch in write_u8/16/32/64 +
   write_bulk so the common (no-watch) store does no out-of-line call.
   check_mem_watch (4.8%) gone from the profile.
2. round-schedule alloc churn: add Scheduler::round_schedule_into filling
   a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop
   no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical
   ordering/RNG-advance. __rust_alloc/dealloc gone from the profile.
3. probe-firing: hoist a single KernelState::any_probe_active() guard to
   worker_prologue so the four fire_*_if_match calls don't happen at all
   when no probe is configured (was 4x call overhead/visit). All four
   gone from the profile.
4. thunk-map hash: range-reject pc against the registered import-thunk
   address band (KernelState::pc_in_thunk_band, two int compares) before
   the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone.

Tier B (#5, time-granularity change — LANDED, no re-baseline needed):
5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write
   the KeTimeStampBundle when the deterministic clock advanced >= 2500
   units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the
   1 ms granularity any guest deadline math needs (tick_count stays
   fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per
   3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to
   the existing golden (so no re-baseline), 150M boot reaches the splash
   (draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all
   match the post-3AJ baseline), 688 tests green, release n50m oracle ok.

Remaining headroom: interpreter::execute (13%), decrement_quantum (8%),
step_block (7%) are now the top self-costs — the structural superblock/
JIT lever is the next step for the larger gain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-19 22:05:53 +02:00
parent 9d24dd0eaa
commit dc1320cd4b
4 changed files with 156 additions and 26 deletions

View File

@@ -2459,10 +2459,19 @@ fn worker_prologue(
// and println one record. Read-only; lockstep digest unaffected.
// Empty set is the common case → single `is_empty()` test inside
// the helper, no overhead on the hot path.
kernel.fire_ctor_probe_if_match(hw_id, mem);
kernel.fire_branch_probe_if_match(hw_id);
kernel.fire_audit_pc_probe_if_match(hw_id, mem);
kernel.fire_lr_trace_if_match(hw_id);
// Perf (Tier-A #3): all four `fire_*_if_match` helpers early-return
// on an empty registry, but paying 4× call overhead per slot-visit
// (~3.2M visits boot-to-splash) is itself measurable. Gate the whole
// group behind a single `any_probe_active()` predicted branch so the
// common (no-probe) headless path never even makes the calls. When a
// probe IS configured each helper still re-checks its own set, so
// behaviour is identical either way.
if kernel.any_probe_active() {
kernel.fire_ctor_probe_if_match(hw_id, mem);
kernel.fire_branch_probe_if_match(hw_id);
kernel.fire_audit_pc_probe_if_match(hw_id, mem);
kernel.fire_lr_trace_if_match(hw_id);
}
if mem.has_mem_watch() {
let ctx = kernel.scheduler.ctx(hw_id);
@@ -2528,8 +2537,15 @@ fn worker_prologue(
return PrologueOutcome::Continue;
}
// 2) Import thunk intercept.
if let Some((module, ordinal, name)) = thunk_map.get(&pc) {
// 2) Import thunk intercept. Perf (Tier-A #4): import thunks occupy a
// small contiguous address band; the overwhelming majority of executing
// PCs are ordinary guest code outside it. Range-reject against the band
// (two integer compares) before paying the `thunk_map` hash. Faithful
// no-op — any in-band PC still goes through the exact map lookup, and an
// out-of-band PC can never be a registered thunk.
if kernel.pc_in_thunk_band(pc)
&& let Some((module, ordinal, name)) = thunk_map.get(&pc)
{
let module = *module;
let ordinal_u32 = *ordinal as u32;
let thunk_pc = pc;
@@ -2854,6 +2870,10 @@ fn run_execution(
// re-decoding the same handful of pages 60×/s.
let mut isr_decode_cache = xenia_cpu::decoder::DecodeCache::new();
// Tier-A perf #2: reusable buffer for `round_schedule_into` so the round
// loop doesn't heap-allocate a `Vec<u8>` every iteration.
let mut order_buf = [0u8; xenia_cpu::scheduler::HW_THREAD_COUNT];
'outer: loop {
// Per-round prologue: budget / shutdown / heartbeat / vsync /
// timers / audio-interrupt injection. Carved into
@@ -2908,10 +2928,12 @@ fn run_execution(
thunk_map,
);
// Snapshot round schedule. `round_schedule` also advances rng state
// when seeded; mutation is intentional.
// Snapshot round schedule. `round_schedule_into` also advances rng
// state when seeded; mutation is intentional. Perf (Tier-A #2): fill
// a reusable stack array instead of allocating a fresh Vec per round.
kernel.scheduler.begin_round();
let order = kernel.scheduler.round_schedule();
let order_n = kernel.scheduler.round_schedule_into(&mut order_buf);
let order = &order_buf[..order_n];
if order.is_empty() {
// No Ready threads — advance time to the earliest pending
@@ -2933,7 +2955,7 @@ fn run_execution(
// GPU when block dispatch engages.
let instrs_at_round_start = stats.instruction_count;
for hw_id in order {
for &hw_id in order {
let wc = &mut workers[hw_id as usize];
match worker_prologue(
wc,