[iterate-3AK] Perf quick-wins: ~21% faster boot-to-splash (22→27 MIPS)

Profile-driven low-risk optimizations attacking the ~48% per-block /
per-round host-bookkeeping tax found by the callgrind profile. Measured
on the bounded headless workload `check -n 100000000 --gpu-inline`:
baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%.

Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0):
1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch
   behind one has_mem_watch() predicted branch in write_u8/16/32/64 +
   write_bulk so the common (no-watch) store does no out-of-line call.
   check_mem_watch (4.8%) gone from the profile.
2. round-schedule alloc churn: add Scheduler::round_schedule_into filling
   a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop
   no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical
   ordering/RNG-advance. __rust_alloc/dealloc gone from the profile.
3. probe-firing: hoist a single KernelState::any_probe_active() guard to
   worker_prologue so the four fire_*_if_match calls don't happen at all
   when no probe is configured (was 4x call overhead/visit). All four
   gone from the profile.
4. thunk-map hash: range-reject pc against the registered import-thunk
   address band (KernelState::pc_in_thunk_band, two int compares) before
   the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone.

Tier B (#5, time-granularity change — LANDED, no re-baseline needed):
5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write
   the KeTimeStampBundle when the deterministic clock advanced >= 2500
   units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the
   1 ms granularity any guest deadline math needs (tick_count stays
   fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per
   3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to
   the existing golden (so no re-baseline), 150M boot reaches the splash
   (draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all
   match the post-3AJ baseline), 688 tests green, release n50m oracle ok.

Remaining headroom: interpreter::execute (13%), decrement_quantum (8%),
step_block (7%) are now the top self-costs — the structural superblock/
JIT lever is the next step for the larger gain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-19 22:05:53 +02:00
parent 9d24dd0eaa
commit dc1320cd4b
4 changed files with 156 additions and 26 deletions

View File

@@ -795,31 +795,46 @@ impl Scheduler {
/// the fast path — zero bits mean no slot has work and the caller
/// falls through to `advance_to_next_wake`.
pub fn round_schedule(&mut self) -> Vec<u8> {
let mut buf = [0u8; HW_THREAD_COUNT];
let n = self.round_schedule_into(&mut buf);
buf[..n].to_vec()
}
/// Allocation-free variant of [`Self::round_schedule`] (Tier-A perf #2).
/// Fills `buf` with the runnable slot ids and returns the count `n`; the
/// valid range is `buf[..n]`. The hot scheduler loop (lockstep +
/// parallel) calls this with a reusable stack array so it does not
/// `__rust_alloc`/`__rust_dealloc` a fresh `Vec` every round (~7 instr
/// apart at boot-to-splash → millions of churned allocations). Identical
/// ordering / RNG-advance semantics to `round_schedule`, so the schedule
/// — and thus the lockstep digest — is byte-for-byte unchanged.
pub fn round_schedule_into(&mut self, buf: &mut [u8; HW_THREAD_COUNT]) -> usize {
if self.non_empty_runnable == 0 {
return Vec::new();
return 0;
}
let start = self.rotation_cursor as usize;
let mut out: Vec<u8> = Vec::with_capacity(HW_THREAD_COUNT);
let mut n = 0usize;
for off in 0..HW_THREAD_COUNT {
let i = (start + off) % HW_THREAD_COUNT;
if self.non_empty_runnable & (1 << i) != 0 {
out.push(i as u8);
buf[n] = i as u8;
n += 1;
}
}
// Seeded mode layers a deterministic shuffle on top of the
// already-filtered list. Same spawn/wake sequence + same seed ⇒
// same schedule (invariant preserved from pre-Axis-1).
if let OrderMode::Seeded { .. } = self.order {
for i in (1..out.len()).rev() {
for i in (1..n).rev() {
self.rng_state ^= self.rng_state << 13;
self.rng_state ^= self.rng_state >> 7;
self.rng_state ^= self.rng_state << 17;
let j = (self.rng_state as usize) % (i + 1);
out.swap(i, j);
buf.swap(i, j);
}
}
self.rotation_cursor = ((start + 1) % HW_THREAD_COUNT) as u8;
out
n
}
pub fn begin_round(&mut self) {