[iterate-3AK] Perf quick-wins: ~21% faster boot-to-splash (22→27 MIPS)
Profile-driven low-risk optimizations attacking the ~48% per-block / per-round host-bookkeeping tax found by the callgrind profile. Measured on the bounded headless workload `check -n 100000000 --gpu-inline`: baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%. Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0): 1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch behind one has_mem_watch() predicted branch in write_u8/16/32/64 + write_bulk so the common (no-watch) store does no out-of-line call. check_mem_watch (4.8%) gone from the profile. 2. round-schedule alloc churn: add Scheduler::round_schedule_into filling a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical ordering/RNG-advance. __rust_alloc/dealloc gone from the profile. 3. probe-firing: hoist a single KernelState::any_probe_active() guard to worker_prologue so the four fire_*_if_match calls don't happen at all when no probe is configured (was 4x call overhead/visit). All four gone from the profile. 4. thunk-map hash: range-reject pc against the registered import-thunk address band (KernelState::pc_in_thunk_band, two int compares) before the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone. Tier B (#5, time-granularity change — LANDED, no re-baseline needed): 5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write the KeTimeStampBundle when the deterministic clock advanced >= 2500 units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the 1 ms granularity any guest deadline math needs (tick_count stays fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per 3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to the existing golden (so no re-baseline), 150M boot reaches the splash (draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all match the post-3AJ baseline), 688 tests green, release n50m oracle ok. Remaining headroom: interpreter::execute (13%), decrement_quantum (8%), step_block (7%) are now the top self-costs — the structural superblock/ JIT lever is the next step for the larger gain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -795,31 +795,46 @@ impl Scheduler {
|
||||
/// the fast path — zero bits mean no slot has work and the caller
|
||||
/// falls through to `advance_to_next_wake`.
|
||||
pub fn round_schedule(&mut self) -> Vec<u8> {
|
||||
let mut buf = [0u8; HW_THREAD_COUNT];
|
||||
let n = self.round_schedule_into(&mut buf);
|
||||
buf[..n].to_vec()
|
||||
}
|
||||
|
||||
/// Allocation-free variant of [`Self::round_schedule`] (Tier-A perf #2).
|
||||
/// Fills `buf` with the runnable slot ids and returns the count `n`; the
|
||||
/// valid range is `buf[..n]`. The hot scheduler loop (lockstep +
|
||||
/// parallel) calls this with a reusable stack array so it does not
|
||||
/// `__rust_alloc`/`__rust_dealloc` a fresh `Vec` every round (~7 instr
|
||||
/// apart at boot-to-splash → millions of churned allocations). Identical
|
||||
/// ordering / RNG-advance semantics to `round_schedule`, so the schedule
|
||||
/// — and thus the lockstep digest — is byte-for-byte unchanged.
|
||||
pub fn round_schedule_into(&mut self, buf: &mut [u8; HW_THREAD_COUNT]) -> usize {
|
||||
if self.non_empty_runnable == 0 {
|
||||
return Vec::new();
|
||||
return 0;
|
||||
}
|
||||
let start = self.rotation_cursor as usize;
|
||||
let mut out: Vec<u8> = Vec::with_capacity(HW_THREAD_COUNT);
|
||||
let mut n = 0usize;
|
||||
for off in 0..HW_THREAD_COUNT {
|
||||
let i = (start + off) % HW_THREAD_COUNT;
|
||||
if self.non_empty_runnable & (1 << i) != 0 {
|
||||
out.push(i as u8);
|
||||
buf[n] = i as u8;
|
||||
n += 1;
|
||||
}
|
||||
}
|
||||
// Seeded mode layers a deterministic shuffle on top of the
|
||||
// already-filtered list. Same spawn/wake sequence + same seed ⇒
|
||||
// same schedule (invariant preserved from pre-Axis-1).
|
||||
if let OrderMode::Seeded { .. } = self.order {
|
||||
for i in (1..out.len()).rev() {
|
||||
for i in (1..n).rev() {
|
||||
self.rng_state ^= self.rng_state << 13;
|
||||
self.rng_state ^= self.rng_state >> 7;
|
||||
self.rng_state ^= self.rng_state << 17;
|
||||
let j = (self.rng_state as usize) % (i + 1);
|
||||
out.swap(i, j);
|
||||
buf.swap(i, j);
|
||||
}
|
||||
}
|
||||
self.rotation_cursor = ((start + 1) % HW_THREAD_COUNT) as u8;
|
||||
out
|
||||
n
|
||||
}
|
||||
|
||||
pub fn begin_round(&mut self) {
|
||||
|
||||
Reference in New Issue
Block a user