[iterate-3AK] Perf quick-wins: ~21% faster boot-to-splash (22→27 MIPS)

Profile-driven low-risk optimizations attacking the ~48% per-block /
per-round host-bookkeeping tax found by the callgrind profile. Measured
on the bounded headless workload `check -n 100000000 --gpu-inline`:
baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%.

Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0):
1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch
   behind one has_mem_watch() predicted branch in write_u8/16/32/64 +
   write_bulk so the common (no-watch) store does no out-of-line call.
   check_mem_watch (4.8%) gone from the profile.
2. round-schedule alloc churn: add Scheduler::round_schedule_into filling
   a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop
   no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical
   ordering/RNG-advance. __rust_alloc/dealloc gone from the profile.
3. probe-firing: hoist a single KernelState::any_probe_active() guard to
   worker_prologue so the four fire_*_if_match calls don't happen at all
   when no probe is configured (was 4x call overhead/visit). All four
   gone from the profile.
4. thunk-map hash: range-reject pc against the registered import-thunk
   address band (KernelState::pc_in_thunk_band, two int compares) before
   the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone.

Tier B (#5, time-granularity change — LANDED, no re-baseline needed):
5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write
   the KeTimeStampBundle when the deterministic clock advanced >= 2500
   units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the
   1 ms granularity any guest deadline math needs (tick_count stays
   fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per
   3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to
   the existing golden (so no re-baseline), 150M boot reaches the splash
   (draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all
   match the post-3AJ baseline), 688 tests green, release n50m oracle ok.

Remaining headroom: interpreter::execute (13%), decrement_quantum (8%),
step_block (7%) are now the top self-costs — the structural superblock/
JIT lever is the next step for the larger gain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-19 22:05:53 +02:00
parent 9d24dd0eaa
commit dc1320cd4b
4 changed files with 156 additions and 26 deletions

View File

@@ -357,7 +357,8 @@ impl GuestMemory {
/// from `GuestMemory` without a wider plumbing change).
pub fn write_bulk(&self, addr: u32, buf: &[u8]) {
let len = buf.len() as u32;
let old_lane = self.capture_mem_watch_old(addr, len);
let watch = self.has_mem_watch();
let old_lane = if watch { self.capture_mem_watch_old(addr, len) } else { None };
let ptr = self.translate_virtual_mut(addr);
unsafe {
std::ptr::copy_nonoverlapping(buf.as_ptr(), ptr, buf.len());
@@ -374,7 +375,7 @@ impl GuestMemory {
// the page works.
self.bump_page_version(page * PAGE_SIZE);
}
self.check_mem_watch(addr, len, old_lane);
if watch { self.check_mem_watch(addr, len, old_lane); }
}
/// Check if a guest address has been allocated/committed. Acquire load
@@ -540,11 +541,16 @@ impl MemoryAccess for GuestMemory {
return;
}
if !self.is_mapped(addr) { return; }
let old_lane = self.capture_mem_watch_old(addr, 1);
// Perf (Tier-A #1): the mem-watch capture/report pair are out-of-line
// calls; on the common (no-watch) path each was a real call that
// immediately returned. Gate both behind one predicted branch so the
// hot store does no call work unless a watch is actually armed.
let watch = self.has_mem_watch();
let old_lane = if watch { self.capture_mem_watch_old(addr, 1) } else { None };
let ptr = self.translate_virtual_mut(addr);
unsafe { *ptr = val };
self.bump_page_version(addr);
self.check_mem_watch(addr, 1, old_lane);
if watch { self.check_mem_watch(addr, 1, old_lane); }
}
fn write_u16(&self, addr: u32, val: u16) {
@@ -552,7 +558,8 @@ impl MemoryAccess for GuestMemory {
(mmio.write_callback)(addr, val as u32);
} else if !self.is_mapped(addr) {
} else {
let old_lane = self.capture_mem_watch_old(addr, 2);
let watch = self.has_mem_watch();
let old_lane = if watch { self.capture_mem_watch_old(addr, 2) } else { None };
let ptr = self.translate_virtual_mut(addr);
unsafe {
std::ptr::copy_nonoverlapping(val.to_be_bytes().as_ptr(), ptr, 2);
@@ -564,7 +571,7 @@ impl MemoryAccess for GuestMemory {
if (addr & 0xFFF) >= (PAGE_SIZE - 1) {
self.bump_page_version(addr.wrapping_add(1));
}
self.check_mem_watch(addr, 2, old_lane);
if watch { self.check_mem_watch(addr, 2, old_lane); }
}
}
@@ -573,7 +580,8 @@ impl MemoryAccess for GuestMemory {
(mmio.write_callback)(addr, val);
} else if !self.is_mapped(addr) {
} else {
let old_lane = self.capture_mem_watch_old(addr, 4);
let watch = self.has_mem_watch();
let old_lane = if watch { self.capture_mem_watch_old(addr, 4) } else { None };
let ptr = self.translate_virtual_mut(addr);
unsafe {
std::ptr::copy_nonoverlapping(val.to_be_bytes().as_ptr(), ptr, 4);
@@ -582,7 +590,7 @@ impl MemoryAccess for GuestMemory {
if (addr & 0xFFF) >= (PAGE_SIZE - 3) {
self.bump_page_version(addr.wrapping_add(3));
}
self.check_mem_watch(addr, 4, old_lane);
if watch { self.check_mem_watch(addr, 4, old_lane); }
}
}
@@ -592,7 +600,8 @@ impl MemoryAccess for GuestMemory {
(mmio.write_callback)(addr.wrapping_add(4), val as u32);
} else if !self.is_mapped(addr) {
} else {
let old_lane = self.capture_mem_watch_old(addr, 8);
let watch = self.has_mem_watch();
let old_lane = if watch { self.capture_mem_watch_old(addr, 8) } else { None };
let ptr = self.translate_virtual_mut(addr);
unsafe {
std::ptr::copy_nonoverlapping(val.to_be_bytes().as_ptr(), ptr, 8);
@@ -601,7 +610,7 @@ impl MemoryAccess for GuestMemory {
if (addr & 0xFFF) >= (PAGE_SIZE - 7) {
self.bump_page_version(addr.wrapping_add(7));
}
self.check_mem_watch(addr, 8, old_lane);
if watch { self.check_mem_watch(addr, 8, old_lane); }
}
}