Profile-driven low-risk optimizations attacking the ~48% per-block /
per-round host-bookkeeping tax found by the callgrind profile. Measured
on the bounded headless workload `check -n 100000000 --gpu-inline`:
baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%.
Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0):
1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch
behind one has_mem_watch() predicted branch in write_u8/16/32/64 +
write_bulk so the common (no-watch) store does no out-of-line call.
check_mem_watch (4.8%) gone from the profile.
2. round-schedule alloc churn: add Scheduler::round_schedule_into filling
a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop
no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical
ordering/RNG-advance. __rust_alloc/dealloc gone from the profile.
3. probe-firing: hoist a single KernelState::any_probe_active() guard to
worker_prologue so the four fire_*_if_match calls don't happen at all
when no probe is configured (was 4x call overhead/visit). All four
gone from the profile.
4. thunk-map hash: range-reject pc against the registered import-thunk
address band (KernelState::pc_in_thunk_band, two int compares) before
the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone.
Tier B (#5, time-granularity change — LANDED, no re-baseline needed):
5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write
the KeTimeStampBundle when the deterministic clock advanced >= 2500
units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the
1 ms granularity any guest deadline math needs (tick_count stays
fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per
3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to
the existing golden (so no re-baseline), 150M boot reaches the splash
(draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all
match the post-3AJ baseline), 688 tests green, release n50m oracle ok.
Remaining headroom: interpreter::execute (13%), decrement_quantum (8%),
step_block (7%) are now the top self-costs — the structural superblock/
JIT lever is the next step for the larger gain.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>