[iterate-4A] Milestone-2: XMA audio decoder + RE tooling (dispatch recorder, analyzer vtable-fix, non-perturbing probes)

Milestone-2 (intro video dat/movie/ADV.wmv) audio path + major RE tooling.

XMA AUDIO (built, working, deterministic, tested):
- APU MMIO 0x7FEA0000 + 320x64B register-mapped context array; real XMACreateContext/Release
  (xma.rs); real FFmpeg xma2 decoder XMA_CONTEXT_DATA->S16BE PCM (xma_decode.rs, xma2_codec.rs,
  ffmpeg-sys-next). Decode runs synchronously on the CPU thread (deterministic, no host thread).
- Audio-worker scheduler fix (main.rs LR_HALT restore + scheduler.rs): the XAudio render-callback
  worker was wrongly exited after ~2 deliveries; now survives -> guest drives XMA decode (70 kicks).
- XAudioSubmitRenderDriverFrame made faithful. Golden sylpheed_n50m re-baselined; tests pass.

RE TOOLING:
- Runtime indirect-dispatch recorder (dispatch_rec.rs): records (call-site->target, r3, lr);
  env-gated XENIA_DISPATCH_REC, filters XENIA_DISPATCH_REC_TARGETS/_SITES; deterministic, observe-only.
- Repaired static analyzer (vtables.rs): vtable extraction silently fragmented vtables with
  non-function head slots (missed the XMV engine vtable). Fixed via vptr-write-anchoring -> engine
  fully typed (vtables 722->1150 on rebuild).
- Fixed probe HEISENBUG (main.rs run_superblock): --audit-pc-probe-hex/--mem-watch no longer disable
  superblock chaining; probes fire inside the chain loop -> scheduling identical armed-vs-unarmed,
  movie subsystem now observable. Fixed a --quiet bug swallowing armed trace reports.

VIDEO still doesn't play (B, guest-side): the XMV engine never issues begin-playback (sub_825076F0,
vtable 0x8200a1e8 slot21) -> never primes -> 2000ms timeout. Narrowed to the ARM2 engine-setup
wrappers; no honest our-side gate-fix (masking forbidden). See HANDOFF-iterate-4A-milestone2.md for
new-machine setup (incl. the FFmpeg apt deps + sylpheed.db regeneration) and continuation pointers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-21 21:38:19 +02:00
parent acb29db444
commit 23189b95af
19 changed files with 3106 additions and 46 deletions

View File

@@ -415,6 +415,18 @@ fn main() -> Result<()> {
// metrics summary.
let _obs = observability::init(&config)?;
// Env-gated indirect-dispatch recorder (off by default). Resolve the env
// once here; a scope guard dumps the recorded (call_site -> target) table
// at end-of-run no matter how the run terminates.
xenia_cpu::dispatch_rec::install();
struct DispatchRecGuard;
impl Drop for DispatchRecGuard {
fn drop(&mut self) {
xenia_cpu::dispatch_rec::dump();
}
}
let _dispatch_rec_guard = DispatchRecGuard;
let result = match cli.command {
Commands::Disasm { path, count, at } => cmd_disasm(&path, count, at),
Commands::Exec {
@@ -1437,6 +1449,45 @@ fn cmd_exec_inner(
// atoms that live inside `kernel.gpu.mmio`.
mem.add_mmio_region(xenia_gpu::build_mmio_region(kernel.gpu.mmio()));
// apu stage 1 — reserve the 320-entry XMA context array and install the
// `0x7FEA0000` register aperture (mirrors canary's `XmaDecoder::Setup`).
//
// Physical placement: canary stores a *physical* address in
// `ContextArrayAddress` (reg 0x600) — `PhysicalHeap::GetPhysicalAddress`
// returns `va - heap_base` (== `va & 0x1FFFFFFF` for the physical heaps).
// Our memory model is FLAT: `translate_virtual` is a raw `membase + addr`
// with no separate physical-window mirror, and `translate_physical` masks
// `& 0x1FFFFFFF` — so the two only coincide for low (`< 0x2000_0000`) VAs.
// `heap_alloc` returns a `0x40000000`-region VA, so `va & 0x1FFFFFFF` would
// be 0 (disagreeing with the context pointers `XMACreateContext` hands out
// at `va + i*64`). The guest reads `ContextArrayAddress` and indexes it as
// `base + i*64`; for that to equal the pointers it dereferences, the base
// MUST equal the VA. So we advertise `va` itself — self-consistent in the
// flat model (the guest reaches every context through the same VA space).
// Stage 3's decoder will read the context structs via this VA directly
// (not via `translate_physical`). The 20480-byte buffer is page-committed
// by `heap_alloc`, so the guest never faults writing the 64-byte structs.
{
let array_size =
(xenia_apu::XMA_CONTEXT_COUNT as u32) * xenia_apu::XMA_CONTEXT_SIZE; // 320 * 64
match kernel.heap_alloc(array_size, &mem) {
Some(va) => {
let phys = va; // flat model: array base == VA (see note above)
kernel.xma.lock().unwrap().init(va, phys);
mem.add_mmio_region(xenia_apu::build_mmio_region(kernel.xma.clone()));
tracing::info!(
va = format_args!("{va:#010x}"),
phys = format_args!("{phys:#010x}"),
size = format_args!("{array_size:#x}"),
"xma: context array reserved + 0x7FEA0000 aperture installed"
);
}
None => {
tracing::error!("xma: failed to reserve context array (heap exhausted)");
}
}
}
// Install the initial guest thread on HW slot 0. The thread handle we
// hand the scheduler isn't visible to any guest API yet, but joiners
// (XThreadWait-style) will see it via `find_by_tid`.
@@ -2354,6 +2405,14 @@ fn coord_post_round(
let _ = gpu_runs;
}
// APU stage 3 — pump the XMA decoder on the CPU thread, same cadence as the
// inline GPU. Deterministic (no host thread / clock): for each context with
// a pending kick it runs one Work() pass, decoding the guest's XMA packets
// into PCM and writing it back into the output ring + context struct.
if let Ok(mut xma) = kernel.xma.try_lock() {
xma.decode_pending(mem);
}
if kernel.gpu.has_pending_interrupts() {
for pi in kernel.gpu.take_pending_interrupts() {
// Canary `ExecutePacketType3_INTERRUPT` dispatches the callback
@@ -2445,7 +2504,7 @@ fn worker_prologue(
stats: &mut ExecStats,
) -> PrologueOutcome {
use xenia_cpu::interpreter::{step_cached, StepResult};
use xenia_cpu::scheduler::{HwState, INITIAL_GUEST_TID};
use xenia_cpu::scheduler::{BlockReason, HwState, INITIAL_GUEST_TID};
use xenia_cpu::PpcOpcode;
const LR_HALT: u32 = xenia_cpu::context::LR_HALT_SENTINEL as u32;
@@ -2492,12 +2551,26 @@ fn worker_prologue(
// 1) Halt-sentinel check (per HW thread).
if pc == LR_HALT {
// iterate-4A: the async audio-callback injection (`try_inject_audio_callback`)
// sets `interrupts.saved`/`injected_ref` to the dedicated audio
// worker and runs REAL guest code (`sub_824D29F0`, which calls
// blocking kernel APIs) across MANY scheduler rounds before
// returning to `LR_HALT_SENTINEL`. The restore must fire only when
// the thread that *actually* reached the sentinel is the injected
// worker itself — i.e. the FULL `ThreadRef` (hw_id AND idx), which
// `scheduler.current` holds after `begin_slot_visit`. Matching on
// `hw_id` alone let ANY OTHER thread sharing that HW slot reach
// `LR_HALT` and consume the audio worker's `saved` slot; when the
// worker later truly returned, `saved` was already `None`, the
// guard failed, and control fell through to "marking exited" — the
// worker was removed and every subsequent audio callback dropped
// (`find_by_handle` skips Exited threads). The graphics ISR path is
// fully synchronous (`dispatch_graphics_interrupts` restores inline
// and never leaves `interrupts.saved` set across rounds), so this
// restore lifecycle is exclusive to audio and graphics is
// unaffected.
let injected_here = kernel.interrupts.saved.is_some()
&& kernel
.interrupts
.injected_ref
.map(|r| r.hw_id == hw_id)
== Some(true);
&& kernel.interrupts.injected_ref == kernel.scheduler.current;
if injected_here
&& let Some(saved) = kernel.interrupts.saved.take()
{
@@ -2509,17 +2582,64 @@ fn worker_prologue(
kernel.interrupts.delivered += 1;
let source = saved.source;
let mut restore_outcome = "ready";
let current = kernel.scheduler.thread(target_ref).state.clone();
if let HwState::ServicingIrq(reason) = current {
kernel.scheduler.thread_mut(target_ref).state =
HwState::Blocked(reason);
restore_outcome = "reblocked";
// iterate-4A: the dedicated audio worker's canonical resting
// state is "parked on its synthetic handle, awaiting the next
// callback injection". The callback (`sub_824D29F0`) runs real
// guest code that can be flipped `ServicingIrq -> Ready` by an
// intervening `wake_ref` (a `KeSetEvent`/timeout targeting the
// worker as a waiter mid-callback). The old re-block heuristic
// only re-parked when the state was *still* `ServicingIrq`, so
// such a wake left the worker `Ready` — it then ran its thread
// entry to the `LR_HALT` sentinel, EXITED, and every subsequent
// callback dropped (`find_by_handle` skips Exited workers),
// wedging the intro-video audio→XMA pipeline. When this restore
// is an audio callback (`source == INTERRUPT_SOURCE_AUDIO`),
// re-park the worker UNCONDITIONALLY onto its synthetic
// park-handle so it survives to receive the next fire. (Graphics
// restores keep the `ServicingIrq`-only re-block: a graphics
// victim is a borrowed real thread, not a parked worker, and the
// old behavior there must stay byte-identical.)
if source == xenia_kernel::INTERRUPT_SOURCE_AUDIO {
let worker_handle =
kernel.scheduler.thread(target_ref).thread_handle;
let index = worker_handle.and_then(|h| {
kernel
.xaudio
.worker_handles
.iter()
.position(|wh| *wh == Some(h))
});
if let Some(index) = index {
let park = xenia_kernel::xaudio::synthetic_park_handle(index);
kernel.scheduler.thread_mut(target_ref).state =
HwState::Blocked(BlockReason::WaitAny {
handles: vec![park],
deadline: None,
});
restore_outcome = "reparked";
} else if let HwState::ServicingIrq(reason) =
kernel.scheduler.thread(target_ref).state.clone()
{
// Fallback (handle unresolved): preserve the legacy
// ServicingIrq-only re-block rather than leak the worker.
kernel.scheduler.thread_mut(target_ref).state =
HwState::Blocked(reason);
restore_outcome = "reblocked";
}
} else {
let current = kernel.scheduler.thread(target_ref).state.clone();
if let HwState::ServicingIrq(reason) = current {
kernel.scheduler.thread_mut(target_ref).state =
HwState::Blocked(reason);
restore_outcome = "reblocked";
}
}
tracing::debug!(
source,
hw_id,
outcome = restore_outcome,
"graphics interrupt: callback returned"
"interrupt: callback returned"
);
return PrologueOutcome::Continue;
}
@@ -2905,12 +3025,55 @@ fn run_superblock(
let budget = superblock_budget();
// Probe / mem-watch / debugger-hook modes need per-block-entry
// observability; in those modes never chain (run exactly one block,
// identical to the pre-superblock behaviour). The block-cache fast
// path is only entered when hooks/DB are off anyway, but a probe or
// mem-watch can be armed alongside it.
let chain_allowed = !kernel.any_probe_active() && !mem.has_mem_watch();
// Heisenbug fix (toolkit audit, 2026-06-21): probes and mem-watch are
// OBSERVE-ONLY diagnostics and must NOT change guest scheduling. The
// previous implementation disabled superblock chaining whenever any
// probe / mem-watch was armed (so the per-block-entry observation in
// `worker_prologue` was reached for every block). But chaining is what
// determines thread interleaving, so arming a probe perturbed the
// schedule — it starved the movie/XMV subsystem so it never reached the
// video state, making the probe useless on exactly the code we most
// needed to observe (`XENIA_SUPERBLOCK_BUDGET=1` reproduces the same
// starvation, confirming chaining is the lever).
//
// The fix fires the SAME per-block-entry observation INSIDE the chain
// loop, at every chained block's entry PC (see `fire_block_entry_probes`
// below), so chaining — and therefore scheduling — is byte-identical
// whether or not a probe is armed. `chain_allowed` no longer depends on
// the probe/mem-watch state.
//
// `wants_hooks()` (the interactive debugger / breakpoint path) still
// forces the per-instruction path in `worker_prologue` and never reaches
// `run_superblock`, so the only remaining reason to never chain here is
// the explicit budget==1 reproduction request.
let chain_allowed = budget > 1;
// Per-block-entry diagnostic observation, replicating exactly what
// `worker_prologue` does at the first block of a slot visit:
// 1. the four `fire_*_if_match` probe helpers (read-only; each
// re-checks its own armed set against the live ctx PC), and
// 2. the mem-watch writer-context publish, so a watched store that
// fires mid-block is attributed to the CORRECT chained block's
// entry PC / LR (matching the single-block reporting granularity)
// instead of the stale superblock-entry PC.
// The closure is a pure function of the live scheduler context; the
// caller must ensure `ctx.pc` equals the block-entry PC before calling.
let probe_hw_id = wc.hw_id;
let fire_block_entry_probes =
|kernel: &mut xenia_kernel::KernelState, mem: &xenia_memory::GuestMemory| {
let hw_id = probe_hw_id;
if kernel.any_probe_active() {
kernel.fire_ctor_probe_if_match(hw_id, mem);
kernel.fire_branch_probe_if_match(hw_id);
kernel.fire_audit_pc_probe_if_match(hw_id, mem);
kernel.fire_lr_trace_if_match(hw_id);
}
if mem.has_mem_watch() {
let ctx = kernel.scheduler.ctx(hw_id);
let tid_w = kernel.scheduler.tid(hw_id).unwrap_or(0);
xenia_memory::set_writer_ctx(tid_w, ctx.pc, ctx.lr as u32);
}
};
let mut block_ptr = first_block_ptr;
let mut pc_before = first_pc_before;
@@ -2955,11 +3118,20 @@ fn run_superblock(
break (result, block_ptr, pc_before);
}
// Chain: build/fetch the next block. Re-borrows `wc.block_cache`,
// which invalidates the previous `block_ptr` — but we've already
// finished using it (only `sync_sensitive`/diagnostics were read,
// above), so the raw-pointer aliasing rule is respected.
// Chain into the next block. `ctx.pc` now equals `next_pc` (the
// chained block's entry), so fire the per-block-entry observation
// BEFORE stepping it — identical to what `worker_prologue` did at
// the first block. This keeps the probe firing at EVERY armed
// block-entry while leaving the chaining decision (and thus the
// schedule) untouched. The first block was already observed by the
// prologue, so we only observe the newly-chained blocks here.
pc_before = next_pc;
fire_block_entry_probes(kernel, mem);
// Build/fetch the next block. Re-borrows `wc.block_cache`, which
// invalidates the previous `block_ptr` — but we've already finished
// using it (only `sync_sensitive`/diagnostics were read, above), so
// the raw-pointer aliasing rule is respected.
block_ptr = wc.block_cache.lookup_or_build(next_pc, mem) as *const _;
};
@@ -2993,6 +3165,15 @@ fn run_execution(
let mut stats = ExecStats::default();
let _ = quiet; // retained for future per-kind suppression
// APU stage 3 — give the XMA decoder a stable pointer to the guest memory
// mapping `run_execution` runs against, so the kick MMIO write can run
// Work() synchronously (canary `!use_dedicated_xma_thread` semantics: the
// game observes the updated context the instant its kick store retires).
// `mem` outlives this call for both the headless and UI paths.
if let Ok(mut xma) = kernel.xma.lock() {
xma.set_memory(mem);
}
// `--halt-on-deadlock` CLI flag OR `XENIA_HALT_ON_DEADLOCK=1|true` env var:
// when the scheduler next hits a hard deadlock (every live HW thread
// blocked on a handle wait with no pending timer) we bail out with a
@@ -4093,10 +4274,18 @@ fn dump_thread_diagnostic(
),
}
}
if quiet {
return;
}
use xenia_kernel::objects::KernelObject;
// Toolkit-audit fix (2026-06-21): only the ALWAYS-ON thread/waiter table
// is suppressed by `--quiet`. The explicitly-armed diagnostics below
// (`--trace-handles`, `--trace-handles-focus`, `--dump-addr`) are
// requested output — arming the flag IS the user asking for it — and
// were previously swallowed by the blanket `if quiet { return; }`, which
// made the documented headless `--quiet` invocation silently drop every
// handle/focus/dump report. They are each self-gated below (on
// `audit.enabled` / `!audit.focus.is_empty()` / `!dump_addrs.is_empty()`)
// so they only print when actually armed.
if !quiet {
println!("\n=== Thread diagnostics ===");
for (hw_id, slot) in kernel.scheduler.slots.iter().enumerate() {
if slot.runqueue.is_empty() {
@@ -4193,6 +4382,7 @@ fn dump_thread_diagnostic(
println!(" cs={:#010x} waiters(tid)={:?}", cs_ptr, tids);
}
}
} // end `if !quiet` (always-on thread/waiter table)
// Audit trails (only when --trace-handles flipped the flag). For each
// tracked handle, emit a compact block: kind, creator, and the bounded
@@ -4868,8 +5058,23 @@ fn cmd_dis(
// pointer-validity oracle; runs over .rdata + .data.
let function_starts: std::collections::BTreeSet<u32> =
func_analysis.functions.keys().copied().collect();
let vtables = xenia_analysis::vtables::analyze(
&pe_image, base, &sections, &function_starts,
// Anchor discovery: recover vtable bases from constructor vptr-write
// stores so a vtable with non-function head words (null / pure-virtual /
// unrecognised thunk slots) isn't fragmented away by the contiguity
// heuristic. (Fixes e.g. the XMV engine vtable 0x8200a908.)
let vptr_anchor_funcs: std::collections::BTreeMap<u32, (u32, bool)> = func_analysis
.functions
.iter()
.map(|(&s, fi)| (s, (fi.end, fi.is_saverestore)))
.collect();
let vptr_block_boundaries: std::collections::HashSet<u32> =
xref_result.labels.keys().copied().collect();
let vtable_anchors = xenia_analysis::vtables::scan_vptr_write_constants(
&pe_image, base, &vptr_anchor_funcs, &sections, &vptr_block_boundaries,
);
info!(vtable_anchors = vtable_anchors.len(), "vptr-write anchor scan complete");
let vtables = xenia_analysis::vtables::analyze_with_anchors(
&pe_image, base, &sections, &function_starts, &vtable_anchors,
);
let rtti_count = vtables.iter().filter(|v| v.rtti_present).count();
info!(

View File

@@ -1,9 +1,9 @@
{
"instructions": 50000110,
"imports": 243387,
"instructions": 50000200,
"imports": 189264,
"unimpl": 0,
"draws": 1279,
"swaps": 260,
"draws": 768,
"swaps": 157,
"unique_render_targets": 2,
"shader_blobs_live": 6,
"texture_cache_entries": 1