[iterate-4A] Milestone-2: XMA audio decoder + RE tooling (dispatch recorder, analyzer vtable-fix, non-perturbing probes)

Milestone-2 (intro video dat/movie/ADV.wmv) audio path + major RE tooling. XMA AUDIO (built, working, deterministic, tested): - APU MMIO 0x7FEA0000 + 320x64B register-mapped context array; real XMACreateContext/Release (xma.rs); real FFmpeg xma2 decoder XMA_CONTEXT_DATA->S16BE PCM (xma_decode.rs, xma2_codec.rs, ffmpeg-sys-next). Decode runs synchronously on the CPU thread (deterministic, no host thread). - Audio-worker scheduler fix (main.rs LR_HALT restore + scheduler.rs): the XAudio render-callback worker was wrongly exited after ~2 deliveries; now survives -> guest drives XMA decode (70 kicks). - XAudioSubmitRenderDriverFrame made faithful. Golden sylpheed_n50m re-baselined; tests pass. RE TOOLING: - Runtime indirect-dispatch recorder (dispatch_rec.rs): records (call-site->target, r3, lr); env-gated XENIA_DISPATCH_REC, filters XENIA_DISPATCH_REC_TARGETS/_SITES; deterministic, observe-only. - Repaired static analyzer (vtables.rs): vtable extraction silently fragmented vtables with non-function head slots (missed the XMV engine vtable). Fixed via vptr-write-anchoring -> engine fully typed (vtables 722->1150 on rebuild). - Fixed probe HEISENBUG (main.rs run_superblock): --audit-pc-probe-hex/--mem-watch no longer disable superblock chaining; probes fire inside the chain loop -> scheduling identical armed-vs-unarmed, movie subsystem now observable. Fixed a --quiet bug swallowing armed trace reports. VIDEO still doesn't play (B, guest-side): the XMV engine never issues begin-playback (sub_825076F0, vtable 0x8200a1e8 slot21) -> never primes -> 2000ms timeout. Narrowed to the ARM2 engine-setup wrappers; no honest our-side gate-fix (masking forbidden). See HANDOFF-iterate-4A-milestone2.md for new-machine setup (incl. the FFmpeg apt deps + sylpheed.db regeneration) and continuation pointers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 21:38:19 +02:00
parent acb29db444
commit 23189b95af
19 changed files with 3106 additions and 46 deletions
--- a/crates/xenia-app/src/main.rs
+++ b/crates/xenia-app/src/main.rs
@@ -415,6 +415,18 @@ fn main() -> Result<()> {
    // metrics summary.
    let _obs = observability::init(&config)?;

+    // Env-gated indirect-dispatch recorder (off by default). Resolve the env
+    // once here; a scope guard dumps the recorded (call_site -> target) table
+    // at end-of-run no matter how the run terminates.
+    xenia_cpu::dispatch_rec::install();
+    struct DispatchRecGuard;
+    impl Drop for DispatchRecGuard {
+        fn drop(&mut self) {
+            xenia_cpu::dispatch_rec::dump();
+        }
+    }
+    let _dispatch_rec_guard = DispatchRecGuard;
+
    let result = match cli.command {
        Commands::Disasm { path, count, at } => cmd_disasm(&path, count, at),
        Commands::Exec {
@@ -1437,6 +1449,45 @@ fn cmd_exec_inner(
    // atoms that live inside `kernel.gpu.mmio`.
    mem.add_mmio_region(xenia_gpu::build_mmio_region(kernel.gpu.mmio()));

+    // apu stage 1 — reserve the 320-entry XMA context array and install the
+    // `0x7FEA0000` register aperture (mirrors canary's `XmaDecoder::Setup`).
+    //
+    // Physical placement: canary stores a *physical* address in
+    // `ContextArrayAddress` (reg 0x600) — `PhysicalHeap::GetPhysicalAddress`
+    // returns `va - heap_base` (== `va & 0x1FFFFFFF` for the physical heaps).
+    // Our memory model is FLAT: `translate_virtual` is a raw `membase + addr`
+    // with no separate physical-window mirror, and `translate_physical` masks
+    // `& 0x1FFFFFFF` — so the two only coincide for low (`< 0x2000_0000`) VAs.
+    // `heap_alloc` returns a `0x40000000`-region VA, so `va & 0x1FFFFFFF` would
+    // be 0 (disagreeing with the context pointers `XMACreateContext` hands out
+    // at `va + i*64`). The guest reads `ContextArrayAddress` and indexes it as
+    // `base + i*64`; for that to equal the pointers it dereferences, the base
+    // MUST equal the VA. So we advertise `va` itself — self-consistent in the
+    // flat model (the guest reaches every context through the same VA space).
+    // Stage 3's decoder will read the context structs via this VA directly
+    // (not via `translate_physical`). The 20480-byte buffer is page-committed
+    // by `heap_alloc`, so the guest never faults writing the 64-byte structs.
+    {
+        let array_size =
+            (xenia_apu::XMA_CONTEXT_COUNT as u32) * xenia_apu::XMA_CONTEXT_SIZE; // 320 * 64
+        match kernel.heap_alloc(array_size, &mem) {
+            Some(va) => {
+                let phys = va; // flat model: array base == VA (see note above)
+                kernel.xma.lock().unwrap().init(va, phys);
+                mem.add_mmio_region(xenia_apu::build_mmio_region(kernel.xma.clone()));
+                tracing::info!(
+                    va = format_args!("{va:#010x}"),
+                    phys = format_args!("{phys:#010x}"),
+                    size = format_args!("{array_size:#x}"),
+                    "xma: context array reserved + 0x7FEA0000 aperture installed"
+                );
+            }
+            None => {
+                tracing::error!("xma: failed to reserve context array (heap exhausted)");
+            }
+        }
+    }
+
    // Install the initial guest thread on HW slot 0. The thread handle we
    // hand the scheduler isn't visible to any guest API yet, but joiners
    // (XThreadWait-style) will see it via `find_by_tid`.
@@ -2354,6 +2405,14 @@ fn coord_post_round(
        let _ = gpu_runs;
    }

+    // APU stage 3 — pump the XMA decoder on the CPU thread, same cadence as the
+    // inline GPU. Deterministic (no host thread / clock): for each context with
+    // a pending kick it runs one Work() pass, decoding the guest's XMA packets
+    // into PCM and writing it back into the output ring + context struct.
+    if let Ok(mut xma) = kernel.xma.try_lock() {
+        xma.decode_pending(mem);
+    }
+
    if kernel.gpu.has_pending_interrupts() {
        for pi in kernel.gpu.take_pending_interrupts() {
            // Canary `ExecutePacketType3_INTERRUPT` dispatches the callback
@@ -2445,7 +2504,7 @@ fn worker_prologue(
    stats: &mut ExecStats,
 ) -> PrologueOutcome {
    use xenia_cpu::interpreter::{step_cached, StepResult};
-    use xenia_cpu::scheduler::{HwState, INITIAL_GUEST_TID};
+    use xenia_cpu::scheduler::{BlockReason, HwState, INITIAL_GUEST_TID};
    use xenia_cpu::PpcOpcode;
    const LR_HALT: u32 = xenia_cpu::context::LR_HALT_SENTINEL as u32;

@@ -2492,12 +2551,26 @@ fn worker_prologue(

    // 1) Halt-sentinel check (per HW thread).
    if pc == LR_HALT {
+        // iterate-4A: the async audio-callback injection (`try_inject_audio_callback`)
+        // sets `interrupts.saved`/`injected_ref` to the dedicated audio
+        // worker and runs REAL guest code (`sub_824D29F0`, which calls
+        // blocking kernel APIs) across MANY scheduler rounds before
+        // returning to `LR_HALT_SENTINEL`. The restore must fire only when
+        // the thread that *actually* reached the sentinel is the injected
+        // worker itself — i.e. the FULL `ThreadRef` (hw_id AND idx), which
+        // `scheduler.current` holds after `begin_slot_visit`. Matching on
+        // `hw_id` alone let ANY OTHER thread sharing that HW slot reach
+        // `LR_HALT` and consume the audio worker's `saved` slot; when the
+        // worker later truly returned, `saved` was already `None`, the
+        // guard failed, and control fell through to "marking exited" — the
+        // worker was removed and every subsequent audio callback dropped
+        // (`find_by_handle` skips Exited threads). The graphics ISR path is
+        // fully synchronous (`dispatch_graphics_interrupts` restores inline
+        // and never leaves `interrupts.saved` set across rounds), so this
+        // restore lifecycle is exclusive to audio and graphics is
+        // unaffected.
        let injected_here = kernel.interrupts.saved.is_some()
-            && kernel
-                .interrupts
-                .injected_ref
-                .map(|r| r.hw_id == hw_id)
-                == Some(true);
+            && kernel.interrupts.injected_ref == kernel.scheduler.current;
        if injected_here
            && let Some(saved) = kernel.interrupts.saved.take()
        {
@@ -2509,17 +2582,64 @@ fn worker_prologue(
            kernel.interrupts.delivered += 1;
            let source = saved.source;
            let mut restore_outcome = "ready";
-            let current = kernel.scheduler.thread(target_ref).state.clone();
-            if let HwState::ServicingIrq(reason) = current {
-                kernel.scheduler.thread_mut(target_ref).state =
-                    HwState::Blocked(reason);
-                restore_outcome = "reblocked";
+
+            // iterate-4A: the dedicated audio worker's canonical resting
+            // state is "parked on its synthetic handle, awaiting the next
+            // callback injection". The callback (`sub_824D29F0`) runs real
+            // guest code that can be flipped `ServicingIrq -> Ready` by an
+            // intervening `wake_ref` (a `KeSetEvent`/timeout targeting the
+            // worker as a waiter mid-callback). The old re-block heuristic
+            // only re-parked when the state was *still* `ServicingIrq`, so
+            // such a wake left the worker `Ready` — it then ran its thread
+            // entry to the `LR_HALT` sentinel, EXITED, and every subsequent
+            // callback dropped (`find_by_handle` skips Exited workers),
+            // wedging the intro-video audio→XMA pipeline. When this restore
+            // is an audio callback (`source == INTERRUPT_SOURCE_AUDIO`),
+            // re-park the worker UNCONDITIONALLY onto its synthetic
+            // park-handle so it survives to receive the next fire. (Graphics
+            // restores keep the `ServicingIrq`-only re-block: a graphics
+            // victim is a borrowed real thread, not a parked worker, and the
+            // old behavior there must stay byte-identical.)
+            if source == xenia_kernel::INTERRUPT_SOURCE_AUDIO {
+                let worker_handle =
+                    kernel.scheduler.thread(target_ref).thread_handle;
+                let index = worker_handle.and_then(|h| {
+                    kernel
+                        .xaudio
+                        .worker_handles
+                        .iter()
+                        .position(|wh| *wh == Some(h))
+                });
+                if let Some(index) = index {
+                    let park = xenia_kernel::xaudio::synthetic_park_handle(index);
+                    kernel.scheduler.thread_mut(target_ref).state =
+                        HwState::Blocked(BlockReason::WaitAny {
+                            handles: vec![park],
+                            deadline: None,
+                        });
+                    restore_outcome = "reparked";
+                } else if let HwState::ServicingIrq(reason) =
+                    kernel.scheduler.thread(target_ref).state.clone()
+                {
+                    // Fallback (handle unresolved): preserve the legacy
+                    // ServicingIrq-only re-block rather than leak the worker.
+                    kernel.scheduler.thread_mut(target_ref).state =
+                        HwState::Blocked(reason);
+                    restore_outcome = "reblocked";
+                }
+            } else {
+                let current = kernel.scheduler.thread(target_ref).state.clone();
+                if let HwState::ServicingIrq(reason) = current {
+                    kernel.scheduler.thread_mut(target_ref).state =
+                        HwState::Blocked(reason);
+                    restore_outcome = "reblocked";
+                }
            }
            tracing::debug!(
                source,
                hw_id,
                outcome = restore_outcome,
-                "graphics interrupt: callback returned"
+                "interrupt: callback returned"
            );
            return PrologueOutcome::Continue;
        }
@@ -2905,12 +3025,55 @@ fn run_superblock(

    let budget = superblock_budget();

-    // Probe / mem-watch / debugger-hook modes need per-block-entry
-    // observability; in those modes never chain (run exactly one block,
-    // identical to the pre-superblock behaviour). The block-cache fast
-    // path is only entered when hooks/DB are off anyway, but a probe or
-    // mem-watch can be armed alongside it.
-    let chain_allowed = !kernel.any_probe_active() && !mem.has_mem_watch();
+    // Heisenbug fix (toolkit audit, 2026-06-21): probes and mem-watch are
+    // OBSERVE-ONLY diagnostics and must NOT change guest scheduling. The
+    // previous implementation disabled superblock chaining whenever any
+    // probe / mem-watch was armed (so the per-block-entry observation in
+    // `worker_prologue` was reached for every block). But chaining is what
+    // determines thread interleaving, so arming a probe perturbed the
+    // schedule — it starved the movie/XMV subsystem so it never reached the
+    // video state, making the probe useless on exactly the code we most
+    // needed to observe (`XENIA_SUPERBLOCK_BUDGET=1` reproduces the same
+    // starvation, confirming chaining is the lever).
+    //
+    // The fix fires the SAME per-block-entry observation INSIDE the chain
+    // loop, at every chained block's entry PC (see `fire_block_entry_probes`
+    // below), so chaining — and therefore scheduling — is byte-identical
+    // whether or not a probe is armed. `chain_allowed` no longer depends on
+    // the probe/mem-watch state.
+    //
+    // `wants_hooks()` (the interactive debugger / breakpoint path) still
+    // forces the per-instruction path in `worker_prologue` and never reaches
+    // `run_superblock`, so the only remaining reason to never chain here is
+    // the explicit budget==1 reproduction request.
+    let chain_allowed = budget > 1;
+
+    // Per-block-entry diagnostic observation, replicating exactly what
+    // `worker_prologue` does at the first block of a slot visit:
+    //   1. the four `fire_*_if_match` probe helpers (read-only; each
+    //      re-checks its own armed set against the live ctx PC), and
+    //   2. the mem-watch writer-context publish, so a watched store that
+    //      fires mid-block is attributed to the CORRECT chained block's
+    //      entry PC / LR (matching the single-block reporting granularity)
+    //      instead of the stale superblock-entry PC.
+    // The closure is a pure function of the live scheduler context; the
+    // caller must ensure `ctx.pc` equals the block-entry PC before calling.
+    let probe_hw_id = wc.hw_id;
+    let fire_block_entry_probes =
+        |kernel: &mut xenia_kernel::KernelState, mem: &xenia_memory::GuestMemory| {
+            let hw_id = probe_hw_id;
+            if kernel.any_probe_active() {
+                kernel.fire_ctor_probe_if_match(hw_id, mem);
+                kernel.fire_branch_probe_if_match(hw_id);
+                kernel.fire_audit_pc_probe_if_match(hw_id, mem);
+                kernel.fire_lr_trace_if_match(hw_id);
+            }
+            if mem.has_mem_watch() {
+                let ctx = kernel.scheduler.ctx(hw_id);
+                let tid_w = kernel.scheduler.tid(hw_id).unwrap_or(0);
+                xenia_memory::set_writer_ctx(tid_w, ctx.pc, ctx.lr as u32);
+            }
+        };

    let mut block_ptr = first_block_ptr;
    let mut pc_before = first_pc_before;
@@ -2955,11 +3118,20 @@ fn run_superblock(
            break (result, block_ptr, pc_before);
        }

-        // Chain: build/fetch the next block. Re-borrows `wc.block_cache`,
-        // which invalidates the previous `block_ptr` — but we've already
-        // finished using it (only `sync_sensitive`/diagnostics were read,
-        // above), so the raw-pointer aliasing rule is respected.
+        // Chain into the next block. `ctx.pc` now equals `next_pc` (the
+        // chained block's entry), so fire the per-block-entry observation
+        // BEFORE stepping it — identical to what `worker_prologue` did at
+        // the first block. This keeps the probe firing at EVERY armed
+        // block-entry while leaving the chaining decision (and thus the
+        // schedule) untouched. The first block was already observed by the
+        // prologue, so we only observe the newly-chained blocks here.
        pc_before = next_pc;
+        fire_block_entry_probes(kernel, mem);
+
+        // Build/fetch the next block. Re-borrows `wc.block_cache`, which
+        // invalidates the previous `block_ptr` — but we've already finished
+        // using it (only `sync_sensitive`/diagnostics were read, above), so
+        // the raw-pointer aliasing rule is respected.
        block_ptr = wc.block_cache.lookup_or_build(next_pc, mem) as *const _;
    };

@@ -2993,6 +3165,15 @@ fn run_execution(
    let mut stats = ExecStats::default();
    let _ = quiet; // retained for future per-kind suppression

+    // APU stage 3 — give the XMA decoder a stable pointer to the guest memory
+    // mapping `run_execution` runs against, so the kick MMIO write can run
+    // Work() synchronously (canary `!use_dedicated_xma_thread` semantics: the
+    // game observes the updated context the instant its kick store retires).
+    // `mem` outlives this call for both the headless and UI paths.
+    if let Ok(mut xma) = kernel.xma.lock() {
+        xma.set_memory(mem);
+    }
+
    // `--halt-on-deadlock` CLI flag OR `XENIA_HALT_ON_DEADLOCK=1|true` env var:
    // when the scheduler next hits a hard deadlock (every live HW thread
    // blocked on a handle wait with no pending timer) we bail out with a
@@ -4093,10 +4274,18 @@ fn dump_thread_diagnostic(
            ),
        }
    }
-    if quiet {
-        return;
-    }
    use xenia_kernel::objects::KernelObject;
+
+    // Toolkit-audit fix (2026-06-21): only the ALWAYS-ON thread/waiter table
+    // is suppressed by `--quiet`. The explicitly-armed diagnostics below
+    // (`--trace-handles`, `--trace-handles-focus`, `--dump-addr`) are
+    // requested output — arming the flag IS the user asking for it — and
+    // were previously swallowed by the blanket `if quiet { return; }`, which
+    // made the documented headless `--quiet` invocation silently drop every
+    // handle/focus/dump report. They are each self-gated below (on
+    // `audit.enabled` / `!audit.focus.is_empty()` / `!dump_addrs.is_empty()`)
+    // so they only print when actually armed.
+    if !quiet {
    println!("\n=== Thread diagnostics ===");
    for (hw_id, slot) in kernel.scheduler.slots.iter().enumerate() {
        if slot.runqueue.is_empty() {
@@ -4193,6 +4382,7 @@ fn dump_thread_diagnostic(
            println!("    cs={:#010x} waiters(tid)={:?}", cs_ptr, tids);
        }
    }
+    } // end `if !quiet` (always-on thread/waiter table)

    // Audit trails (only when --trace-handles flipped the flag). For each
    // tracked handle, emit a compact block: kind, creator, and the bounded
@@ -4868,8 +5058,23 @@ fn cmd_dis(
    // pointer-validity oracle; runs over .rdata + .data.
    let function_starts: std::collections::BTreeSet<u32> =
        func_analysis.functions.keys().copied().collect();
-    let vtables = xenia_analysis::vtables::analyze(
-        &pe_image, base, &sections, &function_starts,
+    // Anchor discovery: recover vtable bases from constructor vptr-write
+    // stores so a vtable with non-function head words (null / pure-virtual /
+    // unrecognised thunk slots) isn't fragmented away by the contiguity
+    // heuristic. (Fixes e.g. the XMV engine vtable 0x8200a908.)
+    let vptr_anchor_funcs: std::collections::BTreeMap<u32, (u32, bool)> = func_analysis
+        .functions
+        .iter()
+        .map(|(&s, fi)| (s, (fi.end, fi.is_saverestore)))
+        .collect();
+    let vptr_block_boundaries: std::collections::HashSet<u32> =
+        xref_result.labels.keys().copied().collect();
+    let vtable_anchors = xenia_analysis::vtables::scan_vptr_write_constants(
+        &pe_image, base, &vptr_anchor_funcs, &sections, &vptr_block_boundaries,
+    );
+    info!(vtable_anchors = vtable_anchors.len(), "vptr-write anchor scan complete");
+    let vtables = xenia_analysis::vtables::analyze_with_anchors(
+        &pe_image, base, &sections, &function_starts, &vtable_anchors,
    );
    let rtti_count = vtables.iter().filter(|v| v.rtti_present).count();
    info!(
--- a/crates/xenia-app/tests/golden/sylpheed_n50m.json
+++ b/crates/xenia-app/tests/golden/sylpheed_n50m.json
@@ -1,9 +1,9 @@
 {
-  "instructions": 50000110,
-  "imports": 243387,
+  "instructions": 50000200,
+  "imports": 189264,
  "unimpl": 0,
-  "draws": 1279,
-  "swaps": 260,
+  "draws": 768,
+  "swaps": 157,
  "unique_render_targets": 2,
  "shader_blobs_live": 6,
  "texture_cache_entries": 1