Merge dispatcher-probe-audit/p0-ctor-probe-and-struct-dump (KRNBUG-AUDIT-004)

This commit is contained in:
MechaCat02
2026-05-04 17:09:52 +02:00
5 changed files with 2601 additions and 0 deletions

View File

@@ -4457,3 +4457,168 @@ function arg (no constant-load), but the simple
3. Treat 0x42450b5c independently. AUDIT-002's hook missed it because
the parking site (PC=0x824cd4f4) isn't routed through `do_wait_single`.
Open KRNBUG-AUDIT-004 for that wait path.
---
### KRNBUG-AUDIT-004 — `--ctor-probe` PC hook + `--dump-addr` struct dump; producer-indirection layer identified; "8-instance pool" hypothesis falsified
**Status**: landed on master (no-ff merge of feature branch
`dispatcher-probe-audit/p0-ctor-probe-and-struct-dump`). Diagnostic-
only, read-only, lockstep-preserved (`instructions=100000002` at
`-n 100M --stable-digest`).
**Tests**: 586 → **588**.
**What landed (`crates/xenia-kernel/src/state.rs`):**
- `pub ctor_probe_pcs: HashSet<u32>` field on `KernelState` (default
empty).
- `pub fire_ctor_probe_if_match(hw_id, mem)` — fast-rejects when set
is empty; on match prints a one-shot record `CTOR-PROBE pc=...
tid=... hw=... cycle=... sp=... r3=... lr=...` plus an 8-frame
back-chain with saved-r31/r30 per frame. Pure read.
- `pub dump_addrs: Vec<u32>` field for end-of-run struct dumps.
- 2 unit tests: empty-set no-op, set-membership invariant.
**What landed (`crates/xenia-app/src/main.rs`):**
- `--ctor-probe=0x8217C850,0x82181750,...` CLI flag (and
`XENIA_CTOR_PROBE`). Parsed into `kernel.ctor_probe_pcs` at
`cmd_exec_inner` startup.
- `--dump-addr=0x828F3D08,0x828F4070,0x828F3EC0,...` CLI flag (and
`XENIA_DUMP_ADDR`). Each address gets a 128-byte hex+be32+ASCII
dump at end-of-run, after the per-handle FOCUS report.
- `worker_prologue` calls `fire_ctor_probe_if_match` after reading
`pc` and before any thunk-dispatch / step-block branch.
`dump_thread_diagnostic` consumes `kernel.dump_addrs`.
**Decisive findings (corrects KRNBUG-AUDIT-002/003):**
1. **The "8-instance pool" hypothesis for handle 0x1004 is FALSE.**
Probe ran at `-n 50M --halt-on-deadlock` with PCs
`[0x821783D8, 0x82181750, 0x821701C8]` (the per-instance ctors
for handles 0x1004 / 0x100c / 0x15e0 respectively). Each fired
**EXACTLY ONCE**:
```
CTOR-PROBE pc=0x821783d8 tid=1 hw=0 cycle=1401430 r3=0x828f3ec0 ← handle 0x1004
CTOR-PROBE pc=0x82181750 tid=1 hw=0 cycle=5363599 r3=0x828f3d08 ← handle 0x100c
CTOR-PROBE pc=0x821701c8 tid=1 hw=0 cycle=9203618 r3=0x828f4070 ← handle 0x15e0
```
Handle 0x1004 has a SINGLE dispatcher at **0x828F3EC0**, not 8
pool members. The earlier "called 8 times" claim came from
counting raw entries to the OUTER getter `sub_8217C850`, but
`sub_8217C850` is a Meyers-style singleton-getter — its inner
`bl 0x821783D8` (the per-instance ctor) is gated on a one-shot
init flag at `[0x828F48D8] bit 0`. Subsequent `sub_8217C850`
calls just return the existing slot pointer.
2. **The producer indirection layer IS the singleton-getter
itself.** Static byte-scans of `.rdata` and `.data` show 0 hits
for the dispatcher addresses 0x828F3D08 / 0x828F4070 — so no
registry table holds them. But the `xrefs` table for the OUTER
getters reveals:
```
sub_821800D8 (outer for 0x828F3D08, handle 0x100c): 6 callers
0x821802d8 (sub_82180158+0x180) ← non-create-chain
0x821806e0 (sub_821805C8+0x118) ← non-create-chain
0x82180b28 (sub_82180A10+0x118) ← non-create-chain
0x82180ea0 (sub_82180D90+0x110) ← non-create-chain
0x82181254 (sub_821810E0+0x174) ← non-create-chain
0x82181c54 (sub_82181C28+0x2C) ← create-chain ONLY
sub_8216F618 (outer for 0x828F4070, handle 0x15e0): 5 callers
0x8216f9d4 (sub_8216F818+0x1BC) ← non-create-chain
0x8216fc08 (sub_8216F9F0+0x218) ← non-create-chain
0x821700b8 (sub_8216FF70+0x148) ← non-create-chain
0x821700f4 (sub_821700E0+0x14) ← non-create-chain
0x821707f4 (sub_821707C0+0x34) ← create-chain ONLY
```
The non-create-chain consumers all share the **canonical
producer pattern**:
```
bl outer_singleton_getter ; r3 = dispatcher ptr
lwz r3, OFFSET(r3) ; r3 = an event handle / queue field
bl 0x824AA1D8 ; signal/wake function
```
For 0x100c the offset is 80 (= 0x50); for 0x15e0 the offset is
36 (= 0x24).
So **interpretation (2) of the audit charter is confirmed**:
producers reference the dispatchers via a function-call layer of
indirection, not through direct address materialization. The
xref-table audit in AUDIT-003 (which only catches direct
constant-loads of the dispatcher base) was **necessary but not
sufficient** — it correctly saw "no direct producer references"
but missed the singleton-getter indirection.
3. **Dispatcher struct layouts** (128-byte dumps at `-n 50M
--halt-on-deadlock`):
```
0x828F3D08 (handle 0x100c, per-instance ctor sub_82181750):
+0x00 = 0xFFFFFFFF ; queue head/tail sentinel
+0x28 = 0x00000007 ; capacity = 7
+0x2C = 0x01000000 ; init flag
+0x3C = 0xFFFFFFFF ; secondary sentinel
+0x48 = 0x00001010 ; thread_handle (worker thread)
+0x4C = 0x0000100C ; event_handle (= self handle 0x100c)
+0x50 = 0x00000000 ; producer reads this — currently 0
+0x70 = 0x00000001 ; refcount?
+0x74 = 0x828F3D08 ; self-pointer
0x828F4070 (handle 0x15e0, per-instance ctor sub_821701C8):
+0x00 = 0x01000000 ; init flag
+0x10 = 0xFFFFFFFF ; queue sentinel
+0x1C = 0x000015E4 ; sibling-handle (NOT in our parked
; set — possibly a thread handle)
+0x20 = 0x000015E0 ; event_handle (= self handle 0x15e0)
+0x24 = 0x00000000 ; producer reads this — currently 0
+0x40 = 0xFFFFFFFF ; secondary sentinel
0x828F3EC0 (handle 0x1004, per-instance ctor sub_821783D8):
+0x00 = 0x01000000 ; init flag
+0x10 = 0xFFFFFFFF ; queue sentinel
+0x20 = 0x40541BC0 ; heap pointer (sub-buffer #1)
+0x30 = 0x00000014 ; size 20
+0x34 = 0x0000002F ; size 47
+0x38 = 0x414F5F60 ; heap-range payload (or two halfwords)
+0x3C = 0x40211CA0 ; heap pointer (sub-buffer #2)
+0x44 = 0x405418C0 ; heap pointer (sub-buffer #3)
+0x50 = 0x40111840 ; heap pointer (sub-buffer #4)
+0x58 = 0xFFFFFFFF ; sentinel
+0x5C = 0xFFFFFFFF ; sentinel
+0x76 = 0x000012AC ; possibly thread id
+0x78 = 0x00001004 ; event_handle (= self handle 0x1004)
```
The 0x1004 dispatcher is **noticeably different**: it owns 4
guest-heap sub-buffers in 0x4xxxxxxx range, suggesting it
manages a more complex resource than the other two (which are
pure POD job queues). The +0x78 location of the event_handle
differs from 0x100c's +0x4C and 0x15e0's +0x20, so each
subsystem has its own struct layout (no shared base class).
**Reproduce:**
```bash
cargo run --release -p xenia-app -- exec 'sylpheed.iso' \
--halt-on-deadlock \
--trace-handles-focus=0x1004,0x100c,0x15e0 \
--ctor-probe=0x821783D8,0x82181750,0x821701C8 \
--dump-addr=0x828F3D08,0x828F4070,0x828F3EC0 \
-n 50000000
```
Trace files saved at:
- `audit-runs/audit-004/run-50m-probe.txt` (outer-getter probes)
- `audit-runs/audit-004/run-50m-probe-v2.txt` (inner-ctor probes — singleton hypothesis confirmed)
**Recommendation for next session (do not implement a fix):**
Hook the entry of each non-create-chain consumer site for handle
0x100c (5 sites: 0x821802d8, 0x821806e0, 0x82180b28, 0x82180ea0,
0x82181254) and for handle 0x15e0 (4 sites: 0x8216f9d4, 0x8216fc08,
0x821700b8, 0x821700f4) using `--ctor-probe=...`. If any fire, then
the producer DOES execute and the failure mode is in the wake/signal
chain (probably `lwz r3, OFFSET(r3)` reads zero — see dispatcher dump
[+0x50] = 0 for 0x100c, [+0x24] = 0 for 0x15e0 — and the wake
function 0x824AA1D8 is then called with handle=0). If none fire,
the producer chain is gated upstream (likely a feature flag, init
phase, or RPC handler that never fires). Either way, the next
diagnostic narrows the bug surface dramatically.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -176,6 +176,26 @@ enum Commands {
/// `XENIA_XAUDIO_TICK=1`.
#[arg(long)]
xaudio_tick: bool,
/// Diagnostic. Comma-separated list of guest PCs at which the
/// worker prologue prints a one-shot `CTOR-PROBE` line capturing
/// live r3/lr/sp + an 8-frame back-chain. Hit-once-per-hit — for
/// the bridge ctor `sub_8217C850` called 8× from the
/// static-init driver, the probe fires 8×. Use to recover
/// pool-element `this` addresses that MSVC ctors fail to
/// preserve in r31. Read-only; lockstep digest unaffected.
/// Examples: `--ctor-probe=0x8217C850`,
/// `--ctor-probe=0x82181750,0x821701C8`.
#[arg(long)]
ctor_probe: Option<String>,
/// Diagnostic. Comma-separated list of guest addresses to dump
/// (64 bytes each, hex + u32 lanes) at end-of-run, after the
/// per-handle FOCUS report. Used to inspect the static
/// dispatcher / job-queue / pool struct layouts identified by
/// AUDIT-003 (e.g. `--dump-addr=0x828F3D08,0x828F4070`). The
/// dump captures a snapshot at the moment `dump_thread_diagnostic`
/// runs — typically when `--halt-on-deadlock` triggers.
#[arg(long)]
dump_addr: Option<String>,
},
/// Browse XISO disc image contents
Browse {
@@ -331,6 +351,8 @@ fn main() -> Result<()> {
reservations_table,
parallel,
xaudio_tick,
ctor_probe,
dump_addr,
} => cmd_exec(
&path,
max_instructions,
@@ -349,6 +371,8 @@ fn main() -> Result<()> {
reservations_table,
parallel,
xaudio_tick,
ctor_probe.as_deref(),
dump_addr.as_deref(),
),
Commands::Browse { path } => cmd_browse(&path),
Commands::Info { path } => cmd_info(&path),
@@ -550,6 +574,8 @@ fn cmd_exec(
reservations_table: bool,
parallel: bool,
xaudio_tick: bool,
ctor_probe: Option<&str>,
dump_addr: Option<&str>,
) -> Result<()> {
cmd_exec_inner(
path,
@@ -569,6 +595,8 @@ fn cmd_exec(
reservations_table,
parallel,
xaudio_tick,
ctor_probe,
dump_addr,
None,
None,
false,
@@ -607,6 +635,8 @@ fn cmd_check(
reservations_table,
parallel,
xaudio_tick,
None, // ctor_probe — diagnostic, never wanted on goldens
None, // dump_addr — same
out,
expect,
stable_digest,
@@ -631,6 +661,8 @@ fn cmd_exec_inner(
reservations_table: bool,
parallel: bool,
xaudio_tick: bool,
ctor_probe: Option<&str>,
dump_addr: Option<&str>,
digest_out: Option<&str>,
digest_expect: Option<&str>,
stable_digest: bool,
@@ -879,6 +911,89 @@ fn cmd_exec_inner(
}
}
// Diagnostic. Parse `--ctor-probe=0x8217C850,0x...` (or
// `XENIA_CTOR_PROBE=...`) into `kernel.ctor_probe_pcs`. The
// worker prologue checks this set on every step; on a hit it
// prints a single back-chain capture line. Empty set = no
// probes = no-op fast path.
let ctor_probe_combined: Option<String> = match (ctor_probe, std::env::var("XENIA_CTOR_PROBE").ok()) {
(Some(s), _) => Some(s.to_string()),
(None, Some(s)) if !s.is_empty() => Some(s),
_ => None,
};
if let Some(list) = ctor_probe_combined {
for token in list.split(',').map(str::trim).filter(|s| !s.is_empty()) {
let parsed = if let Some(hex) = token.strip_prefix("0x").or_else(|| token.strip_prefix("0X")) {
u32::from_str_radix(hex, 16)
} else {
token.parse::<u32>()
};
match parsed {
Ok(pc) => {
kernel.ctor_probe_pcs.insert(pc);
}
Err(_) => {
return Err(anyhow::anyhow!(
"invalid PC in --ctor-probe: {token:?}"
));
}
}
}
if !quiet && !kernel.ctor_probe_pcs.is_empty() {
let pcs: Vec<String> = kernel
.ctor_probe_pcs
.iter()
.map(|p| format!("{p:#010x}"))
.collect();
tracing::info!(
"ctor probes armed: {} ({})",
kernel.ctor_probe_pcs.len(),
pcs.join(", ")
);
}
}
// Diagnostic. Parse `--dump-addr=0x828F3D08,...` (or
// `XENIA_DUMP_ADDR=...`) into `kernel.dump_addrs`. The contents
// are dumped at end-of-run by `dump_thread_diagnostic`. Pure
// read; never mutates guest state.
let dump_addr_combined: Option<String> = match (dump_addr, std::env::var("XENIA_DUMP_ADDR").ok()) {
(Some(s), _) => Some(s.to_string()),
(None, Some(s)) if !s.is_empty() => Some(s),
_ => None,
};
if let Some(list) = dump_addr_combined {
for token in list.split(',').map(str::trim).filter(|s| !s.is_empty()) {
let parsed = if let Some(hex) = token.strip_prefix("0x").or_else(|| token.strip_prefix("0X")) {
u32::from_str_radix(hex, 16)
} else {
token.parse::<u32>()
};
match parsed {
Ok(addr) => {
kernel.dump_addrs.push(addr);
}
Err(_) => {
return Err(anyhow::anyhow!(
"invalid address in --dump-addr: {token:?}"
));
}
}
}
if !quiet && !kernel.dump_addrs.is_empty() {
let addrs: Vec<String> = kernel
.dump_addrs
.iter()
.map(|a| format!("{a:#010x}"))
.collect();
tracing::info!(
"dump addresses armed: {} ({})",
kernel.dump_addrs.len(),
addrs.join(", ")
);
}
}
// Install the GPU register aperture MMIO region on the guest memory so
// any `0x7FC8xxxx` access routes to our atomic mailbox. Matches canary's
// `graphics_system.cc:141-144`. The callbacks capture Arc clones of the
@@ -1812,6 +1927,13 @@ fn worker_prologue(
let pc = kernel.scheduler.ctx(hw_id).pc;
// 0) Diagnostic ctor-probe: if `pc` is in
// `kernel.ctor_probe_pcs`, capture live r3/lr/sp + back-chain
// and println one record. Read-only; lockstep digest unaffected.
// Empty set is the common case → single `is_empty()` test inside
// the helper, no overhead on the hot path.
kernel.fire_ctor_probe_if_match(hw_id, mem);
// 1) Halt-sentinel check (per HW thread).
if pc == LR_HALT {
let injected_here = kernel.interrupts.saved.is_some()
@@ -3405,6 +3527,56 @@ fn dump_thread_diagnostic(
}
}
}
// Diagnostic. `--dump-addr` content dump at end-of-run. Each
// address gets a 64-byte dump in three forms (raw bytes, u32
// big-endian lanes, ASCII guess). The lanes form is gold for
// pool-element / dispatcher struct reverse-engineering: at
// [base+0]=-1 means "empty queue" sentinel from AUDIT-003, while
// any image-range pointer `0x82xxxxxx` at a fixed offset hints
// at an embedded silph::Event* / vtable / next-link.
if !kernel.dump_addrs.is_empty() {
println!("\n=== Dump-addr ===");
const DUMP_BYTES: usize = 128;
for &addr in &kernel.dump_addrs {
println!(" addr={:#010x}", addr);
let mut bytes = [0u8; DUMP_BYTES];
for i in 0..DUMP_BYTES {
bytes[i] = mem.read_u8(addr.wrapping_add(i as u32));
}
for row in 0..(DUMP_BYTES / 16) {
let off = row * 16;
let hex: String = (0..16)
.map(|i| format!("{:02x}", bytes[off + i]))
.collect::<Vec<_>>()
.join(" ");
let words: String = (0..4)
.map(|i| {
let b = &bytes[off + i * 4..off + i * 4 + 4];
format!(
"{:08x}",
u32::from_be_bytes([b[0], b[1], b[2], b[3]])
)
})
.collect::<Vec<_>>()
.join(" ");
let ascii: String = (0..16)
.map(|i| {
let b = bytes[off + i];
if (0x20..=0x7E).contains(&b) {
b as char
} else {
'.'
}
})
.collect();
println!(
" +{:#04x}: {} | be32={} | {}",
off, hex, words, ascii,
);
}
}
}
}
#[allow(clippy::too_many_arguments)]

View File

@@ -179,6 +179,30 @@ pub struct KernelState {
/// backend (which lives on the worker thread under `--gpu-thread`).
pub ring_base: u32,
pub ring_size_dwords: u32,
/// Diagnostic. PCs at which the worker prologue fires a one-shot
/// stack/back-chain dump capturing live `r3` (= `this` in MSVC
/// PPC ctors), `lr` (= return site), and the cycle/tid that hit
/// the PC. Populated from `--ctor-probe=0x8217C850,0x...` /
/// `XENIA_CTOR_PROBE`. Empty by default → check is a single
/// `is_empty()` test, no extra cost on the unprobed hot path.
/// Read-only diagnostic — no guest state is mutated, so the
/// `sylpheed_n*m.json` lockstep digest is preserved.
///
/// **Why a per-PC probe instead of per-handle?** The MSVC ctors
/// at `sub_8217C850` (and friends) don't preserve `this` in r31
/// across the inner `bl` to `silph::Event::Construct`, so the
/// AUDIT-002 multi-frame back-chain at `NtCreateEvent` only
/// recovers stack-relative pointers — never the pool-element
/// `this`. Hooking the ctor's PRE-prologue PC captures r3 = this
/// before any save/restore can clobber it.
pub ctor_probe_pcs: std::collections::HashSet<u32>,
/// Diagnostic. Guest addresses to dump (64 bytes each, hex + u32
/// lanes) at end-of-run. Populated from `--dump-addr=0x828F3D08,
/// 0x828F4070`. Used to inspect static dispatcher / job-queue /
/// pool struct layouts identified by AUDIT-003. Read-only — the
/// dump is performed by `dump_thread_diagnostic`, never during
/// the hot interpreter loop, so lockstep determinism is unaffected.
pub dump_addrs: Vec<u32>,
}
impl KernelState {
@@ -230,6 +254,8 @@ impl KernelState {
ring_base: 0,
ring_size_dwords: 0,
parallel_active: false,
ctor_probe_pcs: std::collections::HashSet::new(),
dump_addrs: Vec::new(),
};
crate::exports::register_exports(&mut state);
crate::xam::register_exports(&mut state);
@@ -522,6 +548,48 @@ impl KernelState {
self.audit.record_wake(handle, entry);
}
/// Diagnostic. If the live PC for HW slot `hw_id` is in
/// `self.ctor_probe_pcs`, emit a single `CTOR-PROBE` line with
/// the current cycle, tid, hw_id, sp, r3, lr, plus an 8-frame
/// back-chain walk. Read-only — no guest state is mutated, so a
/// run with the probe set is byte-identical to one without (the
/// probe only adds println noise).
///
/// Intended call site: top of `worker_prologue`, after `pc` has
/// been read but before any thunk-dispatch / step-block branch.
/// Fires once per hit — if the same PC is reached again (e.g.
/// the bridge ctor sub_8217C850 called 8 times by the static-
/// init driver), it fires 8 times, which is exactly what we want
/// for pool-element identification.
pub fn fire_ctor_probe_if_match(&self, hw_id: u8, mem: &GuestMemory) {
if self.ctor_probe_pcs.is_empty() {
return;
}
let ctx = self.scheduler.ctx(hw_id);
let pc = ctx.pc;
if !self.ctor_probe_pcs.contains(&pc) {
return;
}
let tid = self.scheduler.tid(hw_id).unwrap_or(0);
let r3 = ctx.gpr[3] as u32;
let lr = ctx.lr as u32;
let sp = ctx.gpr[1] as u32;
let cycle = ctx.cycle_count;
let frames = walk_guest_back_chain(sp, lr, mem, 8);
println!(
"CTOR-PROBE pc={:#010x} tid={} hw={} cycle={} sp={:#010x} r3={:#010x} lr={:#010x}",
pc, tid, hw_id, cycle, sp, r3, lr,
);
for (i, (fp, frame_lr)) in frames.iter().enumerate() {
let saved_r31 = mem.read_u32(fp.wrapping_sub(12));
let saved_r30 = mem.read_u32(fp.wrapping_sub(16));
println!(
" CTOR-PROBE frame={} fp={:#010x} lr={:#010x} saved-r31={:#010x} saved-r30={:#010x}",
i, fp, frame_lr, saved_r31, saved_r30,
);
}
}
/// Read a TLS slot for the currently running HW thread.
pub fn tls_get(&self, index: u32) -> u64 {
self.scheduler.tls_get(index)
@@ -1231,6 +1299,38 @@ mod tests {
/// A NUL-terminated ASCII string is read up to `max`; non-printable
/// bytes mark the candidate as bogus (return empty string). The
/// `.?A` prefix gating in `read_class_at_this` then rejects them.
/// `fire_ctor_probe_if_match` only emits when `pc` matches a
/// configured PC. We assert it's a no-op on miss and a no-panic
/// on hit (the println goes to stdout; we just check the helper
/// reads the back-chain without faulting).
#[test]
fn fire_ctor_probe_if_match_no_op_on_empty_set() {
let mem = GuestMemory::new().expect("memory init");
let state = KernelState::new();
// No probes set → must be a no-op even when the scheduler
// ctx has whatever PC.
state.fire_ctor_probe_if_match(0, &mem);
assert!(state.ctor_probe_pcs.is_empty());
}
#[test]
fn fire_ctor_probe_if_match_only_fires_on_listed_pc() {
// We can't easily redirect stdout under cargo-test, so this
// test mostly verifies the membership check + that no panic
// occurs when frame walking encounters zero/sentinel pages.
// The empty-stack walk returns just `[(sp, lr)]`, exercising
// the loop body once safely.
let mem = GuestMemory::new().expect("memory init");
let mut state = KernelState::new();
state.ctor_probe_pcs.insert(0x8217_C850);
// The default PpcContext on slot 0 has pc=0 (idle sentinel),
// so the probe set membership test misses → no fire.
state.fire_ctor_probe_if_match(0, &mem);
// Sanity: an unrelated PC isn't claimed.
assert!(!state.ctor_probe_pcs.contains(&0x8200_0000));
assert!(state.ctor_probe_pcs.contains(&0x8217_C850));
}
#[test]
fn read_ascii_cstring_handles_termination_and_garbage() {
use xenia_memory::page_table::MemoryProtect;