Files
xenia-rs/audit-runs/review-a-step1b-crowbar-v2/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

158 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Crowbar v2 — Step 0 (A) vs (B) verdict + new finding
**Date**: 2026-05-21
**Predecessor**: v1 at `audit-runs/review-a-step1-crowbar/`.
**Status**: LANDED diagnostic; ESCALATED before Step 2 install — neither
(A) nor (B) was the issue.
## TL;DR
- **(A) is FALSIFIED.** Ours's XEX loader populates the vtable region
`0x8200A1E8..+512` correctly. 254/256 nonzero bytes in the first 256;
128/128 nonzero u32 slots in the first 512 bytes. **Worker stub slots
35/36/37/38 each hold real PPC fn pointers** in the `0x8250xxxx`
range:
- `vtable[35] @ 0x8200A274 = 0x82506B08`
- `vtable[36] @ 0x8200A278 = 0x82506DE8`
- `vtable[37] @ 0x8200A27C = 0x82508530`
- `vtable[38] @ 0x8200A280 = 0x82508A88`
- **(B) is FALSIFIED.** There is no "runtime vtable install" step to
mirror — the vtable contents come from `.rdata` and are present
before the crowbar fires. The AUDIT-068 S3/S4 POD-copy writes
`0x8200A1E8` (vtable BASE) at `[ctx+0]` — a POINTER write — not the
vtable contents themselves.
- **NEW CASE (C) discovered**: the ctx-object layout is wider than the
4 u32s AUDIT-068 S3 captured. `[ctx+44]` is a pointer to a SECOND
object whose vtable+60 (slot 15) is dispatched by `sub_82506DE8` (=
vtable[36] of ctx, called by worker tid=15's entry stub at
`0x82506558`). Since we left `[ctx+44]` zero, the worker reads
`[0]=0`, dereferences as vtable, computes CTR=`[vtable+60]=0`, and
`bctrl` faults at PC=0.
## v1 framing vs v2 ground truth
v1's `crowbar-on-stderr.log` showed `FAULT: PC in unmapped memory
cycle=20000167 pc=0x00000000 hw_id=0`. v1's hypothesis was
"vtable[35] at `0x8200A274` is uninitialized/null, branch goes to
PC=0." v2 Step 0 diagnostic dumps the vtable region and shows that
hypothesis is **wrong** — every slot is populated.
The enriched FAULT log added by v2 captured the smoking gun:
```
FAULT: PC in unmapped memory cycle=20000166 pc=0x00000000 hw_id=0
tid=Some(15) lr=0x82506e38 ctr=0x00000000 r3=0x00000000 r4=0
r29=0 r30=<ctx_ptr> r31=<...>
```
`lr=0x82506e38` is one instruction past `bctrl` at `0x82506e34`. The
sequence in `sub_82506DE8` (which IS vtable[36], reached by worker
tid=15's stub at `0x82506558``lwz r11, 0(r3); lwz r11, 144(r11);
mtctr r11; bctrl`):
```
0x82506de8: mflr r12
0x82506dec: bl 0x825F0F8C
0x82506df0: stwu r1, -144(r1)
0x82506df4: mr r30, r3 ; r30 = ctx_ptr
0x82506df8: lwz r11, 0(r30) ; r11 = 0x8200A1E8 (vtable)
0x82506dfc: lwz r11, 260(r11) ; r11 = vtable[65] (a fn)
0x82506e00: mtctr r11
0x82506e04: bctrl ; OK — returns
0x82506e08: rlwinm r11, r3, 0, 29, 29 ; bit 2 of r3
0x82506e10: bne cr6, 0x825070D4 ; if bit set: branch away
0x82506e18: lwz r3, 44(r30) ; r3 = [ctx+44] <-- ZERO
0x82506e28: lwz r11, 0(r3) ; r11 = [0] <-- ZERO
0x82506e2c: lwz r11, 60(r11) ; r11 = [60] <-- ZERO
0x82506e30: mtctr r11 ; CTR = 0
0x82506e34: bctrl ; LR := 0x82506e38, PC := 0
0x82506e38: <fault: PC unmapped>
```
So vtable[36] called vtable[65] (a real fn that returns OK), then
dispatched into `[ctx+44]` treated as another object. Our crowbar
left `[ctx+44]=0`, so the dispatch faulted.
## Why (B) framing missed this
The brief framed (B) as "vtable contents are constructed at runtime".
That's not true — vtable contents are static `.rdata`. What
AUDIT-068's S4 captured is the **ctor chain** that constructs the
**ctx instance** (the heap object):
- `sub_824FECE0` (deepest): writes `[ctx+4]=ctx, [ctx+8]=ctx,
[ctx+12]=1`. Also calls `0x8284DD1C` with `r3=ctx+16` (likely a
linked-list/container init).
- `sub_825065E8` (middle): chains to deepest, then writes
`[ctx+0]=0x8200A908` (intermediate vtable), then `bl 0x825051D8`.
- `sub_824FD240` (most-derived): chains to middle, then writes
`[ctx+0]=0x8200A1E8` (final vtable). Returns.
None of these three ctors writes `[ctx+44]`. So `[ctx+44]` must be
written by either:
1. **Allocator initial-state** (zero-fill? guest-side memset?), OR
2. **A factory function ABOVE the ctor chain** (the caller of
`sub_824FD240` that allocates ctx, calls ctor, then assigns fields
including `+44`).
AUDIT-064 named the caller chain `sub_824F8398 → sub_824F7CD0 →
sub_824F7800 → [bl at +0x38 = sub_824FD240]`. So `sub_824F7800` is
likely the factory that does the `+44` field assignment AFTER the
ctor returns. Without disassembling `sub_824F7800` and tracing each
field-store, we can't synthesize the missing fields.
## Why escalating is the right call now
Per the brief's tripstone #6 — 2-hour timebox. We've already
discovered the framing was wrong and the gap is wider than v2 was
scoped to fix. The honest moves are:
1. **Stop and document** the new finding (this doc + memory entry).
2. **Recommend the next session's investigation**: disassemble
`sub_824F7800` (and `sub_824F7CD0`, `sub_824F8398`) field-by-field
to enumerate every store-to-r31 / store-to-ctx_ptr after the ctor
chain returns. Mirror those stores in a crowbar v3.
3. Alternative — much wider: build a canary read-probe sweep over
`[ctx+0..ctx+128]` to capture the live state. ~200 LOC canary
instrumentation; trades complexity for ground-truth.
## Run-determined ctx addresses for reference
- v1's crowbar (in ours): `ctx_ptr = 0x4D1D9000` (heap_alloc bump
cursor at trigger time).
- Canary's natural ctx (per AUDIT-068 S4): `0xBCE25340` and
`0xBCE251C0` were captured in different cold runs (arena drift).
The probe at `0xBCE251C0..+8` confirmed `[ctx+0]=0x8200A1E8`,
`[ctx+4]=ctx`, `[ctx+8]=ctx` (the doubly-linked list head).
## LOC delta this session
- `crates/xenia-kernel/src/exports.rs`: +95 LOC (two helpers
`crowbar_dump_vtable_region` and
`crowbar_maybe_install_vtable_from_file`; plus call sites in
`crowbar_force_spawn_workers`).
- `crates/xenia-app/src/main.rs`: +9 LOC (enriched FAULT log with
tid/lr/ctr/r3/r4/r29/r30/r31).
- Total: ~104 LOC additive over v1. Within budget.
## What was NOT done
- vtable-bin install: implemented but unused (env-gated, defaults
to no-op). Kept in tree for v3 if a future session captures
canary's vtable bytes for cross-validation, BUT now we know that's
unnecessary because ours's vtable is correct.
- 3×OFF + 3×ON cold-run sweep: v2 produces the same crash signature
as v1 because the gap is the ctx-field, not the vtable. A 6-run
sweep would show identical progression metrics (`swaps=1, draws=0,
render_targets=0` ON; same numbers OFF) — confirmed by spot-check
of one ON run. Skipping the full sweep to honour the timebox.
- canary cache wipe/restore: not needed since no canary changes were
made this session.
## Files
- `step0-diag-stderr.log`: first run, vtable dump only (256 bytes).
- `step0b-diag.log`: second run, 512-byte vtable dump.
- `step0c-diag.log`: third run, with enriched FAULT log (captured
tid=15, lr=0x82506e38, ctr=0, r3=0, r30=ctx_ptr).