handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions
--- a/audit-runs/review-a-step1b-crowbar-v2/investigation.md
+++ b/audit-runs/review-a-step1b-crowbar-v2/investigation.md
@@ -0,0 +1,157 @@
+# Crowbar v2 — Step 0 (A) vs (B) verdict + new finding
+
+**Date**: 2026-05-21
+**Predecessor**: v1 at `audit-runs/review-a-step1-crowbar/`.
+**Status**: LANDED diagnostic; ESCALATED before Step 2 install — neither
+(A) nor (B) was the issue.
+
+## TL;DR
+
+- **(A) is FALSIFIED.** Ours's XEX loader populates the vtable region
+  `0x8200A1E8..+512` correctly. 254/256 nonzero bytes in the first 256;
+  128/128 nonzero u32 slots in the first 512 bytes. **Worker stub slots
+  35/36/37/38 each hold real PPC fn pointers** in the `0x8250xxxx`
+  range:
+  - `vtable[35] @ 0x8200A274 = 0x82506B08`
+  - `vtable[36] @ 0x8200A278 = 0x82506DE8`
+  - `vtable[37] @ 0x8200A27C = 0x82508530`
+  - `vtable[38] @ 0x8200A280 = 0x82508A88`
+- **(B) is FALSIFIED.** There is no "runtime vtable install" step to
+  mirror — the vtable contents come from `.rdata` and are present
+  before the crowbar fires. The AUDIT-068 S3/S4 POD-copy writes
+  `0x8200A1E8` (vtable BASE) at `[ctx+0]` — a POINTER write — not the
+  vtable contents themselves.
+- **NEW CASE (C) discovered**: the ctx-object layout is wider than the
+  4 u32s AUDIT-068 S3 captured. `[ctx+44]` is a pointer to a SECOND
+  object whose vtable+60 (slot 15) is dispatched by `sub_82506DE8` (=
+  vtable[36] of ctx, called by worker tid=15's entry stub at
+  `0x82506558`). Since we left `[ctx+44]` zero, the worker reads
+  `[0]=0`, dereferences as vtable, computes CTR=`[vtable+60]=0`, and
+  `bctrl` faults at PC=0.
+
+## v1 framing vs v2 ground truth
+
+v1's `crowbar-on-stderr.log` showed `FAULT: PC in unmapped memory
+cycle=20000167 pc=0x00000000 hw_id=0`. v1's hypothesis was
+"vtable[35] at `0x8200A274` is uninitialized/null, branch goes to
+PC=0." v2 Step 0 diagnostic dumps the vtable region and shows that
+hypothesis is **wrong** — every slot is populated.
+
+The enriched FAULT log added by v2 captured the smoking gun:
+
+```
+FAULT: PC in unmapped memory cycle=20000166 pc=0x00000000 hw_id=0
+  tid=Some(15) lr=0x82506e38 ctr=0x00000000 r3=0x00000000 r4=0
+  r29=0 r30=<ctx_ptr> r31=<...>
+```
+
+`lr=0x82506e38` is one instruction past `bctrl` at `0x82506e34`. The
+sequence in `sub_82506DE8` (which IS vtable[36], reached by worker
+tid=15's stub at `0x82506558` → `lwz r11, 0(r3); lwz r11, 144(r11);
+mtctr r11; bctrl`):
+
+```
+0x82506de8: mflr r12
+0x82506dec: bl   0x825F0F8C
+0x82506df0: stwu r1, -144(r1)
+0x82506df4: mr   r30, r3              ; r30 = ctx_ptr
+0x82506df8: lwz  r11, 0(r30)          ; r11 = 0x8200A1E8 (vtable)
+0x82506dfc: lwz  r11, 260(r11)        ; r11 = vtable[65] (a fn)
+0x82506e00: mtctr r11
+0x82506e04: bctrl                     ; OK — returns
+0x82506e08: rlwinm r11, r3, 0, 29, 29 ; bit 2 of r3
+0x82506e10: bne cr6, 0x825070D4       ; if bit set: branch away
+0x82506e18: lwz r3, 44(r30)           ; r3 = [ctx+44]    <-- ZERO
+0x82506e28: lwz r11, 0(r3)            ; r11 = [0]       <-- ZERO
+0x82506e2c: lwz r11, 60(r11)          ; r11 = [60]      <-- ZERO
+0x82506e30: mtctr r11                 ; CTR = 0
+0x82506e34: bctrl                     ; LR := 0x82506e38, PC := 0
+0x82506e38: <fault: PC unmapped>
+```
+
+So vtable[36] called vtable[65] (a real fn that returns OK), then
+dispatched into `[ctx+44]` treated as another object. Our crowbar
+left `[ctx+44]=0`, so the dispatch faulted.
+
+## Why (B) framing missed this
+
+The brief framed (B) as "vtable contents are constructed at runtime".
+That's not true — vtable contents are static `.rdata`. What
+AUDIT-068's S4 captured is the **ctor chain** that constructs the
+**ctx instance** (the heap object):
+
+- `sub_824FECE0` (deepest): writes `[ctx+4]=ctx, [ctx+8]=ctx,
+  [ctx+12]=1`. Also calls `0x8284DD1C` with `r3=ctx+16` (likely a
+  linked-list/container init).
+- `sub_825065E8` (middle): chains to deepest, then writes
+  `[ctx+0]=0x8200A908` (intermediate vtable), then `bl 0x825051D8`.
+- `sub_824FD240` (most-derived): chains to middle, then writes
+  `[ctx+0]=0x8200A1E8` (final vtable). Returns.
+
+None of these three ctors writes `[ctx+44]`. So `[ctx+44]` must be
+written by either:
+1. **Allocator initial-state** (zero-fill? guest-side memset?), OR
+2. **A factory function ABOVE the ctor chain** (the caller of
+   `sub_824FD240` that allocates ctx, calls ctor, then assigns fields
+   including `+44`).
+
+AUDIT-064 named the caller chain `sub_824F8398 → sub_824F7CD0 →
+sub_824F7800 → [bl at +0x38 = sub_824FD240]`. So `sub_824F7800` is
+likely the factory that does the `+44` field assignment AFTER the
+ctor returns. Without disassembling `sub_824F7800` and tracing each
+field-store, we can't synthesize the missing fields.
+
+## Why escalating is the right call now
+
+Per the brief's tripstone #6 — 2-hour timebox. We've already
+discovered the framing was wrong and the gap is wider than v2 was
+scoped to fix. The honest moves are:
+
+1. **Stop and document** the new finding (this doc + memory entry).
+2. **Recommend the next session's investigation**: disassemble
+   `sub_824F7800` (and `sub_824F7CD0`, `sub_824F8398`) field-by-field
+   to enumerate every store-to-r31 / store-to-ctx_ptr after the ctor
+   chain returns. Mirror those stores in a crowbar v3.
+3. Alternative — much wider: build a canary read-probe sweep over
+   `[ctx+0..ctx+128]` to capture the live state. ~200 LOC canary
+   instrumentation; trades complexity for ground-truth.
+
+## Run-determined ctx addresses for reference
+
+- v1's crowbar (in ours): `ctx_ptr = 0x4D1D9000` (heap_alloc bump
+  cursor at trigger time).
+- Canary's natural ctx (per AUDIT-068 S4): `0xBCE25340` and
+  `0xBCE251C0` were captured in different cold runs (arena drift).
+  The probe at `0xBCE251C0..+8` confirmed `[ctx+0]=0x8200A1E8`,
+  `[ctx+4]=ctx`, `[ctx+8]=ctx` (the doubly-linked list head).
+
+## LOC delta this session
+
+- `crates/xenia-kernel/src/exports.rs`: +95 LOC (two helpers
+  `crowbar_dump_vtable_region` and
+  `crowbar_maybe_install_vtable_from_file`; plus call sites in
+  `crowbar_force_spawn_workers`).
+- `crates/xenia-app/src/main.rs`: +9 LOC (enriched FAULT log with
+  tid/lr/ctr/r3/r4/r29/r30/r31).
+- Total: ~104 LOC additive over v1. Within budget.
+
+## What was NOT done
+
+- vtable-bin install: implemented but unused (env-gated, defaults
+  to no-op). Kept in tree for v3 if a future session captures
+  canary's vtable bytes for cross-validation, BUT now we know that's
+  unnecessary because ours's vtable is correct.
+- 3×OFF + 3×ON cold-run sweep: v2 produces the same crash signature
+  as v1 because the gap is the ctx-field, not the vtable. A 6-run
+  sweep would show identical progression metrics (`swaps=1, draws=0,
+  render_targets=0` ON; same numbers OFF) — confirmed by spot-check
+  of one ON run. Skipping the full sweep to honour the timebox.
+- canary cache wipe/restore: not needed since no canary changes were
+  made this session.
+
+## Files
+
+- `step0-diag-stderr.log`: first run, vtable dump only (256 bytes).
+- `step0b-diag.log`: second run, 512-byte vtable dump.
+- `step0c-diag.log`: third run, with enriched FAULT log (captured
+  tid=15, lr=0x82506e38, ctr=0, r3=0, r30=ctx_ptr).