handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions
--- a/audit-runs/review-a-step1c-crowbar-v3/investigation.md
+++ b/audit-runs/review-a-step1c-crowbar-v3/investigation.md
@@ -0,0 +1,199 @@
+# Crowbar v3 — ctx-state install verbatim
+
+**Date**: 2026-05-21
+**Predecessor**: v2 at `audit-runs/review-a-step1b-crowbar-v2/`.
+**Status**: LANDED. Hypothesis FALSIFIED: wedge is NOT crowbar-soluble at
+the ctx-state-only level. Case (D) needed (recursive secondary-object
+install). v3 produces same composite progression score as OFF baseline.
+
+## TL;DR
+
+- v2 found case (C): `[ctx+44]` is a secondary-object pointer.
+  vtable[36] reads it and dispatches through it.
+- v3 captured canary's **actual `[ctx+44]` value** = `0xBCE25640` (via
+  the `audit_68_host_mem_read_probe` cvar) along with the rest of the
+  64-byte ctx head, then installed that state verbatim in ours.
+- Worker tid=15 now passes the `[ctx+44]` load (loads `0xBCE25640`
+  into r3) but **`0xBCE25640` is unmapped in ours's address space**
+  (ours's allocator returns 0x4D1Dxxxx VAs; canary's xenon-arena VAs
+  in the `0xBCExxxxx` range have no equivalent in ours).
+- Reading `[0xBCE25640]` returns 0 → `CTR=0` → `bctrl` faults at PC=0
+  with `r3=0xbce25640` (was `r3=0x0` in v2 — confirming the install
+  worked, just deeper recursion needed).
+- 3x OFF / 3x ON runs deterministic: `swaps=1, draws=0,
+  unique_render_targets=0` identical. **Composite progression Δ = 0.**
+
+## Captured canary ctx state
+
+Canary cold run (90s, `--mute=true`), with cvars:
+
+```
+--audit_61_branch_probe_pcs=0x825070F0
+--audit_68_host_mem_read_probe=0xBCE251C0:8:1000000,0xBCE251C8:8:1000000,
+                               0xBCE251D0:8:1000000,0xBCE251D8:8:1000000,
+                               0xBCE251E0:8:1000000,0xBCE251E8:8:1000000,
+                               0xBCE251F0:8:1000000,0xBCE251F8:8:1000000
+```
+
+AUDIT-061-BR confirmed ctx_ptr=`0xBCE251C0` (per AUDIT-068 S3 expectation;
+no arena drift in this run). Read probe captured the install timeline:
+
+| host_ns | event |
+|--------:|-------|
+| 9.556 s | Install starts: `[ctx+0]=0x8200A1E8` (vtable), `[ctx+4]=ctx`, `[ctx+8]=ctx`, `[ctx+12]=1` (refcount), `[ctx+16]=0x01000000`, `[ctx+32]=0xFFFFFFFF` |
+| 9.571 s | `[ctx+44]=0xBCE25640` written, `[ctx+48]=0xBE568F00` written (looks float-ish) |
+| 9.754 s | Transient `[ctx+32]=1` and `[ctx+40]=0x30057018` writes that are cleared next probe tick — likely temporary scratch during a function call |
+| 9.755 s | Stable post-install state |
+
+Final ctx bytes (saved at `ctx-canary.bin`):
+
+```
+  +  0: 82 00 A1 E8 BC E2 51 C0 BC E2 51 C0 00 00 00 01   <- vptr / self / self / refcount
+  + 16: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+  + 32: FF FF FF FF 00 00 00 00 00 00 00 00 BC E2 56 40   <- ...sentinel... / [ctx+44]=0xBCE25640
+  + 48: BE 56 8F 00 00 00 00 00 00 00 00 00 00 00 00 00   <- [ctx+48]=0xBE568F00 (-0.21f?)
+```
+
+## Install path in ours
+
+v3 adds `crowbar_maybe_install_ctx_from_file()` (~63 LOC) that reads
+the binary at `$XENIA_CROWBAR_CTX_BIN` and writes the bytes via
+`mem.write_u8(ctx_ptr + i, byte)` — same pattern as v2's
+`crowbar_maybe_install_vtable_from_file()`. Plus ~12 LOC of comments
+and the call-site addition. ~75 LOC additive over v2.
+
+The 64-byte ctx file overwrites the v2 init at `+0/+4/+8/+12` with
+identical values (verified — they match), and fills `+16..+63` with
+the captured state.
+
+Post-install log confirms exact write:
+```
+CROWBAR: installed 64 bytes at ctx_ptr=0x4d1d9000
+CROWBAR: post-ctx-install ctx[+  0] (=0x4d1d9000) = 0x8200a1e8
+CROWBAR: post-ctx-install ctx[+ 32] (=0x4d1d9020) = 0xffffffff
+CROWBAR: post-ctx-install ctx[+ 44] (=0x4d1d902c) = 0xbce25640    <-- secondary obj ptr installed
+CROWBAR: post-ctx-install ctx[+ 48] (=0x4d1d9030) = 0xbe568f00
+```
+
+## The fault (v3)
+
+Identical fault PC, different r3 — that's the smoking gun:
+
+| | v1 (no ctx install) | v2 (init +0..+12 only) | v3 (full 64 bytes) |
+|-|-|-|-|
+| FAULT PC | 0 | 0 | 0 |
+| LR | 0x82506e38 | 0x82506e38 | 0x82506e38 |
+| CTR | 0 | 0 | 0 |
+| **r3** | (any) | **0x0** | **0xbce25640** |
+| r30 (ctx_ptr) | 0x4D1D9000 | 0x4D1D9000 | 0x4D1D9000 |
+| tid | 15 | 15 | 15 |
+
+The `lwz r11, 0(r3)` at PC `0x82506e28` (per v2's disasm) loads from
+`r3 = [ctx+44]`. In v2, `r3=0`, so reads `[0]=0`. In v3, `r3=0xBCE25640`,
+so reads `[0xBCE25640]`. Both reads return 0 because:
+
+- v2: page 0 isn't mapped (well, it might be but the value is 0).
+- v3: page `0xBCE25640` is **definitely** unmapped in ours.
+
+Ours's heap is at `0..0x6FFFFFFF` (per `KernelState::heap_alloc`). The
+xenon physical-region VAs (`0xBC000000..0xC0000000`) never appear in
+ours's allocator namespace — `MmAllocatePhysicalMemoryEx` just calls
+`heap_alloc()` which returns low VAs.
+
+## Why this falsifies the v3 hypothesis
+
+The brief's hypothesis: "with the full ctx state pre-installed AND the
+4 workers spawned, ours produces `swaps≥2` or `draws≥1`."
+
+Outcome: ctx state IS installed, 4 workers ARE spawned and resumed,
+but the dispatch on the secondary object fails because the secondary
+object's VA isn't mappable.
+
+This is exactly **case (γ) → fault at new structural location** that
+the brief predicted. The new fault PC isn't actually new (still 0),
+but the new fault PRIMARY CAUSE is different: in v2 the cause was
+"ctx+44 not initialized"; in v3 it's "ctx+44 points to an unmapped VA."
+
+## Composite progression score
+
+Per brief's option 6 metric (excluding the matched_prefix term, which
+needs canary cross-comparison not available in `check` digests):
+
+```
+score = 1*swaps + 10*draws + 100*unique_render_targets
+```
+
+| Run | swaps | draws | unique_RT | score | instructions |
+|-|-:|-:|-:|-:|-:|
+| OFF-1 | 1 | 0 | 0 | **1** | 25,000,000 |
+| OFF-2 | 1 | 0 | 0 | **1** | 25,000,000 |
+| OFF-3 | 1 | 0 | 0 | **1** | 25,000,000 |
+| ON-1  | 1 | 0 | 0 | **1** | 20,000,167 (faulted) |
+| ON-2  | 1 | 0 | 0 | **1** | 20,000,167 (faulted) |
+| ON-3  | 1 | 0 | 0 | **1** | 20,000,167 (faulted) |
+
+**Δ = 0**. The instruction count dropped from 25M to 20.0001M in ON runs
+because the fault halts the run early at `instr=20000167`, ~167 instr
+after the crowbar trigger (threshold=20M). Confirms the workers can't
+even complete one meaningful iteration before faulting.
+
+## LOC delta
+
+- `crates/xenia-kernel/src/exports.rs`: +63 LOC (helper)
+  + 13 LOC (call-site comments + wire-up) = +76 LOC over v2.
+- `audit-runs/review-a-step1c-crowbar-v3/`: artifacts (ctx-canary.bin,
+  canary-probe-run1.log, off-{1,2,3}.json, on-{1,2,3}.json, this doc,
+  summary.md, re-validation.md, fix.diff).
+- No tests added: the helper is structurally identical to v2's
+  `crowbar_maybe_install_vtable_from_file`, which has no test (it's a
+  diagnostic, opt-in via env var).
+- canary instrumentation: **0 LOC** (reused existing
+  `audit_68_host_mem_read_probe` cvar).
+
+## What this confirms
+
+1. v2's case (C) framing is structurally correct: `[ctx+44]` IS a
+   secondary-object pointer that vtable[36] dispatches through.
+2. Cross-engine pointer-VA mismatch is real and non-trivial:
+   ours's allocator namespace doesn't include `0xBCxxxxxx` VAs.
+3. The wedge is **≥4-deep** (vtable + ctx primary + ctx secondary
+   pointer + secondary object's own vtable + fn-pointer slot). Crowbar
+   approach saturates without much deeper state capture.
+
+## What this does NOT confirm
+
+- That the actual canary VA `0xBCE25640` is the ONLY secondary object.
+  There may be more pointers in deeper ctx slots (we only captured 64
+  bytes; the full struct may be larger).
+- That installing the secondary object would suffice. The secondary
+  object likely has its own pointer fields (head node of a linked
+  list — looks like a queue/work-list given the doubly-linked-list
+  pattern at +4/+8).
+
+## Recommendation
+
+**Stop the crowbar approach.** The wedge is structurally too deep
+for state synthesis to be cheaper than fixing the natural-activation
+gap. Per Q5 of the boot-state review (methodology-assessment.md): the
+matched-prefix metric is on the wrong thread, and the wedge is
+**inherently a thread-activation problem**, not a state-construction
+problem.
+
+Pivot recommendations (in order of cost):
+
+1. **AUDIT-069 follow-up** — the 25 vs 1 "other producers" gap from
+   Session 5 is more actionable than the worker-spawn gap. The XAudio
+   thread resume at canary 1.726 s is a candidate trigger that
+   produces 8-24 helpers ahead of the wedge.
+2. **Recursive ctx-state capture** (option β from brief) — write a
+   probe-graph tool that captures canary's pointer-reachable closure
+   from ctx_ptr (BFS via `audit_68_host_mem_read_probe`, follow each
+   pointer field that's in the BC arena, capture another 64 bytes,
+   repeat). Estimate: 200-400 LOC tooling + needs ours-side memory
+   allocator extension to map BC-arena VAs. High complexity vs gain.
+3. **Pointer-translation table** (option α) — map canary BC-VAs to
+   ours allocator-VAs on install. Needs canary-vs-ours linked allocator
+   walk; ~300 LOC.
+
+The natural-activation path (Step 2 of the boot-state roadmap) is
+likely cheaper than any of these crowbar extensions.