Files
xenia-rs/audit-runs/review-a-step1c-crowbar-v3/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

200 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Crowbar v3 — ctx-state install verbatim
**Date**: 2026-05-21
**Predecessor**: v2 at `audit-runs/review-a-step1b-crowbar-v2/`.
**Status**: LANDED. Hypothesis FALSIFIED: wedge is NOT crowbar-soluble at
the ctx-state-only level. Case (D) needed (recursive secondary-object
install). v3 produces same composite progression score as OFF baseline.
## TL;DR
- v2 found case (C): `[ctx+44]` is a secondary-object pointer.
vtable[36] reads it and dispatches through it.
- v3 captured canary's **actual `[ctx+44]` value** = `0xBCE25640` (via
the `audit_68_host_mem_read_probe` cvar) along with the rest of the
64-byte ctx head, then installed that state verbatim in ours.
- Worker tid=15 now passes the `[ctx+44]` load (loads `0xBCE25640`
into r3) but **`0xBCE25640` is unmapped in ours's address space**
(ours's allocator returns 0x4D1Dxxxx VAs; canary's xenon-arena VAs
in the `0xBCExxxxx` range have no equivalent in ours).
- Reading `[0xBCE25640]` returns 0 → `CTR=0``bctrl` faults at PC=0
with `r3=0xbce25640` (was `r3=0x0` in v2 — confirming the install
worked, just deeper recursion needed).
- 3x OFF / 3x ON runs deterministic: `swaps=1, draws=0,
unique_render_targets=0` identical. **Composite progression Δ = 0.**
## Captured canary ctx state
Canary cold run (90s, `--mute=true`), with cvars:
```
--audit_61_branch_probe_pcs=0x825070F0
--audit_68_host_mem_read_probe=0xBCE251C0:8:1000000,0xBCE251C8:8:1000000,
0xBCE251D0:8:1000000,0xBCE251D8:8:1000000,
0xBCE251E0:8:1000000,0xBCE251E8:8:1000000,
0xBCE251F0:8:1000000,0xBCE251F8:8:1000000
```
AUDIT-061-BR confirmed ctx_ptr=`0xBCE251C0` (per AUDIT-068 S3 expectation;
no arena drift in this run). Read probe captured the install timeline:
| host_ns | event |
|--------:|-------|
| 9.556 s | Install starts: `[ctx+0]=0x8200A1E8` (vtable), `[ctx+4]=ctx`, `[ctx+8]=ctx`, `[ctx+12]=1` (refcount), `[ctx+16]=0x01000000`, `[ctx+32]=0xFFFFFFFF` |
| 9.571 s | `[ctx+44]=0xBCE25640` written, `[ctx+48]=0xBE568F00` written (looks float-ish) |
| 9.754 s | Transient `[ctx+32]=1` and `[ctx+40]=0x30057018` writes that are cleared next probe tick — likely temporary scratch during a function call |
| 9.755 s | Stable post-install state |
Final ctx bytes (saved at `ctx-canary.bin`):
```
+ 0: 82 00 A1 E8 BC E2 51 C0 BC E2 51 C0 00 00 00 01 <- vptr / self / self / refcount
+ 16: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 32: FF FF FF FF 00 00 00 00 00 00 00 00 BC E2 56 40 <- ...sentinel... / [ctx+44]=0xBCE25640
+ 48: BE 56 8F 00 00 00 00 00 00 00 00 00 00 00 00 00 <- [ctx+48]=0xBE568F00 (-0.21f?)
```
## Install path in ours
v3 adds `crowbar_maybe_install_ctx_from_file()` (~63 LOC) that reads
the binary at `$XENIA_CROWBAR_CTX_BIN` and writes the bytes via
`mem.write_u8(ctx_ptr + i, byte)` — same pattern as v2's
`crowbar_maybe_install_vtable_from_file()`. Plus ~12 LOC of comments
and the call-site addition. ~75 LOC additive over v2.
The 64-byte ctx file overwrites the v2 init at `+0/+4/+8/+12` with
identical values (verified — they match), and fills `+16..+63` with
the captured state.
Post-install log confirms exact write:
```
CROWBAR: installed 64 bytes at ctx_ptr=0x4d1d9000
CROWBAR: post-ctx-install ctx[+ 0] (=0x4d1d9000) = 0x8200a1e8
CROWBAR: post-ctx-install ctx[+ 32] (=0x4d1d9020) = 0xffffffff
CROWBAR: post-ctx-install ctx[+ 44] (=0x4d1d902c) = 0xbce25640 <-- secondary obj ptr installed
CROWBAR: post-ctx-install ctx[+ 48] (=0x4d1d9030) = 0xbe568f00
```
## The fault (v3)
Identical fault PC, different r3 — that's the smoking gun:
| | v1 (no ctx install) | v2 (init +0..+12 only) | v3 (full 64 bytes) |
|-|-|-|-|
| FAULT PC | 0 | 0 | 0 |
| LR | 0x82506e38 | 0x82506e38 | 0x82506e38 |
| CTR | 0 | 0 | 0 |
| **r3** | (any) | **0x0** | **0xbce25640** |
| r30 (ctx_ptr) | 0x4D1D9000 | 0x4D1D9000 | 0x4D1D9000 |
| tid | 15 | 15 | 15 |
The `lwz r11, 0(r3)` at PC `0x82506e28` (per v2's disasm) loads from
`r3 = [ctx+44]`. In v2, `r3=0`, so reads `[0]=0`. In v3, `r3=0xBCE25640`,
so reads `[0xBCE25640]`. Both reads return 0 because:
- v2: page 0 isn't mapped (well, it might be but the value is 0).
- v3: page `0xBCE25640` is **definitely** unmapped in ours.
Ours's heap is at `0..0x6FFFFFFF` (per `KernelState::heap_alloc`). The
xenon physical-region VAs (`0xBC000000..0xC0000000`) never appear in
ours's allocator namespace — `MmAllocatePhysicalMemoryEx` just calls
`heap_alloc()` which returns low VAs.
## Why this falsifies the v3 hypothesis
The brief's hypothesis: "with the full ctx state pre-installed AND the
4 workers spawned, ours produces `swaps≥2` or `draws≥1`."
Outcome: ctx state IS installed, 4 workers ARE spawned and resumed,
but the dispatch on the secondary object fails because the secondary
object's VA isn't mappable.
This is exactly **case (γ) → fault at new structural location** that
the brief predicted. The new fault PC isn't actually new (still 0),
but the new fault PRIMARY CAUSE is different: in v2 the cause was
"ctx+44 not initialized"; in v3 it's "ctx+44 points to an unmapped VA."
## Composite progression score
Per brief's option 6 metric (excluding the matched_prefix term, which
needs canary cross-comparison not available in `check` digests):
```
score = 1*swaps + 10*draws + 100*unique_render_targets
```
| Run | swaps | draws | unique_RT | score | instructions |
|-|-:|-:|-:|-:|-:|
| OFF-1 | 1 | 0 | 0 | **1** | 25,000,000 |
| OFF-2 | 1 | 0 | 0 | **1** | 25,000,000 |
| OFF-3 | 1 | 0 | 0 | **1** | 25,000,000 |
| ON-1 | 1 | 0 | 0 | **1** | 20,000,167 (faulted) |
| ON-2 | 1 | 0 | 0 | **1** | 20,000,167 (faulted) |
| ON-3 | 1 | 0 | 0 | **1** | 20,000,167 (faulted) |
**Δ = 0**. The instruction count dropped from 25M to 20.0001M in ON runs
because the fault halts the run early at `instr=20000167`, ~167 instr
after the crowbar trigger (threshold=20M). Confirms the workers can't
even complete one meaningful iteration before faulting.
## LOC delta
- `crates/xenia-kernel/src/exports.rs`: +63 LOC (helper)
+ 13 LOC (call-site comments + wire-up) = +76 LOC over v2.
- `audit-runs/review-a-step1c-crowbar-v3/`: artifacts (ctx-canary.bin,
canary-probe-run1.log, off-{1,2,3}.json, on-{1,2,3}.json, this doc,
summary.md, re-validation.md, fix.diff).
- No tests added: the helper is structurally identical to v2's
`crowbar_maybe_install_vtable_from_file`, which has no test (it's a
diagnostic, opt-in via env var).
- canary instrumentation: **0 LOC** (reused existing
`audit_68_host_mem_read_probe` cvar).
## What this confirms
1. v2's case (C) framing is structurally correct: `[ctx+44]` IS a
secondary-object pointer that vtable[36] dispatches through.
2. Cross-engine pointer-VA mismatch is real and non-trivial:
ours's allocator namespace doesn't include `0xBCxxxxxx` VAs.
3. The wedge is **≥4-deep** (vtable + ctx primary + ctx secondary
pointer + secondary object's own vtable + fn-pointer slot). Crowbar
approach saturates without much deeper state capture.
## What this does NOT confirm
- That the actual canary VA `0xBCE25640` is the ONLY secondary object.
There may be more pointers in deeper ctx slots (we only captured 64
bytes; the full struct may be larger).
- That installing the secondary object would suffice. The secondary
object likely has its own pointer fields (head node of a linked
list — looks like a queue/work-list given the doubly-linked-list
pattern at +4/+8).
## Recommendation
**Stop the crowbar approach.** The wedge is structurally too deep
for state synthesis to be cheaper than fixing the natural-activation
gap. Per Q5 of the boot-state review (methodology-assessment.md): the
matched-prefix metric is on the wrong thread, and the wedge is
**inherently a thread-activation problem**, not a state-construction
problem.
Pivot recommendations (in order of cost):
1. **AUDIT-069 follow-up** — the 25 vs 1 "other producers" gap from
Session 5 is more actionable than the worker-spawn gap. The XAudio
thread resume at canary 1.726 s is a candidate trigger that
produces 8-24 helpers ahead of the wedge.
2. **Recursive ctx-state capture** (option β from brief) — write a
probe-graph tool that captures canary's pointer-reachable closure
from ctx_ptr (BFS via `audit_68_host_mem_read_probe`, follow each
pointer field that's in the BC arena, capture another 64 bytes,
repeat). Estimate: 200-400 LOC tooling + needs ours-side memory
allocator extension to map BC-arena VAs. High complexity vs gain.
3. **Pointer-translation table** (option α) — map canary BC-VAs to
ours allocator-VAs on install. Needs canary-vs-ours linked allocator
walk; ~300 LOC.
The natural-activation path (Step 2 of the boot-state roadmap) is
likely cheaper than any of these crowbar extensions.