Files
xenia-rs/audit-runs/review-a-step1c-crowbar-v3/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

8.7 KiB
Raw Blame History

Crowbar v3 — ctx-state install verbatim

Date: 2026-05-21 Predecessor: v2 at audit-runs/review-a-step1b-crowbar-v2/. Status: LANDED. Hypothesis FALSIFIED: wedge is NOT crowbar-soluble at the ctx-state-only level. Case (D) needed (recursive secondary-object install). v3 produces same composite progression score as OFF baseline.

TL;DR

  • v2 found case (C): [ctx+44] is a secondary-object pointer. vtable[36] reads it and dispatches through it.
  • v3 captured canary's actual [ctx+44] value = 0xBCE25640 (via the audit_68_host_mem_read_probe cvar) along with the rest of the 64-byte ctx head, then installed that state verbatim in ours.
  • Worker tid=15 now passes the [ctx+44] load (loads 0xBCE25640 into r3) but 0xBCE25640 is unmapped in ours's address space (ours's allocator returns 0x4D1Dxxxx VAs; canary's xenon-arena VAs in the 0xBCExxxxx range have no equivalent in ours).
  • Reading [0xBCE25640] returns 0 → CTR=0bctrl faults at PC=0 with r3=0xbce25640 (was r3=0x0 in v2 — confirming the install worked, just deeper recursion needed).
  • 3x OFF / 3x ON runs deterministic: swaps=1, draws=0, unique_render_targets=0 identical. Composite progression Δ = 0.

Captured canary ctx state

Canary cold run (90s, --mute=true), with cvars:

--audit_61_branch_probe_pcs=0x825070F0
--audit_68_host_mem_read_probe=0xBCE251C0:8:1000000,0xBCE251C8:8:1000000,
                               0xBCE251D0:8:1000000,0xBCE251D8:8:1000000,
                               0xBCE251E0:8:1000000,0xBCE251E8:8:1000000,
                               0xBCE251F0:8:1000000,0xBCE251F8:8:1000000

AUDIT-061-BR confirmed ctx_ptr=0xBCE251C0 (per AUDIT-068 S3 expectation; no arena drift in this run). Read probe captured the install timeline:

host_ns event
9.556 s Install starts: [ctx+0]=0x8200A1E8 (vtable), [ctx+4]=ctx, [ctx+8]=ctx, [ctx+12]=1 (refcount), [ctx+16]=0x01000000, [ctx+32]=0xFFFFFFFF
9.571 s [ctx+44]=0xBCE25640 written, [ctx+48]=0xBE568F00 written (looks float-ish)
9.754 s Transient [ctx+32]=1 and [ctx+40]=0x30057018 writes that are cleared next probe tick — likely temporary scratch during a function call
9.755 s Stable post-install state

Final ctx bytes (saved at ctx-canary.bin):

  +  0: 82 00 A1 E8 BC E2 51 C0 BC E2 51 C0 00 00 00 01   <- vptr / self / self / refcount
  + 16: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  + 32: FF FF FF FF 00 00 00 00 00 00 00 00 BC E2 56 40   <- ...sentinel... / [ctx+44]=0xBCE25640
  + 48: BE 56 8F 00 00 00 00 00 00 00 00 00 00 00 00 00   <- [ctx+48]=0xBE568F00 (-0.21f?)

Install path in ours

v3 adds crowbar_maybe_install_ctx_from_file() (~63 LOC) that reads the binary at $XENIA_CROWBAR_CTX_BIN and writes the bytes via mem.write_u8(ctx_ptr + i, byte) — same pattern as v2's crowbar_maybe_install_vtable_from_file(). Plus ~12 LOC of comments and the call-site addition. ~75 LOC additive over v2.

The 64-byte ctx file overwrites the v2 init at +0/+4/+8/+12 with identical values (verified — they match), and fills +16..+63 with the captured state.

Post-install log confirms exact write:

CROWBAR: installed 64 bytes at ctx_ptr=0x4d1d9000
CROWBAR: post-ctx-install ctx[+  0] (=0x4d1d9000) = 0x8200a1e8
CROWBAR: post-ctx-install ctx[+ 32] (=0x4d1d9020) = 0xffffffff
CROWBAR: post-ctx-install ctx[+ 44] (=0x4d1d902c) = 0xbce25640    <-- secondary obj ptr installed
CROWBAR: post-ctx-install ctx[+ 48] (=0x4d1d9030) = 0xbe568f00

The fault (v3)

Identical fault PC, different r3 — that's the smoking gun:

v1 (no ctx install) v2 (init +0..+12 only) v3 (full 64 bytes)
FAULT PC 0 0 0
LR 0x82506e38 0x82506e38 0x82506e38
CTR 0 0 0
r3 (any) 0x0 0xbce25640
r30 (ctx_ptr) 0x4D1D9000 0x4D1D9000 0x4D1D9000
tid 15 15 15

The lwz r11, 0(r3) at PC 0x82506e28 (per v2's disasm) loads from r3 = [ctx+44]. In v2, r3=0, so reads [0]=0. In v3, r3=0xBCE25640, so reads [0xBCE25640]. Both reads return 0 because:

  • v2: page 0 isn't mapped (well, it might be but the value is 0).
  • v3: page 0xBCE25640 is definitely unmapped in ours.

Ours's heap is at 0..0x6FFFFFFF (per KernelState::heap_alloc). The xenon physical-region VAs (0xBC000000..0xC0000000) never appear in ours's allocator namespace — MmAllocatePhysicalMemoryEx just calls heap_alloc() which returns low VAs.

Why this falsifies the v3 hypothesis

The brief's hypothesis: "with the full ctx state pre-installed AND the 4 workers spawned, ours produces swaps≥2 or draws≥1."

Outcome: ctx state IS installed, 4 workers ARE spawned and resumed, but the dispatch on the secondary object fails because the secondary object's VA isn't mappable.

This is exactly case (γ) → fault at new structural location that the brief predicted. The new fault PC isn't actually new (still 0), but the new fault PRIMARY CAUSE is different: in v2 the cause was "ctx+44 not initialized"; in v3 it's "ctx+44 points to an unmapped VA."

Composite progression score

Per brief's option 6 metric (excluding the matched_prefix term, which needs canary cross-comparison not available in check digests):

score = 1*swaps + 10*draws + 100*unique_render_targets
Run swaps draws unique_RT score instructions
OFF-1 1 0 0 1 25,000,000
OFF-2 1 0 0 1 25,000,000
OFF-3 1 0 0 1 25,000,000
ON-1 1 0 0 1 20,000,167 (faulted)
ON-2 1 0 0 1 20,000,167 (faulted)
ON-3 1 0 0 1 20,000,167 (faulted)

Δ = 0. The instruction count dropped from 25M to 20.0001M in ON runs because the fault halts the run early at instr=20000167, ~167 instr after the crowbar trigger (threshold=20M). Confirms the workers can't even complete one meaningful iteration before faulting.

LOC delta

  • crates/xenia-kernel/src/exports.rs: +63 LOC (helper)
    • 13 LOC (call-site comments + wire-up) = +76 LOC over v2.
  • audit-runs/review-a-step1c-crowbar-v3/: artifacts (ctx-canary.bin, canary-probe-run1.log, off-{1,2,3}.json, on-{1,2,3}.json, this doc, summary.md, re-validation.md, fix.diff).
  • No tests added: the helper is structurally identical to v2's crowbar_maybe_install_vtable_from_file, which has no test (it's a diagnostic, opt-in via env var).
  • canary instrumentation: 0 LOC (reused existing audit_68_host_mem_read_probe cvar).

What this confirms

  1. v2's case (C) framing is structurally correct: [ctx+44] IS a secondary-object pointer that vtable[36] dispatches through.
  2. Cross-engine pointer-VA mismatch is real and non-trivial: ours's allocator namespace doesn't include 0xBCxxxxxx VAs.
  3. The wedge is ≥4-deep (vtable + ctx primary + ctx secondary pointer + secondary object's own vtable + fn-pointer slot). Crowbar approach saturates without much deeper state capture.

What this does NOT confirm

  • That the actual canary VA 0xBCE25640 is the ONLY secondary object. There may be more pointers in deeper ctx slots (we only captured 64 bytes; the full struct may be larger).
  • That installing the secondary object would suffice. The secondary object likely has its own pointer fields (head node of a linked list — looks like a queue/work-list given the doubly-linked-list pattern at +4/+8).

Recommendation

Stop the crowbar approach. The wedge is structurally too deep for state synthesis to be cheaper than fixing the natural-activation gap. Per Q5 of the boot-state review (methodology-assessment.md): the matched-prefix metric is on the wrong thread, and the wedge is inherently a thread-activation problem, not a state-construction problem.

Pivot recommendations (in order of cost):

  1. AUDIT-069 follow-up — the 25 vs 1 "other producers" gap from Session 5 is more actionable than the worker-spawn gap. The XAudio thread resume at canary 1.726 s is a candidate trigger that produces 8-24 helpers ahead of the wedge.
  2. Recursive ctx-state capture (option β from brief) — write a probe-graph tool that captures canary's pointer-reachable closure from ctx_ptr (BFS via audit_68_host_mem_read_probe, follow each pointer field that's in the BC arena, capture another 64 bytes, repeat). Estimate: 200-400 LOC tooling + needs ours-side memory allocator extension to map BC-arena VAs. High complexity vs gain.
  3. Pointer-translation table (option α) — map canary BC-VAs to ours allocator-VAs on install. Needs canary-vs-ours linked allocator walk; ~300 LOC.

The natural-activation path (Step 2 of the boot-state roadmap) is likely cheaper than any of these crowbar extensions.