Files
xenia-rs/audit-runs/iterate-2M-exit-state-dump/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

8.5 KiB

Iterate 2.M — Always-on structured exit-state dump (writer report)

Date: 2026-05-28. LOC delta: engine +143 (xenia-app main.rs +128, xenia-kernel event_log.rs +15). Tests: xenia-kernel 227/227 PASS + xenia-app 5/5 + 2 ignored + 1 ignored = ZERO regressions. Cascade: N/A — diagnostic, not investigation (tripstone #40).

Headline

STRUCTURED-EXIT-DUMP-LANDED. Every exec invocation now emits <phase-A-trace-dir>/exit-thread-state.json at exit time, regardless of --quiet. The dump contains every alive thread (tid, hw_id, idx, pc, lr, sp, priority, affinity, suspend_count, state) plus a wedge_map cross-referencing every blocked-forever wait into {waiter_tid, waiter_pc, handle, handle_type, signaler_tid_if_known, human summary}. Closes reading-error #42 — Phase-A JSONL is now never the sole source of exit-time ground truth.

Mode

Engine code change in xenia-rs/crates/:

  • xenia-kernel/src/event_log.rs:7-22, 48-53, 79-89 — record the Phase-A trace path passed to init() so the dump can derive a sibling path; expose pub fn output_path() -> Option<&'static Path>. ~15 LOC net.
  • xenia-app/src/main.rs:4460-4583 — new fn write_thread_state_dump( kernel: &KernelState) that builds JSON via serde_json from kernel.scheduler.slots[*].runqueue[*] + kernel.objects[h] and writes to <phase-A-dir>/exit-thread-state.json (CWD fallback when Phase-A is disabled). Always-on (no quiet gate). ~110 LOC body + 13 LOC docstring.
  • xenia-app/src/main.rs:2161-2164, 4525-4527 — wire the call into both post-run paths (headless cmd_exec_inner and run_with_ui), immediately after dump_thread_diagnostic. Existing plain-text diagnostic untouched.

Verification gate

Same invocation as 2.J/2.K with no extra flags:

XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
  -n 50000000 --quiet \
  --phase-a-event-log audit-runs/iterate-2M-exit-state-dump/ours-cold.jsonl \
  "<iso>"

Run completed EXIT=0. Stderr emitted (under --quiet):

exit-thread-state: wrote 13 thread(s), 10 wedge entr(ies) to \
  audit-runs/iterate-2M-exit-state-dump/exit-thread-state.json

Gate criteria — all PASS

criterion result
Dump emitted at <output-dir>/exit-thread-state.json without extra flags PASS
Contains all 13 alive threads (matches 2.K's plain-text dump count) PASS
5 blocked tids at PC 0x824ac578 present and tagged state=Blocked PASS (tid 1, 13, 4, 5, 3)
Wedge map cross-references handle → type → signaler_tid_if_known PASS (10 entries, all blocked-forever waits)
tid=1 → Thread(id=13) circular wait surfaced PASS (summary: "tid=1 → Thread(id=13)")
tid=8 → Semaphore(0/2^31-1) AUDIT-069 work-sem visible PASS (summary: "tid=8 → Semaphore(0/2147483647)")
tid=13 → Event(sig=false) signaler-unknown surfaced PASS (signaler_tid_if_known: null)
Existing === Final State === / === Thread diagnostics === / -- Handle waiter lists -- blocks preserved under non-quiet PASS (3 grep hits in non-quiet stdout)
Structured dump ALSO emits under non-quiet (idempotent w.r.t. quiet flag) PASS

Bit-for-bit match against 2.K's exit-diag-full.log

Each of the 8 blocked tids in 2.K's plain-text dump appears in 2.M's wedge_map/alive_threads with identical handle ids, identical handle types, identical PC/LR/SP values, identical waiter membership. Spot-check:

2.K plain-text line 2.M JSON
tid=1 ... handles: [4808] ... pc=0x824ac578 {"tid":1, "handle":"0x000012c8", "pc":"0x824ac578"} (4808=0x12c8)
tid=13 ... handles: [4816] ... pc=0x824ac578 {"tid":13, "handle":"0x000012d0", "pc":"0x824ac578"} (4816=0x12d0)
tid=8 ... handles: [4332, 4312] [{"handle":"0x000010ec"},{"handle":"0x000010d8"}] (4332=0x10ec, 4312=0x10d8)
tid=4 ... handles: [4136] {"tid":4, "handle":"0x00001028"} (4136=0x1028)
tid=5 ... handles: [4836] {"tid":5, "handle":"0x000012e4"} (4836=0x12e4)
tid=3 ... handles: [4128] {"tid":3, "handle":"0x00001020"} (4128=0x1020)
tid=8 ... 0x10d8 Semaphore(0/2147483647) {"type":"Semaphore","count":0,"max":2147483647}
0x12c8 Thread(id=13, exit=None) {"type":"Thread","thread_id":13,"exited":false}

Existing-mechanism

fn dump_thread_diagnostic (main.rs:3933-4453) produces the plain-text === Thread diagnostics === + -- Handle waiter lists -- block when !quiet. 2.K's exit-diag-full.log was a manual non-quiet re-run. 2.M extends by adding a sibling structured emitter that is always on; the existing plain-text path is unchanged (still off under --quiet, still emits identically under non-quiet).

Relationship: the plain-text dump remains the human-readable walk-the-log artifact; the new JSON is the machine-readable harness input. They produce the same content from the same KernelState snapshot; choosing JSON for the new sibling matches Phase-A JSONL's schema-versioned input style and is jq-friendly.

Test results

  • cargo build --release -p xenia-app — OK, 1 pre-existing unrelated warning (phase_b_snapshot.rs::walk_committed_regions dead_code).
  • cargo test --release -p xenia-kernel -p xenia-app235 passed, 0 failed (227 lib + 5 + 2 ignored + 1 ignored + 0 doc).

Use cases

  • Next iterate can jq '.wedge_map[] | select(.waiter_pc == "0x824ac578")' to get the wedge tid set in one line.
  • Cross-engine diff: pair canary's analogous exit-state JSON (TBD) with ours's via tools/diff-events-style diff to identify missing-thread (canary tids 15/27/28 = sub_825070F0 family) and missing-signaler (Event handles with waiters_tid≠[] and no producer in ours's trace).
  • No more 2.J-class misreadings: a Phase-A trace ending with kernel.return success at the matched-prefix tail will be immediately contradicted by exit-thread-state.json showing those same tids parked indefinitely. The reading-error #42 surface is closed at the output level.

Tripstone audit

  • #28 (cross-engine tid stability): JSON keys tids by raw integer, which is acceptable for ours-only intra-run reads. For cross-engine diffs against canary, downstream tooling must continue to key on (entry_pc, ctx_ptr) — that's a 2.M+1 concern, not a 2.M one. The dump preserves enough columns (hw_id, idx, pc, lr, sp, affinity_mask) for the consumer to do its own re-keying.
  • #39 (progression class): 2.M is methodology not progression. No cascade A/B/C/D claim made. Headline does NOT claim VdSwap/draw movement.
  • #40 (single-keystone framing): not applicable — diagnostic, not single-cause investigation.
  • #42 (Phase-A blind to blocked-forever waits): CLOSED at the output level by this iterate. Future investigations now have an always-on machine-readable wedge snapshot.

Confidence

  • HIGH that the dump emits on every exec run with no extra flags (verified empirically under --quiet AND non-quiet).
  • HIGH that content matches 2.K's plain-text dump bit-for-bit (every handle id, every PC, every waiter list line cross-checked).
  • HIGH that existing diagnostic mechanism is unbroken (plain-text still emits 3 sections under non-quiet, JSON also emits).
  • HIGH that ZERO test regressions (235/235 pass).

Artifacts

Under xenia-rs/audit-runs/iterate-2M-exit-state-dump/:

  • ours-cold.jsonl (Phase-A trace, 121,569 events, ~28MB, bit-equal to 2.J/2.K)
  • ours-cold.stdout.log (empty — quiet mode preserved)
  • ours-cold.stderr.log (single line: dump emission notice)
  • exit-thread-state.json (the new artifact, 9651 bytes, 13 threads + 10 wedge entries)
  • ours-cold-nonquiet.stdout.log / .stderr.log (regression check: existing plain-text diagnostic preserved)
  • writer-report.md (this file)

Patch:

  • xenia-rs/crates/xenia-kernel/src/event_log.rs (path tracker + accessor)
  • xenia-rs/crates/xenia-app/src/main.rs (dump function + 2 call sites)

Next iterate enabler

exit-thread-state.json is now a stable input for:

  1. Canary parity: add the analogous emitter to canary's exit path so cross-engine wedge-map diffs become trivial.
  2. Per-handle signaler hunt: for each wedge handle_type=Event, signaler_tid_if_known=null, walk Phase-A trace for canary's handle-equivalent (semantic_id) signal source — directly identifies which canary thread/path is missing in ours.
  3. Regression alarm: a CI step can refuse to merge if len(wedge_map) > N for the boot-replay scenario, preventing silent re-wedges.