Files
xenia-rs/audit-runs/iterate-2M-exit-state-dump/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

187 lines
8.5 KiB
Markdown

# Iterate 2.M — Always-on structured exit-state dump (writer report)
**Date:** 2026-05-28. **LOC delta:** engine **+143** (xenia-app
main.rs **+128**, xenia-kernel event_log.rs **+15**). **Tests:**
xenia-kernel 227/227 PASS + xenia-app 5/5 + 2 ignored + 1 ignored = ZERO
regressions. **Cascade:** N/A — diagnostic, not investigation
(tripstone #40).
## Headline
**STRUCTURED-EXIT-DUMP-LANDED.** Every `exec` invocation now emits
`<phase-A-trace-dir>/exit-thread-state.json` at exit time, regardless
of `--quiet`. The dump contains every alive thread (tid, hw_id, idx,
pc, lr, sp, priority, affinity, suspend_count, state) plus a
`wedge_map` cross-referencing every blocked-forever wait into
{waiter_tid, waiter_pc, handle, handle_type, signaler_tid_if_known,
human summary}. Closes reading-error #42 — Phase-A JSONL is now never
the sole source of exit-time ground truth.
## Mode
Engine code change in `xenia-rs/crates/`:
- `xenia-kernel/src/event_log.rs:7-22, 48-53, 79-89` — record the
Phase-A trace path passed to `init()` so the dump can derive a
sibling path; expose `pub fn output_path() -> Option<&'static Path>`.
~15 LOC net.
- `xenia-app/src/main.rs:4460-4583` — new `fn write_thread_state_dump(
kernel: &KernelState)` that builds JSON via `serde_json` from
`kernel.scheduler.slots[*].runqueue[*]` + `kernel.objects[h]` and
writes to `<phase-A-dir>/exit-thread-state.json` (CWD fallback when
Phase-A is disabled). Always-on (no `quiet` gate). ~110 LOC body +
13 LOC docstring.
- `xenia-app/src/main.rs:2161-2164, 4525-4527` — wire the call into
both post-run paths (headless `cmd_exec_inner` and `run_with_ui`),
immediately after `dump_thread_diagnostic`. Existing plain-text
diagnostic untouched.
## Verification gate
Same invocation as 2.J/2.K with **no extra flags**:
```
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
-n 50000000 --quiet \
--phase-a-event-log audit-runs/iterate-2M-exit-state-dump/ours-cold.jsonl \
"<iso>"
```
Run completed `EXIT=0`. Stderr emitted (under `--quiet`):
```
exit-thread-state: wrote 13 thread(s), 10 wedge entr(ies) to \
audit-runs/iterate-2M-exit-state-dump/exit-thread-state.json
```
### Gate criteria — all PASS
| criterion | result |
|---|---|
| Dump emitted at `<output-dir>/exit-thread-state.json` without extra flags | **PASS** |
| Contains all 13 alive threads (matches 2.K's plain-text dump count) | **PASS** |
| 5 blocked tids at PC `0x824ac578` present and tagged `state=Blocked` | **PASS** (tid 1, 13, 4, 5, 3) |
| Wedge map cross-references handle → type → signaler_tid_if_known | **PASS** (10 entries, all blocked-forever waits) |
| tid=1 → Thread(id=13) circular wait surfaced | **PASS** (`summary: "tid=1 → Thread(id=13)"`) |
| tid=8 → Semaphore(0/2^31-1) AUDIT-069 work-sem visible | **PASS** (`summary: "tid=8 → Semaphore(0/2147483647)"`) |
| tid=13 → Event(sig=false) signaler-unknown surfaced | **PASS** (`signaler_tid_if_known: null`) |
| Existing `=== Final State ===` / `=== Thread diagnostics ===` / `-- Handle waiter lists --` blocks preserved under non-quiet | **PASS** (3 grep hits in non-quiet stdout) |
| Structured dump ALSO emits under non-quiet (idempotent w.r.t. quiet flag) | **PASS** |
### Bit-for-bit match against 2.K's exit-diag-full.log
Each of the 8 blocked tids in 2.K's plain-text dump appears in 2.M's
`wedge_map`/`alive_threads` with identical handle ids, identical
handle types, identical PC/LR/SP values, identical waiter membership.
Spot-check:
| 2.K plain-text line | 2.M JSON |
|---|---|
| `tid=1 ... handles: [4808] ... pc=0x824ac578` | `{"tid":1, "handle":"0x000012c8", "pc":"0x824ac578"}` (4808=0x12c8) |
| `tid=13 ... handles: [4816] ... pc=0x824ac578` | `{"tid":13, "handle":"0x000012d0", "pc":"0x824ac578"}` (4816=0x12d0) |
| `tid=8 ... handles: [4332, 4312]` | `[{"handle":"0x000010ec"},{"handle":"0x000010d8"}]` (4332=0x10ec, 4312=0x10d8) |
| `tid=4 ... handles: [4136]` | `{"tid":4, "handle":"0x00001028"}` (4136=0x1028) |
| `tid=5 ... handles: [4836]` | `{"tid":5, "handle":"0x000012e4"}` (4836=0x12e4) |
| `tid=3 ... handles: [4128]` | `{"tid":3, "handle":"0x00001020"}` (4128=0x1020) |
| `tid=8 ... 0x10d8 Semaphore(0/2147483647)` | `{"type":"Semaphore","count":0,"max":2147483647}` |
| `0x12c8 Thread(id=13, exit=None)` | `{"type":"Thread","thread_id":13,"exited":false}` |
## Existing-mechanism
`fn dump_thread_diagnostic` (main.rs:3933-4453) produces the plain-text
`=== Thread diagnostics ===` + `-- Handle waiter lists --` block when
`!quiet`. 2.K's `exit-diag-full.log` was a manual non-quiet re-run.
2.M **extends** by adding a sibling structured emitter that is always
on; the existing plain-text path is **unchanged** (still off under
`--quiet`, still emits identically under non-quiet).
Relationship: the plain-text dump remains the human-readable
walk-the-log artifact; the new JSON is the machine-readable harness
input. They produce the same content from the same `KernelState`
snapshot; choosing JSON for the new sibling matches Phase-A JSONL's
schema-versioned input style and is `jq`-friendly.
## Test results
- `cargo build --release -p xenia-app` — OK, 1 pre-existing unrelated
warning (`phase_b_snapshot.rs::walk_committed_regions` dead_code).
- `cargo test --release -p xenia-kernel -p xenia-app` — **235 passed,
0 failed** (227 lib + 5 + 2 ignored + 1 ignored + 0 doc).
## Use cases
- **Next iterate** can `jq '.wedge_map[] | select(.waiter_pc ==
"0x824ac578")'` to get the wedge tid set in one line.
- **Cross-engine diff**: pair canary's analogous exit-state JSON (TBD)
with ours's via `tools/diff-events`-style diff to identify
missing-thread (canary tids 15/27/28 = sub_825070F0 family) and
missing-signaler (Event handles with `waiters_tid≠[]` and no
producer in ours's trace).
- **No more 2.J-class misreadings**: a Phase-A trace ending with
`kernel.return success` at the matched-prefix tail will be
immediately contradicted by `exit-thread-state.json` showing those
same tids parked indefinitely. The reading-error #42 surface is
closed at the output level.
## Tripstone audit
- **#28** (cross-engine tid stability): JSON keys tids by raw integer,
which is acceptable for ours-only intra-run reads. For cross-engine
diffs against canary, downstream tooling must continue to key on
`(entry_pc, ctx_ptr)` — that's a 2.M+1 concern, not a 2.M one. The
dump preserves enough columns (`hw_id`, `idx`, `pc`, `lr`, `sp`,
`affinity_mask`) for the consumer to do its own re-keying.
- **#39** (progression class): 2.M is methodology not progression. No
cascade A/B/C/D claim made. Headline does NOT claim VdSwap/draw
movement.
- **#40** (single-keystone framing): not applicable — diagnostic,
not single-cause investigation.
- **#42** (Phase-A blind to blocked-forever waits): **CLOSED** at the
output level by this iterate. Future investigations now have an
always-on machine-readable wedge snapshot.
## Confidence
- **HIGH** that the dump emits on every `exec` run with no extra flags
(verified empirically under `--quiet` AND non-quiet).
- **HIGH** that content matches 2.K's plain-text dump bit-for-bit
(every handle id, every PC, every waiter list line cross-checked).
- **HIGH** that existing diagnostic mechanism is unbroken (plain-text
still emits 3 sections under non-quiet, JSON also emits).
- **HIGH** that ZERO test regressions (235/235 pass).
## Artifacts
Under `xenia-rs/audit-runs/iterate-2M-exit-state-dump/`:
- `ours-cold.jsonl` (Phase-A trace, 121,569 events, ~28MB, bit-equal
to 2.J/2.K)
- `ours-cold.stdout.log` (empty — quiet mode preserved)
- `ours-cold.stderr.log` (single line: dump emission notice)
- `exit-thread-state.json` (**the new artifact**, 9651 bytes, 13
threads + 10 wedge entries)
- `ours-cold-nonquiet.stdout.log` / `.stderr.log` (regression check:
existing plain-text diagnostic preserved)
- `writer-report.md` (this file)
Patch:
- `xenia-rs/crates/xenia-kernel/src/event_log.rs` (path tracker +
accessor)
- `xenia-rs/crates/xenia-app/src/main.rs` (dump function + 2 call
sites)
## Next iterate enabler
`exit-thread-state.json` is now a stable input for:
1. **Canary parity**: add the analogous emitter to canary's exit path
so cross-engine wedge-map diffs become trivial.
2. **Per-handle signaler hunt**: for each wedge `handle_type=Event,
signaler_tid_if_known=null`, walk Phase-A trace for canary's
handle-equivalent (semantic_id) signal source — directly identifies
which canary thread/path is missing in ours.
3. **Regression alarm**: a CI step can refuse to merge if
`len(wedge_map) > N` for the boot-replay scenario, preventing
silent re-wedges.