handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
186
audit-runs/iterate-2M-exit-state-dump/writer-report.md
Normal file
186
audit-runs/iterate-2M-exit-state-dump/writer-report.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# Iterate 2.M — Always-on structured exit-state dump (writer report)
|
||||
|
||||
**Date:** 2026-05-28. **LOC delta:** engine **+143** (xenia-app
|
||||
main.rs **+128**, xenia-kernel event_log.rs **+15**). **Tests:**
|
||||
xenia-kernel 227/227 PASS + xenia-app 5/5 + 2 ignored + 1 ignored = ZERO
|
||||
regressions. **Cascade:** N/A — diagnostic, not investigation
|
||||
(tripstone #40).
|
||||
|
||||
## Headline
|
||||
|
||||
**STRUCTURED-EXIT-DUMP-LANDED.** Every `exec` invocation now emits
|
||||
`<phase-A-trace-dir>/exit-thread-state.json` at exit time, regardless
|
||||
of `--quiet`. The dump contains every alive thread (tid, hw_id, idx,
|
||||
pc, lr, sp, priority, affinity, suspend_count, state) plus a
|
||||
`wedge_map` cross-referencing every blocked-forever wait into
|
||||
{waiter_tid, waiter_pc, handle, handle_type, signaler_tid_if_known,
|
||||
human summary}. Closes reading-error #42 — Phase-A JSONL is now never
|
||||
the sole source of exit-time ground truth.
|
||||
|
||||
## Mode
|
||||
|
||||
Engine code change in `xenia-rs/crates/`:
|
||||
|
||||
- `xenia-kernel/src/event_log.rs:7-22, 48-53, 79-89` — record the
|
||||
Phase-A trace path passed to `init()` so the dump can derive a
|
||||
sibling path; expose `pub fn output_path() -> Option<&'static Path>`.
|
||||
~15 LOC net.
|
||||
- `xenia-app/src/main.rs:4460-4583` — new `fn write_thread_state_dump(
|
||||
kernel: &KernelState)` that builds JSON via `serde_json` from
|
||||
`kernel.scheduler.slots[*].runqueue[*]` + `kernel.objects[h]` and
|
||||
writes to `<phase-A-dir>/exit-thread-state.json` (CWD fallback when
|
||||
Phase-A is disabled). Always-on (no `quiet` gate). ~110 LOC body +
|
||||
13 LOC docstring.
|
||||
- `xenia-app/src/main.rs:2161-2164, 4525-4527` — wire the call into
|
||||
both post-run paths (headless `cmd_exec_inner` and `run_with_ui`),
|
||||
immediately after `dump_thread_diagnostic`. Existing plain-text
|
||||
diagnostic untouched.
|
||||
|
||||
## Verification gate
|
||||
|
||||
Same invocation as 2.J/2.K with **no extra flags**:
|
||||
|
||||
```
|
||||
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
|
||||
-n 50000000 --quiet \
|
||||
--phase-a-event-log audit-runs/iterate-2M-exit-state-dump/ours-cold.jsonl \
|
||||
"<iso>"
|
||||
```
|
||||
|
||||
Run completed `EXIT=0`. Stderr emitted (under `--quiet`):
|
||||
|
||||
```
|
||||
exit-thread-state: wrote 13 thread(s), 10 wedge entr(ies) to \
|
||||
audit-runs/iterate-2M-exit-state-dump/exit-thread-state.json
|
||||
```
|
||||
|
||||
### Gate criteria — all PASS
|
||||
|
||||
| criterion | result |
|
||||
|---|---|
|
||||
| Dump emitted at `<output-dir>/exit-thread-state.json` without extra flags | **PASS** |
|
||||
| Contains all 13 alive threads (matches 2.K's plain-text dump count) | **PASS** |
|
||||
| 5 blocked tids at PC `0x824ac578` present and tagged `state=Blocked` | **PASS** (tid 1, 13, 4, 5, 3) |
|
||||
| Wedge map cross-references handle → type → signaler_tid_if_known | **PASS** (10 entries, all blocked-forever waits) |
|
||||
| tid=1 → Thread(id=13) circular wait surfaced | **PASS** (`summary: "tid=1 → Thread(id=13)"`) |
|
||||
| tid=8 → Semaphore(0/2^31-1) AUDIT-069 work-sem visible | **PASS** (`summary: "tid=8 → Semaphore(0/2147483647)"`) |
|
||||
| tid=13 → Event(sig=false) signaler-unknown surfaced | **PASS** (`signaler_tid_if_known: null`) |
|
||||
| Existing `=== Final State ===` / `=== Thread diagnostics ===` / `-- Handle waiter lists --` blocks preserved under non-quiet | **PASS** (3 grep hits in non-quiet stdout) |
|
||||
| Structured dump ALSO emits under non-quiet (idempotent w.r.t. quiet flag) | **PASS** |
|
||||
|
||||
### Bit-for-bit match against 2.K's exit-diag-full.log
|
||||
|
||||
Each of the 8 blocked tids in 2.K's plain-text dump appears in 2.M's
|
||||
`wedge_map`/`alive_threads` with identical handle ids, identical
|
||||
handle types, identical PC/LR/SP values, identical waiter membership.
|
||||
Spot-check:
|
||||
|
||||
| 2.K plain-text line | 2.M JSON |
|
||||
|---|---|
|
||||
| `tid=1 ... handles: [4808] ... pc=0x824ac578` | `{"tid":1, "handle":"0x000012c8", "pc":"0x824ac578"}` (4808=0x12c8) |
|
||||
| `tid=13 ... handles: [4816] ... pc=0x824ac578` | `{"tid":13, "handle":"0x000012d0", "pc":"0x824ac578"}` (4816=0x12d0) |
|
||||
| `tid=8 ... handles: [4332, 4312]` | `[{"handle":"0x000010ec"},{"handle":"0x000010d8"}]` (4332=0x10ec, 4312=0x10d8) |
|
||||
| `tid=4 ... handles: [4136]` | `{"tid":4, "handle":"0x00001028"}` (4136=0x1028) |
|
||||
| `tid=5 ... handles: [4836]` | `{"tid":5, "handle":"0x000012e4"}` (4836=0x12e4) |
|
||||
| `tid=3 ... handles: [4128]` | `{"tid":3, "handle":"0x00001020"}` (4128=0x1020) |
|
||||
| `tid=8 ... 0x10d8 Semaphore(0/2147483647)` | `{"type":"Semaphore","count":0,"max":2147483647}` |
|
||||
| `0x12c8 Thread(id=13, exit=None)` | `{"type":"Thread","thread_id":13,"exited":false}` |
|
||||
|
||||
## Existing-mechanism
|
||||
|
||||
`fn dump_thread_diagnostic` (main.rs:3933-4453) produces the plain-text
|
||||
`=== Thread diagnostics ===` + `-- Handle waiter lists --` block when
|
||||
`!quiet`. 2.K's `exit-diag-full.log` was a manual non-quiet re-run.
|
||||
2.M **extends** by adding a sibling structured emitter that is always
|
||||
on; the existing plain-text path is **unchanged** (still off under
|
||||
`--quiet`, still emits identically under non-quiet).
|
||||
|
||||
Relationship: the plain-text dump remains the human-readable
|
||||
walk-the-log artifact; the new JSON is the machine-readable harness
|
||||
input. They produce the same content from the same `KernelState`
|
||||
snapshot; choosing JSON for the new sibling matches Phase-A JSONL's
|
||||
schema-versioned input style and is `jq`-friendly.
|
||||
|
||||
## Test results
|
||||
|
||||
- `cargo build --release -p xenia-app` — OK, 1 pre-existing unrelated
|
||||
warning (`phase_b_snapshot.rs::walk_committed_regions` dead_code).
|
||||
- `cargo test --release -p xenia-kernel -p xenia-app` — **235 passed,
|
||||
0 failed** (227 lib + 5 + 2 ignored + 1 ignored + 0 doc).
|
||||
|
||||
## Use cases
|
||||
|
||||
- **Next iterate** can `jq '.wedge_map[] | select(.waiter_pc ==
|
||||
"0x824ac578")'` to get the wedge tid set in one line.
|
||||
- **Cross-engine diff**: pair canary's analogous exit-state JSON (TBD)
|
||||
with ours's via `tools/diff-events`-style diff to identify
|
||||
missing-thread (canary tids 15/27/28 = sub_825070F0 family) and
|
||||
missing-signaler (Event handles with `waiters_tid≠[]` and no
|
||||
producer in ours's trace).
|
||||
- **No more 2.J-class misreadings**: a Phase-A trace ending with
|
||||
`kernel.return success` at the matched-prefix tail will be
|
||||
immediately contradicted by `exit-thread-state.json` showing those
|
||||
same tids parked indefinitely. The reading-error #42 surface is
|
||||
closed at the output level.
|
||||
|
||||
## Tripstone audit
|
||||
|
||||
- **#28** (cross-engine tid stability): JSON keys tids by raw integer,
|
||||
which is acceptable for ours-only intra-run reads. For cross-engine
|
||||
diffs against canary, downstream tooling must continue to key on
|
||||
`(entry_pc, ctx_ptr)` — that's a 2.M+1 concern, not a 2.M one. The
|
||||
dump preserves enough columns (`hw_id`, `idx`, `pc`, `lr`, `sp`,
|
||||
`affinity_mask`) for the consumer to do its own re-keying.
|
||||
- **#39** (progression class): 2.M is methodology not progression. No
|
||||
cascade A/B/C/D claim made. Headline does NOT claim VdSwap/draw
|
||||
movement.
|
||||
- **#40** (single-keystone framing): not applicable — diagnostic,
|
||||
not single-cause investigation.
|
||||
- **#42** (Phase-A blind to blocked-forever waits): **CLOSED** at the
|
||||
output level by this iterate. Future investigations now have an
|
||||
always-on machine-readable wedge snapshot.
|
||||
|
||||
## Confidence
|
||||
|
||||
- **HIGH** that the dump emits on every `exec` run with no extra flags
|
||||
(verified empirically under `--quiet` AND non-quiet).
|
||||
- **HIGH** that content matches 2.K's plain-text dump bit-for-bit
|
||||
(every handle id, every PC, every waiter list line cross-checked).
|
||||
- **HIGH** that existing diagnostic mechanism is unbroken (plain-text
|
||||
still emits 3 sections under non-quiet, JSON also emits).
|
||||
- **HIGH** that ZERO test regressions (235/235 pass).
|
||||
|
||||
## Artifacts
|
||||
|
||||
Under `xenia-rs/audit-runs/iterate-2M-exit-state-dump/`:
|
||||
|
||||
- `ours-cold.jsonl` (Phase-A trace, 121,569 events, ~28MB, bit-equal
|
||||
to 2.J/2.K)
|
||||
- `ours-cold.stdout.log` (empty — quiet mode preserved)
|
||||
- `ours-cold.stderr.log` (single line: dump emission notice)
|
||||
- `exit-thread-state.json` (**the new artifact**, 9651 bytes, 13
|
||||
threads + 10 wedge entries)
|
||||
- `ours-cold-nonquiet.stdout.log` / `.stderr.log` (regression check:
|
||||
existing plain-text diagnostic preserved)
|
||||
- `writer-report.md` (this file)
|
||||
|
||||
Patch:
|
||||
|
||||
- `xenia-rs/crates/xenia-kernel/src/event_log.rs` (path tracker +
|
||||
accessor)
|
||||
- `xenia-rs/crates/xenia-app/src/main.rs` (dump function + 2 call
|
||||
sites)
|
||||
|
||||
## Next iterate enabler
|
||||
|
||||
`exit-thread-state.json` is now a stable input for:
|
||||
|
||||
1. **Canary parity**: add the analogous emitter to canary's exit path
|
||||
so cross-engine wedge-map diffs become trivial.
|
||||
2. **Per-handle signaler hunt**: for each wedge `handle_type=Event,
|
||||
signaler_tid_if_known=null`, walk Phase-A trace for canary's
|
||||
handle-equivalent (semantic_id) signal source — directly identifies
|
||||
which canary thread/path is missing in ours.
|
||||
3. **Regression alarm**: a CI step can refuse to merge if
|
||||
`len(wedge_map) > N` for the boot-replay scenario, preventing
|
||||
silent re-wedges.
|
||||
Reference in New Issue
Block a user