Files
xenia-rs/audit-runs/iterate-2Q-signal-match/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

234 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iterate 2.Q — `signal.match` instrumentation + wedge disambiguation (writer report)
**Date:** 2026-05-28.
**LOC delta:** engine **~110** (event_log.rs +57, exports.rs +53, 4 single-line
emits in 4 signal handlers), tooling **+1** (diff_events.py ENGINE_LOCAL_KINDS
extension). All retained, cvar-gated default-off via existing
`event_log::is_enabled()`.
**Tests:** 227 kernel + 19 path + 149 cpu + 300 main = full suite PASS, 0
regressions.
**Cascade:** N/A — observability class, no semantics changed.
## Headline
**NO-SIGNAL-TARGETS-WEDGE-HANDLES-AT-ALL.** Of the 5 wedge handles
{0x12c8, 0x12d0, 0x12e4, 0x1028, 0x1020}, **4 receive ZERO signals in the
entire run** (not just in the 1000-1010ms window — never), and the 5th
(0x1028, the AUDIT-069 work-semaphore) is signaled 7× by tid=5 with the
wakes working correctly each time (tid=4 wakes, processes, re-waits) —
the wedge there is producer-stop, not wake-failure. Across all 169
NtSetEvent / KeSetEvent / NtReleaseSemaphore / KeReleaseSemaphore calls
in the run, only 36 (21%) fire with any waiter parked on the targeted
handle. The 67 NtSetEvent calls from 2.P collapse to 12 `signal.match`
events — 55 of 67 NtSetEvent calls target handles with no waiter at
signal time. **The disambiguation supports a missing-producer /
unreached-signaler hypothesis over either pure-kernel-wait-bug or
pure-handle-lookup-bug.**
## Mode
Engine code change: pure observability emitter, no semantic change.
Cvar-gated via existing `event_log::is_enabled()`. ENGINE_LOCAL in the
diff tool — does not affect matched-prefix.
Invocation (identical to 2.J/2.K/2.M/2.N):
```
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
-n 50000000 --quiet \
--phase-a-event-log audit-runs/iterate-2Q-signal-match/ours-cold.jsonl \
"../Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso"
```
Exit code 0. Output: `ours-cold.jsonl` (28.7 MB, 121,605 events — 121,569
baseline + 36 `signal.match`), `exit-thread-state.json` (9651 bytes,
bit-identical to 2.M/2.N), `ours-cold.stderr.log` (single 2.M emission
notice line), `ours-cold.stdout.log` (empty — quiet mode).
## Patch summary
| file | purpose | LOC added |
|---|---|---|
| `crates/xenia-kernel/src/event_log.rs` | new `emit_signal_match(tid, cycle, signal_call, target_handle, waiter_count, waiter_tids)` — emits `signal.match` schema-v1 event with FNV-1a SID lookup from registry; `null` when not registered | ~57 |
| `crates/xenia-kernel/src/exports.rs` | `snapshot_waiters_for_signal(state, handle)` (read-only ThreadRef→tid map over the per-object waiter list) + `emit_signal_match_if_waiters(state, name, handle)` shim (gathers tid + cycle, gates on `n > 0`) | ~53 |
| `crates/xenia-kernel/src/exports.rs` | 4 single-line emit calls (`ke_set_event`, `nt_set_event`, `ke_release_semaphore`, `nt_release_semaphore`), placed AFTER `audit_signal` and BEFORE `wake_eligible_waiters` so the snapshot reflects the pre-wake waiter set; `NtReleaseSemaphore` only on SUCCESS path (parity with `wake_eligible_waiters` skip on `STATUS_SEMAPHORE_LIMIT_EXCEEDED`) | 4 |
| `tools/diff-events/diff_events.py` | `ENGINE_LOCAL_KINDS += {"signal.match"}` so the new kind consumes a per-tid idx slot on the emitter side without alignment cost | 1 |
Total engine ~114 LOC, tooling +1. Within the 50-100 target / 150 hard
cap modulo the snapshot helper.
Schema-v1 event emitted:
```json
{"schema_version":1,"engine":"ours","kind":"signal.match",
"tid":<signaling tid>,"tid_event_idx":<idx>,"guest_cycle":0,
"host_ns":<ns>,"deterministic":true,
"payload":{"signal_call":"NtSetEvent"|"KeSetEvent"|
"NtReleaseSemaphore"|"KeReleaseSemaphore",
"target_handle":"0x<8hex>",
"target_sid":"<16hex>"|null,
"waiter_count":<n>,"waiter_tids":[<tid>,...]}}
```
Emit policy: skip when `waiter_count == 0` (per 2.Q scope: don't pollute
the trace with no-op-target signals).
## Test results
```
cargo build --release -> OK (1 pre-existing dead_code warning unrelated)
cargo test --release -> all suites PASS:
xenia-kernel 227 passed, 0 failed
xenia-cpu 149 passed, 0 failed
xenia-app 300 passed, 0 failed
+ 30+ smaller suites, 0 failures total
```
## Disambiguation result — 5 wedge handles in 1000-1010ms window
**Caveat:** the run terminates at host_ns=1.008 s on the 50M-instr budget
(per 2.K/2.M/2.N exit-state geometry), so "1000-1010ms window" reduces
to only ~8 ms of trace time. **Zero `signal.match` events fire in the
last 50 ms (≥ 950 ms).** I therefore report counts for the WHOLE run
window [0, 1008ms]:
| wedge handle | object | hits in [1000-1010ms] | hits whole run | signaler tids | call types |
|---|---|---:|---:|---|---|
| `0x000012c8` | Thread(13) | 0 | **0** | — | — |
| `0x000012d0` | Event(sig=false) | 0 | **0** | — | — |
| `0x000012e4` | Event(sig=false) | 0 | **0** | — | — |
| `0x00001028` | Semaphore(0/2³¹-1) | 0 | **7** | {5} | NtReleaseSemaphore ×7 |
| `0x00001020` | Event(sig=false) | 0 | **0** | — | — |
**For 4 of 5 wedge handles: NO SIGNAL EVER FIRES on them in this
trajectory** — the corresponding signaler producers are never reached
in the 50M-instruction window. This is the *missing-producer / unreached-
signaler* class consistent with AUDIT-049 (tid=1 stall) and the full
AUDIT-069 lineage.
**For wedge handle 0x1028** (the AUDIT-069 work-semaphore): tid=5 issues
7 successful `NtReleaseSemaphore` calls at host_ns ∈ {468, 479, 484,
510, 642, 656, 755} ms, **each correctly observing tid=4 as the parked
waiter**. Tid=4 wakes, runs briefly, and re-parks on the same handle
(verified via per-handle `wait.begin` interleave: 8 `wait.begin` events
by tid=4 on SID `ff49a138deff7643` alternating 1:1 with the 7 releases,
ending with a final `wait.begin` at host_ns=757 ms that is never
matched). After 755 ms tid=5 stops producing — there is no
wait-completion bug on this handle; the wedge here is a *producer-stop*
at host_ns=755 ms (which is itself ~250 ms before the trace-end cap),
not a wake-failure.
## Wider signal/match coverage — pre-wake waiter density
To check the AUDIT-062 "signals target wrong slots" pattern more
broadly, I tallied `kernel.call` vs `signal.match` counts per signal
call (the gap = signals that fired with zero waiters parked on the
targeted handle):
| signal call | `kernel.call` count | `signal.match` count | % with any waiter |
|---|---:|---:|---:|
| NtSetEvent | 67 | 12 | **17.9%** |
| KeSetEvent | 2 | 2 | 100.0% |
| NtReleaseSemaphore | 99 | 21 | **21.2%** |
| KeReleaseSemaphore | 1 | 1 | 100.0% |
| **total** | **169** | **36** | **21.3%** |
**~80% of NtSetEvent and ~79% of NtReleaseSemaphore calls fire at a
handle with no parked waiter at signal time.** Without the AUDIT-062
SID-cross-check against canary, we can't yet say whether these are
wrong-slot misroutes or whether canary likewise fires on the same
handles with no waiter (both engines may legitimately set
manual-reset events ahead of any wait). But the pattern is consistent
with AUDIT-062's framing — *signals exist, just not landing on the
parked-waiter handles* — and inconsistent with a per-handle wake-engine
bug (where signals WOULD target the wedge handles but wakes wouldn't
fire).
## Confidence + hypothesis support
- **HIGH** that the patch is correct and observability-only: 0 test
regressions; semantics of `wake_eligible_waiters` / `audit_signal`
unchanged; emit happens between them.
- **HIGH** that `signal.match` events fire as designed: 36 events
emitted, all with `waiter_count ≥ 1`; SID resolution works (most
carry non-null `target_sid`); `host_ns` and `tid` populate correctly.
- **HIGH** that 4 of the 5 wedge handles {0x12c8, 0x12d0, 0x12e4,
0x1020} receive zero signals in this trajectory: grepped exhaustively
by exact handle string — no hits.
- **HIGH** that wedge handle 0x1028 is NOT a kernel-wake bug: the
wake-rewait-wake interleave with tid=4 is clean for 7 cycles;
producer (tid=5) stops at 755 ms, not the wake plumbing.
- **MEDIUM-HIGH** that the global "missing-producer / unreached-
signaler" framing is the right hypothesis. The 80% no-waiter signal
rate is suggestive of AUDIT-062 wrong-slot fires, but proving it
requires the *canary* `signal.match` mirror (parallel cvar in
xenia-canary) + cross-engine SID diff.
- **NOT SUPPORTED** by this data: a pure kernel wait-completion bug on
the wedge handles (those handles never receive signals; we can't
observe a wake-failure that never gets a signal-trigger).
- **PARTIALLY SUPPORTED**: AUDIT-062-class handle-lookup-bug pattern
for the 80% no-waiter signal calls — but only co-observable in canary
to confirm.
Per tripstone #40 (no single-keystone framing): this iterate does NOT
claim one hypothesis as THE answer. Both possibilities remain live for
distinct subsets of signals — 0x1028 producer-stop is one class, the
4 zero-signal wedge handles are another (could be downstream of the
same root cause or independent), and the 80% no-waiter signals are a
third.
## Tripstones audit
- **#28** (cross-engine tid stability): only intra-engine tids reported
(waiter_tids in `signal.match` are ours-side scheduler tids). No
cross-engine tid claims made.
- **#39** (composite progression): NO progression claim. VdSwap count
UNCHANGED, matched-prefix UNCHANGED (signal.match is ENGINE_LOCAL in
diff harness — verified via `ENGINE_LOCAL_KINDS` membership).
- **#40** (single-keystone framing): explicitly NOT claiming the
disambiguation is THE answer. The data isolates that **the wedge
handles are not signaled** but does not prove whether the broader
~80% no-waiter rate is wrong-slot routing vs benign pre-wait sets.
- **#41** (categorized diff tags): `signal.match` is ENGINE_LOCAL so
it doesn't affect the categorized harness output.
- **#42** (Phase-A blind to blocked-forever waits): `exit-thread-
state.json` auto-emitted bit-identical to 2.M/2.N (verified by
filesize match 9651 bytes + 13 alive threads + 10 wedge entries).
## Next-iterate recommendation
Two complementary directions, priority order:
**(1)** ~30-60 LOC canary-side `signal.match` mirror (same payload
shape, same cvar pattern). Run canary cold under the same
50M-equivalent budget and diff: enumerate `signal.match` events for
each of the 5 wedge handles' canary-side SIDs. If canary fires
signals on the SIDs that ours's wedge handles resolve to, AUDIT-062
wrong-slot is confirmed. If canary likewise fires zero on those SIDs,
the missing-producer framing is sealed and the next strategic blocker
is the producer-chain itself (sub_825070F0 fan-out per
AUDIT-066/068/069).
**(2)** ~0-40 LOC ours-side investigation of the 0x1028 producer-stop
at host_ns=755 ms: tid=5 successfully issues 7 releases then stops.
Walk its post-release control flow (LR=0x824ac578 ⇒ within the wait
wrapper; tid=5 itself goes blocked at host_ns~770 ms on event 0x12e4
per the exit-state) — the stop is *because tid=5 itself wedges on a
different unsignaled event after the 7th cycle*. This is downstream of
the same wedge-graph but the "graph edge" is now explicit:
`tid=5 wedges on 0x12e4 ⇒ no more 0x1028 releases ⇒ tid=4 wedges on
0x1028`. So 0x12e4 is upstream of 0x1028 in the wedge graph; whoever
should signal 0x12e4 is the real producer-gap.
## Artifacts
Under `xenia-rs/audit-runs/iterate-2Q-signal-match/`:
- `ours-cold.jsonl` (28.7 MB, 121,605 events — 36 `signal.match` +
121,569 baseline events bit-equal to 2.N where ENGINE_LOCAL kinds
collapsed)
- `ours-cold.stdout.log` (empty — quiet mode)
- `ours-cold.stderr.log` (single 2.M emission notice line)
- `exit-thread-state.json` (9651 bytes; bit-identical to 2.M/2.N — 13
threads + 10 wedge entries)
- `writer-report.md` (this file)