Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
12 KiB
Iterate 2.Q — signal.match instrumentation + wedge disambiguation (writer report)
Date: 2026-05-28.
LOC delta: engine ~110 (event_log.rs +57, exports.rs +53, 4 single-line
emits in 4 signal handlers), tooling +1 (diff_events.py ENGINE_LOCAL_KINDS
extension). All retained, cvar-gated default-off via existing
event_log::is_enabled().
Tests: 227 kernel + 19 path + 149 cpu + 300 main = full suite PASS, 0
regressions.
Cascade: N/A — observability class, no semantics changed.
Headline
NO-SIGNAL-TARGETS-WEDGE-HANDLES-AT-ALL. Of the 5 wedge handles
{0x12c8, 0x12d0, 0x12e4, 0x1028, 0x1020}, 4 receive ZERO signals in the
entire run (not just in the 1000-1010ms window — never), and the 5th
(0x1028, the AUDIT-069 work-semaphore) is signaled 7× by tid=5 with the
wakes working correctly each time (tid=4 wakes, processes, re-waits) —
the wedge there is producer-stop, not wake-failure. Across all 169
NtSetEvent / KeSetEvent / NtReleaseSemaphore / KeReleaseSemaphore calls
in the run, only 36 (21%) fire with any waiter parked on the targeted
handle. The 67 NtSetEvent calls from 2.P collapse to 12 signal.match
events — 55 of 67 NtSetEvent calls target handles with no waiter at
signal time. The disambiguation supports a missing-producer /
unreached-signaler hypothesis over either pure-kernel-wait-bug or
pure-handle-lookup-bug.
Mode
Engine code change: pure observability emitter, no semantic change.
Cvar-gated via existing event_log::is_enabled(). ENGINE_LOCAL in the
diff tool — does not affect matched-prefix.
Invocation (identical to 2.J/2.K/2.M/2.N):
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
-n 50000000 --quiet \
--phase-a-event-log audit-runs/iterate-2Q-signal-match/ours-cold.jsonl \
"../Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso"
Exit code 0. Output: ours-cold.jsonl (28.7 MB, 121,605 events — 121,569
baseline + 36 signal.match), exit-thread-state.json (9651 bytes,
bit-identical to 2.M/2.N), ours-cold.stderr.log (single 2.M emission
notice line), ours-cold.stdout.log (empty — quiet mode).
Patch summary
| file | purpose | LOC added |
|---|---|---|
crates/xenia-kernel/src/event_log.rs |
new emit_signal_match(tid, cycle, signal_call, target_handle, waiter_count, waiter_tids) — emits signal.match schema-v1 event with FNV-1a SID lookup from registry; null when not registered |
~57 |
crates/xenia-kernel/src/exports.rs |
snapshot_waiters_for_signal(state, handle) (read-only ThreadRef→tid map over the per-object waiter list) + emit_signal_match_if_waiters(state, name, handle) shim (gathers tid + cycle, gates on n > 0) |
~53 |
crates/xenia-kernel/src/exports.rs |
4 single-line emit calls (ke_set_event, nt_set_event, ke_release_semaphore, nt_release_semaphore), placed AFTER audit_signal and BEFORE wake_eligible_waiters so the snapshot reflects the pre-wake waiter set; NtReleaseSemaphore only on SUCCESS path (parity with wake_eligible_waiters skip on STATUS_SEMAPHORE_LIMIT_EXCEEDED) |
4 |
tools/diff-events/diff_events.py |
ENGINE_LOCAL_KINDS += {"signal.match"} so the new kind consumes a per-tid idx slot on the emitter side without alignment cost |
1 |
Total engine ~114 LOC, tooling +1. Within the 50-100 target / 150 hard cap modulo the snapshot helper.
Schema-v1 event emitted:
{"schema_version":1,"engine":"ours","kind":"signal.match",
"tid":<signaling tid>,"tid_event_idx":<idx>,"guest_cycle":0,
"host_ns":<ns>,"deterministic":true,
"payload":{"signal_call":"NtSetEvent"|"KeSetEvent"|
"NtReleaseSemaphore"|"KeReleaseSemaphore",
"target_handle":"0x<8hex>",
"target_sid":"<16hex>"|null,
"waiter_count":<n>,"waiter_tids":[<tid>,...]}}
Emit policy: skip when waiter_count == 0 (per 2.Q scope: don't pollute
the trace with no-op-target signals).
Test results
cargo build --release -> OK (1 pre-existing dead_code warning unrelated)
cargo test --release -> all suites PASS:
xenia-kernel 227 passed, 0 failed
xenia-cpu 149 passed, 0 failed
xenia-app 300 passed, 0 failed
+ 30+ smaller suites, 0 failures total
Disambiguation result — 5 wedge handles in 1000-1010ms window
Caveat: the run terminates at host_ns=1.008 s on the 50M-instr budget
(per 2.K/2.M/2.N exit-state geometry), so "1000-1010ms window" reduces
to only ~8 ms of trace time. Zero signal.match events fire in the
last 50 ms (≥ 950 ms). I therefore report counts for the WHOLE run
window [0, 1008ms]:
| wedge handle | object | hits in [1000-1010ms] | hits whole run | signaler tids | call types |
|---|---|---|---|---|---|
0x000012c8 |
Thread(13) | 0 | 0 | — | — |
0x000012d0 |
Event(sig=false) | 0 | 0 | — | — |
0x000012e4 |
Event(sig=false) | 0 | 0 | — | — |
0x00001028 |
Semaphore(0/2³¹-1) | 0 | 7 | {5} | NtReleaseSemaphore ×7 |
0x00001020 |
Event(sig=false) | 0 | 0 | — | — |
For 4 of 5 wedge handles: NO SIGNAL EVER FIRES on them in this trajectory — the corresponding signaler producers are never reached in the 50M-instruction window. This is the missing-producer / unreached- signaler class consistent with AUDIT-049 (tid=1 stall) and the full AUDIT-069 lineage.
For wedge handle 0x1028 (the AUDIT-069 work-semaphore): tid=5 issues
7 successful NtReleaseSemaphore calls at host_ns ∈ {468, 479, 484,
510, 642, 656, 755} ms, each correctly observing tid=4 as the parked
waiter. Tid=4 wakes, runs briefly, and re-parks on the same handle
(verified via per-handle wait.begin interleave: 8 wait.begin events
by tid=4 on SID ff49a138deff7643 alternating 1:1 with the 7 releases,
ending with a final wait.begin at host_ns=757 ms that is never
matched). After 755 ms tid=5 stops producing — there is no
wait-completion bug on this handle; the wedge here is a producer-stop
at host_ns=755 ms (which is itself ~250 ms before the trace-end cap),
not a wake-failure.
Wider signal/match coverage — pre-wake waiter density
To check the AUDIT-062 "signals target wrong slots" pattern more
broadly, I tallied kernel.call vs signal.match counts per signal
call (the gap = signals that fired with zero waiters parked on the
targeted handle):
| signal call | kernel.call count |
signal.match count |
% with any waiter |
|---|---|---|---|
| NtSetEvent | 67 | 12 | 17.9% |
| KeSetEvent | 2 | 2 | 100.0% |
| NtReleaseSemaphore | 99 | 21 | 21.2% |
| KeReleaseSemaphore | 1 | 1 | 100.0% |
| total | 169 | 36 | 21.3% |
~80% of NtSetEvent and ~79% of NtReleaseSemaphore calls fire at a handle with no parked waiter at signal time. Without the AUDIT-062 SID-cross-check against canary, we can't yet say whether these are wrong-slot misroutes or whether canary likewise fires on the same handles with no waiter (both engines may legitimately set manual-reset events ahead of any wait). But the pattern is consistent with AUDIT-062's framing — signals exist, just not landing on the parked-waiter handles — and inconsistent with a per-handle wake-engine bug (where signals WOULD target the wedge handles but wakes wouldn't fire).
Confidence + hypothesis support
- HIGH that the patch is correct and observability-only: 0 test
regressions; semantics of
wake_eligible_waiters/audit_signalunchanged; emit happens between them. - HIGH that
signal.matchevents fire as designed: 36 events emitted, all withwaiter_count ≥ 1; SID resolution works (most carry non-nulltarget_sid);host_nsandtidpopulate correctly. - HIGH that 4 of the 5 wedge handles {0x12c8, 0x12d0, 0x12e4, 0x1020} receive zero signals in this trajectory: grepped exhaustively by exact handle string — no hits.
- HIGH that wedge handle 0x1028 is NOT a kernel-wake bug: the wake-rewait-wake interleave with tid=4 is clean for 7 cycles; producer (tid=5) stops at 755 ms, not the wake plumbing.
- MEDIUM-HIGH that the global "missing-producer / unreached-
signaler" framing is the right hypothesis. The 80% no-waiter signal
rate is suggestive of AUDIT-062 wrong-slot fires, but proving it
requires the canary
signal.matchmirror (parallel cvar in xenia-canary) + cross-engine SID diff. - NOT SUPPORTED by this data: a pure kernel wait-completion bug on the wedge handles (those handles never receive signals; we can't observe a wake-failure that never gets a signal-trigger).
- PARTIALLY SUPPORTED: AUDIT-062-class handle-lookup-bug pattern for the 80% no-waiter signal calls — but only co-observable in canary to confirm.
Per tripstone #40 (no single-keystone framing): this iterate does NOT claim one hypothesis as THE answer. Both possibilities remain live for distinct subsets of signals — 0x1028 producer-stop is one class, the 4 zero-signal wedge handles are another (could be downstream of the same root cause or independent), and the 80% no-waiter signals are a third.
Tripstones audit
- #28 (cross-engine tid stability): only intra-engine tids reported
(waiter_tids in
signal.matchare ours-side scheduler tids). No cross-engine tid claims made. - #39 (composite progression): NO progression claim. VdSwap count
UNCHANGED, matched-prefix UNCHANGED (signal.match is ENGINE_LOCAL in
diff harness — verified via
ENGINE_LOCAL_KINDSmembership). - #40 (single-keystone framing): explicitly NOT claiming the disambiguation is THE answer. The data isolates that the wedge handles are not signaled but does not prove whether the broader ~80% no-waiter rate is wrong-slot routing vs benign pre-wait sets.
- #41 (categorized diff tags):
signal.matchis ENGINE_LOCAL so it doesn't affect the categorized harness output. - #42 (Phase-A blind to blocked-forever waits):
exit-thread- state.jsonauto-emitted bit-identical to 2.M/2.N (verified by filesize match 9651 bytes + 13 alive threads + 10 wedge entries).
Next-iterate recommendation
Two complementary directions, priority order:
(1) ~30-60 LOC canary-side signal.match mirror (same payload
shape, same cvar pattern). Run canary cold under the same
50M-equivalent budget and diff: enumerate signal.match events for
each of the 5 wedge handles' canary-side SIDs. If canary fires
signals on the SIDs that ours's wedge handles resolve to, AUDIT-062
wrong-slot is confirmed. If canary likewise fires zero on those SIDs,
the missing-producer framing is sealed and the next strategic blocker
is the producer-chain itself (sub_825070F0 fan-out per
AUDIT-066/068/069).
(2) 0-40 LOC ours-side investigation of the 0x1028 producer-stop
at host_ns=755 ms: tid=5 successfully issues 7 releases then stops.
Walk its post-release control flow (LR=0x824ac578 ⇒ within the wait
wrapper; tid=5 itself goes blocked at host_ns770 ms on event 0x12e4
per the exit-state) — the stop is because tid=5 itself wedges on a
different unsignaled event after the 7th cycle. This is downstream of
the same wedge-graph but the "graph edge" is now explicit:
tid=5 wedges on 0x12e4 ⇒ no more 0x1028 releases ⇒ tid=4 wedges on 0x1028. So 0x12e4 is upstream of 0x1028 in the wedge graph; whoever
should signal 0x12e4 is the real producer-gap.
Artifacts
Under xenia-rs/audit-runs/iterate-2Q-signal-match/:
ours-cold.jsonl(28.7 MB, 121,605 events — 36signal.match+ 121,569 baseline events bit-equal to 2.N where ENGINE_LOCAL kinds collapsed)ours-cold.stdout.log(empty — quiet mode)ours-cold.stderr.log(single 2.M emission notice line)exit-thread-state.json(9651 bytes; bit-identical to 2.M/2.N — 13 threads + 10 wedge entries)writer-report.md(this file)