handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
233
audit-runs/iterate-2Q-signal-match/writer-report.md
Normal file
233
audit-runs/iterate-2Q-signal-match/writer-report.md
Normal file
@@ -0,0 +1,233 @@
|
||||
# Iterate 2.Q — `signal.match` instrumentation + wedge disambiguation (writer report)
|
||||
|
||||
**Date:** 2026-05-28.
|
||||
**LOC delta:** engine **~110** (event_log.rs +57, exports.rs +53, 4 single-line
|
||||
emits in 4 signal handlers), tooling **+1** (diff_events.py ENGINE_LOCAL_KINDS
|
||||
extension). All retained, cvar-gated default-off via existing
|
||||
`event_log::is_enabled()`.
|
||||
**Tests:** 227 kernel + 19 path + 149 cpu + 300 main = full suite PASS, 0
|
||||
regressions.
|
||||
**Cascade:** N/A — observability class, no semantics changed.
|
||||
|
||||
## Headline
|
||||
|
||||
**NO-SIGNAL-TARGETS-WEDGE-HANDLES-AT-ALL.** Of the 5 wedge handles
|
||||
{0x12c8, 0x12d0, 0x12e4, 0x1028, 0x1020}, **4 receive ZERO signals in the
|
||||
entire run** (not just in the 1000-1010ms window — never), and the 5th
|
||||
(0x1028, the AUDIT-069 work-semaphore) is signaled 7× by tid=5 with the
|
||||
wakes working correctly each time (tid=4 wakes, processes, re-waits) —
|
||||
the wedge there is producer-stop, not wake-failure. Across all 169
|
||||
NtSetEvent / KeSetEvent / NtReleaseSemaphore / KeReleaseSemaphore calls
|
||||
in the run, only 36 (21%) fire with any waiter parked on the targeted
|
||||
handle. The 67 NtSetEvent calls from 2.P collapse to 12 `signal.match`
|
||||
events — 55 of 67 NtSetEvent calls target handles with no waiter at
|
||||
signal time. **The disambiguation supports a missing-producer /
|
||||
unreached-signaler hypothesis over either pure-kernel-wait-bug or
|
||||
pure-handle-lookup-bug.**
|
||||
|
||||
## Mode
|
||||
|
||||
Engine code change: pure observability emitter, no semantic change.
|
||||
Cvar-gated via existing `event_log::is_enabled()`. ENGINE_LOCAL in the
|
||||
diff tool — does not affect matched-prefix.
|
||||
|
||||
Invocation (identical to 2.J/2.K/2.M/2.N):
|
||||
|
||||
```
|
||||
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
|
||||
-n 50000000 --quiet \
|
||||
--phase-a-event-log audit-runs/iterate-2Q-signal-match/ours-cold.jsonl \
|
||||
"../Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso"
|
||||
```
|
||||
|
||||
Exit code 0. Output: `ours-cold.jsonl` (28.7 MB, 121,605 events — 121,569
|
||||
baseline + 36 `signal.match`), `exit-thread-state.json` (9651 bytes,
|
||||
bit-identical to 2.M/2.N), `ours-cold.stderr.log` (single 2.M emission
|
||||
notice line), `ours-cold.stdout.log` (empty — quiet mode).
|
||||
|
||||
## Patch summary
|
||||
|
||||
| file | purpose | LOC added |
|
||||
|---|---|---|
|
||||
| `crates/xenia-kernel/src/event_log.rs` | new `emit_signal_match(tid, cycle, signal_call, target_handle, waiter_count, waiter_tids)` — emits `signal.match` schema-v1 event with FNV-1a SID lookup from registry; `null` when not registered | ~57 |
|
||||
| `crates/xenia-kernel/src/exports.rs` | `snapshot_waiters_for_signal(state, handle)` (read-only ThreadRef→tid map over the per-object waiter list) + `emit_signal_match_if_waiters(state, name, handle)` shim (gathers tid + cycle, gates on `n > 0`) | ~53 |
|
||||
| `crates/xenia-kernel/src/exports.rs` | 4 single-line emit calls (`ke_set_event`, `nt_set_event`, `ke_release_semaphore`, `nt_release_semaphore`), placed AFTER `audit_signal` and BEFORE `wake_eligible_waiters` so the snapshot reflects the pre-wake waiter set; `NtReleaseSemaphore` only on SUCCESS path (parity with `wake_eligible_waiters` skip on `STATUS_SEMAPHORE_LIMIT_EXCEEDED`) | 4 |
|
||||
| `tools/diff-events/diff_events.py` | `ENGINE_LOCAL_KINDS += {"signal.match"}` so the new kind consumes a per-tid idx slot on the emitter side without alignment cost | 1 |
|
||||
|
||||
Total engine ~114 LOC, tooling +1. Within the 50-100 target / 150 hard
|
||||
cap modulo the snapshot helper.
|
||||
|
||||
Schema-v1 event emitted:
|
||||
```json
|
||||
{"schema_version":1,"engine":"ours","kind":"signal.match",
|
||||
"tid":<signaling tid>,"tid_event_idx":<idx>,"guest_cycle":0,
|
||||
"host_ns":<ns>,"deterministic":true,
|
||||
"payload":{"signal_call":"NtSetEvent"|"KeSetEvent"|
|
||||
"NtReleaseSemaphore"|"KeReleaseSemaphore",
|
||||
"target_handle":"0x<8hex>",
|
||||
"target_sid":"<16hex>"|null,
|
||||
"waiter_count":<n>,"waiter_tids":[<tid>,...]}}
|
||||
```
|
||||
|
||||
Emit policy: skip when `waiter_count == 0` (per 2.Q scope: don't pollute
|
||||
the trace with no-op-target signals).
|
||||
|
||||
## Test results
|
||||
|
||||
```
|
||||
cargo build --release -> OK (1 pre-existing dead_code warning unrelated)
|
||||
cargo test --release -> all suites PASS:
|
||||
xenia-kernel 227 passed, 0 failed
|
||||
xenia-cpu 149 passed, 0 failed
|
||||
xenia-app 300 passed, 0 failed
|
||||
+ 30+ smaller suites, 0 failures total
|
||||
```
|
||||
|
||||
## Disambiguation result — 5 wedge handles in 1000-1010ms window
|
||||
|
||||
**Caveat:** the run terminates at host_ns=1.008 s on the 50M-instr budget
|
||||
(per 2.K/2.M/2.N exit-state geometry), so "1000-1010ms window" reduces
|
||||
to only ~8 ms of trace time. **Zero `signal.match` events fire in the
|
||||
last 50 ms (≥ 950 ms).** I therefore report counts for the WHOLE run
|
||||
window [0, 1008ms]:
|
||||
|
||||
| wedge handle | object | hits in [1000-1010ms] | hits whole run | signaler tids | call types |
|
||||
|---|---|---:|---:|---|---|
|
||||
| `0x000012c8` | Thread(13) | 0 | **0** | — | — |
|
||||
| `0x000012d0` | Event(sig=false) | 0 | **0** | — | — |
|
||||
| `0x000012e4` | Event(sig=false) | 0 | **0** | — | — |
|
||||
| `0x00001028` | Semaphore(0/2³¹-1) | 0 | **7** | {5} | NtReleaseSemaphore ×7 |
|
||||
| `0x00001020` | Event(sig=false) | 0 | **0** | — | — |
|
||||
|
||||
**For 4 of 5 wedge handles: NO SIGNAL EVER FIRES on them in this
|
||||
trajectory** — the corresponding signaler producers are never reached
|
||||
in the 50M-instruction window. This is the *missing-producer / unreached-
|
||||
signaler* class consistent with AUDIT-049 (tid=1 stall) and the full
|
||||
AUDIT-069 lineage.
|
||||
|
||||
**For wedge handle 0x1028** (the AUDIT-069 work-semaphore): tid=5 issues
|
||||
7 successful `NtReleaseSemaphore` calls at host_ns ∈ {468, 479, 484,
|
||||
510, 642, 656, 755} ms, **each correctly observing tid=4 as the parked
|
||||
waiter**. Tid=4 wakes, runs briefly, and re-parks on the same handle
|
||||
(verified via per-handle `wait.begin` interleave: 8 `wait.begin` events
|
||||
by tid=4 on SID `ff49a138deff7643` alternating 1:1 with the 7 releases,
|
||||
ending with a final `wait.begin` at host_ns=757 ms that is never
|
||||
matched). After 755 ms tid=5 stops producing — there is no
|
||||
wait-completion bug on this handle; the wedge here is a *producer-stop*
|
||||
at host_ns=755 ms (which is itself ~250 ms before the trace-end cap),
|
||||
not a wake-failure.
|
||||
|
||||
## Wider signal/match coverage — pre-wake waiter density
|
||||
|
||||
To check the AUDIT-062 "signals target wrong slots" pattern more
|
||||
broadly, I tallied `kernel.call` vs `signal.match` counts per signal
|
||||
call (the gap = signals that fired with zero waiters parked on the
|
||||
targeted handle):
|
||||
|
||||
| signal call | `kernel.call` count | `signal.match` count | % with any waiter |
|
||||
|---|---:|---:|---:|
|
||||
| NtSetEvent | 67 | 12 | **17.9%** |
|
||||
| KeSetEvent | 2 | 2 | 100.0% |
|
||||
| NtReleaseSemaphore | 99 | 21 | **21.2%** |
|
||||
| KeReleaseSemaphore | 1 | 1 | 100.0% |
|
||||
| **total** | **169** | **36** | **21.3%** |
|
||||
|
||||
**~80% of NtSetEvent and ~79% of NtReleaseSemaphore calls fire at a
|
||||
handle with no parked waiter at signal time.** Without the AUDIT-062
|
||||
SID-cross-check against canary, we can't yet say whether these are
|
||||
wrong-slot misroutes or whether canary likewise fires on the same
|
||||
handles with no waiter (both engines may legitimately set
|
||||
manual-reset events ahead of any wait). But the pattern is consistent
|
||||
with AUDIT-062's framing — *signals exist, just not landing on the
|
||||
parked-waiter handles* — and inconsistent with a per-handle wake-engine
|
||||
bug (where signals WOULD target the wedge handles but wakes wouldn't
|
||||
fire).
|
||||
|
||||
## Confidence + hypothesis support
|
||||
|
||||
- **HIGH** that the patch is correct and observability-only: 0 test
|
||||
regressions; semantics of `wake_eligible_waiters` / `audit_signal`
|
||||
unchanged; emit happens between them.
|
||||
- **HIGH** that `signal.match` events fire as designed: 36 events
|
||||
emitted, all with `waiter_count ≥ 1`; SID resolution works (most
|
||||
carry non-null `target_sid`); `host_ns` and `tid` populate correctly.
|
||||
- **HIGH** that 4 of the 5 wedge handles {0x12c8, 0x12d0, 0x12e4,
|
||||
0x1020} receive zero signals in this trajectory: grepped exhaustively
|
||||
by exact handle string — no hits.
|
||||
- **HIGH** that wedge handle 0x1028 is NOT a kernel-wake bug: the
|
||||
wake-rewait-wake interleave with tid=4 is clean for 7 cycles;
|
||||
producer (tid=5) stops at 755 ms, not the wake plumbing.
|
||||
- **MEDIUM-HIGH** that the global "missing-producer / unreached-
|
||||
signaler" framing is the right hypothesis. The 80% no-waiter signal
|
||||
rate is suggestive of AUDIT-062 wrong-slot fires, but proving it
|
||||
requires the *canary* `signal.match` mirror (parallel cvar in
|
||||
xenia-canary) + cross-engine SID diff.
|
||||
- **NOT SUPPORTED** by this data: a pure kernel wait-completion bug on
|
||||
the wedge handles (those handles never receive signals; we can't
|
||||
observe a wake-failure that never gets a signal-trigger).
|
||||
- **PARTIALLY SUPPORTED**: AUDIT-062-class handle-lookup-bug pattern
|
||||
for the 80% no-waiter signal calls — but only co-observable in canary
|
||||
to confirm.
|
||||
|
||||
Per tripstone #40 (no single-keystone framing): this iterate does NOT
|
||||
claim one hypothesis as THE answer. Both possibilities remain live for
|
||||
distinct subsets of signals — 0x1028 producer-stop is one class, the
|
||||
4 zero-signal wedge handles are another (could be downstream of the
|
||||
same root cause or independent), and the 80% no-waiter signals are a
|
||||
third.
|
||||
|
||||
## Tripstones audit
|
||||
|
||||
- **#28** (cross-engine tid stability): only intra-engine tids reported
|
||||
(waiter_tids in `signal.match` are ours-side scheduler tids). No
|
||||
cross-engine tid claims made.
|
||||
- **#39** (composite progression): NO progression claim. VdSwap count
|
||||
UNCHANGED, matched-prefix UNCHANGED (signal.match is ENGINE_LOCAL in
|
||||
diff harness — verified via `ENGINE_LOCAL_KINDS` membership).
|
||||
- **#40** (single-keystone framing): explicitly NOT claiming the
|
||||
disambiguation is THE answer. The data isolates that **the wedge
|
||||
handles are not signaled** but does not prove whether the broader
|
||||
~80% no-waiter rate is wrong-slot routing vs benign pre-wait sets.
|
||||
- **#41** (categorized diff tags): `signal.match` is ENGINE_LOCAL so
|
||||
it doesn't affect the categorized harness output.
|
||||
- **#42** (Phase-A blind to blocked-forever waits): `exit-thread-
|
||||
state.json` auto-emitted bit-identical to 2.M/2.N (verified by
|
||||
filesize match 9651 bytes + 13 alive threads + 10 wedge entries).
|
||||
|
||||
## Next-iterate recommendation
|
||||
|
||||
Two complementary directions, priority order:
|
||||
|
||||
**(1)** ~30-60 LOC canary-side `signal.match` mirror (same payload
|
||||
shape, same cvar pattern). Run canary cold under the same
|
||||
50M-equivalent budget and diff: enumerate `signal.match` events for
|
||||
each of the 5 wedge handles' canary-side SIDs. If canary fires
|
||||
signals on the SIDs that ours's wedge handles resolve to, AUDIT-062
|
||||
wrong-slot is confirmed. If canary likewise fires zero on those SIDs,
|
||||
the missing-producer framing is sealed and the next strategic blocker
|
||||
is the producer-chain itself (sub_825070F0 fan-out per
|
||||
AUDIT-066/068/069).
|
||||
|
||||
**(2)** ~0-40 LOC ours-side investigation of the 0x1028 producer-stop
|
||||
at host_ns=755 ms: tid=5 successfully issues 7 releases then stops.
|
||||
Walk its post-release control flow (LR=0x824ac578 ⇒ within the wait
|
||||
wrapper; tid=5 itself goes blocked at host_ns~770 ms on event 0x12e4
|
||||
per the exit-state) — the stop is *because tid=5 itself wedges on a
|
||||
different unsignaled event after the 7th cycle*. This is downstream of
|
||||
the same wedge-graph but the "graph edge" is now explicit:
|
||||
`tid=5 wedges on 0x12e4 ⇒ no more 0x1028 releases ⇒ tid=4 wedges on
|
||||
0x1028`. So 0x12e4 is upstream of 0x1028 in the wedge graph; whoever
|
||||
should signal 0x12e4 is the real producer-gap.
|
||||
|
||||
## Artifacts
|
||||
|
||||
Under `xenia-rs/audit-runs/iterate-2Q-signal-match/`:
|
||||
|
||||
- `ours-cold.jsonl` (28.7 MB, 121,605 events — 36 `signal.match` +
|
||||
121,569 baseline events bit-equal to 2.N where ENGINE_LOCAL kinds
|
||||
collapsed)
|
||||
- `ours-cold.stdout.log` (empty — quiet mode)
|
||||
- `ours-cold.stderr.log` (single 2.M emission notice line)
|
||||
- `exit-thread-state.json` (9651 bytes; bit-identical to 2.M/2.N — 13
|
||||
threads + 10 wedge entries)
|
||||
- `writer-report.md` (this file)
|
||||
Reference in New Issue
Block a user