Files
xenia-rs/audit-runs/iterate-2T-wake-requested/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

320 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iterate 2.T — `wake.requested` instrumentation in `wake_eligible_waiters` (writer report)
**Date:** 2026-05-28. **LOC delta:** engine **~95** (event_log.rs +55,
exports.rs ~40 across helpers + 2 in-loop snapshot/emit blocks), tooling
**+1** (diff_events.py `ENGINE_LOCAL_KINDS` extension). All retained,
cvar-gated default-off via existing `event_log::is_enabled()`.
**Tests:** 227 kernel + 19 path + 149 cpu + 300 main + ~30 smaller =
full suite PASS, 0 regressions.
**Cascade:** N/A — observability class, no semantics changed.
## Headline
**C-2B-SCHEDULER-PICK-BUG-IDENTIFIED.** The kernel-side wake primitive
(`wake_eligible_waiters`) **correctly transitions tid=6 from
Blocked → Ready EVERY TIME** a signal targets it. 5 `wake.requested`
events fire for `target_tid=6` on handle `0x000010b4` over [485, 764]ms,
**all with `transitioned=true, new_state="Ready", target_cpu=5`.** After
the 5th wake at 764.29 ms, tid=6 is Ready on CPU5 but never executes
again — last tid=6 event remains 663.60 ms — even though the trace
continues until 766.86 ms (tids 4, 5, 13 keep emitting events). The
exit-thread-state confirms tid=6 finishes in `Ready` on hw_id=5,
priority=0 — sharing CPU5 with **tid=10 at priority=15** (and tid=12
at priority=0). The `pick_runnable` rule in
`xenia-cpu/src/scheduler.rs:211-218` is strict-priority within a slot:
`max_by_key(|(i, t)| (t.priority, -(*i as i64)))`. tid=10's priority=15
deterministically beats tid=6's priority=0. **C-2a (kernel wake-call
bug) is FALSIFIED**; **C-2b (scheduler-pick-skipping Ready tid on CPU)
is CONFIRMED**, with the specific mechanism being strict-priority
starvation by a same-CPU peer (tid=10, priority=15).
## Mode
Pure observability emitter. No semantic change to `wake_eligible_waiters`
or any caller. Cvar-gated via existing `event_log::is_enabled()` and
implicitly skipped when zero waiters are eligible (the wake loop's
existing early-returns happen before the prior-state snapshot).
ENGINE_LOCAL in the diff tool — does not affect matched-prefix.
Invocation (identical to 2.Q):
```
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
-n 50000000 --quiet \
--phase-a-event-log audit-runs/iterate-2T-wake-requested/ours-cold.jsonl \
"../Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso"
```
Exit code 0. Output: `ours-cold.jsonl` (28.7 MB, 121,641 events —
121,569 baseline + 36 `signal.match` + 36 `wake.requested`),
`exit-thread-state.json` (9651 bytes, bit-identical to 2.M/2.N/2.Q/2.S),
`ours-cold.stderr.log` (single 2.M emission-notice line),
`ours-cold.stdout.log` (empty — quiet mode).
## Patch summary
| file | purpose | LOC added |
|---|---|---:|
| `crates/xenia-kernel/src/event_log.rs` | new `emit_wake_requested(signaling_tid, cycle, target_tid, target_handle, wait_kind, transitioned, new_state, target_cpu)` — emits schema-v1 `wake.requested` event mirroring the `signal.match` shape from 2.Q | ~55 |
| `crates/xenia-kernel/src/exports.rs` | `wake_classify_state(HwState) -> (wait_kind, state_name)` + `wake_signaling_ctx(state) -> (tid, cycle)` + `emit_wake_requested_for(state, signaler, cycle, target, handle, prior_kind, prior_name)` shim (~40 LOC), plus 2 in-loop snapshot blocks inside `wake_eligible_waiters` (~14 LOC: manual-fan-out path + auto-winner path) capturing prior state BEFORE `set_wake_status_for_waitany` + `wake_ref` and emitting AFTER | ~40 |
| `crates/xenia-kernel/src/exports.rs` | `let (signaling_tid, signaling_cycle) = wake_signaling_ctx(state);` once at entry to `wake_eligible_waiters` (captures the signal-call caller's identity for the entire fan-out attribution) | 1 |
| `tools/diff-events/diff_events.py` | `ENGINE_LOCAL_KINDS += {"wake.requested"}` so the new kind consumes a per-tid idx slot on the emitter side without alignment cost (analog to 2.Q's `signal.match` addition) | 1 |
Total engine ~95 LOC, tooling +1. Within the 80-150 target / 200 hard
cap.
Schema-v1 event emitted:
```json
{"schema_version":1,"engine":"ours","kind":"wake.requested",
"tid":<signaling tid>,"tid_event_idx":<idx>,"guest_cycle":<cycle>,
"host_ns":<ns>,"deterministic":true,
"payload":{"target_tid":<woken tid>,"target_handle":"0x<8hex>",
"wait_kind":"WaitAny"|"WaitAll"|"WaitSingle"|"Other",
"transitioned":true|false,
"new_state":"Ready"|"StillBlocked"|"AlreadyReady"|"Exited"|"ServicingIrq"|"Idle"|"Other",
"target_cpu":<hw_id>|null}}
```
Emit policy: exactly ONE event per waiter the kernel touches (the wake
loop's early-returns naturally skip zero-eligible-waiter cases per
spec). Snapshot of prior state captured immediately BEFORE
`set_wake_status_for_waitany` + `wake_ref`; post-state read AFTER.
`transitioned` is the strict `prior=Blocked && post=Ready` predicate.
## Test results
```
cargo build --release -> OK (1 pre-existing dead_code warning unrelated)
cargo test --release -> all suites PASS:
xenia-kernel 227 passed, 0 failed
xenia-cpu 149 passed, 0 failed
xenia-app 300 passed, 0 failed
xenia-path 19 passed, 0 failed
+ 30+ smaller suites, 0 failures total
```
Baseline event count UNCHANGED at 121,569 (vs 2.J/2.K/2.Q/2.S);
`signal.match` count UNCHANGED at 36 (vs 2.Q/2.S); 36 NEW
`wake.requested` events emitted. Total 121,641 = 121,569 + 36 + 36.
## `wake.requested` summary (all 36 events)
| field | distribution |
|---|---|
| `transitioned` | true: 36 — **every wake successfully transitioned its target from Blocked → Ready** |
| `new_state` | "Ready": 36 — no `StillBlocked` / `AlreadyReady` / etc. |
| `wait_kind` | "WaitAny": 36 — all single-WaitAny parks (consistent with NtWaitForSingleObjectEx routing through `WaitAny{handles:[h]}` per `do_wait_single`) |
| signaler tids | tid=5:19, tid=1:9, tid=6:3, tid=2:1, tid=11:1, tid=9:1, tid=8:1, tid=13:1 |
| target tids | tid=5:11, tid=1:9, tid=4:7, **tid=6:5**, tid=8:2, tid=9:1, tid=10:1 |
**Every signal that targeted a parked waiter produced a successful
Blocked→Ready transition.** The kernel wake primitive plumbing is sound
for this trajectory.
## tid=6 deep-dive — the C-2 case
Five `wake.requested` events for `target_tid=6`, paired 1:1 with the
5 `signal.match` events on handle `0x000010b4` (same timestamps to ms
precision):
| ns (ms) | signaler | handle | transitioned | new_state | target_cpu | tid=6 next event |
|---:|---:|---|---|---|---:|---|
| 485.03 | tid=5 | 0x10b4 | true | Ready | 5 | (tid=6 runs through ~520 ms) |
| 485.14 | tid=5 | 0x10b4 | true | Ready | 5 | (back-to-back wake; coalesces) |
| 510.64 | tid=5 | 0x10b4 | true | Ready | 5 | (tid=6 runs through ~660 ms) |
| 659.00 | tid=5 | 0x10b4 | true | Ready | 5 | tid=6 last event @ 663.60 ms |
| **764.29** | **tid=5** | **0x10b4** | **true** | **Ready** | **5** | **NEVER — tid=6 emits no further events; trace continues to 766.86 ms** |
**The first 4 wakes are followed by tid=6 execution.** The 5th wake
(764.29 ms) transitions tid=6 Blocked→Ready on CPU5 — exactly as
expected — but tid=6 never gets picked by the scheduler before the
50M-instruction budget cuts off ~2.6 ms later. Other tids on the
trace continue emitting events past 764.29 ms:
| tid | last event ns (ms) | vs tid=6 quiescence |
|---:|---:|---|
| 1 | 760.12 | predecessor |
| 4 | 766.84 | **+103 ms after tid=6 last event; +2.55 ms after tid=6's final wake** |
| 5 | 766.86 | +103 ms after; the signaler still alive |
| 13 | 763.04 | +99 ms after |
So the trace is NOT truncated; tid=6 specifically is starved post-wake.
## CPU5 contention — the C-2b mechanism
Exit-thread-state CPU5 (hw_id=5) Ready set:
| tid | state | hw_id | affinity | priority | pc |
|---:|---|---:|---|---:|---|
| **6** | **Ready** | **5** | **0x20** | **0** | 0x824ab214 |
| **10** | **Ready** | **5** | **0x20** | **15** | 0x824d1404 |
| **12** | **Ready** | **5** | **0x20** | 0 | 0x824aa6a4 |
| 3 | Blocked | 5 | 0x20 | 0 | — |
Per `xenia-cpu/src/scheduler.rs:211-218`:
```rust
pub fn pick_runnable(&self) -> Option<usize> {
self.runqueue.iter().enumerate()
.filter(|(_, t)| matches!(t.state, HwState::Ready | HwState::ServicingIrq(_)))
.max_by_key(|(i, t)| (t.priority, -(*i as i64)))
.map(|(i, _)| i)
}
```
Strict-priority within a slot. tid=10 (`priority=15`) deterministically
beats tid=6 (`priority=0`) whenever tid=10 is also Ready. tid=10's
last trace event is 727.92 ms, but tid=10 is still Ready in the exit
state — meaning **tid=10 is executing past 727.92 ms in a path that
doesn't cross kernel boundaries** (no further `kernel.call`, `import.call`,
etc.), almost certainly a CPU-bound loop. That loop starves
tid=6 on CPU5 indefinitely.
This is the C-2b mechanism with concrete identification:
- **NOT a wake-call bug** (`wake.requested` fires 5× with
`transitioned=true, new_state=Ready` on tid=6) — falsifies C-2a.
- **IS a scheduler-pick-skip** but specifically a strict-priority
starvation: when a same-slot peer is at priority ≥1, lower-priority
Ready threads on that slot never run.
## Disambiguation result vs goal-spec
| outcome | gate predicate | result | conclusion |
|---|---|---|---|
| C-2A-KERNEL-WAKE-BUG | `wake.requested` for tid=6 absent OR `transitioned=false` | NO (5 events, all transitioned) | **FALSIFIED** |
| C-2B-SCHEDULER-PICK-BUG | `wake.requested` fires for tid=6 with `transitioned=true, new_state=Ready` AND tid=6 still doesn't execute | YES (5 events, last at 764.29 ms; tid=6 last event 663.60 ms; never runs after the 5th wake) | **CONFIRMED** |
| NEITHER-CLEAN-NEW-HYPOTHESIS | neither sub-hypothesis matches | NO | NOT TRIGGERED |
| RUN-FAILED | non-zero exit / crash / hang / no `wake.requested` events | NO (EXIT=0, 36 events) | NOT TRIGGERED |
**Result: C-2b CONFIRMED with concrete mechanism — strict-priority
starvation on CPU5 by tid=10 (priority=15) blocking tid=6 (priority=0).**
## Tripstone audit
- **#28** (cross-engine tid stability): No cross-engine tid claims. All
reported tids are ours-side scheduler tids; stable within this
trajectory (verified per-tid counts intact vs 2.Q/2.S baseline).
- **#39** (composite progression IS progression): NO progression claim.
VdSwap=1, draws=0, render_targets=0 — bit-identical to
2.J/2.K/2.Q/2.N/2.S. The C-2b confirmation is a *root-cause
identification*, not a progression event.
- **#40** (single-keystone framing): Care taken. The headline names
*strict-priority starvation* as the SPECIFIC mechanism behind C-2b
(not just "the scheduler is buggy"). The data isolates the priority
asymmetry (15 vs 0) and same-slot pinning (0x20=CPU5 only). Open
follow-ups are not collapsed: *why* is tid=10 priority=15 (canary
parity? game intent?), and *why* does the strict-priority scheduler
not implement starvation-avoidance (intentional simplification or
oversight)?
- **#41** (categorized diff tags): `wake.requested` is ENGINE_LOCAL in
the diff harness — doesn't affect matched-prefix.
- **#42** (Phase-A blind to blocked-forever waits): Used the always-on
`exit-thread-state.json` to confirm tid=6's final Ready state
post-trace-end (Phase-A alone would have shown only the 5 wakes
without the final unscheduled outcome).
- **#43** (no budget-cap framing): 2.S already established the 50M
budget reproduces the phenomenon; this iterate uses 50M and confirms
the C-2b mechanism is structural and reproducible in the smaller
budget. The 2.6 ms gap between the 5th wake (764.29) and tid=5/4's
final events (~766.86) is sufficient to show the scheduler had ample
opportunity to pick tid=6 and didn't.
## Comparison: 2.K → 2.Q → 2.S → 2.T
| gate | 2.K (500M, baseline) | 2.Q (50M + signal.match) | 2.S (500M + signal.match) | 2.T (50M + signal.match + wake.requested) |
|------|---:|---:|---:|---:|
| total events | 121,569 | 121,605 | 121,605 | **121,641** |
| baseline events | 121,569 | 121,569 | 121,569 | **121,569** |
| `signal.match` events | n/a | 36 | 36 | **36** |
| `wake.requested` events | n/a | n/a | n/a | **36** |
| `wake.requested` for tid=6 (transitioned=true) | n/a | n/a | n/a | **5** |
| exit-state size | 9651 | 9651 | 9651 | **9651** |
| tid=6 final state | Ready | Ready | Ready | **Ready** |
| Termination | budget hit | (50M) | budget hit (500M) | budget hit (50M) |
Baseline + signal.match counts bit-identical to 2.Q/2.S. New 36
wake.requested events 1:1 with the 36 signal.match events
(every successful signal that found a waiter resulted in a wake call
that transitioned the waiter — sanity check of the kernel wake plumbing
across all signaler tids and target handles).
## Confidence
- **HIGH** that the patch is correct and observability-only: 0 test
regressions; `wake_eligible_waiters` body unchanged except for the
snapshot reads and post-call emits.
- **HIGH** that `wake.requested` events fire as designed: 36 events
emitted, all with `transitioned=true && new_state=Ready` (which
matches the expected branch in the wake primitive — every waker
pulled off a non-empty waiter queue should transition Blocked→Ready).
- **HIGH** that tid=6 is woken 5× by tid=5 on handle 0x10b4 with
successful transition — grepped exhaustively by exact handle string
+ target_tid filter.
- **HIGH** that tid=6 emits no events after 663.60 ms despite the 5th
wake at 764.29 ms (per-tid scan).
- **HIGH** that exit-state thread geometry matches 2.M/2.N/2.Q/2.S
bit-identically (9651 bytes; tid=6 final Ready/hw_id=5/affinity=0x20).
- **HIGH** that C-2a (kernel wake-call bug) is FALSIFIED: the
wake-call IS issued AND IS transitioning tid=6 Blocked→Ready
correctly.
- **HIGH** that C-2b (scheduler-pick-skip) is CONFIRMED with strict-
priority starvation as the specific mechanism, given CPU5's exit-
state Ready set has tid=10 at priority=15 (vs tid=6 at priority=0)
pinned to affinity 0x20.
- **MEDIUM-HIGH** that tid=10 is in a CPU-bound non-kernel-crossing
loop after 727.92 ms (inferred from "still Ready at exit" + "no
trace events after 727.92 ms"). Could also be a tight kernel.call
loop where the trace happens to not have a corresponding emit — but
on CPU5 with priority advantage, tid=10 keeps the slot.
## Next-iterate recommendation
Three complementary directions, priority order:
**(1) 2.U — investigate why tid=10 is priority=15 / what tid=10 is
doing post-727.92ms (~0 LOC observability OR ~50 LOC PC-trace +
KeSetBasePriorityThread hook).** Pull from existing audit DB
(`sylpheed.db`) the basic block containing PC `0x824d1404` (tid=10's
final PC) — find the loop body. Compare with canary's analog tid to
see whether canary's same-PC thread also runs at priority=15. If canary
runs the same code at a different priority OR has a yield/sleep in the
loop, that's the divergence point. If canary is identical, then the
problem is **ours's scheduler lacks fair preemption** that canary's
host-OS-thread scheduler provides naturally (canary multiplexes guest
threads onto OS threads, getting OS-level fairness for free; ours's
cooperative-pick-runnable model needs explicit anti-starvation).
**(2) 2.V — add starvation-avoidance to `pick_runnable`** (~30-80 LOC
engine, semantic change). E.g., dynamic priority aging or a
quantum-expiration override that occasionally picks the highest-priority
*not-recently-run* thread. Risk: changes scheduling semantics —
matched-prefix and Phase-A trace counts will shift; needs careful
parity check vs canary. **NOT recommended as next** without first
establishing canary baseline (2.U).
**(3) 2.W — canary-side `wake.requested` mirror (~30-60 LOC C++)** for
cross-engine parity diff. Lower urgency than 2.U because the ours-side
data alone is sufficient to identify the scheduler-pick mechanism;
canary parity becomes relevant only if 2.U finds tid=10 is intentionally
high-priority and canary handles it via OS-level fairness.
**Recommended:** 2.U first (~30 min DB lookup + cross-engine check),
then 2.V or 2.W depending on what 2.U reveals.
## Artifacts
Under `xenia-rs/audit-runs/iterate-2T-wake-requested/`:
- `ours-cold.jsonl` (28.7 MB, 121,641 events = 121,569 baseline + 36
`signal.match` + 36 `wake.requested`)
- `ours-cold.stdout.log` (empty — quiet mode)
- `ours-cold.stderr.log` (single 2.M emission-notice line)
- `exit-thread-state.json` (9651 bytes; bit-identical to
2.M/2.N/2.Q/2.S — 13 threads + 10 wedge entries)
- `writer-report.md` (this file)
Engine HEAD `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` + uncommitted
2.Q signal.match patch + this iterate's 2.T wake.requested patch.
xenia-canary UNCHANGED.