Files
xenia-rs/audit-runs/audit-069-wait-signal-producer/writer-report-v3.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

230 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AUDIT-069 Session 3 — writer report v3
Date: 2026-05-20
xenia-rs HEAD: `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` (UNCHANGED from S1/S2)
`git diff HEAD | sha256sum`: `ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357`
(UNCHANGED at start AND end of S3)
No canary instrumentation added this session.
No ours source modifications. `--lr-trace` is a runtime flag (main.rs:233-243).
## Headline (HIGH confidence, direct measurement)
ours's tid=5 (= canary tid=10 by entry/ctx identity) fires the γ-signaler
family from the SAME guest LRs as canary — but **only 81 times where
canary fires 492 times (16%)**. This is NOT a "wrong-handle" bug — it is
a **producer-loop underrun**. The dispatch loop in `sub_82450A68` exits
early or starves; consumer threads then block on events that ours never
gets to signal.
S2's "the producer fires identically, just selects wrong handles" framing
is REFINED, not falsified: the producer reaches the wrappers via the
EXACT same call sites but completes ~5× fewer iterations.
## Method
Read-only `--lr-trace=0x824AA2F0,0x824AAF50` on cold ours boot, 1.5B
instructions / 47 s wallclock (and re-validated at 5B / 159s — same 81
fires, same handle universe, same import_calls=39290 → no new work after
the producer's initial burst). JSONL output to s3/ours-lr-trace.jsonl.
Cross-engine paired against S1's `signal-probe-correlated.log` (canary
data, fresh 2026-05-20).
## Per-LR fire counts
| caller LR | symbol | wrapper PC | canary tid=10 | ours tid=5 | ratio |
|---|---|---|---:|---:|---:|
| 0x8245DA44 | γ-D-A (sub_8245D9D8) | 0x824AA2F0 | 23 | 5 | 22% |
| 0x8245DB08 | γ-D-B (sub_8245DA78) | 0x824AA2F0 | 8 | 1 | 12% |
| 0x8245DC5C | γ-DB40 (sub_8245DB40 NEW) | 0x824AAF50 | 461 | 75 | 16% |
| **TOTAL** | | | **492** | **81** | **16%** |
ours runs the same producer code, but the loop terminates early. S2's per-PC
fire-count table also shows ours = 6/1/75 for the three γ-fns — this S3
data agrees with S2 for the wrapper-entry side too.
## Handle namespaces are incomparable by raw ID
- canary uses `XEvent::native_object()` pseudo-handles `F8000xxx` (high bit
set, encodes a synthetic ID assigned by `XObject::GetNativeObject`).
- ours uses normal slot IDs `0x10xx` from the handle-slot allocator.
Comparison must be by (a) **position in the per-LR sequence** and (b)
**call args** (size r5, signal-kind r4).
## Position-0 args MATCH (HIGH confidence, direct measurement)
| LR | r5 (size / kind) | matches? |
|---|---|---|
| 0x8245DC5C | ours=0x800 / canary=0x800 | YES |
| 0x8245DA44 | ours=2 (Set) / canary=2 | YES |
| 0x8245DB08 | ours=2 / canary=2 | YES |
r4 (buffer/ctx pointers) DIFFER in absolute address (different memory
layouts) but TYPE-shaped identically. The first invocation of each
signaler is structurally identical. The divergence is in COUNT of
subsequent loop iterations, not in handle-selection of position-0.
See `s3/handle-sequence-diff.md` for full position-aligned table.
## γ-DB40 signal-target distribution (the 461-vs-75 case)
| canary handle | count | ours handle | count |
|---|---:|---|---:|
| F80000C8 | 229 | 0x000010E0 | 69 |
| F80000DC | 79 | 0x00001040 | 1 |
| F8000078 | 71 | 0x0000105C | 1 |
| F80000BC | 39 | 0x00001098 | 1 |
| F800012C | 28 | 0x000010AC | 1 |
| F80000B4 | 7 | 0x000010D0 | 1 |
| F8000044 | 4 | 0x0000121C | 1 |
Shape: both have one dominant handle that absorbs ~half the signals
(canary 229/461=50%, ours 69/75=92%) and a long tail. ours's tail is
truncated — only 7 distinct handles in γ-DB40 vs canary's 10+.
This is consistent with **the producer enqueues the same kinds of work
items but the upstream feeder under-fires**, so the dominant work-item
(handle `0x10E0``F80000C8` by position) gets some iterations,
the next-most-common items get truncated to 1×, and the long tail
(canary's `F80000DC` 79× / `F8000078` 71×) is mostly missing.
## Wedge handle status (HIGH confidence)
AUDIT-062 archive recorded ours wedge handles `0x12AC` and `0x12B8` with
`<NO_SIGNALS_DESPITE_WAITS>` annotation in a deeper-boot run.
In S3's lr-trace: **handle 0x12AC count = 0, handle 0x12B8 count = 0**.
**No handle ≥ 0x121C appears in tid=5's signal trace at all.**
Max handle observed in this run: 0x121C (cache:/aab216c3 NtCreateFile).
The wedge handles are NEVER allocated in this 5B-instruction run, because
boot terminates **before** the trajectory that would create them. The
producer fires 81 times, then tid=5 goes quiet; the import_call counter
freezes at 39,290; `--halt-on-deadlock` does NOT trigger (consumers wait
on existing events that were never the wedge in this run).
**This is a stronger statement than "the wedge handle is never signaled":
the wedge handle is never even CREATED, because the boot never reaches
the point of creating it.** ours's boot trajectory is truncated by the
producer underrun upstream.
## Classification: producer-loop underrun (HIGH confidence)
NOT a race (timing-dependent), NOT a wrong-handle bug (the args at
matching positions are structurally identical), NOT a missing-kernel-
handler bug (the signals that DO fire pass through bit-equivalent
wrappers).
It is **producer-loop underrun**: sub_82450A68's dispatch loop iterates
fewer times. Either:
1. The work queue (read from guest memory by sub_82450A68) is populated
with fewer items by some upstream feeder.
2. The dispatch loop's exit condition trips early.
3. The thread blocks on a dispatcher event that never gets re-signaled.
Mechanism candidates (S4 to discriminate):
- **upstream feeder**: callers of sub_8244FEA8 (11 sites in DB) — one
enqueues less work in ours. Most likely the audio cluster
(sub_8225EE20) or sub_82452DC0 (2 calls) given they relate to APUBUG-
PRODUCER-001 territory.
- **dispatch loop exit**: the loop reads a flag from the dispatcher
struct at `0x828F3B68 + offset`; a state divergence there exits early.
- **inner KeWait at 0x824AB240** (mentioned in S1 spawn-chain notes):
if this wait times out / fails differently in ours, the loop exits.
## Reading-error registry
NO new reading-error class needed. This session confirms one existing
class:
- **#28 cross-engine tid label mismatch** — used correctly here
(compared by entry/ctx, not by tid integer).
- **AUDIT-062 "wrong handles" framing** is a SYMPTOM of the producer
underrun (fewer signals → some handles signaled, others starved),
not a separate bug.
## Cascade
- **A** (capture ours per-PC signaler firings): PASS (137 records, 81 on tid=5).
- **B** (parallel canary sequence from S1): PASS (492 records on tid=10).
- **C** (first-mismatch identification): PASS — divergence is in iteration
count, not in handle-at-position-0. Position-0 args match structurally.
- **D** (race-vs-missing-signal classification): PASS — neither pure race
nor pure missing-signal. It is **producer-loop underrun** (boot doesn't
reach the wedge-handle-creating subsystem).
Net 4/4 PASS.
## S4 recommendation (refined)
**Drop the "wrong-handles-from-γ-signaler" framing.** Focus upstream on
WHY tid=5's dispatch loop runs ~5× fewer iterations.
### Path A (RECOMMENDED, ~30 LOC ours-only diagnostic, no source mod)
Use `--lr-trace=0x82450A68` (the dispatch-loop body PC) plus the existing
`--branch-probe` to see WHERE in the loop body ours exits. If the loop has
a backward branch at offset X and ours's last fire is at offset Y < X, the
loop is exiting early. Pair with the inner `bl 0x824AB240` (KeWaitForMultipleObjects)
to see if the loop blocks on a wait that returns differently than canary.
### Path B (~80 LOC ours-only) — feeder-side capture
`--lr-trace=0x8244FEA8` on cold ours AND canary. The spawn-helper fires 11
times statically in DB-derived list of callers; runtime fires 7× in S2's
ours run. Pair r3/r4 (the spawned thread's start_ctx args) with canary's
equivalent. ours may be missing one or more enqueues — the missing
enqueue is the upstream root cause.
### Path C (~250 LOC, larger) — work-queue struct disassembly
Disassemble sub_82450A68 body, identify the work-queue struct it reads
from (likely at `[r29 + N]` where r29 = start_ctx 0x828F3B68 or a derived
pointer). Watch the struct with `--mem-watch` to identify the populator
(which fn writes the queue items). Trace that populator upstream.
LOC budget for S4: Path A ~30, Path B ~80, Path C ~250.
**Path A first** — gives the precise exit-condition (loop-body branch vs
inner-wait timeout) at zero LOC cost.
## Discipline
- xenia-rs HEAD UNCHANGED (sha256 of `git diff HEAD` matches S1/S2 end).
- No source modifications.
- `--lr-trace` is read-only, lockstep-digest-unaffected (per state.rs:1463-1500).
- No canary run this session (S1's data is fresh).
- No canary cache to wipe (no canary run).
- ours runs cold (no cache pre-population).
## Artifacts
```
audit-runs/audit-069-wait-signal-producer/s3/
ours-lr-trace.jsonl (137 records, both PCs, all tids)
ours-lr-trace.stderr (run log + counters)
ours-lr-trace.stdout (empty under --quiet)
ours-lr-trace-824AA2F0.log (60 records, NtSetEvent wrapper)
ours-lr-trace-824AAF50.log (77 records, Ke wrapper)
ours-lr-trace-extended.{jsonl,stderr,stdout} (5B-instr re-validation: same 81 fires)
handle-sequence-diff.md (parallel comparison + first-mismatch table)
writer-report-v3.md (this file)
```
No fresh canary run was needed — S1's `signal-probe-correlated.log`
(154,187 lines) carries all canary signal-probe data.
## Summary of S1 → S2 → S3 progression
- **S1**: identified canary's tid=10 as the signaler; claimed ours lacks
this thread (FALSIFIED by S2).
- **S2**: spawn-chain runs identically on ours tid=5; refined to "wrong-
handle selection" downstream (REFINED by S3).
- **S3**: ours runs identical PC/LR chain but with ~5× fewer iterations.
Loop underrun classification. Wedge handle never even gets created in
ours's truncated boot trajectory.
The bug is **upstream of the γ-signaler**: in WHAT the dispatch loop
reads from the work queue, or in the loop's exit condition.