handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,229 @@
# AUDIT-069 Session 3 — writer report v3
Date: 2026-05-20
xenia-rs HEAD: `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` (UNCHANGED from S1/S2)
`git diff HEAD | sha256sum`: `ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357`
(UNCHANGED at start AND end of S3)
No canary instrumentation added this session.
No ours source modifications. `--lr-trace` is a runtime flag (main.rs:233-243).
## Headline (HIGH confidence, direct measurement)
ours's tid=5 (= canary tid=10 by entry/ctx identity) fires the γ-signaler
family from the SAME guest LRs as canary — but **only 81 times where
canary fires 492 times (16%)**. This is NOT a "wrong-handle" bug — it is
a **producer-loop underrun**. The dispatch loop in `sub_82450A68` exits
early or starves; consumer threads then block on events that ours never
gets to signal.
S2's "the producer fires identically, just selects wrong handles" framing
is REFINED, not falsified: the producer reaches the wrappers via the
EXACT same call sites but completes ~5× fewer iterations.
## Method
Read-only `--lr-trace=0x824AA2F0,0x824AAF50` on cold ours boot, 1.5B
instructions / 47 s wallclock (and re-validated at 5B / 159s — same 81
fires, same handle universe, same import_calls=39290 → no new work after
the producer's initial burst). JSONL output to s3/ours-lr-trace.jsonl.
Cross-engine paired against S1's `signal-probe-correlated.log` (canary
data, fresh 2026-05-20).
## Per-LR fire counts
| caller LR | symbol | wrapper PC | canary tid=10 | ours tid=5 | ratio |
|---|---|---|---:|---:|---:|
| 0x8245DA44 | γ-D-A (sub_8245D9D8) | 0x824AA2F0 | 23 | 5 | 22% |
| 0x8245DB08 | γ-D-B (sub_8245DA78) | 0x824AA2F0 | 8 | 1 | 12% |
| 0x8245DC5C | γ-DB40 (sub_8245DB40 NEW) | 0x824AAF50 | 461 | 75 | 16% |
| **TOTAL** | | | **492** | **81** | **16%** |
ours runs the same producer code, but the loop terminates early. S2's per-PC
fire-count table also shows ours = 6/1/75 for the three γ-fns — this S3
data agrees with S2 for the wrapper-entry side too.
## Handle namespaces are incomparable by raw ID
- canary uses `XEvent::native_object()` pseudo-handles `F8000xxx` (high bit
set, encodes a synthetic ID assigned by `XObject::GetNativeObject`).
- ours uses normal slot IDs `0x10xx` from the handle-slot allocator.
Comparison must be by (a) **position in the per-LR sequence** and (b)
**call args** (size r5, signal-kind r4).
## Position-0 args MATCH (HIGH confidence, direct measurement)
| LR | r5 (size / kind) | matches? |
|---|---|---|
| 0x8245DC5C | ours=0x800 / canary=0x800 | YES |
| 0x8245DA44 | ours=2 (Set) / canary=2 | YES |
| 0x8245DB08 | ours=2 / canary=2 | YES |
r4 (buffer/ctx pointers) DIFFER in absolute address (different memory
layouts) but TYPE-shaped identically. The first invocation of each
signaler is structurally identical. The divergence is in COUNT of
subsequent loop iterations, not in handle-selection of position-0.
See `s3/handle-sequence-diff.md` for full position-aligned table.
## γ-DB40 signal-target distribution (the 461-vs-75 case)
| canary handle | count | ours handle | count |
|---|---:|---|---:|
| F80000C8 | 229 | 0x000010E0 | 69 |
| F80000DC | 79 | 0x00001040 | 1 |
| F8000078 | 71 | 0x0000105C | 1 |
| F80000BC | 39 | 0x00001098 | 1 |
| F800012C | 28 | 0x000010AC | 1 |
| F80000B4 | 7 | 0x000010D0 | 1 |
| F8000044 | 4 | 0x0000121C | 1 |
Shape: both have one dominant handle that absorbs ~half the signals
(canary 229/461=50%, ours 69/75=92%) and a long tail. ours's tail is
truncated — only 7 distinct handles in γ-DB40 vs canary's 10+.
This is consistent with **the producer enqueues the same kinds of work
items but the upstream feeder under-fires**, so the dominant work-item
(handle `0x10E0``F80000C8` by position) gets some iterations,
the next-most-common items get truncated to 1×, and the long tail
(canary's `F80000DC` 79× / `F8000078` 71×) is mostly missing.
## Wedge handle status (HIGH confidence)
AUDIT-062 archive recorded ours wedge handles `0x12AC` and `0x12B8` with
`<NO_SIGNALS_DESPITE_WAITS>` annotation in a deeper-boot run.
In S3's lr-trace: **handle 0x12AC count = 0, handle 0x12B8 count = 0**.
**No handle ≥ 0x121C appears in tid=5's signal trace at all.**
Max handle observed in this run: 0x121C (cache:/aab216c3 NtCreateFile).
The wedge handles are NEVER allocated in this 5B-instruction run, because
boot terminates **before** the trajectory that would create them. The
producer fires 81 times, then tid=5 goes quiet; the import_call counter
freezes at 39,290; `--halt-on-deadlock` does NOT trigger (consumers wait
on existing events that were never the wedge in this run).
**This is a stronger statement than "the wedge handle is never signaled":
the wedge handle is never even CREATED, because the boot never reaches
the point of creating it.** ours's boot trajectory is truncated by the
producer underrun upstream.
## Classification: producer-loop underrun (HIGH confidence)
NOT a race (timing-dependent), NOT a wrong-handle bug (the args at
matching positions are structurally identical), NOT a missing-kernel-
handler bug (the signals that DO fire pass through bit-equivalent
wrappers).
It is **producer-loop underrun**: sub_82450A68's dispatch loop iterates
fewer times. Either:
1. The work queue (read from guest memory by sub_82450A68) is populated
with fewer items by some upstream feeder.
2. The dispatch loop's exit condition trips early.
3. The thread blocks on a dispatcher event that never gets re-signaled.
Mechanism candidates (S4 to discriminate):
- **upstream feeder**: callers of sub_8244FEA8 (11 sites in DB) — one
enqueues less work in ours. Most likely the audio cluster
(sub_8225EE20) or sub_82452DC0 (2 calls) given they relate to APUBUG-
PRODUCER-001 territory.
- **dispatch loop exit**: the loop reads a flag from the dispatcher
struct at `0x828F3B68 + offset`; a state divergence there exits early.
- **inner KeWait at 0x824AB240** (mentioned in S1 spawn-chain notes):
if this wait times out / fails differently in ours, the loop exits.
## Reading-error registry
NO new reading-error class needed. This session confirms one existing
class:
- **#28 cross-engine tid label mismatch** — used correctly here
(compared by entry/ctx, not by tid integer).
- **AUDIT-062 "wrong handles" framing** is a SYMPTOM of the producer
underrun (fewer signals → some handles signaled, others starved),
not a separate bug.
## Cascade
- **A** (capture ours per-PC signaler firings): PASS (137 records, 81 on tid=5).
- **B** (parallel canary sequence from S1): PASS (492 records on tid=10).
- **C** (first-mismatch identification): PASS — divergence is in iteration
count, not in handle-at-position-0. Position-0 args match structurally.
- **D** (race-vs-missing-signal classification): PASS — neither pure race
nor pure missing-signal. It is **producer-loop underrun** (boot doesn't
reach the wedge-handle-creating subsystem).
Net 4/4 PASS.
## S4 recommendation (refined)
**Drop the "wrong-handles-from-γ-signaler" framing.** Focus upstream on
WHY tid=5's dispatch loop runs ~5× fewer iterations.
### Path A (RECOMMENDED, ~30 LOC ours-only diagnostic, no source mod)
Use `--lr-trace=0x82450A68` (the dispatch-loop body PC) plus the existing
`--branch-probe` to see WHERE in the loop body ours exits. If the loop has
a backward branch at offset X and ours's last fire is at offset Y < X, the
loop is exiting early. Pair with the inner `bl 0x824AB240` (KeWaitForMultipleObjects)
to see if the loop blocks on a wait that returns differently than canary.
### Path B (~80 LOC ours-only) — feeder-side capture
`--lr-trace=0x8244FEA8` on cold ours AND canary. The spawn-helper fires 11
times statically in DB-derived list of callers; runtime fires 7× in S2's
ours run. Pair r3/r4 (the spawned thread's start_ctx args) with canary's
equivalent. ours may be missing one or more enqueues — the missing
enqueue is the upstream root cause.
### Path C (~250 LOC, larger) — work-queue struct disassembly
Disassemble sub_82450A68 body, identify the work-queue struct it reads
from (likely at `[r29 + N]` where r29 = start_ctx 0x828F3B68 or a derived
pointer). Watch the struct with `--mem-watch` to identify the populator
(which fn writes the queue items). Trace that populator upstream.
LOC budget for S4: Path A ~30, Path B ~80, Path C ~250.
**Path A first** — gives the precise exit-condition (loop-body branch vs
inner-wait timeout) at zero LOC cost.
## Discipline
- xenia-rs HEAD UNCHANGED (sha256 of `git diff HEAD` matches S1/S2 end).
- No source modifications.
- `--lr-trace` is read-only, lockstep-digest-unaffected (per state.rs:1463-1500).
- No canary run this session (S1's data is fresh).
- No canary cache to wipe (no canary run).
- ours runs cold (no cache pre-population).
## Artifacts
```
audit-runs/audit-069-wait-signal-producer/s3/
ours-lr-trace.jsonl (137 records, both PCs, all tids)
ours-lr-trace.stderr (run log + counters)
ours-lr-trace.stdout (empty under --quiet)
ours-lr-trace-824AA2F0.log (60 records, NtSetEvent wrapper)
ours-lr-trace-824AAF50.log (77 records, Ke wrapper)
ours-lr-trace-extended.{jsonl,stderr,stdout} (5B-instr re-validation: same 81 fires)
handle-sequence-diff.md (parallel comparison + first-mismatch table)
writer-report-v3.md (this file)
```
No fresh canary run was needed — S1's `signal-probe-correlated.log`
(154,187 lines) carries all canary signal-probe data.
## Summary of S1 → S2 → S3 progression
- **S1**: identified canary's tid=10 as the signaler; claimed ours lacks
this thread (FALSIFIED by S2).
- **S2**: spawn-chain runs identically on ours tid=5; refined to "wrong-
handle selection" downstream (REFINED by S3).
- **S3**: ours runs identical PC/LR chain but with ~5× fewer iterations.
Loop underrun classification. Wedge handle never even gets created in
ours's truncated boot trajectory.
The bug is **upstream of the γ-signaler**: in WHAT the dispatch loop
reads from the work queue, or in the loop's exit condition.