Files

MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 07:19:08 +02:00

9.9 KiB

Raw Blame History

AUDIT-069 Session 3 — writer report v3

Date: 2026-05-20 xenia-rs HEAD: e6d43a23ac393004d2e5adf2f0395fd0b5e6448b (UNCHANGED from S1/S2) git diff HEAD | sha256sum: ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357 (UNCHANGED at start AND end of S3) No canary instrumentation added this session. No ours source modifications. --lr-trace is a runtime flag (main.rs:233-243).

Headline (HIGH confidence, direct measurement)

ours's tid=5 (= canary tid=10 by entry/ctx identity) fires the γ-signaler family from the SAME guest LRs as canary — but only 81 times where canary fires 492 times (16%). This is NOT a "wrong-handle" bug — it is a producer-loop underrun. The dispatch loop in sub_82450A68 exits early or starves; consumer threads then block on events that ours never gets to signal.

S2's "the producer fires identically, just selects wrong handles" framing is REFINED, not falsified: the producer reaches the wrappers via the EXACT same call sites but completes ~5× fewer iterations.

Method

Read-only --lr-trace=0x824AA2F0,0x824AAF50 on cold ours boot, 1.5B instructions / 47 s wallclock (and re-validated at 5B / 159s — same 81 fires, same handle universe, same import_calls=39290 → no new work after the producer's initial burst). JSONL output to s3/ours-lr-trace.jsonl. Cross-engine paired against S1's signal-probe-correlated.log (canary data, fresh 2026-05-20).

Per-LR fire counts

caller LR	symbol	wrapper PC	canary tid=10	ours tid=5	ratio
0x8245DA44	γ-D-A (sub_8245D9D8)	0x824AA2F0	23	5	22%
0x8245DB08	γ-D-B (sub_8245DA78)	0x824AA2F0	8	1	12%
0x8245DC5C	γ-DB40 (sub_8245DB40 NEW)	0x824AAF50	461	75	16%
TOTAL			492	81	16%

ours runs the same producer code, but the loop terminates early. S2's per-PC fire-count table also shows ours = 6/1/75 for the three γ-fns — this S3 data agrees with S2 for the wrapper-entry side too.

Handle namespaces are incomparable by raw ID

canary uses XEvent::native_object() pseudo-handles F8000xxx (high bit set, encodes a synthetic ID assigned by XObject::GetNativeObject).
ours uses normal slot IDs 0x10xx from the handle-slot allocator.

Comparison must be by (a) position in the per-LR sequence and (b) call args (size r5, signal-kind r4).

Position-0 args MATCH (HIGH confidence, direct measurement)

LR	r5 (size / kind)	matches?
0x8245DC5C	ours=0x800 / canary=0x800	YES
0x8245DA44	ours=2 (Set) / canary=2	YES
0x8245DB08	ours=2 / canary=2	YES

r4 (buffer/ctx pointers) DIFFER in absolute address (different memory layouts) but TYPE-shaped identically. The first invocation of each signaler is structurally identical. The divergence is in COUNT of subsequent loop iterations, not in handle-selection of position-0.

See s3/handle-sequence-diff.md for full position-aligned table.

γ-DB40 signal-target distribution (the 461-vs-75 case)

canary handle	count	ours handle	count
F80000C8	229	0x000010E0	69
F80000DC	79	0x00001040	1
F8000078	71	0x0000105C	1
F80000BC	39	0x00001098	1
F800012C	28	0x000010AC	1
F80000B4	7	0x000010D0	1
F8000044	4	0x0000121C	1

Shape: both have one dominant handle that absorbs ~half the signals (canary 229/461=50%, ours 69/75=92%) and a long tail. ours's tail is truncated — only 7 distinct handles in γ-DB40 vs canary's 10+.

This is consistent with the producer enqueues the same kinds of work items but the upstream feeder under-fires, so the dominant work-item (handle 0x10E0 ≈ F80000C8 by position) gets some iterations, the next-most-common items get truncated to 1×, and the long tail (canary's F80000DC 79× / F8000078 71×) is mostly missing.

Wedge handle status (HIGH confidence)

AUDIT-062 archive recorded ours wedge handles 0x12AC and 0x12B8 with <NO_SIGNALS_DESPITE_WAITS> annotation in a deeper-boot run.

In S3's lr-trace: handle 0x12AC count = 0, handle 0x12B8 count = 0. No handle ≥ 0x121C appears in tid=5's signal trace at all.

Max handle observed in this run: 0x121C (cache:/aab216c3 NtCreateFile).

The wedge handles are NEVER allocated in this 5B-instruction run, because boot terminates before the trajectory that would create them. The producer fires 81 times, then tid=5 goes quiet; the import_call counter freezes at 39,290; --halt-on-deadlock does NOT trigger (consumers wait on existing events that were never the wedge in this run).

This is a stronger statement than "the wedge handle is never signaled": the wedge handle is never even CREATED, because the boot never reaches the point of creating it. ours's boot trajectory is truncated by the producer underrun upstream.

Classification: producer-loop underrun (HIGH confidence)

NOT a race (timing-dependent), NOT a wrong-handle bug (the args at matching positions are structurally identical), NOT a missing-kernel- handler bug (the signals that DO fire pass through bit-equivalent wrappers).

It is producer-loop underrun: sub_82450A68's dispatch loop iterates fewer times. Either:

The work queue (read from guest memory by sub_82450A68) is populated with fewer items by some upstream feeder.
The dispatch loop's exit condition trips early.
The thread blocks on a dispatcher event that never gets re-signaled.

Mechanism candidates (S4 to discriminate):

upstream feeder: callers of sub_8244FEA8 (11 sites in DB) — one enqueues less work in ours. Most likely the audio cluster (sub_8225EE20) or sub_82452DC0 (2 calls) given they relate to APUBUG- PRODUCER-001 territory.
dispatch loop exit: the loop reads a flag from the dispatcher struct at 0x828F3B68 + offset; a state divergence there exits early.
inner KeWait at 0x824AB240 (mentioned in S1 spawn-chain notes): if this wait times out / fails differently in ours, the loop exits.

Reading-error registry

NO new reading-error class needed. This session confirms one existing class:

#28 cross-engine tid label mismatch — used correctly here (compared by entry/ctx, not by tid integer).
AUDIT-062 "wrong handles" framing is a SYMPTOM of the producer underrun (fewer signals → some handles signaled, others starved), not a separate bug.

Cascade

A (capture ours per-PC signaler firings): PASS (137 records, 81 on tid=5).
B (parallel canary sequence from S1): PASS (492 records on tid=10).
C (first-mismatch identification): PASS — divergence is in iteration count, not in handle-at-position-0. Position-0 args match structurally.
D (race-vs-missing-signal classification): PASS — neither pure race nor pure missing-signal. It is producer-loop underrun (boot doesn't reach the wedge-handle-creating subsystem).

Net 4/4 PASS.

S4 recommendation (refined)

Drop the "wrong-handles-from-γ-signaler" framing. Focus upstream on WHY tid=5's dispatch loop runs ~5× fewer iterations.

Path A (RECOMMENDED, ~30 LOC ours-only diagnostic, no source mod)

Use --lr-trace=0x82450A68 (the dispatch-loop body PC) plus the existing --branch-probe to see WHERE in the loop body ours exits. If the loop has a backward branch at offset X and ours's last fire is at offset Y < X, the loop is exiting early. Pair with the inner bl 0x824AB240 (KeWaitForMultipleObjects) to see if the loop blocks on a wait that returns differently than canary.

Path B (~80 LOC ours-only) — feeder-side capture

--lr-trace=0x8244FEA8 on cold ours AND canary. The spawn-helper fires 11 times statically in DB-derived list of callers; runtime fires 7× in S2's ours run. Pair r3/r4 (the spawned thread's start_ctx args) with canary's equivalent. ours may be missing one or more enqueues — the missing enqueue is the upstream root cause.

Path C (~250 LOC, larger) — work-queue struct disassembly

Disassemble sub_82450A68 body, identify the work-queue struct it reads from (likely at [r29 + N] where r29 = start_ctx 0x828F3B68 or a derived pointer). Watch the struct with --mem-watch to identify the populator (which fn writes the queue items). Trace that populator upstream.

LOC budget for S4: Path A ~30, Path B ~80, Path C ~250.

Path A first — gives the precise exit-condition (loop-body branch vs inner-wait timeout) at zero LOC cost.

Discipline

xenia-rs HEAD UNCHANGED (sha256 of git diff HEAD matches S1/S2 end).
No source modifications.
--lr-trace is read-only, lockstep-digest-unaffected (per state.rs:1463-1500).
No canary run this session (S1's data is fresh).
No canary cache to wipe (no canary run).
ours runs cold (no cache pre-population).

Artifacts

audit-runs/audit-069-wait-signal-producer/s3/
  ours-lr-trace.jsonl              (137 records, both PCs, all tids)
  ours-lr-trace.stderr             (run log + counters)
  ours-lr-trace.stdout             (empty under --quiet)
  ours-lr-trace-824AA2F0.log       (60 records, NtSetEvent wrapper)
  ours-lr-trace-824AAF50.log       (77 records, Ke wrapper)
  ours-lr-trace-extended.{jsonl,stderr,stdout}  (5B-instr re-validation: same 81 fires)
  handle-sequence-diff.md          (parallel comparison + first-mismatch table)
  writer-report-v3.md              (this file)

No fresh canary run was needed — S1's signal-probe-correlated.log (154,187 lines) carries all canary signal-probe data.

Summary of S1 → S2 → S3 progression

S1: identified canary's tid=10 as the signaler; claimed ours lacks this thread (FALSIFIED by S2).
S2: spawn-chain runs identically on ours tid=5; refined to "wrong- handle selection" downstream (REFINED by S3).
S3: ours runs identical PC/LR chain but with ~5× fewer iterations. Loop underrun classification. Wedge handle never even gets created in ours's truncated boot trajectory.

The bug is upstream of the γ-signaler: in WHAT the dispatch loop reads from the work queue, or in the loop's exit condition.

9.9 KiB Raw Blame History Unescape Escape