Files

MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 07:19:08 +02:00

8.4 KiB

Raw Blame History

AUDIT-069 Session 2 — writer report v2

Date: 2026-05-20 xenia-rs HEAD: e6d43a23ac393004d2e5adf2f0395fd0b5e6448b (UNCHANGED from S1) git diff HEAD | sha256sum: ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357 (UNCHANGED from S1 end) No canary instrumentation added this session.

Headline

S1's framing is FALSIFIED. ours does NOT lack a "canary-tid=10 equivalent" thread. The spawn chain executes identically:

main (ours tid=1) → sub_8244FEA8 → sub_8244FF50 → ExCreateThread(entry=0x82450A28, ctx=0x828F3B68) → ours tid=5 starts → sub_82450A28 (1×) → sub_82450A68 (1×) → γ-signaler family (sub_8245D9D8 6×, sub_8245DA78 1×, sub_8245DB40 75×)

This is bit-equivalent to canary's chain, modulo the tid label (canary calls it tid=10, ours calls it tid=5 — same entry, same ctx, same dispatch loop, same γ-signaler family fires from inside it).

The signaler spawn-chain is NOT the bug. S1's "the bug is at the thread-spawn layer" hypothesis is wrong.

Spawn chain (DB-derived, READ-ONLY DuckDB)

Fn	callers in DB	role
0x82450A28	1 ref-edge from 0x8244FFF8 (sub_8244FF50+0xA8)	thread entry (data ptr only)
0x8244FF50	1 call-edge from 0x8244FEE8 (sub_8244FEA8+0x40)	ExCreateThread caller
0x8244FEA8	11 call-edges (8 unique callers across sub_821A5150, sub_821CB968, sub_821CC2E8, sub_821D2850, sub_82237EC8, sub_8225EE20, sub_822E0350, sub_824528A8, sub_82452DC0 (2×), sub_8245E528)	spawn helper

Per-PC fire counts (ours-cold, 1.5B instr, fresh today)

PC	symbol	fires	tid
0x8244FEA8	sub_8244FEA8 (spawn helper)	7	1
0x8244FF50	sub_8244FF50 (ExCreateThread caller)	1	1
0x82450A28	sub_82450A28 (thread entry)	1	5
0x82450A68	sub_82450A68 (worker dispatch loop)	1	5
0x8245D9D8	γ-signaler D	6	5
0x8245DA78	γ-signaler D-B	1	5
0x8245DB40	γ-signaler D-NEW	75	5

Spawn event log confirms ExCreateThread: tid=5 handle=0x1050 entry=0x82450a28 start_ctx=0x828f3b68. Total kernel.calls{name=ExCreateThread} = 10.

Comparison with canary (S1 data — fresh today, not stale)

metric	canary	ours
thread with entry=0x82450A28	tid=10	tid=5
start_ctx	0x828F3B68	0x828F3B68
γ-D family signaler firings	all on tid=10	all on tid=5
NtSetEvent fires from γ-D (via wrapper 0x824AA2F0)	confirmed	confirmed

The spawn chain and γ-signaler invocation match. The only divergence at the signaler call site is which handle gets signaled, not whether the signaler runs.

Divergence point (parent fires, child also fires)

NONE — every node in the spawn chain fires in ours. The S1-prescribed "first ancestor that fires while child does not" never materialises because the entire chain is reached identically.

The actual divergence is downstream of the spawn-chain — at the handle-selection step inside the γ-signaler family, per AUDIT-062's prior finding ("ours's γ-signalers signal WRONG handles — neighbors of the wedge handle, not the wedge itself").

Gate condition

There is no gate that ours fails. The control flow reaches the γ-signaler and invokes the NtSetEvent wrapper (sub_824AA2F0) with bit-identical control flow. The argument to NtSetEvent (the handle) is the divergent term.

In the AUDIT-062 archive ours-ntset.jsonl, the γ-D signaler on ours tid=5 calls NtSetEvent on handles 0x103C, 0x1068, 0x106C, 0x1094, ... These are guest-side handle slots that the waiter is NOT waiting on.

Per S1, canary's wedge waiter (tid=17, tid=26) waits on F80000A4 and F8000110. Note that canary's handles are pseudo-handles (high-bit encoded), while ours's slot allocator hands out normal 0x10xx IDs — a known cross-engine handle convention mismatch already documented in AUDIT-019/043/062.

The semantic question is therefore: what does the producer compute as the "next handle to signal", and is the computation reading a different value of the bookkeeping struct in ours vs canary? This is the question AUDIT-062 hit and parked; it must be re-opened now that S1 has clarified the producer thread is reached identically.

ours-side analog status

The relevant kernel handlers are:

NtSetEvent — ours xenia-kernel/src/exports.rs is per-AUDIT-062 archive bit-equivalent to canary in semantics (signals the event, schedules wakeup). Returns SUCCESS in both.
ExCreateThread — ours bit-equivalent (S2 spawn matches canary trajectory ctx + entry + suspended flag).
xeKeWaitForSingleObject (wedge wait at 0x821CB1DC) — ours behaviour matches per AUDIT-049/065 prior work; the WAIT itself is fine, what remains broken is the signaler picking the right handle on tid=5.

Net: NO kernel handler bug. The divergence is guest-state computed inside the γ-signaler family at sub_8245D9D8 / sub_8245DA78 / sub_8245DB40 — i.e. data that lives in the queue/list dispatched by sub_82450A68.

Reading-error #28 reclassification

S1 inadvertently committed the same class of error documented as #28 in prior audit memory: "treating per-engine tid label numerically across engines without a tid-mapping translation." S1 used canary's "tid=10" verbatim and AUDIT-062's "tid=10: 0 fires" verbatim, concluding "ours's thread set lacks the canary-tid=10 equivalent." In reality the same guest thread exists on both, with renumbered host-side tid labels.

The correct cross-engine identity is (entry_pc, start_ctx), not the tid integer. S2 re-validates by entry=0x82450a28 ∧ ctx=0x828f3b68, which uniquely identifies the spawn on both engines.

Do NOT register a new reading-error #; this is the existing #28 surface.

Session 3 recommendation (refined)

Drop the spawn-chain investigation entirely. The producer thread runs.

Path A (RECOMMENDED, ~80 LOC ours-only): build a probe of the handle-passed-to-NtSetEvent on tid=5 (ours) inside the γ-signaler PCs, paired with the symmetric audit_69_event_signal_watch capture from S1 in canary. Compare the sequence of handle IDs per signaler invocation. The first mismatch identifies the guest-state divergence that drives wrong-handle selection.

Plumbing path: extend --lr-trace in ours (crates/xenia-app/src/main.rs:233-243) to also capture r3 snapshot at multiple PCs, matching canary's audit_69 wrapper-entry capture. Already exists (M12 lr_trace lists pc/tid/hw/cycle/r3/r4/r5/r6/lr). Probe ours 0x824AA2F0 and 0x824AAF50 entry PCs.

Path B (~50 LOC diff-tool): extend the diff-events JSONL absorber to treat the canary→ours handle-ID mapping as a runtime-discovered alias when the underlying dispatcher pointer matches. Doesn't fix the bug, absorbs the symptom.

Path C (root-cause, larger): walk sub_82450A68 dispatch loop body disassembly + AUDIT-062 archive to identify which guest-memory struct holds the queue of "handles to signal." The wrong handles on ours mean this struct gets populated wrong somewhere upstream of tid=5's dispatch loop — likely from sub_8244FEA8's 7 fires (which call sites enqueue work, and what data is enqueued).

LOC budget for S3: Path A ~80, Path B ~50, Path C unknown (~200+).

Cascade A/B/C/D

A (DB-derived spawn chain): PASS (11 callers, 1 unique call edge to FF50).
B (per-fn fire counts ours+canary): PASS (ours fresh, canary from S1 fresh).
C (divergence-point identification): N/A — no divergence in spawn chain; S1 framing falsified. Re-direction recommended.
D (kernel-handler bit-equivalence check): PASS (NtSetEvent / ExCreateThread per AUDIT-062 archive; no new kernel bug detected).

Net: 3/4 PASS, 1/4 N/A (because the postulated divergence wasn't there).

Discipline

xenia-rs HEAD UNCHANGED (sha256 of git diff HEAD matches S1 end).
No canary instrumentation added this session — S1's data is fresh.
ours-rs ran with --ctor-probe (read-only, lockstep-digest-unaffected flag already in main.rs:194).
No source modifications to ours.
ours-rs cache (none on this host); no canary run, no canary cache to wipe.

Artifacts

audit-runs/audit-069-wait-signal-producer/
  session-2-spawn-walk.log    (combined probe + DB queries + fires table)
  writer-report-v2.md         (this file)
  s2/ours-probe.stdout        (780 lines, 91 CTOR-PROBE records)
  s2/ours-probe.stderr        (241 lines, all spawn events + summary)

No fix-canary-v2.diff (no canary instrumentation added).

8.4 KiB Raw Blame History Unescape Escape