Files
xenia-rs/audit-runs/audit-069-wait-signal-producer/writer-report-v2.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

193 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AUDIT-069 Session 2 — writer report v2
Date: 2026-05-20
xenia-rs HEAD: `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` (UNCHANGED from S1)
`git diff HEAD | sha256sum`: `ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357` (UNCHANGED from S1 end)
No canary instrumentation added this session.
## Headline
**S1's framing is FALSIFIED.** ours does NOT lack a "canary-tid=10
equivalent" thread. The spawn chain executes identically:
main (ours tid=1) → sub_8244FEA8 → sub_8244FF50
→ ExCreateThread(entry=0x82450A28, ctx=0x828F3B68)
→ ours tid=5 starts
→ sub_82450A28 (1×) → sub_82450A68 (1×)
γ-signaler family (sub_8245D9D8 6×, sub_8245DA78 1×, sub_8245DB40 75×)
This is bit-equivalent to canary's chain, modulo the tid label
(canary calls it tid=10, ours calls it tid=5 — same entry, same ctx,
same dispatch loop, same γ-signaler family fires from inside it).
The signaler spawn-chain is NOT the bug. S1's "the bug is at the
thread-spawn layer" hypothesis is wrong.
## Spawn chain (DB-derived, READ-ONLY DuckDB)
| Fn | callers in DB | role |
|---|---|---|
| 0x82450A28 | 1 ref-edge from 0x8244FFF8 (sub_8244FF50+0xA8) | thread entry (data ptr only) |
| 0x8244FF50 | 1 call-edge from 0x8244FEE8 (sub_8244FEA8+0x40) | ExCreateThread caller |
| 0x8244FEA8 | 11 call-edges (8 unique callers across sub_821A5150, sub_821CB968, sub_821CC2E8, sub_821D2850, sub_82237EC8, sub_8225EE20, sub_822E0350, sub_824528A8, sub_82452DC0 (2×), sub_8245E528) | spawn helper |
## Per-PC fire counts (ours-cold, 1.5B instr, fresh today)
| PC | symbol | fires | tid |
|---|---|---|---|
| 0x8244FEA8 | sub_8244FEA8 (spawn helper) | 7 | 1 |
| 0x8244FF50 | sub_8244FF50 (ExCreateThread caller) | 1 | 1 |
| 0x82450A28 | sub_82450A28 (thread entry) | 1 | 5 |
| 0x82450A68 | sub_82450A68 (worker dispatch loop) | 1 | 5 |
| 0x8245D9D8 | γ-signaler D | 6 | 5 |
| 0x8245DA78 | γ-signaler D-B | 1 | 5 |
| 0x8245DB40 | γ-signaler D-NEW | 75 | 5 |
Spawn event log confirms `ExCreateThread: tid=5 handle=0x1050 entry=0x82450a28 start_ctx=0x828f3b68`.
Total `kernel.calls{name=ExCreateThread} = 10`.
## Comparison with canary (S1 data — fresh today, not stale)
| metric | canary | ours |
|---|---|---|
| thread with entry=0x82450A28 | tid=10 | tid=5 |
| start_ctx | 0x828F3B68 | 0x828F3B68 |
| γ-D family signaler firings | all on tid=10 | all on tid=5 |
| NtSetEvent fires from γ-D (via wrapper 0x824AA2F0) | confirmed | confirmed |
The spawn chain and γ-signaler invocation match. The only divergence at the
signaler call site is **which handle gets signaled**, not whether the
signaler runs.
## Divergence point (parent fires, child also fires)
NONE — every node in the spawn chain fires in ours. The S1-prescribed
"first ancestor that fires while child does not" never materialises because
the entire chain is reached identically.
The actual divergence is downstream of the spawn-chain — at the
**handle-selection** step inside the γ-signaler family, per AUDIT-062's
prior finding ("ours's γ-signalers signal WRONG handles — neighbors of the
wedge handle, not the wedge itself").
## Gate condition
There is no gate that ours fails. The control flow reaches the γ-signaler
and invokes the NtSetEvent wrapper (`sub_824AA2F0`) with bit-identical
control flow. The argument to NtSetEvent (the handle) is the
divergent term.
In the AUDIT-062 archive ours-ntset.jsonl, the γ-D signaler on ours tid=5
calls NtSetEvent on handles `0x103C`, `0x1068`, `0x106C`, `0x1094`, ...
These are guest-side handle slots that the *waiter* is NOT waiting on.
Per S1, canary's wedge waiter (tid=17, tid=26) waits on `F80000A4` and
`F8000110`. Note that canary's handles are *pseudo-handles* (high-bit
encoded), while ours's slot allocator hands out normal `0x10xx` IDs —
a known cross-engine handle convention mismatch already documented
in AUDIT-019/043/062.
The semantic question is therefore: **what does the producer compute as
the "next handle to signal", and is the computation reading
a different value of the bookkeeping struct in ours vs canary?**
This is the question AUDIT-062 hit and parked; it must be re-opened
now that S1 has clarified the producer thread is reached identically.
## ours-side analog status
The relevant kernel handlers are:
- `NtSetEvent` — ours `xenia-kernel/src/exports.rs` is per-AUDIT-062 archive
bit-equivalent to canary in semantics (signals the event, schedules wakeup).
Returns SUCCESS in both.
- `ExCreateThread` — ours bit-equivalent (S2 spawn matches canary trajectory
ctx + entry + suspended flag).
- `xeKeWaitForSingleObject` (wedge wait at 0x821CB1DC) — ours behaviour
matches per AUDIT-049/065 prior work; the WAIT itself is fine, what
remains broken is the signaler picking the right handle on tid=5.
Net: NO kernel handler bug. The divergence is **guest-state computed
inside the γ-signaler family at sub_8245D9D8 / sub_8245DA78 /
sub_8245DB40** — i.e. data that lives in the queue/list dispatched
by sub_82450A68.
## Reading-error #28 reclassification
S1 inadvertently committed the same class of error documented as #28 in
prior audit memory: "treating per-engine tid label numerically across
engines without a tid-mapping translation." S1 used canary's "tid=10"
verbatim and AUDIT-062's "tid=10: 0 fires" verbatim, concluding "ours's
thread set lacks the canary-tid=10 equivalent." In reality the same
guest thread exists on both, with renumbered host-side tid labels.
The correct cross-engine identity is `(entry_pc, start_ctx)`, not the
tid integer. S2 re-validates by `entry=0x82450a28 ∧ ctx=0x828f3b68`,
which uniquely identifies the spawn on both engines.
Do NOT register a new reading-error #; this is the existing #28 surface.
## Session 3 recommendation (refined)
Drop the spawn-chain investigation entirely. The producer thread runs.
**Path A (RECOMMENDED, ~80 LOC ours-only)**: build a probe of the
**handle-passed-to-NtSetEvent** on tid=5 (ours) inside the γ-signaler
PCs, paired with the symmetric `audit_69_event_signal_watch` capture
from S1 in canary. Compare the *sequence of handle IDs* per signaler
invocation. The first mismatch identifies the guest-state divergence
that drives wrong-handle selection.
Plumbing path: extend `--lr-trace` in ours (`crates/xenia-app/src/main.rs:233-243`)
to also capture `r3` snapshot at multiple PCs, matching canary's
audit_69 wrapper-entry capture. Already exists (M12 lr_trace lists
pc/tid/hw/cycle/r3/r4/r5/r6/lr). Probe ours `0x824AA2F0` and `0x824AAF50`
entry PCs.
**Path B (~50 LOC diff-tool)**: extend the diff-events JSONL absorber to
treat the canary→ours handle-ID mapping as a runtime-discovered alias
when the underlying dispatcher pointer matches. Doesn't fix the bug,
absorbs the symptom.
**Path C (root-cause, larger)**: walk sub_82450A68 dispatch loop body
disassembly + AUDIT-062 archive to identify which guest-memory struct
holds the queue of "handles to signal." The wrong handles on ours mean
this struct gets populated wrong somewhere upstream of tid=5's dispatch
loop — likely from sub_8244FEA8's 7 fires (which call sites enqueue
work, and what data is enqueued).
LOC budget for S3: Path A ~80, Path B ~50, Path C unknown (~200+).
## Cascade A/B/C/D
- **A** (DB-derived spawn chain): PASS (11 callers, 1 unique call edge to FF50).
- **B** (per-fn fire counts ours+canary): PASS (ours fresh, canary from S1 fresh).
- **C** (divergence-point identification): N/A — no divergence in spawn chain;
S1 framing falsified. Re-direction recommended.
- **D** (kernel-handler bit-equivalence check): PASS (NtSetEvent / ExCreateThread
per AUDIT-062 archive; no new kernel bug detected).
Net: 3/4 PASS, 1/4 N/A (because the postulated divergence wasn't there).
## Discipline
- xenia-rs HEAD UNCHANGED (sha256 of `git diff HEAD` matches S1 end).
- No canary instrumentation added this session — S1's data is fresh.
- ours-rs ran with `--ctor-probe` (read-only, lockstep-digest-unaffected
flag already in main.rs:194).
- No source modifications to ours.
- ours-rs cache (none on this host); no canary run, no canary cache to wipe.
## Artifacts
```
audit-runs/audit-069-wait-signal-producer/
session-2-spawn-walk.log (combined probe + DB queries + fires table)
writer-report-v2.md (this file)
s2/ours-probe.stdout (780 lines, 91 CTOR-PROBE records)
s2/ours-probe.stderr (241 lines, all spawn events + summary)
```
No `fix-canary-v2.diff` (no canary instrumentation added).