handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,192 @@
# AUDIT-069 Session 2 — writer report v2
Date: 2026-05-20
xenia-rs HEAD: `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` (UNCHANGED from S1)
`git diff HEAD | sha256sum`: `ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357` (UNCHANGED from S1 end)
No canary instrumentation added this session.
## Headline
**S1's framing is FALSIFIED.** ours does NOT lack a "canary-tid=10
equivalent" thread. The spawn chain executes identically:
main (ours tid=1) → sub_8244FEA8 → sub_8244FF50
→ ExCreateThread(entry=0x82450A28, ctx=0x828F3B68)
→ ours tid=5 starts
→ sub_82450A28 (1×) → sub_82450A68 (1×)
γ-signaler family (sub_8245D9D8 6×, sub_8245DA78 1×, sub_8245DB40 75×)
This is bit-equivalent to canary's chain, modulo the tid label
(canary calls it tid=10, ours calls it tid=5 — same entry, same ctx,
same dispatch loop, same γ-signaler family fires from inside it).
The signaler spawn-chain is NOT the bug. S1's "the bug is at the
thread-spawn layer" hypothesis is wrong.
## Spawn chain (DB-derived, READ-ONLY DuckDB)
| Fn | callers in DB | role |
|---|---|---|
| 0x82450A28 | 1 ref-edge from 0x8244FFF8 (sub_8244FF50+0xA8) | thread entry (data ptr only) |
| 0x8244FF50 | 1 call-edge from 0x8244FEE8 (sub_8244FEA8+0x40) | ExCreateThread caller |
| 0x8244FEA8 | 11 call-edges (8 unique callers across sub_821A5150, sub_821CB968, sub_821CC2E8, sub_821D2850, sub_82237EC8, sub_8225EE20, sub_822E0350, sub_824528A8, sub_82452DC0 (2×), sub_8245E528) | spawn helper |
## Per-PC fire counts (ours-cold, 1.5B instr, fresh today)
| PC | symbol | fires | tid |
|---|---|---|---|
| 0x8244FEA8 | sub_8244FEA8 (spawn helper) | 7 | 1 |
| 0x8244FF50 | sub_8244FF50 (ExCreateThread caller) | 1 | 1 |
| 0x82450A28 | sub_82450A28 (thread entry) | 1 | 5 |
| 0x82450A68 | sub_82450A68 (worker dispatch loop) | 1 | 5 |
| 0x8245D9D8 | γ-signaler D | 6 | 5 |
| 0x8245DA78 | γ-signaler D-B | 1 | 5 |
| 0x8245DB40 | γ-signaler D-NEW | 75 | 5 |
Spawn event log confirms `ExCreateThread: tid=5 handle=0x1050 entry=0x82450a28 start_ctx=0x828f3b68`.
Total `kernel.calls{name=ExCreateThread} = 10`.
## Comparison with canary (S1 data — fresh today, not stale)
| metric | canary | ours |
|---|---|---|
| thread with entry=0x82450A28 | tid=10 | tid=5 |
| start_ctx | 0x828F3B68 | 0x828F3B68 |
| γ-D family signaler firings | all on tid=10 | all on tid=5 |
| NtSetEvent fires from γ-D (via wrapper 0x824AA2F0) | confirmed | confirmed |
The spawn chain and γ-signaler invocation match. The only divergence at the
signaler call site is **which handle gets signaled**, not whether the
signaler runs.
## Divergence point (parent fires, child also fires)
NONE — every node in the spawn chain fires in ours. The S1-prescribed
"first ancestor that fires while child does not" never materialises because
the entire chain is reached identically.
The actual divergence is downstream of the spawn-chain — at the
**handle-selection** step inside the γ-signaler family, per AUDIT-062's
prior finding ("ours's γ-signalers signal WRONG handles — neighbors of the
wedge handle, not the wedge itself").
## Gate condition
There is no gate that ours fails. The control flow reaches the γ-signaler
and invokes the NtSetEvent wrapper (`sub_824AA2F0`) with bit-identical
control flow. The argument to NtSetEvent (the handle) is the
divergent term.
In the AUDIT-062 archive ours-ntset.jsonl, the γ-D signaler on ours tid=5
calls NtSetEvent on handles `0x103C`, `0x1068`, `0x106C`, `0x1094`, ...
These are guest-side handle slots that the *waiter* is NOT waiting on.
Per S1, canary's wedge waiter (tid=17, tid=26) waits on `F80000A4` and
`F8000110`. Note that canary's handles are *pseudo-handles* (high-bit
encoded), while ours's slot allocator hands out normal `0x10xx` IDs —
a known cross-engine handle convention mismatch already documented
in AUDIT-019/043/062.
The semantic question is therefore: **what does the producer compute as
the "next handle to signal", and is the computation reading
a different value of the bookkeeping struct in ours vs canary?**
This is the question AUDIT-062 hit and parked; it must be re-opened
now that S1 has clarified the producer thread is reached identically.
## ours-side analog status
The relevant kernel handlers are:
- `NtSetEvent` — ours `xenia-kernel/src/exports.rs` is per-AUDIT-062 archive
bit-equivalent to canary in semantics (signals the event, schedules wakeup).
Returns SUCCESS in both.
- `ExCreateThread` — ours bit-equivalent (S2 spawn matches canary trajectory
ctx + entry + suspended flag).
- `xeKeWaitForSingleObject` (wedge wait at 0x821CB1DC) — ours behaviour
matches per AUDIT-049/065 prior work; the WAIT itself is fine, what
remains broken is the signaler picking the right handle on tid=5.
Net: NO kernel handler bug. The divergence is **guest-state computed
inside the γ-signaler family at sub_8245D9D8 / sub_8245DA78 /
sub_8245DB40** — i.e. data that lives in the queue/list dispatched
by sub_82450A68.
## Reading-error #28 reclassification
S1 inadvertently committed the same class of error documented as #28 in
prior audit memory: "treating per-engine tid label numerically across
engines without a tid-mapping translation." S1 used canary's "tid=10"
verbatim and AUDIT-062's "tid=10: 0 fires" verbatim, concluding "ours's
thread set lacks the canary-tid=10 equivalent." In reality the same
guest thread exists on both, with renumbered host-side tid labels.
The correct cross-engine identity is `(entry_pc, start_ctx)`, not the
tid integer. S2 re-validates by `entry=0x82450a28 ∧ ctx=0x828f3b68`,
which uniquely identifies the spawn on both engines.
Do NOT register a new reading-error #; this is the existing #28 surface.
## Session 3 recommendation (refined)
Drop the spawn-chain investigation entirely. The producer thread runs.
**Path A (RECOMMENDED, ~80 LOC ours-only)**: build a probe of the
**handle-passed-to-NtSetEvent** on tid=5 (ours) inside the γ-signaler
PCs, paired with the symmetric `audit_69_event_signal_watch` capture
from S1 in canary. Compare the *sequence of handle IDs* per signaler
invocation. The first mismatch identifies the guest-state divergence
that drives wrong-handle selection.
Plumbing path: extend `--lr-trace` in ours (`crates/xenia-app/src/main.rs:233-243`)
to also capture `r3` snapshot at multiple PCs, matching canary's
audit_69 wrapper-entry capture. Already exists (M12 lr_trace lists
pc/tid/hw/cycle/r3/r4/r5/r6/lr). Probe ours `0x824AA2F0` and `0x824AAF50`
entry PCs.
**Path B (~50 LOC diff-tool)**: extend the diff-events JSONL absorber to
treat the canary→ours handle-ID mapping as a runtime-discovered alias
when the underlying dispatcher pointer matches. Doesn't fix the bug,
absorbs the symptom.
**Path C (root-cause, larger)**: walk sub_82450A68 dispatch loop body
disassembly + AUDIT-062 archive to identify which guest-memory struct
holds the queue of "handles to signal." The wrong handles on ours mean
this struct gets populated wrong somewhere upstream of tid=5's dispatch
loop — likely from sub_8244FEA8's 7 fires (which call sites enqueue
work, and what data is enqueued).
LOC budget for S3: Path A ~80, Path B ~50, Path C unknown (~200+).
## Cascade A/B/C/D
- **A** (DB-derived spawn chain): PASS (11 callers, 1 unique call edge to FF50).
- **B** (per-fn fire counts ours+canary): PASS (ours fresh, canary from S1 fresh).
- **C** (divergence-point identification): N/A — no divergence in spawn chain;
S1 framing falsified. Re-direction recommended.
- **D** (kernel-handler bit-equivalence check): PASS (NtSetEvent / ExCreateThread
per AUDIT-062 archive; no new kernel bug detected).
Net: 3/4 PASS, 1/4 N/A (because the postulated divergence wasn't there).
## Discipline
- xenia-rs HEAD UNCHANGED (sha256 of `git diff HEAD` matches S1 end).
- No canary instrumentation added this session — S1's data is fresh.
- ours-rs ran with `--ctor-probe` (read-only, lockstep-digest-unaffected
flag already in main.rs:194).
- No source modifications to ours.
- ours-rs cache (none on this host); no canary run, no canary cache to wipe.
## Artifacts
```
audit-runs/audit-069-wait-signal-producer/
session-2-spawn-walk.log (combined probe + DB queries + fires table)
writer-report-v2.md (this file)
s2/ours-probe.stdout (780 lines, 91 CTOR-PROBE records)
s2/ours-probe.stderr (241 lines, all spawn events + summary)
```
No `fix-canary-v2.diff` (no canary instrumentation added).