handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
192
audit-runs/audit-069-wait-signal-producer/writer-report-v2.md
Normal file
192
audit-runs/audit-069-wait-signal-producer/writer-report-v2.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# AUDIT-069 Session 2 — writer report v2
|
||||
|
||||
Date: 2026-05-20
|
||||
xenia-rs HEAD: `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` (UNCHANGED from S1)
|
||||
`git diff HEAD | sha256sum`: `ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357` (UNCHANGED from S1 end)
|
||||
No canary instrumentation added this session.
|
||||
|
||||
## Headline
|
||||
|
||||
**S1's framing is FALSIFIED.** ours does NOT lack a "canary-tid=10
|
||||
equivalent" thread. The spawn chain executes identically:
|
||||
|
||||
main (ours tid=1) → sub_8244FEA8 → sub_8244FF50
|
||||
→ ExCreateThread(entry=0x82450A28, ctx=0x828F3B68)
|
||||
→ ours tid=5 starts
|
||||
→ sub_82450A28 (1×) → sub_82450A68 (1×)
|
||||
→ γ-signaler family (sub_8245D9D8 6×, sub_8245DA78 1×, sub_8245DB40 75×)
|
||||
|
||||
This is bit-equivalent to canary's chain, modulo the tid label
|
||||
(canary calls it tid=10, ours calls it tid=5 — same entry, same ctx,
|
||||
same dispatch loop, same γ-signaler family fires from inside it).
|
||||
|
||||
The signaler spawn-chain is NOT the bug. S1's "the bug is at the
|
||||
thread-spawn layer" hypothesis is wrong.
|
||||
|
||||
## Spawn chain (DB-derived, READ-ONLY DuckDB)
|
||||
|
||||
| Fn | callers in DB | role |
|
||||
|---|---|---|
|
||||
| 0x82450A28 | 1 ref-edge from 0x8244FFF8 (sub_8244FF50+0xA8) | thread entry (data ptr only) |
|
||||
| 0x8244FF50 | 1 call-edge from 0x8244FEE8 (sub_8244FEA8+0x40) | ExCreateThread caller |
|
||||
| 0x8244FEA8 | 11 call-edges (8 unique callers across sub_821A5150, sub_821CB968, sub_821CC2E8, sub_821D2850, sub_82237EC8, sub_8225EE20, sub_822E0350, sub_824528A8, sub_82452DC0 (2×), sub_8245E528) | spawn helper |
|
||||
|
||||
## Per-PC fire counts (ours-cold, 1.5B instr, fresh today)
|
||||
|
||||
| PC | symbol | fires | tid |
|
||||
|---|---|---|---|
|
||||
| 0x8244FEA8 | sub_8244FEA8 (spawn helper) | 7 | 1 |
|
||||
| 0x8244FF50 | sub_8244FF50 (ExCreateThread caller) | 1 | 1 |
|
||||
| 0x82450A28 | sub_82450A28 (thread entry) | 1 | 5 |
|
||||
| 0x82450A68 | sub_82450A68 (worker dispatch loop) | 1 | 5 |
|
||||
| 0x8245D9D8 | γ-signaler D | 6 | 5 |
|
||||
| 0x8245DA78 | γ-signaler D-B | 1 | 5 |
|
||||
| 0x8245DB40 | γ-signaler D-NEW | 75 | 5 |
|
||||
|
||||
Spawn event log confirms `ExCreateThread: tid=5 handle=0x1050 entry=0x82450a28 start_ctx=0x828f3b68`.
|
||||
Total `kernel.calls{name=ExCreateThread} = 10`.
|
||||
|
||||
## Comparison with canary (S1 data — fresh today, not stale)
|
||||
|
||||
| metric | canary | ours |
|
||||
|---|---|---|
|
||||
| thread with entry=0x82450A28 | tid=10 | tid=5 |
|
||||
| start_ctx | 0x828F3B68 | 0x828F3B68 |
|
||||
| γ-D family signaler firings | all on tid=10 | all on tid=5 |
|
||||
| NtSetEvent fires from γ-D (via wrapper 0x824AA2F0) | confirmed | confirmed |
|
||||
|
||||
The spawn chain and γ-signaler invocation match. The only divergence at the
|
||||
signaler call site is **which handle gets signaled**, not whether the
|
||||
signaler runs.
|
||||
|
||||
## Divergence point (parent fires, child also fires)
|
||||
|
||||
NONE — every node in the spawn chain fires in ours. The S1-prescribed
|
||||
"first ancestor that fires while child does not" never materialises because
|
||||
the entire chain is reached identically.
|
||||
|
||||
The actual divergence is downstream of the spawn-chain — at the
|
||||
**handle-selection** step inside the γ-signaler family, per AUDIT-062's
|
||||
prior finding ("ours's γ-signalers signal WRONG handles — neighbors of the
|
||||
wedge handle, not the wedge itself").
|
||||
|
||||
## Gate condition
|
||||
|
||||
There is no gate that ours fails. The control flow reaches the γ-signaler
|
||||
and invokes the NtSetEvent wrapper (`sub_824AA2F0`) with bit-identical
|
||||
control flow. The argument to NtSetEvent (the handle) is the
|
||||
divergent term.
|
||||
|
||||
In the AUDIT-062 archive ours-ntset.jsonl, the γ-D signaler on ours tid=5
|
||||
calls NtSetEvent on handles `0x103C`, `0x1068`, `0x106C`, `0x1094`, ...
|
||||
These are guest-side handle slots that the *waiter* is NOT waiting on.
|
||||
|
||||
Per S1, canary's wedge waiter (tid=17, tid=26) waits on `F80000A4` and
|
||||
`F8000110`. Note that canary's handles are *pseudo-handles* (high-bit
|
||||
encoded), while ours's slot allocator hands out normal `0x10xx` IDs —
|
||||
a known cross-engine handle convention mismatch already documented
|
||||
in AUDIT-019/043/062.
|
||||
|
||||
The semantic question is therefore: **what does the producer compute as
|
||||
the "next handle to signal", and is the computation reading
|
||||
a different value of the bookkeeping struct in ours vs canary?**
|
||||
This is the question AUDIT-062 hit and parked; it must be re-opened
|
||||
now that S1 has clarified the producer thread is reached identically.
|
||||
|
||||
## ours-side analog status
|
||||
|
||||
The relevant kernel handlers are:
|
||||
|
||||
- `NtSetEvent` — ours `xenia-kernel/src/exports.rs` is per-AUDIT-062 archive
|
||||
bit-equivalent to canary in semantics (signals the event, schedules wakeup).
|
||||
Returns SUCCESS in both.
|
||||
|
||||
- `ExCreateThread` — ours bit-equivalent (S2 spawn matches canary trajectory
|
||||
ctx + entry + suspended flag).
|
||||
|
||||
- `xeKeWaitForSingleObject` (wedge wait at 0x821CB1DC) — ours behaviour
|
||||
matches per AUDIT-049/065 prior work; the WAIT itself is fine, what
|
||||
remains broken is the signaler picking the right handle on tid=5.
|
||||
|
||||
Net: NO kernel handler bug. The divergence is **guest-state computed
|
||||
inside the γ-signaler family at sub_8245D9D8 / sub_8245DA78 /
|
||||
sub_8245DB40** — i.e. data that lives in the queue/list dispatched
|
||||
by sub_82450A68.
|
||||
|
||||
## Reading-error #28 reclassification
|
||||
|
||||
S1 inadvertently committed the same class of error documented as #28 in
|
||||
prior audit memory: "treating per-engine tid label numerically across
|
||||
engines without a tid-mapping translation." S1 used canary's "tid=10"
|
||||
verbatim and AUDIT-062's "tid=10: 0 fires" verbatim, concluding "ours's
|
||||
thread set lacks the canary-tid=10 equivalent." In reality the same
|
||||
guest thread exists on both, with renumbered host-side tid labels.
|
||||
|
||||
The correct cross-engine identity is `(entry_pc, start_ctx)`, not the
|
||||
tid integer. S2 re-validates by `entry=0x82450a28 ∧ ctx=0x828f3b68`,
|
||||
which uniquely identifies the spawn on both engines.
|
||||
|
||||
Do NOT register a new reading-error #; this is the existing #28 surface.
|
||||
|
||||
## Session 3 recommendation (refined)
|
||||
|
||||
Drop the spawn-chain investigation entirely. The producer thread runs.
|
||||
|
||||
**Path A (RECOMMENDED, ~80 LOC ours-only)**: build a probe of the
|
||||
**handle-passed-to-NtSetEvent** on tid=5 (ours) inside the γ-signaler
|
||||
PCs, paired with the symmetric `audit_69_event_signal_watch` capture
|
||||
from S1 in canary. Compare the *sequence of handle IDs* per signaler
|
||||
invocation. The first mismatch identifies the guest-state divergence
|
||||
that drives wrong-handle selection.
|
||||
|
||||
Plumbing path: extend `--lr-trace` in ours (`crates/xenia-app/src/main.rs:233-243`)
|
||||
to also capture `r3` snapshot at multiple PCs, matching canary's
|
||||
audit_69 wrapper-entry capture. Already exists (M12 lr_trace lists
|
||||
pc/tid/hw/cycle/r3/r4/r5/r6/lr). Probe ours `0x824AA2F0` and `0x824AAF50`
|
||||
entry PCs.
|
||||
|
||||
**Path B (~50 LOC diff-tool)**: extend the diff-events JSONL absorber to
|
||||
treat the canary→ours handle-ID mapping as a runtime-discovered alias
|
||||
when the underlying dispatcher pointer matches. Doesn't fix the bug,
|
||||
absorbs the symptom.
|
||||
|
||||
**Path C (root-cause, larger)**: walk sub_82450A68 dispatch loop body
|
||||
disassembly + AUDIT-062 archive to identify which guest-memory struct
|
||||
holds the queue of "handles to signal." The wrong handles on ours mean
|
||||
this struct gets populated wrong somewhere upstream of tid=5's dispatch
|
||||
loop — likely from sub_8244FEA8's 7 fires (which call sites enqueue
|
||||
work, and what data is enqueued).
|
||||
|
||||
LOC budget for S3: Path A ~80, Path B ~50, Path C unknown (~200+).
|
||||
|
||||
## Cascade A/B/C/D
|
||||
|
||||
- **A** (DB-derived spawn chain): PASS (11 callers, 1 unique call edge to FF50).
|
||||
- **B** (per-fn fire counts ours+canary): PASS (ours fresh, canary from S1 fresh).
|
||||
- **C** (divergence-point identification): N/A — no divergence in spawn chain;
|
||||
S1 framing falsified. Re-direction recommended.
|
||||
- **D** (kernel-handler bit-equivalence check): PASS (NtSetEvent / ExCreateThread
|
||||
per AUDIT-062 archive; no new kernel bug detected).
|
||||
|
||||
Net: 3/4 PASS, 1/4 N/A (because the postulated divergence wasn't there).
|
||||
|
||||
## Discipline
|
||||
|
||||
- xenia-rs HEAD UNCHANGED (sha256 of `git diff HEAD` matches S1 end).
|
||||
- No canary instrumentation added this session — S1's data is fresh.
|
||||
- ours-rs ran with `--ctor-probe` (read-only, lockstep-digest-unaffected
|
||||
flag already in main.rs:194).
|
||||
- No source modifications to ours.
|
||||
- ours-rs cache (none on this host); no canary run, no canary cache to wipe.
|
||||
|
||||
## Artifacts
|
||||
|
||||
```
|
||||
audit-runs/audit-069-wait-signal-producer/
|
||||
session-2-spawn-walk.log (combined probe + DB queries + fires table)
|
||||
writer-report-v2.md (this file)
|
||||
s2/ours-probe.stdout (780 lines, 91 CTOR-PROBE records)
|
||||
s2/ours-probe.stderr (241 lines, all spawn events + summary)
|
||||
```
|
||||
|
||||
No `fix-canary-v2.diff` (no canary instrumentation added).
|
||||
Reference in New Issue
Block a user