handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions
--- a/audit-runs/review-a-boot-state/plan.md
+++ b/audit-runs/review-a-boot-state/plan.md
@@ -0,0 +1,333 @@
+# Review A — boot-state review and shortest-path roadmap
+
+**Session type**: PLAN-only.  No engine LOC changes; no canary
+instrumentation changes.  Read-only investigation across the
+existing audit chain artifacts.
+**Date**: 2026-05-21
+**Companion documents** (in this directory):
+- `canary-boot-trajectory.md` — canary's call chain from entry_point
+  to first gameplay draw, with wallclock timestamps.
+- `ours-wedge-localization.md` — precise where-ours-stops, in graph
+  terms.
+- `shortest-path-roadmap.md` — 3-5 step roadmap with expected
+  progression delta per step.
+- `methodology-assessment.md` — alternative metric proposal.
+
+This `plan.md` summarizes the five framing questions with answers
+backed by file:line citations.
+
+---
+
+## Q1 — What is "first draw" in canary's Sylpheed boot?
+
+**Two distinct "draws" must be disambiguated.**
+
+### Q1.a: First boot-init `VdSwap` (the swap=1 event)
+
+Canary's tid=6 (guest main) emits **one** `VdSwap` at ~9.5 s
+wallclock, immediately after the GPU subsystem init sequence
+`VdInitializeEngines → VdInitializeRingBuffer →
+VdEnableRingBufferRPtrWriteBack → VdSetGraphicsInterruptCallback →
+VdSetSystemCommandBufferGpuIdentifierAddress → VdGetSystemCommandBuffer`.
+This swap publishes the boot framebuffer and contains no draw packets.
+
+**Ours also reaches this swap** — visible in
+`phase-w-wedge-reattack/ours-postfix.jsonl` at idx 105283 (host_ns
+496,276,229).  This is what produces ours's `swaps=1` metric.
+
+Both engines reach this point.  **It is NOT the gate.**
+
+### Q1.b: First gameplay `VdSwap` (the swap≥2 / draws≥1 event)
+
+Canary's renderer tid=13 (entry `0x822F1EE0`, spawned suspended at
+1.671 s) wakes after the `sub_825070F0` worker fan-out at host_ns
+≈ 10.383 s and begins emitting `VdGetSystemCommandBuffer` /
+`VdSwap` pairs at ~150 fps.  Canary's tid=13 emits **12,092
+VdSwap calls in the 90-s window** (per
+`phase-nonmatch-investigation/canary-tid-profiles.md:21`).
+
+The first of these is the **first gameplay draw**, fired at ~10.7 s
+wallclock — about 1.2 s after the `sub_825070F0` fan-out triggers
+the worker cluster.
+
+**Pre-conditions canary establishes before this point** (per
+`canary-boot-trajectory.md`):
+
+1. Vtable `0x8200A1E8` of `ANON_Class_713383D7` installed at host_ns
+   ≈ 9.4-9.6 s via POD-copy at GUEST PC `sub_824FD240+0x24`
+   (per `project_audit_068_session4_2026_05_20`).
+2. Activation chain `sub_822F1AA8 → sub_82173990 → sub_821746B0 →
+   sub_82172BA0 → sub_821B55D8 → sub_824F8398 → sub_824F7CD0 →
+   sub_824F7800 → bctrl vtable[1] = sub_825070F0` fires on tid=6.
+3. `sub_825070F0` spawns 4 worker threads with entries
+   `0x82506528/58/88/B8` and shared ctx `0xBCE251C0`.
+4. Workers (canary tids 27/28/29) emit signals that unwedge the
+   `sub_821CB030` Event waits across the cache-file IO completion
+   chain.
+5. Renderer tid=13's body (entered earlier but blocked on a
+   tid=14/15 XAudio-coordinated event) unblocks; per-frame
+   `VdGetSystemCommandBuffer` / `VdSwap` loop begins.
+
+---
+
+## Q2 — What is ours's actual progress, and what's the wedge root cause?
+
+**Ours stops at the first wait in the activation chain.** Specifically:
+
+- **tid=1 (main)** wedged at `sub_82173990+0x2D4` (PC `0x824ac578` =
+  `do_wait_single`) on handle `0x12c8` = `Thread(id=13)` — waiting
+  for the renderer's thread handle to signal (which happens only when
+  tid=13 calls `ExTerminateThread`).
+- **tid=13 (renderer / cache-IO worker)** wedged at
+  `sub_821CB030+0x1B0` on handle `0x12d0` = `Event/Auto`, created by
+  itself via `NtCreateEvent` at `sub_821CB030+0x128`.  `signals=0,
+  wakes=0` — `<NO_SIGNALS_DESPITE_WAITS>`.
+- **`sub_825070F0` fires 0×** at any horizon probed.
+
+Citation: `phase-w-wedge-reattack/halt-on-deadlock-dump.txt` +
+`phase-w-wedge-reattack/current-state.md`.
+
+### Root cause (at one structural level deeper than the wedge symptom)
+
+**Per AUDIT-069 Session 5 (the most recent measurement):**
+
+- Canary fires 414 `NtReleaseSemaphore` calls on the work-queue
+  semaphore in the 90-s window.
+- Ours fires 99 (24%).
+- Breakdown: Worker (382 vs 90), Main (7 vs 8), **Other producers
+  (25 vs 1)**.
+
+The "**other producers (25 vs 1)**" gap is the load-bearing
+discrepancy.  Canary has **24 additional thread sources** releasing
+the work semaphore during bootstrap that ours does not have.  These
+correspond to:
+
+1. The 4 `sub_825070F0` workers (canary tids 27/28/29 + 1) — absent
+   in ours.
+2. XAudio render threads (canary tids 14/15, spawned suspended in
+   both engines, **resumed only in canary**).
+3. The secondary spawn burst at 1.94-2.15 s (canary tids 18-25) —
+   8 helpers including file-IO and NtWaitForMultipleObjectsEx workers
+   — absent in ours.
+
+### The ONE structural issue
+
+> **Ours never reaches `sub_825070F0` because the activation chain
+> that calls it is downstream of tid=13's wedge; and tid=13's wedge
+> is downstream of the worker cluster activation; and the worker
+> cluster activation is `sub_825070F0`.  This is a self-referential
+> lock.**
+
+Canary breaks the lock because some part of the bootstrap
+*pre-activates* the producers (probably via XAudio thread resume at
+1.726 s, which then runs ahead, populates the work queue, signals
+events, etc.).  Ours never resumes the XAudio threads — they're
+spawned suspended and stay that way.
+
+**The single highest-leverage gap is the XAudio thread resume,**
+because (a) it happens early (1.726 s in canary vs. ours's wedge
+which fixes around 1.4 s — i.e. the resume should happen before the
+wedge), (b) it activates the dominant event producers, and (c) AUDIT-069
+S5's "other producers 25 vs 1" finding implicates exactly this class
+of thread.
+
+---
+
+## Q3 — Shortest-path-to-first-draw roadmap
+
+Three to four steps (full detail in `shortest-path-roadmap.md`):
+
+- **Step 1 (~80-150 LOC, ours-side)**: add `--force-spawn-workers`
+  cvar that crowbars `sub_825070F0` activation by directly spawning
+  the 4 worker threads with the right ctx after `VdInitializeRingBuffer`
+  returns.  Tests "are the workers functionally correct if activated"
+  and "does activating them unwedge sub_821CB030."
+- **Step 2 (~0 LOC)**: with Step 1 active, mine the canary jsonl for
+  the kernel-call sequence on tid=6 in the wallclock window [9.4 s,
+  9.6 s] (the install epoch).  Identify what guest call triggers
+  `sub_824FD240+0x24`'s POD-copy of the vtable in canary.
+- **Step 3 (~10-500 LOC, depending on what Step 2 finds)**: mirror
+  that trigger in ours — likely a missing kernel-import return value
+  or a missing post-condition that the trigger inspects.
+- **Step 4 (~0 LOC; remove crowbar)**: re-test ours without
+  `--force-spawn-workers`.  Verify natural bootstrap reaches
+  `sub_825070F0` activation.
+- **Step 5 (~0-50 LOC)**: measure renderer-thread VdSwap rate over 90 s
+  wallclock; target ±30% of canary's 12,092 calls.
+
+Expected delta:
+
+| After step | `swaps` | `draws` | `unique_render_targets` |
+|---|---:|---:|---:|
+| Pre | 1 | 0 | 0 |
+| Step 1 (crowbar) | 2+ | 1+ | 1+ |
+| Step 4 (decrowbar) | 2+ | 1+ | 1+ |
+| Step 5 (parity) | 100+ | 100+ | 1-5 |
+
+---
+
+## Q4 — What's NOT on the shortest path
+
+Explicitly deferred (full rationale in `shortest-path-roadmap.md`):
+
+- **Audio (host-audio-* / XAudio implementation)** — even though
+  XAudio thread resume MAY be the trigger from Q2, ours's existing
+  XAudio shim is sufficient for the workers to bootstrap if they
+  receive the right kernel-call sequence.  Full XAudio
+  implementation is beyond first-draw scope.
+- **HID** — Sylpheed's intro/title screens are auto-advance; no
+  input needed.
+- **XAM content / save games** — not on first-draw path.
+- **Scheduler determinism work** (Phase D Stages 0-4 and beyond) —
+  null result; the wedge is upstream of contention scheduling.
+  Close or indefinitely defer.
+- **Diff-tool canonicalization** (Phase C+N for N > 25) — saturated
+  on matched-prefix without progression; halt this work class until
+  Step 4 lands and the workload re-baselines.
+- **AUDIT-068 host-side install probes** — superseded by AUDIT-068
+  Session 4 finding (writer is GUEST PC, not host).  The followup
+  question is what *triggers* the guest code path, which Step 2
+  addresses through cheaper means.
+
+---
+
+## Q5 — Methodology assessment
+
+**Current methodology relied on matched-prefix as a progression
+proxy.  This assumption is now empirically falsified**: +2,960
+events of matched-prefix advancement produced 0 units of progression
+(`swaps=1, draws=0` across 25+ iterates).
+
+### Proposed alternative metric
+
+**Option 6 (composite `progression_score`)**:
+
+```
+progression_score = 1 * swaps + 10 * draws + 100 * unique_render_targets
+                  + 0.001 * matched_prefix
+```
+
+Continuous gradient; honest about wedge-solving vs. canonicalization
+priority.  Requires ~10 LOC to add to `digest.json`.
+
+Discipline: tag every iterate as either
+"**canonicalization only — no progression**" or
+"**progression**".  Cap at 5 consecutive canonicalization-only
+iterates before mandatory pivot to wedge-attack work.
+
+### New reading-error #39
+
+> **#39 (matched-prefix as progression proxy)**: matched-prefix
+> measures engine-to-engine divergence point, NOT game-to-game
+> functional gap.  When the wedge is on a different thread than the
+> matched-prefix anchor thread, advancing matched-prefix is
+> orthogonal to unwedging.  Future audits MUST distinguish "ours's
+> tid-X diverges from canary's tid-Y" from "ours's tid-X is *blocked
+> because tid-Z is wedged*", and target the wedge directly when
+> present.
+
+---
+
+## Counterintuitive findings (anti-anchoring)
+
+Per Tripstones in the task brief:
+
+### 1. Both engines reach `swaps=1`; ours is NOT behind on the boot swap.
+
+The shared boot-init `VdSwap` fires in both.  Ours's `swaps=1` metric
+is "achieved, just at the same point canary also did it".  The
+divergence is NOT "ours can't do the first swap"; it's "ours can't do
+the SECOND through Nth swap (the gameplay loop)".
+
+### 2. Tripstone 4 verified: canary does reach gameplay draws, ours does not.
+
+`canary-jitter-1.jsonl` shows 12,092 VdSwap calls on canary tid=13 in
+90 s wallclock — definitively in the gameplay rendering loop, not
+pre-first-draw.  Ours's tid analogous to canary tid=13 emits ~80
+events total before wedging — definitively before gameplay starts.
+The "both engines pre-first-draw" hypothesis is FALSE.
+
+### 3. The matched-prefix metric is on the WRONG thread.
+
+Matched-prefix tracks tid=6 (canary) vs tid=1 (ours), the main
+threads.  But the wedge is on **tid=13 in both engines** — the
+renderer thread.  Tid=1's matched-prefix can advance 105,128 events
+without ever touching the wedge.
+
+### 4. The "boot-state-machine" framing is misleading.
+
+There's no monolithic boot state machine.  There are ~28 threads in
+canary, each running their own lifecycle, communicating via shared
+kernel objects.  The bottleneck isn't a state transition; it's a
+THREAD ACTIVATION GAP.
+
+### 5. AUDIT-069 Session 5's "other producers 25 vs 1" is the key forensic discovery, more than AUDIT-068's vtable install epoch.
+
+The vtable install IS interesting but it's downstream of the producer
+gap.  Producers must be running to populate the work queue, which
+gets the worker to do its thing, which signals the wedge, which lets
+the activation chain continue, which calls `sub_824FD240+0x24`,
+which writes the vtable.  Fixing the vtable install in isolation
+(e.g., via a host-side mem-write hack) doesn't help if no producer
+is feeding work to the workers.
+
+---
+
+## Cascade prediction confidence
+
+- A — canary boot trajectory characterized: **DONE, HIGH** (canary-jitter-1.jsonl provides direct evidence).
+- B — ours's wedge root-cause localized deeper than "sub_821CB030 waits": **DONE, MEDIUM-HIGH** (AUDIT-069 S5 "other producers 25 vs 1" finding).
+- C — shortest-path roadmap with ≤5 steps: **DONE, MEDIUM** (5 steps; Step 1 confidence ~60%).
+- D — alternative metric proposed: **DONE, HIGH** (Option 6 composite, plus reading-error #39).
+
+---
+
+## Open questions / known unknowns
+
+1. **What is the bootstrap trigger for canary's `sub_824FD240+0x24`?**
+   Roadmap Step 2 addresses.  Could be answered in <1 session of
+   canary jsonl analysis.
+2. **Does Step 1's crowbar produce a clean wedge-unblock, or does it
+   reveal additional unmodelled state in the ctx object?** Empirical;
+   testable in one session.
+3. **Are canary's XAudio threads (tids 14/15) the actual missing
+   producer, or are they downstream of the same trigger?** Worth a
+   targeted probe before Step 1; ~50 LOC ours-side to log
+   NtResumeThread on the XAudio entry PCs.
+4. **Will the AUDIT-067 "vtable install is host-side" finding
+   resurface?** No — AUDIT-068 S4 falsified this; the writer is
+   GUEST PC `sub_824FD240+0x24`.  The "host-side" framing was a
+   mis-read of the POD-copy semantics (reading-error #36).
+
+---
+
+## Recommended next action
+
+**Dispatch a "progression iterate" implementing Step 1 of the
+roadmap** (`--force-spawn-workers` crowbar, ~80-150 LOC ours-side).
+This is a high-variance, high-reward iterate; expected outcome is
+either `swaps ≥ 2, draws ≥ 1` (success — wedge structurally
+isolated to thread activation) or an informative failure mode (e.g.,
+worker faults at first vtable bctrl indicating additional state
+needed in ctx object).  Time-box: 1 session, max 2h.
+
+If Step 1 succeeds in ANY way (even if draws stays 0), the next
+iterate is Step 2 (kernel-call sequence mining in canary-jitter-1.jsonl).
+This step has minimal risk and uses existing tooling.
+
+If Step 1 fails completely (panic / segfault unrecoverable), revert
+the crowbar and reframe: the wedge may be in ours's kernel-handler
+implementations themselves, not just bootstrap activation.  At that
+point a deeper Path β engine investigation is unavoidable.
+
+---
+
+## Memory hygiene note
+
+This review is read-only.  xenia-rs HEAD unchanged.  canary HEAD
+unchanged.  sylpheed.db unchanged.  No new artifacts beyond this
+directory.
+
+After dispatching Step 1, future memory entries should adopt the
+new `progression_score` + tagging discipline outlined in
+`methodology-assessment.md`.