handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
333
audit-runs/review-a-boot-state/plan.md
Normal file
333
audit-runs/review-a-boot-state/plan.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Review A — boot-state review and shortest-path roadmap
|
||||
|
||||
**Session type**: PLAN-only. No engine LOC changes; no canary
|
||||
instrumentation changes. Read-only investigation across the
|
||||
existing audit chain artifacts.
|
||||
**Date**: 2026-05-21
|
||||
**Companion documents** (in this directory):
|
||||
- `canary-boot-trajectory.md` — canary's call chain from entry_point
|
||||
to first gameplay draw, with wallclock timestamps.
|
||||
- `ours-wedge-localization.md` — precise where-ours-stops, in graph
|
||||
terms.
|
||||
- `shortest-path-roadmap.md` — 3-5 step roadmap with expected
|
||||
progression delta per step.
|
||||
- `methodology-assessment.md` — alternative metric proposal.
|
||||
|
||||
This `plan.md` summarizes the five framing questions with answers
|
||||
backed by file:line citations.
|
||||
|
||||
---
|
||||
|
||||
## Q1 — What is "first draw" in canary's Sylpheed boot?
|
||||
|
||||
**Two distinct "draws" must be disambiguated.**
|
||||
|
||||
### Q1.a: First boot-init `VdSwap` (the swap=1 event)
|
||||
|
||||
Canary's tid=6 (guest main) emits **one** `VdSwap` at ~9.5 s
|
||||
wallclock, immediately after the GPU subsystem init sequence
|
||||
`VdInitializeEngines → VdInitializeRingBuffer →
|
||||
VdEnableRingBufferRPtrWriteBack → VdSetGraphicsInterruptCallback →
|
||||
VdSetSystemCommandBufferGpuIdentifierAddress → VdGetSystemCommandBuffer`.
|
||||
This swap publishes the boot framebuffer and contains no draw packets.
|
||||
|
||||
**Ours also reaches this swap** — visible in
|
||||
`phase-w-wedge-reattack/ours-postfix.jsonl` at idx 105283 (host_ns
|
||||
496,276,229). This is what produces ours's `swaps=1` metric.
|
||||
|
||||
Both engines reach this point. **It is NOT the gate.**
|
||||
|
||||
### Q1.b: First gameplay `VdSwap` (the swap≥2 / draws≥1 event)
|
||||
|
||||
Canary's renderer tid=13 (entry `0x822F1EE0`, spawned suspended at
|
||||
1.671 s) wakes after the `sub_825070F0` worker fan-out at host_ns
|
||||
≈ 10.383 s and begins emitting `VdGetSystemCommandBuffer` /
|
||||
`VdSwap` pairs at ~150 fps. Canary's tid=13 emits **12,092
|
||||
VdSwap calls in the 90-s window** (per
|
||||
`phase-nonmatch-investigation/canary-tid-profiles.md:21`).
|
||||
|
||||
The first of these is the **first gameplay draw**, fired at ~10.7 s
|
||||
wallclock — about 1.2 s after the `sub_825070F0` fan-out triggers
|
||||
the worker cluster.
|
||||
|
||||
**Pre-conditions canary establishes before this point** (per
|
||||
`canary-boot-trajectory.md`):
|
||||
|
||||
1. Vtable `0x8200A1E8` of `ANON_Class_713383D7` installed at host_ns
|
||||
≈ 9.4-9.6 s via POD-copy at GUEST PC `sub_824FD240+0x24`
|
||||
(per `project_audit_068_session4_2026_05_20`).
|
||||
2. Activation chain `sub_822F1AA8 → sub_82173990 → sub_821746B0 →
|
||||
sub_82172BA0 → sub_821B55D8 → sub_824F8398 → sub_824F7CD0 →
|
||||
sub_824F7800 → bctrl vtable[1] = sub_825070F0` fires on tid=6.
|
||||
3. `sub_825070F0` spawns 4 worker threads with entries
|
||||
`0x82506528/58/88/B8` and shared ctx `0xBCE251C0`.
|
||||
4. Workers (canary tids 27/28/29) emit signals that unwedge the
|
||||
`sub_821CB030` Event waits across the cache-file IO completion
|
||||
chain.
|
||||
5. Renderer tid=13's body (entered earlier but blocked on a
|
||||
tid=14/15 XAudio-coordinated event) unblocks; per-frame
|
||||
`VdGetSystemCommandBuffer` / `VdSwap` loop begins.
|
||||
|
||||
---
|
||||
|
||||
## Q2 — What is ours's actual progress, and what's the wedge root cause?
|
||||
|
||||
**Ours stops at the first wait in the activation chain.** Specifically:
|
||||
|
||||
- **tid=1 (main)** wedged at `sub_82173990+0x2D4` (PC `0x824ac578` =
|
||||
`do_wait_single`) on handle `0x12c8` = `Thread(id=13)` — waiting
|
||||
for the renderer's thread handle to signal (which happens only when
|
||||
tid=13 calls `ExTerminateThread`).
|
||||
- **tid=13 (renderer / cache-IO worker)** wedged at
|
||||
`sub_821CB030+0x1B0` on handle `0x12d0` = `Event/Auto`, created by
|
||||
itself via `NtCreateEvent` at `sub_821CB030+0x128`. `signals=0,
|
||||
wakes=0` — `<NO_SIGNALS_DESPITE_WAITS>`.
|
||||
- **`sub_825070F0` fires 0×** at any horizon probed.
|
||||
|
||||
Citation: `phase-w-wedge-reattack/halt-on-deadlock-dump.txt` +
|
||||
`phase-w-wedge-reattack/current-state.md`.
|
||||
|
||||
### Root cause (at one structural level deeper than the wedge symptom)
|
||||
|
||||
**Per AUDIT-069 Session 5 (the most recent measurement):**
|
||||
|
||||
- Canary fires 414 `NtReleaseSemaphore` calls on the work-queue
|
||||
semaphore in the 90-s window.
|
||||
- Ours fires 99 (24%).
|
||||
- Breakdown: Worker (382 vs 90), Main (7 vs 8), **Other producers
|
||||
(25 vs 1)**.
|
||||
|
||||
The "**other producers (25 vs 1)**" gap is the load-bearing
|
||||
discrepancy. Canary has **24 additional thread sources** releasing
|
||||
the work semaphore during bootstrap that ours does not have. These
|
||||
correspond to:
|
||||
|
||||
1. The 4 `sub_825070F0` workers (canary tids 27/28/29 + 1) — absent
|
||||
in ours.
|
||||
2. XAudio render threads (canary tids 14/15, spawned suspended in
|
||||
both engines, **resumed only in canary**).
|
||||
3. The secondary spawn burst at 1.94-2.15 s (canary tids 18-25) —
|
||||
8 helpers including file-IO and NtWaitForMultipleObjectsEx workers
|
||||
— absent in ours.
|
||||
|
||||
### The ONE structural issue
|
||||
|
||||
> **Ours never reaches `sub_825070F0` because the activation chain
|
||||
> that calls it is downstream of tid=13's wedge; and tid=13's wedge
|
||||
> is downstream of the worker cluster activation; and the worker
|
||||
> cluster activation is `sub_825070F0`. This is a self-referential
|
||||
> lock.**
|
||||
|
||||
Canary breaks the lock because some part of the bootstrap
|
||||
*pre-activates* the producers (probably via XAudio thread resume at
|
||||
1.726 s, which then runs ahead, populates the work queue, signals
|
||||
events, etc.). Ours never resumes the XAudio threads — they're
|
||||
spawned suspended and stay that way.
|
||||
|
||||
**The single highest-leverage gap is the XAudio thread resume,**
|
||||
because (a) it happens early (1.726 s in canary vs. ours's wedge
|
||||
which fixes around 1.4 s — i.e. the resume should happen before the
|
||||
wedge), (b) it activates the dominant event producers, and (c) AUDIT-069
|
||||
S5's "other producers 25 vs 1" finding implicates exactly this class
|
||||
of thread.
|
||||
|
||||
---
|
||||
|
||||
## Q3 — Shortest-path-to-first-draw roadmap
|
||||
|
||||
Three to four steps (full detail in `shortest-path-roadmap.md`):
|
||||
|
||||
- **Step 1 (~80-150 LOC, ours-side)**: add `--force-spawn-workers`
|
||||
cvar that crowbars `sub_825070F0` activation by directly spawning
|
||||
the 4 worker threads with the right ctx after `VdInitializeRingBuffer`
|
||||
returns. Tests "are the workers functionally correct if activated"
|
||||
and "does activating them unwedge sub_821CB030."
|
||||
- **Step 2 (~0 LOC)**: with Step 1 active, mine the canary jsonl for
|
||||
the kernel-call sequence on tid=6 in the wallclock window [9.4 s,
|
||||
9.6 s] (the install epoch). Identify what guest call triggers
|
||||
`sub_824FD240+0x24`'s POD-copy of the vtable in canary.
|
||||
- **Step 3 (~10-500 LOC, depending on what Step 2 finds)**: mirror
|
||||
that trigger in ours — likely a missing kernel-import return value
|
||||
or a missing post-condition that the trigger inspects.
|
||||
- **Step 4 (~0 LOC; remove crowbar)**: re-test ours without
|
||||
`--force-spawn-workers`. Verify natural bootstrap reaches
|
||||
`sub_825070F0` activation.
|
||||
- **Step 5 (~0-50 LOC)**: measure renderer-thread VdSwap rate over 90 s
|
||||
wallclock; target ±30% of canary's 12,092 calls.
|
||||
|
||||
Expected delta:
|
||||
|
||||
| After step | `swaps` | `draws` | `unique_render_targets` |
|
||||
|---|---:|---:|---:|
|
||||
| Pre | 1 | 0 | 0 |
|
||||
| Step 1 (crowbar) | 2+ | 1+ | 1+ |
|
||||
| Step 4 (decrowbar) | 2+ | 1+ | 1+ |
|
||||
| Step 5 (parity) | 100+ | 100+ | 1-5 |
|
||||
|
||||
---
|
||||
|
||||
## Q4 — What's NOT on the shortest path
|
||||
|
||||
Explicitly deferred (full rationale in `shortest-path-roadmap.md`):
|
||||
|
||||
- **Audio (host-audio-* / XAudio implementation)** — even though
|
||||
XAudio thread resume MAY be the trigger from Q2, ours's existing
|
||||
XAudio shim is sufficient for the workers to bootstrap if they
|
||||
receive the right kernel-call sequence. Full XAudio
|
||||
implementation is beyond first-draw scope.
|
||||
- **HID** — Sylpheed's intro/title screens are auto-advance; no
|
||||
input needed.
|
||||
- **XAM content / save games** — not on first-draw path.
|
||||
- **Scheduler determinism work** (Phase D Stages 0-4 and beyond) —
|
||||
null result; the wedge is upstream of contention scheduling.
|
||||
Close or indefinitely defer.
|
||||
- **Diff-tool canonicalization** (Phase C+N for N > 25) — saturated
|
||||
on matched-prefix without progression; halt this work class until
|
||||
Step 4 lands and the workload re-baselines.
|
||||
- **AUDIT-068 host-side install probes** — superseded by AUDIT-068
|
||||
Session 4 finding (writer is GUEST PC, not host). The followup
|
||||
question is what *triggers* the guest code path, which Step 2
|
||||
addresses through cheaper means.
|
||||
|
||||
---
|
||||
|
||||
## Q5 — Methodology assessment
|
||||
|
||||
**Current methodology relied on matched-prefix as a progression
|
||||
proxy. This assumption is now empirically falsified**: +2,960
|
||||
events of matched-prefix advancement produced 0 units of progression
|
||||
(`swaps=1, draws=0` across 25+ iterates).
|
||||
|
||||
### Proposed alternative metric
|
||||
|
||||
**Option 6 (composite `progression_score`)**:
|
||||
|
||||
```
|
||||
progression_score = 1 * swaps + 10 * draws + 100 * unique_render_targets
|
||||
+ 0.001 * matched_prefix
|
||||
```
|
||||
|
||||
Continuous gradient; honest about wedge-solving vs. canonicalization
|
||||
priority. Requires ~10 LOC to add to `digest.json`.
|
||||
|
||||
Discipline: tag every iterate as either
|
||||
"**canonicalization only — no progression**" or
|
||||
"**progression**". Cap at 5 consecutive canonicalization-only
|
||||
iterates before mandatory pivot to wedge-attack work.
|
||||
|
||||
### New reading-error #39
|
||||
|
||||
> **#39 (matched-prefix as progression proxy)**: matched-prefix
|
||||
> measures engine-to-engine divergence point, NOT game-to-game
|
||||
> functional gap. When the wedge is on a different thread than the
|
||||
> matched-prefix anchor thread, advancing matched-prefix is
|
||||
> orthogonal to unwedging. Future audits MUST distinguish "ours's
|
||||
> tid-X diverges from canary's tid-Y" from "ours's tid-X is *blocked
|
||||
> because tid-Z is wedged*", and target the wedge directly when
|
||||
> present.
|
||||
|
||||
---
|
||||
|
||||
## Counterintuitive findings (anti-anchoring)
|
||||
|
||||
Per Tripstones in the task brief:
|
||||
|
||||
### 1. Both engines reach `swaps=1`; ours is NOT behind on the boot swap.
|
||||
|
||||
The shared boot-init `VdSwap` fires in both. Ours's `swaps=1` metric
|
||||
is "achieved, just at the same point canary also did it". The
|
||||
divergence is NOT "ours can't do the first swap"; it's "ours can't do
|
||||
the SECOND through Nth swap (the gameplay loop)".
|
||||
|
||||
### 2. Tripstone 4 verified: canary does reach gameplay draws, ours does not.
|
||||
|
||||
`canary-jitter-1.jsonl` shows 12,092 VdSwap calls on canary tid=13 in
|
||||
90 s wallclock — definitively in the gameplay rendering loop, not
|
||||
pre-first-draw. Ours's tid analogous to canary tid=13 emits ~80
|
||||
events total before wedging — definitively before gameplay starts.
|
||||
The "both engines pre-first-draw" hypothesis is FALSE.
|
||||
|
||||
### 3. The matched-prefix metric is on the WRONG thread.
|
||||
|
||||
Matched-prefix tracks tid=6 (canary) vs tid=1 (ours), the main
|
||||
threads. But the wedge is on **tid=13 in both engines** — the
|
||||
renderer thread. Tid=1's matched-prefix can advance 105,128 events
|
||||
without ever touching the wedge.
|
||||
|
||||
### 4. The "boot-state-machine" framing is misleading.
|
||||
|
||||
There's no monolithic boot state machine. There are ~28 threads in
|
||||
canary, each running their own lifecycle, communicating via shared
|
||||
kernel objects. The bottleneck isn't a state transition; it's a
|
||||
THREAD ACTIVATION GAP.
|
||||
|
||||
### 5. AUDIT-069 Session 5's "other producers 25 vs 1" is the key forensic discovery, more than AUDIT-068's vtable install epoch.
|
||||
|
||||
The vtable install IS interesting but it's downstream of the producer
|
||||
gap. Producers must be running to populate the work queue, which
|
||||
gets the worker to do its thing, which signals the wedge, which lets
|
||||
the activation chain continue, which calls `sub_824FD240+0x24`,
|
||||
which writes the vtable. Fixing the vtable install in isolation
|
||||
(e.g., via a host-side mem-write hack) doesn't help if no producer
|
||||
is feeding work to the workers.
|
||||
|
||||
---
|
||||
|
||||
## Cascade prediction confidence
|
||||
|
||||
- A — canary boot trajectory characterized: **DONE, HIGH** (canary-jitter-1.jsonl provides direct evidence).
|
||||
- B — ours's wedge root-cause localized deeper than "sub_821CB030 waits": **DONE, MEDIUM-HIGH** (AUDIT-069 S5 "other producers 25 vs 1" finding).
|
||||
- C — shortest-path roadmap with ≤5 steps: **DONE, MEDIUM** (5 steps; Step 1 confidence ~60%).
|
||||
- D — alternative metric proposed: **DONE, HIGH** (Option 6 composite, plus reading-error #39).
|
||||
|
||||
---
|
||||
|
||||
## Open questions / known unknowns
|
||||
|
||||
1. **What is the bootstrap trigger for canary's `sub_824FD240+0x24`?**
|
||||
Roadmap Step 2 addresses. Could be answered in <1 session of
|
||||
canary jsonl analysis.
|
||||
2. **Does Step 1's crowbar produce a clean wedge-unblock, or does it
|
||||
reveal additional unmodelled state in the ctx object?** Empirical;
|
||||
testable in one session.
|
||||
3. **Are canary's XAudio threads (tids 14/15) the actual missing
|
||||
producer, or are they downstream of the same trigger?** Worth a
|
||||
targeted probe before Step 1; ~50 LOC ours-side to log
|
||||
NtResumeThread on the XAudio entry PCs.
|
||||
4. **Will the AUDIT-067 "vtable install is host-side" finding
|
||||
resurface?** No — AUDIT-068 S4 falsified this; the writer is
|
||||
GUEST PC `sub_824FD240+0x24`. The "host-side" framing was a
|
||||
mis-read of the POD-copy semantics (reading-error #36).
|
||||
|
||||
---
|
||||
|
||||
## Recommended next action
|
||||
|
||||
**Dispatch a "progression iterate" implementing Step 1 of the
|
||||
roadmap** (`--force-spawn-workers` crowbar, ~80-150 LOC ours-side).
|
||||
This is a high-variance, high-reward iterate; expected outcome is
|
||||
either `swaps ≥ 2, draws ≥ 1` (success — wedge structurally
|
||||
isolated to thread activation) or an informative failure mode (e.g.,
|
||||
worker faults at first vtable bctrl indicating additional state
|
||||
needed in ctx object). Time-box: 1 session, max 2h.
|
||||
|
||||
If Step 1 succeeds in ANY way (even if draws stays 0), the next
|
||||
iterate is Step 2 (kernel-call sequence mining in canary-jitter-1.jsonl).
|
||||
This step has minimal risk and uses existing tooling.
|
||||
|
||||
If Step 1 fails completely (panic / segfault unrecoverable), revert
|
||||
the crowbar and reframe: the wedge may be in ours's kernel-handler
|
||||
implementations themselves, not just bootstrap activation. At that
|
||||
point a deeper Path β engine investigation is unavoidable.
|
||||
|
||||
---
|
||||
|
||||
## Memory hygiene note
|
||||
|
||||
This review is read-only. xenia-rs HEAD unchanged. canary HEAD
|
||||
unchanged. sylpheed.db unchanged. No new artifacts beyond this
|
||||
directory.
|
||||
|
||||
After dispatching Step 1, future memory entries should adopt the
|
||||
new `progression_score` + tagging discipline outlined in
|
||||
`methodology-assessment.md`.
|
||||
Reference in New Issue
Block a user