Files
xenia-rs/audit-runs/review-a-boot-state/methodology-assessment.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

7.9 KiB
Raw Blame History

Methodology assessment

The matched-prefix metric: load-bearing or load-shedding?

Across 25+ iterates (audits 049 through 069; Phase C+1 through C+25; Phase D Stages 0-4 plus D-extension; Phase W; Phase host-audio-*), matched-prefix on the main thread (canary tid=6 ⇄ ours tid=1) advanced:

Phase Matched-prefix Δ
Phase B baseline (pre-C+1) ~102,168
Phase D D-extension landing 104,607 → 105,046 +439
Phase W (VdInitializeEngines fix) 105,046 → 105,112 +66
Phase C+25 (MmGetPhysicalAddress canon) 105,112 → 105,128 +16
Phase swaps draws unique_render_targets
Phase B baseline 1 0 0
Phase W 1 0 0
Phase C+25 1 0 0

The two metrics are decoupled. Matched-prefix is moving along ENGINE-internal divergences (kernel-call return values, thread IDs, heap arena base addresses). The progression metric is gated by boot-state activation, which lives one or more layers above the diff points.

Why the decoupling happened

Three reading-errors compound:

  1. #23 (cooperative-vs-preemptive scheduling jitter): canary's default-scheduling produces different intra-thread event ordering than ours's coroutine scheduler. Diff-tool absorbers (C+18, C+21, D-extension) correctly hide this jitter — but they hide real bootstrap-time divergences too. Phase W explicitly noted: "If ours's worker fails to enqueue something canary's worker awaits, we'd never see the gap because the matched-prefix isn't on the worker tid in the first place."
  2. #30 (per-tid PC SID drift): shared-global SIDs work for process-global dispatchers (e.g., the work-queue semaphore at handle 0xF800003C in canary). But the wedge handle 0x12d0 uses a per-tid create-site SID that does NOT match across engines. So even when the same logical event exists in both engines, the diff harness reports SID mismatch and absorbs OR diverges incorrectly.
  3. #38 (cross-spawn producer paths): static reachability (the sylpheed.db xrefs table) misses producer paths that cross thread-spawn boundaries. The result.md from Phase Non-match shows canary's tid=14 (XAudio voice-mask poll) communicates with downstream code via a path that has no static bl edge — it crosses via guest kernel APIs.

Alternative metric proposals

Option 1 — draws ≥ 1 (sharp gate)

Pros: directly measures the target. Boolean. Reproducible. Cons: gives no signal during iteration — every iterate before the breakthrough is draws = 0. Loss function is non-smooth.

Option 2 — swaps ≥ 2 (relaxed first-frame gate)

Pros: still sharp; one bit looser than draws. Distinguishes boot-init-only swap (swaps=1) from at-least-one-rendered-frame (swaps≥2). Cons: same non-smooth loss. Achievable in principle by a crowbar without solving the underlying bug.

Option 3 — Renderer-thread liveness: events_emitted_by_renderer_thread ≥ N

Compute: events emitted on the thread spawned at entry 0x822F1EE0 in any 90-s wallclock window. Canary: 594,000. Ours: ~0.

Pros: smooth-ish (event count can move slowly). Directly measures "is the renderer running." Bypasses the diff-tool jitter problem because it's a per-engine internal count. Cons: requires a non-trivial 90-s wallclock run (not 50M instr ceiling). Could be gamed by a crowbar that resumes the renderer without unblocking the wedge.

Option 4 — Worker-thread census: count(threads_with_events ≥ 10k) ≥ 6

Compute: how many tids in ours emit ≥10k events over 90 s wallclock. Canary at 90 s: 12 tids meet this (tids 1/2/4/6/9/10/11/12/13/14/15/16 plus the post-10s workers 21/27/28/29). Ours at 50M instr: 5 tids.

Pros: directly measures the AUDIT-057 thread-gap. Smooth metric: each unwedged thread adds 1 to the count. Cons: requires 90-s wallclock runs — ours can't reach this without solving the wedge first, so it's pre-requisite-equivalent to Option 3.

Option 5 — worker_semaphore_release_count (AUDIT-069 S5)

Compute: how many NtReleaseSemaphore calls on the work semaphore (handle 0xF800003C in canary, equivalent in ours) over 90 s wallclock. Canary: 414. Ours: 99 (24%).

Pros: pinpoints the under-production directly. Mechanically measurable. Already instrumented in canary (audit_70_semaphore_release_watch). Cons: same wallclock requirement; same gameability.

Option 6 — composite: progression_score

Define:

progression_score = 1 * swaps + 10 * draws + 100 * unique_render_targets
                  + 0.001 * matched_prefix

This recovers signal during iteration (matched-prefix moves) without pretending it's progression. The 1000:1 weight ratio matches the bug-class severity.

Pros: continuous gradient over both wedge-solving and canonicalization work. Honest about which is more important. Cons: arbitrary weights. Composite metrics drift in meaning.

Recommendation

Adopt Option 6 (composite progression_score) as the primary methodology metric, with a hard secondary gate of "Option 2 (swaps ≥ 2) is what matters; everything else is fitness."

Concrete proposal:

  1. The digest.json output gains a progression_score field computed from the existing fields (zero new instrumentation).
  2. Every iterate must report Δprogression_score in its re-validation.md.
  3. Iterates that only move matched_prefix (i.e., Δprogression_score = (small) × Δmatched_prefix) MUST be tagged in their memory entry as "canonicalization only — no progression" and counted against a budget: max 5 consecutive iterates in this class before mandatory pivot to wedge-attack work.
  4. Audits that move swaps or draws (the high-weight terms) are tagged "progression" and given priority for resource allocation.

This methodology change costs ~10 LOC in the digest output and imposes a discipline cap of 5 canonicalization-only audits between progression attempts.

Falsification of the matched-prefix-as-proxy belief

Phase C through C+25 explicitly assumed that matched-prefix is a proxy for progression. This assumption is now empirically falsified:

+2,960 events of matched-prefix advancement produced exactly ZERO units of progression.

Reading-error #39 (newly registered by this review):

#39 (matched-prefix as progression proxy): matched-prefix measures engine-to-engine divergence point, not game-to-game functional gap. When the wedge is on a different thread than the matched-prefix anchor thread, advancing matched-prefix is orthogonal to unwedging. Future audits MUST distinguish "ours's tid-X main thread diverges from canary's tid-Y" from "ours's tid-X main thread is blocked because tid-Z is wedged", and target the wedge directly when present.

What "progression discipline" looks like in practice

For the next 3 iterates:

  • Iterate N+1: Step 1 of shortest-path-roadmap (crowbar). No diff-tool work. Target: swaps ≥ 2.
  • Iterate N+2: Step 2 of roadmap (trigger ID via canary jsonl analysis). No engine LOC. Target: identification of the missing kernel call(s).
  • Iterate N+3: Step 3 of roadmap (mirror the trigger). Target: ours unblocks without the crowbar.

Each iterate must produce a progression_score delta report. If 3 iterates in a row produce Δprogression_score ≤ ε (where ε = +0.001 × +500 ≈ +0.5), the methodology should be re-reviewed again before continuing — this would mean even the crowbar approach failed and a deeper rethink is needed.

Closing note

The user's instinct in calling this strategic pause and review was correct. The matched-prefix-only chain was producing real canonicalization work but had ceased producing progression. The roadmap above is one principled attempt at breaking the cycle; if it fails, the next-level fallback is to formally accept Sylpheed's boot-state as currently unreachable in ours and pivot to a different title for the methodology demonstration.