Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7.9 KiB
Methodology assessment
The matched-prefix metric: load-bearing or load-shedding?
Across 25+ iterates (audits 049 through 069; Phase C+1 through C+25; Phase D Stages 0-4 plus D-extension; Phase W; Phase host-audio-*), matched-prefix on the main thread (canary tid=6 ⇄ ours tid=1) advanced:
| Phase | Matched-prefix | Δ |
|---|---|---|
| Phase B baseline (pre-C+1) | ~102,168 | — |
| Phase D D-extension landing | 104,607 → 105,046 | +439 |
| Phase W (VdInitializeEngines fix) | 105,046 → 105,112 | +66 |
| Phase C+25 (MmGetPhysicalAddress canon) | 105,112 → 105,128 | +16 |
| Phase | swaps |
draws |
unique_render_targets |
|---|---|---|---|
| Phase B baseline | 1 | 0 | 0 |
| Phase W | 1 | 0 | 0 |
| Phase C+25 | 1 | 0 | 0 |
The two metrics are decoupled. Matched-prefix is moving along ENGINE-internal divergences (kernel-call return values, thread IDs, heap arena base addresses). The progression metric is gated by boot-state activation, which lives one or more layers above the diff points.
Why the decoupling happened
Three reading-errors compound:
- #23 (cooperative-vs-preemptive scheduling jitter): canary's default-scheduling produces different intra-thread event ordering than ours's coroutine scheduler. Diff-tool absorbers (C+18, C+21, D-extension) correctly hide this jitter — but they hide real bootstrap-time divergences too. Phase W explicitly noted: "If ours's worker fails to enqueue something canary's worker awaits, we'd never see the gap because the matched-prefix isn't on the worker tid in the first place."
- #30 (per-tid PC SID drift): shared-global SIDs work for
process-global dispatchers (e.g., the work-queue semaphore at
handle
0xF800003Cin canary). But the wedge handle0x12d0uses a per-tid create-site SID that does NOT match across engines. So even when the same logical event exists in both engines, the diff harness reports SID mismatch and absorbs OR diverges incorrectly. - #38 (cross-spawn producer paths): static reachability (the
sylpheed.db
xrefstable) misses producer paths that cross thread-spawn boundaries. The result.md from Phase Non-match shows canary's tid=14 (XAudio voice-mask poll) communicates with downstream code via a path that has no staticbledge — it crosses via guest kernel APIs.
Alternative metric proposals
Option 1 — draws ≥ 1 (sharp gate)
Pros: directly measures the target. Boolean. Reproducible.
Cons: gives no signal during iteration — every iterate before the
breakthrough is draws = 0. Loss function is non-smooth.
Option 2 — swaps ≥ 2 (relaxed first-frame gate)
Pros: still sharp; one bit looser than draws. Distinguishes
boot-init-only swap (swaps=1) from at-least-one-rendered-frame
(swaps≥2).
Cons: same non-smooth loss. Achievable in principle by a crowbar
without solving the underlying bug.
Option 3 — Renderer-thread liveness: events_emitted_by_renderer_thread ≥ N
Compute: events emitted on the thread spawned at entry 0x822F1EE0
in any 90-s wallclock window. Canary: 594,000. Ours: ~0.
Pros: smooth-ish (event count can move slowly). Directly measures "is the renderer running." Bypasses the diff-tool jitter problem because it's a per-engine internal count. Cons: requires a non-trivial 90-s wallclock run (not 50M instr ceiling). Could be gamed by a crowbar that resumes the renderer without unblocking the wedge.
Option 4 — Worker-thread census: count(threads_with_events ≥ 10k) ≥ 6
Compute: how many tids in ours emit ≥10k events over 90 s wallclock. Canary at 90 s: 12 tids meet this (tids 1/2/4/6/9/10/11/12/13/14/15/16 plus the post-10s workers 21/27/28/29). Ours at 50M instr: 5 tids.
Pros: directly measures the AUDIT-057 thread-gap. Smooth metric: each unwedged thread adds 1 to the count. Cons: requires 90-s wallclock runs — ours can't reach this without solving the wedge first, so it's pre-requisite-equivalent to Option 3.
Option 5 — worker_semaphore_release_count (AUDIT-069 S5)
Compute: how many NtReleaseSemaphore calls on the work semaphore
(handle 0xF800003C in canary, equivalent in ours) over 90 s
wallclock. Canary: 414. Ours: 99 (24%).
Pros: pinpoints the under-production directly. Mechanically measurable. Already instrumented in canary (audit_70_semaphore_release_watch). Cons: same wallclock requirement; same gameability.
Option 6 — composite: progression_score
Define:
progression_score = 1 * swaps + 10 * draws + 100 * unique_render_targets
+ 0.001 * matched_prefix
This recovers signal during iteration (matched-prefix moves) without pretending it's progression. The 1000:1 weight ratio matches the bug-class severity.
Pros: continuous gradient over both wedge-solving and canonicalization work. Honest about which is more important. Cons: arbitrary weights. Composite metrics drift in meaning.
Recommendation
Adopt Option 6 (composite progression_score) as the primary
methodology metric, with a hard secondary gate of "Option 2
(swaps ≥ 2) is what matters; everything else is fitness."
Concrete proposal:
- The
digest.jsonoutput gains aprogression_scorefield computed from the existing fields (zero new instrumentation). - Every iterate must report Δprogression_score in its re-validation.md.
- Iterates that only move
matched_prefix(i.e., Δprogression_score = (small) × Δmatched_prefix) MUST be tagged in their memory entry as "canonicalization only — no progression" and counted against a budget: max 5 consecutive iterates in this class before mandatory pivot to wedge-attack work. - Audits that move
swapsordraws(the high-weight terms) are tagged "progression" and given priority for resource allocation.
This methodology change costs ~10 LOC in the digest output and imposes a discipline cap of 5 canonicalization-only audits between progression attempts.
Falsification of the matched-prefix-as-proxy belief
Phase C through C+25 explicitly assumed that matched-prefix is a proxy for progression. This assumption is now empirically falsified:
+2,960 events of matched-prefix advancement produced exactly ZERO units of progression.
Reading-error #39 (newly registered by this review):
#39 (matched-prefix as progression proxy): matched-prefix measures engine-to-engine divergence point, not game-to-game functional gap. When the wedge is on a different thread than the matched-prefix anchor thread, advancing matched-prefix is orthogonal to unwedging. Future audits MUST distinguish "ours's tid-X main thread diverges from canary's tid-Y" from "ours's tid-X main thread is blocked because tid-Z is wedged", and target the wedge directly when present.
What "progression discipline" looks like in practice
For the next 3 iterates:
- Iterate N+1: Step 1 of shortest-path-roadmap (crowbar). No
diff-tool work. Target:
swaps ≥ 2. - Iterate N+2: Step 2 of roadmap (trigger ID via canary jsonl analysis). No engine LOC. Target: identification of the missing kernel call(s).
- Iterate N+3: Step 3 of roadmap (mirror the trigger). Target: ours unblocks without the crowbar.
Each iterate must produce a progression_score delta report. If
3 iterates in a row produce Δprogression_score ≤ ε (where
ε = +0.001 × +500 ≈ +0.5), the methodology should be re-reviewed
again before continuing — this would mean even the crowbar approach
failed and a deeper rethink is needed.
Closing note
The user's instinct in calling this strategic pause and review was correct. The matched-prefix-only chain was producing real canonicalization work but had ceased producing progression. The roadmap above is one principled attempt at breaking the cycle; if it fails, the next-level fallback is to formally accept Sylpheed's boot-state as currently unreachable in ours and pivot to a different title for the methodology demonstration.