xenia-rs/audit-runs/review-a-boot-state/methodology-assessment.md

# Methodology assessment

## The matched-prefix metric: load-bearing or load-shedding?

Across 25+ iterates (audits 049 through 069; Phase C+1 through C+25;
Phase D Stages 0-4 plus D-extension; Phase W; Phase host-audio-*),
matched-prefix on the main thread (canary tid=6 ⇄ ours tid=1)
advanced:

| Phase | Matched-prefix | Δ |
|---|---:|---:|
| Phase B baseline (pre-C+1) | ~102,168 | — |
| Phase D D-extension landing | 104,607 → 105,046 | +439 |
| Phase W (VdInitializeEngines fix) | 105,046 → 105,112 | +66 |
| Phase C+25 (MmGetPhysicalAddress canon) | 105,112 → 105,128 | +16 |

| Phase | `swaps` | `draws` | `unique_render_targets` |
|---|---:|---:|---:|
| Phase B baseline | 1 | 0 | 0 |
| Phase W | 1 | 0 | 0 |
| Phase C+25 | 1 | 0 | 0 |

**The two metrics are decoupled.**  Matched-prefix is moving along
ENGINE-internal divergences (kernel-call return values, thread IDs,
heap arena base addresses).  The progression metric is gated by
boot-state activation, which lives one or more layers above the diff
points.

## Why the decoupling happened

Three reading-errors compound:

1. **#23 (cooperative-vs-preemptive scheduling jitter)**: canary's
   default-scheduling produces different *intra-thread* event ordering
   than ours's coroutine scheduler.  Diff-tool absorbers (C+18, C+21,
   D-extension) correctly hide this jitter — but they hide *real
   bootstrap-time divergences too*.  Phase W explicitly noted: "If
   ours's worker fails to enqueue something canary's worker awaits,
   we'd never see the gap because the matched-prefix isn't on the
   worker tid in the first place."
2. **#30 (per-tid PC SID drift)**: shared-global SIDs work for
   process-global dispatchers (e.g., the work-queue semaphore at
   handle `0xF800003C` in canary).  But the wedge handle `0x12d0`
   uses a per-tid create-site SID that does NOT match across engines.
   So even when the same logical event exists in both engines, the
   diff harness reports SID mismatch and absorbs OR diverges
   incorrectly.
3. **#38 (cross-spawn producer paths)**: static reachability (the
   sylpheed.db `xrefs` table) misses producer paths that cross
   thread-spawn boundaries.  The result.md from Phase Non-match shows
   canary's tid=14 (XAudio voice-mask poll) communicates with
   downstream code via a path that has no static `bl` edge — it
   crosses via guest kernel APIs.

## Alternative metric proposals

### Option 1 — `draws ≥ 1` (sharp gate)

**Pros**: directly measures the target.  Boolean.  Reproducible.
**Cons**: gives no signal during iteration — every iterate before the
breakthrough is `draws = 0`.  Loss function is non-smooth.

### Option 2 — `swaps ≥ 2` (relaxed first-frame gate)

**Pros**: still sharp; one bit looser than draws.  Distinguishes
boot-init-only swap (`swaps=1`) from at-least-one-rendered-frame
(`swaps≥2`).
**Cons**: same non-smooth loss.  Achievable in principle by a crowbar
without solving the underlying bug.

### Option 3 — Renderer-thread liveness: `events_emitted_by_renderer_thread ≥ N`

Compute: events emitted on the thread spawned at entry `0x822F1EE0`
in any 90-s wallclock window.  Canary: 594,000.  Ours: ~0.

**Pros**: smooth-ish (event count can move slowly).  Directly measures
"is the renderer running."  Bypasses the diff-tool jitter problem
because it's a per-engine internal count.
**Cons**: requires a non-trivial 90-s wallclock run (not 50M instr
ceiling).  Could be gamed by a crowbar that resumes the renderer
without unblocking the wedge.

### Option 4 — Worker-thread census: `count(threads_with_events ≥ 10k) ≥ 6`

Compute: how many tids in ours emit ≥10k events over 90 s wallclock.
Canary at 90 s: 12 tids meet this (tids 1/2/4/6/9/10/11/12/13/14/15/16
plus the post-10s workers 21/27/28/29).  Ours at 50M instr: 5 tids.

**Pros**: directly measures the AUDIT-057 thread-gap.  Smooth metric:
each unwedged thread adds 1 to the count.
**Cons**: requires 90-s wallclock runs — ours can't reach this
without solving the wedge first, so it's pre-requisite-equivalent to
Option 3.

### Option 5 — `worker_semaphore_release_count` (AUDIT-069 S5)

Compute: how many `NtReleaseSemaphore` calls on the work semaphore
(handle `0xF800003C` in canary, equivalent in ours) over 90 s
wallclock.  Canary: 414.  Ours: 99 (24%).

**Pros**: pinpoints the under-production directly.  Mechanically
measurable.  Already instrumented in canary (audit_70_semaphore_release_watch).
**Cons**: same wallclock requirement; same gameability.

### Option 6 — composite: `progression_score`

Define:

```
progression_score = 1 * swaps + 10 * draws + 100 * unique_render_targets
                  + 0.001 * matched_prefix
```

This recovers signal during iteration (matched-prefix moves)
without pretending it's progression.  The 1000:1 weight ratio
matches the bug-class severity.

**Pros**: continuous gradient over both wedge-solving and
canonicalization work.  Honest about which is more important.
**Cons**: arbitrary weights.  Composite metrics drift in meaning.

## Recommendation

**Adopt Option 6 (composite progression_score) as the primary
methodology metric**, with a hard secondary gate of "Option 2
(`swaps ≥ 2`) is what matters; everything else is fitness."

Concrete proposal:

1. The `digest.json` output gains a `progression_score` field
   computed from the existing fields (zero new instrumentation).
2. Every iterate must report Δprogression_score in its
   re-validation.md.
3. Iterates that only move `matched_prefix` (i.e., Δprogression_score
   = (small) × Δmatched_prefix) MUST be tagged in their memory entry
   as "**canonicalization only — no progression**" and counted
   against a *budget*: max 5 consecutive iterates in this class
   before mandatory pivot to wedge-attack work.
4. Audits that move `swaps` or `draws` (the high-weight terms) are
   tagged "**progression**" and given priority for resource
   allocation.

This methodology change costs ~10 LOC in the digest output and
imposes a discipline cap of 5 canonicalization-only audits between
progression attempts.

## Falsification of the matched-prefix-as-proxy belief

Phase C through C+25 explicitly assumed that matched-prefix is a
**proxy** for progression.  This assumption is now empirically
falsified:

> +2,960 events of matched-prefix advancement produced exactly
> ZERO units of progression.

Reading-error #39 (newly registered by this review):

> **#39 (matched-prefix as progression proxy)**: matched-prefix
> measures *engine-to-engine divergence point*, not *game-to-game
> functional gap*.  When the wedge is on a different thread than the
> matched-prefix anchor thread, advancing matched-prefix is orthogonal
> to unwedging.  Future audits MUST distinguish "ours's tid-X main
> thread diverges from canary's tid-Y" from "ours's tid-X main thread
> is *blocked because tid-Z is wedged*", and target the wedge directly
> when present.

## What "progression discipline" looks like in practice

For the next 3 iterates:

- Iterate N+1: **Step 1 of shortest-path-roadmap** (crowbar).  No
  diff-tool work.  Target: `swaps ≥ 2`.
- Iterate N+2: **Step 2 of roadmap** (trigger ID via canary jsonl
  analysis).  No engine LOC.  Target: identification of the missing
  kernel call(s).
- Iterate N+3: **Step 3 of roadmap** (mirror the trigger).  Target:
  ours unblocks without the crowbar.

Each iterate must produce a `progression_score` delta report.  If
3 iterates in a row produce Δprogression_score ≤ ε (where
ε = +0.001 × +500 ≈ +0.5), the methodology should be re-reviewed
again before continuing — this would mean even the crowbar approach
failed and a deeper rethink is needed.

## Closing note

The user's instinct in calling this strategic pause and review was
correct.  The matched-prefix-only chain was producing real
canonicalization work but had ceased producing progression.  The
roadmap above is one principled attempt at breaking the cycle; if it
fails, the next-level fallback is to formally accept Sylpheed's
boot-state as currently unreachable in ours and pivot to a different
title for the methodology demonstration.