handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,193 @@
# Methodology assessment
## The matched-prefix metric: load-bearing or load-shedding?
Across 25+ iterates (audits 049 through 069; Phase C+1 through C+25;
Phase D Stages 0-4 plus D-extension; Phase W; Phase host-audio-*),
matched-prefix on the main thread (canary tid=6 ⇄ ours tid=1)
advanced:
| Phase | Matched-prefix | Δ |
|---|---:|---:|
| Phase B baseline (pre-C+1) | ~102,168 | — |
| Phase D D-extension landing | 104,607 → 105,046 | +439 |
| Phase W (VdInitializeEngines fix) | 105,046 → 105,112 | +66 |
| Phase C+25 (MmGetPhysicalAddress canon) | 105,112 → 105,128 | +16 |
| Phase | `swaps` | `draws` | `unique_render_targets` |
|---|---:|---:|---:|
| Phase B baseline | 1 | 0 | 0 |
| Phase W | 1 | 0 | 0 |
| Phase C+25 | 1 | 0 | 0 |
**The two metrics are decoupled.** Matched-prefix is moving along
ENGINE-internal divergences (kernel-call return values, thread IDs,
heap arena base addresses). The progression metric is gated by
boot-state activation, which lives one or more layers above the diff
points.
## Why the decoupling happened
Three reading-errors compound:
1. **#23 (cooperative-vs-preemptive scheduling jitter)**: canary's
default-scheduling produces different *intra-thread* event ordering
than ours's coroutine scheduler. Diff-tool absorbers (C+18, C+21,
D-extension) correctly hide this jitter — but they hide *real
bootstrap-time divergences too*. Phase W explicitly noted: "If
ours's worker fails to enqueue something canary's worker awaits,
we'd never see the gap because the matched-prefix isn't on the
worker tid in the first place."
2. **#30 (per-tid PC SID drift)**: shared-global SIDs work for
process-global dispatchers (e.g., the work-queue semaphore at
handle `0xF800003C` in canary). But the wedge handle `0x12d0`
uses a per-tid create-site SID that does NOT match across engines.
So even when the same logical event exists in both engines, the
diff harness reports SID mismatch and absorbs OR diverges
incorrectly.
3. **#38 (cross-spawn producer paths)**: static reachability (the
sylpheed.db `xrefs` table) misses producer paths that cross
thread-spawn boundaries. The result.md from Phase Non-match shows
canary's tid=14 (XAudio voice-mask poll) communicates with
downstream code via a path that has no static `bl` edge — it
crosses via guest kernel APIs.
## Alternative metric proposals
### Option 1 — `draws ≥ 1` (sharp gate)
**Pros**: directly measures the target. Boolean. Reproducible.
**Cons**: gives no signal during iteration — every iterate before the
breakthrough is `draws = 0`. Loss function is non-smooth.
### Option 2 — `swaps ≥ 2` (relaxed first-frame gate)
**Pros**: still sharp; one bit looser than draws. Distinguishes
boot-init-only swap (`swaps=1`) from at-least-one-rendered-frame
(`swaps≥2`).
**Cons**: same non-smooth loss. Achievable in principle by a crowbar
without solving the underlying bug.
### Option 3 — Renderer-thread liveness: `events_emitted_by_renderer_thread ≥ N`
Compute: events emitted on the thread spawned at entry `0x822F1EE0`
in any 90-s wallclock window. Canary: 594,000. Ours: ~0.
**Pros**: smooth-ish (event count can move slowly). Directly measures
"is the renderer running." Bypasses the diff-tool jitter problem
because it's a per-engine internal count.
**Cons**: requires a non-trivial 90-s wallclock run (not 50M instr
ceiling). Could be gamed by a crowbar that resumes the renderer
without unblocking the wedge.
### Option 4 — Worker-thread census: `count(threads_with_events ≥ 10k) ≥ 6`
Compute: how many tids in ours emit ≥10k events over 90 s wallclock.
Canary at 90 s: 12 tids meet this (tids 1/2/4/6/9/10/11/12/13/14/15/16
plus the post-10s workers 21/27/28/29). Ours at 50M instr: 5 tids.
**Pros**: directly measures the AUDIT-057 thread-gap. Smooth metric:
each unwedged thread adds 1 to the count.
**Cons**: requires 90-s wallclock runs — ours can't reach this
without solving the wedge first, so it's pre-requisite-equivalent to
Option 3.
### Option 5 — `worker_semaphore_release_count` (AUDIT-069 S5)
Compute: how many `NtReleaseSemaphore` calls on the work semaphore
(handle `0xF800003C` in canary, equivalent in ours) over 90 s
wallclock. Canary: 414. Ours: 99 (24%).
**Pros**: pinpoints the under-production directly. Mechanically
measurable. Already instrumented in canary (audit_70_semaphore_release_watch).
**Cons**: same wallclock requirement; same gameability.
### Option 6 — composite: `progression_score`
Define:
```
progression_score = 1 * swaps + 10 * draws + 100 * unique_render_targets
+ 0.001 * matched_prefix
```
This recovers signal during iteration (matched-prefix moves)
without pretending it's progression. The 1000:1 weight ratio
matches the bug-class severity.
**Pros**: continuous gradient over both wedge-solving and
canonicalization work. Honest about which is more important.
**Cons**: arbitrary weights. Composite metrics drift in meaning.
## Recommendation
**Adopt Option 6 (composite progression_score) as the primary
methodology metric**, with a hard secondary gate of "Option 2
(`swaps ≥ 2`) is what matters; everything else is fitness."
Concrete proposal:
1. The `digest.json` output gains a `progression_score` field
computed from the existing fields (zero new instrumentation).
2. Every iterate must report Δprogression_score in its
re-validation.md.
3. Iterates that only move `matched_prefix` (i.e., Δprogression_score
= (small) × Δmatched_prefix) MUST be tagged in their memory entry
as "**canonicalization only — no progression**" and counted
against a *budget*: max 5 consecutive iterates in this class
before mandatory pivot to wedge-attack work.
4. Audits that move `swaps` or `draws` (the high-weight terms) are
tagged "**progression**" and given priority for resource
allocation.
This methodology change costs ~10 LOC in the digest output and
imposes a discipline cap of 5 canonicalization-only audits between
progression attempts.
## Falsification of the matched-prefix-as-proxy belief
Phase C through C+25 explicitly assumed that matched-prefix is a
**proxy** for progression. This assumption is now empirically
falsified:
> +2,960 events of matched-prefix advancement produced exactly
> ZERO units of progression.
Reading-error #39 (newly registered by this review):
> **#39 (matched-prefix as progression proxy)**: matched-prefix
> measures *engine-to-engine divergence point*, not *game-to-game
> functional gap*. When the wedge is on a different thread than the
> matched-prefix anchor thread, advancing matched-prefix is orthogonal
> to unwedging. Future audits MUST distinguish "ours's tid-X main
> thread diverges from canary's tid-Y" from "ours's tid-X main thread
> is *blocked because tid-Z is wedged*", and target the wedge directly
> when present.
## What "progression discipline" looks like in practice
For the next 3 iterates:
- Iterate N+1: **Step 1 of shortest-path-roadmap** (crowbar). No
diff-tool work. Target: `swaps ≥ 2`.
- Iterate N+2: **Step 2 of roadmap** (trigger ID via canary jsonl
analysis). No engine LOC. Target: identification of the missing
kernel call(s).
- Iterate N+3: **Step 3 of roadmap** (mirror the trigger). Target:
ours unblocks without the crowbar.
Each iterate must produce a `progression_score` delta report. If
3 iterates in a row produce Δprogression_score ≤ ε (where
ε = +0.001 × +500 ≈ +0.5), the methodology should be re-reviewed
again before continuing — this would mean even the crowbar approach
failed and a deeper rethink is needed.
## Closing note
The user's instinct in calling this strategic pause and review was
correct. The matched-prefix-only chain was producing real
canonicalization work but had ceased producing progression. The
roadmap above is one principled attempt at breaking the cycle; if it
fails, the next-level fallback is to formally accept Sylpheed's
boot-state as currently unreachable in ours and pivot to a different
title for the methodology demonstration.