Files
xenia-rs/audit-runs/phase-c23-scheduler-determinism-plan/recommendation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

155 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Recommendation — Phase C+23
## Top-line: STAY WITH THE BAND-AID
After source-reading both engines + characterizing 4 archived canary
cold runs' jitter shape + reviewing Phase D's H'/H broad outcomes,
the recommended approach is **(ζ) stay with the band-aid**.
The 104,607 cap that originally motivated this track is already
unblocked at the diff-tool layer (Phase D D-extension absorber,
2026-05-18). The next divergence at idx 105,046 is
`VdInitializeEngines.return_value` — a VD-subsystem engine bug, NOT
a scheduling-determinism recurrence. The cost-benefit of pursuing
γ/β/α is no longer compelling because the immediate symptom is
resolved and no structural follow-on cap has appeared.
## Rationale
### 1. The original target is already unblocked.
| metric | pre-C+20 (C+19) | post-C+21 | post-Phase-D D-extension | now |
|---|---|---|---|---|
| Main matched-prefix | 104,606 | 104,607 | **105,046** | 105,046 |
| Sister chains | 11/32/3/41/16 | 11/32/3/41/16 | 11/32/4/41/16 | unchanged |
| Cap class at head | (B) contention | (A) state-mutation | (engine) VD | (engine) VD |
The matched-prefix advanced **+440** since C+19 through diff-tool work
that did NOT touch the engines. The cap class at the head is no longer
scheduling.
### 2. Phase D Stages 1-4 already built the structural infrastructure.
Phase D Stage 1 (canary contention emitter), Stage 2 (manifest builder),
Stage 3 (ours `OrderMode::ContentionReplay` + manifest loader), and
Stage 4 (diff-tool engine-local kinds) ALL LANDED. The engine code is
in tree. What's missing is *coverage of the right contention events*:
the 104,607 divergence was upstream of canary's first
`contention.observed=true` emit (idx 104,664), so the manifest could
not target the right call site.
This means: if we pursue γ (broaden replay to more event classes),
the entry cost is not "start from scratch" but "extend an existing
manifest layer." However, the LOC budget for γ is still ~600 across
both engines, and there is **no proven future cap** that this would
unblock.
### 3. The empirical jitter range is small and fully absorbable.
From `jitter-profile.md`: 4 canary cold samples show 3 distinct
shapes around the contention window. The C+21 absorber + Phase D
D-extension already canonicalize ALL 3 shapes to the same matched
form. Even N=5 or N=10 fresh canary colds would land in one of these
3 shapes (likely with the same absorber outcome).
The SID core (`a25a16a4f6f547aa`, `2a70efeeed4f4fb6`,
`72a4170012353517`) is consistent across cold runs (±20% counts), and
the shared-global SID recipe (C+18) recomputes them deterministically.
The transient "top-2" SIDs (which change per-cold) all flow through
the shared-global absorber.
### 4. Canary cannot be made deterministic without invalidating it.
The host-thread-per-XThread model is what makes canary the *oracle*.
Replacing it (α / β) would require:
- Reworking ~2000-3000 LOC of canary base+kernel.
- Re-validating against the broader canary test corpus (other games).
- Accepting a real risk of breaking Sylpheed-unrelated game-compat.
Approach γ (record-and-replay) avoids touching canary's scheduling
philosophy but requires ours to consume a multi-million-entry trace,
with engineering and runtime cost that should be matched to a *proven*
future scheduling cap.
### 5. The Phase B image hash and ours digest are stable.
`image_loaded_sha256 ea8d160e…` UNCHANGED. Ours default digest
stable × 3 cold runs. There is no signal of latent divergence in the
pre-Phase-A surfaces that would benefit from scheduling alignment.
## What to keep
1. **Phase D Stages 1-4 infrastructure** stays in tree. Cvar
`kernel_emit_contention=false` default-off; `XENIA_CONTENTION_MANIFEST_PATH`
opt-in. Future phases can use them.
2. **All absorbers** (C+18, C+21, D-extension) stay; they are correct
and narrow.
3. **The Stage 0 `OrderMode::ScanQuantum`** stays as a debug knob,
documented as null-result.
## What to defer
1. Approach γ (broader scheduling-trace replay) — defer until a
future cap demonstrably scheduling-related appears.
2. Approach β / α (deterministic preemption / cooperative canary) —
defer indefinitely.
## What to do next
The next phase is **C+24** (or whatever the natural next number) on
the head divergence at idx 105,046: `VdInitializeEngines.return_value`
(canary=1 ours=0). This is a regular engine bug investigation, ~5-50
LOC.
## Fallback: γ trigger criteria
If a future phase finds a NEW scheduling-determinism cap (defined as:
two consecutive divergences whose root cause is contention/wakeup-
ordering across ≥2 guest threads, NOT a guest-code bug or kernel
emit-completeness gap), then revisit γ. The criteria:
- The new cap is ≥1,000 events long.
- The C+21 / D-extension absorbers cannot fold it within their
current cap (32 pairs).
- Empirical jitter sampling (≥3 canary colds) confirms structural
shape divergence, not just SID identity drift.
If all three hold, γ is justified. Estimated ~600 LOC across 4-5
sessions.
## What this recommendation is NOT
- It is NOT "no scheduling work was useful." Stages 1-4 + D-extension
produced the matched-prefix advance from 104,606 → 105,046 (+440).
- It is NOT "the absorbers are perfect forever." They are explicit
band-aids in spirit of reading-error #23, annotated in schema-v1.md
v1.5.
- It is NOT "ours and canary are bit-aligned in contention regions."
They are *measurably* aligned (matched-prefix) but not *structurally*
aligned (the underlying guest events still differ; the absorber
folds the difference).
## Multi-session budget if we proceed (γ scenario only)
Sessions estimated 4-5. NOT scheduled now.
| stage | LOC | est session |
|---|---|---|
| γ-Stage 1: extend canary trace to wake/park/yield | ~150 | 1 |
| γ-Stage 2: extend manifest builder | ~80 | 0.5 |
| γ-Stage 3: generalized replayer in ours | ~250 | 2 |
| γ-Stage 4: diff-tool integration | ~50 | 0.5 |
| γ-Stage 5: validation + sister budgets | n/a | 1 |
| **total** | **~530** | **~5** |
## Acceptance for THIS session (planning-only)
- [x] Planning artifacts in `audit-runs/phase-c23-scheduler-determinism-plan/`.
- [x] Engine sources UNCHANGED (verified by file listing — only
documentation + 1 python probe written).
- [x] Diff tool UNCHANGED.
- [x] Memory entry to be written next.
- [x] Recommendation justified against C+21 band-aid + breadth of
contention regions + multi-session budget.