Files
xenia-rs/audit-runs/phase-c23-scheduler-determinism-plan/recommendation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

6.3 KiB
Raw Blame History

Recommendation — Phase C+23

Top-line: STAY WITH THE BAND-AID

After source-reading both engines + characterizing 4 archived canary cold runs' jitter shape + reviewing Phase D's H'/H broad outcomes, the recommended approach is (ζ) stay with the band-aid.

The 104,607 cap that originally motivated this track is already unblocked at the diff-tool layer (Phase D D-extension absorber, 2026-05-18). The next divergence at idx 105,046 is VdInitializeEngines.return_value — a VD-subsystem engine bug, NOT a scheduling-determinism recurrence. The cost-benefit of pursuing γ/β/α is no longer compelling because the immediate symptom is resolved and no structural follow-on cap has appeared.

Rationale

1. The original target is already unblocked.

metric pre-C+20 (C+19) post-C+21 post-Phase-D D-extension now
Main matched-prefix 104,606 104,607 105,046 105,046
Sister chains 11/32/3/41/16 11/32/3/41/16 11/32/4/41/16 unchanged
Cap class at head (B) contention (A) state-mutation (engine) VD (engine) VD

The matched-prefix advanced +440 since C+19 through diff-tool work that did NOT touch the engines. The cap class at the head is no longer scheduling.

2. Phase D Stages 1-4 already built the structural infrastructure.

Phase D Stage 1 (canary contention emitter), Stage 2 (manifest builder), Stage 3 (ours OrderMode::ContentionReplay + manifest loader), and Stage 4 (diff-tool engine-local kinds) ALL LANDED. The engine code is in tree. What's missing is coverage of the right contention events: the 104,607 divergence was upstream of canary's first contention.observed=true emit (idx 104,664), so the manifest could not target the right call site.

This means: if we pursue γ (broaden replay to more event classes), the entry cost is not "start from scratch" but "extend an existing manifest layer." However, the LOC budget for γ is still ~600 across both engines, and there is no proven future cap that this would unblock.

3. The empirical jitter range is small and fully absorbable.

From jitter-profile.md: 4 canary cold samples show 3 distinct shapes around the contention window. The C+21 absorber + Phase D D-extension already canonicalize ALL 3 shapes to the same matched form. Even N=5 or N=10 fresh canary colds would land in one of these 3 shapes (likely with the same absorber outcome).

The SID core (a25a16a4f6f547aa, 2a70efeeed4f4fb6, 72a4170012353517) is consistent across cold runs (±20% counts), and the shared-global SID recipe (C+18) recomputes them deterministically. The transient "top-2" SIDs (which change per-cold) all flow through the shared-global absorber.

4. Canary cannot be made deterministic without invalidating it.

The host-thread-per-XThread model is what makes canary the oracle. Replacing it (α / β) would require:

  • Reworking ~2000-3000 LOC of canary base+kernel.
  • Re-validating against the broader canary test corpus (other games).
  • Accepting a real risk of breaking Sylpheed-unrelated game-compat.

Approach γ (record-and-replay) avoids touching canary's scheduling philosophy but requires ours to consume a multi-million-entry trace, with engineering and runtime cost that should be matched to a proven future scheduling cap.

5. The Phase B image hash and ours digest are stable.

image_loaded_sha256 ea8d160e… UNCHANGED. Ours default digest stable × 3 cold runs. There is no signal of latent divergence in the pre-Phase-A surfaces that would benefit from scheduling alignment.

What to keep

  1. Phase D Stages 1-4 infrastructure stays in tree. Cvar kernel_emit_contention=false default-off; XENIA_CONTENTION_MANIFEST_PATH opt-in. Future phases can use them.
  2. All absorbers (C+18, C+21, D-extension) stay; they are correct and narrow.
  3. The Stage 0 OrderMode::ScanQuantum stays as a debug knob, documented as null-result.

What to defer

  1. Approach γ (broader scheduling-trace replay) — defer until a future cap demonstrably scheduling-related appears.
  2. Approach β / α (deterministic preemption / cooperative canary) — defer indefinitely.

What to do next

The next phase is C+24 (or whatever the natural next number) on the head divergence at idx 105,046: VdInitializeEngines.return_value (canary=1 ours=0). This is a regular engine bug investigation, ~5-50 LOC.

Fallback: γ trigger criteria

If a future phase finds a NEW scheduling-determinism cap (defined as: two consecutive divergences whose root cause is contention/wakeup- ordering across ≥2 guest threads, NOT a guest-code bug or kernel emit-completeness gap), then revisit γ. The criteria:

  • The new cap is ≥1,000 events long.
  • The C+21 / D-extension absorbers cannot fold it within their current cap (32 pairs).
  • Empirical jitter sampling (≥3 canary colds) confirms structural shape divergence, not just SID identity drift.

If all three hold, γ is justified. Estimated ~600 LOC across 4-5 sessions.

What this recommendation is NOT

  • It is NOT "no scheduling work was useful." Stages 1-4 + D-extension produced the matched-prefix advance from 104,606 → 105,046 (+440).
  • It is NOT "the absorbers are perfect forever." They are explicit band-aids in spirit of reading-error #23, annotated in schema-v1.md v1.5.
  • It is NOT "ours and canary are bit-aligned in contention regions." They are measurably aligned (matched-prefix) but not structurally aligned (the underlying guest events still differ; the absorber folds the difference).

Multi-session budget if we proceed (γ scenario only)

Sessions estimated 4-5. NOT scheduled now.

stage LOC est session
γ-Stage 1: extend canary trace to wake/park/yield ~150 1
γ-Stage 2: extend manifest builder ~80 0.5
γ-Stage 3: generalized replayer in ours ~250 2
γ-Stage 4: diff-tool integration ~50 0.5
γ-Stage 5: validation + sister budgets n/a 1
total ~530 ~5

Acceptance for THIS session (planning-only)

  • Planning artifacts in audit-runs/phase-c23-scheduler-determinism-plan/.
  • Engine sources UNCHANGED (verified by file listing — only documentation + 1 python probe written).
  • Diff tool UNCHANGED.
  • Memory entry to be written next.
  • Recommendation justified against C+21 band-aid + breadth of contention regions + multi-session budget.