handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,154 @@
|
||||
# Recommendation — Phase C+23
|
||||
|
||||
## Top-line: STAY WITH THE BAND-AID
|
||||
|
||||
After source-reading both engines + characterizing 4 archived canary
|
||||
cold runs' jitter shape + reviewing Phase D's H'/H broad outcomes,
|
||||
the recommended approach is **(ζ) stay with the band-aid**.
|
||||
|
||||
The 104,607 cap that originally motivated this track is already
|
||||
unblocked at the diff-tool layer (Phase D D-extension absorber,
|
||||
2026-05-18). The next divergence at idx 105,046 is
|
||||
`VdInitializeEngines.return_value` — a VD-subsystem engine bug, NOT
|
||||
a scheduling-determinism recurrence. The cost-benefit of pursuing
|
||||
γ/β/α is no longer compelling because the immediate symptom is
|
||||
resolved and no structural follow-on cap has appeared.
|
||||
|
||||
## Rationale
|
||||
|
||||
### 1. The original target is already unblocked.
|
||||
|
||||
| metric | pre-C+20 (C+19) | post-C+21 | post-Phase-D D-extension | now |
|
||||
|---|---|---|---|---|
|
||||
| Main matched-prefix | 104,606 | 104,607 | **105,046** | 105,046 |
|
||||
| Sister chains | 11/32/3/41/16 | 11/32/3/41/16 | 11/32/4/41/16 | unchanged |
|
||||
| Cap class at head | (B) contention | (A) state-mutation | (engine) VD | (engine) VD |
|
||||
|
||||
The matched-prefix advanced **+440** since C+19 through diff-tool work
|
||||
that did NOT touch the engines. The cap class at the head is no longer
|
||||
scheduling.
|
||||
|
||||
### 2. Phase D Stages 1-4 already built the structural infrastructure.
|
||||
|
||||
Phase D Stage 1 (canary contention emitter), Stage 2 (manifest builder),
|
||||
Stage 3 (ours `OrderMode::ContentionReplay` + manifest loader), and
|
||||
Stage 4 (diff-tool engine-local kinds) ALL LANDED. The engine code is
|
||||
in tree. What's missing is *coverage of the right contention events*:
|
||||
the 104,607 divergence was upstream of canary's first
|
||||
`contention.observed=true` emit (idx 104,664), so the manifest could
|
||||
not target the right call site.
|
||||
|
||||
This means: if we pursue γ (broaden replay to more event classes),
|
||||
the entry cost is not "start from scratch" but "extend an existing
|
||||
manifest layer." However, the LOC budget for γ is still ~600 across
|
||||
both engines, and there is **no proven future cap** that this would
|
||||
unblock.
|
||||
|
||||
### 3. The empirical jitter range is small and fully absorbable.
|
||||
|
||||
From `jitter-profile.md`: 4 canary cold samples show 3 distinct
|
||||
shapes around the contention window. The C+21 absorber + Phase D
|
||||
D-extension already canonicalize ALL 3 shapes to the same matched
|
||||
form. Even N=5 or N=10 fresh canary colds would land in one of these
|
||||
3 shapes (likely with the same absorber outcome).
|
||||
|
||||
The SID core (`a25a16a4f6f547aa`, `2a70efeeed4f4fb6`,
|
||||
`72a4170012353517`) is consistent across cold runs (±20% counts), and
|
||||
the shared-global SID recipe (C+18) recomputes them deterministically.
|
||||
The transient "top-2" SIDs (which change per-cold) all flow through
|
||||
the shared-global absorber.
|
||||
|
||||
### 4. Canary cannot be made deterministic without invalidating it.
|
||||
|
||||
The host-thread-per-XThread model is what makes canary the *oracle*.
|
||||
Replacing it (α / β) would require:
|
||||
|
||||
- Reworking ~2000-3000 LOC of canary base+kernel.
|
||||
- Re-validating against the broader canary test corpus (other games).
|
||||
- Accepting a real risk of breaking Sylpheed-unrelated game-compat.
|
||||
|
||||
Approach γ (record-and-replay) avoids touching canary's scheduling
|
||||
philosophy but requires ours to consume a multi-million-entry trace,
|
||||
with engineering and runtime cost that should be matched to a *proven*
|
||||
future scheduling cap.
|
||||
|
||||
### 5. The Phase B image hash and ours digest are stable.
|
||||
|
||||
`image_loaded_sha256 ea8d160e…` UNCHANGED. Ours default digest
|
||||
stable × 3 cold runs. There is no signal of latent divergence in the
|
||||
pre-Phase-A surfaces that would benefit from scheduling alignment.
|
||||
|
||||
## What to keep
|
||||
|
||||
1. **Phase D Stages 1-4 infrastructure** stays in tree. Cvar
|
||||
`kernel_emit_contention=false` default-off; `XENIA_CONTENTION_MANIFEST_PATH`
|
||||
opt-in. Future phases can use them.
|
||||
2. **All absorbers** (C+18, C+21, D-extension) stay; they are correct
|
||||
and narrow.
|
||||
3. **The Stage 0 `OrderMode::ScanQuantum`** stays as a debug knob,
|
||||
documented as null-result.
|
||||
|
||||
## What to defer
|
||||
|
||||
1. Approach γ (broader scheduling-trace replay) — defer until a
|
||||
future cap demonstrably scheduling-related appears.
|
||||
2. Approach β / α (deterministic preemption / cooperative canary) —
|
||||
defer indefinitely.
|
||||
|
||||
## What to do next
|
||||
|
||||
The next phase is **C+24** (or whatever the natural next number) on
|
||||
the head divergence at idx 105,046: `VdInitializeEngines.return_value`
|
||||
(canary=1 ours=0). This is a regular engine bug investigation, ~5-50
|
||||
LOC.
|
||||
|
||||
## Fallback: γ trigger criteria
|
||||
|
||||
If a future phase finds a NEW scheduling-determinism cap (defined as:
|
||||
two consecutive divergences whose root cause is contention/wakeup-
|
||||
ordering across ≥2 guest threads, NOT a guest-code bug or kernel
|
||||
emit-completeness gap), then revisit γ. The criteria:
|
||||
|
||||
- The new cap is ≥1,000 events long.
|
||||
- The C+21 / D-extension absorbers cannot fold it within their
|
||||
current cap (32 pairs).
|
||||
- Empirical jitter sampling (≥3 canary colds) confirms structural
|
||||
shape divergence, not just SID identity drift.
|
||||
|
||||
If all three hold, γ is justified. Estimated ~600 LOC across 4-5
|
||||
sessions.
|
||||
|
||||
## What this recommendation is NOT
|
||||
|
||||
- It is NOT "no scheduling work was useful." Stages 1-4 + D-extension
|
||||
produced the matched-prefix advance from 104,606 → 105,046 (+440).
|
||||
- It is NOT "the absorbers are perfect forever." They are explicit
|
||||
band-aids in spirit of reading-error #23, annotated in schema-v1.md
|
||||
v1.5.
|
||||
- It is NOT "ours and canary are bit-aligned in contention regions."
|
||||
They are *measurably* aligned (matched-prefix) but not *structurally*
|
||||
aligned (the underlying guest events still differ; the absorber
|
||||
folds the difference).
|
||||
|
||||
## Multi-session budget if we proceed (γ scenario only)
|
||||
|
||||
Sessions estimated 4-5. NOT scheduled now.
|
||||
|
||||
| stage | LOC | est session |
|
||||
|---|---|---|
|
||||
| γ-Stage 1: extend canary trace to wake/park/yield | ~150 | 1 |
|
||||
| γ-Stage 2: extend manifest builder | ~80 | 0.5 |
|
||||
| γ-Stage 3: generalized replayer in ours | ~250 | 2 |
|
||||
| γ-Stage 4: diff-tool integration | ~50 | 0.5 |
|
||||
| γ-Stage 5: validation + sister budgets | n/a | 1 |
|
||||
| **total** | **~530** | **~5** |
|
||||
|
||||
## Acceptance for THIS session (planning-only)
|
||||
|
||||
- [x] Planning artifacts in `audit-runs/phase-c23-scheduler-determinism-plan/`.
|
||||
- [x] Engine sources UNCHANGED (verified by file listing — only
|
||||
documentation + 1 python probe written).
|
||||
- [x] Diff tool UNCHANGED.
|
||||
- [x] Memory entry to be written next.
|
||||
- [x] Recommendation justified against C+21 band-aid + breadth of
|
||||
contention regions + multi-session budget.
|
||||
Reference in New Issue
Block a user