Files

MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 07:19:08 +02:00

5.0 KiB

Raw Blame History

Phase C+22 — ESCALATION (2026-05-18)

Decision: ESCALATE

C+22's target divergence at canary tid=6→1 idx=104,607 (canary import.call RtlEnterCriticalSection extra nested-Enter vs ours import.call RtlLeaveCriticalSection) is classified as (A) scheduler-determinism + post-wait state-mutation downstream effect — the same class C+20 escalated. C+21's wait.begin floating-absorb correctly removed the visible wait.begin jitter event (verified floating_wait (c/o) = 2/0 engaged on this chain in the fresh c22 sample), but the post-wait branch in canary's guest code, taken because shared state was mutated during the wait, cannot be papered over at the diff layer without crossing reading-error #23 (matching genuinely different guest behavior).

What was done

Backed up both canary cache locations.
Wiped both canary caches + ours's cache.
Cold-ran ours (50M instructions, against the .iso).
Cold-ran canary (90s timeout, against the .iso).
Truncated canary log keeping all tids (first 250k events per tid) so the C+18/C+21 cross-tid shared-global heuristic has the multi-tid evidence it needs.
Ran diff_events.py with full multi-tid map.
Verified main matched prefix = 104,607 (matches C+21).
Verified sister chains unchanged: 11/32/3/41/16.
Verified C+21 floating-absorb engaged: floating_create (c/o) = 1/0, floating_wait (c/o) = 2/0 on main chain.
Restored canary caches.

Discovered along the way:

Reading-error class #34 (NEW): cold-run determinism depends on input path form. The .xex and .iso paths produce different boot trajectories. All cold-vs-cold runs MUST use the .iso path. Documented in investigation.md §"Methodology note".

What was NOT done

No engine source changed (per ESCALATE classification).
No diff-tool changes (the existing C+18/C+21 absorbers already work correctly for this region; over-absorbing the post-wait Enter/Leave block would cross into matching genuinely different guest behavior).
Phase A emitter additive for cs_ptr arg considered but deferred — not needed to establish the escalation decision; would only refine the cause-of-branch story which is already established by the C+20 analysis.
D-NEW-2 NOT touched (explicitly out of scope per prompt).

Why we can't fix this in C+22's authorized scope

The C+22 prompt authorizes modifications to:

crates/xenia-kernel/src/exports.rs (rtl_enter_critical_section, rtl_leave_critical_section, related CS state)
crates/xenia-kernel/src/state.rs if CS state model needs adjustment
tools/diff-events/diff_events.py if a new race pattern is identified
Tests, Phase A emitter additive if needed, documentation

But explicitly forbids:

Refactor scheduler / thread-model
Refactor CS primitives broadly
Touch GPU/audio/HID
Land deferred items
Fix D-NEW-2 in this session

The actual root cause is scheduler determinism — ours's single-stepping scheduler runs tid=1 monolithically through this region, denying other tids the opportunity to claim the shared CS that's contended in canary. The fix requires either:

Reworking ours's scheduler to interleave threads at finer granularity (multi-thousand-LOC refactor — NOT AUTHORIZED).
Recording canary's scheduling trace and replaying it in ours (new subsystem — NOT AUTHORIZED).
Adding wait.begin emission to ours's RtlEnter park path AND re-architecting the CS contention model so that, when ours DOES contend, it produces canary-symmetric state mutations — partial; would not fix this case because ours fast-paths here, never parks.
Modifying Sylpheed guest code (out of scope and defeats parity goal).

None of (1)-(4) fit C+22's authorized scope. Escalation is the correct decision.

Recommended next-target sequence

C+23 = D-NEW-2 (independent ε-class fix on a different sister chain). KeWaitForSingleObject timeout_ns sign/scale asymmetry. Out of scope for C+22 per prompt; in scope for C+23.
C+24 = D-NEW-3 (canary tid=14→9 idx=41: XAudioGetVoiceCategoryVolumeChangeMask vs ours's RtlEnterCriticalSection). Likely a missing/stubbed XAudio export.
Parallel scheduler-determinism track: a dedicated multi- session refactor to attack the C+20/C+22 family at the root. Scope per C+20: per-CS-pointer "expected contention" inference from canary logs + scheduler driver + diff-tool "scheduling-trace replay" event class.

Confidence

Classification confidence: HIGH (95%+). Verified by multi-sample canary cold runs showing structurally identical EE-LL nested pattern across all 4 samples; C+21 absorber engaged exactly as predicted; mechanism (post-wait state-mutation branch) consistent with C+20's analysis.
Escalation correctness: HIGH (95%+). No authorized modification within C+22's scope can fix this; reading-error #23 explicitly applies if we over-absorb in the diff tool.
Reading-error #34 discovery: HIGH (verified by repeat experiment — 2 ours-cold runs against .iso byte-identical modulo timestamps; identical to C+19 archive).

5.0 KiB Raw Blame History