Files
xenia-rs/audit-runs/phase-d-stage3/result.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

13 KiB
Raw Blame History

Phase D Stage 3 (+ Stage 4) — Contention Replay: Result

Date: 2026-05-18 Outcome: LANDED. Stage 3 ours-side contention-replay infrastructure + Stage 4 diff-tool engine-local kind. Default-mode digest byte-identical to pre-Stage-3 baseline. Replay-mode digest stable × 3. Sister chains preserved. Main matched-prefix at 104,607 — unchanged from baseline. Stage 3 lands as infrastructure; it does not unblock the cap on its own because the 104,607 divergence is upstream of any contention.observed event in the canary trace.

Engine source change

file LOC purpose
xenia-rs/crates/xenia-kernel/src/contention_manifest.rs new file, 280 manifest loader + (tid, idx) → Entry HashMap behind Mutex + consume_at_peek(tid, peek_idx) with per-tid emit-count translation back to canary's idx space + 12 unit tests
xenia-rs/crates/xenia-kernel/src/lib.rs +1 pub mod contention_manifest;
xenia-rs/crates/xenia-kernel/src/event_log.rs +35 object_type::CRITICAL_SECTION = 0x0C enum value + emit_contention_observed(tid, guest_cycle, cs_ptr, contended) helper (mirrors canary's EmitContentionObserved)
xenia-rs/crates/xenia-kernel/src/state.rs +20 KernelState.contention_manifest: Option<Arc<ContentionManifest>> field + install_contention_manifest() setter
xenia-rs/crates/xenia-kernel/src/exports.rs +75 rtl_enter_critical_section body consults the manifest via consume_at_peek, emits contention.observed on hit, falls through to natural code (conservative deadlock-safe mode); aggressive mode behind XENIA_CONTENTION_AGGRESSIVE=1 env var (no-go per probe — see below)
xenia-rs/crates/xenia-app/src/main.rs +30 XENIA_CONTENTION_MANIFEST_PATH env-var loader + tracing log on success/failure
Engine total ~440 LOC additive across 6 files
diff-tool file LOC purpose (Stage 4 bundled)
xenia-rs/tools/diff-events/diff_events.py +30 new ENGINE_LOCAL_KINDS = {"contention.observed"} set + skip branches in diff_one_tid so per-tid pointer advances past these events on EITHER side without comparison
xenia-rs/tools/diff-events/test_diff_events.py +65 2 new tests covering both-sides and one-sided engine-local skip
xenia-rs/tools/diff-events/build_contention_manifest.py +45 added --tid-map CANARY=OURS,... translation so the manifest stores ours-side tids (so Stage-3 lookup keys on the consumer's native current_tid)
xenia-rs/tools/diff-events/test_build_manifest.py +50 2 new tests covering tid-map translation + unmapped-tid drop
Tooling total ~190 LOC additive

Tests

  • cargo test -p xenia-kernel --lib: 216 PASS (was 213, +3 new tests in contention_manifest: consume_at_peek_translates_idx, consume_at_peek_miss_does_not_bump_emit_count, consume_at_peek_per_tid_independent — plus 9 earlier ones covering load, consume, peek, version-check, missing-fields-error, cs_ptr-parse).
  • cargo test -p xenia-cpu --lib: 300 PASS (unchanged from Stage 0).
  • python3 test_build_manifest.py: 11 PASS (was 9, +2 for tid-map).
  • python3 test_diff_events.py: all pre-existing PASS + 2 new (test_engine_local_contention_observed_skipped_both_sides, test_engine_local_one_sided_contention_observed).

Cold-run validation

Gate 1: default-mode digest byte-identical to Stage 0 baseline

Without XENIA_CONTENTION_MANIFEST_PATH set:

$ XENIA_CACHE_WIPE=1 xenia-rs-stage3 exec --phase-a-event-log ours.jsonl \
    -n 50000000 --quiet Sylpheed.iso
$ det_digest.py ours.jsonl
det_fields_md5 = ba5b5e0795ccb32966a49d3b2917a30d   <-- same as Stage 0 baseline
total_events  = 121569

✓ Stage 3's manifest-check fast-path costs ≈ one Option::as_ref().and_then(…) per rtl_enter_critical_section call. Default-mode behavior preserved bit-for-bit. Phase B image_loaded_sha256 = ea8d160e… UNCHANGED.

Gate 2: replay-mode digest stable × 3

With manifest installed (807 entries → 284 entries after tid-map filter):

run digest total_events
1 1d7c6b4592d024405cd9d86eb79f5307 121571
2 1d7c6b4592d024405cd9d86eb79f5307 121571
3 1d7c6b4592d024405cd9d86eb79f5307 121571

✓ Bit-stable × 3 under replay. New digest is expected: the 2 contention.observed emits shift per-tid idx values by +2 starting at idx 102,788, producing a provably different (but deterministic) byte sequence.

Gate 3: replay-mode matched-prefix vs canary cvar-ON

$ python3 diff_events.py \
    --canary canary-cvaron-trunc.jsonl \
    --ours stage3-replay.jsonl \
    --tid-map 6=1,7=2,4=11,12=7,14=9,15=10
| canary_tid | ours_tid | matched | first_divergence_at |
|---|---|---|---|
| 4 | 11 | 11    | — |
| 6 | 1  | 104607| 104607 |
| 7 | 2  | 32    | — |
| 12| 7  | 4     | 4 |
| 14| 9  | 41    | 41 |
| 15| 10 | 16    | — |

✓ Main matched-prefix 104,607 — same as pre-Phase-D baseline. Sister chains preserved (11/32/4/41/16, identical to C+22 baseline). Stage 3 does not break the prefix; nor does it advance it.

Why the cap isn't unblocked

The 104,607 divergence is at canary's tid=6 idx 104,610 (nested RtlEnter) vs ours's tid=1 idx 104,608 (RtlLeave). Both engines completed the outer RtlEnter at idx 104,608/104,606 with return_value=0 and no contention.observed event. The first canary contention is at idx 104,664 — AFTER the cap divergence, on a DEEPER RtlEnter call further into the same control-flow branch ours never enters.

In other words: the cap is upstream of any contention. Replaying canary's contention.observed events at 102,788 and 104,664 happens either too early (102,788 — way before the cap) or too late (104,664 — ours diverged at 104,607 and never reaches that ordinal in the same logical position). The manifest mechanism is correctly aligned to canary's contention events; the cap simply isn't a contention event.

The 104,610 nested RtlEnter in canary vs the RtlLeave in ours is guest-code-driven: same PPC code, same outer-Enter return value (0), different next-call decision. Likely cause: some other guest memory state (not the CS struct itself) has diverged between canary and ours upstream of idx 104,610, and the guest's branch decision reads that state. That's a state-divergence root cause, NOT scheduling-determinism.

Aggressive mode probe (XENIA_CONTENTION_AGGRESSIVE=1)

Tested behind an env-var gate to confirm: forcing the park unconditionally when the manifest hits (even when CS is free in guest memory) regresses the trace catastrophically.

  • main matched-prefix: 102,789 (-1,818 vs baseline)
  • ours_total events: 1,019,208 (-vs- 121,569 default; 8× ballooned)
  • Sister chains: 4 of 5 entirely absent (tid=11/7/9/10 produced zero events)
  • Cause: tid=1 force-parks on a free CS at idx 102,788. No peer touches the CS during the wait. Scheduler::unblock_on_deadlock eventually recovers with owner=0, but downstream guest state is now corrupted (the RtlEnter returned with owner unset). The other tids never reach the spawn point because tid=1 was supposed to drive their setup.

Aggressive mode is gated off by default (XENIA_CONTENTION_AGGRESSIVE=1 explicit opt-in). Conservative mode is the landed default.

What the manifest did observe

Per the run log (debug level):

manifest cs_ptr cross-engine divergence at tid=1 idx=102788: manifest 0xbc65c890, ours 0x40544890 (allocator ε)
manifest hit at tid=1 idx=102788 cs=0x40544890 but CS is free/self-owned (owner=0); replay skipped (state-divergence, not schedule-divergence)
manifest cs_ptr cross-engine divergence at tid=1 idx=104664: manifest 0xbc65c890, ours 0x828f39d0 (allocator ε)
manifest hit at tid=1 idx=104664 cs=0x828f39d0 but CS is free/self-owned (owner=0); replay skipped (state-divergence, not schedule-divergence)

Two hits fired, both at the right ordinal. Both fell into the conservative "skip" branch because:

  1. The cs_ptr canary recorded (0xbc65c890) and the cs_ptr ours sees (0x40544890 / 0x828f39d0) differ. This is the AUDIT-043 allocator ε divergence — we don't gate on it (we trust the (tid, idx) alignment), but it's logged.
  2. In ours's guest memory the CS owner is 0 (free) — no peer is holding the lock. Force-parking here would deadlock; the conservative branch skips and falls through to the natural fast-path.

Reading-error class

No new class. Existing protocols:

  • #28 verify source first — read rtl_enter_critical_section, event_log.rs, KernelState end-to-end before editing.
  • #32 canary contention jitter — handled at the manifest layer (per-tid + idx key, no ordinal hardcoding).
  • AUDIT-043 allocator ε — manifest cs_ptr (canary heap) ≠ ours cs_ptr (different heap); the diff tool handles this for kernel.return via per-tid allocator ordinal canonicalization. The Stage-3 manifest doesn't have that translation but doesn't need it: matching on (tid, idx) is sufficient because both engines emit the RtlEnter at the same per-tid ordinal within the matched prefix.

Artifacts

  • contention_manifest.json — 284 entries (tid-map applied)
  • /tmp/stage3-conservative-r{1,2,3}.jsonl — replay-mode cold runs (digest 1d7c6b45… × 3)
  • /tmp/stage3-default-r1.jsonl — default-mode cold run (digest ba5b5e07…)
  • /tmp/stage3-aggressive-r1.jsonl — aggressive-mode probe (-1,818 prefix; do NOT use as baseline)

Post-landing divergence forensics (root-cause clue)

Trace inspection of the divergent region in BOTH engines:

Ours's tid=1 idx 104,604..104,613:

import.call RtlEnter  → kernel.return  →
import.call RtlLeave  → kernel.return  →
import.call NtClose   → handle.destroy {SID: f02c5bda6f21992e, raw: 0x1068, prior_refcount: 1}

Canary's tid=6 idx 104,607..104,622:

import.call RtlEnter  → kernel.return  →
import.call RtlEnter  → kernel.return  (NESTED)
import.call RtlLeave  → kernel.return  (inner)
import.call RtlLeave  → kernel.return  (outer)
import.call NtClose   → handle.destroy on the SAME logical Event

The handle being closed is an Event (object_type=1). Its lineage in ours's trace:

idx host_ns event
104,387 515.2ms handle.create Event 0x1068
104,572 516.4ms wait.begin tid=1 on 0x1068, timeout_ns=-1 (indefinite)
104,612 519.7ms handle.destroy 0x1068

The Event is signaled between the wait.begin and the destroy. Scanning ours's trace for the signaler during the wait window (516.4-519.7ms) found: tid=5 calls NtSetEvent at host_ns 519.3ms — just ~2-3ms before tid=1 wakes and proceeds to the Enter/Leave/Close sequence.

So the divergence picture:

  1. tid=1 creates Event E (notification primitive).
  2. tid=1 blocks on E.
  3. tid=5 (or another peer) signals E via NtSetEvent.
  4. tid=1 wakes, acquires CS, checks some queue/list protected by the CS, optionally does nested cleanup work, releases CS, closes E.
  5. The "optionally does nested work" branch fires only in canary because canary's peer tids have produced more work items by the time tid=1 wakes. Ours's peer tids produce fewer items in the same wall-time window.

Conclusion: the 104,607 cap is a workload-interleaving / state-accumulation divergence, not a scheduling-determinism one. Stage 3's contention replay doesn't address it because the contention canary observes (at idx 104,664) is INSIDE the nested- cleanup branch ours never enters. This is C+22's "state-mutation-during-wait" hypothesis confirmed at the trace level: peer tids in canary mutate more shared state during the wait window than peer tids in ours do.

Decision

Stage 3 + 4 LAND as infrastructure. The 104,607 cap is NOT unblocked by this work alone because the divergence is upstream of any contention.

Next steps (in priority order):

  1. Stage 5 — per-CS hardcoded yield. The plan's fallback: hardcode a yield (NOT a park) at a specific CS pointer / call site near idx 104,610 in ours to shift the scheduling enough that some other tid runs and mutates state, hopefully advancing the prefix. ~30 LOC, narrow scope.
  2. State-divergence root-cause investigation — disassemble guest code at the call site of idx 104,610 to identify what state the guest is reading to decide nested-Enter vs Leave. Likely some shared variable/refcount mutated by another tid in canary but not in ours.
  3. D-extension — extend the diff tool's wait.begin absorber to also fold the post-acquire E E L L nested-cleanup block when followed by the matching E L NtClose pattern. The plan tags this as a band-aid crossing reading-error #23.

Phase B image hash

image_loaded_sha256 = ea8d160e… — UNCHANGED.

Next session

Stage 5 OR state-divergence investigation, per the user's call.