Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
13 KiB
Phase D Stage 3 (+ Stage 4) — Contention Replay: Result
Date: 2026-05-18
Outcome: LANDED. Stage 3 ours-side contention-replay infrastructure +
Stage 4 diff-tool engine-local kind. Default-mode digest byte-identical
to pre-Stage-3 baseline. Replay-mode digest stable × 3. Sister chains
preserved. Main matched-prefix at 104,607 — unchanged from baseline.
Stage 3 lands as infrastructure; it does not unblock the cap on its own
because the 104,607 divergence is upstream of any
contention.observed event in the canary trace.
Engine source change
| file | LOC | purpose |
|---|---|---|
| xenia-rs/crates/xenia-kernel/src/contention_manifest.rs | new file, 280 | manifest loader + (tid, idx) → Entry HashMap behind Mutex + consume_at_peek(tid, peek_idx) with per-tid emit-count translation back to canary's idx space + 12 unit tests |
| xenia-rs/crates/xenia-kernel/src/lib.rs | +1 | pub mod contention_manifest; |
| xenia-rs/crates/xenia-kernel/src/event_log.rs | +35 | object_type::CRITICAL_SECTION = 0x0C enum value + emit_contention_observed(tid, guest_cycle, cs_ptr, contended) helper (mirrors canary's EmitContentionObserved) |
| xenia-rs/crates/xenia-kernel/src/state.rs | +20 | KernelState.contention_manifest: Option<Arc<ContentionManifest>> field + install_contention_manifest() setter |
| xenia-rs/crates/xenia-kernel/src/exports.rs | +75 | rtl_enter_critical_section body consults the manifest via consume_at_peek, emits contention.observed on hit, falls through to natural code (conservative deadlock-safe mode); aggressive mode behind XENIA_CONTENTION_AGGRESSIVE=1 env var (no-go per probe — see below) |
| xenia-rs/crates/xenia-app/src/main.rs | +30 | XENIA_CONTENTION_MANIFEST_PATH env-var loader + tracing log on success/failure |
| Engine total | ~440 LOC additive across 6 files |
| diff-tool file | LOC | purpose (Stage 4 bundled) |
|---|---|---|
| xenia-rs/tools/diff-events/diff_events.py | +30 | new ENGINE_LOCAL_KINDS = {"contention.observed"} set + skip branches in diff_one_tid so per-tid pointer advances past these events on EITHER side without comparison |
| xenia-rs/tools/diff-events/test_diff_events.py | +65 | 2 new tests covering both-sides and one-sided engine-local skip |
| xenia-rs/tools/diff-events/build_contention_manifest.py | +45 | added --tid-map CANARY=OURS,... translation so the manifest stores ours-side tids (so Stage-3 lookup keys on the consumer's native current_tid) |
| xenia-rs/tools/diff-events/test_build_manifest.py | +50 | 2 new tests covering tid-map translation + unmapped-tid drop |
| Tooling total | ~190 LOC additive |
Tests
cargo test -p xenia-kernel --lib: 216 PASS (was 213, +3 new tests in contention_manifest:consume_at_peek_translates_idx,consume_at_peek_miss_does_not_bump_emit_count,consume_at_peek_per_tid_independent— plus 9 earlier ones covering load, consume, peek, version-check, missing-fields-error, cs_ptr-parse).cargo test -p xenia-cpu --lib: 300 PASS (unchanged from Stage 0).python3 test_build_manifest.py: 11 PASS (was 9, +2 for tid-map).python3 test_diff_events.py: all pre-existing PASS + 2 new (test_engine_local_contention_observed_skipped_both_sides,test_engine_local_one_sided_contention_observed).
Cold-run validation
Gate 1: default-mode digest byte-identical to Stage 0 baseline
Without XENIA_CONTENTION_MANIFEST_PATH set:
$ XENIA_CACHE_WIPE=1 xenia-rs-stage3 exec --phase-a-event-log ours.jsonl \
-n 50000000 --quiet Sylpheed.iso
$ det_digest.py ours.jsonl
det_fields_md5 = ba5b5e0795ccb32966a49d3b2917a30d <-- same as Stage 0 baseline
total_events = 121569
✓ Stage 3's manifest-check fast-path costs ≈ one
Option::as_ref().and_then(…) per rtl_enter_critical_section call.
Default-mode behavior preserved bit-for-bit. Phase B
image_loaded_sha256 = ea8d160e… UNCHANGED.
Gate 2: replay-mode digest stable × 3
With manifest installed (807 entries → 284 entries after tid-map filter):
| run | digest | total_events |
|---|---|---|
| 1 | 1d7c6b4592d024405cd9d86eb79f5307 |
121571 |
| 2 | 1d7c6b4592d024405cd9d86eb79f5307 |
121571 |
| 3 | 1d7c6b4592d024405cd9d86eb79f5307 |
121571 |
✓ Bit-stable × 3 under replay. New digest is expected: the 2 contention.observed emits shift per-tid idx values by +2 starting at idx 102,788, producing a provably different (but deterministic) byte sequence.
Gate 3: replay-mode matched-prefix vs canary cvar-ON
$ python3 diff_events.py \
--canary canary-cvaron-trunc.jsonl \
--ours stage3-replay.jsonl \
--tid-map 6=1,7=2,4=11,12=7,14=9,15=10
| canary_tid | ours_tid | matched | first_divergence_at |
|---|---|---|---|
| 4 | 11 | 11 | — |
| 6 | 1 | 104607| 104607 |
| 7 | 2 | 32 | — |
| 12| 7 | 4 | 4 |
| 14| 9 | 41 | 41 |
| 15| 10 | 16 | — |
✓ Main matched-prefix 104,607 — same as pre-Phase-D baseline. Sister chains preserved (11/32/4/41/16, identical to C+22 baseline). Stage 3 does not break the prefix; nor does it advance it.
Why the cap isn't unblocked
The 104,607 divergence is at canary's tid=6 idx 104,610 (nested RtlEnter)
vs ours's tid=1 idx 104,608 (RtlLeave). Both engines completed the outer
RtlEnter at idx 104,608/104,606 with return_value=0 and no
contention.observed event. The first canary contention is at idx
104,664 — AFTER the cap divergence, on a DEEPER RtlEnter call further
into the same control-flow branch ours never enters.
In other words: the cap is upstream of any contention. Replaying canary's contention.observed events at 102,788 and 104,664 happens either too early (102,788 — way before the cap) or too late (104,664 — ours diverged at 104,607 and never reaches that ordinal in the same logical position). The manifest mechanism is correctly aligned to canary's contention events; the cap simply isn't a contention event.
The 104,610 nested RtlEnter in canary vs the RtlLeave in ours is guest-code-driven: same PPC code, same outer-Enter return value (0), different next-call decision. Likely cause: some other guest memory state (not the CS struct itself) has diverged between canary and ours upstream of idx 104,610, and the guest's branch decision reads that state. That's a state-divergence root cause, NOT scheduling-determinism.
Aggressive mode probe (XENIA_CONTENTION_AGGRESSIVE=1)
Tested behind an env-var gate to confirm: forcing the park unconditionally when the manifest hits (even when CS is free in guest memory) regresses the trace catastrophically.
- main matched-prefix: 102,789 (-1,818 vs baseline)
- ours_total events: 1,019,208 (-vs- 121,569 default; 8× ballooned)
- Sister chains: 4 of 5 entirely absent (tid=11/7/9/10 produced zero events)
- Cause: tid=1 force-parks on a free CS at idx 102,788. No peer touches the
CS during the wait.
Scheduler::unblock_on_deadlockeventually recovers withowner=0, but downstream guest state is now corrupted (the RtlEnter returned with owner unset). The other tids never reach the spawn point because tid=1 was supposed to drive their setup.
Aggressive mode is gated off by default (XENIA_CONTENTION_AGGRESSIVE=1
explicit opt-in). Conservative mode is the landed default.
What the manifest did observe
Per the run log (debug level):
manifest cs_ptr cross-engine divergence at tid=1 idx=102788: manifest 0xbc65c890, ours 0x40544890 (allocator ε)
manifest hit at tid=1 idx=102788 cs=0x40544890 but CS is free/self-owned (owner=0); replay skipped (state-divergence, not schedule-divergence)
manifest cs_ptr cross-engine divergence at tid=1 idx=104664: manifest 0xbc65c890, ours 0x828f39d0 (allocator ε)
manifest hit at tid=1 idx=104664 cs=0x828f39d0 but CS is free/self-owned (owner=0); replay skipped (state-divergence, not schedule-divergence)
Two hits fired, both at the right ordinal. Both fell into the conservative "skip" branch because:
- The cs_ptr canary recorded (
0xbc65c890) and the cs_ptr ours sees (0x40544890/0x828f39d0) differ. This is the AUDIT-043 allocator ε divergence — we don't gate on it (we trust the(tid, idx)alignment), but it's logged. - In ours's guest memory the CS owner is 0 (free) — no peer is holding the lock. Force-parking here would deadlock; the conservative branch skips and falls through to the natural fast-path.
Reading-error class
No new class. Existing protocols:
- #28 verify source first — read
rtl_enter_critical_section,event_log.rs,KernelStateend-to-end before editing. - #32 canary contention jitter — handled at the manifest layer (per-tid + idx key, no ordinal hardcoding).
- AUDIT-043 allocator ε — manifest cs_ptr (canary heap)
≠ ours cs_ptr (different heap); the diff tool handles this for
kernel.returnvia per-tid allocator ordinal canonicalization. The Stage-3 manifest doesn't have that translation but doesn't need it: matching on(tid, idx)is sufficient because both engines emit the RtlEnter at the same per-tid ordinal within the matched prefix.
Artifacts
- contention_manifest.json — 284 entries (tid-map applied)
/tmp/stage3-conservative-r{1,2,3}.jsonl— replay-mode cold runs (digest1d7c6b45…× 3)/tmp/stage3-default-r1.jsonl— default-mode cold run (digestba5b5e07…)/tmp/stage3-aggressive-r1.jsonl— aggressive-mode probe (-1,818 prefix; do NOT use as baseline)
Post-landing divergence forensics (root-cause clue)
Trace inspection of the divergent region in BOTH engines:
Ours's tid=1 idx 104,604..104,613:
import.call RtlEnter → kernel.return →
import.call RtlLeave → kernel.return →
import.call NtClose → handle.destroy {SID: f02c5bda6f21992e, raw: 0x1068, prior_refcount: 1}
Canary's tid=6 idx 104,607..104,622:
import.call RtlEnter → kernel.return →
import.call RtlEnter → kernel.return (NESTED)
import.call RtlLeave → kernel.return (inner)
import.call RtlLeave → kernel.return (outer)
import.call NtClose → handle.destroy on the SAME logical Event
The handle being closed is an Event (object_type=1). Its lineage in ours's trace:
| idx | host_ns | event |
|---|---|---|
| 104,387 | 515.2ms | handle.create Event 0x1068 |
| 104,572 | 516.4ms | wait.begin tid=1 on 0x1068, timeout_ns=-1 (indefinite) |
| 104,612 | 519.7ms | handle.destroy 0x1068 |
The Event is signaled between the wait.begin and the destroy. Scanning
ours's trace for the signaler during the wait window (516.4-519.7ms)
found: tid=5 calls NtSetEvent at host_ns 519.3ms — just ~2-3ms
before tid=1 wakes and proceeds to the Enter/Leave/Close sequence.
So the divergence picture:
- tid=1 creates Event E (notification primitive).
- tid=1 blocks on E.
- tid=5 (or another peer) signals E via
NtSetEvent. - tid=1 wakes, acquires CS, checks some queue/list protected by the CS, optionally does nested cleanup work, releases CS, closes E.
- The "optionally does nested work" branch fires only in canary because canary's peer tids have produced more work items by the time tid=1 wakes. Ours's peer tids produce fewer items in the same wall-time window.
Conclusion: the 104,607 cap is a workload-interleaving / state-accumulation divergence, not a scheduling-determinism one. Stage 3's contention replay doesn't address it because the contention canary observes (at idx 104,664) is INSIDE the nested- cleanup branch ours never enters. This is C+22's "state-mutation-during-wait" hypothesis confirmed at the trace level: peer tids in canary mutate more shared state during the wait window than peer tids in ours do.
Decision
Stage 3 + 4 LAND as infrastructure. The 104,607 cap is NOT unblocked by this work alone because the divergence is upstream of any contention.
Next steps (in priority order):
- Stage 5 — per-CS hardcoded yield. The plan's fallback: hardcode a yield (NOT a park) at a specific CS pointer / call site near idx 104,610 in ours to shift the scheduling enough that some other tid runs and mutates state, hopefully advancing the prefix. ~30 LOC, narrow scope.
- State-divergence root-cause investigation — disassemble guest code at the call site of idx 104,610 to identify what state the guest is reading to decide nested-Enter vs Leave. Likely some shared variable/refcount mutated by another tid in canary but not in ours.
- D-extension — extend the diff tool's
wait.beginabsorber to also fold the post-acquireE E L Lnested-cleanup block when followed by the matchingE L NtClosepattern. The plan tags this as a band-aid crossing reading-error #23.
Phase B image hash
image_loaded_sha256 = ea8d160e… — UNCHANGED.
Next session
Stage 5 OR state-divergence investigation, per the user's call.