Files
xenia-rs/audit-runs/phase-c20-rtl-enter-cs-wait/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

11 KiB

Phase C+20 investigation — RtlEnterCriticalSection wait.begin (2026-05-14)

Framing verification (reading-error #28 discipline)

Canary's RtlEnterCriticalSection — xboxkrnl_rtl.cc:596-633

void RtlEnterCriticalSection_entry(pointer_t<X_RTL_CRITICAL_SECTION> cs) {
  if (!cs.guest_address()) { ... return; }
  CriticalSectionPrefetchW(&cs->lock_count);
  uint32_t cur_thread = XThread::GetCurrentThread()->guest_object();
  uint32_t spin_count = cs->header.absolute * 256;

  if (cs->owning_thread == cur_thread) {           // RECURSIVE FAST PATH
    xe::atomic_inc(&cs->lock_count);
    cs->recursion_count++;
    return;
  }

  // Spin loop
  while (spin_count--) {
    if (xe::atomic_cas(-1, 0, &cs->lock_count)) {  // UNCONTENDED FAST PATH
      cs->owning_thread = cur_thread;
      cs->recursion_count = 1;
      return;
    }
  }

  if (xe::atomic_inc(&cs->lock_count) != 0) {      // CONTENDED SLOW PATH
    // Create a full waiter.
    xeKeWaitForSingleObject(reinterpret_cast<void*>(cs.host_address()), 8, 0, 0,
                            nullptr);
  }
  assert_true(cs->owning_thread == 0);
  cs->owning_thread = cur_thread;
  cs->recursion_count = 1;
}

Canary only emits wait.begin on the contended slow path (via the xeKeWaitForSingleObject call). The wait handle is the CS struct pointer; xeKeWaitForSingleObject resolves it via XObject::GetNativeObject which lazy-wraps the embedded DISPATCHER_HEADER (first 12 bytes of the CS struct) as an XEvent — the SID 75ae880ec432eb36 (object_type=1, raw_handle=0xf8000044) seen at canary tid=9 idx=295 IS this Event, synthesized on first contention.

xeKeWaitForSingleObject emit point — xboxkrnl_threading.cc:969-991

uint32_t xeKeWaitForSingleObject(void* object_ptr, uint32_t wait_reason, ...) {
  auto object = XObject::GetNativeObject<XObject>(kernel_state(), object_ptr);
  if (!object) { assert_always(); return X_STATUS_ABANDONED_WAIT_0; }

  if (phase_a::IsEnabled()) {
    uint64_t sid = 0;
    if (!object->handles().empty()) {
      sid = phase_a::LookupHandleSemanticId(object->handles()[0]);
    }
    int64_t timeout_ns = timeout_ptr ? (*timeout_ptr * 100) : -1;
    phase_a::EmitWaitBegin(&sid, 1, timeout_ns, alertable != 0, false);
  }

  X_STATUS result = object->Wait(...);
  ...
}

Confirms: wait.begin fires only when the slow path is taken.

Ours's rtl_enter_critical_section — exports.rs:2886-2946

Has three branches:

  1. owner == 0 || !owner_is_live → claim uncontended.
  2. owner == current_tid → recursive bump.
  3. otherwise → park current thread on cs_waiters via state.scheduler.park_current(BlockReason::CriticalSection(cs_ptr)).

The park path does NOT emit wait.begin. Symmetric to canary's slow path semantically, but no schema event.

Divergent event observed (fresh canary cold + fresh ours cold)

[104604] ours+canary import.call RtlEnterCriticalSection
[104605] ours+canary kernel.call  RtlEnterCriticalSection
[104606] CANARY   wait.begin sid=75ae880ec432eb36 timeout=-1 wait_type=any
[104606] OURS     kernel.return RtlEnterCriticalSection rv=0
[104607] CANARY   kernel.return RtlEnterCriticalSection rv=0

Classification

This is a (B) Real contention difference, NOT (A) always-wait, NOT (C) emit gap.

Evidence:

  1. Canary's RtlEnterCriticalSection source code provably only emits wait.begin in the contended branch. The earlier two RtlEnterCriticalSection sequences (canary tid=6 idx=104,598-600 and idx=104,608-610) BOTH fast-path (no wait.begin) — proving canary's path is conditional on contention.

  2. SID 75ae880ec432eb36 appears 15 times in canary, on 4 different tids (tid=6/9/10/18). Always with object_type=1 (Event). All 15 are wait.begin (or 1 handle.create first-touch). This is a shared CS used across the title's thread pool.

  3. At canary's idx 104,604, the CS is contended because tid=9 is simultaneously doing cache-file work (NtCreateFile cache:\69d8e45ce534ffea.tmp at canary tid=9 idx=305) that almost certainly enters the same CS first. Canary's host_ns gap between ours-idx 104,603 (RtlLeave) and 104,604 (RtlEnter) is 268.2 ms, during which thousands of other-tid events fire.

  4. At ours's idx 104,604, only tid=1 and tid=5 are active in a 1ms window around the call. tid=5 is in MmFreePhysicalMemory — not touching this CS. Ours's gap between idx 104,603→104,604 is 7.6 μs. Effectively single-threaded.

  5. Ours has no other live thread holding this CS — fast path is the correct semantic result for ours's scheduling.

Why this is scheduler determinism

The contention pattern emerges from the interleaving of multiple guest threads racing on a shared CS. To make ours produce the same event sequence as canary at this idx, we would need:

  • tid=9 (or another holder) to be currently inside its critical section block when tid=1 reaches idx 104,604.
  • That requires ours to schedule tid=9 ahead of (or concurrently with) tid=1's RtlEnter, exactly as canary's host scheduler did.
  • Ours's deterministic single-stepping scheduler runs tid=1 near-monolithically through this region — tid=9 has no opportunity to claim the CS before tid=1 fast-paths through.

This is the canonical signature of cross-thread scheduling asymmetry. Fixing it requires either:

(i) Reworking ours's scheduler to interleave threads at finer granularity matching canary's preemption points — substantial refactor of xenia-cpu::scheduler.

(ii) Recording a "scheduling trace" from canary (which thread holds which CS at which guest_cycle) and replaying it in ours — new subsystem.

(iii) Forcing ours to spin-wait briefly at every RtlEnter so other tids get a chance to claim the CS — extremely fragile, no guarantee of matching canary's exact interleave.

None of these are scoped for a single phase-C iteration. The prompt's authorized scope explicitly says:

You may NOT refactor thread scheduling (escalation: scheduler determinism is a separate session).

Escalation: if classification is (B) and scheduler determinism is required, escalate cleanly — don't push through.

Decision: ESCALATE + diff-tool TODO

C+20 produces no engine change. The classification, supporting evidence, and recommended escalation path are recorded for a future "scheduler-determinism" milestone.

Additional diff-tool action (NOT executed in C+20 per scope): the diff tool should be taught to absorb cross-tid race-window wait.begin events on shared CS dispatchers (analog to C+18's shared-global SID floating-absorb for handle.create). The divergence at idx 104,606 is a strict sub-case of class #30 (scheduling-determinism observation artifact). A follow-up phase (C+20.5 or part of the scheduler-determinism track) should:

  1. Detect wait.begin events with SID matching the canary jitter-1's 75ae880ec432eb36 pattern (multi-tid usage, type=1 Event, first-touched by GetNativeObject from an RtlEnter slow path).
  2. Mark as "scheduling-jitter-window" and floating-absorb in the diff walk so matched-prefix doesn't anchor to it.

This would reveal the true next divergence beyond the jitter cloud.

Risk of "partial" fixes considered

Could we just always emit wait.begin in ours's rtl_enter_critical_section?

No — would produce phantom wait.begin events on the fast path where canary correctly emits none. Would regress at the very next RtlEnterCriticalSection that ours fast-paths (e.g., ours idx 104,598 where canary also fast-paths). Net effect: shifts the divergence elsewhere, doesn't fix it.

Could we wire wait.begin into ours's park_current(CriticalSection)?

Yes — this would be semantically symmetric to canary and is a small patch (~25 LOC). But it would NOT fix the divergence at idx 104,606, because ours doesn't park at this call site at all. The patch would be inert until a different test case exposes a path where ours does park on a CS. Useful prophylactic, but not the C+20 target.

Could we remove the owner_is_live shortcut?

The !owner_is_live heuristic in ours treats owner != 0 && find_by_tid(owner).is_none() as "free". At idx 104,604, this is not the triggered branch — the CS is genuinely uncontended (owner == 0 on the first probe), so removing it doesn't change behavior here.

Reading-error class #31 (documented per prompt) + #32 (NEW)

#31 Stale-canary-jsonl trap — always re-run canary fresh for cold-vs-cold measurements. The prompt established this.

#32 (NEW) Canary itself is non-deterministic across cold runs in contention-dependent regions. Cross-checking the 3 fresh canary jitter jsonls at tid=6 idx 104,595-104,612 confirms canary is structurally non-deterministic here:

jitter idx 104,606 event
1 wait.begin sid=75ae880ec432eb36
2 kernel.return RtlEnterCriticalSection (fast path, no wait!)
3 kernel.call RtlLeaveCriticalSection (sequence shifted; the
wait.begin shifted to idx 104,603 with sid=a25a16a4f6f547aa)

jitter-2's behavior at idx 104,606 is bit-identical to ours. jitter-3 has the wait.begin at a different idx with a different SID — proving the contention pattern is host-scheduler-dependent in canary itself.

This means:

  1. The prompt's framing ("canary emits wait.begin, ours emits kernel.return") was based on ONE jitter sample (jitter-1). It is not a stable structural property of canary.
  2. Matched-prefix as a cross-engine metric is unreliable in regions where canary's contention is host-scheduler-driven.
  3. There is NO real engine bug to fix here. Ours's behavior matches canary jitter-2 at idx 104,606 verbatim.

Reading-error class #32: assuming canary determinism by sampling ONE cold run; need ≥2-3 cold samples to distinguish "real divergence" from "scheduler-driven jitter window".

Cascade outcome

  • A=verify canary's RtlEnterCriticalSection impl: PASS.
  • B=classify (A/B/C): PASS — (B), real contention.
  • C=land fix (or clean escalation): ESCALATION (per prompt authorized scope).
  • D=main matched-prefix > 104,606: N/A (no code change).

Recommendation for next session

C+20-escalation = open a parallel scheduler-determinism track:

  1. Add a per-CS-pointer "expected contention" inference from canary logs.
  2. Drive ours's scheduler to preempt tid=1 at each RtlEnter site where canary's matched call exhibits a wait.begin.
  3. Verify diff-tool absorbs as a structured "scheduling-trace replay" event class.

In parallel, address D-NEW-2 (KeWaitForSingleObject timeout_ns sign/scale asymmetry on tid=12→7 idx=3) — a small ε-class encoding fix that's independent of scheduler determinism.

Also worth landing as a small prophylactic patch (NOT in C+20): wire wait.begin into ours's rtl_enter_critical_section park path so that whenever the slow path IS triggered, ours emits the schema event. Defer until first such case manifests.