Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
207 lines
12 KiB
Markdown
207 lines
12 KiB
Markdown
# Investigation Notes — Scheduler-Determinism Plan (2026-05-18)
|
||
|
||
Source citations and probe results from the Phase-1 investigation. All claims here are verified against source or runtime data; speculation is flagged.
|
||
|
||
## 1. Canary threading & scheduling model
|
||
|
||
**Verdict**: 1-host-thread-per-XThread; scheduling delegated to host OS (Wine on Linux). No internal scheduler.
|
||
|
||
- Each guest `XThread` owns a host `xe::threading::Thread` (`xenia-canary/src/xenia/kernel/xthread.h:476`).
|
||
- POSIX backend: pthread per XThread (`xenia-canary/src/xenia/base/threading_posix.cc`).
|
||
- TLS bridge: `thread_local XThread* current_xthread_tls_` (`xthread.cc:105`). `XThread::TryGetCurrentThread()` returns null when called outside a guest thread (C+15-α robustness fix for the boot-time emitter).
|
||
- Tid assignment: `thread_id_(++next_xthread_id_)` in ctor (`xthread.cc:62`).
|
||
- KPCR per XThread, allocated at `pcr_address_` (`xthread.h:506`); contains scheduler-like state mirroring real Xenon KPRCB.
|
||
- `CheckQuantumAndDecay()` (`xthread.h:437`) fires ~20ms via `KernelState`'s timer — simulates Xenon priority decay but does NOT preempt; runs on whichever host thread the host OS schedules.
|
||
|
||
**No internal scheduler.** No `lockstep`, `deterministic`, `replay` cvar (grep confirmed across `xenia-canary/src/xenia/`).
|
||
|
||
## 2. Canary clock infrastructure
|
||
|
||
**Verdict**: wallclock-driven (rdtsc or platform API). Optional scaling, no full deterministic mode.
|
||
|
||
- Canonical class `xe::Clock` (`base/clock.h:30`).
|
||
- `Clock::QueryHostTickCount()` (`base/clock.cc:128`): rdtsc on x64 if `clock_source_raw=true`, else platform API.
|
||
- `Clock::QueryGuestSystemTime()` (`clock.h:82`): host time adjusted by `guest_time_scalar_`.
|
||
- `KeQuerySystemTime_entry` (`xboxkrnl_threading.cc:459`): declared `void`, writes via OUT pointer; reads `Clock::QueryGuestSystemTime()`. (C+1 verified parity with ours's void-export framing.)
|
||
- `KeWaitForSingleObject_entry` (`xboxkrnl_threading.cc:1003`): reads `*timeout_ptr` as i64×100 → ns (C+23 verified ours computes the same value).
|
||
- Cvars: `clock_no_scaling` (`base/clock.cc:24`), `clock_source_raw` (`base/clock.cc:28`). Neither makes the clock deterministic across Wine runs — wallclock drift is irreducible.
|
||
|
||
## 3. Canary wait primitives
|
||
|
||
**Verdict**: `xe::threading::Wait` → `pthread_cond_timedwait` (POSIX) / `WaitForMultipleObjects` (Win32).
|
||
|
||
- `xeKeWaitForSingleObject()` (`xboxkrnl_threading.cc:969`) → `XObject::Wait()` → `xe::threading::Wait()` → host primitive.
|
||
- Whether contention happens is purely host-OS-scheduler-driven. Reading-error #32 from C+20 documents this: 3 fresh canary cold runs at tid=6 idx 104,606 showed different patterns (no wait.begin / wait.begin contended / offset-shifted).
|
||
|
||
## 4. Canary RtlEnterCriticalSection — spin-then-wait (DISCOVERED)
|
||
|
||
[xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_rtl.cc:596-633](../../../xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_rtl.cc) — `RtlEnterCriticalSection_entry`:
|
||
|
||
```c
|
||
uint32_t spin_count = cs->header.absolute * 256; // game-supplied spin count
|
||
if (cs->owning_thread == cur_thread) { recursion++; return; }
|
||
while (spin_count--) {
|
||
if (xe::atomic_cas(-1, 0, &cs->lock_count)) { /* acquired via spin */ break; }
|
||
}
|
||
if (xe::atomic_inc(&cs->lock_count) != 0) {
|
||
xeKeWaitForSingleObject(...); // slow path
|
||
}
|
||
cs->owning_thread = cur_thread; cs->recursion_count = 1;
|
||
```
|
||
|
||
**Implication**: under low contention, spin succeeds and no `wait.begin` is emitted. Under high contention, spin fails and `wait.begin` fires. Whether spin succeeds depends on host-OS timing — non-deterministic across Wine runs.
|
||
|
||
## 5. Ours threading & scheduling
|
||
|
||
**Verdict**: single host thread; 6 cooperative HW slots; deterministic by construction.
|
||
|
||
- `xenia-rs/crates/xenia-cpu/src/scheduler.rs`:
|
||
- `OrderMode { Fixed, Seeded { seed } }` (lines 230-258).
|
||
- `round_schedule()` (lines 710-740): returns slot-id vector; advances `rotation_cursor` by 1.
|
||
- `park_current(BlockReason)` (line 808).
|
||
- `wake_ref(ThreadRef)` (line 831).
|
||
- M3 optional `--parallel` mode (6 workers + coordinator, 7-party phaser) exists but is not default.
|
||
|
||
**Determinism foundation**: 23 phases of stabilization invested in `e1dfcb15…` cold digest × 3 reproducible.
|
||
|
||
## 6. Ours RtlEnterCriticalSection — NO spin
|
||
|
||
[xenia-rs/crates/xenia-kernel/src/exports.rs:2886-2946](../../../xenia-rs/crates/xenia-kernel/src/exports.rs) — `rtl_enter_critical_section`:
|
||
|
||
```rust
|
||
let owner = mem.read_u32(cs_ptr + CS_OFFS_OWNING_THREAD);
|
||
let owner_is_live = owner != 0 && state.scheduler.find_by_tid(owner).is_some();
|
||
if owner == 0 || !owner_is_live {
|
||
/* claim immediately — write owning_thread, lock_count=0, recursion=1 */
|
||
return;
|
||
}
|
||
if owner == current_tid { /* recursive lock — increment counts */ return; }
|
||
// Truly contended against a live peer — park IMMEDIATELY (no spin).
|
||
state.cs_waiters.entry(cs_ptr).or_default().push(current_ref);
|
||
state.scheduler.park_current(BlockReason::CriticalSection(cs_ptr));
|
||
```
|
||
|
||
**Asymmetry summary**: canary spins ~256×N times before parking; ours parks immediately. Under the cooperative scheduler, ours's tid=1 runs monolithically until it parks — no other thread has a chance to acquire the CS first. Hence at 104,607, the CS is free when tid=1 tries, while in canary it was held by another thread that got scheduled in between.
|
||
|
||
## 7. Ours clock infrastructure
|
||
|
||
**Verdict**: fixed FILETIME constant. No wallclock dependency in the hot path.
|
||
|
||
- `KeQuerySystemTime` returns `132_500_000_000_000_000` (~2021) via OUT-ptr (`exports.rs:628`).
|
||
- `KeQueryInterruptTime` returns `0x0000_0001_0000_0000` (`exports.rs:504`).
|
||
- `event_log.rs` uses `Instant::now()` for the observability `host_ns` field — non-deterministic but not consumed by the matched-prefix metric.
|
||
|
||
## 8. Sylpheed workload profile (probe)
|
||
|
||
Ran on `xenia-rs/audit-runs/phase-c22-rtl-enter-leave-control-flow/ours-cold.jsonl` (121,569 events):
|
||
|
||
| event | count | notes |
|
||
|---|---|---|
|
||
| RtlEnterCriticalSection (kernel.call) | 19,494 | ≈80% of all kernel.calls |
|
||
| RtlLeaveCriticalSection (kernel.call) | 19,492 | matches Enter (off-by-2 from boot edge) |
|
||
| NtClose | 160 | |
|
||
| NtCreateEvent | 103 | |
|
||
| NtReleaseSemaphore | 99 | |
|
||
| NtQueryInformationFile | 93 | |
|
||
| NtWaitForMultipleObjectsEx | 92 | |
|
||
| KeWaitForSingleObject | 5 | |
|
||
| KeWaitForMultipleObjects | 1 | |
|
||
| **KeQuerySystemTime** | **2** | clock-light workload |
|
||
| KeQueryPerformanceFrequency | 6 | |
|
||
| KeQueryPerformanceCounter | 0 | |
|
||
| KeQueryInterruptTime | 0 | |
|
||
| KeDelayExecutionThread | 0 | |
|
||
| NtYieldExecution | 0 | |
|
||
| wait.begin events (all kinds) | 34 | most with `timeout_ns=-1` (indefinite) |
|
||
|
||
**Implications**:
|
||
- Sylpheed is CS-dominated. Stage-1 emitter on RtlEnterCS captures the dominant signal.
|
||
- Sylpheed barely touches the clock. Approach A (cycle clock in canary) addresses ≈2 events out of 121,569. Wrong target.
|
||
- Wait surface is small (34 events). Wait-side replay is low-value; scope to CS only.
|
||
|
||
## 9. The 104,607 divergence (re-verified)
|
||
|
||
From C+22 memory + jitter jsonl re-analysis:
|
||
|
||
| sample | tid=6 events 104,604..104,615 (import.call only) |
|
||
|---|---|
|
||
| c21 archived | E E L L |
|
||
| canary jitter-1 | E (wait.begin slow path) E L L |
|
||
| canary jitter-2 | E E L L |
|
||
| canary jitter-3 | (shifted) E E L L |
|
||
| fresh c22 | E (wait.begin slow path) E L L |
|
||
|
||
All canary samples have the EXTRA nested RtlEnterCriticalSection (second `E` before the final `L L`). Ours never does — it goes `E L NtClose`. Structural divergence post-absorber-engagement.
|
||
|
||
Shared dispatcher: canary's wait.begin `handles_semantic_ids=['75ae880ec432eb36']` — this is the CS embedded Event dispatcher, lazy-wrapped by `XObject::GetNativeObject`. Same SID computed via C+18 shared-global recipe in both engines.
|
||
|
||
## 10. Cvar inventory (canary side)
|
||
|
||
Grep across `xenia-canary/src/xenia/` for `DEFINE_bool|DEFINE_int|DEFINE_uint|DEFINE_string`:
|
||
|
||
- `clock_no_scaling` (`base/clock.cc:24`)
|
||
- `clock_source_raw` (`base/clock.cc:28`)
|
||
- `ignore_thread_priorities` (`kernel/xthread.cc:30`)
|
||
- `ignore_thread_affinities` (`kernel/xthread.cc:33`)
|
||
- `stack_size_multiplier_hack` (`kernel/xthread.cc:37`)
|
||
- `main_xthread_stack_size_multiplier_hack` (`kernel/xthread.cc:39`)
|
||
- `phase_a_event_log_path` (`cpu/cpu_flags.cc:84`) — Phase A trace gate
|
||
- `phase_a_event_log_mem_writes` (`cpu/cpu_flags.cc:88`) — reserved, not wired
|
||
- `phase_b_snapshot_dir` (`cpu/cpu_flags.cc:94`) — Phase B image snapshot
|
||
- `phase_b_snapshot_and_exit` (`cpu/cpu_flags.cc:100`)
|
||
|
||
No `lockstep`, `deterministic`, `replay`, `single_thread`, `cooperative` cvars exist. **No built-in deterministic mode.**
|
||
|
||
## 11. Diff-tool absorber state (post-C+21)
|
||
|
||
`xenia-rs/tools/diff-events/diff_events.py` (767 LOC):
|
||
|
||
- `collect_shared_global_sids()`: pre-pass union of (a) recipe-matching SIDs (C+18) and (b) cross-tid usage heuristic — any SID used by handle.create OR wait.begin on ≥2 distinct tids.
|
||
- `is_shared_global_wait_begin()`: classifies a wait.begin as floating if any handle_sid is in the shared-global set.
|
||
- `diff_one_tid()`: floating-absorbs `handle.create` (C+18) and `wait.begin` (C+21) on kind mismatches.
|
||
- `SKIP_PAYLOAD_FIELDS_BY_KIND`: skips engine-local fields per kind.
|
||
|
||
**Reading-error #23 boundary**: absorbing the post-wait Enter/Leave block (canary's extra `E` then `L` at 104,610-104,615) would be folding real guest behavior, not transient observation. The plan's Stage 3 instead makes ours produce the same observation by forcing ours into the same contended state.
|
||
|
||
## 12. Tid-chain mapping (stable per memory baseline)
|
||
|
||
| canary | ours |
|
||
|---|---|
|
||
| 6 | 1 |
|
||
| 4 | 11 |
|
||
| 7 | 2 |
|
||
| 12 | 7 |
|
||
| 14 | 9 |
|
||
| 15 | 10 |
|
||
|
||
This is a *display* convention for cross-engine alignment in diff reports. In the wire format, each engine emits its native tid. The manifest in Stage 2-3 keys on the source-side native tid — no translation needed since each side consumes events it produced.
|
||
|
||
## 13. Methodology rules in force
|
||
|
||
- **Reading-error #28** (verify source first): applied — read both engines' RtlEnterCS implementations before designing.
|
||
- **Reading-error #32** (canary non-deterministic in contention regions): characterized — 3 jitter samples documented.
|
||
- **Reading-error #33** (canary cache lives in binary-dir under wine): not relevant here.
|
||
- **Reading-error #34** (use `.iso` not loose `.xex`): apply in all validation runs.
|
||
- **Cold-vs-cold protocol**: canary `--mute=true`, ours `XENIA_CACHE_WIPE=1`.
|
||
- **Stop hook rename**: rename background binaries before any backgrounded run (e.g. `xrs-verify-stage0`, `xrs-replay`).
|
||
|
||
## 14. Confidence calibration
|
||
|
||
| claim | source-verified | probe-verified | confidence |
|
||
|---|---|---|---|
|
||
| Canary spins, ours doesn't | yes (xboxkrnl_rtl.cc:613 + exports.rs:2927) | n/a (static) | high |
|
||
| Sylpheed clock-light | n/a | yes (kernel.call counts) | high |
|
||
| 104,607 divergence is structural | yes (C+22 mech) | yes (5 canary samples consistent) | high |
|
||
| C+18 shared-global SID is cross-engine identical | yes (event_log.rs + event_log.cc) | implicit (matched in diff reports) | high |
|
||
| Canary has no deterministic mode cvar | yes (grep) | n/a | high |
|
||
| Stage-0 quantum spike may unblock | no (untested) | no | medium |
|
||
| Stage-3 manifest replay unblocks | no | no | medium-high (mechanism sound, integration risk) |
|
||
| Sister chain regression ≤5 acceptable | n/a | n/a | open question for user |
|
||
|
||
## Open unknowns (deferred to implementation)
|
||
|
||
1. The exact `cs_ptr` of the contended CS at canary tid=6 idx 104,608 is not directly emitted by the current schema (the `wait.begin` payload carries SID but not the raw pointer). Stage 1's `cs_ptr` field plugs this gap.
|
||
2. Does Sylpheed initialize the contended CS with `RtlInitializeCriticalSectionAndSpinCount(spin_count > 0)` or just `RtlInitializeCriticalSection` (zero spin)? Affects whether canary's spin path can succeed at this site. Probe by reading the cs's `header.absolute` field during a canary run.
|
||
3. The dispatcher Event's first-toucher tid differs across cold runs (canary tid=9 in one, others in others). Does this stable enough across cold runs of the SAME canary binary to be a reliable replay anchor? Stage 1 round-trip validation will reveal.
|
||
4. Does the M3 `--parallel` mode in ours reproduce the same divergence pattern? Untested. Out of scope for this plan but worth a future probe.
|