handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,30 @@
|
||||
# Phase C+22 broad-impact (2026-05-18)
|
||||
|
||||
## Resolved
|
||||
- None (escalation, no engine change).
|
||||
|
||||
## Advanced
|
||||
- None on main chain (C+21 baseline 104,607 preserved).
|
||||
- None on sister chains (11/32/3/41/16 all preserved).
|
||||
|
||||
## Persisted
|
||||
- The scheduler-determinism class persists. Same root cause
|
||||
C+20 escalated. Multiple downstream effects of the same
|
||||
asymmetry now identified at idx 104,608 (wait.begin —
|
||||
absorbed by C+21) and idx 104,610 (post-wait nested-Enter
|
||||
branch — NOT absorbable per reading-error #23).
|
||||
|
||||
## NEW
|
||||
- Reading-error class **#34**: cold-run determinism depends on
|
||||
input path form. Running ours against the loose `default.xex`
|
||||
rather than the parent `.iso` produces a different boot
|
||||
trajectory (40× more imports, 1.6M unimpl warnings, different
|
||||
thread-create sequence). All cold-vs-cold protocol runs MUST
|
||||
use the `.iso` path. ✓ documented in `investigation.md` and
|
||||
`cold-vs-cold-result.md`.
|
||||
|
||||
## Permanent infrastructure contribution
|
||||
- None (no engine, diff-tool, schema, or emitter changes).
|
||||
|
||||
## Tests
|
||||
- 204 (unchanged from C+19/C+21).
|
||||
@@ -0,0 +1,95 @@
|
||||
# Phase C+22 cold-vs-cold result (2026-05-18)
|
||||
|
||||
## Outcome: NO engine change. ESCALATE.
|
||||
|
||||
Verified C+21 absorber engaged correctly on the fresh cold-vs-cold
|
||||
measurement; main matched-prefix is stable at **104,607**
|
||||
(no change from C+21 baseline). Divergence at canary idx 104,610
|
||||
(post-absorb) vs ours idx 104,607 classified as **(A) ours
|
||||
fast-paths while canary contends → state mutated during the wait
|
||||
→ different post-acquire branch**. Same root cause as C+20.
|
||||
|
||||
## Matched-prefix table (vs C+21 baseline)
|
||||
|
||||
| chain | C+21 | C+22 (fresh c22) | delta |
|
||||
|--------------------------------|---------|------------------|-------|
|
||||
| canary tid=6 → ours tid=1 main | 104,607 | **104,607** | 0 |
|
||||
| canary tid=4 → ours tid=11 | 11 | 11 | 0 |
|
||||
| canary tid=7 → ours tid=2 | 32 | 32 | 0 |
|
||||
| canary tid=12 → ours tid=7 | 3 | 3 | 0 |
|
||||
| canary tid=14 → ours tid=9 | 41 | 41 | 0 |
|
||||
| canary tid=15 → ours tid=10 | 16 | 16 | 0 |
|
||||
|
||||
## Floating-event absorption counts (fresh c22)
|
||||
|
||||
| chain | floating_create (c/o) | floating_wait (c/o) |
|
||||
|--------------------------------|-----------------------|---------------------|
|
||||
| canary tid=6 → ours tid=1 main | 1 / 0 | 2 / 0 |
|
||||
| canary tid=15 → ours tid=10 | 0 / 1 | 0 / 0 |
|
||||
| others | 0 / 0 | 0 / 0 |
|
||||
|
||||
C+18 absorber engaged on main chain (1 canary handle.create
|
||||
floated) and on tid=15→10 (1 ours handle.create floated).
|
||||
C+21 absorber engaged on main chain (2 canary wait.begin events
|
||||
floated — the contention slow-path emitted them in this run).
|
||||
|
||||
## Cold-stable invariants
|
||||
|
||||
- **ours-cold byte-identical to C+19 archive**: 121,569 events
|
||||
match post-normalization (host_ns/guest_cycle excluded).
|
||||
Reading-error class #34 (cold-run determinism depends on input
|
||||
path form) discovered + documented; ALL future cold runs must
|
||||
use `.iso` not `.xex`.
|
||||
- **Phase B image hash**: `ea8d160e9369328a5b922258a92113efb8d7
|
||||
ce3e1a5c12cc521e375985c91c18` — unchanged (no engine source
|
||||
change).
|
||||
- **Engine source**: UNCHANGED in both ours and canary. No new
|
||||
exports, no new flags, no diff-tool changes.
|
||||
|
||||
## Verification of NOT-jitter
|
||||
|
||||
Pattern in canary's tid=6 post-loop region (idx 104,604-104,615
|
||||
`import.call` events only):
|
||||
|
||||
| sample | events |
|
||||
|------------------|--------------------------------------------------------------|
|
||||
| c21 archived | E E L L |
|
||||
| canary jitter-1 | E (wait.begin slow path) E L L |
|
||||
| canary jitter-2 | E E L L |
|
||||
| canary jitter-3 | (shifted) E E L L |
|
||||
| fresh c22 | E (wait.begin slow path) E L L |
|
||||
|
||||
All canary samples take the EXTRA nested RtlEnter after the
|
||||
post-loop `E` at 104,604. Ours never does — it goes `E L NtClose`.
|
||||
This is a real guest-code branch divergence, NOT diff-tool jitter.
|
||||
|
||||
## Cascade outcome
|
||||
|
||||
- A=verify divergence is NOT jitter: PASS
|
||||
- B=classify (A/B/C/D): PASS — **(A) scheduler-determinism +
|
||||
post-wait state-mutation effect**
|
||||
- C=land fix or escalate cleanly: ESCALATION (per prompt
|
||||
authorized fallback)
|
||||
- D=main matched-prefix > 104,607: N/A (no engine change)
|
||||
|
||||
## Files
|
||||
|
||||
- `investigation.md` — detailed framing, mechanism, rejected
|
||||
fixes, reading-error #34 documentation, next-targets.
|
||||
- `diff-cold-vs-cold.md` — full Phase A diff report (fresh c22
|
||||
cold-vs-cold).
|
||||
- `canary-binary-cache-pre-wipe.tar.gz` — pre-wipe canary
|
||||
binary-dir cache backup (restored post-run).
|
||||
- `canary-xdg-cache-pre-wipe.tar.gz` — pre-wipe canary XDG
|
||||
cache backup (restored post-run).
|
||||
- `ours-cold-stdout.log` / `canary-cold-stdout.log` — run logs.
|
||||
|
||||
## Next-target recommendation
|
||||
|
||||
**C+23 = D-NEW-2** (`KeWaitForSingleObject` `timeout_ns`
|
||||
sign/scale asymmetry on canary tid=12 → ours tid=7 idx=3):
|
||||
canary=-30000000 vs ours=429466729600. Independent of
|
||||
scheduler-determinism. Out of scope for C+22 per the C+22
|
||||
prompt's explicit "You may NOT ... Fix D-NEW-2 in this
|
||||
session." Likely ~20-40 LOC in `ke_wait_for_single_object`'s
|
||||
timeout-pointer dereference.
|
||||
@@ -0,0 +1,135 @@
|
||||
# Phase A diff report
|
||||
|
||||
**This report is the output of Phase A's diff harness. Divergences
|
||||
shown here are INPUT for Phase B (first-divergence localization),
|
||||
not findings of Phase A.** Phase A's job is to make the harness
|
||||
itself correct, not to analyze what it surfaces.
|
||||
|
||||
## Summary
|
||||
|
||||
| canary_tid | ours_tid | matched | canary_total | ours_total | first_divergence_at | floating_create (c/o) | floating_wait (c/o) |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 4 | 11 | 11 | 136165 | 11 | — | 0/0 | 0/0 |
|
||||
| 6 | 1 | 104607 | 250000 | 108507 | 104607 | 1/0 | 2/0 |
|
||||
| 7 | 2 | 32 | 32 | 33 | — | 0/0 | 0/0 |
|
||||
| 12 | 7 | 3 | 25314 | 5 | 3 | 0/0 | 0/0 |
|
||||
| 14 | 9 | 41 | 250000 | 77 | 41 | 0/0 | 0/0 |
|
||||
| 15 | 10 | 16 | 250000 | 17 | — | 0/1 | 0/0 |
|
||||
|
||||
*`floating_create (c/o)` counts shared-global `handle.create` events absorbed by Phase C+18 cross-tid SID matching. `floating_wait (c/o)` counts `wait.begin` events on shared-global dispatchers absorbed by Phase C+21 (scheduling-jitter window — canary's contention slow path may fire while ours fast-paths or vice versa). See schema-v1.md §"Shared-global SIDs" and §"Wait-begin floating absorb".*
|
||||
|
||||
## canary_tid=4 → ours_tid=11
|
||||
|
||||
No divergence within the 11 compared events (canary has 136165, ours has 11).
|
||||
|
||||
## canary_tid=6 → ours_tid=1
|
||||
|
||||
First divergence at `tid_event_idx=104607`: payload.ord: canary=293 ours=304
|
||||
|
||||
**Pre-context (last 5 matching events):**
|
||||
```
|
||||
canary: [104604] kernel.call RtlLeaveCriticalSection
|
||||
ours: [104602] kernel.call RtlLeaveCriticalSection
|
||||
canary: [104605] kernel.return RtlLeaveCriticalSection
|
||||
ours: [104603] kernel.return RtlLeaveCriticalSection
|
||||
canary: [104606] import.call RtlEnterCriticalSection
|
||||
ours: [104604] import.call RtlEnterCriticalSection
|
||||
canary: [104607] kernel.call RtlEnterCriticalSection
|
||||
ours: [104605] kernel.call RtlEnterCriticalSection
|
||||
canary: [104609] kernel.return RtlEnterCriticalSection
|
||||
ours: [104606] kernel.return RtlEnterCriticalSection
|
||||
```
|
||||
|
||||
**Divergent event:**
|
||||
```
|
||||
canary: [104610] import.call RtlEnterCriticalSection
|
||||
ours: [104607] import.call RtlLeaveCriticalSection
|
||||
```
|
||||
|
||||
**Next event after the divergence (if any):**
|
||||
```
|
||||
canary: [104611] kernel.call RtlEnterCriticalSection
|
||||
ours: [104608] kernel.call RtlLeaveCriticalSection
|
||||
```
|
||||
|
||||
**Raw events (JSON):**
|
||||
```json
|
||||
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1500126500, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "RtlEnterCriticalSection", "ord": 293}, "schema_version": 1, "tid": 6, "tid_event_idx": 104610}
|
||||
{"deterministic": true, "engine": "ours", "guest_cycle": 5517276, "host_ns": 488870591, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "RtlLeaveCriticalSection", "ord": 304}, "schema_version": 1, "tid": 1, "tid_event_idx": 104607}
|
||||
```
|
||||
|
||||
## canary_tid=7 → ours_tid=2
|
||||
|
||||
No divergence within the 32 compared events (canary has 32, ours has 33).
|
||||
|
||||
## canary_tid=12 → ours_tid=7
|
||||
|
||||
First divergence at `tid_event_idx=3`: payload.timeout_ns: canary=-30000000 ours=429466729600
|
||||
|
||||
**Pre-context (last 5 matching events):**
|
||||
```
|
||||
canary: [0] import.call KeWaitForSingleObject
|
||||
ours: [0] import.call KeWaitForSingleObject
|
||||
canary: [1] kernel.call KeWaitForSingleObject
|
||||
ours: [1] kernel.call KeWaitForSingleObject
|
||||
canary: [2] handle.create sid=c49d8f0ab90401ea
|
||||
ours: [2] handle.create sid=6e3d96c5a52bf429
|
||||
```
|
||||
|
||||
**Divergent event:**
|
||||
```
|
||||
canary: [3] wait.begin {'handles_semantic_ids': ['c49d8f0ab90401ea'], 'timeout_ns': -30000000, 'alertable': False, 'wait_type': 'any'}
|
||||
ours: [3] wait.begin {'handles_semantic_ids': ['6e3d96c5a52bf429'], 'timeout_ns': 429466729600, 'alertable': False, 'wait_type': 'any'}
|
||||
```
|
||||
|
||||
**Next event after the divergence (if any):**
|
||||
```
|
||||
canary: [4] kernel.return KeWaitForSingleObject
|
||||
ours: [4] kernel.return KeWaitForSingleObject
|
||||
```
|
||||
|
||||
**Raw events (JSON):**
|
||||
```json
|
||||
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1585723800, "kind": "wait.begin", "payload": {"alertable": false, "handles_semantic_ids": ["c49d8f0ab90401ea"], "timeout_ns": -30000000, "wait_type": "any"}, "schema_version": 1, "tid": 12, "tid_event_idx": 3}
|
||||
{"deterministic": true, "engine": "ours", "guest_cycle": 0, "host_ns": 498180107, "kind": "wait.begin", "payload": {"alertable": false, "handles_semantic_ids": ["6e3d96c5a52bf429"], "timeout_ns": 429466729600, "wait_type": "any"}, "schema_version": 1, "tid": 7, "tid_event_idx": 3}
|
||||
```
|
||||
|
||||
## canary_tid=14 → ours_tid=9
|
||||
|
||||
First divergence at `tid_event_idx=41`: payload.ord: canary=503 ours=293
|
||||
|
||||
**Pre-context (last 5 matching events):**
|
||||
```
|
||||
canary: [36] kernel.call KeReleaseSpinLockFromRaisedIrql
|
||||
ours: [36] kernel.call KeReleaseSpinLockFromRaisedIrql
|
||||
canary: [37] kernel.return KeReleaseSpinLockFromRaisedIrql
|
||||
ours: [37] kernel.return KeReleaseSpinLockFromRaisedIrql
|
||||
canary: [38] import.call KfLowerIrql
|
||||
ours: [38] import.call KfLowerIrql
|
||||
canary: [39] kernel.call KfLowerIrql
|
||||
ours: [39] kernel.call KfLowerIrql
|
||||
canary: [40] kernel.return KfLowerIrql
|
||||
ours: [40] kernel.return KfLowerIrql
|
||||
```
|
||||
|
||||
**Divergent event:**
|
||||
```
|
||||
canary: [41] import.call XAudioGetVoiceCategoryVolumeChangeMask
|
||||
ours: [41] import.call RtlEnterCriticalSection
|
||||
```
|
||||
|
||||
**Next event after the divergence (if any):**
|
||||
```
|
||||
canary: [42] kernel.call XAudioGetVoiceCategoryVolumeChangeMask
|
||||
ours: [42] kernel.call RtlEnterCriticalSection
|
||||
```
|
||||
|
||||
**Raw events (JSON):**
|
||||
```json
|
||||
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1795772400, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "XAudioGetVoiceCategoryVolumeChangeMask", "ord": 503}, "schema_version": 1, "tid": 14, "tid_event_idx": 41}
|
||||
{"deterministic": true, "engine": "ours", "guest_cycle": 417, "host_ns": 1628175228, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "RtlEnterCriticalSection", "ord": 293}, "schema_version": 1, "tid": 9, "tid_event_idx": 41}
|
||||
```
|
||||
|
||||
## canary_tid=15 → ours_tid=10
|
||||
|
||||
No divergence within the 16 compared events (canary has 250000, ours has 17).
|
||||
122
audit-runs/phase-c22-rtl-enter-leave-control-flow/escalation.md
Normal file
122
audit-runs/phase-c22-rtl-enter-leave-control-flow/escalation.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Phase C+22 — ESCALATION (2026-05-18)
|
||||
|
||||
## Decision: ESCALATE
|
||||
|
||||
C+22's target divergence at canary tid=6→1 idx=104,607 (canary
|
||||
`import.call RtlEnterCriticalSection` extra nested-Enter vs
|
||||
ours `import.call RtlLeaveCriticalSection`) is classified as
|
||||
**(A) scheduler-determinism + post-wait state-mutation downstream
|
||||
effect** — the same class C+20 escalated. C+21's wait.begin
|
||||
floating-absorb correctly removed the visible wait.begin jitter
|
||||
event (verified `floating_wait (c/o) = 2/0` engaged on this
|
||||
chain in the fresh c22 sample), but the *post-wait branch* in
|
||||
canary's guest code, taken because shared state was mutated
|
||||
during the wait, cannot be papered over at the diff layer
|
||||
without crossing reading-error #23 (matching genuinely different
|
||||
guest behavior).
|
||||
|
||||
## What was done
|
||||
|
||||
1. Backed up both canary cache locations.
|
||||
2. Wiped both canary caches + ours's cache.
|
||||
3. Cold-ran ours (50M instructions, against the `.iso`).
|
||||
4. Cold-ran canary (90s timeout, against the `.iso`).
|
||||
5. Truncated canary log keeping all tids (first 250k events per
|
||||
tid) so the C+18/C+21 cross-tid shared-global heuristic has
|
||||
the multi-tid evidence it needs.
|
||||
6. Ran `diff_events.py` with full multi-tid map.
|
||||
7. Verified main matched prefix = 104,607 (matches C+21).
|
||||
8. Verified sister chains unchanged: 11/32/3/41/16.
|
||||
9. Verified C+21 floating-absorb engaged: `floating_create (c/o)
|
||||
= 1/0`, `floating_wait (c/o) = 2/0` on main chain.
|
||||
10. Restored canary caches.
|
||||
|
||||
Discovered along the way:
|
||||
- **Reading-error class #34** (NEW): cold-run determinism
|
||||
depends on input path form. The `.xex` and `.iso` paths
|
||||
produce different boot trajectories. All cold-vs-cold runs
|
||||
MUST use the `.iso` path. Documented in
|
||||
`investigation.md` §"Methodology note".
|
||||
|
||||
## What was NOT done
|
||||
|
||||
- No engine source changed (per ESCALATE classification).
|
||||
- No diff-tool changes (the existing C+18/C+21 absorbers
|
||||
already work correctly for this region; over-absorbing the
|
||||
post-wait Enter/Leave block would cross into matching
|
||||
genuinely different guest behavior).
|
||||
- Phase A emitter additive for `cs_ptr` arg considered but
|
||||
deferred — not needed to establish the escalation decision;
|
||||
would only refine the cause-of-branch story which is already
|
||||
established by the C+20 analysis.
|
||||
- D-NEW-2 NOT touched (explicitly out of scope per prompt).
|
||||
|
||||
## Why we can't fix this in C+22's authorized scope
|
||||
|
||||
The C+22 prompt authorizes modifications to:
|
||||
- `crates/xenia-kernel/src/exports.rs` (rtl_enter_critical_section,
|
||||
rtl_leave_critical_section, related CS state)
|
||||
- `crates/xenia-kernel/src/state.rs` if CS state model needs
|
||||
adjustment
|
||||
- `tools/diff-events/diff_events.py` if a new race pattern is
|
||||
identified
|
||||
- Tests, Phase A emitter additive if needed, documentation
|
||||
|
||||
But explicitly forbids:
|
||||
- Refactor scheduler / thread-model
|
||||
- Refactor CS primitives broadly
|
||||
- Touch GPU/audio/HID
|
||||
- Land deferred items
|
||||
- Fix D-NEW-2 in this session
|
||||
|
||||
The actual root cause is **scheduler determinism** — ours's
|
||||
single-stepping scheduler runs tid=1 monolithically through this
|
||||
region, denying other tids the opportunity to claim the shared
|
||||
CS that's contended in canary. The fix requires either:
|
||||
|
||||
1. Reworking ours's scheduler to interleave threads at finer
|
||||
granularity (multi-thousand-LOC refactor — NOT AUTHORIZED).
|
||||
2. Recording canary's scheduling trace and replaying it in ours
|
||||
(new subsystem — NOT AUTHORIZED).
|
||||
3. Adding wait.begin emission to ours's RtlEnter park path AND
|
||||
re-architecting the CS contention model so that, when ours
|
||||
DOES contend, it produces canary-symmetric state mutations
|
||||
— partial; would not fix this case because ours fast-paths
|
||||
here, never parks.
|
||||
4. Modifying Sylpheed guest code (out of scope and defeats
|
||||
parity goal).
|
||||
|
||||
None of (1)-(4) fit C+22's authorized scope. **Escalation is the
|
||||
correct decision.**
|
||||
|
||||
## Recommended next-target sequence
|
||||
|
||||
1. **C+23 = D-NEW-2** (independent ε-class fix on a different
|
||||
sister chain). `KeWaitForSingleObject` `timeout_ns`
|
||||
sign/scale asymmetry. Out of scope for C+22 per prompt; in
|
||||
scope for C+23.
|
||||
2. **C+24 = D-NEW-3** (canary tid=14→9 idx=41:
|
||||
`XAudioGetVoiceCategoryVolumeChangeMask` vs ours's
|
||||
`RtlEnterCriticalSection`). Likely a missing/stubbed
|
||||
XAudio export.
|
||||
3. **Parallel scheduler-determinism track**: a dedicated multi-
|
||||
session refactor to attack the C+20/C+22 family at the root.
|
||||
Scope per C+20: per-CS-pointer "expected contention"
|
||||
inference from canary logs + scheduler driver + diff-tool
|
||||
"scheduling-trace replay" event class.
|
||||
|
||||
## Confidence
|
||||
|
||||
- Classification confidence: HIGH (95%+). Verified by
|
||||
multi-sample canary cold runs showing structurally identical
|
||||
EE-LL nested pattern across all 4 samples; C+21 absorber
|
||||
engaged exactly as predicted; mechanism (post-wait
|
||||
state-mutation branch) consistent with C+20's analysis.
|
||||
|
||||
- Escalation correctness: HIGH (95%+). No authorized
|
||||
modification within C+22's scope can fix this; reading-error
|
||||
#23 explicitly applies if we over-absorb in the diff tool.
|
||||
|
||||
- Reading-error #34 discovery: HIGH (verified by repeat
|
||||
experiment — 2 ours-cold runs against `.iso` byte-identical
|
||||
modulo timestamps; identical to C+19 archive).
|
||||
@@ -0,0 +1,262 @@
|
||||
# Phase C+22 investigation — RtlEnter/RtlLeave post-wait control-flow divergence (2026-05-18)
|
||||
|
||||
## TL;DR
|
||||
|
||||
**ESCALATE.** The divergence at tid=6→1 idx=104,607 (canary
|
||||
`import.call RtlEnterCriticalSection` vs ours `import.call
|
||||
RtlLeaveCriticalSection`) is a downstream effect of the **same
|
||||
scheduler-determinism asymmetry** that C+20 escalated. C+21's
|
||||
floating-absorb correctly removes the visible `wait.begin` jitter
|
||||
event from the diff (`floating_wait (c/o) = 2/0` engaged on this
|
||||
chain in the fresh c22 sample), but the **post-wait guest-code
|
||||
branch** taken in canary because shared state was mutated during
|
||||
the wait is NOT an observation artifact — it's a structural
|
||||
behavioral consequence of scheduler interleaving and cannot be
|
||||
papered over at the diff layer without falsely matching genuinely
|
||||
different guest behavior.
|
||||
|
||||
## Verification: NOT jitter
|
||||
|
||||
Per reading-error #32 discipline, sampled 4 canary cold streams
|
||||
+ 1 fresh ours cold. The Enter/Leave PATTERN in the post-wait
|
||||
region is structurally consistent across all canary samples:
|
||||
|
||||
| sample | events 104,604-104,615 (tid=6, import.call only) |
|
||||
|---------------|--------------------------------------------------------|
|
||||
| c21 archived | E E L L (nested pair after acquire) |
|
||||
| jitter-1 | E (wait.begin slow-path) E L L |
|
||||
| jitter-2 | E E L L (same as c21) |
|
||||
| jitter-3 | (index-shifted +3) E E L L |
|
||||
| fresh c22 | E (wait.begin slow-path) E L L |
|
||||
|
||||
All canary samples take an EXTRA nested RtlEnter after the post-
|
||||
loop `E` at 104,604. Ours never does — it goes `E L NtClose`.
|
||||
|
||||
The two canary jitter shapes (with vs without the wait.begin
|
||||
emission inside the first E pair) are the C+21 absorption target;
|
||||
both shapes converge to the same post-wait nested-Enter behavior.
|
||||
|
||||
## Mechanism (classification: A + B-via-A)
|
||||
|
||||
C+21 absorption confirmed working — the diff harness correctly
|
||||
folds the wait.begin and handle.create events on shared-global
|
||||
dispatcher `sid=75ae880ec432eb36 / raw=0xf8000034` (an Event
|
||||
dispatcher used cross-tid) into the matched prefix:
|
||||
|
||||
```
|
||||
fresh c22 floating_create (c/o) = 1/0
|
||||
fresh c22 floating_wait (c/o) = 2/0
|
||||
```
|
||||
|
||||
Result: matched prefix advances to 104,607 (canary stream
|
||||
internally at idx 104,610 after C+21 unfolds the 3 absorbed
|
||||
events).
|
||||
|
||||
The remaining divergence is:
|
||||
|
||||
```
|
||||
canary [104,610] import.call RtlEnterCriticalSection (nested 2nd acquire)
|
||||
ours [104,607] import.call RtlLeaveCriticalSection (release first acquire)
|
||||
```
|
||||
|
||||
This is NOT a "ghost" event. It's a real divergence in **guest
|
||||
control flow** at the same logical execution point.
|
||||
|
||||
### Why it happens
|
||||
|
||||
Sylpheed's guest code at this PC, after the post-loop CS acquire,
|
||||
reads a state value (e.g. a queue pointer, a reference count, an
|
||||
event-signaled flag) protected by that CS. Based on the value, it
|
||||
either:
|
||||
|
||||
- (canary's path): re-enters a nested CS to drain or clean up
|
||||
additional state, then releases both levels.
|
||||
- (ours's path): proceeds directly to release the outer CS and
|
||||
close the Event handle.
|
||||
|
||||
In canary's contended scenario, while tid=6 was blocked on the
|
||||
shared dispatcher at 104,608 (the embedded `DISPATCHER_HEADER` of
|
||||
the CS object — its `wait.begin` was on `sid=75ae880ec432eb36`,
|
||||
the canary's first-toucher SID for this Event), **another guest
|
||||
thread held the CS and may have mutated the protected state**.
|
||||
When tid=6 resumes and the slow-path RtlEnter completes
|
||||
acquisition, the state value that the post-acquire branch reads
|
||||
has changed, and the branch takes the nested-cleanup path.
|
||||
|
||||
In ours, tid=1 never blocked here. No other thread had a chance
|
||||
to mutate the protected state during a wait window. The state
|
||||
value the branch reads is the pre-wait value, and the branch
|
||||
takes the simple-release path.
|
||||
|
||||
This is the same downstream effect that the C+20 escalation
|
||||
analysis predicted: *"That requires ours to schedule tid=9 ahead
|
||||
of (or concurrently with) tid=1's RtlEnter, exactly as canary's
|
||||
host scheduler did. Ours's deterministic single-stepping
|
||||
scheduler runs tid=1 near-monolithically through this region —
|
||||
tid=9 has no opportunity to claim the CS before tid=1 fast-paths
|
||||
through."*
|
||||
|
||||
The classification is class A in the C+22 prompt taxonomy:
|
||||
**ours's RtlEnter takes a fast path (uncontended) that canary's
|
||||
contended path doesn't — same root cause as C+20.**
|
||||
|
||||
### Why this can't be absorbed in the diff tool (reading-error
|
||||
#23 risk)
|
||||
|
||||
Unlike the wait.begin event itself (which is a transient
|
||||
observation directly correlated to scheduling), the
|
||||
post-divergence Enter / Leave sequence corresponds to **distinct
|
||||
guest code paths**. Folding canary's extra RtlEnter at idx
|
||||
104,610 + matching RtlLeave at 104,613 into the matched prefix
|
||||
would require the diff tool to over-absorb a 6-event block per
|
||||
contention occurrence, regardless of whether ours's code path
|
||||
ACTUALLY corresponds to canary's contended path. This crosses
|
||||
the line from "scheduling-jitter mitigation" to "matching
|
||||
genuinely different guest behavior" — reading-error #23 in
|
||||
action.
|
||||
|
||||
The C+21 absorb is justified because the wait.begin event is
|
||||
guaranteed to be a no-op observation if/when it fires (canary's
|
||||
xeKeWaitForSingleObject is the slow path that the fast path
|
||||
trivially skips). The post-wait Enter / Leave block is the
|
||||
opposite: real work, real guest code execution.
|
||||
|
||||
## Engine-side fixes considered and rejected
|
||||
|
||||
### (i) Wire wait.begin into ours's `rtl_enter_critical_section`
|
||||
park path
|
||||
Symmetric to canary, but does NOT fix the divergence at idx
|
||||
104,607 because ours doesn't park here at all. The patch would
|
||||
be inert in this case; the divergence persists. Useful
|
||||
prophylactic but not the C+22 target.
|
||||
|
||||
### (ii) Force ours to spin-wait briefly at every RtlEnter to
|
||||
give other tids a chance to claim the CS
|
||||
Extremely fragile, no guarantee of matching canary's exact
|
||||
interleave. Likely shifts divergence elsewhere without resolving
|
||||
it.
|
||||
|
||||
### (iii) Implement deterministic CS-priority scheduling
|
||||
where any other tid that has a pending wait on the same CS gets
|
||||
to run before the current tid's fast-path
|
||||
Would change ours's scheduler semantics broadly. Multi-thousand-
|
||||
LOC scope. Explicitly NOT authorized per the C+22 prompt:
|
||||
|
||||
> You may NOT (without escalating): Refactor scheduler /
|
||||
> thread-model.
|
||||
|
||||
### (iv) Record canary's contention trace and replay it in ours
|
||||
("scheduling-trace replay")
|
||||
A new subsystem; recorded under C+20 escalation already.
|
||||
|
||||
### (v) Modify Sylpheed's guest code at the post-loop branch to
|
||||
force the simple-release path
|
||||
Would require modifying guest binary — outside scope and
|
||||
defeats the parity goal.
|
||||
|
||||
### (vi) Add a no-op `cs_ptr` Phase A emitter additive for
|
||||
diagnosis
|
||||
~30 LOC each engine + canary recompile. Cvar-OFF zero-cost.
|
||||
Would allow future investigation to distinguish whether
|
||||
canary's nested RtlEnter at 104,610 is on the SAME CS pointer
|
||||
(recursive bump) or a DIFFERENT CS (nested cleanup lock).
|
||||
Deferred — not needed for the escalation decision because the
|
||||
mechanism (post-wait state mutation) is already established by
|
||||
the C+20 analysis; the additional `cs_ptr` data would only
|
||||
refine the cause-of-branch story.
|
||||
|
||||
## Cascade outcome (per C+22 prompt)
|
||||
|
||||
- A=verify divergence is NOT jitter: PASS (4 canary cold samples
|
||||
agree on EE-LL nested pattern; C+21 absorber engaged
|
||||
`floating_wait (c/o) = 2/0` and matched prefix is 104,607
|
||||
exactly).
|
||||
- B=classify (A/B/C/D): PASS — **(A) ours's RtlEnter fast-paths
|
||||
while canary's contends → downstream state mutation during the
|
||||
wait → different post-acquire branch in guest code.**
|
||||
- C=land fix or escalate cleanly: ESCALATION (per C+22 prompt
|
||||
authorized fallback).
|
||||
- D=main matched-prefix > 104,607: N/A (no engine change).
|
||||
|
||||
## Cold-vs-cold gate matrix (escalation-mode)
|
||||
|
||||
| gate | result |
|
||||
|-------------------------------------|-------------------|
|
||||
| ours-cold byte-identical to c19 | YES (121,569 |
|
||||
| | events match) |
|
||||
| Main matched-prefix | 104,607 (= C+21) |
|
||||
| Sister chains | 11/32/3/41/16 ✓ |
|
||||
| Phase B `image_loaded_sha256` | unchanged ✓ |
|
||||
| Engine source | UNCHANGED |
|
||||
| C+21 absorber engagement | 1/0 + 2/0 (fired) |
|
||||
|
||||
## Per-chain delta vs C+21 baseline
|
||||
|
||||
NONE. All chains identical to C+21:
|
||||
|
||||
| chain | C+21 | C+22 (this) | delta |
|
||||
|--------------------------------|---------|-------------|-------|
|
||||
| canary tid=6 → ours tid=1 main | 104,607 | 104,607 | 0 |
|
||||
| canary tid=4 → ours tid=11 | 11 | 11 | 0 |
|
||||
| canary tid=7 → ours tid=2 | 32 | 32 | 0 |
|
||||
| canary tid=12 → ours tid=7 | 3 | 3 | 0 |
|
||||
| canary tid=14 → ours tid=9 | 41 | 41 | 0 |
|
||||
| canary tid=15 → ours tid=10 | 16 | 16 | 0 |
|
||||
|
||||
## Methodology note — reading-error class #34
|
||||
|
||||
**#34 (NEW): cold-run determinism depends on input path form.**
|
||||
Running ours against `default.xex` directly (extracted file)
|
||||
produces a different boot trajectory than running against the
|
||||
parent `.iso` containing it. The C+19 / C+21 baselines used the
|
||||
`.iso` path; the `.xex` direct path yields 40x more imports and
|
||||
1.6M unimpl warnings (CPU stuck/looping in a probe that doesn't
|
||||
fire on the iso). All cold-vs-cold protocol entries MUST use
|
||||
the iso path. Reproduces deterministically: ours-cold against
|
||||
`.iso` is byte-identical to the c19 archived ours-cold modulo
|
||||
host_ns/guest_cycle fields (verified 121,569 events all match
|
||||
post-normalization).
|
||||
|
||||
Likely cause: the iso path triggers `xenia_vfs::disc_image::
|
||||
DiscImageDevice::open` at main.rs:1397-1400, mounting a full
|
||||
disc VFS at `d:\` / `\Device\Cdrom0\`. The bare-xex path skips
|
||||
this and leaves the VFS unmounted for most disc-prefixed
|
||||
opens, causing different boot-validator branches.
|
||||
|
||||
This affects ALL future cold-vs-cold protocol runs — always
|
||||
pass the .iso path, not the loose .xex.
|
||||
|
||||
## Recommendation for next sessions
|
||||
|
||||
This is the SECOND C-series session (after C+20) classified as
|
||||
scheduler-determinism in the post-loop RtlEnter region near idx
|
||||
104,607. The pattern is stable and well-understood. Recommended
|
||||
next-target sequence:
|
||||
|
||||
1. **C+23 = D-NEW-2** (`KeWaitForSingleObject` `timeout_ns`
|
||||
sign/scale asymmetry on tid=12→7 idx=3): canary=-30000000
|
||||
vs ours=429466729600. Small ε-class encoding fix in
|
||||
`ke_wait_for_single_object`'s timeout-pointer dereference.
|
||||
Independent of scheduler determinism. Out of scope for C+22
|
||||
per prompt's explicit "You may NOT ... Fix D-NEW-2 in this
|
||||
session."
|
||||
|
||||
2. **C+24 = D-NEW-3** (canary tid=14 → ours tid=9 idx=41:
|
||||
canary calls `XAudioGetVoiceCategoryVolumeChangeMask` while
|
||||
ours calls `RtlEnterCriticalSection`). Pre-context shows
|
||||
identical KeReleaseSpinLockFromRaisedIrql + KfLowerIrql
|
||||
pair; the next branch picks completely different exports.
|
||||
Likely a missing/stubbed XAudio export in ours that, when
|
||||
absent, causes a fallback to a different code path.
|
||||
|
||||
3. **Open the parallel scheduler-determinism track** to attack
|
||||
the C+20 / C+22 family at the root. Estimated multi-session
|
||||
refactor; per prompt this is "a separate session."
|
||||
|
||||
## Files
|
||||
|
||||
- `diff-cold-vs-cold.md` — full diff report.
|
||||
- `cold-vs-cold-result.md` — matched-prefix table + gates.
|
||||
- `canary-binary-cache-pre-wipe.tar.gz` — pre-wipe oracle backup.
|
||||
- `canary-xdg-cache-pre-wipe.tar.gz` — pre-wipe XDG oracle.
|
||||
- `escalation.md` — this document's TL;DR + recommended next.
|
||||
@@ -0,0 +1,80 @@
|
||||
# Phase C+22 re-validation (2026-05-18)
|
||||
|
||||
## Protocol followed
|
||||
|
||||
Cold-vs-cold per reading-error #31 + #32 + #33 + the new #34.
|
||||
|
||||
1. ✓ Backed up both canary cache locations
|
||||
(`xenia-canary/build-cross/bin/Windows/Debug/cache/` and
|
||||
`~/.local/share/Xenia/cache/`) to tarballs.
|
||||
2. ✓ Wiped both canary caches + ours's
|
||||
(`~/.local/share/xenia-rs/cache/`).
|
||||
3. ✓ Cold-ran ours (50M instructions) against the `.iso`
|
||||
path — NOT the loose `default.xex` (per new
|
||||
reading-error #34).
|
||||
4. ✓ Cold-ran canary (90s timeout) against the `.iso` path
|
||||
with `--mute=true`.
|
||||
5. ✓ Truncated canary log to first 250k events per tid
|
||||
(keeping ALL tids, NOT only tid=6 — needed for the C+18 /
|
||||
C+21 cross-tid shared-global heuristic).
|
||||
6. ✓ Ran `diff_events.py` with full tid map
|
||||
`6=1,7=2,4=11,12=7,14=9,15=10`.
|
||||
7. ✓ Restored both canary cache backups.
|
||||
|
||||
## Determinism check
|
||||
|
||||
- ours-cold byte-identical to C+19 archive (`audit-runs/
|
||||
phase-c19-NtDuplicateObject-handle-create/ours-cold.jsonl`):
|
||||
121,569 events match post-normalization (host_ns and
|
||||
guest_cycle excluded). Reproduces deterministically.
|
||||
- No new engine source changes, so digest unchanged. Phase B
|
||||
`image_loaded_sha256 = ea8d160e9369328a5b922258a92113efb8d
|
||||
7ce3e1a5c12cc521e375985c91c18` preserved.
|
||||
|
||||
## Gate matrix (escalation mode)
|
||||
|
||||
| gate | result |
|
||||
|---------------------------------------------|-------------------|
|
||||
| Engine source unchanged | PASS |
|
||||
| Diff-tool source unchanged | PASS |
|
||||
| Phase A schema version 1 unchanged | PASS |
|
||||
| ours-cold byte-equivalent to C+19 archive | PASS (121,569 ev) |
|
||||
| Main matched-prefix preserved at C+21 | PASS (104,607) |
|
||||
| Sister chains preserved (11/32/3/41/16) | PASS |
|
||||
| C+21 absorber engaged (validation) | PASS (2/0 wait) |
|
||||
| C+18 absorber engaged (validation) | PASS (1/0 create) |
|
||||
| Phase B image hash preserved | PASS |
|
||||
| Canary caches restored | PASS |
|
||||
| Reading-error #34 documented + reproducible | PASS |
|
||||
| Tests count (no change) | 204 (unchanged) |
|
||||
|
||||
## What changed in the diff-tool report compared to C+21 baseline
|
||||
|
||||
NOTHING substantive. The numbers are identical:
|
||||
|
||||
| chain | C+21 (jitter-1) | C+22 (fresh c22) |
|
||||
|-----------------------------|-----------------|------------------|
|
||||
| tid=6→1 | 104,607 | 104,607 |
|
||||
| tid=4→11 | 11 | 11 |
|
||||
| tid=7→2 | 32 | 32 |
|
||||
| tid=12→7 | 3 | 3 |
|
||||
| tid=14→9 | 41 | 41 |
|
||||
| tid=15→10 | 16 | 16 |
|
||||
| main floating_create (c/o) | 0 / 0 | 1 / 0 |
|
||||
| main floating_wait (c/o) | 1 / 0 | 2 / 0 |
|
||||
|
||||
The floating counts vary across canary cold samples (different
|
||||
host-scheduler interleavings emit the wait.begin in different
|
||||
counts and at different indices) but the matched-prefix is
|
||||
constant — this is the C+21 fix working as designed.
|
||||
|
||||
## Outcome
|
||||
|
||||
- C+22 = ESCALATION (no engine change).
|
||||
- Cold-vs-cold environment is healthy and reproducible.
|
||||
- C+21 absorber works correctly under varying contention.
|
||||
- Reading-error #34 added as a methodology guard for future
|
||||
cold-vs-cold runs.
|
||||
- Next-target list updated: C+23 = D-NEW-2 (independent
|
||||
ε-class), C+24 = D-NEW-3 (XAudio), parallel scheduler-
|
||||
determinism track.
|
||||
Reference in New Issue
Block a user