Files
xenia-rs/audit-runs/phase-c17-keWait-native-object/cold-vs-cold-result.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

109 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase C+17 cold-vs-cold result (2026-05-14)
## Matched-prefix table
| canary_tid | ours_tid | C+16 | C+17 | delta | first_divergence_at | kind |
|------------|----------|---------|---------|-----------|---------------------|-----------------------------------------------------|
| 6 | 1 | 102,171 | 102,553 | **+382** | 102,553 | `NtDuplicateObject` no `handle.create` (NEW-1) |
| 4 | 11 | 8 | 11 | **+3** | — | no divergence in 11 events (ours stalls) |
| 7 | 2 | 30 | 32 | **+2** | — | no divergence in 32 events |
| 12 | 7 | 2 | 3 | **+1** | 3 | `timeout_ns` differs in `wait.begin` (NEW-2) |
| 14 | 9 | 2 | 41 | **+39** | 41 | unrelated `XAudioGetVoiceCategoryVolumeChangeMask` |
| 15 | 10 | 16 | 2 | **-14** | 2 | ordering: ours emits `handle.create` on first thread-touch of shared dispatcher (NEW-3) |
**Main chain advanced +382** (D-2/D-3/D-4 root cause resolved). 4 of 5 sister
chains advanced. The tid=15→10 chain regressed by 14 events due to a
cross-thread-caching ordering side-effect (see broad-impact.md / NEW-3); the
underlying state alignment is the SAME root cause, so the regression is
"observation-side" — canary's `GetNativeObject` is process-global, so the
adoption happens on whichever thread touches the dispatcher first.
## New first divergence on main (idx=102,553)
```
canary: [102551] import.call NtDuplicateObject
ours: [102551] import.call NtDuplicateObject
canary: [102552] kernel.call NtDuplicateObject
ours: [102552] kernel.call NtDuplicateObject
canary: [102553] handle.create sid=df686b147b291902 (object_type=1)
ours: [102553] kernel.return NtDuplicateObject
canary: [102554] kernel.return NtDuplicateObject
ours: [102554] import.call RtlEnterCriticalSection
```
Canary's `NtDuplicateObject_entry` calls `ObjectTable::DuplicateHandle` which
fires `AddHandle` for the new slot, emitting `handle.create`. Ours's
`nt_duplicate_object` short-circuits via handle aliasing (AUDIT-062's
`dup_id=source_id` design) and does NOT emit a new `handle.create`. This is
**D-NEW-1 HIGH** — first C+18 target.
## Acceptance gates
- **Gate 1 (default-off digest)**: PASS — 3× reproducible at
`e1dfcb1559f987b35012a7f2dc6d93f5` (unchanged from C+13/C+15-α/C+16
baseline). The fix is observation-only at the digest level; the new
shadow-handle refcount entries do not feed back into guest behavior
inside the 50M-instruction window.
- **Gate 2 (cvar-on emit)**: PASS — ours 121,544 events (was 121,537 in
C+16, +7 from new lazy `handle.create` emits in the main chain
bring-up); canary 3,059,463 events in ~90s. Both JSONL parse cleanly.
- **Gate 3 (diff tool)**: PASS — diff tool produces 6-chain report with
the new SID-skip semantics for `wait.begin.handles_semantic_ids`.
- **Gate 4 (cold-vs-cold)**: PASS — main matched prefix advances
102,171 → 102,553 (+382). 4 of 5 sister chains advance; 1 minor
regression on tid=15→10 (NEW-3, observation-side).
- **Gate 5 (build clean)**: PASS — `cargo build --release` clean
(1 pre-existing dead_code warning unrelated).
- **Gate 6 (tests)**: PASS — 186 → 191 (added 5 new lifecycle tests for
`ensure_dispatcher_object`; all pass + entire workspace green).
- **Gate 7 (Phase B image hash)**: PASS — `image_loaded_sha256` =
`ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18`
(unchanged).
- **Gate 8 (event-log determinism)**: PASS — `handle.create` event
stream (post-strip of `host_ns`) is bit-identical across 3 cold
runs: md5 `0bd91b4c61dea52d72859e7d9c3541ba`.
## Sister-chain analysis
All 5 sister chains' first divergences are no longer "wait.begin with SID=0":
- tid=4→11: was `KeWaitForMultipleObjects` at idx=8 with empty SIDs;
now goes 11 events deep with NO divergence (ours stalls, but for
reasons unrelated to D-2/D-3/D-4).
- tid=7→2: was `KeWaitForSingleObject` at idx=30 with SID=0; now 32
events with NO divergence.
- tid=12→7: was at idx=2 with SID=0; now idx=3 — the `handle.create`
matches (SID skipped per diff-tool policy), divergence is now
`timeout_ns` mismatch (-30000000 vs 429466729600) — a real
game-side wait-quantum mismatch.
- tid=14→9: was at idx=2 with SID=0; now idx=41 — reached a real
`XAudioGetVoiceCategoryVolumeChangeMask` divergence (sister-chain
audio export the boot doesn't reach in ours).
- tid=15→10: was at idx=16 (no divergence in 16 events); now idx=2
diverges because ours emits `handle.create` on this thread's first
touch of a globally-shared semaphore dispatcher at `0x828a3230`,
while canary emitted it earlier on another thread. Observation-side
ordering issue; underlying state model is the same. NEW-3 below.
## Refcount leak risk audit
The fix bumps `state.handle_refcount[ptr] = 1` for each first-touch shadow.
Three concerns and mitigations:
1. **Leak risk**: no code path currently destroys these shadows
(`ensure_dispatcher_object` adoptions). Canary's design has the same
property — `GetNativeObject`-synthesized `XObject`s survive until
process exit. No leak relative to canary's behavior.
2. **Double-bump risk**: the early-return guard at the top of
`ensure_dispatcher_object` (`state.objects.contains_key(&ptr)`)
ensures the refcount entry is initialized exactly once per pointer.
Test `ensure_dispatcher_object_is_idempotent_on_repeated_touch`
verifies this.
3. **Refcount underflow risk**: if a future change wires
`handle.destroy` on shadow removal (e.g., when `NtClose` is
somehow called on a guest dispatcher pointer), the refcount must
not underflow. The `or_insert(1)` form preserves any pre-existing
refcount (e.g., if the same pointer was previously allocated via
`alloc_handle_for`, though that's impossible since `next_handle`
starts at `0x1000` and pointers live above `0x1_0000`).