handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,158 @@
# Phase C+22 — Payload-field canonicalization for host-heap-derived guest VAs
**Date:** 2026-05-26
**Mode:** WRITE — diff-tool only. No engine source changes.
**Status:** LANDED. Main matched-prefix 105,128 → 105,138 (+10).
## TL;DR
The pre-C+22 first divergence at canary tid=6 ↔ ours tid=1 idx 105,128 is a
`thread.create.ctx_ptr` mismatch:
```
canary: thread.create {parent_tid=6, entry_pc=0x824cd458, ctx_ptr=0xbe56bb3c, ...}
ours: thread.create {parent_tid=1, entry_pc=0x824cd458, ctx_ptr=0x42453b3c, ...}
```
- `parent_tid` was ALREADY skipped via `SKIP_PAYLOAD_FIELDS_BY_KIND["thread.create"]`
(line 245 of `diff_events.py`, in place since C+15-α). The task framing that
it needed new canonicalization was misread; tests now pin the existing
behavior so it doesn't regress.
- `ctx_ptr` IS the actual divergence at this index. Canary's `0xbe56bb3c`
is in the BC physical heap; ours's `0x42453b3c` is in the unified user heap.
Same AUDIT-043 ε class as C+2's `MmAllocatePhysicalMemoryEx`.
## Why C+2's `ALLOCATOR_RETURN_FNS` doesn't cover this
C+2 canonicalizes `kernel.return.return_value` for a known set of host-
allocator-returning exports. `ExCreateThread`'s return *value* is the new
thread's handle (already covered by `handle_semantic_id` skip-policy), but
the host-allocated TLS/context block VA appears in a *typed payload field*
(`thread.create.ctx_ptr`) — a side channel C+2 doesn't see.
## The fix
`HOST_HEAP_PAYLOAD_FIELDS_BY_KIND` map and `canonicalize_host_heap_payload_fields`
helper, exact mirror of `ALLOCATOR_RETURN_FNS` / `canonicalize_allocator_returns`,
restricted to typed payload fields. Initial set:
```python
HOST_HEAP_PAYLOAD_FIELDS_BY_KIND = {
"thread.create": ("ctx_ptr",),
}
```
Sentinel format: `<HOSTHEAP_<KIND>_<FIELD>_<ORDINAL>>` — distinct namespace
from `<ALLOC_*_*>` so the two passes don't collide.
## Strict fields preserved (THE tripstone)
`thread.create`'s game-visible attributes MUST stay strict — they're not
host-heap-derived and any divergence is a real bug. Tests verify each:
| field | canary | ours | strict? |
|---|---|---|---|
| `entry_pc` | `0x824cd458` | `0x824cd458` | YES — guest VA from XEX, bit-identical |
| `priority` | `0` | `0` | YES — game-visible |
| `affinity` | `4` | `4` | YES — game-visible |
| `stack_size` | `32768` | `32768` | YES — game-visible |
| `suspended` | `false` | `false` | YES — game-visible |
| `parent_tid` | `6` | `1` | NO — already skipped (C+15-α) |
| `handle_semantic_id` | engine-local | engine-local | NO — already skipped (C+15-α) |
| `ctx_ptr` | `0xbe56bb3c` | `0x42453b3c` | NEW: canonicalized via ordinal (C+22 v1.7) |
5 negative tests in `test_diff_events.py` mutate each strict field one-at-a-
time and confirm divergence still surfaces — guard against over-suppression.
## Verification matrix
| canary file | pre-C+22 matched | post-C+22 matched | Δ |
|---|---|---|---|
| `canary-jitter-1.jsonl` (4.4 GB, 476,943 events on tid=6) | 105,128 | **105,138** | **+10** |
| `canary-jitter-2.jsonl` (3.5 GB, 441,027 events on tid=6) | 105,128 | **105,138** | **+10** |
| `canary-jitter-3.jsonl` (3.7 GB, 445,578 events on tid=6) | 105,128 | **105,138** | **+10** |
All three jitter runs advance to the SAME new divergence: idx 105,138,
`kernel.return VdQueryVideoFlags`:
```
canary: payload.return_value = 3 (status "0x00000003")
ours: payload.return_value = 0 (status "0x00000000")
```
This is a genuine Vd subsystem divergence (UNRELATED to canonicalization),
out of C+22's scope — surfaces correctly as a real first-divergence.
## Tests
8 new tests in `test_diff_events.py`:
1. `test_thread_create_ctx_ptr_in_host_heap_set` — registration sanity.
2. `test_host_heap_field_canonicalization_ordinals` — ordinals assigned
per-tid in event order, sentinel format correct, strict fields untouched.
3. `test_host_heap_field_cross_engine_alignment` — divergent raw VAs
collapse to identical sentinels; `compare_event` reports no divergence.
4. `test_host_heap_field_real_divergence_still_caught` — parameterized
over `entry_pc`/`priority`/`affinity`/`stack_size`/`suspended`,
each strict-field mutation surfaces correctly.
5. `test_host_heap_field_count_mismatch_still_diverges` — ordinal-count
skew produces distinct sentinels (divergence-preserving contract).
6. `test_host_heap_field_non_string_value_left_alone``None` / missing
values leave ordinal counter unincremented; first string-typed value
gets ordinal 0.
7. `test_parent_tid_already_skipped` — pins the C+15-α behavior so
future refactors don't accidentally remove `parent_tid` from
`SKIP_PAYLOAD_FIELDS_BY_KIND`.
8. (covered in #2) Strict-field preservation as positive assertion.
Total: previous 33 tests + 8 new = **41 tests, all PASS**.
## Files touched
- `xenia-rs/tools/diff-events/diff_events.py` (+~70 LOC additive)
- `HOST_HEAP_PAYLOAD_FIELDS_BY_KIND` constant
- `canonicalize_host_heap_payload_fields()` function
- `--no-canonicalize-host-heap-fields` CLI flag
- Call site in `main()` (mirrors `--no-canonicalize-allocators`)
- `xenia-rs/tools/diff-events/test_diff_events.py` (+~290 LOC tests)
- `xenia-rs/audit-runs/phase-a-diff-harness/schema-v1.md` (+~110 LOC)
- New §"Host-heap payload-field canonicalization (v1.7 …)"
- Updated `ctx_ptr` row in field-comparison rules table
NO engine source touched. xenia-rs HEAD unchanged. Phase B
`image_loaded_sha256` ε class boundary unchanged.
## Backward compatibility
- Wire format unchanged (`schema_version = 1`).
- Pre-C+22 event logs whose `thread.create.ctx_ptr` is non-string (`None`
/ missing) parse cleanly — the canonicalizer is defensive.
- Pre-C+22 event logs whose `ctx_ptr` happens to bit-match (static-
allocator VAs both engines use, e.g. `0x828F3D08`) still match
identically post-canonicalization (same ordinal in both engines).
- `--no-canonicalize-host-heap-fields` reverts to raw-VA comparison
for investigation/debugging.
## Cascade
- A (design): PASS — minimal extension of C+2 pattern, no new
mechanism class.
- B (implement + test): PASS — 8 new tests, 41 total PASS.
- C (3-jitter verification): PASS — all three jitters advance
105,128 → 105,138 (+10), same downstream divergence.
- D (fresh canary measurement, main > 105,128): PASS using archived
jitter cold runs (105,138 > 105,128 ✓ on all 3). A fresh canary
cold run was NOT initiated this session — the 3-jitter archived
set is the protocol-honored substitute when canary is wedged or
build is slow (per phase-c25-mm-allocator-family precedent).
## Next divergence (C+23 candidate)
`kernel.return VdQueryVideoFlags` at idx 105,138:
- canary returns `3` (status `0x00000003`)
- ours returns `0` (status `0x00000000`)
`VdQueryVideoFlags` is a Vd-subsystem export that returns a bitmask of
video-mode capabilities (HDTV, widescreen, anti-aliasing). The
divergence is a real bug downstream of C+22, NOT a canonicalization
class. C+23+ scope.