Files
xenia-rs/audit-runs/phase-c22-payload-canonicalization/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

159 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase C+22 — Payload-field canonicalization for host-heap-derived guest VAs
**Date:** 2026-05-26
**Mode:** WRITE — diff-tool only. No engine source changes.
**Status:** LANDED. Main matched-prefix 105,128 → 105,138 (+10).
## TL;DR
The pre-C+22 first divergence at canary tid=6 ↔ ours tid=1 idx 105,128 is a
`thread.create.ctx_ptr` mismatch:
```
canary: thread.create {parent_tid=6, entry_pc=0x824cd458, ctx_ptr=0xbe56bb3c, ...}
ours: thread.create {parent_tid=1, entry_pc=0x824cd458, ctx_ptr=0x42453b3c, ...}
```
- `parent_tid` was ALREADY skipped via `SKIP_PAYLOAD_FIELDS_BY_KIND["thread.create"]`
(line 245 of `diff_events.py`, in place since C+15-α). The task framing that
it needed new canonicalization was misread; tests now pin the existing
behavior so it doesn't regress.
- `ctx_ptr` IS the actual divergence at this index. Canary's `0xbe56bb3c`
is in the BC physical heap; ours's `0x42453b3c` is in the unified user heap.
Same AUDIT-043 ε class as C+2's `MmAllocatePhysicalMemoryEx`.
## Why C+2's `ALLOCATOR_RETURN_FNS` doesn't cover this
C+2 canonicalizes `kernel.return.return_value` for a known set of host-
allocator-returning exports. `ExCreateThread`'s return *value* is the new
thread's handle (already covered by `handle_semantic_id` skip-policy), but
the host-allocated TLS/context block VA appears in a *typed payload field*
(`thread.create.ctx_ptr`) — a side channel C+2 doesn't see.
## The fix
`HOST_HEAP_PAYLOAD_FIELDS_BY_KIND` map and `canonicalize_host_heap_payload_fields`
helper, exact mirror of `ALLOCATOR_RETURN_FNS` / `canonicalize_allocator_returns`,
restricted to typed payload fields. Initial set:
```python
HOST_HEAP_PAYLOAD_FIELDS_BY_KIND = {
"thread.create": ("ctx_ptr",),
}
```
Sentinel format: `<HOSTHEAP_<KIND>_<FIELD>_<ORDINAL>>` — distinct namespace
from `<ALLOC_*_*>` so the two passes don't collide.
## Strict fields preserved (THE tripstone)
`thread.create`'s game-visible attributes MUST stay strict — they're not
host-heap-derived and any divergence is a real bug. Tests verify each:
| field | canary | ours | strict? |
|---|---|---|---|
| `entry_pc` | `0x824cd458` | `0x824cd458` | YES — guest VA from XEX, bit-identical |
| `priority` | `0` | `0` | YES — game-visible |
| `affinity` | `4` | `4` | YES — game-visible |
| `stack_size` | `32768` | `32768` | YES — game-visible |
| `suspended` | `false` | `false` | YES — game-visible |
| `parent_tid` | `6` | `1` | NO — already skipped (C+15-α) |
| `handle_semantic_id` | engine-local | engine-local | NO — already skipped (C+15-α) |
| `ctx_ptr` | `0xbe56bb3c` | `0x42453b3c` | NEW: canonicalized via ordinal (C+22 v1.7) |
5 negative tests in `test_diff_events.py` mutate each strict field one-at-a-
time and confirm divergence still surfaces — guard against over-suppression.
## Verification matrix
| canary file | pre-C+22 matched | post-C+22 matched | Δ |
|---|---|---|---|
| `canary-jitter-1.jsonl` (4.4 GB, 476,943 events on tid=6) | 105,128 | **105,138** | **+10** |
| `canary-jitter-2.jsonl` (3.5 GB, 441,027 events on tid=6) | 105,128 | **105,138** | **+10** |
| `canary-jitter-3.jsonl` (3.7 GB, 445,578 events on tid=6) | 105,128 | **105,138** | **+10** |
All three jitter runs advance to the SAME new divergence: idx 105,138,
`kernel.return VdQueryVideoFlags`:
```
canary: payload.return_value = 3 (status "0x00000003")
ours: payload.return_value = 0 (status "0x00000000")
```
This is a genuine Vd subsystem divergence (UNRELATED to canonicalization),
out of C+22's scope — surfaces correctly as a real first-divergence.
## Tests
8 new tests in `test_diff_events.py`:
1. `test_thread_create_ctx_ptr_in_host_heap_set` — registration sanity.
2. `test_host_heap_field_canonicalization_ordinals` — ordinals assigned
per-tid in event order, sentinel format correct, strict fields untouched.
3. `test_host_heap_field_cross_engine_alignment` — divergent raw VAs
collapse to identical sentinels; `compare_event` reports no divergence.
4. `test_host_heap_field_real_divergence_still_caught` — parameterized
over `entry_pc`/`priority`/`affinity`/`stack_size`/`suspended`,
each strict-field mutation surfaces correctly.
5. `test_host_heap_field_count_mismatch_still_diverges` — ordinal-count
skew produces distinct sentinels (divergence-preserving contract).
6. `test_host_heap_field_non_string_value_left_alone``None` / missing
values leave ordinal counter unincremented; first string-typed value
gets ordinal 0.
7. `test_parent_tid_already_skipped` — pins the C+15-α behavior so
future refactors don't accidentally remove `parent_tid` from
`SKIP_PAYLOAD_FIELDS_BY_KIND`.
8. (covered in #2) Strict-field preservation as positive assertion.
Total: previous 33 tests + 8 new = **41 tests, all PASS**.
## Files touched
- `xenia-rs/tools/diff-events/diff_events.py` (+~70 LOC additive)
- `HOST_HEAP_PAYLOAD_FIELDS_BY_KIND` constant
- `canonicalize_host_heap_payload_fields()` function
- `--no-canonicalize-host-heap-fields` CLI flag
- Call site in `main()` (mirrors `--no-canonicalize-allocators`)
- `xenia-rs/tools/diff-events/test_diff_events.py` (+~290 LOC tests)
- `xenia-rs/audit-runs/phase-a-diff-harness/schema-v1.md` (+~110 LOC)
- New §"Host-heap payload-field canonicalization (v1.7 …)"
- Updated `ctx_ptr` row in field-comparison rules table
NO engine source touched. xenia-rs HEAD unchanged. Phase B
`image_loaded_sha256` ε class boundary unchanged.
## Backward compatibility
- Wire format unchanged (`schema_version = 1`).
- Pre-C+22 event logs whose `thread.create.ctx_ptr` is non-string (`None`
/ missing) parse cleanly — the canonicalizer is defensive.
- Pre-C+22 event logs whose `ctx_ptr` happens to bit-match (static-
allocator VAs both engines use, e.g. `0x828F3D08`) still match
identically post-canonicalization (same ordinal in both engines).
- `--no-canonicalize-host-heap-fields` reverts to raw-VA comparison
for investigation/debugging.
## Cascade
- A (design): PASS — minimal extension of C+2 pattern, no new
mechanism class.
- B (implement + test): PASS — 8 new tests, 41 total PASS.
- C (3-jitter verification): PASS — all three jitters advance
105,128 → 105,138 (+10), same downstream divergence.
- D (fresh canary measurement, main > 105,128): PASS using archived
jitter cold runs (105,138 > 105,128 ✓ on all 3). A fresh canary
cold run was NOT initiated this session — the 3-jitter archived
set is the protocol-honored substitute when canary is wedged or
build is slow (per phase-c25-mm-allocator-family precedent).
## Next divergence (C+23 candidate)
`kernel.return VdQueryVideoFlags` at idx 105,138:
- canary returns `3` (status `0x00000003`)
- ours returns `0` (status `0x00000000`)
`VdQueryVideoFlags` is a Vd-subsystem export that returns a bitmask of
video-mode capabilities (HDTV, widescreen, anti-aliasing). The
divergence is a real bug downstream of C+22, NOT a canonicalization
class. C+23+ scope.