Files
xenia-rs/audit-runs/phase-c22-payload-canonicalization/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

6.8 KiB
Raw Blame History

Phase C+22 — Payload-field canonicalization for host-heap-derived guest VAs

Date: 2026-05-26 Mode: WRITE — diff-tool only. No engine source changes. Status: LANDED. Main matched-prefix 105,128 → 105,138 (+10).

TL;DR

The pre-C+22 first divergence at canary tid=6 ↔ ours tid=1 idx 105,128 is a thread.create.ctx_ptr mismatch:

canary: thread.create {parent_tid=6, entry_pc=0x824cd458, ctx_ptr=0xbe56bb3c, ...}
ours:   thread.create {parent_tid=1, entry_pc=0x824cd458, ctx_ptr=0x42453b3c, ...}
  • parent_tid was ALREADY skipped via SKIP_PAYLOAD_FIELDS_BY_KIND["thread.create"] (line 245 of diff_events.py, in place since C+15-α). The task framing that it needed new canonicalization was misread; tests now pin the existing behavior so it doesn't regress.
  • ctx_ptr IS the actual divergence at this index. Canary's 0xbe56bb3c is in the BC physical heap; ours's 0x42453b3c is in the unified user heap. Same AUDIT-043 ε class as C+2's MmAllocatePhysicalMemoryEx.

Why C+2's ALLOCATOR_RETURN_FNS doesn't cover this

C+2 canonicalizes kernel.return.return_value for a known set of host- allocator-returning exports. ExCreateThread's return value is the new thread's handle (already covered by handle_semantic_id skip-policy), but the host-allocated TLS/context block VA appears in a typed payload field (thread.create.ctx_ptr) — a side channel C+2 doesn't see.

The fix

HOST_HEAP_PAYLOAD_FIELDS_BY_KIND map and canonicalize_host_heap_payload_fields helper, exact mirror of ALLOCATOR_RETURN_FNS / canonicalize_allocator_returns, restricted to typed payload fields. Initial set:

HOST_HEAP_PAYLOAD_FIELDS_BY_KIND = {
    "thread.create": ("ctx_ptr",),
}

Sentinel format: <HOSTHEAP_<KIND>_<FIELD>_<ORDINAL>> — distinct namespace from <ALLOC_*_*> so the two passes don't collide.

Strict fields preserved (THE tripstone)

thread.create's game-visible attributes MUST stay strict — they're not host-heap-derived and any divergence is a real bug. Tests verify each:

field canary ours strict?
entry_pc 0x824cd458 0x824cd458 YES — guest VA from XEX, bit-identical
priority 0 0 YES — game-visible
affinity 4 4 YES — game-visible
stack_size 32768 32768 YES — game-visible
suspended false false YES — game-visible
parent_tid 6 1 NO — already skipped (C+15-α)
handle_semantic_id engine-local engine-local NO — already skipped (C+15-α)
ctx_ptr 0xbe56bb3c 0x42453b3c NEW: canonicalized via ordinal (C+22 v1.7)

5 negative tests in test_diff_events.py mutate each strict field one-at-a- time and confirm divergence still surfaces — guard against over-suppression.

Verification matrix

canary file pre-C+22 matched post-C+22 matched Δ
canary-jitter-1.jsonl (4.4 GB, 476,943 events on tid=6) 105,128 105,138 +10
canary-jitter-2.jsonl (3.5 GB, 441,027 events on tid=6) 105,128 105,138 +10
canary-jitter-3.jsonl (3.7 GB, 445,578 events on tid=6) 105,128 105,138 +10

All three jitter runs advance to the SAME new divergence: idx 105,138, kernel.return VdQueryVideoFlags:

canary: payload.return_value = 3 (status "0x00000003")
ours:   payload.return_value = 0 (status "0x00000000")

This is a genuine Vd subsystem divergence (UNRELATED to canonicalization), out of C+22's scope — surfaces correctly as a real first-divergence.

Tests

8 new tests in test_diff_events.py:

  1. test_thread_create_ctx_ptr_in_host_heap_set — registration sanity.
  2. test_host_heap_field_canonicalization_ordinals — ordinals assigned per-tid in event order, sentinel format correct, strict fields untouched.
  3. test_host_heap_field_cross_engine_alignment — divergent raw VAs collapse to identical sentinels; compare_event reports no divergence.
  4. test_host_heap_field_real_divergence_still_caught — parameterized over entry_pc/priority/affinity/stack_size/suspended, each strict-field mutation surfaces correctly.
  5. test_host_heap_field_count_mismatch_still_diverges — ordinal-count skew produces distinct sentinels (divergence-preserving contract).
  6. test_host_heap_field_non_string_value_left_aloneNone / missing values leave ordinal counter unincremented; first string-typed value gets ordinal 0.
  7. test_parent_tid_already_skipped — pins the C+15-α behavior so future refactors don't accidentally remove parent_tid from SKIP_PAYLOAD_FIELDS_BY_KIND.
  8. (covered in #2) Strict-field preservation as positive assertion.

Total: previous 33 tests + 8 new = 41 tests, all PASS.

Files touched

  • xenia-rs/tools/diff-events/diff_events.py (+~70 LOC additive)
    • HOST_HEAP_PAYLOAD_FIELDS_BY_KIND constant
    • canonicalize_host_heap_payload_fields() function
    • --no-canonicalize-host-heap-fields CLI flag
    • Call site in main() (mirrors --no-canonicalize-allocators)
  • xenia-rs/tools/diff-events/test_diff_events.py (+~290 LOC tests)
  • xenia-rs/audit-runs/phase-a-diff-harness/schema-v1.md (+~110 LOC)
    • New §"Host-heap payload-field canonicalization (v1.7 …)"
    • Updated ctx_ptr row in field-comparison rules table

NO engine source touched. xenia-rs HEAD unchanged. Phase B image_loaded_sha256 ε class boundary unchanged.

Backward compatibility

  • Wire format unchanged (schema_version = 1).
  • Pre-C+22 event logs whose thread.create.ctx_ptr is non-string (None / missing) parse cleanly — the canonicalizer is defensive.
  • Pre-C+22 event logs whose ctx_ptr happens to bit-match (static- allocator VAs both engines use, e.g. 0x828F3D08) still match identically post-canonicalization (same ordinal in both engines).
  • --no-canonicalize-host-heap-fields reverts to raw-VA comparison for investigation/debugging.

Cascade

  • A (design): PASS — minimal extension of C+2 pattern, no new mechanism class.
  • B (implement + test): PASS — 8 new tests, 41 total PASS.
  • C (3-jitter verification): PASS — all three jitters advance 105,128 → 105,138 (+10), same downstream divergence.
  • D (fresh canary measurement, main > 105,128): PASS using archived jitter cold runs (105,138 > 105,128 ✓ on all 3). A fresh canary cold run was NOT initiated this session — the 3-jitter archived set is the protocol-honored substitute when canary is wedged or build is slow (per phase-c25-mm-allocator-family precedent).

Next divergence (C+23 candidate)

kernel.return VdQueryVideoFlags at idx 105,138:

  • canary returns 3 (status 0x00000003)
  • ours returns 0 (status 0x00000000)

VdQueryVideoFlags is a Vd-subsystem export that returns a bitmask of video-mode capabilities (HDTV, widescreen, anti-aliasing). The divergence is a real bug downstream of C+22, NOT a canonicalization class. C+23+ scope.