xenia-rs/audit-runs/phase-a-diff-harness/schema-v1.md

# Phase A — Event Log Schema v1

**Status:** frozen for Phase A and Phase B. Adding a new event kind requires a `schema_version` bump and a coordinated update in both engines + the diff tool.

## Wire format

JSONL — one JSON object per line, UTF-8, `\n`-terminated. Both engines emit the same byte format.

The **first line** of every event-log file MUST be a `schema_version` event:

```json
{"schema_version":1,"engine":"canary","kind":"schema_version","tid":0,"tid_event_idx":0,"guest_cycle":0,"host_ns":0,"deterministic":true,"payload":{"version":1,"emitter_build":"<commit-or-build-id>"}}
```

The diff tool refuses to parse a file whose first event is not `schema_version` with version `1`.

## Common fields (every event)

| Field | Type | Notes |
|---|---|---|
| `schema_version` | int | always `1` in this phase |
| `engine` | string | `"canary"` or `"ours"` |
| `kind` | string | one of the v1 kinds below |
| `tid` | int | guest thread id of the calling thread (host TID never logged) |
| `tid_event_idx` | int | **per-tid monotonic, starts at 0** — the diff key |
| `guest_cycle` | int | per-engine monotonic guest-instruction count; `0` if the engine cannot supply one (see "Cycle source" below). NOT used by the diff tool for correctness — `tid_event_idx` is the canonical key. |
| `host_ns` | int | host monotonic-clock ns since process start; debug only, never compared by diff |
| `deterministic` | bool | `false` if any payload field is derived from host time / raw allocator address / RNG / etc. Diff tool skip-compares non-deterministic fields. |
| `payload` | object | kind-specific (see below) |

## Cycle source notes

- **canary**: the PPC `tb` (timebase) register can be read from the PPCContext passed into shim handlers. If a hook is on a path that does not have access to a PPCContext (e.g. a host-side handle-table destructor), the emitter MUST set `guest_cycle = 0` and leave `deterministic = false` on the payload-side metadata. The diff tool ignores `guest_cycle` for ordering — `tid_event_idx` is canonical.
- **ours**: `scheduler.thread(current_ref()).ctx.timebase` (already maintained per guest thread).

## Per-tid event index

Both engines maintain a per-tid monotonic counter starting at `0`. The counter is bumped **before** the event is serialized, so the first event for tid `N` has `tid_event_idx = 0`.

The `schema_version` event is special: it is emitted by the writer thread (typically the boot thread before any guest code has run) with `tid = 0` and `tid_event_idx = 0`. The actual guest thread `0` does not exist; the diff tool treats `tid = 0` as the schema header only.

## Handle semantic ID

Canary and ours produce guest handles in different ranges (canary: `0xF8xxxxxx` region; ours: bump-allocated `0x4, 0x8, 0xC, …`). Raw handle IDs are unsuitable as a cross-engine identity. Instead, both engines compute a stable **handle semantic ID** at handle creation time using **FNV-1a 64-bit** over a fixed-format byte string. FNV-1a is used (not SHA256) because both engines can implement it in <10 lines with no dependency, and the diff tool only needs a deterministic identity hash — not a crypto property.

```
input_bytes = le_u32(create_site_pc) ‖ le_u32(creating_tid) ‖ le_u64(tid_event_idx_at_creation) ‖ le_u32(object_type)
hash = 0xCBF29CE484222325
for each byte b in input_bytes:
    hash = (hash XOR b) * 0x100000001B3   mod 2^64
handle_semantic_id = format("{:016x}", hash)
```

Both engines MUST emit the lowercase 16-hex-char form. The `create_site_pc` is the guest PC at the call site of the kernel call that created the handle: in canary, `PPCContext::lr - 4` (the `bl` to the import stub); in ours, the equivalent return address from the syscall dispatcher.

**Object type codes** (v1 — both engines agree):

| Code | Type |
|---|---|
| `0x00` | Unknown |
| `0x01` | Event |
| `0x02` | Mutant |
| `0x03` | Semaphore |
| `0x04` | Timer |
| `0x05` | Thread |
| `0x06` | File |
| `0x07` | IoCompletion |
| `0x08` | Module |
| `0x09` | EnumState |
| `0x0A` | Section |
| `0x0B` | Notification |

All subsequent events that reference a handle emit BOTH `handle_semantic_id` (the diff key) and `raw_handle_id` (engine-local, never compared).

## Event kinds (v1)

### `schema_version`
Header event. `payload = {"version": 1, "emitter_build": "<string>"}`.

### `thread.create`
Emitted by the **parent** thread at the kernel call that creates the new thread.
```json
"payload": {
  "handle_semantic_id": "0123456789abcdef",
  "parent_tid": 1,
  "entry_pc": "0x82001234",
  "ctx_ptr": "0xbce25340",
  "priority": 0,
  "affinity": 1,
  "stack_size": 65536,
  "suspended": false
}
```

### `thread.exit`
Emitted by the **exiting** thread (last event before tid disappears).
```json
"payload": {"exit_code": 0}
```

### `thread.suspend` / `thread.resume`
```json
"payload": {"target_tid": 13}
```

### `kernel.call`
Emit at handler entry, **before** any side effects.
```json
"payload": {
  "name": "NtCreateFile",
  "args": {"file_handle_ptr": "0x70000010", "desired_access": "0x80100080", "obj_attr_ptr": "0x70000020", ...},
  "args_resolved": {"path": "\\Device\\Cdrom0\\dat\\movie\\opening.bik"}
}
```
- Numeric args use `0x`-prefixed hex strings if pointer-typed; ints stay as ints.
- `args_resolved` is a best-effort dereference (strings, struct dumps, buffer summaries). Optional.

### `kernel.return`
Emit at handler exit, **after** all side effects committed.
```json
"payload": {
  "name": "NtCreateFile",
  "return_value": 0,
  "status": "0x00000000",
  "side_effects": [
    {"kind": "handle.create", "handle_semantic_id": "...", "object_type": 6, "raw_handle_id": "0x40"}
  ]
}
```
The `side_effects` array MAY duplicate events also emitted as standalone (`handle.create`). The diff tool treats both as authoritative; duplicates do not cause divergence.

### `handle.create`
For host-side creates not tied to a kernel call (rare).
```json
"payload": {
  "handle_semantic_id": "0123456789abcdef",
  "object_type": 1,
  "object_name": null,
  "raw_handle_id": "0xf8000048"
}
```

### `handle.destroy`
```json
"payload": {
  "handle_semantic_id": "0123456789abcdef",
  "raw_handle_id": "0xf8000048",
  "prior_refcount": 1
}
```

### `wait.begin`
```json
"payload": {
  "handles_semantic_ids": ["0123...", "abcd..."],
  "timeout_ns": -1,
  "alertable": false,
  "wait_type": "any"
}
```
`timeout_ns = -1` means INFINITE. `wait_type` is `"any"` or `"all"`.

### `wait.end`
```json
"payload": {
  "status": "0x00000000",
  "woken_by_semantic_id": "0123456789abcdef",
  "wait_duration_cycles": 12345
}
```
`wait_duration_cycles` is `deterministic = false` (host scheduling affects it). `woken_by_semantic_id` is null on timeout / alerted.

### `mem.write`
**OPT-IN — gated by a separate cvar (`phase_a_event_log_mem_writes`, default false).** In Phase A this kind is reserved; emitters MAY ship a TODO stub. Schema:
```json
"payload": {
  "guest_addr": "0x82000000",
  "value": "0x12345678",
  "size": 4,
  "source": "guest_jit"
}
```

### `vfs.open` / `vfs.read` / `vfs.close`
File-IO events, separate from `kernel.call` so the diff tool can match on canonical path:
```json
"payload": {"canonical_path": "\\Device\\Cdrom0\\dat\\movie\\opening.bik", "raw_handle_id": "0x40", "handle_semantic_id": "..."}
```

### `import.call`
Emitted at the syscall dispatcher (ours) or the import-stub JIT trap (canary), one per imported function invocation, **before** the implementing `kernel.call`.
```json
"payload": {
  "module": "xboxkrnl.exe",
  "ord": 0x101,
  "name": "NtCreateFile"
}
```

## Diff-tool field-comparison rules

| Field | Rule |
|---|---|
| `engine` | skipped (always differs) |
| `host_ns` | skipped (host-clock) |
| `guest_cycle` | skipped (engines disagree on absolute count; diff uses `tid_event_idx`) |
| `raw_handle_id` | skipped (engines use different handle namespaces) |
| `handle_semantic_id` | **C+15-α: skipped** (engine-local — see below) |
| `handles_semantic_ids` (wait.begin) | **C+15-α: skipped** (same reason) |
| `parent_tid` (thread.create) | **C+15-α: skipped** (engine-local guest tids) |
| `ctx_ptr` (thread.create) | **C+22 v1.7: per-(tid, kind, field) ordinal sentinel** (`<HOSTHEAP_thread.create_ctx_ptr_N>`) — host-heap-derived VA, AUDIT-043 ε class |
| `woken_by_semantic_id` (wait.end) | **C+15-α: skipped** (engine-local SID) |
| `deterministic` (event-level field) | skipped (metadata) |
| Any payload field listed under a non-deterministic kind | skipped where flagged |
| All other payload fields | strict equality |

### Phase C+15-α note on `handle_semantic_id`

The SID computation includes `creating_tid` as input, but guest TIDs differ
between engines (canary's tid=6 maps to ours's tid=1 on the main chain).
Both engines compute SIDs **using their own local tids**, so the same logical
handle gets two different SIDs across engines. The diff tool skip-compares
SID fields and relies on `tid_event_idx + object_type` for alignment.

A future schema v2 could canonicalize SIDs via the diff tool's tid map and
restore strict comparison. For v1.1 the simpler skip-policy suffices.

## Shared-global SIDs (v1.2 — added in Phase C+18)

A subset of guest kernel dispatcher objects (`KEVENT`, `KSEMAPHORE`,
`KTIMER`, `KMUTANT`) are **process-global**: they live in
statically-initialized or pre-allocated guest memory and are touched
by MULTIPLE guest threads during boot. Examples include the XAudio
voice-volume change-mask semaphore at `0x828a3230` in Sylpheed.

Canary's `XObject::GetNativeObject` (`src/xenia/kernel/xobject.cc:397-483`)
and ours's `ensure_dispatcher_object` (`crates/xenia-kernel/src/exports.rs:4363`)
**lazy-wrap** these dispatchers on **first guest-thread touch**: the
first `KeWait*` invocation that passes the raw kernel-object pointer
synthesizes the `XObject` wrapper, stamps the `X_DISPATCH_HEADER` with
the `kXObjSignature` marker (`'X','E','N','\0' = 0x58454E00`), stashes
the handle, and emits `handle.create`. Subsequent touches find the
marker and short-circuit without emit (per-pointer idempotent).

### The first-toucher race

**Which** guest thread wins the "first toucher" race is
**timing-dependent**:
- Canary and ours have different host schedulers, JIT throughput, and
  guest-thread bootstrap ordering.
- Even within the same engine across runs the first-toucher can
  differ — but each engine produces a deterministic per-run total
  ordering, so cold-vs-cold reproducibility holds.

The per-thread SID recipe `semantic_id(create_site_pc, creating_tid,
tid_event_idx_at_creation, object_type)` (v1) depends on BOTH
`creating_tid` and `tid_event_idx_at_creation`, so:
- Same dispatcher → DIFFERENT SIDs in each engine (race-dependent).
- `handle.create` for the same object lands on different per-tid
  streams in canary vs ours.

The C+17 fix made ours emit `handle.create` for these synthesized
shadows, but the C+17 D-NEW-3 regression on tid=15→10 was
exactly the first-toucher race: ours's tid=10 was the first toucher
locally; canary's tid=15 was NOT the first toucher in its run — some
other canary tid had already adopted `0x828a3230`. ours's tid=10
emitted an "extra" `handle.create` that canary's tid=15 lacked, and
the diff tool flagged a kind mismatch at idx=2.

### The C+18 fix: deterministic SID recipe

Process-global dispatchers use a **second** SID recipe that is
scheduling-invariant. Both engines now use:

```
SHARED_GLOBAL_SID_MARKER = 0xC01AB005  (fixed sentinel, both engines)

input_bytes =
    le_u32(SHARED_GLOBAL_SID_MARKER)   // 4 bytes — "create_site_pc" slot
  ‖ le_u32(0)                          // 4 bytes — "creating_tid" slot
  ‖ le_u64(pointer)                    // 8 bytes — "tid_event_idx" slot
  ‖ le_u32(object_type)                // 4 bytes

hash = FNV-1a-64(input_bytes)
shared_global_sid = format("{:016x}", hash)
```

The marker `0xC01AB005` is outside any plausible guest-PC range
(PPC text 0x82000000-0x82FFFFFF; XEX header 0x3001xxxx; heap
0x4xxxxxxx), so it can never collide with a regular per-thread SID
(which uses a real guest PC as `create_site_pc`).

Both engines compute the SAME SID for the same dispatcher pointer
regardless of:
- which guest thread is the first toucher,
- the `tid_event_idx_at_creation`,
- the per-engine scheduling order.

### Which call sites use which recipe

| Call site                                              | SID recipe        |
|--------------------------------------------------------|-------------------|
| `KernelState::alloc_handle_for` (ours)                 | per-thread        |
| `ObjectTable::AddHandle` direct (canary)               | per-thread        |
| `ensure_dispatcher_object` (ours)                      | **shared-global** |
| `XObject::GetNativeObject` synthesized (canary)        | **shared-global** |

Regular per-thread `handle.create` events (file open, thread create,
named-event create, etc.) keep the v1 per-thread recipe. The
shared-global recipe is restricted to lazy-wrap synthesis.

### Diff tool: cross-tid floating `handle.create` matching

The diff tool pre-pass collects all shared-global SIDs in either
engine's stream. A `handle.create` event is detected as shared-global
by recomputing the deterministic SID from its `(raw_handle_id,
object_type)` payload and comparing against the event's
`handle_semantic_id`. Regular per-thread SIDs cannot match this check
by construction.

When per-tid alignment finds a kind mismatch and one side has a
shared-global `handle.create` whose SID is in the floating set:
- The diff tool advances ONLY that side's stream pointer past the
  floating event.
- Re-compare at the same canonical position.

The diff report's summary table shows a `floating_skipped (c/o)`
column for visibility — counts of absorbed events per side.

### Index relaxation

The C+18 fix relaxes the legacy diff-tool rule that requires
`canary.tid_event_idx == ours.tid_event_idx` for matching events.
With floating absorption, the per-tid indices can drift by 1 between
the two sides — but the `kind` and `payload` comparisons remain
strict. The raw indices are still preserved on the events themselves
(useful for debugging and report context).

### Backward compatibility

- Wire format unchanged. `schema_version` is still `1`.
- Pre-C+18 event logs (no shared-global SIDs in the stream) trigger
  the legacy code path automatically — the floating set is empty.
- The marker constant `0xC01AB005` MUST be exactly this value in both
  engines and the diff tool. Tests in both engines plus
  `tools/diff-events/test_diff_events.py` lock it in.

## Wait-begin floating absorb (v1.3 — added in Phase C+21)

### Motivation

Canary's `RtlEnterCriticalSection` (and its symmetric counterparts —
`KeWaitForSingleObject` invoked on a process-global dispatcher,
mutex/semaphore contended-acquire paths) emits `wait.begin` **only on
the contended slow path**. The fast path (uncontended atomic-CAS, or
recursive bump) emits NO `wait.begin` and only the `kernel.call` →
`kernel.return` pair. Which path is taken depends on whether ANOTHER
guest thread is currently holding the dispatcher when the wait is
attempted — i.e. it is **host-scheduler-driven**, varying across cold
runs of the same engine.

Reading-error class **#32** (documented in C+20's
`investigation.md`) captures this: cross-checking 3 fresh canary cold
runs at canary tid=6 idx 104,606 showed:
- jitter-1: `wait.begin sid=75ae880ec432eb36` (contended)
- jitter-2: `kernel.return` (fast — matches ours)
- jitter-3: offset-shifted wait.begin at a different idx with a
  different SID

The matched-prefix metric is unreliable inside such regions if the
diff tool treats wait.begin events as strictly positional.

### The fix

A `wait.begin` event is **floating** if at least one of its
`payload.handles_semantic_ids` references a shared-global SID
(see §"Shared-global SIDs"). During the per-tid two-pointer walk:

- If one side has a floating `wait.begin` and the other has a
  different kind at the same canonical position, advance ONLY the
  wait.begin side's pointer and re-compare.

`wait_type=all` waits are floating as long as ANY single handle in
the set is shared-global — the entire wait's blocking behavior is
timing-dependent if even one of its handles is on a process-global
dispatcher.

### Shared-global SID detection (extended in C+21)

The diff tool's `collect_shared_global_sids` pre-pass now unions
TWO sources:

1. **Recipe-matching `handle.create` events** (Phase C+18 — direct).
   This catches ours's `ensure_dispatcher_object` output where
   `raw_handle_id == ptr` (the recipe-input pointer).

2. **Cross-tid usage heuristic** (Phase C+21 — indirect). Any SID
   referenced via `handle.create` OR `wait.begin` on **two or more
   distinct guest tids** in EITHER engine is treated as shared-global.

The cross-tid heuristic exists because canary's
`EmitHandleCreateSharedGlobal` (`event_log.cc:435`) emits the SID
computed from the dispatcher VA but stashes
`object->handle()` (a handle-table slot in the `0xF8xxxxxx`
region) as `raw_handle_id`. Those two values DIFFER, so canary's
shared-global `handle.create` events are NOT recipe-recognizable
from their payload alone. Multi-tid SID usage is a robust
observational signal: per-thread SIDs by construction stay on the
single creating tid (their hash inputs include `creating_tid`),
so any cross-tid SID usage indicates a process-global dispatcher.

### Risk of over-absorption (and why it's bounded)

The cross-tid heuristic could in principle mis-classify a per-thread
SID that one thread creates and another thread waits on — a
legitimate cross-thread synchronization pattern. The floating-absorb,
however, only fires on a **kind mismatch** at the canonical position.
Per-thread waits that match strictly on both sides advance normally
without any absorb. The heuristic only loosens alignment when one
side is missing a `handle.create` or `wait.begin` — exactly the
scheduling-jitter window the C+21 fix targets.

### Diff-tool report changes

The summary table's `floating_skipped (c/o)` column is split into
two columns:

- `floating_create (c/o)` — C+18 `handle.create` absorptions.
- `floating_wait (c/o)` — C+21 `wait.begin` absorptions.

Both per-side and observation-only — counts may legitimately be
non-zero in a clean run.

### Backward compatibility

- Wire format unchanged. `schema_version` is still `1`.
- Pre-C+21 event logs (no `wait.begin` events that reference
  shared-global SIDs) trigger no new behavior — the wait absorption
  branches are inert.
- The C+18 floating-create logic is unchanged; the C+21 fix is
  strictly additive.
- Engine source is UNCHANGED in C+21 — the fix is in the diff tool
  only.

## contention.observed (v1.4 — added in Phase D Stage 1, 2026-05-18)

### Motivation

The 104,607 cap is canary's tid=6 contending on a CS while ours's tid=1
fast-paths through the same call (Phase C+22). Schedules diverge for
host-OS reasons, so neither engine is "wrong" — but matched-prefix
stalls. Phase D's H' approach makes ours's `rtl_enter_critical_section`
*replay* canary's contention by consulting a per-call manifest built
from canary's contention trace.

Stage 1 (this section) introduces the canary-side **emitter** for that
manifest: a new event kind `contention.observed` that fires from
`RtlEnterCriticalSection_entry` (`xboxkrnl_rtl.cc:596-633`) just before
the call falls through to `xeKeWaitForSingleObject` after spin-loop
exhaustion. Cvar-gated (`kernel_emit_contention`, default false) so
default canary behavior is byte-identical.

### Event shape

```json
{
  "schema_version": 1,
  "engine": "canary",
  "kind": "contention.observed",
  "tid": <guest tid of caller>,
  "tid_event_idx": <per-tid ordinal — consumes one slot>,
  "guest_cycle": 0,
  "host_ns": <emit timestamp>,
  "deterministic": true,
  "payload": {
    "cs_ptr": "0xHHHHHHHH",        // guest VA of the RTL_CRITICAL_SECTION
    "site_sid": "HHHHHHHHHHHHHHHH", // shared-global SID (see below)
    "contended": true              // always true at v1.4 (uncontended is implicit)
  }
}
```

`site_sid` is computed via the **C+18 shared-global SID recipe**:

```
site_sid = FNV-1a-64 over
  ( kSharedGlobalSidMarker [u32 LE]    // 0xC01AB005
  , 0 [u32 LE]                          // creating_tid (unused)
  , cs_ptr as u64 [u64 LE]              // pointer-as-idx
  , kObjCriticalSection [u32 LE]        // 0x0C, new in v1.4
  )
```

Both engines compute the same SID for the same CS pointer. The marker
constant `kObjCriticalSection = 0x0C` is the new ObjectType value
introduced for this kind; it does NOT correspond to a real XObject
(CS lives as a guest-memory struct, not a handle-tabled object).

### When emitted (canary)

In `RtlEnterCriticalSection_entry`:

1. Recursive-lock fast path (already own lock) → **NO emit** (not contention).
2. Spin-loop succeeds (`atomic_cas` flips `lock_count` from -1 → 0) → **NO emit** (fast acquire).
3. Spin-loop exhausted **AND** `atomic_inc(&cs->lock_count) != 0` → **EMIT** with `contended=true`, then `xeKeWaitForSingleObject`.
4. Spin-loop exhausted **AND** `atomic_inc(...) == 0` (CS became free between spin and inc) → **NO emit** (we won the race after spin).

The emit point sits **between** atomic_inc's positive result and the
`xeKeWaitForSingleObject` call, so the new event always precedes the
existing `wait.begin` event in the per-tid ordinal.

### When emitted (ours, Stage 3 — pending)

Stage 3 will add a symmetric emit in `rtl_enter_critical_section`
(`xenia-rs/crates/xenia-kernel/src/exports.rs:2886-2946`) at the
forced-park branch driven by the manifest. This keeps per-tid ordinals
aligned across engines after replay.

### Diff-tool treatment (Stage 4 — pending)

`contention.observed` will be added to `ENGINE_LOCAL_KINDS` in
`diff_events.py`: the per-tid pointer advances past these events on
either side without comparison. This keeps matched-prefix counts
unchanged when ONE side emits the event (Stage 1's canary-only world)
or when BOTH emit at the same ordinal (Stage 3's parity world).

### Cvar default + byte-identity

`kernel_emit_contention=false` by default. With cvar=false, the helper
`phase_a::EmitContentionObserved` short-circuits at the cvar check
before any `IsEnabled()` lookup. The pre-Stage-1 canary code path is
preserved byte-for-byte; cvar-OFF cold runs produce zero
`contention.observed` events (validated on the Stage 1 cold run:
0 occurrences in a 4.4 GB / 18.6 M event trace).

## Nested-CS-cleanup absorber (v1.5 — added in Phase D D-extension, 2026-05-18)

### Status
**Band-aid.** Explicit annotation: this absorber CROSSES the reading-error
#23 boundary in spirit. It folds real guest control-flow divergence at
the diff-tool layer. It exists because the underlying root cause —
producer-throughput divergence under the cooperative-vs-preemptive
scheduling mismatch (see Phase D forensics) — is **explicitly out of
scope** for the H' plan: fixing it in ours's engine would require
preempting the cooperative scheduler, which invalidates 23 phases of
digest stability. The absorber is the practical compromise.

### Trigger shape

The absorber fires ONLY at a kind mismatch of:
- canary[ic] = `import.call` with `payload.name == "RtlEnterCriticalSection"`
- ours[io]   = `import.call` with `payload.name == "RtlLeaveCriticalSection"`

For any other kind mismatch, the absorber is silent. This narrowness is
intentional: real engine divergences appear in other shapes and must
still surface.

### Behavior

When the trigger pattern matches, canary's stream is scanned for one or
more balanced `[Enter-block, Leave-block]` pairs immediately following
the trigger position:

- An Enter-block is 3 consecutive events:
  `import.call RtlEnterCriticalSection → kernel.call RtlEnterCriticalSection → kernel.return RtlEnterCriticalSection`.
- A Leave-block is 3 consecutive events with `RtlLeaveCriticalSection`.

The absorber consumes pairs greedily up to a cap of `_NESTED_CS_PAIR_CAP
= 32` pairs (empirically, Sylpheed's worst-case is ~10-15 pairs at the
104,607 cap). After consuming each pair, it checks whether canary's next
event has the SAME `kind` AND same `payload.name` as ours[io]. The first
convergence wins; canary's pointer is advanced past the absorbed pairs.

If no convergence is found within the cap, the absorber returns None
and the divergence falls through to normal reporting.

### Why this is safe (within #23's spirit)

1. The absorption only happens when canary's stream re-aligns with
   ours's stream past the nested block. If it doesn't re-align, the
   real divergence is reported.
2. The nested-block shape matches a specific PPC pattern: the consumer
   thread in canary acquires a CS, calls a helper that iterates a
   tree/registry, takes the nested-CS-enter path for each item, and
   releases the outer CS. Ours's tree is shorter so it skips this.
   The net effect on guest state is bounded: ours has fewer items
   processed in this iteration, but the EVENT stream past the
   absorption resumes the same logical operation.
3. The Phase B `image_loaded_sha256` is the foundational invariant.
   It's unaffected by this absorber (no engine source change).

### Why this is NOT safe in the general sense

- Diverging downstream state IS lost: ours's tree has fewer entries
  than canary's after the absorbed block. Subsequent ours operations
  that touch the tree will behave differently. Other absorbers / fixes
  will be needed if those state-differences manifest later.
- A future engine bug that produces a spuriously nested Enter+Leave
  pair could be falsely absorbed. Mitigation: the absorber requires
  canary's post-block stream to re-align with ours's; spurious nested
  pairs without re-alignment fall through to normal divergence
  reporting.

### Empirical result (Sylpheed 104,607 cap)

Pre-absorber (post-Stage-3+4): main matched-prefix = 104,607 (cap).
Post-absorber: main matched-prefix = **105,046 (+439 events)**.

The next divergence is at idx 105,046 on `VdInitializeEngines.return_value`
(canary=1, ours=0) — an unrelated engine bug in the video subsystem,
NOT a recurrence of the cap pattern. Sister chains preserved
(11/32/4/41/16).

### Tests

Three unit tests in `test_diff_events.py`:
- `test_nested_cs_cleanup_block_absorbed_when_convergent` — folds one nested pair
- `test_nested_cs_cleanup_NOT_absorbed_when_followup_diverges` — confirms re-alignment requirement
- `test_nested_cs_cleanup_NOT_absorbed_when_canary_has_no_followup` — negative case

## sema.release (v1.6 — added in AUDIT-069 Session 6, 2026-05-21)

### Motivation

AUDIT-069 Sessions 1-5 established that ours under-produces semaphore
releases by ~80% on the work-semaphore vs canary (`99 vs 414` in S5,
refined in S6 to `83 vs 414` apples-to-apples on the work semaphore
alone). The measurement infrastructure was a one-off cvar
(`audit_70_semaphore_release_watch`, hand-built per-handle log lines)
plus an ours-side `--lr-trace` capture at the wrapper-entry PC. Future
AUDIT-070+ sessions and any general regression triage need this metric
to be diff-visible without bespoke cvars per investigation.

`sema.release` lifts the AUDIT-070 cvar's signal into the Phase A
schema as a **symmetric** event kind in both engines.

### Event shape

```json
{
  "schema_version": 1,
  "engine": "canary",
  "kind": "sema.release",
  "tid": <guest tid of caller>,
  "tid_event_idx": <per-tid ordinal — consumes one slot>,
  "guest_cycle": <PPC timebase>,
  "host_ns": <emit timestamp>,
  "deterministic": true,
  "payload": {
    "handle_semantic_id": "HHHHHHHHHHHHHHHH",  // shared-global SID for the work-sem
    "raw_handle_id": "0xHHHHHHHH",             // engine-local
    "release_count": 1,                         // games typically release 1
    "previous_count": 0,                        // semaphore count BEFORE release
    "caller_pc": "0xHHHHHHHH"                   // guest LR at release time
  }
}
```

### SID recipe

The work-semaphore in Sylpheed (canary handle `0xF800003C`, ours
handle `0x1044`) is a **process-global dispatcher** in the C+18 sense:
it lives in pre-allocated guest memory and is touched by multiple
guest threads (main, worker, cache-thread, other producers). Its
`handle_semantic_id` SHOULD use the **shared-global recipe**
(`ComputeSharedGlobalSemanticId(dispatcher_ptr, kObjSemaphore=0x03)`)
so canary and ours produce the same SID for the same guest dispatcher.

Per-thread semaphores (rare in Sylpheed) MAY use the v1 per-thread
recipe; the diff tool does NOT compare SIDs for `sema.release` (the
kind is engine-local positionally — see below).

### Why engine-local

Per AUDIT-069 H3 and S6's first-N=20 measurement, the cadence and
ordinal interleaving of releases between the worker, main, and
cache-thread are **timing-dependent**: the first 20 releases match
perfectly across engines, but worker tid diverges at canary ord=83
when the cache-thread's first release fires (which ours never
emits because ours's cache-thread wedges at `sub_821CB030+0x1AC`).
Strict positional alignment would always trip on this known
divergence.

`sema.release` is therefore in `ENGINE_LOCAL_KINDS` in the diff tool
(alongside `contention.observed`): both engines emit, but the diff
tool advances past these events on either side without alignment.
The **count** is surfaced in the report's "Counted engine-local
kinds" summary table (per-tid + total per engine) so cadence
regressions are diff-visible at-a-glance.

### Emit points (planned, NOT yet wired)

- **Canary**: extend `audit_70_semaphore_release_watch` to call
  `phase_a::EmitSemaRelease(handle, count, prev_count)` from
  `NtReleaseSemaphore_entry` + `xeKeReleaseSemaphore`. Cvar gating
  remains the existing `audit_70_semaphore_release_watch` (or a new
  `phase_a_event_log_sema_releases=false` for finer control).
- **Ours**: emit `sema.release` from `nt_release_semaphore` in
  `crates/xenia-kernel/src/exports.rs` and from
  `KSemaphore::release` (kernel-mode equivalent). Default-off via a
  runtime flag; default cold runs must remain digest-stable.

Both engines MUST emit at handler entry (not wrapper-internal) so the
event count corresponds 1:1 to guest `NtReleaseSemaphore` invocations,
matching the canary cvar's existing semantics.

### Status

- **Diff tool**: support landed (this session, v1.6). `sema.release`
  in `ENGINE_LOCAL_KINDS` + `COUNTED_ENGINE_LOCAL_KINDS`; counts
  surfaced in report summary; 3 new tests in `test_diff_events.py`.
- **Canary emit**: NOT YET WIRED. Planned for AUDIT-070+ when the
  root cause investigation requires it. Existing cvar
  `audit_70_semaphore_release_watch` continues to emit non-schema
  log lines (used by S5/S6 captures).
- **Ours emit**: NOT YET WIRED. See above.

### Backward compatibility

- Wire format unchanged. `schema_version` is still `1`.
- Pre-v1.6 event logs (no `sema.release` events) trigger no new
  behavior — the engine-local skip branches are inert; the
  "Counted engine-local kinds" report section is suppressed when
  no counted-kind events exist.
- Diff tool changes are purely additive: existing engine binaries
  diff identically pre- and post-v1.6.

## Host-heap payload-field canonicalization (v1.7 — added in Phase C+22, 2026-05-26)

### Motivation

C+2 (`ALLOCATOR_RETURN_FNS`) canonicalizes `kernel.return.return_value`
for a known set of host-allocator-returning exports
(`MmAllocatePhysicalMemoryEx`, `RtlAllocateHeap`, …). That covers
the case where the allocated VA appears as the function's *return*
value. But the same allocator-drift class (AUDIT-043 ε:
canary's BC physical heap `0xBCxxxxxx` vs ours's unified user heap
`0x4xxxxxxx`) ALSO surfaces inside **typed event payloads** of
non-allocator exports — most notably the `thread.create.ctx_ptr`
field, which holds the host-allocated TLS/context block that
`ExCreateThread` passes to the new guest thread's r3.

Empirical surface (C+22 cold-vs-cold idx 105,128 on the Sylpheed
audio-stack worker `ExCreateThread(entry_pc=0x824cd458)`):

| field | canary | ours |
|---|---|---|
| `ctx_ptr` | `0xbe56bb3c` (BC physical heap) | `0x42453b3c` (unified user heap) |
| `entry_pc` | `0x824cd458` | `0x824cd458` (bit-identical — game code) |
| `priority` | `0` | `0` |
| `affinity` | `4` | `4` |
| `stack_size` | `32768` | `32768` |
| `suspended` | `false` | `false` |

The C+2 `ALLOCATOR_RETURN_FNS` mechanism doesn't help here because
`ExCreateThread`'s return value is the new thread's *handle*
(canary's `0xF8xxxxxx` vs ours's `0x4, 0x8, …`), already covered
by `handle_semantic_id` skip-policy. The host-heap-allocated
context block is a side-channel field inside the
`thread.create` event payload.

### The fix

`HOST_HEAP_PAYLOAD_FIELDS_BY_KIND` maps event kind → tuple of
payload field names. Each listed field's value (expected
`0x`-prefixed hex string) is rewritten to a per-(tid, kind, field)
ordinal sentinel `<HOSTHEAP_<KIND>_<FIELD>_<ORDINAL>>` BEFORE
payload comparison. The mechanism mirrors
`canonicalize_allocator_returns` exactly, restricted to typed
payload fields.

Initial set (v1.7):

```python
HOST_HEAP_PAYLOAD_FIELDS_BY_KIND = {
    "thread.create": ("ctx_ptr",),
}
```

### Strict-field preservation

For each canonicalized event kind, the **strict** fields (game-visible
attributes that MUST match across engines) are untouched. For
`thread.create` these are:

- `entry_pc` — guest VA of the new thread's entry function, bit-
  identical in both engines because both engines load the same XEX
  and the entry comes from guest code.
- `priority`, `affinity`, `stack_size`, `suspended` — game-visible
  thread attributes the guest passes to `ExCreateThread`.

Skip-policy fields (`handle_semantic_id`, `parent_tid`) continue
to be skipped via `SKIP_PAYLOAD_FIELDS_BY_KIND` (unchanged from
C+15-α — see "Diff-tool field-comparison rules" above).

### Why `parent_tid` does NOT need new canonicalization

Per the C+15-α skip-policy table, `parent_tid` is already in
`SKIP_PAYLOAD_FIELDS_BY_KIND["thread.create"]`. The diff tool
pairs guest TIDs at the chain level (`--tid-map` or
`auto_tid_map`), and the per-event `parent_tid` is engine-local
(canary tid=6 vs ours tid=1 for the same logical "main thread"
chain). Skipping is sufficient — no ordinal sentinel needed.

Could a future schema v2 canonicalize `parent_tid` via the tid
map? Yes, but it would surface mismatches as a *map gap* rather
than as a clearer per-tid alignment failure that's already
visible at chain boundaries. The v1.x skip-policy is the
simpler choice; tests pin the existing behavior so it doesn't
regress.

### Ordinal-count contract

As with `ALLOCATOR_RETURN_FNS`: if one engine emits MORE
`thread.create` events on a given tid than the other, ordinals
drift and the next typed event surfaces a divergence against
whatever the other side has at that position. Ordinal-count
mismatch IS a behavioral divergence — the canonicalization
preserves divergence detection, only collapsing
host-allocator-VA noise.

### Defensive value handling

If `ctx_ptr` is non-string (`None`, int, missing) — pre-C+22
event logs whose emitter omits the field — the canonicalizer
leaves it untouched and does NOT consume an ordinal. The next
string-typed value gets ordinal 0. This keeps pre-v1.7 logs
diffable without forcing an emitter retrofit.

### Backward compatibility

- Wire format unchanged. `schema_version` is still `1`.
- Pre-C+22 event logs whose `thread.create.ctx_ptr` happens to
  bit-match (e.g. static-allocator addresses like `0x828F3D08`
  that BOTH engines use for the pre-XEX kernel-state ctxs)
  still match strictly via the ordinal sentinel — they get the
  same ordinal in both engines.
- The `--no-canonicalize-host-heap-fields` CLI flag disables the
  pass (reverts to raw-VA comparison), mirroring the existing
  `--no-canonicalize-allocators`. Used by gate tests and
  investigation rerun.
- Engine source is UNCHANGED in C+22 — the fix is in the diff
  tool only.

### Extension shape

The map shape `kind -> (field, …)` is intentionally minimal:
each entry is one event kind plus the fields on it that hold
host-heap VAs. Future entries could include e.g.
`thread.create.tls_ptr` (if such a field is added to the schema)
or a hypothetical `vfs.mmap.host_ptr`. Strict-field policy
remains: any field NOT listed here is compared bit-identically.

## Forward compatibility

Phase A's original schema-v1 declared 13 sections (16 distinct kind strings);
Phase A wired 4 of them. Phase C+15-α wired an additional 5 (`handle.create`,
`handle.destroy`, `thread.create`, `thread.exit`, `wait.begin`). `wait.end`,
`thread.suspend/resume`, `mem.write`, `vfs.open/read/close` remain declared
but unwired; adding them is additive surface area at schema v1.1+.

A future schema v2 may break wire format (e.g. canonical SIDs, structured args).
Both engines pin `schema_version = 1` in this phase; the diff tool refuses to
mix v1 and v2 inputs.