# Step 2 — Natural install-trigger sequence and ours divergence point **Date:** 2026-05-21 **Mode:** PLAN-only (investigation; no engine LOC changes). **Sources:** `canary-jitter-1.jsonl` (4.4 GB, 18.7M events) and `phase-w-wedge-reattack/ours-postfix.jsonl` (28 MB, 121,569 events). ## TL;DR The Step 2 plan's framing — "identify the canary tid=6 kernel-call sequence in the install window [9.4s, 9.6s]" — **cannot be applied because ours never reaches host_ns ≥ 1.73s.** Ours's tid=1 wedges 8 seconds before the install epoch. The reframed question — "what canary-tid=6 sequence between the matched-prefix wedge point and the install epoch fails in ours?" — resolves to a **single root cause one level upstream of the wedge**: > Canary's spawned cache-loader worker (canary tid=17, entry > `0x821748F0`) executes ~4140 events and calls `ExTerminateThread` > at host_ns = 2.092s, taking 154ms. Ours's analog (ours tid=13) > executes 435 events, **never reaches its second wait iteration**, > and wedges at its FIRST `NtWaitForSingleObjectEx` (no signaler ever > fires). **Ours's tid=13 takes a different guest-code branch from > the first wait onward — it calls `NtReleaseSemaphore` instead of > `NtSetEvent` between `NtCreateEvent` and `NtWaitForSingleObjectEx`, > so the event it then waits on is unsignaled.** This is a **branch divergence inside guest code `sub_821CB030`'s body**, NOT a missing kernel call in ours and NOT a wrong return value from ours's kernel. ## Step 0 outcome — install epoch reachable on canary, not on ours | Source | First event | Last event | |---|---|---| | canary tid=6 events in [9.0s..11.0s] | 16,175 kernel.calls captured | install epoch + worker-spawn covered ✓ | | ours tid=1 events | 1.728s (last event before wedge) | install epoch is at ~9.5s — **8s in the future** | Ours physically cannot reach 9.4s; tid=1 blocks on tid=13's thread-handle at host_ns=1.728s, all other tids subsequently block too (see `phase-w-wedge-reattack/halt-on-deadlock-dump.txt`). Therefore the canary "kernel-call sequence ours doesn't make in the install window" question is degenerate: ours makes **none** of canary's 16,175 calls in that window because ours stops emitting at host_ns=1.73s. The substantive Step 2 question reframes to: **"What does canary do between matched-prefix idx ~108,476 (= ours's last events) and the install epoch?"** Answer: it RUNS the worker tid=17 to completion, which causes the join-wait on tid=1/6 to return, after which tid=6 iterates `sub_822F1AA8`'s main loop further and eventually triggers `sub_824FD240` and `sub_825070F0`. Everything hinges on tid=17 completing. ## Step 1 outcome — canary tid=6 spawns sub_821748F0 at host_ns=1.935s Exact anchor: ``` canary tid=6 host_ns=1935433700 idx=108476 ExCreateThread(entry=0x821748f0, ctx=0xbc365620, stack=524288, susp=T) → handle.create raw=0xf80000a0 hsid=3bd922fbb385c2c9 canary tid=6 host_ns=1937223600 idx=108498 NtResumeThread NtWaitForSingleObjectEx handles=[3bd922fbb385c2c9] timeout=-1 → wait.begin canary tid=6 host_ns=2092000000 idx=108499 (155 ms later) kernel.return NtWaitForSingleObjectEx rv=0 status=0x00000000 ``` The wait IS infinite (timeout_ns=-1) — yet it returns in 155ms because the worker terminates (canary tid=17's last call is `ExTerminateThread` at host_ns=2.0918s). Ours's mirror: ``` ours tid=1 host_ns=1727479660 idx=108481 ExCreateThread(entry=0x821748f0, ctx=0x4024d640, stack=0, susp=T) → handle.create raw=0x000012c8 hsid=8a25e09a8a739c1b ours tid=1 host_ns=1727611893 idx=108505 wait.begin handles=[8a25e09a8a739c1b] timeout=-1 ours tid=1 host_ns=1727614433 idx=108506 kernel.return NtWaitForSingleObjectEx rv=0 ← but this is just the return record from the entry probe, NOT actual unblock ``` (Note: `ours-postfix.jsonl` schema emits the entry-probe `kernel.return` even on an infinite wait, because the probe wraps the wait wrapper. Per `halt-on-deadlock-dump.txt`, tid=1 is in fact still `Blocked` on handle `0x000012c8` = Thread(id=13) at deadlock-detection time.) The spawn parameters look identical in shape (same entry PC; ctx and stack are run-specific). **Spawn semantics match.** ## Step 2 outcome — canary tid=17 vs ours tid=13 kernel-call differential Lifetimes: | | canary tid=17 | ours tid=13 | |---|---|---| | first event | host_ns=1.9378s | host_ns=1.7276s | | last event | host_ns=2.0918s | host_ns=1.7307s | | duration | **154 ms** | **3 ms** | | total events | 4140 | 435 | | kernel.call count | 1351 | 142 | | terminates? | yes via `ExTerminateThread` | no — wedged on wait | Per-call differential (top entries by |canary − ours|): | kernel.call | canary tid=17 | ours tid=13 | Δ | |---|---:|---:|---:| | RtlEnterCriticalSection | 607 | 58 | +549 | | RtlLeaveCriticalSection | 607 | 58 | +549 | | NtClose | 19 | 2 | +17 | | NtCreateEvent | 18 | 3 | +15 | | NtDuplicateObject | 16 | 2 | +14 | | RtlInitAnsiString | 11 | 1 | +10 | | NtWaitForSingleObjectEx | 11 | 2 | +9 | | RtlInitializeCriticalSectionAndSpinCount | 15 | 6 | +9 | | NtQueryFullAttributesFile | 9 | 1 | +8 | | NtReleaseSemaphore | 9 | 1 | +8 | | RtlNtStatusToDosError | 9 | 1 | +8 | | NtSetEvent | 8 | 1 | +7 | | KeTlsSetValue | 2 | 0 | +2 | | NtCreateFile | 2 | 0 | +2 | | ExCreateThread | 1 | 0 | +1 | | ExTerminateThread | 1 | 0 | +1 | | KeTlsGetValue | 1 | 0 | +1 | | KeQueryPerformanceFrequency | 0 | 1 | -1 | **Set-difference of unique kernel-call names**: ours's set of called APIs is a strict subset of canary's, plus `KeQueryPerformanceFrequency` which canary called outside this window. **No kernel API is missing from ours's implementation that canary uses.** All of these APIs already work in ours (they are called successfully on tid=5, tid=1, or tid=10 elsewhere in the same run). The differential isn't "ours fails to implement a kernel call" — it's "ours executes 10× fewer iterations of the same loop body." ## The control-flow divergence (the root cause) Canary tid=17, idx 339-356 — the FIRST wait pattern: ``` idx=339 NtCreateEvent idx=340 handle.create raw=0xf80000b8 hsid=1070523eb111c6ea object_type=1 (Event) idx=343 NtDuplicateObject → handle.create at idx=344 idx=347 NtSetEvent ← THE EVENT IS SIGNALED BEFORE THE WAIT idx=350 NtClose → handle.destroy at idx=351 idx=354 NtWaitForSingleObjectEx idx=355 wait.begin handles=[1070523eb111c6ea] timeout=-1 idx=356 kernel.return rv=0 ← wait completes in 23µs because event was signaled ``` Ours tid=13, idx 175-434 — the analog wait pattern: ``` idx=175 NtCreateEvent idx=177 handle.create raw=0x000012d0 hsid=d5e23609d3948568 object_type=1 (Event) … 240 RtlEnterCriticalSection / RtlLeaveCriticalSection ops in between … idx=419 NtDuplicateObject → handle.create at idx=420 idx=429 NtReleaseSemaphore ← DIFFERENT API — semaphore, not event-set idx=432 NtWaitForSingleObjectEx idx=433 wait.begin handles=[d5e23609d3948568] timeout=-1 idx=434 kernel.return rv=0 (entry probe only; actual wait blocks forever) ⏸ WEDGE — event d5e23609d3948568 is never signaled. ``` The key observation: **between `NtCreateEvent` and the corresponding `NtWaitForSingleObjectEx`**, canary calls `NtSetEvent` to signal the very event it is about to wait on (idiomatic self-signaled wait-pump barrier). Ours **skips the NtSetEvent**, calls `NtReleaseSemaphore` instead, and then blocks on the unsignaled event. This is a **guest-code branch divergence** inside the helper hierarchy `sub_821CB030 → sub_821CBA08 → sub_821CC3F8 → sub_821C4EB0` (per `sub_82173990.md` chain). The branch predicate is some state read between `NtCreateEvent` and the call site of `NtSetEvent` / `NtReleaseSemaphore`. ## Step 3/4 — Why does the predicate differ between engines? The deep root: this exact divergence pattern is what AUDIT-069 S5 already found at a different lens: > **AUDIT-069 S5**: "Other producers: canary 25 vs ours 1." Canary > has 24 additional thread sources releasing the work semaphore that > ours doesn't have. Combining S5 with this Step 2 finding: 1. Ours's tid=13 emits ONLY 1 NtReleaseSemaphore before wedging (consistent with the 1 "other producer" S5 measured). 2. Canary's tid=17 emits 9 NtReleaseSemaphore + 8 NtSetEvent before reaching ExTerminateThread. Each release/set comes from a different cache-load iteration. 3. The iteration count is gated by the loop body completing each iteration. Each iteration begins by waiting on an event that must be PRE-SIGNALED to advance. In canary, the event gets pre-signaled (NtSetEvent before NtWait). In ours, the same code path takes the "release semaphore + wait on event signaled by external" branch instead of the "set event + wait on event" branch. **The state read by the predicate at the branch differs.** What state? Without disassembling `sub_821CB030`/`sub_821CBA08` and binding the branch PC to the guest memory location the predicate reads, we cannot say definitively. Candidate state sources: - A bit/flag in the ctx (`0x4024d640` in ours vs `0xbc365620` in canary — different addresses but same shape). Could be uninitialized in ours due to ANON_Class vtable install at `sub_824FD240+0x24` not having fired (AUDIT-068 S4). But that vtable install fires much later (host_ns=9.4s in canary), so this is unlikely. - The result of a prior `NtQueryFullAttributesFile` call. Canary tid=17 calls this 9× before reaching ExTerminateThread; ours tid=13 calls it 1× before wedging. The file being queried is in the `cache:\` filesystem (per `sub_82173990.md` chain). - A guest-memory shared CS-protected pointer set by another tid (canary tids 4/10/14 do 38+90+38 signal events in the [1.9..2.1s] window; in ours, tids 4/5/14 are STILL working in [0..1.73s] but their output is shifted to ours's tid=5, which per AUDIT-069 S5 matches canary's tid=10 producer count almost exactly — 90 NtReleaseSemaphore each). ## Cause attribution Per the Step 5 framework: 1. **Missing ours implementation?** NO. Every kernel API canary tid=17 calls is also implemented in ours and works (verified by other tids using them successfully). 2. **Incorrect return value in ours?** UNLIKELY but unverified. Phase A schema doesn't capture args/return values for most calls; `args_resolved={}` is empty for nearly every call in this window. 3. **Missing side effect in ours?** POSSIBLY. If `NtQueryFullAttributesFile` or `NtCreateFile` on `cache:\\...` has a slightly different behavior in ours (e.g., succeeds when canary fails, or vice-versa), the resulting branch could diverge. 4. **Upstream state divergence (most likely)**: a guest-memory value read by a predicate inside `sub_821CB030`/`sub_821CBA08` differs between engines. The earlier-in-this-tid CS-blob (240+ enter/leave pairs between idx 177 and idx 423) processes some data structure, the result of which selects the branch. **Best single guess (MEDIUM confidence)**: a `NtQueryFullAttributesFile` on a `cache:\\` path returns a different value in ours than in canary (file present vs not, size mismatch, or attrib mismatch). The branch chooses "we need to recompute the cache item" (NtReleaseSemaphore path) instead of "cache item is ready, signal event and proceed" (NtSetEvent path). ## Disjoint-gap count **ONE gap** — the predicate divergence inside `sub_821CB030`'s body. However, the predicate divergence likely has a **complex upstream cause** that involves either filesystem state or guest-memory state initialized by another tid that ALSO has the same kind of subtle drift. So: - **disjoint divergence sites in this trajectory**: 1 (control-flow branch in sub_821CB030 chain). - **disjoint hypothesized causes**: 2-3 (file attribute return value, shared-memory state from tid=10/5 dispatch worker, or vtable install bypass at upstream). This is **NOT** the "50+ disjoint missing kernel patterns" failure mode predicted in tripstone 7. It's a single branch divergence with multiple candidate first-causes. Methodology pivot to Option C (critical-path sweep) is **NOT** indicated; targeted iterate per candidate first-cause IS indicated. ## Recommended next concrete action **Iterate plan, ordered by minimum LOC + maximum signal**: ### Iterate Step 2.A — branch-probe inside sub_821CB030 body (~50-80 LOC ours + ~50 LOC canary) Use existing `audit_61_branch_probe_pcs` to pin the divergent branch inside `sub_821CB030` / `sub_821CBA08` / `sub_821CC3F8`. Specifically probe every `bne`/`beq` PC inside these guest fns that has reachable `bl NtSetEvent` on one branch and `bl NtReleaseSemaphore` on the other. Use sylpheed.db cross-references to enumerate `bl 0x824AA2F0` (NtSetEvent wrapper) and `bl 0x824AB158` (NtReleaseSemaphore wrapper) call sites in these fns. Capture both engines, diff branch-counts. The first divergent branch is the answer. ### Iterate Step 2.B — args/return-value capture for the 9 NtQueryFullAttributesFile calls on canary tid=17 (~30 LOC canary) Extend `audit_61` or write a dedicated probe to log `r3` (filename buffer) and `r0` (NTSTATUS return) for every `NtQueryFullAttributesFile` call inside this 154-ms window. Compare against ours's 1 call. If file-attribute return values differ on a shared file, that's the trigger. ### Iterate Step 2.C — guest-memory read-watch on the ctx struct (~20 LOC, reuses AUDIT-068 S3 read-probe) Use `audit_68_host_mem_read_probe` to sample the worker ctx (`0xbc365620` in canary / `0x4024d640` in ours) at ~1ms cadence in the window [1.7..2.1s]. Identify whether a flag/byte in the ctx differs at the predicate-read time. This pinpoints the actual read location if Step 2.A's branch-probe doesn't immediately reveal the predicate source. ## Tripstones honored - **#28**: verified canary's actual behavior by reading the jsonl directly; the AUDIT-069 S5 framing is corroborated, not assumed. - **#32**: contention regions may jitter; the 240+ CS enter/leave pairs in ours tid=13 are NOT identical to canary tid=17's count (607 vs 58). Differential here may include scheduling-determinism noise. Mitigation: cross-validate with 2nd cold canary run if Step 2.A doesn't immediately converge. - **#39**: matched-prefix did NOT drive this; first-draw progression is the goal. - **#5 of plan tripstones**: AUDIT-069 S5 "25 producers" finding IS downstream of Step 2's identified branch divergence. The 25 producers correspond to canary tid=17's loop iterations that ours tid=13 doesn't reach. ## Cascade - A (acquire canary install-epoch event log): ✓ HIGH (16,175 kernel calls captured cleanly in [9..11s] window). - B (identify install-trigger sequence in canary): ✓ HIGH (canary tid=6 spawns sub_821748F0 at host_ns=1.935s, join-wait returns at 2.092s). The "install trigger" is not a single kernel call but the **completion of worker tid=17**, which causes the join wait to release tid=6 into the rest of the main-loop dispatch. - C (identify where ours diverges from canary): ✓ HIGH (ours tid=13 wedges 3ms into its lifetime, vs canary tid=17 running 154ms; first kernel-call sequence divergence at the NtSetEvent vs NtReleaseSemaphore branch). - D (attribute the divergence to a specific cause): MEDIUM (3 candidate root causes; need iterate 2.A/2.B/2.C to disambiguate). - E (produce Δ-gap count + roadmap): ✓ HIGH (1 divergence site; 3 candidate first-causes; ~50-200 LOC iterate plan). ## Honest assessment - The wedge framing established by AUDIT-049 .. AUDIT-069 holds. - Step 2 narrows the trigger from "the install epoch at 9.4s" down to "the worker tid=13's first wait at 1.73s" — a 7-order-of-magnitude refinement in time. - The 25-producer finding from AUDIT-069 S5 IS a consequence of the Step 2 branch divergence: each missing iteration of canary tid=17's load loop is a missing "other producer" signal. - The fix is NOT to mirror canary's kernel calls; ours implements them correctly. The fix is to find why ours's `sub_821CB030` predicate evaluates differently. - Confidence that the fix is a single guest-state correction (file-attribute mismatch, ctx-field uninitialized, or shared-memory flag race): MEDIUM. ## Artifacts produced this session All under `xenia-rs/audit-runs/review-a-step2-natural-trigger/`: - `extract_canary_install_window.py` — scanner for canary in [9..11s]. - `extract_canary_tid6_pre_install.py` — scanner for tid=6 [1.5..11s]. - `extract_canary_worker_tid.py` — locates spawn worker by hsid. - `extract_canary_tid17_full.py` — tid=17 timeline + diff vs ours tid=13. - `extract_ours_tid1_full.py` — ours tid=1 timeline. - `extract_ours_tid13_final.py` — ours tid=13 timeline. - `find_signaler.py` — finds canary tid=17 wait signalers. - `ours_signal_counts.py` — ours per-tid signal counts. - `canary-tid6-install-window.csv` — 32,383 events. - `canary-tid6-install-window.summary` — kernel.call frequencies. - `canary-tid6-from-anchor.csv` — 139,202 events. - `canary-tid17-worker-timeline.csv` — 4140 events. - `ours-tid13-full-timeline.csv` — 435 events. - `ours-tid1-final-150.csv` — last 150 events on ours tid=1. - `ours-tid1-summary` — kernel.call frequencies. - `canary-tid17-waits.csv` — 29 wait.begin events with handle binding. - `differential-canary-tid17-vs-ours-tid13.txt` — full call-name diff. - `step2-report.md` — this report. **LOC delta in this session**: 0 to xenia-rs/canary engines; 0 to sylpheed.db; ~600 LOC analysis scripts under audit-runs/.