Files

MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 07:19:08 +02:00

17 KiB

Raw Blame History

Step 2 — Natural install-trigger sequence and ours divergence point

Date: 2026-05-21 Mode: PLAN-only (investigation; no engine LOC changes). Sources: canary-jitter-1.jsonl (4.4 GB, 18.7M events) and phase-w-wedge-reattack/ours-postfix.jsonl (28 MB, 121,569 events).

TL;DR

The Step 2 plan's framing — "identify the canary tid=6 kernel-call sequence in the install window [9.4s, 9.6s]" — cannot be applied because ours never reaches host_ns ≥ 1.73s. Ours's tid=1 wedges 8 seconds before the install epoch. The reframed question — "what canary-tid=6 sequence between the matched-prefix wedge point and the install epoch fails in ours?" — resolves to a single root cause one level upstream of the wedge:

Canary's spawned cache-loader worker (canary tid=17, entry 0x821748F0) executes ~4140 events and calls ExTerminateThread at host_ns = 2.092s, taking 154ms. Ours's analog (ours tid=13) executes 435 events, never reaches its second wait iteration, and wedges at its FIRST NtWaitForSingleObjectEx (no signaler ever fires). Ours's tid=13 takes a different guest-code branch from the first wait onward — it calls NtReleaseSemaphore instead of NtSetEvent between NtCreateEvent and NtWaitForSingleObjectEx, so the event it then waits on is unsignaled.

This is a branch divergence inside guest code sub_821CB030's body, NOT a missing kernel call in ours and NOT a wrong return value from ours's kernel.

Step 0 outcome — install epoch reachable on canary, not on ours

Source	First event	Last event
canary tid=6 events in [9.0s..11.0s]	16,175 kernel.calls captured	install epoch + worker-spawn covered ✓
ours tid=1 events	1.728s (last event before wedge)	install epoch is at ~9.5s — 8s in the future

Ours physically cannot reach 9.4s; tid=1 blocks on tid=13's thread-handle at host_ns=1.728s, all other tids subsequently block too (see phase-w-wedge-reattack/halt-on-deadlock-dump.txt). Therefore the canary "kernel-call sequence ours doesn't make in the install window" question is degenerate: ours makes none of canary's 16,175 calls in that window because ours stops emitting at host_ns=1.73s.

The substantive Step 2 question reframes to: "What does canary do between matched-prefix idx ~108,476 (= ours's last events) and the install epoch?" Answer: it RUNS the worker tid=17 to completion, which causes the join-wait on tid=1/6 to return, after which tid=6 iterates sub_822F1AA8's main loop further and eventually triggers sub_824FD240 and sub_825070F0. Everything hinges on tid=17 completing.

Step 1 outcome — canary tid=6 spawns sub_821748F0 at host_ns=1.935s

Exact anchor:

canary tid=6  host_ns=1935433700  idx=108476
  ExCreateThread(entry=0x821748f0, ctx=0xbc365620, stack=524288, susp=T)
  → handle.create raw=0xf80000a0 hsid=3bd922fbb385c2c9
canary tid=6  host_ns=1937223600  idx=108498
  NtResumeThread
  NtWaitForSingleObjectEx handles=[3bd922fbb385c2c9] timeout=-1
  → wait.begin
canary tid=6  host_ns=2092000000  idx=108499  (155 ms later)
  kernel.return NtWaitForSingleObjectEx rv=0 status=0x00000000

The wait IS infinite (timeout_ns=-1) — yet it returns in 155ms because the worker terminates (canary tid=17's last call is ExTerminateThread at host_ns=2.0918s).

Ours's mirror:

ours tid=1  host_ns=1727479660  idx=108481
  ExCreateThread(entry=0x821748f0, ctx=0x4024d640, stack=0, susp=T)
  → handle.create raw=0x000012c8 hsid=8a25e09a8a739c1b
ours tid=1  host_ns=1727611893  idx=108505
  wait.begin handles=[8a25e09a8a739c1b] timeout=-1
ours tid=1  host_ns=1727614433  idx=108506
  kernel.return NtWaitForSingleObjectEx rv=0  ← but this is just the
  return record from the entry probe, NOT actual unblock

(Note: ours-postfix.jsonl schema emits the entry-probe kernel.return even on an infinite wait, because the probe wraps the wait wrapper. Per halt-on-deadlock-dump.txt, tid=1 is in fact still Blocked on handle 0x000012c8 = Thread(id=13) at deadlock-detection time.)

The spawn parameters look identical in shape (same entry PC; ctx and stack are run-specific). Spawn semantics match.

Step 2 outcome — canary tid=17 vs ours tid=13 kernel-call differential

Lifetimes:

	canary tid=17	ours tid=13
first event	host_ns=1.9378s	host_ns=1.7276s
last event	host_ns=2.0918s	host_ns=1.7307s
duration	154 ms	3 ms
total events	4140	435
kernel.call count	1351	142
terminates?	yes via `ExTerminateThread`	no — wedged on wait

Per-call differential (top entries by |canary − ours|):

kernel.call	canary tid=17	ours tid=13	Δ
RtlEnterCriticalSection	607	58	+549
RtlLeaveCriticalSection	607	58	+549
NtClose	19	2	+17
NtCreateEvent	18	3	+15
NtDuplicateObject	16	2	+14
RtlInitAnsiString	11	1	+10
NtWaitForSingleObjectEx	11	2	+9
RtlInitializeCriticalSectionAndSpinCount	15	6	+9
NtQueryFullAttributesFile	9	1	+8
NtReleaseSemaphore	9	1	+8
RtlNtStatusToDosError	9	1	+8
NtSetEvent	8	1	+7
KeTlsSetValue	2	0	+2
NtCreateFile	2	0	+2
ExCreateThread	1	0	+1
ExTerminateThread	1	0	+1
KeTlsGetValue	1	0	+1
KeQueryPerformanceFrequency	0	1	-1

Set-difference of unique kernel-call names: ours's set of called APIs is a strict subset of canary's, plus KeQueryPerformanceFrequency which canary called outside this window. No kernel API is missing from ours's implementation that canary uses. All of these APIs already work in ours (they are called successfully on tid=5, tid=1, or tid=10 elsewhere in the same run).

The differential isn't "ours fails to implement a kernel call" — it's "ours executes 10× fewer iterations of the same loop body."

The control-flow divergence (the root cause)

Canary tid=17, idx 339-356 — the FIRST wait pattern:

idx=339 NtCreateEvent
idx=340 handle.create raw=0xf80000b8 hsid=1070523eb111c6ea object_type=1 (Event)
idx=343 NtDuplicateObject  → handle.create at idx=344
idx=347 NtSetEvent             ← THE EVENT IS SIGNALED BEFORE THE WAIT
idx=350 NtClose                → handle.destroy at idx=351
idx=354 NtWaitForSingleObjectEx
idx=355 wait.begin handles=[1070523eb111c6ea] timeout=-1
idx=356 kernel.return rv=0     ← wait completes in 23µs because event was signaled

Ours tid=13, idx 175-434 — the analog wait pattern:

idx=175 NtCreateEvent
idx=177 handle.create raw=0x000012d0 hsid=d5e23609d3948568 object_type=1 (Event)
        … 240 RtlEnterCriticalSection / RtlLeaveCriticalSection ops in between …
idx=419 NtDuplicateObject  → handle.create at idx=420
idx=429 NtReleaseSemaphore     ← DIFFERENT API — semaphore, not event-set
idx=432 NtWaitForSingleObjectEx
idx=433 wait.begin handles=[d5e23609d3948568] timeout=-1
idx=434 kernel.return rv=0   (entry probe only; actual wait blocks forever)
        ⏸ WEDGE — event d5e23609d3948568 is never signaled.

The key observation: between NtCreateEvent and the corresponding NtWaitForSingleObjectEx, canary calls NtSetEvent to signal the very event it is about to wait on (idiomatic self-signaled wait-pump barrier). Ours skips the NtSetEvent, calls NtReleaseSemaphore instead, and then blocks on the unsignaled event.

This is a guest-code branch divergence inside the helper hierarchy sub_821CB030 → sub_821CBA08 → sub_821CC3F8 → sub_821C4EB0 (per sub_82173990.md chain). The branch predicate is some state read between NtCreateEvent and the call site of NtSetEvent / NtReleaseSemaphore.

Step 3/4 — Why does the predicate differ between engines?

The deep root: this exact divergence pattern is what AUDIT-069 S5 already found at a different lens:

AUDIT-069 S5: "Other producers: canary 25 vs ours 1." Canary has 24 additional thread sources releasing the work semaphore that ours doesn't have.

Combining S5 with this Step 2 finding:

Ours's tid=13 emits ONLY 1 NtReleaseSemaphore before wedging (consistent with the 1 "other producer" S5 measured).
Canary's tid=17 emits 9 NtReleaseSemaphore + 8 NtSetEvent before reaching ExTerminateThread. Each release/set comes from a different cache-load iteration.
The iteration count is gated by the loop body completing each iteration. Each iteration begins by waiting on an event that must be PRE-SIGNALED to advance.

In canary, the event gets pre-signaled (NtSetEvent before NtWait). In ours, the same code path takes the "release semaphore + wait on event signaled by external" branch instead of the "set event + wait on event" branch. The state read by the predicate at the branch differs.

What state? Without disassembling sub_821CB030/sub_821CBA08 and binding the branch PC to the guest memory location the predicate reads, we cannot say definitively. Candidate state sources:

A bit/flag in the ctx (0x4024d640 in ours vs 0xbc365620 in canary — different addresses but same shape). Could be uninitialized in ours due to ANON_Class vtable install at sub_824FD240+0x24 not having fired (AUDIT-068 S4). But that vtable install fires much later (host_ns=9.4s in canary), so this is unlikely.
The result of a prior NtQueryFullAttributesFile call. Canary tid=17 calls this 9× before reaching ExTerminateThread; ours tid=13 calls it 1× before wedging. The file being queried is in the cache:\ filesystem (per sub_82173990.md chain).
A guest-memory shared CS-protected pointer set by another tid (canary tids 4/10/14 do 38+90+38 signal events in the [1.9..2.1s] window; in ours, tids 4/5/14 are STILL working in [0..1.73s] but their output is shifted to ours's tid=5, which per AUDIT-069 S5 matches canary's tid=10 producer count almost exactly — 90 NtReleaseSemaphore each).

Cause attribution

Per the Step 5 framework:

Missing ours implementation? NO. Every kernel API canary tid=17 calls is also implemented in ours and works (verified by other tids using them successfully).
Incorrect return value in ours? UNLIKELY but unverified. Phase A schema doesn't capture args/return values for most calls; args_resolved={} is empty for nearly every call in this window.
Missing side effect in ours? POSSIBLY. If NtQueryFullAttributesFile or NtCreateFile on cache:\<hash>\... has a slightly different behavior in ours (e.g., succeeds when canary fails, or vice-versa), the resulting branch could diverge.
Upstream state divergence (most likely): a guest-memory value read by a predicate inside sub_821CB030/sub_821CBA08 differs between engines. The earlier-in-this-tid CS-blob (240+ enter/leave pairs between idx 177 and idx 423) processes some data structure, the result of which selects the branch.

Best single guess (MEDIUM confidence): a NtQueryFullAttributesFile on a cache:\<hash>\<filename> path returns a different value in ours than in canary (file present vs not, size mismatch, or attrib mismatch). The branch chooses "we need to recompute the cache item" (NtReleaseSemaphore path) instead of "cache item is ready, signal event and proceed" (NtSetEvent path).

Disjoint-gap count

ONE gap — the predicate divergence inside sub_821CB030's body. However, the predicate divergence likely has a complex upstream cause that involves either filesystem state or guest-memory state initialized by another tid that ALSO has the same kind of subtle drift. So:

disjoint divergence sites in this trajectory: 1 (control-flow branch in sub_821CB030 chain).
disjoint hypothesized causes: 2-3 (file attribute return value, shared-memory state from tid=10/5 dispatch worker, or vtable install bypass at upstream).

This is NOT the "50+ disjoint missing kernel patterns" failure mode predicted in tripstone 7. It's a single branch divergence with multiple candidate first-causes. Methodology pivot to Option C (critical-path sweep) is NOT indicated; targeted iterate per candidate first-cause IS indicated.

Recommended next concrete action

Iterate plan, ordered by minimum LOC + maximum signal:

Iterate Step 2.A — branch-probe inside sub_821CB030 body (~50-80 LOC ours + ~50 LOC canary)

Use existing audit_61_branch_probe_pcs to pin the divergent branch inside sub_821CB030 / sub_821CBA08 / sub_821CC3F8. Specifically probe every bne/beq PC inside these guest fns that has reachable bl NtSetEvent on one branch and bl NtReleaseSemaphore on the other. Use sylpheed.db cross-references to enumerate bl 0x824AA2F0 (NtSetEvent wrapper) and bl 0x824AB158 (NtReleaseSemaphore wrapper) call sites in these fns.

Capture both engines, diff branch-counts. The first divergent branch is the answer.

Iterate Step 2.B — args/return-value capture for the 9 NtQueryFullAttributesFile calls on canary tid=17 (~30 LOC canary)

Extend audit_61 or write a dedicated probe to log r3 (filename buffer) and r0 (NTSTATUS return) for every NtQueryFullAttributesFile call inside this 154-ms window. Compare against ours's 1 call. If file-attribute return values differ on a shared file, that's the trigger.

Iterate Step 2.C — guest-memory read-watch on the ctx struct (~20 LOC, reuses AUDIT-068 S3 read-probe)

Use audit_68_host_mem_read_probe to sample the worker ctx (0xbc365620 in canary / 0x4024d640 in ours) at ~1ms cadence in the window [1.7..2.1s]. Identify whether a flag/byte in the ctx differs at the predicate-read time. This pinpoints the actual read location if Step 2.A's branch-probe doesn't immediately reveal the predicate source.

Tripstones honored

#28: verified canary's actual behavior by reading the jsonl directly; the AUDIT-069 S5 framing is corroborated, not assumed.
#32: contention regions may jitter; the 240+ CS enter/leave pairs in ours tid=13 are NOT identical to canary tid=17's count (607 vs 58). Differential here may include scheduling-determinism noise. Mitigation: cross-validate with 2nd cold canary run if Step 2.A doesn't immediately converge.
#39: matched-prefix did NOT drive this; first-draw progression is the goal.
#5 of plan tripstones: AUDIT-069 S5 "25 producers" finding IS downstream of Step 2's identified branch divergence. The 25 producers correspond to canary tid=17's loop iterations that ours tid=13 doesn't reach.

Cascade

A (acquire canary install-epoch event log): ✓ HIGH (16,175 kernel calls captured cleanly in [9..11s] window).
B (identify install-trigger sequence in canary): ✓ HIGH (canary tid=6 spawns sub_821748F0 at host_ns=1.935s, join-wait returns at 2.092s). The "install trigger" is not a single kernel call but the completion of worker tid=17, which causes the join wait to release tid=6 into the rest of the main-loop dispatch.
C (identify where ours diverges from canary): ✓ HIGH (ours tid=13 wedges 3ms into its lifetime, vs canary tid=17 running 154ms; first kernel-call sequence divergence at the NtSetEvent vs NtReleaseSemaphore branch).
D (attribute the divergence to a specific cause): MEDIUM (3 candidate root causes; need iterate 2.A/2.B/2.C to disambiguate).
E (produce Δ-gap count + roadmap): ✓ HIGH (1 divergence site; 3 candidate first-causes; ~50-200 LOC iterate plan).

Honest assessment

The wedge framing established by AUDIT-049 .. AUDIT-069 holds.
Step 2 narrows the trigger from "the install epoch at 9.4s" down to "the worker tid=13's first wait at 1.73s" — a 7-order-of-magnitude refinement in time.
The 25-producer finding from AUDIT-069 S5 IS a consequence of the Step 2 branch divergence: each missing iteration of canary tid=17's load loop is a missing "other producer" signal.
The fix is NOT to mirror canary's kernel calls; ours implements them correctly. The fix is to find why ours's sub_821CB030 predicate evaluates differently.
Confidence that the fix is a single guest-state correction (file-attribute mismatch, ctx-field uninitialized, or shared-memory flag race): MEDIUM.

Artifacts produced this session

All under xenia-rs/audit-runs/review-a-step2-natural-trigger/:

extract_canary_install_window.py — scanner for canary in [9..11s].
extract_canary_tid6_pre_install.py — scanner for tid=6 [1.5..11s].
extract_canary_worker_tid.py — locates spawn worker by hsid.
extract_canary_tid17_full.py — tid=17 timeline + diff vs ours tid=13.
extract_ours_tid1_full.py — ours tid=1 timeline.
extract_ours_tid13_final.py — ours tid=13 timeline.
find_signaler.py — finds canary tid=17 wait signalers.
ours_signal_counts.py — ours per-tid signal counts.
canary-tid6-install-window.csv — 32,383 events.
canary-tid6-install-window.summary — kernel.call frequencies.
canary-tid6-from-anchor.csv — 139,202 events.
canary-tid17-worker-timeline.csv — 4140 events.
ours-tid13-full-timeline.csv — 435 events.
ours-tid1-final-150.csv — last 150 events on ours tid=1.
ours-tid1-summary — kernel.call frequencies.
canary-tid17-waits.csv — 29 wait.begin events with handle binding.
differential-canary-tid17-vs-ours-tid13.txt — full call-name diff.
step2-report.md — this report.

LOC delta in this session: 0 to xenia-rs/canary engines; 0 to sylpheed.db; ~600 LOC analysis scripts under audit-runs/.

17 KiB Raw Blame History Unescape Escape