Files

MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 07:19:08 +02:00

16 KiB

Raw Blame History

AUDIT-069 Session 4 — writer report v4

Date: 2026-05-20 xenia-rs HEAD: e6d43a23ac393004d2e5adf2f0395fd0b5e6448b (UNCHANGED from S1/S2/S3) git diff HEAD | sha256sum: ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357 (UNCHANGED at start AND end of S4) No ours source modifications. No canary instrumentation added. Canary audit_61_branch_probe_pcs cvar used (pre-existing from S1). Canary cache restored from /tmp/canary-cache-bak-audit-068.

Headline (HIGH confidence — direct per-iteration measurement)

S3's "producer-loop underrun" framing pointed in the right direction but mis-located the divergence. Neither engine ever takes the exit-branch in sub_82450A68 (PC=0x82450B50, the LR=epilog path). Both engines's dispatch threads stay in the loop indefinitely (no deadlock; just waiting).

The actual divergence is in the return value of the NtWaitForMultipleObjectsEx call at PC=0x82450B44:

Ours: r3 = 0x00000001 in 91/91 captures (100%) — semaphore acquired.
Canary: r3 = 0x00000102 in 3/4 captures (75%) — WAIT_TIMEOUT.

The two handles being waited on are:

handle[0] = NtCreateEvent at [r31+88] — the STOP event (signal → exit).
handle[1] = NtCreateSemaphore(InitialCount=0, MaximumCount=0x7FFFFFFF) at [r31+92] — the WORK semaphore (signal → process work).

Both created by sub_8244FF50 (spawn helper) BEFORE ExCreateThread. mem-watch confirms handle slots in ours: 0x104C (event) / 0x1050 (semaphore) at run-1; absolute IDs drift across runs but the slot layout is invariant.

This is NOT an exit-branch divergence, NOT loop-underrun in the literal sense — it is a SEMAPHORE-STATE divergence. In ours the work-semaphore count is non-zero at every wait entry (so the wait always returns immediately with success); in canary the count is zero at most wait entries (so the wait times out per the 16ms relative timeout).

Method (READ-ONLY, no source mod)

Disassembled sub_82450A68 body (80 instructions) via xenia-rs disasm --at 0x82450A68 -n 200. Saved to s4/sub_82450A68-disasm.txt.
Identified loop topology: prolog → wait-#1 → body (with inner search over 5-slot table at [r31+112..212]) → dispatch (bl 0x82450B68 → γ-signaler family) → re-wait → back-edge OR exit.
Ran ours-cold with --branch-probe= on 14 BB-entry PCs covering all loop-body paths. Captured 696 records over ~80s wallclock / 91 loop iterations.
Ran canary-cold (cache wiped → restored from /tmp/canary-cache-bak-audit-068) with same audit_61_branch_probe_pcs cvar set. Canary process faulted in vkd3d-proton at ~10s wallclock; captured 35 records / 4 loop iterations. Sufficient to surface the r3 distribution.
Used --mem-watch=0x828F3BC0,0x828F3BC4 to identify which ours handle IDs live in slots [r31+88] and [r31+92]. Then disassembled sub_8244FF50 to confirm event-vs-semaphore types via the import jumps (NtCreateEvent at 0x824A9F18, NtCreateSemaphore at 0x824AB0C0).
Cross-checked ours's kernel handlers (nt_wait_for_multiple_objects_ex, do_wait_multiple, handle_consume, nt_release_semaphore, try_release_semaphore, wake_eligible_waiters) — code looks correct in isolation; the divergence is NOT in these handlers directly.

Per-PC iteration counts

PC	path	ours fires	canary fires	note
0x82450AA4	first-iter entry	1	1	both entered once
0x82450AAC	back-edge target	91	4	canary crashed early
0x82450AC0	flag@212==0 → r4=5	2	0	rare path
0x82450AC8	flag@212!=0 → search	90	4	dominant
0x82450AE4	inner-search continue	72	17
0x82450AF4	search-exhausted	8	3	no candidate found
0x82450AF8	candidate-found	82	1
0x82450B04	budget skip	81	0
0x82450B10	budget refresh	8	0
0x82450B28	dispatch entry	74	1	bl 0x82450B68
0x82450B34	re-wait entry	92	4
0x82450B50	EXIT (epilog)	0	0	never reached

r3 at back-edge (the divergence signal)

	ours	canary
r3=0x1	91/91 (100%)	1/4 (25%)
r3=0x102 (TIMEOUT)	0/91 (0%)	3/4 (75%)
r3=0x0 (handle[0] signaled)	0/91	0/4
r3=other	0/91	0/4

This is the per-iteration measurement the user's framing predicted. The matching iterations show different r3 values at the SAME PC. The "load feeding the predicate" is, however, NOT a guest-memory load — it is the kernel-side return of NtWaitForMultipleObjectsEx. The divergent KERNEL STATE is the work-semaphore count.

Wait wrapper chain (disasm-derived)

sub_824AB240:
  li r7, 0          ; alertable = 0
  b 0x824AB190      ; tail-jump

sub_824AB190(r3=numObj, r4=&handles, r5=WaitMode, r6=Timeout(=16 ms), r7=Alertable):
  ...
  bl 0x824ACA88     ; converts r4=16 ms → LARGE_INTEGER -160000 (relative 100-ns ticks)
  ...
  bl 0x8284E08C     ; NtWaitForMultipleObjectsEx (ord 254, import @ VA 0x8284E08C)
  ; returns NTSTATUS in r3:
  ;   0      = WAIT_OBJECT_0   = handle[0] (stop event) signaled
  ;   1      = WAIT_OBJECT_0+1 = handle[1] (work semaphore) acquired (atomically decrements count by 1)
  ;   0x102  = WAIT_TIMEOUT    = 16 ms elapsed with no signal

sub_82450A68 branches on this:

cmplwi cr6, r3, 0; beq cr6, 0xB50 → r3 == 0 → EXIT (stop event signaled)
cmplwi cr6, r3, 0; bne cr6, 0xAAC → r3 != 0 (including 0x102) → CONTINUE
- r3 == 1 → at least one work-item is available → run the inner table search
- r3 == 0x102 → just a 16ms timer wake; inner search will likely find no candidate and the loop just re-waits

In canary's brief 4-iteration captured window, only iteration-0 had real work (r3=1); iterations 1-3 were timer-wakes (r3=0x102). In ours's 91-iteration window, all back-edges saw r3=1: someone has released the semaphore at least once between each consume.

Handle slot identification (HIGH confidence)

Via --mem-watch=0x828F3BC0,0x828F3BC4:

MEM-WATCH addr=0x828f3bc0 old=0x00000000 new=0x0000104c
   store_addr=0x828f3bc0 store_len=4 tid=1 pc=0x8244ffb0 lr=0x8244ffb0
MEM-WATCH addr=0x828f3bc4 old=0x00000000 new=0x00001050
   store_addr=0x828f3bc4 store_len=4 tid=1 pc=0x8244ffcc lr=0x8244ffcc

Static disasm of writer PCs:

0x8244FFAC: bl 0x824A9F18    ; NtCreateEvent wrapper
0x8244FFB0: stw r3, 88(r30)  ; handle[0] = event = ours 0x104C
0x8244FFC8: bl 0x824AB0C0    ; NtCreateSemaphore wrapper (r4=0=Initial, r5=0x7FFFFFFF=Max)
0x8244FFCC: stw r3, 92(r30)  ; handle[1] = semaphore = ours 0x1050

The semaphore is created with InitialCount=0. So if no one ever calls NtReleaseSemaphore on it, the wait will only ever return STATUS_TIMEOUT. Canary's behavior (mostly 0x102, occasionally 0x1) matches this: producers release the semaphore ~1× per ~16ms.

Ours's behavior (always 0x1) means producers release the semaphore FASTER THAN the consumer drains it.

NtReleaseSemaphore call graph (xrefs to wrapper sub_824AB158)

Wrapper sub_824AB158 calls NtReleaseSemaphore (ord 243, import @ VA 0x8284E07C). Called from 22 sites across 18 functions:

0x822c6770 fn=0x822c6748
0x822c6848 fn=0x822c6808
0x822c95c4 .. 0x822c9718 fn=0x822c8b50 (×6 inline call sites)
0x822f23e8 fn=0x822f2328
0x823dd7f8 fn=0x823dd770
0x823dda3c fn=0x823dd838
0x823df008..1b4 fn=0x823de4b8 (×3)
0x823df604 fn=0x823df320
0x82450310 fn=0x82450218   ← dispatcher-module enqueuer (callers: sub_82452DC0 ×2)
0x824504c4 fn=0x824503A0   ← dispatcher-module enqueuer (callers: sub_82452690, sub_8245E1D8)
0x82450cdc fn=0x82450b68   ← THE DISPATCH FUNCTION itself (self-release)
0x82450d28 fn=0x82450b68   ← THE DISPATCH FUNCTION itself (self-release)
0x82456b48 fn=0x824569c0 (jump form)
0x82458020 fn=0x82457fe0
0x824584c8 fn=0x82458468
0x82459424 fn=0x824591c0
0x8245ab6c fn=0x8245aaf0
0x8245ac6c fn=0x8245abd8
0x8245ade0 fn=0x8245ad00

Critical observation: the dispatch function sub_82450B68 contains TWO release sites (at offsets 0xCDC, 0xD28). Each successful dispatch run can release the semaphore again. If both branches release +1 token, and the wait consumes only -1 per iteration, the count would drift up. This is consistent with the "ours over-released" hypothesis.

Some sub_82450B68 branches release the semaphore via lwz r3, 92(r27) which is handle[1] of the dispatcher itself. So the dispatch function re-fills its own pipe.

Hypothesis (MEDIUM-HIGH confidence)

The semaphore is being over-released in ours due to a divergent dispatch-loop control flow inside sub_82450B68 that differentially decides whether to fire the self-release. Either: (a) ours takes a sub_82450B68 branch that releases when canary's doesn't (this is the dual of S3's question: which sub-branches differ?), OR (b) ours's parse_timeout scales the 16 ms relative timeout by /100 (exports.rs:4495 — magnitude.max(1) / 100), turning a 16 ms wall-clock timeout into 1,600 emulator-ticks. This may differentially interact with how often the semaphore gets a release between wait entries.

The exit-branch-at-matching-iteration framing from the user's task spec does NOT apply here: there IS no exit-branch divergence (both never exit). The divergence is in the wait return value, which has no proximate guest-memory load. The "load feeding the predicate" is a kernel-state read (the semaphore count) performed inside the kernel import handler itself.

Most-recent kernel calls (tid=5 in ours, from S3 lr-trace

data + S4 cross-check)

Most-recent kernel calls before each wait at PC=0x82450B44 (re-wait site), on ours tid=5:

NtReleaseSemaphore(handle=0x1050, count=1) via wrapper sub_824AB158, lr=0x82450CDC OR lr=0x82450D28 (both inside sub_82450B68 dispatch body) — self-release in the dispatch tail.
KeSetEvent(handle=0x10xx) via wrapper sub_824AA2F0 OR sub_824AAF50 — γ-signaler family fires (the audit's original signaler PCs from S1/S3).
KeQueryPerformanceCounter-like via sub_824AA830 — used in budget refresh path.

In canary, the equivalent sequence per S1's signal-probe-correlated.log (180s window) is similar (γ-signalers fire 492× on tid=10), but the SELF-RELEASE rate matters more — that determines whether the consumer keeps seeing a non-zero semaphore.

S5 recommendation (refined)

The right next step is NOT to walk further upstream in the γ-signaler chain (S3's lead). It is to measure the per-branch flow inside sub_82450B68 itself — find which of its many branches release the semaphore and how that branch is selected.

Path A (RECOMMENDED, ~0 LOC, read-only)

--branch-probe covering sub_82450B68 body (PCs 0x82450B68 .. 0x82451238, the dispatch body). Want to capture:

Frequency at the two release sites 0x82450CDC and 0x82450D28 (per-call cumulative count on tid=5).
Frequency at the OTHER exit sites in sub_82450B68 (e.g. the early return at 0x82450EE8 which does NOT release).

If ours's release-rate at CDC/D28 is significantly higher than canary's, that confirms (a). If similar, then (b) becomes the next theory.

Path B (~80 LOC ours-side probe, no source mod)

Use --branch-probe on PCs inside xenia_kernel::exports::parse_timeout to confirm the magnitude/100 scaling actually causes the divergence. Actually this requires source instrumentation since parse_timeout is Rust, not guest code. Mid-priority.

Path C (~30 LOC canary diagnostic)

Add canary cvar audit_69_semaphore_count_probe = VA that emits the post-Set count for the semaphore at native VA matching ours's [r31+92]'s underlying X_KSEMAPHORE. Compare per-iteration count progression canary-vs-ours.

LOC budget for S5: Path A = 0, Path B = ~80, Path C = ~30.

Path A first — narrows the divergence to specific sub_82450B68 sub-branch behavior at zero LOC cost.

Cascade

A (disasm sub_82450A68): PASS (HIGH) — 80-instruction body, 3 BB-paths, 12 BB-entries identified.
B (ours per-iteration loop-branch trace): PASS (HIGH) — 91 back-edge captures, all r3=0x1.
C (canary same trace): PARTIAL (MEDIUM) — canary crashed at 4 iterations in vkd3d-proton on exit; 4 captures sufficient to surface r3=0x102 dominance, but not a long-window comparison.
D (identify divergent load): PARTIAL (MEDIUM) — no guest-memory load is the proximate cause; the divergence is in the kernel-side semaphore-count state. The "load" is conceptually inside do_wait_multiple's read of KernelObject::Semaphore.count.

Net 2/4 PASS-HIGH, 2/4 PARTIAL-MEDIUM. Methodology learned: when both engines stay in a loop, "which branch did ours take differently" is the WRONG question — ask "what's different at the SAME branch."

Confidence flags (summary)

finding	confidence
Both engines never take exit-branch (B50)	HIGH
ours back-edge r3=1 always (91/91)	HIGH
canary back-edge r3=0x102 mostly (3/4)	HIGH
handle[1] is NtCreateSemaphore w/ InitialCount=0	HIGH
handle[0] is NtCreateEvent	HIGH
Divergence is kernel-side semaphore-count state	MEDIUM-HIGH
sub_82450B68 self-release over-fires in ours	MEDIUM
parse_timeout /100 scaling is contributing	LOW-MEDIUM

Discipline

xenia-rs HEAD e6d43a23ac393004d2e5adf2f0395fd0b5e6448b UNCHANGED (sha256 of git diff HEAD matches S1/S2/S3 end at session start AND end).
READ-ONLY ours. No source mod. --branch-probe / --lr-trace / --mem-watch / --trace-handles-focus are runtime read-only flags documented as "lockstep digest unaffected" (state.rs comments).
Canary audit_61_branch_probe_pcs cvar enabled with our PC set; set back to "" at session end. Verified.
Canary mute = true set during run, restored to false at session end.
Canary cache wiped before cold canary run, restored from /tmp/canary-cache-bak-audit-068 at session end.

Artifacts

audit-runs/audit-069-wait-signal-producer/s4/
  sub_82450A68-disasm.txt          (80 ins disasm: sub_82450A28 entry + body)
  ours-loop-branch-trace.stdout    (696 BRANCH-PROBE records, ours-cold)
  ours-loop-branch-trace.stderr    (empty under --quiet)
  canary-loop-branch-trace.stdout  (1074 lines, 35 AUDIT-061-BR records)
  canary-loop-branch-trace.stderr  (89 lines, wine/vkd3d setup + final fault)
  ours-mem-watch.stderr            (2 MEM-WATCH records identifying handle slots)
  ours-mem-watch.stdout            (empty)
  ours-signaler.jsonl              (95 lr-trace records on wrapper PCs)
  ours-handles.{stdout,stderr}     (probe for handle dump; --halt-on-deadlock didn't trigger)
  ours-trace-handles-summary.log   (21 lines: focus startup + 8 ExCreateThread spawns)
  divergence-analysis.md           (per-iter table, hypothesis, S5 leads)
  writer-report-v4.md              (this file)

No canary instrumentation diff this session. No fix-canary-s4.diff.

Summary of S1 → S2 → S3 → S4 arc

S1 (2026-05-20 AM): identified canary tid=10 as the signaler; claimed ours lacks this thread (FALSIFIED by S2).
S2 (2026-05-20 noon): spawn-chain runs identically on ours tid=5; refined to "wrong-handle selection" downstream (REFINED by S3).
S3 (2026-05-20 PM): ours runs identical PC/LR chain but with ~5× fewer iterations. Producer-loop underrun classification. Wedge handle never even created in ours's truncated boot.
S4 (2026-05-20 evening): per-iteration branch-probe shows NEITHER engine ever exits the loop. Divergence is in NtWaitForMultipleObjectsEx return: ours r3=1 always (semaphore acquired), canary r3=0x102 mostly (timeout). Root cause is semaphore-count state divergence — ours's work-semaphore is over-released relative to consume rate, OR ours's timeout never fires before signal. Hypothesis: divergence inside sub_82450B68 dispatch body's self-release logic.

The S5 question is no longer "which earlier kernel call differs" — it is "which sub-branch of sub_82450B68 releases the semaphore in ours that canary's doesn't release in." Read-only branch-probe on sub_82450B68 body PCs.

16 KiB Raw Blame History Unescape Escape