Files
xenia-rs/audit-runs/audit-069-wait-signal-producer/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

13 KiB
Raw Blame History

AUDIT-069 Session 1 — wait-signal producer identification

Date: 2026-05-20 Status: LANDED — signaler tid + caller fns identified; AUDIT-066 circular framing FALSIFIED

Headline

The wait at sub_821CB030+0x1AC (PC 0x821CB1DC) — the canonical AUDIT-049/065 wedge wait — fires in canary on two tids (worker tid=17 and cache-loader tid=26). Both wedges are signaled by tid=10, a worker thread spawned EARLY (via sub_8244FF50ExCreateThread(entry=sub_82450A28)), NOT by any of the four workers spawned by sub_825070F0. This refutes AUDIT-066's circular framing ("γ-signaler running inside the 4 workers spawned by sub_825070F0"): the actual signaler reaches the production phase WITHOUT depending on sub_825070F0 firing.

Step 1 — wait site capture (canary)

Probe: --audit_61_branch_probe_pcs=0x821CB1DC --mute=true, 180s cold.

tid r3 (handle) r4 (timeout) r5 (wait_mode) r6 (ctx) r31 (stack) lr
17 F80000A4 FFFFFFFF 0 (auto) BC65CEC0 7064FA70 0x821CB1D0
26 F8000110 FFFFFFFF 0 (auto) BC667F80 708FF990 0x821CB1D0

Two distinct fires (one per logical caller). Both have r4=INFINITE timeout matching dossier. The lr=0x821CB1D0 is sub_821CB030+0x1A0 = the instruction AFTER the bl-wait — consistent with branch-probe firing at the basic-block-entry following the wait-call's return.

Handle drift across cold runs is real: Step 1 vs Step 3 vs Step 4 trajectories produced wait handles {F80000A0,F8000108} / {F80000A0,F8000108} / {F80000A4,F8000110}. Per-run handles are still deterministic; the absolute ID is not.

Important framing correction: The brief expected "~16 fires" per AUDIT-065. This was already partly retracted by AUDIT-066 (which observed that thid=17 "terminates via ExTerminateThread(0) WITHOUT ever calling Wait inside its cache loop"). Step 1 confirms AUDIT-066's correction: the wait at +0x1AC fires ~2× per boot (one for the work-queue load that ANON_Class_713383D7 work goes through; one for the cache-loader sister-flow). Not 16. The wait is the WORK-QUEUE wait, not a per-cache-file IO wait.

Confidence: HIGH (probe fired, r3/r4/r5 match expected wait-call ABI, two distinct logical fires reproducible across cold runs).

Step 2 — instrumentation (canary, ~280 LOC additive)

New audit_69_* cvars + slowpath module:

  • cpu_flags.{h,cc} (+23/+48 LOC, of which ~30 LOC are mine vs cumulative):
    • --audit_69_event_signal_watch (CSV of guest handle IDs, max 4)
    • --audit_69_event_signal_native_ptr (CSV of guest VAs, max 4)
    • --audit_69_log_all_sets (bool — log EVERY XEvent::Set/Pulse fire)
  • xenia-kernel/audit_69_event_signal_watch.h (51 LOC) — fwd decls, hot-path inline wrapper (single relaxed atomic load + branch).
  • xenia-kernel/audit_69_event_signal_watch.cc (193 LOC) — lazy parse + UINT32_MAX sentinel + XThread::TryGetCurrentThread() for lr/tid capture. Mirrors AUDIT-068's static-init gate pattern.
  • xenia-kernel/xevent.cc (+9 LOC) — hook at XEvent::Set and XEvent::Pulse (the deepest convergence of Ke/Nt set + pulse paths).

Reading-error registration: XThread::GetCurrentThread() asserts on host threads; first iteration used it and crashed. Fixed by switching to TryGetCurrentThread(). (Same lesson as AUDIT-067's bool-vs-pointer asymmetry but in a different fn.)

Cumulative cross-run canary additions retained in tree (AUDIT-061/067/068/069).

Step 3 — correlated capture

Run: cold, 180s, --mute=true --audit_61_branch_probe_pcs=0x821CB1DC,0x824AA2F0,0x824AAF50 --audit_69_log_all_sets=true.

Volume: 122,165 log lines (Step 3) / 155,627 lines (Step 4 with wrapper probes).

Wait fires (Step 4): 2 (tid=17, tid=26, as in Step 1 but with handle drift to F80000A4/F8000110).

Signals on wedge handles (Step 4):

wedge handle (waited on) wait tid signal fires signal lr signaling fn signal tid
0xF80000A4 17 1 0x824AA304 sub_824AA2F0 (NtSetEvent wrapper) 10
0xF8000110 26 100 0x824AAFC8 sub_824AAF50 (a generic event-set-with-arg wrapper) 10

The 100 fires on F8000110 are repeats — auto-reset events fire on first signal; the rest are no-ops. Volume reflects how often the work-queue processes items targeting this synchronizer.

Step 4 — signaler-fn resolution (sylpheed.db cross-check)

Wrapper-entry probe data for these two NtSet wrappers, filtered to tid=10:

wrapper lr-of-caller caller fn tid=10 fire count
sub_824AA2F0 (NtSetEvent wrapper) 0x8245DA44 sub_8245D9D8 (γ-signaler D-A per AUDIT-062) 23
sub_824AA2F0 (NtSetEvent wrapper) 0x8245DB08 sub_8245DA78 (γ-signaler D-B per AUDIT-062) 8
sub_824AAF50 (Ke-style wrapper) 0x8245DC5C sub_8245DB40 (NEW — not previously named) 461

sub_824AAF50 disasm needs follow-up but lr=0x824AAFC8 = sub_824AAF50+0x78 position is consistent with a bl xeKeSetEvent followed by status check in an N-arg helper. The wrapper takes (handle, ptr, size) and the internally-signaled event has a different handle from the input.

Containing-fn cross-check (sylpheed.db):

  • sub_8245D9D8 and sub_8245DA78 are in the worker cluster (0x82450000-0x8245C000). Per AUDIT-062: both are γ-signaler-D family, hot from worker-side, missed by AUDIT-059/060 enumeration.
  • sub_8245DB40 is in the same cluster; callers are sub_824528A8+0x54 and sub_8245EE50+0x20 (both worker-cluster internal).
  • All three are reached from tid=10's body fn sub_82450A68, the trampoline body for the entry sub_82450A28 (which ExCreateThread registers via sub_8244FF50).

tid=10 caller chain (canary):

sub_8244FEA8       (caller of sub_8244FF50; itself called from 11 sites)
  → sub_8244FF50   (spawner — calls ExCreateThread w/ entry=sub_82450A28)
                    → sub_82450A28  (thread-entry trampoline:
                                     KeSetThreadPriority(-2, 3); bl sub_82450A68)
                       → sub_82450A68  (worker dispatch loop)
                         → ... γ-signalers D / DA78 / DB40

sub_82450A28 is referenced as a data pointer at 0x8244FFF8 (inside sub_8244FF50). No call edges to it — it's purely a thread-entry data constant passed to ExCreateThread.

Step 5 — ours cross-reference

All identified signaler fns (sub_8245D9D8, sub_8245DA78, sub_8245DB40, sub_824AA2F0, sub_824AAF50, sub_82450A28, sub_8244FF50) are GAME (XEX) code — not kernel-imports. In ours these execute under the JIT, with no host-side analog to compare. The relevant question is whether the trajectory in ours REACHES these PCs.

Direct evidence from prior runs:

AUDIT-062 ours --lr-trace=0x824aa2f0 trace (ours-ntset.jsonl, 136 fires across cold boot up to deadlock):

  • tid=6: 82 NtSet fires
  • tid=1: 28 fires
  • tid=5: 22 fires
  • tid=8: 2 fires
  • tid=13: 2 fires
  • tid=10: 0 fires

ours NEVER spawns the canary-equivalent of tid=10 (the sub_8244FF50/sub_82450A28/sub_82450A68 worker). This is consistent with AUDIT-057's "thread-gap" finding: ours has fewer threads than canary.

Within ours, the γ-signalers DO fire — but on tid=5 (calling sub_824AA2F0 from lr=0x8245DA44 = sub_8245D9D8+0x6C) per AUDIT-062's ours-ntset.jsonl:line 1. AUDIT-062 already established these signal WRONG handles in ours (neighbors of 0x12AC are signaled; the wedge handle itself is not).

Conclusion: ours's signaler PCs exist and run, but on the wrong tids (no tid=10 equivalent), and target the wrong handles. The PRODUCER → SIGNALER chain in ours is structurally broken at the thread-spawn layer, not the kernel-import layer.

Confidence (Step 5): MEDIUM-HIGH for the chain identification (data is internally consistent and matches AUDIT-062's prior independent capture). LOW on the ours-side resolution mechanism (this audit did not re-run ours; cross-ref is read-only against prior dumps which may be stale relative to current ours HEAD e6d43a23…).

AUDIT-066 framing refutation

AUDIT-066 stated:

the producer-side signal for THAT event comes from a γ-signaler running inside the 4 workers spawned by sub_825070F0 — per AUDIT-063's static-reachability survey of NtSet wrapper callers.

This is falsified by AUDIT-069 Step 3+4 evidence:

  1. The signaler runs on tid=10, spawned by sub_8244FF50 via ExCreateThread(entry=sub_82450A28). This is NOT one of sub_825070F0's 4 workers.
  2. sub_8244FF50's caller chain does NOT require ANON_Class_713383D7's vtable to be installed; it does NOT require sub_825070F0 to fire.
  3. The circular-bootstrap concern AUDIT-066 raised ("workers can't signal until they spawn; they can't spawn until the wedge clears") was structurally correct framing IF the signaler were inside the sub_825070F0 4-worker family. Since the actual signaler is tid=10 (independently spawned), the circle is broken — the signaler IS reachable without the wedge clearing.

Reading-error class #37: static-reachability surveys (AUDIT-063 walked 12 hops from sub_82452DC0 to NtSet wrapper callers) are scoped to a particular caller chain; they miss alternative producer paths reached via unrelated thread-spawn sites. Always probe at the runtime SIGNAL site to confirm which exact caller fired, not just which static path could fire.

Cascade outcome

  • A (capture wait site PC + r3=handle in canary): PASS. PC 0x821CB1DC, r3 captures the handle on first fire reproducibly.
  • B (capture signal fires on the wait targets): PASS. 1 fire on F80000A4 (wedge handle 1), 100 fires on F8000110 (wedge handle 2).
  • C (resolve signaling fn + immediate caller fn): PASS. sub_824AA2F0sub_8245D9D8 / sub_8245DA78 (γ-signaler D family); sub_824AAF50sub_8245DB40 (new). All on tid=10.
  • D (ours-side cross-ref): PARTIAL. tid=10 IS missing in ours per existing AUDIT-062 data; γ-signalers DO fire but on wrong tids. Did not re-run ours in this session (per task discipline; cross-ref read-only against prior dumps).

Net 3/4 PASS, 1/4 PARTIAL.

Discipline

  • xenia-rs HEAD e6d43a23ac393004d2e5adf2f0395fd0b5e6448b UNCHANGED. git diff HEAD | sha256sum at session start = ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357 and at session end IDENTICAL.
  • Canary patch is purely additive, cvar-gated default-off, UINT32_MAX sentinel + std::once parse pattern (per AUDIT-068 discipline).
  • Every canary run used --mute=true.
  • Cache wiped before each cold run (4 cold runs total: Step 1 90s, Step 1 180s rerun, Step 3 with handle watch, Step 3 with log_all_sets, Step 4 with wrapper probes). Each cache moved to /tmp/_audit_069_step* before next cold run.
  • Cache restoration from /tmp/canary-cache-bak-audit-068 deferred to session end (done after this report).

Artifacts

xenia-rs/audit-runs/audit-069-wait-signal-producer/
  step1-wait-probe.log               (90s baseline; 2 wait fires)
  step1-wait-probe.stdout
  step1-wait-probe-180s.log          (180s rerun; 2 wait fires)
  step1-wait-probe-180s.stdout
  step3-signal-probe.log             (180s; first signal-watch test;
                                      handles drifted, partial correlation)
  step3-signal-probe.stdout
  step3-correlated.log               (180s; log_all_sets; 120k signal fires)
  step3-correlated.stdout
  step4-wrapper-callers.log          (180s; log_all_sets + wrapper entries;
                                      155k events; correlated lr-to-caller)
  step4-wrapper-callers.stdout
  fix-canary.diff                    (cumulative canary diff vs 6de80dffe)
  writer-report.md                   (this file)

Session 2 recommendation

Two paths, both <100 LOC ours-side:

Path 1 (ours read-only probe + targeted root-cause): re-run ours with --ctor-probe=0x82450A28 (the canary-tid=10 entry) — confirm it never fires. Then --ctor-probe=0x8244FF50 (the spawner). If sub_8244FF50 also never fires, walk up its 11 callers in sylpheed.db — likely one of them gates on a flag/event that's not set in ours's early-boot trajectory.

Path 2 (canary additional capture): probe canary's tid=10 spawn sequence in detail. Add audit_69_thread_spawn_watch cvar that logs every ExCreateThread call with (entry_pc, ctx, suspend_flag, caller_lr). ~40 LOC. Compare to ours's spawn list — find which call goes unfired in ours.

Both paths are cheaper than continuing on the wedge directly. Path 1 is preferred: it stays on the ours side which is the failing engine.

Predicted Session 2 cascade:

  • A (find sub_82450A28's first-non-fire ancestor in ours): 75-85%
  • B (identify the missing precondition for that ancestor): 50-60%
  • C (fix LOC in ours ≤ 50): 30-40%
  • D (draws>0): 15-25% (single wedge unlock)