Files
xenia-rs/docs/functions/sub_82173990.md
MechaCat02 ad45873a1b ITERATE-2.V: scheduler priority aging closes 18-day AUDIT-049 wedge
Priority aging in xenia-cpu/scheduler.rs:pick_runnable
(effective_priority = base + age_bonus(now_round - last_run_round),
capped at +31, AGING_ROUNDS_PER_BONUS=1). Strict-priority was parking
priority=0 threads behind CPU-bound priority=15 audio mixer
(sub_824D1328 guest spinwait at PC=0x824d1404 on CPU5). Aging
eventually picks the starved thread, breaking the producer-consumer
cycle that caused 5-tid wedge at PC=0x824ac578 since AUDIT-049 (10 May).

Cascade observed: tid=13 clean exit; events 121K -> 13M (107x); last
host_ns 767ms -> 51,011ms (66x); 8 new threads spawn; VdSwap 1 -> 2.

Complete two-day iterate sequence (2026-05-27 -> 2026-05-28):
- 2.F: VdSwap drain timeout 900ms -> 1ms (xenia-gpu/handle.rs); 876x
       perf win on VdSwap kernel callback
- 2.H: vA0000000 physical heap bucket added (state.rs, exports.rs);
       ctx_ptrs now in 0xA0000000-0xBFFFFFFF range matching canary
- 2.L: Phase-A diff harness categorized [return_value mismatch],
       [status mismatch], [args_resolved.path mismatch] tags
       (tools/diff-events/diff_events.py); closes reading-error #41
       (silent test-harness state leak invalidating trace diffs)
- 2.M: always-on exit-thread-state.json sibling to Phase-A JSONL
       (event_log.rs + xenia-app/main.rs); closes reading-error #42
       (Phase-A blind to blocked-forever waits)
- 2.Q: signal.match kernel instrumentation in NtSetEvent /
       NtReleaseSemaphore / KeSetEvent / KeReleaseSemaphore
       (exports.rs); emits target_handle + waiter_count + waiter_tids
- 2.T: wake.requested kernel instrumentation in wake_eligible_waiters
       (exports.rs); emits target_tid + transition + new_state
- 2.V: scheduler priority aging (xenia-cpu/scheduler.rs) [keystone]

Plus accumulated WIP from earlier May (contention_manifest,
phase_b_snapshot, xam/xaudio enhancements, analysis db, xex loader,
xenia-app main loop, etc.). Audit-runs/ artifacts remain untracked
per project convention.

Tests: 300 xenia-cpu / 227 xenia-kernel / 5 xenia-app / 19 xenia-path
/ 30+ smaller suites -- all PASS, 0 regressions. Determinism preserved
(2x cold runs bit-identical at 13,003,881 events post-2.V).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 07:27:26 +02:00

14 KiB
Raw Blame History

address, classification, confidence, last_audit, aliases
address classification confidence last_audit aliases
0x82173990 normal_callee high 066
tid=1 join wait site (the wedge PC 0x82173C60)
synchronous task-spawn + join helper

sub_82173990 — synchronous task-spawn-and-join helper

Synopsis

Tid=1 (main) one-shot helper that builds a stack-resident task descriptor, calls sub_821746B0 to allocate+initialize a 24-byte task record (which encapsulates a sync object created by sub_82172370), and waits INFINITE on that sync object. The wait at PC 0x82173C60 bl 0x824AA330 (= sub_82173990+0x2D0) is the AUDIT-049/AUDIT-064 wedge site — both canary and ours enter the wait, but only canary's wait completes. The wait is on the thread handle of the worker spawned by sub_821746B0 (XThread or KE_THREAD), released when that worker calls ExTerminateThread. Function is called exactly 1× per boot in both engines (entry probe fires once, all body PCs fire once each).

Evidence

Disasm-anatomy (size 768B / 192 insns @ 0x82173990..0x82173C8C):

0x82173990  mflr/prologue (256-byte frame)
0x821739B0  bl 0x8216E7E8   ; first string-init helper (r4=r11+6244)
0x821739CC  bl 0x82448AA0   ; cr0=.G. → arg!=0 path
0x821739F0  bl 0x82448BC8   ; returns string-table entry → r28
0x82173A38  bl 0x8216F218   ; internal copy
0x82173A68  bl 0x821835E0   ; → r25 (ID/result); cr6-tests below
0x82173A78  bne cr6, 0x82173A84   ; skip if r25==28
0x82173A88  beq cr6, 0x82173BC0   ; skip if r25==0
…
0x82173B98  bl 0x82453910        ; signaler candidate (AUDIT-049 column)
0x82173BC0  convergence label (string-table clean-up + dispatch)
0x82173BE4  bl 0x824B2188   ; tid=1's outer-channel pump
0x82173C34  bl 0x821746B0   ; allocates 24-byte task record, sub_82172370 fills [r29+4]
0x82173C38  mr r30, r3      ; r30 = task struct
0x82173C48  bl 0x824AA5C8   ; status query → r3 (canary r3=1 → [r31+80]=0x103 STATUS_PENDING)
0x82173C54  bne cr6, 0x82173C64  ; guard: only wait if r11==0x103 (STATUS_PENDING)
0x82173C5C  lwz r3, 4(r30)       ; r3 = task->sync_handle = [struct+4]
0x82173C60  bl 0x824AA330        ; KeWaitForSingleObject INFINITE  ← THE WEDGE
0x82173C70  bl 0x82174AF8        ; post-wait task completion (sub_82174AF8 runs post-state transition)
0x82173C88  epilogue (b 0x825F0FC4)

Canary run (AUDIT-065, 180s wallclock, --audit_61_branch_probe_pcs)

All 17 probed PCs fire exactly 1× each on tid=F8000008 (= canary main / mapped from tid=6):

PC lr r3 r4 tid meaning
0x82173990 0x822F1B50 BCCC4A80 701CF8C0 6 entry; lr=post-bctrl of sub_822F1AA8
0x821739CC 0x00000001 0x820A17A8 6 cr0=.G. — cmplwi r28,0 post-strcmp != 0 path
0x821739F0 BCCC4A64 BCCC49FC 6 r28 populated; cr6=..E (==)
0x82173A38 701CF860 701CF840 6 inner copy call entry
0x82173A68 BDE996FF BDE98F14 6 r25=0xBDE996FF (returned ID), cr6=..E
0x82173A78 0x0000001C BDE98F14 6 bne 0x82173A84r25 != 28 taken
0x82173BC0 0x82173A6C 0x1C 6 convergence (post-bne over alloc); beq 0x82173B14-skipped
0x82173BE4 0x82173BD4 BE568F00 0x00000005 6 bl 0x824B2188 entry
0x82173C34 0x82173C1C BCCC4A80 0x00000000 6 bl 0x821746B0 entry — calls task-alloc
0x82173C38 0x82173C38 BC365700 701CF6E0 6 r3=task_struct, cr6=.G.
0x82173C48 0x82173C38 F8000094 701CF800 6 post bl 0x824AA5C8; r3=F8000094 ?? actually this is r3 at BB entry post-bl
0x82173C54 0x82173C4C 0x00000001 0x30000000 6 cmplwi r11, 0x103 — value 0x103 sets cr6=..E (eq) per actual r11; BUT cr6=..E means !lt!gt eq — wait was entered
0x82173C60 0x82173C4C F8000094 FFFFFFFF 6 wait entry — r3 = thread handle
0x82173C64 0x82173C64 0 0x1 6 post-wait — wait completed!
0x82173C70 0x82173C64 BCCC4A80 BC365700 6 bl 0x82174AF8 (cleanup)
0x82173C88 0x82173C88 701CF840 6 epilogue

Wait duration: ~445 log lines between PC 0x82173C60 (entry) and 0x82173C64 (post-wait).

Inside the wait window: K> F8000094 XThread::Execute thid 17 (handle=F8000094, 'XThread01F4 (F8000094)', native=000001F4) — i.e. F8000094 IS a thread handle. The thread loads cache files (cache:\aab216c3\5\ee70e0a, cache:\87719002\c\dba806e, cache:\87719002\c\ec0a96e, cache:\87719002\a\60fcb85, etc) and spawns child workers (ExCreateThread(..., 824AFF88, 821C4AD0, BCA44C00, ...) and others). The wait completes immediately after d> F8000094 ExTerminateThread(00000000).

Ours run (AUDIT-065, -n 500M instructions)

Only 6 of 24 BB-entry probed PCs fire (ours's branch probe fires only at BB-entry):

PC lr r3 cycle meaning
0x82173990 0x822F1B50 0x40ba9a80 6,172,194 entry; lr= bctrl in sub_822F1AA8
0x821739CC 0x00000001 6,172,686 non-zero arg path
0x821739F0 0x40ba9a64 6,173,074 str-init complete
0x82173A68 0x41d7e6ff 9,174,034 r3=str-table-entry (the AUDIT-049 wait inside sub_82452DC0 already happened HERE — note the cycle gap from 6.17M to 9.17M means tid=1 has been blocked in sub_82172370/etc; actually this is the bl 0x821835E0 post-return)
0x82173BC0 0x82173A6C 0x1C 9,175,368 convergence (r25=0x41d7e6ff != 28 not equal to 28; checks ≠0 also nonzero, fall thru to skip block @ 0x82173B14)
0x82173C38 0x82173C38 0x4024a640 9,178,243 post bl 0x821746B0 — r3=0x4024a640 (task struct, ALSO start_ctx of tid=13)

No probe fires beyond 0x82173C38. The next BB-entry probe was 0x82173C64 (post-wait). Mid-block PCs 0x82173C48/C54/C60/C70 don't fire in ours's branch-probe (per AUDIT-046 reading-error #13). The fact that 0x82173C64 does NOT fire confirms: ours's tid=1 wedges between 0x82173C38 and 0x82173C64 — at the wait at 0x82173C60.

End-of-run thread state (ours --trace-handles dump)

handle=0x000012a4 Thread(id=13, exit=None) waiters(tid)=[1]
handle=0x000012ac kind=Event/Auto waiters=1 signals=0 waits=1 wakes=0 <NO_SIGNALS_DESPITE_WAITS>
   [ 0] cycle=0 tid=13 lr=0x824ac578 src=do_wait_single
handle=0x000012b8 kind=Event/Auto waiters=1 signals=0 waits=1 wakes=0 <NO_SIGNALS_DESPITE_WAITS>
   [ 0] cycle=0 tid=5 lr=0x824ac578 src=do_wait_single

tid=13 (handle 0x12A4, exit=None) is alive but stuck on event 0x12AC inside sub_821CB030 (cache file IO completion event). tid=5 is one of the workers parked on its own idle event 0x12B8. tid=1 join-waits tid=13 → tid=13 waits 0x12AC → 0x12AC needs workers → workers parked.

Activation

Direct bl from sub_82175330+0x4 via tail-jump (post-bctrl of vtable[0] dispatched at sub_822F1AA8+0xA4). One static caller sub_82175330 per sylpheed.db.

Called exactly 1× per boot on tid=1 in both engines.

Static graph

  • Direct callers (sylpheed.db xrefs.source_func):
    • sub_82175330+0x4 via b 0x82173990 (tail-jump from the vtable thunk).
  • Direct callees of interest:
    • bl 0x8216E7E8 at +0x20 — string-table helper (used twice).
    • bl 0x82448AA0 at +0x38, +0x48, +0x88, +0xC4, +0x168, +0x260 — string-table lookup.
    • bl 0x824AA7A0 at +0x4C — string-helper.
    • bl 0x82448BC8 at +0x5C — internal lookup.
    • bl 0x82448C50 at +0x78, +0x98, +0x178, +0x308 — string convert.
    • bl 0x8216F218 at +0xA8, +0x188 — copy / string ops.
    • bl 0x8217C850 at +0xAC — query.
    • bl 0x82178E50 at +0xB8 — query.
    • bl 0x821835E0 at +0xD8 — returns ID into r25 (key gate).
    • bl 0x824AA830 at +0xFC, +0x148 — kernel helper.
    • bl 0x822C69C8 at +0x104, +0x134 — task-helper.
    • bl 0x822DE650 at +0x118 — helper.
    • bl 0x822F2328 at +0x124, +0x240 — helper (calls inside outer sub_822F1AA8 too).
    • bl 0x822DE858 at +0x13C — helper.
    • bl 0x822F28C0 at +0x144, +0x25C — helper.
    • bl 0x82674028 at +0x15C — kernel-debug printf? format.
    • bl 0x82150EF8 at +0x1A4 — heap alloc 28-byte struct.
    • bl 0x824523E8 at +0x1FC — task-helper.
    • bl 0x82453910 at +0x208signaler/notify (AUDIT-049 column).
    • bl 0x821506B8 at +0x224 — heap free.
    • bl 0x8216E790 at +0x22C, +0x2EC, +0x2F4 — string-cleanup.
    • bl 0x824B2188 at +0x254 — tid=1's outer-channel pump.
    • bl 0x824482D0 at +0x288 — format.
    • bl 0x821746B0 at +0x2A4task allocator + worker spawn (the key call).
    • bl 0x824AA5C8 at +0x2B8 — status query (returns r3 → checked vs 0x103 STATUS_PENDING).
    • bl 0x824AA330 at +0x2D0KeWaitForSingleObject INFINITE — THE WEDGE PC.
    • bl 0x82174AF8 at +0x2E0 — post-wait task-cleanup/state-transition.
    • b 0x825F0FC4 at +0x2FC — epilogue tail-jump.

Audit log

  • AUDIT-066 (2026-05-12)source-review only (READ-ONLY). AUDIT-065's "host-side F8000048 IO completion thread" inference falsified by canary source review. F8000048 is a guest XThread thid=10, spawned by main at canary-run.stdout:1331 via ExCreateThread(...,824AFF88, 82450A28, 828F3B68, 0) — entry 0x82450A28 is a Sylpheed-internal worker thread, not host infrastructure. Canary's only host helper thread is "Kernel Dispatch" (xenia-canary/src/xenia/kernel/kernel_state.cc:524-549) which services CompleteOverlappedDeferred for XAM UI/content, not file IO. Canary's NtReadFile/NtReadFileScatter/NtWriteFile (xboxkrnl_io.cc:125-389) are synchronous and signal the supplied event handle inline via ev->Set(0, false) after the sync read. Ours's signal_io_completion_event (exports.rs:1156-1169) is the bit-equivalent. No "host-side IO completion signal" gap exists in ours's IO handlers. The wait at this fn's +0x2D0 (PC 0x82173C60) is on the thread handle for the worker spawned via bl 0x821746B0 (= tid=13 in ours / thid=17 in canary), released by ExTerminateThread per AUDIT-065 — confirmed correct framing. AUDIT-066 conclusion: brief's proposed fix locus (xenia-kernel/src/exports.rs IO handlers) is wrong; the bug is upstream worker-cluster bootstrap (AUDIT-057/063/064 chain). No code change of any size in exports.rs would unwedge tid=13. [confirmed: brief premise falsified]
  • AUDIT-065 (2026-05-12) — full disasm + 17-PC probe in BOTH engines. Canary's tid=1 (= F8000008, internally tid=6) reaches PC 0x82173C60 exactly once, waits on r3=0xF8000094 (= XThread thid 17's thread handle), and the wait completes when that worker reaches ExTerminateThread(0). Worker runs synchronous cache file IO (cache:\aab216c3\5\ee70e0a, cache:\87719002\c\... etc) and spawns child workers via ExCreateThread(... 824AFF88, 821C4AD0, BCA44C00 ...) before terminating. Ours's tid=1 reaches PC 0x82173C38 (post bl 0x821746B0, r3=0x4024a640 = ours's task struct = ours's tid=13's start_ctx) and stalls before 0x82173C64 — i.e. inside the wait at 0x82173C60. Ours's tid=13 (created by bl 0x821746B0's subroutine, entry 0x821748F0) DOES open the same cache files (cache:/aab216c3/5, cache:/aab216c3) but BLOCKS inside sub_821CB030+0x1AC on event 0x12AC (NO_SIGNALS_DESPITE_WAITS). So ExTerminateThread is never reached on ours's tid=13 → tid=1's wait on 0x12A4 never completes. Refines the wedge from "thread-join on tid=13" to a precise mechanism: the wait at +0x2D0 is structurally a synchronous task-join (canary worker exits in ~1ms; ours's worker is permanently stuck downstream). sub_82173990 body itself is clean — every probed PC except the wait completion matches canary's behavior; the divergence is entirely in what happens inside sub_821746B0's spawned worker (tid=13's body in sub_821748F0sub_821C4EB0sub_821CC3F8sub_821CBA08sub_821CB030). Same AUDIT-049 island, now framed as: how does canary's worker get its 0x12AC-equivalent event signaled fast enough that the worker can call ExTerminateThread? [confirmed]

Open questions

  • What signals canary's tid=17 cache-IO completion event (the 0x12AC-equivalent inside its sub_821CB030)? Probe canary's NtSetEvent/KeSetEvent thunks (0x8284DF5C/0x82490018) filtered on tid=17's r3-handle in the lr-window around the corresponding tid=17 wait. Compare against ours's empty 0x12AC.signals count.
  • The cache file open flow looks identical in both engines for cache:\aab216c3\… paths — confirming AUDIT-054's VFS layout fix landed correctly. The divergence is purely in the producer-side signaling.
  • Both engines pass through PC 0x82173B98 (the sub_82453910 "signaler" candidate) on the r25!=0 && r25!=28 path — but only if the AUDIT-046 "5/5 iter" path lands inside the block [0x82173B14, 0x82173BC0]. Ours's BB-entry probe shows 0x82173BC0 fires but NOT 0x82173B14/B40/B84/B98/BA0/BA8 — meaning ours's beq at 0x82173B14 is taken (r5==0) and we skip the entire block. Canary's BR list also shows direct jump from 0x82173A780x82173BC0 (line 1998→1999) — so canary also skips the block at 0x82173B14. Block is dead in both engines at this horizon. sub_82453910 is NOT the relevant signaler at this call site.
  • r3 at 0x82173C48 shows F8000094 in canary at probe time, but that's the post-bl 0x824AA5C8 PC (mid-block; the captured value is whatever r3 carried at that instant — likely the handle being queried, not the return). Worth a follow-up probe to confirm the status-query target.

Cross-references

  • Caller: sub_822F1AA8 (via thunk sub_82175330, vtable[0] of *(0x828E1F08)).
  • Callees of interest:
    • sub_821746B0 — task allocator + worker spawn (no dossier yet — recommend creating one).
    • sub_82172370 — sync object creator (no dossier yet).
    • sub_82174AF8 — post-wait cleanup (no dossier yet).
  • Worker-side wedge: sub_821CB030 — fires inside the worker spawned via sub_821746B0 from sub_82173990's bl at +0x2A4.
  • Audits: 049 (original tid=1 stall localization), 064 (full activation chain to sub_825070F0), 065 (this).
  • Artifacts: audit-runs/audit-065-sub82173990-wait-site/{sub_82173990.disasm,canary.log,canary-run.stdout,ours.log,ours-stdout.log,summary.md}.