Files
xenia-rs/docs/functions/sub_82173990.md
MechaCat02 ad45873a1b ITERATE-2.V: scheduler priority aging closes 18-day AUDIT-049 wedge
Priority aging in xenia-cpu/scheduler.rs:pick_runnable
(effective_priority = base + age_bonus(now_round - last_run_round),
capped at +31, AGING_ROUNDS_PER_BONUS=1). Strict-priority was parking
priority=0 threads behind CPU-bound priority=15 audio mixer
(sub_824D1328 guest spinwait at PC=0x824d1404 on CPU5). Aging
eventually picks the starved thread, breaking the producer-consumer
cycle that caused 5-tid wedge at PC=0x824ac578 since AUDIT-049 (10 May).

Cascade observed: tid=13 clean exit; events 121K -> 13M (107x); last
host_ns 767ms -> 51,011ms (66x); 8 new threads spawn; VdSwap 1 -> 2.

Complete two-day iterate sequence (2026-05-27 -> 2026-05-28):
- 2.F: VdSwap drain timeout 900ms -> 1ms (xenia-gpu/handle.rs); 876x
       perf win on VdSwap kernel callback
- 2.H: vA0000000 physical heap bucket added (state.rs, exports.rs);
       ctx_ptrs now in 0xA0000000-0xBFFFFFFF range matching canary
- 2.L: Phase-A diff harness categorized [return_value mismatch],
       [status mismatch], [args_resolved.path mismatch] tags
       (tools/diff-events/diff_events.py); closes reading-error #41
       (silent test-harness state leak invalidating trace diffs)
- 2.M: always-on exit-thread-state.json sibling to Phase-A JSONL
       (event_log.rs + xenia-app/main.rs); closes reading-error #42
       (Phase-A blind to blocked-forever waits)
- 2.Q: signal.match kernel instrumentation in NtSetEvent /
       NtReleaseSemaphore / KeSetEvent / KeReleaseSemaphore
       (exports.rs); emits target_handle + waiter_count + waiter_tids
- 2.T: wake.requested kernel instrumentation in wake_eligible_waiters
       (exports.rs); emits target_tid + transition + new_state
- 2.V: scheduler priority aging (xenia-cpu/scheduler.rs) [keystone]

Plus accumulated WIP from earlier May (contention_manifest,
phase_b_snapshot, xam/xaudio enhancements, analysis db, xex loader,
xenia-app main loop, etc.). Audit-runs/ artifacts remain untracked
per project convention.

Tests: 300 xenia-cpu / 227 xenia-kernel / 5 xenia-app / 19 xenia-path
/ 30+ smaller suites -- all PASS, 0 regressions. Determinism preserved
(2x cold runs bit-identical at 13,003,881 events post-2.V).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 07:27:26 +02:00

168 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
address: 0x82173990
classification: normal_callee
confidence: high
last_audit: 066
aliases:
- "tid=1 join wait site (the wedge PC 0x82173C60)"
- "synchronous task-spawn + join helper"
---
# sub_82173990 — synchronous task-spawn-and-join helper
## Synopsis
Tid=1 (main) one-shot helper that builds a stack-resident task descriptor, calls
`sub_821746B0` to allocate+initialize a 24-byte task record (which encapsulates
a sync object created by `sub_82172370`), and waits INFINITE on that sync
object. The wait at PC **`0x82173C60 bl 0x824AA330`** (= `sub_82173990+0x2D0`)
is the AUDIT-049/AUDIT-064 wedge site — both canary and ours enter the wait,
but only canary's wait completes. The wait is on the **thread handle** of the
worker spawned by `sub_821746B0` (XThread or KE_THREAD), released when that
worker calls `ExTerminateThread`. Function is called exactly **1× per boot** in
both engines (entry probe fires once, all body PCs fire once each).
## Evidence
Disasm-anatomy (size 768B / 192 insns @ `0x82173990..0x82173C8C`):
```
0x82173990 mflr/prologue (256-byte frame)
0x821739B0 bl 0x8216E7E8 ; first string-init helper (r4=r11+6244)
0x821739CC bl 0x82448AA0 ; cr0=.G. → arg!=0 path
0x821739F0 bl 0x82448BC8 ; returns string-table entry → r28
0x82173A38 bl 0x8216F218 ; internal copy
0x82173A68 bl 0x821835E0 ; → r25 (ID/result); cr6-tests below
0x82173A78 bne cr6, 0x82173A84 ; skip if r25==28
0x82173A88 beq cr6, 0x82173BC0 ; skip if r25==0
0x82173B98 bl 0x82453910 ; signaler candidate (AUDIT-049 column)
0x82173BC0 convergence label (string-table clean-up + dispatch)
0x82173BE4 bl 0x824B2188 ; tid=1's outer-channel pump
0x82173C34 bl 0x821746B0 ; allocates 24-byte task record, sub_82172370 fills [r29+4]
0x82173C38 mr r30, r3 ; r30 = task struct
0x82173C48 bl 0x824AA5C8 ; status query → r3 (canary r3=1 → [r31+80]=0x103 STATUS_PENDING)
0x82173C54 bne cr6, 0x82173C64 ; guard: only wait if r11==0x103 (STATUS_PENDING)
0x82173C5C lwz r3, 4(r30) ; r3 = task->sync_handle = [struct+4]
0x82173C60 bl 0x824AA330 ; KeWaitForSingleObject INFINITE ← THE WEDGE
0x82173C70 bl 0x82174AF8 ; post-wait task completion (sub_82174AF8 runs post-state transition)
0x82173C88 epilogue (b 0x825F0FC4)
```
### Canary run (AUDIT-065, 180s wallclock, --audit_61_branch_probe_pcs)
All 17 probed PCs fire **exactly 1× each on tid=F8000008 (= canary main / mapped from `tid=6`)**:
| PC | lr | r3 | r4 | tid | meaning |
|---|---|---|---|---|---|
| `0x82173990` | `0x822F1B50` | `BCCC4A80` | `701CF8C0` | 6 | entry; lr=post-bctrl of sub_822F1AA8 |
| `0x821739CC` | — | `0x00000001` | `0x820A17A8` | 6 | cr0=.G. — `cmplwi r28,0` post-strcmp != 0 path |
| `0x821739F0` | — | `BCCC4A64` | `BCCC49FC` | 6 | r28 populated; cr6=..E (==) |
| `0x82173A38` | — | `701CF860` | `701CF840` | 6 | inner copy call entry |
| `0x82173A68` | — | `BDE996FF` | `BDE98F14` | 6 | r25=0xBDE996FF (returned ID), cr6=..E |
| `0x82173A78` | — | `0x0000001C` | `BDE98F14` | 6 | `bne 0x82173A84``r25 != 28` taken |
| `0x82173BC0` | `0x82173A6C` | `0x1C` | … | 6 | convergence (post-bne over alloc); `beq 0x82173B14`-skipped |
| `0x82173BE4` | `0x82173BD4` | `BE568F00` | `0x00000005` | 6 | bl 0x824B2188 entry |
| `0x82173C34` | `0x82173C1C` | `BCCC4A80` | `0x00000000` | 6 | bl 0x821746B0 entry — calls task-alloc |
| `0x82173C38` | `0x82173C38` | `BC365700` | `701CF6E0` | 6 | r3=task_struct, cr6=.G. |
| `0x82173C48` | `0x82173C38` | `F8000094` | `701CF800` | 6 | post bl 0x824AA5C8; r3=F8000094 ?? actually this is r3 at BB entry post-bl |
| `0x82173C54` | `0x82173C4C` | `0x00000001` | `0x30000000` | 6 | cmplwi r11, 0x103 — value 0x103 sets cr6=..E (eq) per actual r11; BUT cr6=..E means !lt!gt eq — wait was entered |
| `0x82173C60` | `0x82173C4C` | **`F8000094`** | `FFFFFFFF` | 6 | **wait entry — r3 = thread handle** |
| `0x82173C64` | `0x82173C64` | `0` | `0x1` | 6 | post-wait — wait completed! |
| `0x82173C70` | `0x82173C64` | `BCCC4A80` | `BC365700` | 6 | bl 0x82174AF8 (cleanup) |
| `0x82173C88` | `0x82173C88` | `701CF840` | … | 6 | epilogue |
Wait duration: ~445 log lines between PC `0x82173C60` (entry) and `0x82173C64` (post-wait).
Inside the wait window: `K> F8000094 XThread::Execute thid 17 (handle=F8000094, 'XThread01F4 (F8000094)', native=000001F4)` — i.e. F8000094 IS a thread handle. The thread loads cache files (`cache:\aab216c3\5\ee70e0a`, `cache:\87719002\c\dba806e`, `cache:\87719002\c\ec0a96e`, `cache:\87719002\a\60fcb85`, etc) and spawns child workers (`ExCreateThread(..., 824AFF88, 821C4AD0, BCA44C00, ...)` and others). The wait completes immediately after `d> F8000094 ExTerminateThread(00000000)`.
### Ours run (AUDIT-065, -n 500M instructions)
Only 6 of 24 BB-entry probed PCs fire (ours's branch probe fires only at BB-entry):
| PC | lr | r3 | cycle | meaning |
|---|---|---|---|---|
| `0x82173990` | `0x822F1B50` | `0x40ba9a80` | 6,172,194 | entry; lr= bctrl in sub_822F1AA8 |
| `0x821739CC` | — | `0x00000001` | 6,172,686 | non-zero arg path |
| `0x821739F0` | — | `0x40ba9a64` | 6,173,074 | str-init complete |
| `0x82173A68` | — | `0x41d7e6ff` | 9,174,034 | r3=str-table-entry (the AUDIT-049 wait inside sub_82452DC0 already happened HERE — note the cycle gap from 6.17M to 9.17M means tid=1 has been blocked in sub_82172370/etc; actually this is the `bl 0x821835E0` post-return) |
| `0x82173BC0` | `0x82173A6C` | `0x1C` | 9,175,368 | convergence (r25=0x41d7e6ff != 28 not equal to 28; checks ≠0 also nonzero, fall thru to skip block @ 0x82173B14) |
| `0x82173C38` | `0x82173C38` | `0x4024a640` | 9,178,243 | post `bl 0x821746B0` — r3=`0x4024a640` (task struct, ALSO start_ctx of tid=13) |
**No probe fires beyond `0x82173C38`**. The next BB-entry probe was `0x82173C64` (post-wait). Mid-block PCs `0x82173C48/C54/C60/C70` don't fire in ours's branch-probe (per AUDIT-046 reading-error #13). The fact that `0x82173C64` does NOT fire confirms: **ours's tid=1 wedges between `0x82173C38` and `0x82173C64`** — at the wait at `0x82173C60`.
### End-of-run thread state (ours --trace-handles dump)
```
handle=0x000012a4 Thread(id=13, exit=None) waiters(tid)=[1]
handle=0x000012ac kind=Event/Auto waiters=1 signals=0 waits=1 wakes=0 <NO_SIGNALS_DESPITE_WAITS>
[ 0] cycle=0 tid=13 lr=0x824ac578 src=do_wait_single
handle=0x000012b8 kind=Event/Auto waiters=1 signals=0 waits=1 wakes=0 <NO_SIGNALS_DESPITE_WAITS>
[ 0] cycle=0 tid=5 lr=0x824ac578 src=do_wait_single
```
tid=13 (handle 0x12A4, exit=None) is alive but stuck on event 0x12AC inside sub_821CB030 (cache file IO completion event). tid=5 is one of the workers parked on its own idle event 0x12B8. tid=1 join-waits tid=13 → tid=13 waits 0x12AC → 0x12AC needs workers → workers parked.
## Activation
Direct `bl` from `sub_82175330+0x4` via tail-jump (post-bctrl of vtable[0] dispatched at `sub_822F1AA8+0xA4`). One static caller `sub_82175330` per `sylpheed.db`.
Called exactly **1× per boot** on tid=1 in both engines.
## Static graph
- Direct callers (sylpheed.db `xrefs.source_func`):
- `sub_82175330+0x4` via `b 0x82173990` (tail-jump from the vtable thunk).
- Direct callees of interest:
- `bl 0x8216E7E8` at `+0x20` — string-table helper (used twice).
- `bl 0x82448AA0` at `+0x38`, `+0x48`, `+0x88`, `+0xC4`, `+0x168`, `+0x260` — string-table lookup.
- `bl 0x824AA7A0` at `+0x4C` — string-helper.
- `bl 0x82448BC8` at `+0x5C` — internal lookup.
- `bl 0x82448C50` at `+0x78`, `+0x98`, `+0x178`, `+0x308` — string convert.
- `bl 0x8216F218` at `+0xA8`, `+0x188` — copy / string ops.
- `bl 0x8217C850` at `+0xAC` — query.
- `bl 0x82178E50` at `+0xB8` — query.
- `bl 0x821835E0` at `+0xD8` — returns ID into `r25` (key gate).
- `bl 0x824AA830` at `+0xFC`, `+0x148` — kernel helper.
- `bl 0x822C69C8` at `+0x104`, `+0x134` — task-helper.
- `bl 0x822DE650` at `+0x118` — helper.
- `bl 0x822F2328` at `+0x124`, `+0x240` — helper (calls inside outer sub_822F1AA8 too).
- `bl 0x822DE858` at `+0x13C` — helper.
- `bl 0x822F28C0` at `+0x144`, `+0x25C` — helper.
- `bl 0x82674028` at `+0x15C` — kernel-debug printf? format.
- `bl 0x82150EF8` at `+0x1A4` — heap alloc 28-byte struct.
- `bl 0x824523E8` at `+0x1FC` — task-helper.
- `bl 0x82453910` at `+0x208`**signaler/notify (AUDIT-049 column)**.
- `bl 0x821506B8` at `+0x224` — heap free.
- `bl 0x8216E790` at `+0x22C`, `+0x2EC`, `+0x2F4` — string-cleanup.
- `bl 0x824B2188` at `+0x254` — tid=1's outer-channel pump.
- `bl 0x824482D0` at `+0x288` — format.
- **`bl 0x821746B0` at `+0x2A4`** — **task allocator + worker spawn (the key call).**
- `bl 0x824AA5C8` at `+0x2B8` — status query (returns `r3` → checked vs `0x103 STATUS_PENDING`).
- **`bl 0x824AA330` at `+0x2D0`** — **`KeWaitForSingleObject` INFINITE — THE WEDGE PC.**
- `bl 0x82174AF8` at `+0x2E0` — post-wait task-cleanup/state-transition.
- `b 0x825F0FC4` at `+0x2FC` — epilogue tail-jump.
## Audit log
- **AUDIT-066 (2026-05-12)** — **source-review only (READ-ONLY)**. AUDIT-065's "host-side `F8000048` IO completion thread" inference falsified by canary source review. `F8000048` is a **guest XThread thid=10**, spawned by main at canary-run.stdout:1331 via `ExCreateThread(...,824AFF88, 82450A28, 828F3B68, 0)` — entry `0x82450A28` is a Sylpheed-internal worker thread, not host infrastructure. Canary's only host helper thread is "Kernel Dispatch" (`xenia-canary/src/xenia/kernel/kernel_state.cc:524-549`) which services `CompleteOverlappedDeferred` for XAM UI/content, not file IO. Canary's `NtReadFile`/`NtReadFileScatter`/`NtWriteFile` (`xboxkrnl_io.cc:125-389`) are synchronous and signal the supplied event handle inline via `ev->Set(0, false)` after the sync read. Ours's `signal_io_completion_event` (`exports.rs:1156-1169`) is the bit-equivalent. **No "host-side IO completion signal" gap exists** in ours's IO handlers. The wait at this fn's `+0x2D0` (PC `0x82173C60`) is on the thread handle for the worker spawned via `bl 0x821746B0` (= tid=13 in ours / thid=17 in canary), released by `ExTerminateThread` per AUDIT-065 — confirmed correct framing. AUDIT-066 conclusion: brief's proposed fix locus (`xenia-kernel/src/exports.rs` IO handlers) is wrong; the bug is upstream worker-cluster bootstrap (AUDIT-057/063/064 chain). No code change of any size in `exports.rs` would unwedge tid=13. [confirmed: brief premise falsified]
- **AUDIT-065 (2026-05-12)** — full disasm + 17-PC probe in BOTH engines. **Canary's tid=1 (= F8000008, internally tid=6) reaches PC `0x82173C60` exactly once, waits on r3=`0xF8000094` (= XThread thid 17's thread handle), and the wait completes when that worker reaches `ExTerminateThread(0)`**. Worker runs synchronous cache file IO (`cache:\aab216c3\5\ee70e0a`, `cache:\87719002\c\...` etc) and spawns child workers via `ExCreateThread(... 824AFF88, 821C4AD0, BCA44C00 ...)` before terminating. **Ours's tid=1 reaches PC `0x82173C38` (post `bl 0x821746B0`, r3=`0x4024a640` = ours's task struct = ours's tid=13's start_ctx) and stalls before `0x82173C64`** — i.e. inside the wait at `0x82173C60`. Ours's tid=13 (created by `bl 0x821746B0`'s subroutine, entry `0x821748F0`) DOES open the same cache files (`cache:/aab216c3/5`, `cache:/aab216c3`) but BLOCKS inside `sub_821CB030+0x1AC` on event `0x12AC` (NO_SIGNALS_DESPITE_WAITS). So `ExTerminateThread` is never reached on ours's tid=13 → tid=1's wait on `0x12A4` never completes. **Refines the wedge from "thread-join on tid=13" to a precise mechanism**: the wait at `+0x2D0` is structurally a synchronous task-join (canary worker exits in ~1ms; ours's worker is permanently stuck downstream). **`sub_82173990` body itself is clean** — every probed PC except the wait completion matches canary's behavior; the divergence is entirely in what happens inside `sub_821746B0`'s spawned worker (tid=13's body in `sub_821748F0``sub_821C4EB0``sub_821CC3F8``sub_821CBA08``sub_821CB030`). Same AUDIT-049 island, now framed as: **how does canary's worker get its `0x12AC`-equivalent event signaled fast enough that the worker can call `ExTerminateThread`?** [confirmed]
## Open questions
- **What signals canary's tid=17 cache-IO completion event (the `0x12AC`-equivalent inside its `sub_821CB030`)?** Probe canary's `NtSetEvent`/`KeSetEvent` thunks (`0x8284DF5C`/`0x82490018`) filtered on tid=17's `r3`-handle in the lr-window around the corresponding tid=17 wait. Compare against ours's empty `0x12AC.signals` count.
- The cache file open flow looks identical in both engines for `cache:\aab216c3\…` paths — confirming AUDIT-054's VFS layout fix landed correctly. The divergence is purely in the producer-side signaling.
- Both engines pass through PC `0x82173B98` (the `sub_82453910` "signaler" candidate) on the `r25!=0 && r25!=28` path — but only if the AUDIT-046 "5/5 iter" path lands inside the block `[0x82173B14, 0x82173BC0]`. Ours's BB-entry probe shows `0x82173BC0` fires but NOT `0x82173B14/B40/B84/B98/BA0/BA8` — meaning ours's beq at `0x82173B14` is **taken** (r5==0) and we skip the entire block. Canary's BR list also shows direct jump from `0x82173A78``0x82173BC0` (line 1998→1999) — **so canary also skips the block at 0x82173B14**. Block is dead in both engines at this horizon. `sub_82453910` is NOT the relevant signaler at this call site.
- `r3` at `0x82173C48` shows `F8000094` in canary at probe time, but that's the **post-`bl 0x824AA5C8` PC** (mid-block; the captured value is whatever r3 carried at that instant — likely the handle being queried, not the return). Worth a follow-up probe to confirm the status-query target.
## Cross-references
- Caller: [sub_822F1AA8](sub_822F1AA8.md) (via thunk sub_82175330, vtable[0] of `*(0x828E1F08)`).
- Callees of interest:
- `sub_821746B0` — task allocator + worker spawn (no dossier yet — recommend creating one).
- `sub_82172370` — sync object creator (no dossier yet).
- `sub_82174AF8` — post-wait cleanup (no dossier yet).
- Worker-side wedge: [sub_821CB030](sub_821CB030.md) — fires inside the worker spawned via sub_821746B0 from sub_82173990's `bl` at `+0x2A4`.
- Audits: 049 (original tid=1 stall localization), 064 (full activation chain to sub_825070F0), 065 (this).
- Artifacts: `audit-runs/audit-065-sub82173990-wait-site/{sub_82173990.disasm,canary.log,canary-run.stdout,ours.log,ours-stdout.log,summary.md}`.