ITERATE-2.V: scheduler priority aging closes 18-day AUDIT-049 wedge

Priority aging in xenia-cpu/scheduler.rs:pick_runnable
(effective_priority = base + age_bonus(now_round - last_run_round),
capped at +31, AGING_ROUNDS_PER_BONUS=1). Strict-priority was parking
priority=0 threads behind CPU-bound priority=15 audio mixer
(sub_824D1328 guest spinwait at PC=0x824d1404 on CPU5). Aging
eventually picks the starved thread, breaking the producer-consumer
cycle that caused 5-tid wedge at PC=0x824ac578 since AUDIT-049 (10 May).

Cascade observed: tid=13 clean exit; events 121K -> 13M (107x); last
host_ns 767ms -> 51,011ms (66x); 8 new threads spawn; VdSwap 1 -> 2.

Complete two-day iterate sequence (2026-05-27 -> 2026-05-28):
- 2.F: VdSwap drain timeout 900ms -> 1ms (xenia-gpu/handle.rs); 876x
       perf win on VdSwap kernel callback
- 2.H: vA0000000 physical heap bucket added (state.rs, exports.rs);
       ctx_ptrs now in 0xA0000000-0xBFFFFFFF range matching canary
- 2.L: Phase-A diff harness categorized [return_value mismatch],
       [status mismatch], [args_resolved.path mismatch] tags
       (tools/diff-events/diff_events.py); closes reading-error #41
       (silent test-harness state leak invalidating trace diffs)
- 2.M: always-on exit-thread-state.json sibling to Phase-A JSONL
       (event_log.rs + xenia-app/main.rs); closes reading-error #42
       (Phase-A blind to blocked-forever waits)
- 2.Q: signal.match kernel instrumentation in NtSetEvent /
       NtReleaseSemaphore / KeSetEvent / KeReleaseSemaphore
       (exports.rs); emits target_handle + waiter_count + waiter_tids
- 2.T: wake.requested kernel instrumentation in wake_eligible_waiters
       (exports.rs); emits target_tid + transition + new_state
- 2.V: scheduler priority aging (xenia-cpu/scheduler.rs) [keystone]

Plus accumulated WIP from earlier May (contention_manifest,
phase_b_snapshot, xam/xaudio enhancements, analysis db, xex loader,
xenia-app main loop, etc.). Audit-runs/ artifacts remain untracked
per project convention.

Tests: 300 xenia-cpu / 227 xenia-kernel / 5 xenia-app / 19 xenia-path
/ 30+ smaller suites -- all PASS, 0 regressions. Determinism preserved
(2x cold runs bit-identical at 13,003,881 events post-2.V).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-29 07:27:26 +02:00
parent e6d43a23ac
commit ad45873a1b
50 changed files with 14389 additions and 506 deletions

View File

@@ -0,0 +1,167 @@
---
address: 0x82173990
classification: normal_callee
confidence: high
last_audit: 066
aliases:
- "tid=1 join wait site (the wedge PC 0x82173C60)"
- "synchronous task-spawn + join helper"
---
# sub_82173990 — synchronous task-spawn-and-join helper
## Synopsis
Tid=1 (main) one-shot helper that builds a stack-resident task descriptor, calls
`sub_821746B0` to allocate+initialize a 24-byte task record (which encapsulates
a sync object created by `sub_82172370`), and waits INFINITE on that sync
object. The wait at PC **`0x82173C60 bl 0x824AA330`** (= `sub_82173990+0x2D0`)
is the AUDIT-049/AUDIT-064 wedge site — both canary and ours enter the wait,
but only canary's wait completes. The wait is on the **thread handle** of the
worker spawned by `sub_821746B0` (XThread or KE_THREAD), released when that
worker calls `ExTerminateThread`. Function is called exactly **1× per boot** in
both engines (entry probe fires once, all body PCs fire once each).
## Evidence
Disasm-anatomy (size 768B / 192 insns @ `0x82173990..0x82173C8C`):
```
0x82173990 mflr/prologue (256-byte frame)
0x821739B0 bl 0x8216E7E8 ; first string-init helper (r4=r11+6244)
0x821739CC bl 0x82448AA0 ; cr0=.G. → arg!=0 path
0x821739F0 bl 0x82448BC8 ; returns string-table entry → r28
0x82173A38 bl 0x8216F218 ; internal copy
0x82173A68 bl 0x821835E0 ; → r25 (ID/result); cr6-tests below
0x82173A78 bne cr6, 0x82173A84 ; skip if r25==28
0x82173A88 beq cr6, 0x82173BC0 ; skip if r25==0
0x82173B98 bl 0x82453910 ; signaler candidate (AUDIT-049 column)
0x82173BC0 convergence label (string-table clean-up + dispatch)
0x82173BE4 bl 0x824B2188 ; tid=1's outer-channel pump
0x82173C34 bl 0x821746B0 ; allocates 24-byte task record, sub_82172370 fills [r29+4]
0x82173C38 mr r30, r3 ; r30 = task struct
0x82173C48 bl 0x824AA5C8 ; status query → r3 (canary r3=1 → [r31+80]=0x103 STATUS_PENDING)
0x82173C54 bne cr6, 0x82173C64 ; guard: only wait if r11==0x103 (STATUS_PENDING)
0x82173C5C lwz r3, 4(r30) ; r3 = task->sync_handle = [struct+4]
0x82173C60 bl 0x824AA330 ; KeWaitForSingleObject INFINITE ← THE WEDGE
0x82173C70 bl 0x82174AF8 ; post-wait task completion (sub_82174AF8 runs post-state transition)
0x82173C88 epilogue (b 0x825F0FC4)
```
### Canary run (AUDIT-065, 180s wallclock, --audit_61_branch_probe_pcs)
All 17 probed PCs fire **exactly 1× each on tid=F8000008 (= canary main / mapped from `tid=6`)**:
| PC | lr | r3 | r4 | tid | meaning |
|---|---|---|---|---|---|
| `0x82173990` | `0x822F1B50` | `BCCC4A80` | `701CF8C0` | 6 | entry; lr=post-bctrl of sub_822F1AA8 |
| `0x821739CC` | — | `0x00000001` | `0x820A17A8` | 6 | cr0=.G. — `cmplwi r28,0` post-strcmp != 0 path |
| `0x821739F0` | — | `BCCC4A64` | `BCCC49FC` | 6 | r28 populated; cr6=..E (==) |
| `0x82173A38` | — | `701CF860` | `701CF840` | 6 | inner copy call entry |
| `0x82173A68` | — | `BDE996FF` | `BDE98F14` | 6 | r25=0xBDE996FF (returned ID), cr6=..E |
| `0x82173A78` | — | `0x0000001C` | `BDE98F14` | 6 | `bne 0x82173A84``r25 != 28` taken |
| `0x82173BC0` | `0x82173A6C` | `0x1C` | … | 6 | convergence (post-bne over alloc); `beq 0x82173B14`-skipped |
| `0x82173BE4` | `0x82173BD4` | `BE568F00` | `0x00000005` | 6 | bl 0x824B2188 entry |
| `0x82173C34` | `0x82173C1C` | `BCCC4A80` | `0x00000000` | 6 | bl 0x821746B0 entry — calls task-alloc |
| `0x82173C38` | `0x82173C38` | `BC365700` | `701CF6E0` | 6 | r3=task_struct, cr6=.G. |
| `0x82173C48` | `0x82173C38` | `F8000094` | `701CF800` | 6 | post bl 0x824AA5C8; r3=F8000094 ?? actually this is r3 at BB entry post-bl |
| `0x82173C54` | `0x82173C4C` | `0x00000001` | `0x30000000` | 6 | cmplwi r11, 0x103 — value 0x103 sets cr6=..E (eq) per actual r11; BUT cr6=..E means !lt!gt eq — wait was entered |
| `0x82173C60` | `0x82173C4C` | **`F8000094`** | `FFFFFFFF` | 6 | **wait entry — r3 = thread handle** |
| `0x82173C64` | `0x82173C64` | `0` | `0x1` | 6 | post-wait — wait completed! |
| `0x82173C70` | `0x82173C64` | `BCCC4A80` | `BC365700` | 6 | bl 0x82174AF8 (cleanup) |
| `0x82173C88` | `0x82173C88` | `701CF840` | … | 6 | epilogue |
Wait duration: ~445 log lines between PC `0x82173C60` (entry) and `0x82173C64` (post-wait).
Inside the wait window: `K> F8000094 XThread::Execute thid 17 (handle=F8000094, 'XThread01F4 (F8000094)', native=000001F4)` — i.e. F8000094 IS a thread handle. The thread loads cache files (`cache:\aab216c3\5\ee70e0a`, `cache:\87719002\c\dba806e`, `cache:\87719002\c\ec0a96e`, `cache:\87719002\a\60fcb85`, etc) and spawns child workers (`ExCreateThread(..., 824AFF88, 821C4AD0, BCA44C00, ...)` and others). The wait completes immediately after `d> F8000094 ExTerminateThread(00000000)`.
### Ours run (AUDIT-065, -n 500M instructions)
Only 6 of 24 BB-entry probed PCs fire (ours's branch probe fires only at BB-entry):
| PC | lr | r3 | cycle | meaning |
|---|---|---|---|---|
| `0x82173990` | `0x822F1B50` | `0x40ba9a80` | 6,172,194 | entry; lr= bctrl in sub_822F1AA8 |
| `0x821739CC` | — | `0x00000001` | 6,172,686 | non-zero arg path |
| `0x821739F0` | — | `0x40ba9a64` | 6,173,074 | str-init complete |
| `0x82173A68` | — | `0x41d7e6ff` | 9,174,034 | r3=str-table-entry (the AUDIT-049 wait inside sub_82452DC0 already happened HERE — note the cycle gap from 6.17M to 9.17M means tid=1 has been blocked in sub_82172370/etc; actually this is the `bl 0x821835E0` post-return) |
| `0x82173BC0` | `0x82173A6C` | `0x1C` | 9,175,368 | convergence (r25=0x41d7e6ff != 28 not equal to 28; checks ≠0 also nonzero, fall thru to skip block @ 0x82173B14) |
| `0x82173C38` | `0x82173C38` | `0x4024a640` | 9,178,243 | post `bl 0x821746B0` — r3=`0x4024a640` (task struct, ALSO start_ctx of tid=13) |
**No probe fires beyond `0x82173C38`**. The next BB-entry probe was `0x82173C64` (post-wait). Mid-block PCs `0x82173C48/C54/C60/C70` don't fire in ours's branch-probe (per AUDIT-046 reading-error #13). The fact that `0x82173C64` does NOT fire confirms: **ours's tid=1 wedges between `0x82173C38` and `0x82173C64`** — at the wait at `0x82173C60`.
### End-of-run thread state (ours --trace-handles dump)
```
handle=0x000012a4 Thread(id=13, exit=None) waiters(tid)=[1]
handle=0x000012ac kind=Event/Auto waiters=1 signals=0 waits=1 wakes=0 <NO_SIGNALS_DESPITE_WAITS>
[ 0] cycle=0 tid=13 lr=0x824ac578 src=do_wait_single
handle=0x000012b8 kind=Event/Auto waiters=1 signals=0 waits=1 wakes=0 <NO_SIGNALS_DESPITE_WAITS>
[ 0] cycle=0 tid=5 lr=0x824ac578 src=do_wait_single
```
tid=13 (handle 0x12A4, exit=None) is alive but stuck on event 0x12AC inside sub_821CB030 (cache file IO completion event). tid=5 is one of the workers parked on its own idle event 0x12B8. tid=1 join-waits tid=13 → tid=13 waits 0x12AC → 0x12AC needs workers → workers parked.
## Activation
Direct `bl` from `sub_82175330+0x4` via tail-jump (post-bctrl of vtable[0] dispatched at `sub_822F1AA8+0xA4`). One static caller `sub_82175330` per `sylpheed.db`.
Called exactly **1× per boot** on tid=1 in both engines.
## Static graph
- Direct callers (sylpheed.db `xrefs.source_func`):
- `sub_82175330+0x4` via `b 0x82173990` (tail-jump from the vtable thunk).
- Direct callees of interest:
- `bl 0x8216E7E8` at `+0x20` — string-table helper (used twice).
- `bl 0x82448AA0` at `+0x38`, `+0x48`, `+0x88`, `+0xC4`, `+0x168`, `+0x260` — string-table lookup.
- `bl 0x824AA7A0` at `+0x4C` — string-helper.
- `bl 0x82448BC8` at `+0x5C` — internal lookup.
- `bl 0x82448C50` at `+0x78`, `+0x98`, `+0x178`, `+0x308` — string convert.
- `bl 0x8216F218` at `+0xA8`, `+0x188` — copy / string ops.
- `bl 0x8217C850` at `+0xAC` — query.
- `bl 0x82178E50` at `+0xB8` — query.
- `bl 0x821835E0` at `+0xD8` — returns ID into `r25` (key gate).
- `bl 0x824AA830` at `+0xFC`, `+0x148` — kernel helper.
- `bl 0x822C69C8` at `+0x104`, `+0x134` — task-helper.
- `bl 0x822DE650` at `+0x118` — helper.
- `bl 0x822F2328` at `+0x124`, `+0x240` — helper (calls inside outer sub_822F1AA8 too).
- `bl 0x822DE858` at `+0x13C` — helper.
- `bl 0x822F28C0` at `+0x144`, `+0x25C` — helper.
- `bl 0x82674028` at `+0x15C` — kernel-debug printf? format.
- `bl 0x82150EF8` at `+0x1A4` — heap alloc 28-byte struct.
- `bl 0x824523E8` at `+0x1FC` — task-helper.
- `bl 0x82453910` at `+0x208`**signaler/notify (AUDIT-049 column)**.
- `bl 0x821506B8` at `+0x224` — heap free.
- `bl 0x8216E790` at `+0x22C`, `+0x2EC`, `+0x2F4` — string-cleanup.
- `bl 0x824B2188` at `+0x254` — tid=1's outer-channel pump.
- `bl 0x824482D0` at `+0x288` — format.
- **`bl 0x821746B0` at `+0x2A4`** — **task allocator + worker spawn (the key call).**
- `bl 0x824AA5C8` at `+0x2B8` — status query (returns `r3` → checked vs `0x103 STATUS_PENDING`).
- **`bl 0x824AA330` at `+0x2D0`** — **`KeWaitForSingleObject` INFINITE — THE WEDGE PC.**
- `bl 0x82174AF8` at `+0x2E0` — post-wait task-cleanup/state-transition.
- `b 0x825F0FC4` at `+0x2FC` — epilogue tail-jump.
## Audit log
- **AUDIT-066 (2026-05-12)** — **source-review only (READ-ONLY)**. AUDIT-065's "host-side `F8000048` IO completion thread" inference falsified by canary source review. `F8000048` is a **guest XThread thid=10**, spawned by main at canary-run.stdout:1331 via `ExCreateThread(...,824AFF88, 82450A28, 828F3B68, 0)` — entry `0x82450A28` is a Sylpheed-internal worker thread, not host infrastructure. Canary's only host helper thread is "Kernel Dispatch" (`xenia-canary/src/xenia/kernel/kernel_state.cc:524-549`) which services `CompleteOverlappedDeferred` for XAM UI/content, not file IO. Canary's `NtReadFile`/`NtReadFileScatter`/`NtWriteFile` (`xboxkrnl_io.cc:125-389`) are synchronous and signal the supplied event handle inline via `ev->Set(0, false)` after the sync read. Ours's `signal_io_completion_event` (`exports.rs:1156-1169`) is the bit-equivalent. **No "host-side IO completion signal" gap exists** in ours's IO handlers. The wait at this fn's `+0x2D0` (PC `0x82173C60`) is on the thread handle for the worker spawned via `bl 0x821746B0` (= tid=13 in ours / thid=17 in canary), released by `ExTerminateThread` per AUDIT-065 — confirmed correct framing. AUDIT-066 conclusion: brief's proposed fix locus (`xenia-kernel/src/exports.rs` IO handlers) is wrong; the bug is upstream worker-cluster bootstrap (AUDIT-057/063/064 chain). No code change of any size in `exports.rs` would unwedge tid=13. [confirmed: brief premise falsified]
- **AUDIT-065 (2026-05-12)** — full disasm + 17-PC probe in BOTH engines. **Canary's tid=1 (= F8000008, internally tid=6) reaches PC `0x82173C60` exactly once, waits on r3=`0xF8000094` (= XThread thid 17's thread handle), and the wait completes when that worker reaches `ExTerminateThread(0)`**. Worker runs synchronous cache file IO (`cache:\aab216c3\5\ee70e0a`, `cache:\87719002\c\...` etc) and spawns child workers via `ExCreateThread(... 824AFF88, 821C4AD0, BCA44C00 ...)` before terminating. **Ours's tid=1 reaches PC `0x82173C38` (post `bl 0x821746B0`, r3=`0x4024a640` = ours's task struct = ours's tid=13's start_ctx) and stalls before `0x82173C64`** — i.e. inside the wait at `0x82173C60`. Ours's tid=13 (created by `bl 0x821746B0`'s subroutine, entry `0x821748F0`) DOES open the same cache files (`cache:/aab216c3/5`, `cache:/aab216c3`) but BLOCKS inside `sub_821CB030+0x1AC` on event `0x12AC` (NO_SIGNALS_DESPITE_WAITS). So `ExTerminateThread` is never reached on ours's tid=13 → tid=1's wait on `0x12A4` never completes. **Refines the wedge from "thread-join on tid=13" to a precise mechanism**: the wait at `+0x2D0` is structurally a synchronous task-join (canary worker exits in ~1ms; ours's worker is permanently stuck downstream). **`sub_82173990` body itself is clean** — every probed PC except the wait completion matches canary's behavior; the divergence is entirely in what happens inside `sub_821746B0`'s spawned worker (tid=13's body in `sub_821748F0``sub_821C4EB0``sub_821CC3F8``sub_821CBA08``sub_821CB030`). Same AUDIT-049 island, now framed as: **how does canary's worker get its `0x12AC`-equivalent event signaled fast enough that the worker can call `ExTerminateThread`?** [confirmed]
## Open questions
- **What signals canary's tid=17 cache-IO completion event (the `0x12AC`-equivalent inside its `sub_821CB030`)?** Probe canary's `NtSetEvent`/`KeSetEvent` thunks (`0x8284DF5C`/`0x82490018`) filtered on tid=17's `r3`-handle in the lr-window around the corresponding tid=17 wait. Compare against ours's empty `0x12AC.signals` count.
- The cache file open flow looks identical in both engines for `cache:\aab216c3\…` paths — confirming AUDIT-054's VFS layout fix landed correctly. The divergence is purely in the producer-side signaling.
- Both engines pass through PC `0x82173B98` (the `sub_82453910` "signaler" candidate) on the `r25!=0 && r25!=28` path — but only if the AUDIT-046 "5/5 iter" path lands inside the block `[0x82173B14, 0x82173BC0]`. Ours's BB-entry probe shows `0x82173BC0` fires but NOT `0x82173B14/B40/B84/B98/BA0/BA8` — meaning ours's beq at `0x82173B14` is **taken** (r5==0) and we skip the entire block. Canary's BR list also shows direct jump from `0x82173A78``0x82173BC0` (line 1998→1999) — **so canary also skips the block at 0x82173B14**. Block is dead in both engines at this horizon. `sub_82453910` is NOT the relevant signaler at this call site.
- `r3` at `0x82173C48` shows `F8000094` in canary at probe time, but that's the **post-`bl 0x824AA5C8` PC** (mid-block; the captured value is whatever r3 carried at that instant — likely the handle being queried, not the return). Worth a follow-up probe to confirm the status-query target.
## Cross-references
- Caller: [sub_822F1AA8](sub_822F1AA8.md) (via thunk sub_82175330, vtable[0] of `*(0x828E1F08)`).
- Callees of interest:
- `sub_821746B0` — task allocator + worker spawn (no dossier yet — recommend creating one).
- `sub_82172370` — sync object creator (no dossier yet).
- `sub_82174AF8` — post-wait cleanup (no dossier yet).
- Worker-side wedge: [sub_821CB030](sub_821CB030.md) — fires inside the worker spawned via sub_821746B0 from sub_82173990's `bl` at `+0x2A4`.
- Audits: 049 (original tid=1 stall localization), 064 (full activation chain to sub_825070F0), 065 (this).
- Artifacts: `audit-runs/audit-065-sub82173990-wait-site/{sub_82173990.disasm,canary.log,canary-run.stdout,ours.log,ours-stdout.log,summary.md}`.