Files
xenia-rs/audit-runs/audit-059-handle-disambiguation/round-A8-ours-822F1AA8-trace/FINDINGS.md
MechaCat02 52c30d82a7 [AUDIT-059 R-A] Phase A backward-trace: divergence is sub_822F1AA8 loop exit, not factory/registry
Round-37 anchor reframe: both engines install the SAME static .rdata vtable
0x820A183C at [0x828E1F08]. Instance VAs differ only because of ε-class
allocator divergence (audit-043). vtable bytes byte-identical; the user
prompt's "factory/registry" framing was falsified.

Phase A walkthrough (rounds A1..A8):
- A.1 canary --audit_jit_prolog_pc=0x821741C8: tid=6, r3=0xBCCC4A80 (= inner
  sub-object of [0x828E1F08]'s singleton), LR=0x822F1D5C (return-from-bctrl
  inside sub_822F1AA8)
- A.2 found tid=6 spawn site sub_821746B0 at PC 0x82174824 spawning
  entry=sub_821748F0 ctx=BC365700/BC366DA0. sub_822F1AA8 ALSO spawns a
  second thread (entry=sub_822F1EE0 ctx=BCE24A40) at PC 0x822F1B08
- A.3 sub_822F1AA8 has 2 callers, both in sub_8216EA68 (its sole caller is
  sub_824AB748 = entry_point)
- A.4 ours mirror probe: sub_821746B0 enters, [0x828E2B14] gate passes,
  ExCreateThread fires returning handle 0x1070 (= tid=13). Ours' tid=13
  IS the same logical thread as canary's spawned silph initializer
- A.5 canary --audit_jit_prolog_pc=0x821749C0: fires only 2× on short-lived
  tid=17, tid=26 (the spawned initializers — NOT tid=6)
- A.6 canary --audit_jit_prolog_pc=0x822F1AA8: fires 1× on tid=6 with
  r3=0xBCE24A40 LR=0x8216EE14 (the second sub_822F1AA8 call site)
- A.7 canary --audit_jit_prolog_pc=0x824AB748 (entry_point): fires on
  tid=00000006. CONFIRMS canary's tid=6 = canary's main thread.

Verdict: identical call chain entry_point → sub_8216EA68 → sub_822F1AA8 in
both engines; same controller (ε-divergent VA, byte-identical fields).
Canary's main thread stays in sub_822F1AA8's dispatcher loop firing
sub_821741C8 ~1678×/30s. Ours' main thread exits the loop and thread-joins
on the spawned initializer (tid=13), which is itself wedged on handle 0x1078
forever.

Loop exit is gated by bit 28 of [r30+0] (the controller's flag word). Same
value 0x21 at function entry in both engines. Some code between entry and
loop check sets bit 28 in ours but not in canary. Mem-watch on 0x40d09a40
shows zero guest stores in ours' 50M parallel run — setter is either a
kernel-side store, computed alias, or probe-quantum-elided JIT store.

Phase B classification: Class 3a (state-divergence on controller object).
The vtable is the same; the controller's bit 28 evolves differently during
sub_822F1AA8 setup. Class 4 (synthesis) is now less attractive since we
correctly reach the dispatcher with the right inputs — we just exit too
soon.

Phase C will need either JIT instrumentation to identify the bit-28 setter,
or a kernel-side hook to clear bit 28 on entry to the loop check site.

Findings notes:
- round-A4b-ours-spawn-gate/FINDINGS.md (spawn topology + tid mapping)
- round-A8-ours-822F1AA8-trace/FINDINGS.md (full loop structure + bit-28 gate)

New reading-error class #18: probe-output anchor misframing (singleton[VA]=X
vtable=Y was misread as "Y is canary-only vtable" when Y is the same
.rdata vtable in both engines).

Branch: iterate-2C/silph-ui-spawn-trace off master @ 229b46c.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-11 17:02:20 +02:00

137 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase A synthesis — canary tid=6 IS the main thread; the wedge is sub_822F1AA8's loop exit
## Top-line finding
**Canary's `tid=6` is canary's main thread.** Confirmed by probing `entry_point`
(`sub_824AB748`) with `--audit_jit_prolog_pc=0x824AB748`: fires 1× on
`tid=00000006` with `lr=BCBCBCBC` (= OS-initial / no caller). Ours numbers
its main thread `tid=1`. Same logical thread; different label.
Therefore "tid=6 fires sub_821741C8 471×" (round 33) means **the main thread**
loops inside `sub_822F1AA8` firing `sub_821741C8` ~1678×/30s in canary. In
ours, the main thread (tid=1) runs `sub_822F1AA8` ONCE, exits the loop, and
proceeds to thread-join on the spawned init thread (handle 0x1070 = tid=13),
which is itself blocked forever on handle 0x1078.
## Call chain (identical in both engines, different runtime behavior)
```
entry_point (sub_824AB748)
├─ sub_824ACB38 CRT-driven fnptr-array iterator (audit-050 region)
├─ ...
└─ sub_8216EA68 Many local calls including:
├─ ExCreateThread(entry=sub_8217F0F8 ...) ; sibling thread
├─ sub_822F1AA8(controller=...) ; FIRST call (PC 0x8216ECCC)
└─ sub_822F1AA8(controller=0xBCE24A40 canary / ; SECOND call (PC 0x8216EE10)
0x40d09a40 ours) ↑ this is the loop
```
The SECOND call is what runs the dispatcher loop. Its LR = 0x8216EE14.
Confirmed in both engines.
## sub_822F1AA8 loop structure
```
0x822F1AA8: entry, r30 = r3 (controller)
0x822F1AEC-0x822F1B08: ExCreateThread(entry=sub_822F1EE0, ctx=r30) → r29 = handle
0x822F1B30-0x822F1B34: bl 0x824AA8B0(r3=r29) ; ?
0x822F1B38-0x822F1B4C: first bctrl → vtable[+0] of [0x828E1F08]
0x822F1B50-0x822F1B74: setup, bl 0x824AA330 INFINITE wait on [r22+32]
0x822F1B80-0x822F1BA8: post-wait setup; [r30+0] |= 0x2
0x822F1BB0-0x822F1BBC: TOP-OF-LOOP CHECK: if [r30+0] & 0x10000000 → goto 0x822F1E10 (exit)
0x822F1BCC..0x822F1DEC: loop body (includes the vtable[+8] bctrl → sub_821741C8 at PC 0x822F1D58)
0x822F1DEC-0x822F1DFC: bl 0x824AA330 INFINITE wait on [r23+0]
0x822F1E00-0x822F1E0C: END-OF-ITERATION CHECK: if [r30+0] & 0x10000000 == 0 → goto 0x822F1BCC (re-loop)
0x822F1E10-0x822F1E18: EXIT: [r30+0] |= 0x02000000 (set MSB-6 = LSB-25)
0x822F1E1C-0x822F1E24: release something via bl 0x824AA2F0
0x822F1E28-0x822F1E30: bl 0x824AA330 INFINITE on [r30+28] = SPAWNED THREAD HANDLE (thread join!)
0x822F1E40: bl 0x824AA3E0
0x822F1E44-0x822F1E5C: final cleanup: vtable[+24] bctrl on [0x828E1F08]
0x822F1E60-0x822F1E78: [r30+0] = 0, then [r30+0] |= 1; bl 0x824567E0
0x822F1E7C-0x822F1E88: epilogue
```
**Loop exit gate**: `[r30+0] & 0x10000000` (bit 28 LSB / bit 3 MSB). Set →
exit. Both top-of-loop check (0x822F1BBC) and end-of-iteration check
(0x822F1E0C) gate on the same bit.
## What's different between engines
| Engine | [r30+0] at entry | Loop iterations | Exits sub_822F1AA8? |
|--------|------------------|------------------|----------------------|
| canary | 0x21 (per probe) | ~1678+ in 30s | NO (stays in loop) |
| ours | 0x21 (per probe) | 0 (probes show none of the loop-body PCs fire after entry) | YES (exits quickly) |
Both engines have `[r30+0]=0x21` at entry — bit 28 NOT set. After the `ori
r11, r11, 0x2` at 0x822F1B90, both should have `[r30+0]=0x23`. Bit 28 still
not set.
So **some code sets bit 28 on [r30+0] between sub_822F1AA8 entry and the
loop check** in ours but not in canary.
Mem-watch on 0x40d09a40 (ours' controller VA) shows **zero guest writes** in
my 50M-instruction parallel run. Possible reasons:
- The setter writes from kernel/runtime code that mem-watch doesn't capture
(kernel-host store, not guest JIT store)
- The setter writes via a computed alias (different VA but same backing)
- The bit IS set via a probe-quantum-elided JIT store
## Phase B classification
**Class 3a — state-divergence on the controller object**. The vtable
identity is the same (round-37 confirmed `0x820A183C` in both). The
controller object's bit 28 of `[+0]` evolves differently during the setup
between sub_822F1AA8 entry and the loop check.
Class 4 (synthesis) is now LESS attractive: ours' main thread DOES reach
sub_822F1AA8 with the right controller. We don't need to spawn the
dispatcher — we need to PREVENT the main thread from exiting the loop.
## Pragmatic next step — JIT instrumentation to find bit-28 setter
Most direct diagnostic: add a JIT hook in xenia-cpu that, for guest stores
in the range [0x822F1AA8, 0x822F1E10), captures the guest PC + the written
value when the store would set bit 28 of any address. This identifies the
exact PC that sets the loop-exit bit.
Alternative: extend `--mem-watch` to also capture kernel-side stores by
hooking the GuestMemory write path at the kernel-state level.
Even simpler: add a one-shot `--bit-watch=ADDR:MASK` cvar that fires when
the value at ADDR has any bit in MASK transition from 0→1, regardless of
who wrote it. This is the cleanest diagnostic for this exact pattern.
## Fix shape (when bit-28 setter is identified)
If the bit-28 setter is inside the vtable[+0] dispatch chain at 0x822F1B4C
(target sub_82173990), then the fix might be a state-init issue in the
kernel/runtime.
If the bit-28 setter is inside the inner wait or one of the kernel calls
(`bl 0x824AA8B0`, `bl 0x824AA330`), the fix might be a missing event signal
or a wrong handle-state evolution.
If we can't identify the setter cleanly, the synthesis fallback is to
**inject a kernel-side hook that clears bit 28 of [r30+0] on every entry to
sub_822F1AA8's bit-check site (0x822F1BB0)**. Crude but should keep the
main thread in the loop.
## Why this is a clearer wedge picture than rounds 22-33
Rounds 22-33 chased the audit-049 wedge from various angles. The diagnoses
landed on different layers:
- R22: "wrong cluster targeted" (cluster A vs B)
- R26-30: "state-machine progression bug"
- R32-33: "pool 3 starvation; bootstrap walk-back"
This round establishes the simplest possible framing:
> **Canary's main thread loops forever in a dispatcher; ours' main thread
> exits the loop after one setup phase. The exit is gated by a single bit
> on the controller's flag word.**
If bit 28 of `[controller+0]` could be permanently cleared, ours' main
thread would stay in the loop, sub_821741C8 would dispatch, signals would
flow, tid=13 would complete, draws would happen.