Commit Graph

6 Commits

Author SHA1 Message Date
MechaCat02
38d8871e8d M6: addr_mode column on xrefs + extended store/load classes
Adds finer-grained addressing-mode classification to every data xref row
plus new dispatch for instruction families not previously emitted:
- New `xrefs.addr_mode VARCHAR NULL` column. NULL for control-flow edges
  (call / ind_call / j / br); one of d_form / lis_addi / lis_ori /
  multiword / x_form_indexed / x_form_byterev / atomic / dcbz for data
  edges. Index idx_xrefs_addr_mode.
- New `xenia_analysis::xref::AddrMode` enum + Xref::addr_mode field.
- Opcode 46/47 (lmw/stmw) expand to one xref per slot — D-form multi-word
  load/store now resolves all (32-rS) consecutive addresses.
- Opcode 31 X-form dispatch — stwx/stbx/sthx/stwux/stbux/sthux/stdx/stdux,
  lwzx/lbzx/lhzx/lhax/lwzux/lbzux/lhzux/lhaux/ldx/ldux,
  stwcx./stdcx. (atomic),
  stwbrx/sthbrx/lwbrx/lhbrx (byte-reverse),
  dcbz (cache-line clear).
- X-form rows are emitted ONLY when both rA and rB resolve to known
  constants (rare but present); the dominant runtime-indexed pattern
  remains correctly skipped.

Sylpheed yield (regen on master + merge):
- 442 newly-detected x_form_indexed reads (lwzx/lhzx into static tables).
- 40 newly-detected atomic writes (stwcx./stdcx. with resolvable address).
- 28,834 lis_addi refs, 18,485 d_form reads, 3,288 d_form writes — every
  pre-existing data row now tagged.
- 0 multiword / dcbz / byterev (these instructions exist but aren't on
  lis+addi-tracked code paths).

Tests 633→636 (+3 xref unit tests covering AddrMode tag uniqueness,
data-edge addr_mode round-trip, control-edge None invariant). Schema
golden updated (xrefs gains addr_mode column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:38:47 +02:00
MechaCat02
ab4fe211e5 M5+M7: indirect-dispatch reachability + .rdata string detection
Two MEDIUM milestones bundled (both opportunistic per plan; both small).

## M5 — indirect-dispatch reachability

- `xenia_analysis::indirect`: per-basic-block register tracker over each
  detected function. Recognises the canonical static-vtable pattern
  `lis+addi → lwz off(rA) → mtctr → bcctrl` where rA holds a known M3
  vtable address. Emits one `Xref { kind: IndirectCall }` per resolvable
  bcctrl site.
- PowerPC ABI awareness: `bl`-style calls clobber volatile r0..r12 + ctr
  but preserve non-volatile r13..r31, so a vtable pointer parked in r30/r31
  before a call survives.
- Label-based basic-block boundaries kill register state — bounds
  false-positive risk for jump-IN paths.
- New `XrefKind::IndirectCall` variant (DB tag `'ind_call'`).
- New SQL view `v_indirect_reachability_from_entry` — strict superset of
  `v_reachability_from_entry`, taking `ind_call` edges in the BFS.

Sylpheed yield: 0 edges detected. The binary's 1,001 static lis+addi
references into vtables are nearly all constructor-side vptr writes, not
dispatches; real method dispatch goes through `this->vptr` which requires
alias analysis we explicitly don't do. Documented in SCHEMA.md as the
expected limitation. Three unit tests cover the synthetic-correctness path.

## M7 — string / constant-pool detection

- `xenia_analysis::strings`: scans `.rdata` for runs of ≥ 6 printable
  ASCII bytes (NUL-terminated) and ≥ 6 UTF-16LE code units (basic-plane
  printable ASCII, NUL u16 terminator).
- New `strings(address PK, encoding, length, content)` table + encoding index.
- Implicit cross-ref via existing `xrefs.kind='ref'` rows whose target
  matches a strings.address.

Sylpheed yield: 6,311 ASCII strings (including embedded HLSL shader source
and AS_CB_SURFACE_SWIZZLE_* assertion strings). 9,132 lis+addi sites
cross-reference detected strings — names source PCs near each string in
one query. Four unit tests cover encoding detection, NUL termination, and
short-run rejection.

Tests 626→633 (+3 indirect, +4 strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:22:50 +02:00
MechaCat02
4ff08f6116 M4: class-aware probe tokens via M3 vtable+method tables
CLI extension only — no schema change. Adds symbolic resolution for
--pc-probe / --branch-probe / --ctor-probe tokens:
- `0xADDR` / `2186674160` — numeric (current behavior, no DB load).
- `Class::method` — joins classes × methods × demangled_names.
- `Class::*` — joins classes × methods (all slots).
- `function_name` — falls back to functions.name for free functions /
  saverestore stubs / labels.

New `xenia_analysis::lookup::resolve_probe_token(db_path, token)` opens the
DB read-only ONLY when a token is non-numeric, so legacy numeric flows pay
no IO. New `--probe-db PATH` flag (or `XENIA_PROBE_DB` env / default
`sylpheed.db` next to the .iso) selects the DB.

Symbolic resolution happens BEFORE any guest exec, so it cannot affect the
lockstep digest. Verified deterministic across two reruns at -n 2M
(instructions=2000005 identical).

End-to-end smoke test on Sylpheed: `--pc-probe='ANON_Class_6B674251::*'`
resolves to all 45 method PCs of that anonymous class (matching the
methods-table row count for that vtable).

Tests 621→626 (+5 lookup unit tests covering numeric passthrough,
symbolic-without-DB error, Class::method resolution, Class::* expansion,
and functions.name fallback).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:22:21 +02:00
MechaCat02
1d6c51fbf8 M3: vtable scan + MSVC RTTI walk + 3 new tables
Adds detection of statically-allocated MSVC vtables in .rdata/.data:
- New `xenia_analysis::vtables` walks read-only sections looking for runs of
  ≥3 contiguous big-endian u32 values where each value lands on a known
  function start (from M1's corrected functions table). 2-slot runs are
  rejected to keep false-positive rate down.
- For each candidate the MSVC RTTI walk vtable[-1] → CompleteObjectLocator
  → TypeDescriptor → mangled name is attempted; on success the demangled
  class name is recorded along with a best-effort RTTIClassHierarchyDescriptor
  walk to fill base_classes_json. On failure (RTTI stripped — common for
  shipped game binaries) the class is named ANON_Class_<fnv1a-hash> keyed
  by sorted method-PC list, so identical vtables collapse to one entry.
- DB: new tables `vtables`, `methods`, `classes` with indices on
  function_address and rtti_present. `write_analysis_results` takes a
  `&[Vtable]` slice; `write_disasm` (back-compat) passes empty.
- cmd_dis wires the scan after xref analysis using
  `func_analysis.functions.keys()` as the function-start oracle.

Validation on Sylpheed (RTTI stripped, as expected): 722 vtables / 499
unique classes / 5571 methods. Sanity invariant: every methods.function_address
joins to functions.address (0 broken refs). Largest vtable: 131 slots.

Tests 617→621 (+4 vtable unit tests covering 3-slot detect, 2-slot reject,
synth name stability, and synth name divergence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:17:45 +02:00
MechaCat02
89f5f7e4a9 M2: MSVC C++ demangler + demangled_names DB table
Adds an MSVC name-demangling layer in front of M3's vtable / RTTI work:
- New `xenia_analysis::demangle` wraps the `msvc-demangler` crate (a Rust
  port of LLVM's `MicrosoftDemangle.cpp`). `demangle()` short-circuits on
  non-mangled inputs (`?` prefix check); `demangle_or_raw()` always returns
  a record (raw passthrough on parse failure).
- Heuristic split of the formatted demangled string into structured fields
  `(namespace_path, class_name, method_name, params_signature)`. Top-level
  paren / template-bracket aware, so `a::b<c::d>::e` and signatures with
  templated arg types parse correctly.
- DB: new `demangled_names(address, mangled, raw_demangled, namespace_path,
  class_name, method_name, params_signature)` with indices on address /
  class_name / method_name. Populated from any label whose name starts with
  `?` plus any import name that happens to be mangled.

For Sylpheed (a fully stripped binary) this table is empty out-of-the-box;
the layer's value lands in M3, which will append rows for every RTTI
TypeDescriptor name found in `.rdata`.

Tests 610→617 (+7 demangler unit tests covering early-out, raw fallback,
member function form, RTTI form, qname split, paren-template safety, and
top-level `::` splitting).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:02:21 +02:00
MechaCat02
70120465a3 M1: parse .pdata RUNTIME_FUNCTION; cross-validate function boundaries
Adds an authoritative function-boundary source from the linker:
- New `xenia_xex::pdata` parses .pdata 8-byte entries (BeginAddress + packed
  prolog/length/flags). Bit layout per Microsoft PE32 PowerPC spec: prolog in
  bits 0..7, function_length in bits 8..29, flags in 30..31.
- `func::analyze_with_pdata` unions pdata BeginAddresses into the candidate
  set, attaches `pdata_validated`/`pdata_length` to each `FuncInfo`, and trims
  any function whose `end` overlaps the next start (catches mis-merge where
  one row spanned two prologues — the audit-031 sub_824D23B0/sub_824D29F0
  case).
- DB: extends `functions` with `pdata_validated BOOLEAN`, `pdata_length BIGINT`;
  new table `pdata_entries`; index on pdata_validated.
- New `crates/xenia-analysis/SCHEMA.md` documents M1 layer + forward work.

Validation on Sylpheed: 25481 functions (was 12156) / 23073 pdata_validated /
0 orphans / 0 mis-merges. Audit-031 mis-merge resolved: sub_824D29F0 now has
its own row with `pdata_length=280` (70 dwords); sub_824D23B0 now correctly
ends at 0x824D2878 (`pdata_length=1224` matches prologue walk).

Tests 605→610. New 5-test pdata unit suite covers bit layout + sentinel +
out-of-range filtering + real-world layout round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 19:44:02 +02:00