Files
xenia-rs/crates/xenia-analysis/SCHEMA.md
MechaCat02 ab4fe211e5 M5+M7: indirect-dispatch reachability + .rdata string detection
Two MEDIUM milestones bundled (both opportunistic per plan; both small).

## M5 — indirect-dispatch reachability

- `xenia_analysis::indirect`: per-basic-block register tracker over each
  detected function. Recognises the canonical static-vtable pattern
  `lis+addi → lwz off(rA) → mtctr → bcctrl` where rA holds a known M3
  vtable address. Emits one `Xref { kind: IndirectCall }` per resolvable
  bcctrl site.
- PowerPC ABI awareness: `bl`-style calls clobber volatile r0..r12 + ctr
  but preserve non-volatile r13..r31, so a vtable pointer parked in r30/r31
  before a call survives.
- Label-based basic-block boundaries kill register state — bounds
  false-positive risk for jump-IN paths.
- New `XrefKind::IndirectCall` variant (DB tag `'ind_call'`).
- New SQL view `v_indirect_reachability_from_entry` — strict superset of
  `v_reachability_from_entry`, taking `ind_call` edges in the BFS.

Sylpheed yield: 0 edges detected. The binary's 1,001 static lis+addi
references into vtables are nearly all constructor-side vptr writes, not
dispatches; real method dispatch goes through `this->vptr` which requires
alias analysis we explicitly don't do. Documented in SCHEMA.md as the
expected limitation. Three unit tests cover the synthetic-correctness path.

## M7 — string / constant-pool detection

- `xenia_analysis::strings`: scans `.rdata` for runs of ≥ 6 printable
  ASCII bytes (NUL-terminated) and ≥ 6 UTF-16LE code units (basic-plane
  printable ASCII, NUL u16 terminator).
- New `strings(address PK, encoding, length, content)` table + encoding index.
- Implicit cross-ref via existing `xrefs.kind='ref'` rows whose target
  matches a strings.address.

Sylpheed yield: 6,311 ASCII strings (including embedded HLSL shader source
and AS_CB_SURFACE_SWIZZLE_* assertion strings). 9,132 lis+addi sites
cross-reference detected strings — names source PCs near each string in
one query. Four unit tests cover encoding detection, NUL termination, and
short-run rejection.

Tests 626→633 (+3 indirect, +4 strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:22:50 +02:00

11 KiB
Raw Blame History

xenia-analysis schema reference

Authoritative documentation for the DuckDB tables and SQL views produced by xenia-rs dis --db sylpheed.db. Track schema changes here alongside any update to the db_schema_golden test fixture.

The base + disasm tables (metadata, sections, imports, functions, labels, instructions, xrefs, opt-in exec_trace / import_calls / branch_trace) are documented inline in src/db.rs doc comment. This file collects layered analysis additions and forward-work notes.


Layer M1 — .pdata boundary correction (landed)

Schema additions

  • functions.pdata_validated BOOLEAN NOT NULLtrue when the row's address matches a RUNTIME_FUNCTION.BeginAddress from .pdata. Linker ground truth.
  • functions.pdata_length BIGINT NULLfunction_length (bytes) from the matching pdata entry; NULL when the row is prologue-only.
  • New table pdata_entries(begin_address BIGINT PRIMARY KEY, end_address BIGINT, function_length BIGINT, prolog_length BIGINT, flags BIGINT) — every parsed .pdata RUNTIME_FUNCTION entry (raw, before any merge with prologue analysis).
  • Index idx_functions_pdata_validated on functions(pdata_validated).

What this layer does

  • Parses .pdata 8-byte RUNTIME_FUNCTION entries (PowerPC PE32 layout): word 0 BeginAddress (absolute VA), word 1 packed {prolog_length:8, function_length:22, flags:2}, both big-endian.
  • Unions pdata BeginAddress values into the function-candidate set fed to the prologue walker, so functions our prologue heuristic missed still get rows.
  • When pdata supplies a longer function_length than the prologue walk found, extends end_address to the pdata-implied end (catches mis-split where the walker stopped at an early blr).
  • After the walker, performs a forward pass that trims function.end to the next start when they overlap (catches mis-merge where one row spanned two prologues — the audit-031 sub_824D23B0 / sub_824D29F0 case).

What this layer does NOT do

  • Does not adjust prolog-derived frame_size / saved_gprs from .pdata's prolog_length field — those remain prologue-only inferences.
  • Does not classify functions further than the existing is_leaf / is_saverestore columns. Class membership is M3.
  • Does not detect functions whose entries are missing from BOTH .pdata and the bl-target scan (extremely rare; would require executable-byte linear sweep).

Reference docs

  • Microsoft PE32+ exception data spec for PowerPC RUNTIME_FUNCTION.
  • xenia-canary src/xenia/cpu/xex_module.cc:1570-1587 — canary's reference parser (extracts BeginAddress only; we additionally decode word 1).

Validation queries

-- All pdata entries found
SELECT COUNT(*) FROM pdata_entries;            -- ~23073 for Sylpheed
-- Functions cross-validated against pdata
SELECT COUNT(*) FROM functions WHERE pdata_validated;
-- Functions detected ONLY by prologue (orphans of pdata)
SELECT COUNT(*) FROM functions WHERE NOT pdata_validated;
-- Pdata orphans NOT yet in functions (should be 0 after this layer)
SELECT COUNT(*) FROM pdata_entries p
LEFT JOIN functions f ON f.address = p.begin_address
WHERE f.address IS NULL;
-- Audit-031 mis-merge resolved: 0x824D29F0 should have its own row
SELECT name FROM functions WHERE address = 2186674160;  -- 0x824D29F0

Layer M2 — MSVC C++ name demangler (landed)

Schema additions

  • New table demangled_names(address BIGINT NULL, mangled VARCHAR NOT NULL, raw_demangled VARCHAR NOT NULL, namespace_path VARCHAR NULL, class_name VARCHAR NULL, method_name VARCHAR NULL, params_signature VARCHAR NULL).
  • Indices on address, class_name, method_name.

What this layer does

  • Wraps msvc_demangler::demangle (a Rust port of LLVM's MicrosoftDemangle.cpp) and splits the formatted output into structured fields via a heuristic top-level parser (handles templates and nested parens correctly).
  • Populates demangled_names from any label whose name starts with ? plus any import name that happens to be mangled (defensive — typical kernel imports use C names).

What this layer does NOT do

  • Does not parse the AST returned by msvc_demangler::parse — uses the formatted string and a heuristic split. Adequate for typical class member functions and RTTI strings; exotic template / lambda forms still get raw_demangled populated but may have NULL structured fields.
  • Does not yet ingest RTTI strings discovered in .rdata — that's M3's job; M3 will append rows to this table at the addresses where it finds RTTI TypeDescriptors.

Reference docs

  • msvc-demangler crate (https://docs.rs/msvc-demangler/0.11).
  • LLVM MicrosoftDemangle.cpp (the parser this crate ports).

Layer M3 — Vtable + RTTI detection (landed)

Schema additions

  • vtables(address PK, length, col_address NULL, class_name, rtti_present, base_classes_json NULL) — every detected static vtable.
  • methods(vtable_address, slot, function_address, mangled_name NULL, demangled_name NULL, PRIMARY KEY (vtable_address, slot)) — one row per method slot.
  • classes(name PK, vtable_address, rtti_present, base_classes_json NULL) — deduped by class name (first-detected vtable wins).
  • Indices: methods.function_address, classes.rtti_present.

What this layer does

  • Walks .rdata and .data looking for runs of ≥3 consecutive 4-byte BE values where each value is a known function start (from M1's corrected functions table). Single-2-method vtables are intentionally rejected to control false-positive rate.
  • Attempts the MSVC RTTI walk vtable[-1] → CompleteObjectLocator → TypeDescriptor for each candidate. When successful, the demangled class ClassName string fills class_name and a best-effort RTTIClassHierarchyDescriptor walk fills base_classes_json (JSON array of base class names).
  • Falls back to ANON_Class_<8-hex> keyed by FNV-1a hash of the sorted method-PC tuple when RTTI is absent (typical for shipped game binaries). Identical vtables across the binary (multiple instances) collapse to the same anonymous name.

What this layer does NOT do

  • Vtables built at runtime in heap-allocated memory (e.g. by ctors copying static templates) are out of scope — only static .rdata/.data content.
  • Multiple-inheritance "extra" vftables (one per base subobject) are detected as independent vtables with no link between them.
  • Inheritance-tree walking beyond RTTIClassHierarchyDescriptor's direct base list is not attempted.

Reference docs

  • openrce.org "Reversing Microsoft Visual C++" — RTTI layout articles (CompleteObjectLocator at vtable[-1]; TypeDescriptor at COL+0xC; mangled name at TD+0x8).

Layer M4 — Class-aware probe targeting (landed)

CLI extension only — no schema changes. The probe-token grammar adds three symbolic forms on top of the existing 0xADDR literal:

  • Class::method — joins classes × methods × demangled_names to find every PC whose vtable belongs to that class and whose demangled method_name matches.
  • Class::* — joins classes × methods to find every method PC of that class.
  • function_name — falls back to functions.name lookup for free functions / saverestore stubs / labels.

Numeric tokens never touch the DB (preserves zero-IO fast path; lockstep digest unaffected). Symbolic tokens require the DuckDB at --probe-db PATH or XENIA_PROBE_DB; default is sylpheed.db next to the .iso when present.

Resolution happens BEFORE guest exec begins, so it cannot affect the lockstep digest.

See crates/xenia-analysis/src/lookup.rs.


Layer M5 — Indirect-dispatch reachability (landed)

Schema additions

  • New value 'ind_call' in the xrefs.kind set.
  • New SQL view v_indirect_reachability_from_entry — strict superset of v_reachability_from_entry, taking ind_call edges in the BFS.

What this layer does

  • Walks each FuncAnalysis.functions entry with a per-basic-block register tracker. Recognises the canonical static-vtable pattern: lis+addi → lwz off(rA) → mtctr → bcctrl, where rA ends up holding a known vtable's start address from M3.
  • Honours the PowerPC ABI: bl-style calls (op 18 / 16 with LK=1) clobber volatile r0..r12 + ctr but preserve non-volatile r13..r31, so a vtable pointer parked in r30/r31 before a call survives.
  • Treats every M3 loc_* label as a basic-block boundary (kills register state) so jump-IN paths cannot induce false positives.

What this layer does NOT do (and observed impact)

  • Vtable pointer loaded from a this-pointer field (lwz r_vt, off(rA) where rA = this) — by far the dominant pattern in real C++ — is unresolvable without alias / points-to analysis.
  • On Sylpheed: the layer detects 0 edges. The binary's 1,001 lis+addi references into vtables are mostly constructor-side vptr writes (stw rVtable, vptr_offset(this)), not direct dispatches. The renderer hunt's audit-009 cluster therefore needs a future M5.5 with this-flow tracking before this layer surfaces it.

Reference docs

  • IBM PowerPC ABI: register-save convention (volatile r0..r12 + ctr, non-volatile r13..r31).

Layer M7 — String / constant-pool detection (landed)

Schema additions

  • New table strings(address PK, encoding, length, content).
  • Index idx_strings_encoding.

What this layer does

  • Scans .rdata for runs of length ≥ 6 of printable ASCII bytes followed by a NUL terminator.
  • Scans .rdata for UTF-16LE runs of length ≥ 6 code units (printable-ASCII basic plane only) followed by a u16 NUL terminator.
  • Cross-reference is implicit: existing xrefs.kind='ref' rows whose target falls in strings.address's exact match set name the referencing PCs. SQL: SELECT s.content, x.source FROM xrefs x JOIN strings s ON s.address = x.target WHERE x.kind='ref'.

What this layer does NOT do

  • No UTF-8 multibyte / non-ASCII basic plane in either encoding.
  • No .data scan (read-only-section bias).
  • No multi-byte CJK encodings — Japanese text in localised builds may be represented in shift_jis / utf-8 with non-printable bytes that this scanner skips.

Sylpheed yield

  • 6,311 ASCII strings (including full embedded HLSL shader source).
  • 0 UTF-16LE strings (binary uses ASCII / native CJK encoding).
  • 9,132 lis+addi sites cross-reference into the detected strings — names the source PCs that reference each string.

Forward work (M6, M8M12, not yet landed)

  • M6 — extended xrefs.kind='write' for indexed/byte-reverse/multiword/VMX/DCBZ/atomic stores with addr_mode column.
  • M8 — dispatch-table heuristics beyond vtables (e.g. function-pointer arrays in .data).
  • M9__CxxFrameHandler exception scope-table parsing.
  • M10.tls section / TLS slot tracking.
  • M11__xc_a / __xc_z static-initializer driver detection.
  • M12 — comparative-PC-trace mode for canary diff (runtime side, not analyzer).