Files
xenia-rs/crates/xenia-analysis/SCHEMA.md
MechaCat02 ab4fe211e5 M5+M7: indirect-dispatch reachability + .rdata string detection
Two MEDIUM milestones bundled (both opportunistic per plan; both small).

## M5 — indirect-dispatch reachability

- `xenia_analysis::indirect`: per-basic-block register tracker over each
  detected function. Recognises the canonical static-vtable pattern
  `lis+addi → lwz off(rA) → mtctr → bcctrl` where rA holds a known M3
  vtable address. Emits one `Xref { kind: IndirectCall }` per resolvable
  bcctrl site.
- PowerPC ABI awareness: `bl`-style calls clobber volatile r0..r12 + ctr
  but preserve non-volatile r13..r31, so a vtable pointer parked in r30/r31
  before a call survives.
- Label-based basic-block boundaries kill register state — bounds
  false-positive risk for jump-IN paths.
- New `XrefKind::IndirectCall` variant (DB tag `'ind_call'`).
- New SQL view `v_indirect_reachability_from_entry` — strict superset of
  `v_reachability_from_entry`, taking `ind_call` edges in the BFS.

Sylpheed yield: 0 edges detected. The binary's 1,001 static lis+addi
references into vtables are nearly all constructor-side vptr writes, not
dispatches; real method dispatch goes through `this->vptr` which requires
alias analysis we explicitly don't do. Documented in SCHEMA.md as the
expected limitation. Three unit tests cover the synthetic-correctness path.

## M7 — string / constant-pool detection

- `xenia_analysis::strings`: scans `.rdata` for runs of ≥ 6 printable
  ASCII bytes (NUL-terminated) and ≥ 6 UTF-16LE code units (basic-plane
  printable ASCII, NUL u16 terminator).
- New `strings(address PK, encoding, length, content)` table + encoding index.
- Implicit cross-ref via existing `xrefs.kind='ref'` rows whose target
  matches a strings.address.

Sylpheed yield: 6,311 ASCII strings (including embedded HLSL shader source
and AS_CB_SURFACE_SWIZZLE_* assertion strings). 9,132 lis+addi sites
cross-reference detected strings — names source PCs near each string in
one query. Four unit tests cover encoding detection, NUL termination, and
short-run rejection.

Tests 626→633 (+3 indirect, +4 strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:22:50 +02:00

238 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `xenia-analysis` schema reference
Authoritative documentation for the DuckDB tables and SQL views produced by
`xenia-rs dis --db sylpheed.db`. Track schema changes here alongside any
update to the `db_schema_golden` test fixture.
The base + disasm tables (`metadata`, `sections`, `imports`, `functions`,
`labels`, `instructions`, `xrefs`, opt-in `exec_trace` / `import_calls` /
`branch_trace`) are documented inline in `src/db.rs` doc comment. This file
collects layered analysis additions and forward-work notes.
---
## Layer M1 — `.pdata` boundary correction (landed)
### Schema additions
- `functions.pdata_validated BOOLEAN NOT NULL``true` when the row's
`address` matches a `RUNTIME_FUNCTION.BeginAddress` from `.pdata`. Linker
ground truth.
- `functions.pdata_length BIGINT NULL``function_length` (bytes) from the
matching pdata entry; `NULL` when the row is prologue-only.
- New table `pdata_entries(begin_address BIGINT PRIMARY KEY, end_address
BIGINT, function_length BIGINT, prolog_length BIGINT, flags BIGINT)` — every
parsed `.pdata` `RUNTIME_FUNCTION` entry (raw, before any merge with
prologue analysis).
- Index `idx_functions_pdata_validated` on `functions(pdata_validated)`.
### What this layer does
- Parses `.pdata` 8-byte `RUNTIME_FUNCTION` entries (PowerPC PE32 layout):
word 0 `BeginAddress` (absolute VA), word 1 packed
`{prolog_length:8, function_length:22, flags:2}`, both big-endian.
- Unions pdata `BeginAddress` values into the function-candidate set fed to
the prologue walker, so functions our prologue heuristic missed still get
rows.
- When pdata supplies a longer `function_length` than the prologue walk
found, extends `end_address` to the pdata-implied end (catches mis-split
where the walker stopped at an early `blr`).
- After the walker, performs a forward pass that trims `function.end` to the
next start when they overlap (catches mis-merge where one row spanned two
prologues — the audit-031 `sub_824D23B0` / `sub_824D29F0` case).
### What this layer does NOT do
- Does not adjust prolog-derived `frame_size` / `saved_gprs` from `.pdata`'s
`prolog_length` field — those remain prologue-only inferences.
- Does not classify functions further than the existing `is_leaf` /
`is_saverestore` columns. Class membership is M3.
- Does not detect functions whose entries are missing from BOTH `.pdata`
and the bl-target scan (extremely rare; would require executable-byte
linear sweep).
### Reference docs
- Microsoft PE32+ exception data spec for PowerPC RUNTIME_FUNCTION.
- xenia-canary `src/xenia/cpu/xex_module.cc:1570-1587` — canary's reference
parser (extracts `BeginAddress` only; we additionally decode word 1).
### Validation queries
```sql
-- All pdata entries found
SELECT COUNT(*) FROM pdata_entries; -- ~23073 for Sylpheed
-- Functions cross-validated against pdata
SELECT COUNT(*) FROM functions WHERE pdata_validated;
-- Functions detected ONLY by prologue (orphans of pdata)
SELECT COUNT(*) FROM functions WHERE NOT pdata_validated;
-- Pdata orphans NOT yet in functions (should be 0 after this layer)
SELECT COUNT(*) FROM pdata_entries p
LEFT JOIN functions f ON f.address = p.begin_address
WHERE f.address IS NULL;
-- Audit-031 mis-merge resolved: 0x824D29F0 should have its own row
SELECT name FROM functions WHERE address = 2186674160; -- 0x824D29F0
```
---
## Layer M2 — MSVC C++ name demangler (landed)
### Schema additions
- New table `demangled_names(address BIGINT NULL, mangled VARCHAR NOT NULL,
raw_demangled VARCHAR NOT NULL, namespace_path VARCHAR NULL,
class_name VARCHAR NULL, method_name VARCHAR NULL,
params_signature VARCHAR NULL)`.
- Indices on `address`, `class_name`, `method_name`.
### What this layer does
- Wraps `msvc_demangler::demangle` (a Rust port of LLVM's
`MicrosoftDemangle.cpp`) and splits the formatted output into structured
fields via a heuristic top-level parser (handles templates and nested parens
correctly).
- Populates `demangled_names` from any label whose name starts with `?` plus
any import name that happens to be mangled (defensive — typical kernel
imports use C names).
### What this layer does NOT do
- Does not parse the AST returned by `msvc_demangler::parse` — uses the formatted
string and a heuristic split. Adequate for typical class member functions
and RTTI strings; exotic template / lambda forms still get `raw_demangled`
populated but may have NULL structured fields.
- Does not yet ingest RTTI strings discovered in `.rdata` — that's M3's job;
M3 will append rows to this table at the addresses where it finds RTTI
TypeDescriptors.
### Reference docs
- `msvc-demangler` crate (`https://docs.rs/msvc-demangler/0.11`).
- LLVM `MicrosoftDemangle.cpp` (the parser this crate ports).
## Layer M3 — Vtable + RTTI detection (landed)
### Schema additions
- `vtables(address PK, length, col_address NULL, class_name, rtti_present,
base_classes_json NULL)` — every detected static vtable.
- `methods(vtable_address, slot, function_address, mangled_name NULL,
demangled_name NULL, PRIMARY KEY (vtable_address, slot))` — one row per
method slot.
- `classes(name PK, vtable_address, rtti_present, base_classes_json NULL)` —
deduped by class name (first-detected vtable wins).
- Indices: `methods.function_address`, `classes.rtti_present`.
### What this layer does
- Walks `.rdata` and `.data` looking for runs of ≥3 consecutive 4-byte BE
values where each value is a known function start (from M1's corrected
`functions` table). Single-2-method vtables are intentionally rejected to
control false-positive rate.
- Attempts the MSVC RTTI walk `vtable[-1] → CompleteObjectLocator → TypeDescriptor`
for each candidate. When successful, the demangled `class ClassName`
string fills `class_name` and a best-effort
`RTTIClassHierarchyDescriptor` walk fills `base_classes_json` (JSON array
of base class names).
- Falls back to `ANON_Class_<8-hex>` keyed by FNV-1a hash of the sorted
method-PC tuple when RTTI is absent (typical for shipped game binaries).
Identical vtables across the binary (multiple instances) collapse to the
same anonymous name.
### What this layer does NOT do
- Vtables built at runtime in heap-allocated memory (e.g. by ctors copying
static templates) are out of scope — only static `.rdata`/`.data` content.
- Multiple-inheritance "extra" vftables (one per base subobject) are detected
as independent vtables with no link between them.
- Inheritance-tree walking beyond `RTTIClassHierarchyDescriptor`'s direct
base list is not attempted.
### Reference docs
- openrce.org "Reversing Microsoft Visual C++" — RTTI layout articles
(CompleteObjectLocator at vtable[-1]; TypeDescriptor at COL+0xC; mangled
name at TD+0x8).
## Layer M4 — Class-aware probe targeting (landed)
CLI extension only — no schema changes. The probe-token grammar adds three
symbolic forms on top of the existing `0xADDR` literal:
- `Class::method` — joins `classes` × `methods` × `demangled_names` to find
every PC whose vtable belongs to that class and whose demangled
`method_name` matches.
- `Class::*` — joins `classes` × `methods` to find every method PC of that
class.
- `function_name` — falls back to `functions.name` lookup for free functions
/ saverestore stubs / labels.
Numeric tokens never touch the DB (preserves zero-IO fast path; lockstep
digest unaffected). Symbolic tokens require the DuckDB at `--probe-db PATH`
or `XENIA_PROBE_DB`; default is `sylpheed.db` next to the .iso when present.
Resolution happens BEFORE guest exec begins, so it cannot affect the
lockstep digest.
See `crates/xenia-analysis/src/lookup.rs`.
---
## Layer M5 — Indirect-dispatch reachability (landed)
### Schema additions
- New value `'ind_call'` in the `xrefs.kind` set.
- New SQL view `v_indirect_reachability_from_entry` — strict superset of
`v_reachability_from_entry`, taking `ind_call` edges in the BFS.
### What this layer does
- Walks each `FuncAnalysis.functions` entry with a per-basic-block register
tracker. Recognises the canonical static-vtable pattern:
`lis+addi → lwz off(rA) → mtctr → bcctrl`, where `rA` ends up holding a
known vtable's start address from M3.
- Honours the PowerPC ABI: `bl`-style calls (op 18 / 16 with LK=1) clobber
volatile r0..r12 + ctr but preserve non-volatile r13..r31, so a vtable
pointer parked in r30/r31 before a call survives.
- Treats every M3 `loc_*` label as a basic-block boundary (kills register
state) so jump-IN paths cannot induce false positives.
### What this layer does NOT do (and observed impact)
- Vtable pointer loaded from a `this`-pointer field
(`lwz r_vt, off(rA)` where `rA = this`) — by far the dominant pattern in
real C++ — is unresolvable without alias / points-to analysis.
- On Sylpheed: the layer detects 0 edges. The binary's 1,001 lis+addi
references into vtables are mostly constructor-side **vptr writes**
(`stw rVtable, vptr_offset(this)`), not direct dispatches. The renderer
hunt's audit-009 cluster therefore needs a future M5.5 with `this`-flow
tracking before this layer surfaces it.
### Reference docs
- IBM PowerPC ABI: register-save convention (volatile r0..r12 + ctr,
non-volatile r13..r31).
## Layer M7 — String / constant-pool detection (landed)
### Schema additions
- New table `strings(address PK, encoding, length, content)`.
- Index `idx_strings_encoding`.
### What this layer does
- Scans `.rdata` for runs of length ≥ 6 of printable ASCII bytes followed by
a NUL terminator.
- Scans `.rdata` for UTF-16LE runs of length ≥ 6 code units (printable-ASCII
basic plane only) followed by a u16 NUL terminator.
- Cross-reference is implicit: existing `xrefs.kind='ref'` rows whose
`target` falls in `strings.address`'s exact match set name the referencing
PCs. SQL: `SELECT s.content, x.source FROM xrefs x JOIN strings s
ON s.address = x.target WHERE x.kind='ref'`.
### What this layer does NOT do
- No UTF-8 multibyte / non-ASCII basic plane in either encoding.
- No `.data` scan (read-only-section bias).
- No multi-byte CJK encodings — Japanese text in localised builds may be
represented in shift_jis / utf-8 with non-printable bytes that this
scanner skips.
### Sylpheed yield
- 6,311 ASCII strings (including full embedded HLSL shader source).
- 0 UTF-16LE strings (binary uses ASCII / native CJK encoding).
- 9,132 lis+addi sites cross-reference into the detected strings — names
the source PCs that reference each string.
## Forward work (M6, M8M12, not yet landed)
- **M6** — extended `xrefs.kind='write'` for indexed/byte-reverse/multiword/VMX/DCBZ/atomic stores with `addr_mode` column.
- **M8** — dispatch-table heuristics beyond vtables (e.g. function-pointer arrays in `.data`).
- **M9** — `__CxxFrameHandler` exception scope-table parsing.
- **M10** — `.tls` section / TLS slot tracking.
- **M11** — `__xc_a` / `__xc_z` static-initializer driver detection.
- **M12** — comparative-PC-trace mode for canary diff (runtime side, not analyzer).