Files
xenia-rs/crates/xenia-analysis/SCHEMA.md
MechaCat02 38d8871e8d M6: addr_mode column on xrefs + extended store/load classes
Adds finer-grained addressing-mode classification to every data xref row
plus new dispatch for instruction families not previously emitted:
- New `xrefs.addr_mode VARCHAR NULL` column. NULL for control-flow edges
  (call / ind_call / j / br); one of d_form / lis_addi / lis_ori /
  multiword / x_form_indexed / x_form_byterev / atomic / dcbz for data
  edges. Index idx_xrefs_addr_mode.
- New `xenia_analysis::xref::AddrMode` enum + Xref::addr_mode field.
- Opcode 46/47 (lmw/stmw) expand to one xref per slot — D-form multi-word
  load/store now resolves all (32-rS) consecutive addresses.
- Opcode 31 X-form dispatch — stwx/stbx/sthx/stwux/stbux/sthux/stdx/stdux,
  lwzx/lbzx/lhzx/lhax/lwzux/lbzux/lhzux/lhaux/ldx/ldux,
  stwcx./stdcx. (atomic),
  stwbrx/sthbrx/lwbrx/lhbrx (byte-reverse),
  dcbz (cache-line clear).
- X-form rows are emitted ONLY when both rA and rB resolve to known
  constants (rare but present); the dominant runtime-indexed pattern
  remains correctly skipped.

Sylpheed yield (regen on master + merge):
- 442 newly-detected x_form_indexed reads (lwzx/lhzx into static tables).
- 40 newly-detected atomic writes (stwcx./stdcx. with resolvable address).
- 28,834 lis_addi refs, 18,485 d_form reads, 3,288 d_form writes — every
  pre-existing data row now tagged.
- 0 multiword / dcbz / byterev (these instructions exist but aren't on
  lis+addi-tracked code paths).

Tests 633→636 (+3 xref unit tests covering AddrMode tag uniqueness,
data-edge addr_mode round-trip, control-edge None invariant). Schema
golden updated (xrefs gains addr_mode column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:38:47 +02:00

284 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `xenia-analysis` schema reference
Authoritative documentation for the DuckDB tables and SQL views produced by
`xenia-rs dis --db sylpheed.db`. Track schema changes here alongside any
update to the `db_schema_golden` test fixture.
The base + disasm tables (`metadata`, `sections`, `imports`, `functions`,
`labels`, `instructions`, `xrefs`, opt-in `exec_trace` / `import_calls` /
`branch_trace`) are documented inline in `src/db.rs` doc comment. This file
collects layered analysis additions and forward-work notes.
---
## Layer M1 — `.pdata` boundary correction (landed)
### Schema additions
- `functions.pdata_validated BOOLEAN NOT NULL``true` when the row's
`address` matches a `RUNTIME_FUNCTION.BeginAddress` from `.pdata`. Linker
ground truth.
- `functions.pdata_length BIGINT NULL``function_length` (bytes) from the
matching pdata entry; `NULL` when the row is prologue-only.
- New table `pdata_entries(begin_address BIGINT PRIMARY KEY, end_address
BIGINT, function_length BIGINT, prolog_length BIGINT, flags BIGINT)` — every
parsed `.pdata` `RUNTIME_FUNCTION` entry (raw, before any merge with
prologue analysis).
- Index `idx_functions_pdata_validated` on `functions(pdata_validated)`.
### What this layer does
- Parses `.pdata` 8-byte `RUNTIME_FUNCTION` entries (PowerPC PE32 layout):
word 0 `BeginAddress` (absolute VA), word 1 packed
`{prolog_length:8, function_length:22, flags:2}`, both big-endian.
- Unions pdata `BeginAddress` values into the function-candidate set fed to
the prologue walker, so functions our prologue heuristic missed still get
rows.
- When pdata supplies a longer `function_length` than the prologue walk
found, extends `end_address` to the pdata-implied end (catches mis-split
where the walker stopped at an early `blr`).
- After the walker, performs a forward pass that trims `function.end` to the
next start when they overlap (catches mis-merge where one row spanned two
prologues — the audit-031 `sub_824D23B0` / `sub_824D29F0` case).
### What this layer does NOT do
- Does not adjust prolog-derived `frame_size` / `saved_gprs` from `.pdata`'s
`prolog_length` field — those remain prologue-only inferences.
- Does not classify functions further than the existing `is_leaf` /
`is_saverestore` columns. Class membership is M3.
- Does not detect functions whose entries are missing from BOTH `.pdata`
and the bl-target scan (extremely rare; would require executable-byte
linear sweep).
### Reference docs
- Microsoft PE32+ exception data spec for PowerPC RUNTIME_FUNCTION.
- xenia-canary `src/xenia/cpu/xex_module.cc:1570-1587` — canary's reference
parser (extracts `BeginAddress` only; we additionally decode word 1).
### Validation queries
```sql
-- All pdata entries found
SELECT COUNT(*) FROM pdata_entries; -- ~23073 for Sylpheed
-- Functions cross-validated against pdata
SELECT COUNT(*) FROM functions WHERE pdata_validated;
-- Functions detected ONLY by prologue (orphans of pdata)
SELECT COUNT(*) FROM functions WHERE NOT pdata_validated;
-- Pdata orphans NOT yet in functions (should be 0 after this layer)
SELECT COUNT(*) FROM pdata_entries p
LEFT JOIN functions f ON f.address = p.begin_address
WHERE f.address IS NULL;
-- Audit-031 mis-merge resolved: 0x824D29F0 should have its own row
SELECT name FROM functions WHERE address = 2186674160; -- 0x824D29F0
```
---
## Layer M2 — MSVC C++ name demangler (landed)
### Schema additions
- New table `demangled_names(address BIGINT NULL, mangled VARCHAR NOT NULL,
raw_demangled VARCHAR NOT NULL, namespace_path VARCHAR NULL,
class_name VARCHAR NULL, method_name VARCHAR NULL,
params_signature VARCHAR NULL)`.
- Indices on `address`, `class_name`, `method_name`.
### What this layer does
- Wraps `msvc_demangler::demangle` (a Rust port of LLVM's
`MicrosoftDemangle.cpp`) and splits the formatted output into structured
fields via a heuristic top-level parser (handles templates and nested parens
correctly).
- Populates `demangled_names` from any label whose name starts with `?` plus
any import name that happens to be mangled (defensive — typical kernel
imports use C names).
### What this layer does NOT do
- Does not parse the AST returned by `msvc_demangler::parse` — uses the formatted
string and a heuristic split. Adequate for typical class member functions
and RTTI strings; exotic template / lambda forms still get `raw_demangled`
populated but may have NULL structured fields.
- Does not yet ingest RTTI strings discovered in `.rdata` — that's M3's job;
M3 will append rows to this table at the addresses where it finds RTTI
TypeDescriptors.
### Reference docs
- `msvc-demangler` crate (`https://docs.rs/msvc-demangler/0.11`).
- LLVM `MicrosoftDemangle.cpp` (the parser this crate ports).
## Layer M3 — Vtable + RTTI detection (landed)
### Schema additions
- `vtables(address PK, length, col_address NULL, class_name, rtti_present,
base_classes_json NULL)` — every detected static vtable.
- `methods(vtable_address, slot, function_address, mangled_name NULL,
demangled_name NULL, PRIMARY KEY (vtable_address, slot))` — one row per
method slot.
- `classes(name PK, vtable_address, rtti_present, base_classes_json NULL)` —
deduped by class name (first-detected vtable wins).
- Indices: `methods.function_address`, `classes.rtti_present`.
### What this layer does
- Walks `.rdata` and `.data` looking for runs of ≥3 consecutive 4-byte BE
values where each value is a known function start (from M1's corrected
`functions` table). Single-2-method vtables are intentionally rejected to
control false-positive rate.
- Attempts the MSVC RTTI walk `vtable[-1] → CompleteObjectLocator → TypeDescriptor`
for each candidate. When successful, the demangled `class ClassName`
string fills `class_name` and a best-effort
`RTTIClassHierarchyDescriptor` walk fills `base_classes_json` (JSON array
of base class names).
- Falls back to `ANON_Class_<8-hex>` keyed by FNV-1a hash of the sorted
method-PC tuple when RTTI is absent (typical for shipped game binaries).
Identical vtables across the binary (multiple instances) collapse to the
same anonymous name.
### What this layer does NOT do
- Vtables built at runtime in heap-allocated memory (e.g. by ctors copying
static templates) are out of scope — only static `.rdata`/`.data` content.
- Multiple-inheritance "extra" vftables (one per base subobject) are detected
as independent vtables with no link between them.
- Inheritance-tree walking beyond `RTTIClassHierarchyDescriptor`'s direct
base list is not attempted.
### Reference docs
- openrce.org "Reversing Microsoft Visual C++" — RTTI layout articles
(CompleteObjectLocator at vtable[-1]; TypeDescriptor at COL+0xC; mangled
name at TD+0x8).
## Layer M4 — Class-aware probe targeting (landed)
CLI extension only — no schema changes. The probe-token grammar adds three
symbolic forms on top of the existing `0xADDR` literal:
- `Class::method` — joins `classes` × `methods` × `demangled_names` to find
every PC whose vtable belongs to that class and whose demangled
`method_name` matches.
- `Class::*` — joins `classes` × `methods` to find every method PC of that
class.
- `function_name` — falls back to `functions.name` lookup for free functions
/ saverestore stubs / labels.
Numeric tokens never touch the DB (preserves zero-IO fast path; lockstep
digest unaffected). Symbolic tokens require the DuckDB at `--probe-db PATH`
or `XENIA_PROBE_DB`; default is `sylpheed.db` next to the .iso when present.
Resolution happens BEFORE guest exec begins, so it cannot affect the
lockstep digest.
See `crates/xenia-analysis/src/lookup.rs`.
---
## Layer M5 — Indirect-dispatch reachability (landed)
### Schema additions
- New value `'ind_call'` in the `xrefs.kind` set.
- New SQL view `v_indirect_reachability_from_entry` — strict superset of
`v_reachability_from_entry`, taking `ind_call` edges in the BFS.
### What this layer does
- Walks each `FuncAnalysis.functions` entry with a per-basic-block register
tracker. Recognises the canonical static-vtable pattern:
`lis+addi → lwz off(rA) → mtctr → bcctrl`, where `rA` ends up holding a
known vtable's start address from M3.
- Honours the PowerPC ABI: `bl`-style calls (op 18 / 16 with LK=1) clobber
volatile r0..r12 + ctr but preserve non-volatile r13..r31, so a vtable
pointer parked in r30/r31 before a call survives.
- Treats every M3 `loc_*` label as a basic-block boundary (kills register
state) so jump-IN paths cannot induce false positives.
### What this layer does NOT do (and observed impact)
- Vtable pointer loaded from a `this`-pointer field
(`lwz r_vt, off(rA)` where `rA = this`) — by far the dominant pattern in
real C++ — is unresolvable without alias / points-to analysis.
- On Sylpheed: the layer detects 0 edges. The binary's 1,001 lis+addi
references into vtables are mostly constructor-side **vptr writes**
(`stw rVtable, vptr_offset(this)`), not direct dispatches. The renderer
hunt's audit-009 cluster therefore needs a future M5.5 with `this`-flow
tracking before this layer surfaces it.
### Reference docs
- IBM PowerPC ABI: register-save convention (volatile r0..r12 + ctr,
non-volatile r13..r31).
## Layer M7 — String / constant-pool detection (landed)
### Schema additions
- New table `strings(address PK, encoding, length, content)`.
- Index `idx_strings_encoding`.
### What this layer does
- Scans `.rdata` for runs of length ≥ 6 of printable ASCII bytes followed by
a NUL terminator.
- Scans `.rdata` for UTF-16LE runs of length ≥ 6 code units (printable-ASCII
basic plane only) followed by a u16 NUL terminator.
- Cross-reference is implicit: existing `xrefs.kind='ref'` rows whose
`target` falls in `strings.address`'s exact match set name the referencing
PCs. SQL: `SELECT s.content, x.source FROM xrefs x JOIN strings s
ON s.address = x.target WHERE x.kind='ref'`.
### What this layer does NOT do
- No UTF-8 multibyte / non-ASCII basic plane in either encoding.
- No `.data` scan (read-only-section bias).
- No multi-byte CJK encodings — Japanese text in localised builds may be
represented in shift_jis / utf-8 with non-printable bytes that this
scanner skips.
### Sylpheed yield
- 6,311 ASCII strings (including full embedded HLSL shader source).
- 0 UTF-16LE strings (binary uses ASCII / native CJK encoding).
- 9,132 lis+addi sites cross-reference into the detected strings — names
the source PCs that reference each string.
## Layer M6 — Extended store-class xrefs + `addr_mode` column (landed)
### Schema additions
- `xrefs.addr_mode VARCHAR NULL` — sub-classifies how the source instruction
computes its target. NULL for control-flow edges (call / ind_call / j /
br); one of the following tags for data edges:
- `d_form` — standard signed-16 displacement (lwz/stw/lfs/stfs/etc.)
- `lis_addi` — address materialised via `lis + addi` register tracking
- `lis_ori` — address materialised via `lis + ori`
- `multiword` — `lmw / stmw` (one xref per slot; up to 32-rS slots)
- `x_form_indexed` — `stwx / stbx / sthx / stwux / stbux / sthux / stdx /
stdux / lwzx / lbzx / lhzx / lhax / lwzux / lbzux / lhzux / lhaux / ldx /
ldux` — emitted only when both rA and rB are tracked constants
- `x_form_byterev` — `stwbrx / sthbrx / lwbrx / lhbrx`
- `atomic` — `stwcx. / stdcx.` reservation-conditional stores
- `dcbz` — cache-line clear (32-byte zero at rA+rB)
- Index `idx_xrefs_addr_mode`.
### What this layer does
- Tags every existing data xref with its addressing mode (`d_form` for the
bulk; `lis_addi` / `lis_ori` for the lift-and-add cases that produce
DataRef rows).
- Adds new dispatch for opcode 47 (`stmw`) and 46 (`lmw`), expanding to
per-slot DataWrite / DataRead rows.
- Adds new dispatch for opcode 31 X-form: stores, atomic, byte-reverse,
dcbz. X-form rows are emitted ONLY when both rA and rB resolve to known
constants (otherwise the address is runtime-dependent and we skip).
### What this layer does NOT do
- VMX / VMX128 vector stores (opcode 31 with vector XO codes) are not
emitted — they always have register-indexed addresses that the
lis+addi tracker can't usually resolve, and detecting them adds noise
without improving target resolution.
- The dominant runtime-of-stwx pattern (rA = base, rB = runtime index) is
not resolved — by design; mem-watch covers the runtime side per VERIFY-B.
### Sylpheed yield
- 28,834 `lis_addi` refs, 18,485 `d_form` reads, 3,288 `d_form` writes —
the existing baseline now properly tagged.
- **442 newly-detected `x_form_indexed` reads** — primarily lwzx/lhzx
reads from in-table dispatch (each pair (rA,rB) resolved statically).
- **40 newly-detected `atomic` writes** — every `stwcx.` site with a
resolvable address; useful for reservation-table audits.
- 9 `lis_ori` refs.
- 0 multiword / dcbz / byterev — these instructions exist in the binary
but are not in lis+addi-tracked code paths.
## Forward work (M8M12, not yet landed)
- **M8** — dispatch-table heuristics beyond vtables (e.g. function-pointer arrays in `.data`).
- **M9** — `__CxxFrameHandler` exception scope-table parsing.
- **M10** — `.tls` section / TLS slot tracking.
- **M11** — `__xc_a` / `__xc_z` static-initializer driver detection.
- **M12** — comparative-PC-trace mode for canary diff (runtime side, not analyzer).