# `xenia-analysis` schema reference Authoritative documentation for the DuckDB tables and SQL views produced by `xenia-rs dis --db sylpheed.db`. Track schema changes here alongside any update to the `db_schema_golden` test fixture. The base + disasm tables (`metadata`, `sections`, `imports`, `functions`, `labels`, `instructions`, `xrefs`, opt-in `exec_trace` / `import_calls` / `branch_trace`) are documented inline in `src/db.rs` doc comment. This file collects layered analysis additions and forward-work notes. --- ## Layer M1 — `.pdata` boundary correction (landed) ### Schema additions - `functions.pdata_validated BOOLEAN NOT NULL` — `true` when the row's `address` matches a `RUNTIME_FUNCTION.BeginAddress` from `.pdata`. Linker ground truth. - `functions.pdata_length BIGINT NULL` — `function_length` (bytes) from the matching pdata entry; `NULL` when the row is prologue-only. - New table `pdata_entries(begin_address BIGINT PRIMARY KEY, end_address BIGINT, function_length BIGINT, prolog_length BIGINT, flags BIGINT)` — every parsed `.pdata` `RUNTIME_FUNCTION` entry (raw, before any merge with prologue analysis). - Index `idx_functions_pdata_validated` on `functions(pdata_validated)`. ### What this layer does - Parses `.pdata` 8-byte `RUNTIME_FUNCTION` entries (PowerPC PE32 layout): word 0 `BeginAddress` (absolute VA), word 1 packed `{prolog_length:8, function_length:22, flags:2}`, both big-endian. - Unions pdata `BeginAddress` values into the function-candidate set fed to the prologue walker, so functions our prologue heuristic missed still get rows. - When pdata supplies a longer `function_length` than the prologue walk found, extends `end_address` to the pdata-implied end (catches mis-split where the walker stopped at an early `blr`). - After the walker, performs a forward pass that trims `function.end` to the next start when they overlap (catches mis-merge where one row spanned two prologues — the audit-031 `sub_824D23B0` / `sub_824D29F0` case). ### What this layer does NOT do - Does not adjust prolog-derived `frame_size` / `saved_gprs` from `.pdata`'s `prolog_length` field — those remain prologue-only inferences. - Does not classify functions further than the existing `is_leaf` / `is_saverestore` columns. Class membership is M3. - Does not detect functions whose entries are missing from BOTH `.pdata` and the bl-target scan (extremely rare; would require executable-byte linear sweep). ### Reference docs - Microsoft PE32+ exception data spec for PowerPC RUNTIME_FUNCTION. - xenia-canary `src/xenia/cpu/xex_module.cc:1570-1587` — canary's reference parser (extracts `BeginAddress` only; we additionally decode word 1). ### Validation queries ```sql -- All pdata entries found SELECT COUNT(*) FROM pdata_entries; -- ~23073 for Sylpheed -- Functions cross-validated against pdata SELECT COUNT(*) FROM functions WHERE pdata_validated; -- Functions detected ONLY by prologue (orphans of pdata) SELECT COUNT(*) FROM functions WHERE NOT pdata_validated; -- Pdata orphans NOT yet in functions (should be 0 after this layer) SELECT COUNT(*) FROM pdata_entries p LEFT JOIN functions f ON f.address = p.begin_address WHERE f.address IS NULL; -- Audit-031 mis-merge resolved: 0x824D29F0 should have its own row SELECT name FROM functions WHERE address = 2186674160; -- 0x824D29F0 ``` --- ## Layer M2 — MSVC C++ name demangler (landed) ### Schema additions - New table `demangled_names(address BIGINT NULL, mangled VARCHAR NOT NULL, raw_demangled VARCHAR NOT NULL, namespace_path VARCHAR NULL, class_name VARCHAR NULL, method_name VARCHAR NULL, params_signature VARCHAR NULL)`. - Indices on `address`, `class_name`, `method_name`. ### What this layer does - Wraps `msvc_demangler::demangle` (a Rust port of LLVM's `MicrosoftDemangle.cpp`) and splits the formatted output into structured fields via a heuristic top-level parser (handles templates and nested parens correctly). - Populates `demangled_names` from any label whose name starts with `?` plus any import name that happens to be mangled (defensive — typical kernel imports use C names). ### What this layer does NOT do - Does not parse the AST returned by `msvc_demangler::parse` — uses the formatted string and a heuristic split. Adequate for typical class member functions and RTTI strings; exotic template / lambda forms still get `raw_demangled` populated but may have NULL structured fields. - Does not yet ingest RTTI strings discovered in `.rdata` — that's M3's job; M3 will append rows to this table at the addresses where it finds RTTI TypeDescriptors. ### Reference docs - `msvc-demangler` crate (`https://docs.rs/msvc-demangler/0.11`). - LLVM `MicrosoftDemangle.cpp` (the parser this crate ports). ## Layer M3 — Vtable + RTTI detection (landed) ### Schema additions - `vtables(address PK, length, col_address NULL, class_name, rtti_present, base_classes_json NULL)` — every detected static vtable. - `methods(vtable_address, slot, function_address, mangled_name NULL, demangled_name NULL, PRIMARY KEY (vtable_address, slot))` — one row per method slot. - `classes(name PK, vtable_address, rtti_present, base_classes_json NULL)` — deduped by class name (first-detected vtable wins). - Indices: `methods.function_address`, `classes.rtti_present`. ### What this layer does - Walks `.rdata` and `.data` looking for runs of ≥3 consecutive 4-byte BE values where each value is a known function start (from M1's corrected `functions` table). Single-2-method vtables are intentionally rejected to control false-positive rate. - Attempts the MSVC RTTI walk `vtable[-1] → CompleteObjectLocator → TypeDescriptor` for each candidate. When successful, the demangled `class ClassName` string fills `class_name` and a best-effort `RTTIClassHierarchyDescriptor` walk fills `base_classes_json` (JSON array of base class names). - Falls back to `ANON_Class_<8-hex>` keyed by FNV-1a hash of the sorted method-PC tuple when RTTI is absent (typical for shipped game binaries). Identical vtables across the binary (multiple instances) collapse to the same anonymous name. ### What this layer does NOT do - Vtables built at runtime in heap-allocated memory (e.g. by ctors copying static templates) are out of scope — only static `.rdata`/`.data` content. - Multiple-inheritance "extra" vftables (one per base subobject) are detected as independent vtables with no link between them. - Inheritance-tree walking beyond `RTTIClassHierarchyDescriptor`'s direct base list is not attempted. ### Reference docs - openrce.org "Reversing Microsoft Visual C++" — RTTI layout articles (CompleteObjectLocator at vtable[-1]; TypeDescriptor at COL+0xC; mangled name at TD+0x8). ## Layer M4 — Class-aware probe targeting (landed) CLI extension only — no schema changes. The probe-token grammar adds three symbolic forms on top of the existing `0xADDR` literal: - `Class::method` — joins `classes` × `methods` × `demangled_names` to find every PC whose vtable belongs to that class and whose demangled `method_name` matches. - `Class::*` — joins `classes` × `methods` to find every method PC of that class. - `function_name` — falls back to `functions.name` lookup for free functions / saverestore stubs / labels. Numeric tokens never touch the DB (preserves zero-IO fast path; lockstep digest unaffected). Symbolic tokens require the DuckDB at `--probe-db PATH` or `XENIA_PROBE_DB`; default is `sylpheed.db` next to the .iso when present. Resolution happens BEFORE guest exec begins, so it cannot affect the lockstep digest. See `crates/xenia-analysis/src/lookup.rs`. --- ## Layer M5 — Indirect-dispatch reachability (landed) ### Schema additions - New value `'ind_call'` in the `xrefs.kind` set. - New SQL view `v_indirect_reachability_from_entry` — strict superset of `v_reachability_from_entry`, taking `ind_call` edges in the BFS. ### What this layer does - Walks each `FuncAnalysis.functions` entry with a per-basic-block register tracker. Recognises the canonical static-vtable pattern: `lis+addi → lwz off(rA) → mtctr → bcctrl`, where `rA` ends up holding a known vtable's start address from M3. - Honours the PowerPC ABI: `bl`-style calls (op 18 / 16 with LK=1) clobber volatile r0..r12 + ctr but preserve non-volatile r13..r31, so a vtable pointer parked in r30/r31 before a call survives. - Treats every M3 `loc_*` label as a basic-block boundary (kills register state) so jump-IN paths cannot induce false positives. ### What this layer does NOT do (and observed impact) - Vtable pointer loaded from a `this`-pointer field (`lwz r_vt, off(rA)` where `rA = this`) — by far the dominant pattern in real C++ — is unresolvable without alias / points-to analysis. - On Sylpheed: the layer detects 0 edges. The binary's 1,001 lis+addi references into vtables are mostly constructor-side **vptr writes** (`stw rVtable, vptr_offset(this)`), not direct dispatches. The renderer hunt's audit-009 cluster therefore needs a future M5.5 with `this`-flow tracking before this layer surfaces it. ### Reference docs - IBM PowerPC ABI: register-save convention (volatile r0..r12 + ctr, non-volatile r13..r31). ## Layer M7 — String / constant-pool detection (landed) ### Schema additions - New table `strings(address PK, encoding, length, content)`. - Index `idx_strings_encoding`. ### What this layer does - Scans `.rdata` for runs of length ≥ 6 of printable ASCII bytes followed by a NUL terminator. - Scans `.rdata` for UTF-16LE runs of length ≥ 6 code units (printable-ASCII basic plane only) followed by a u16 NUL terminator. - Cross-reference is implicit: existing `xrefs.kind='ref'` rows whose `target` falls in `strings.address`'s exact match set name the referencing PCs. SQL: `SELECT s.content, x.source FROM xrefs x JOIN strings s ON s.address = x.target WHERE x.kind='ref'`. ### What this layer does NOT do - No UTF-8 multibyte / non-ASCII basic plane in either encoding. - No `.data` scan (read-only-section bias). - No multi-byte CJK encodings — Japanese text in localised builds may be represented in shift_jis / utf-8 with non-printable bytes that this scanner skips. ### Sylpheed yield - 6,311 ASCII strings (including full embedded HLSL shader source). - 0 UTF-16LE strings (binary uses ASCII / native CJK encoding). - 9,132 lis+addi sites cross-reference into the detected strings — names the source PCs that reference each string. ## Forward work (M6, M8–M12, not yet landed) - **M6** — extended `xrefs.kind='write'` for indexed/byte-reverse/multiword/VMX/DCBZ/atomic stores with `addr_mode` column. - **M8** — dispatch-table heuristics beyond vtables (e.g. function-pointer arrays in `.data`). - **M9** — `__CxxFrameHandler` exception scope-table parsing. - **M10** — `.tls` section / TLS slot tracking. - **M11** — `__xc_a` / `__xc_z` static-initializer driver detection. - **M12** — comparative-PC-trace mode for canary diff (runtime side, not analyzer).