Files
xenia-rs/crates/xenia-analysis/SCHEMA.md
MechaCat02 1d6c51fbf8 M3: vtable scan + MSVC RTTI walk + 3 new tables
Adds detection of statically-allocated MSVC vtables in .rdata/.data:
- New `xenia_analysis::vtables` walks read-only sections looking for runs of
  ≥3 contiguous big-endian u32 values where each value lands on a known
  function start (from M1's corrected functions table). 2-slot runs are
  rejected to keep false-positive rate down.
- For each candidate the MSVC RTTI walk vtable[-1] → CompleteObjectLocator
  → TypeDescriptor → mangled name is attempted; on success the demangled
  class name is recorded along with a best-effort RTTIClassHierarchyDescriptor
  walk to fill base_classes_json. On failure (RTTI stripped — common for
  shipped game binaries) the class is named ANON_Class_<fnv1a-hash> keyed
  by sorted method-PC list, so identical vtables collapse to one entry.
- DB: new tables `vtables`, `methods`, `classes` with indices on
  function_address and rtti_present. `write_analysis_results` takes a
  `&[Vtable]` slice; `write_disasm` (back-compat) passes empty.
- cmd_dis wires the scan after xref analysis using
  `func_analysis.functions.keys()` as the function-start oracle.

Validation on Sylpheed (RTTI stripped, as expected): 722 vtables / 499
unique classes / 5571 methods. Sanity invariant: every methods.function_address
joins to functions.address (0 broken refs). Largest vtable: 131 slots.

Tests 617→621 (+4 vtable unit tests covering 3-slot detect, 2-slot reject,
synth name stability, and synth name divergence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:17:45 +02:00

7.6 KiB
Raw Blame History

xenia-analysis schema reference

Authoritative documentation for the DuckDB tables and SQL views produced by xenia-rs dis --db sylpheed.db. Track schema changes here alongside any update to the db_schema_golden test fixture.

The base + disasm tables (metadata, sections, imports, functions, labels, instructions, xrefs, opt-in exec_trace / import_calls / branch_trace) are documented inline in src/db.rs doc comment. This file collects layered analysis additions and forward-work notes.


Layer M1 — .pdata boundary correction (landed)

Schema additions

  • functions.pdata_validated BOOLEAN NOT NULLtrue when the row's address matches a RUNTIME_FUNCTION.BeginAddress from .pdata. Linker ground truth.
  • functions.pdata_length BIGINT NULLfunction_length (bytes) from the matching pdata entry; NULL when the row is prologue-only.
  • New table pdata_entries(begin_address BIGINT PRIMARY KEY, end_address BIGINT, function_length BIGINT, prolog_length BIGINT, flags BIGINT) — every parsed .pdata RUNTIME_FUNCTION entry (raw, before any merge with prologue analysis).
  • Index idx_functions_pdata_validated on functions(pdata_validated).

What this layer does

  • Parses .pdata 8-byte RUNTIME_FUNCTION entries (PowerPC PE32 layout): word 0 BeginAddress (absolute VA), word 1 packed {prolog_length:8, function_length:22, flags:2}, both big-endian.
  • Unions pdata BeginAddress values into the function-candidate set fed to the prologue walker, so functions our prologue heuristic missed still get rows.
  • When pdata supplies a longer function_length than the prologue walk found, extends end_address to the pdata-implied end (catches mis-split where the walker stopped at an early blr).
  • After the walker, performs a forward pass that trims function.end to the next start when they overlap (catches mis-merge where one row spanned two prologues — the audit-031 sub_824D23B0 / sub_824D29F0 case).

What this layer does NOT do

  • Does not adjust prolog-derived frame_size / saved_gprs from .pdata's prolog_length field — those remain prologue-only inferences.
  • Does not classify functions further than the existing is_leaf / is_saverestore columns. Class membership is M3.
  • Does not detect functions whose entries are missing from BOTH .pdata and the bl-target scan (extremely rare; would require executable-byte linear sweep).

Reference docs

  • Microsoft PE32+ exception data spec for PowerPC RUNTIME_FUNCTION.
  • xenia-canary src/xenia/cpu/xex_module.cc:1570-1587 — canary's reference parser (extracts BeginAddress only; we additionally decode word 1).

Validation queries

-- All pdata entries found
SELECT COUNT(*) FROM pdata_entries;            -- ~23073 for Sylpheed
-- Functions cross-validated against pdata
SELECT COUNT(*) FROM functions WHERE pdata_validated;
-- Functions detected ONLY by prologue (orphans of pdata)
SELECT COUNT(*) FROM functions WHERE NOT pdata_validated;
-- Pdata orphans NOT yet in functions (should be 0 after this layer)
SELECT COUNT(*) FROM pdata_entries p
LEFT JOIN functions f ON f.address = p.begin_address
WHERE f.address IS NULL;
-- Audit-031 mis-merge resolved: 0x824D29F0 should have its own row
SELECT name FROM functions WHERE address = 2186674160;  -- 0x824D29F0

Layer M2 — MSVC C++ name demangler (landed)

Schema additions

  • New table demangled_names(address BIGINT NULL, mangled VARCHAR NOT NULL, raw_demangled VARCHAR NOT NULL, namespace_path VARCHAR NULL, class_name VARCHAR NULL, method_name VARCHAR NULL, params_signature VARCHAR NULL).
  • Indices on address, class_name, method_name.

What this layer does

  • Wraps msvc_demangler::demangle (a Rust port of LLVM's MicrosoftDemangle.cpp) and splits the formatted output into structured fields via a heuristic top-level parser (handles templates and nested parens correctly).
  • Populates demangled_names from any label whose name starts with ? plus any import name that happens to be mangled (defensive — typical kernel imports use C names).

What this layer does NOT do

  • Does not parse the AST returned by msvc_demangler::parse — uses the formatted string and a heuristic split. Adequate for typical class member functions and RTTI strings; exotic template / lambda forms still get raw_demangled populated but may have NULL structured fields.
  • Does not yet ingest RTTI strings discovered in .rdata — that's M3's job; M3 will append rows to this table at the addresses where it finds RTTI TypeDescriptors.

Reference docs

  • msvc-demangler crate (https://docs.rs/msvc-demangler/0.11).
  • LLVM MicrosoftDemangle.cpp (the parser this crate ports).

Layer M3 — Vtable + RTTI detection (landed)

Schema additions

  • vtables(address PK, length, col_address NULL, class_name, rtti_present, base_classes_json NULL) — every detected static vtable.
  • methods(vtable_address, slot, function_address, mangled_name NULL, demangled_name NULL, PRIMARY KEY (vtable_address, slot)) — one row per method slot.
  • classes(name PK, vtable_address, rtti_present, base_classes_json NULL) — deduped by class name (first-detected vtable wins).
  • Indices: methods.function_address, classes.rtti_present.

What this layer does

  • Walks .rdata and .data looking for runs of ≥3 consecutive 4-byte BE values where each value is a known function start (from M1's corrected functions table). Single-2-method vtables are intentionally rejected to control false-positive rate.
  • Attempts the MSVC RTTI walk vtable[-1] → CompleteObjectLocator → TypeDescriptor for each candidate. When successful, the demangled class ClassName string fills class_name and a best-effort RTTIClassHierarchyDescriptor walk fills base_classes_json (JSON array of base class names).
  • Falls back to ANON_Class_<8-hex> keyed by FNV-1a hash of the sorted method-PC tuple when RTTI is absent (typical for shipped game binaries). Identical vtables across the binary (multiple instances) collapse to the same anonymous name.

What this layer does NOT do

  • Vtables built at runtime in heap-allocated memory (e.g. by ctors copying static templates) are out of scope — only static .rdata/.data content.
  • Multiple-inheritance "extra" vftables (one per base subobject) are detected as independent vtables with no link between them.
  • Inheritance-tree walking beyond RTTIClassHierarchyDescriptor's direct base list is not attempted.

Reference docs

  • openrce.org "Reversing Microsoft Visual C++" — RTTI layout articles (CompleteObjectLocator at vtable[-1]; TypeDescriptor at COL+0xC; mangled name at TD+0x8).

Layer M4 — Class-aware probe targeting (planned)

CLI extension only — no schema changes. --pc-probe=Class::method and --pc-probe-class=ClassName resolve via M3's tables. See crates/xenia-analysis/src/lookup.rs (when landed).


Forward work (M5M12, not yet landed)

  • M5 — indirect-dispatch reachability via vtable+CTR dataflow.
  • M6 — extended xrefs.kind='write' for indexed/byte-reverse/multiword/VMX/DCBZ/atomic stores with addr_mode column.
  • M7.rdata ASCII / UTF-16 string pool detection cross-referenced with PCs.
  • M8 — dispatch-table heuristics beyond vtables (e.g. function-pointer arrays in .data).
  • M9__CxxFrameHandler exception scope-table parsing.
  • M10.tls section / TLS slot tracking.
  • M11__xc_a / __xc_z static-initializer driver detection.
  • M12 — comparative-PC-trace mode for canary diff (runtime side, not analyzer).