Adds an MSVC name-demangling layer in front of M3's vtable / RTTI work: - New `xenia_analysis::demangle` wraps the `msvc-demangler` crate (a Rust port of LLVM's `MicrosoftDemangle.cpp`). `demangle()` short-circuits on non-mangled inputs (`?` prefix check); `demangle_or_raw()` always returns a record (raw passthrough on parse failure). - Heuristic split of the formatted demangled string into structured fields `(namespace_path, class_name, method_name, params_signature)`. Top-level paren / template-bracket aware, so `a::b<c::d>::e` and signatures with templated arg types parse correctly. - DB: new `demangled_names(address, mangled, raw_demangled, namespace_path, class_name, method_name, params_signature)` with indices on address / class_name / method_name. Populated from any label whose name starts with `?` plus any import name that happens to be mangled. For Sylpheed (a fully stripped binary) this table is empty out-of-the-box; the layer's value lands in M3, which will append rows for every RTTI TypeDescriptor name found in `.rdata`. Tests 610→617 (+7 demangler unit tests covering early-out, raw fallback, member function form, RTTI form, qname split, paren-template safety, and top-level `::` splitting). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
130 lines
5.9 KiB
Markdown
130 lines
5.9 KiB
Markdown
# `xenia-analysis` schema reference
|
||
|
||
Authoritative documentation for the DuckDB tables and SQL views produced by
|
||
`xenia-rs dis --db sylpheed.db`. Track schema changes here alongside any
|
||
update to the `db_schema_golden` test fixture.
|
||
|
||
The base + disasm tables (`metadata`, `sections`, `imports`, `functions`,
|
||
`labels`, `instructions`, `xrefs`, opt-in `exec_trace` / `import_calls` /
|
||
`branch_trace`) are documented inline in `src/db.rs` doc comment. This file
|
||
collects layered analysis additions and forward-work notes.
|
||
|
||
---
|
||
|
||
## Layer M1 — `.pdata` boundary correction (landed)
|
||
|
||
### Schema additions
|
||
- `functions.pdata_validated BOOLEAN NOT NULL` — `true` when the row's
|
||
`address` matches a `RUNTIME_FUNCTION.BeginAddress` from `.pdata`. Linker
|
||
ground truth.
|
||
- `functions.pdata_length BIGINT NULL` — `function_length` (bytes) from the
|
||
matching pdata entry; `NULL` when the row is prologue-only.
|
||
- New table `pdata_entries(begin_address BIGINT PRIMARY KEY, end_address
|
||
BIGINT, function_length BIGINT, prolog_length BIGINT, flags BIGINT)` — every
|
||
parsed `.pdata` `RUNTIME_FUNCTION` entry (raw, before any merge with
|
||
prologue analysis).
|
||
- Index `idx_functions_pdata_validated` on `functions(pdata_validated)`.
|
||
|
||
### What this layer does
|
||
- Parses `.pdata` 8-byte `RUNTIME_FUNCTION` entries (PowerPC PE32 layout):
|
||
word 0 `BeginAddress` (absolute VA), word 1 packed
|
||
`{prolog_length:8, function_length:22, flags:2}`, both big-endian.
|
||
- Unions pdata `BeginAddress` values into the function-candidate set fed to
|
||
the prologue walker, so functions our prologue heuristic missed still get
|
||
rows.
|
||
- When pdata supplies a longer `function_length` than the prologue walk
|
||
found, extends `end_address` to the pdata-implied end (catches mis-split
|
||
where the walker stopped at an early `blr`).
|
||
- After the walker, performs a forward pass that trims `function.end` to the
|
||
next start when they overlap (catches mis-merge where one row spanned two
|
||
prologues — the audit-031 `sub_824D23B0` / `sub_824D29F0` case).
|
||
|
||
### What this layer does NOT do
|
||
- Does not adjust prolog-derived `frame_size` / `saved_gprs` from `.pdata`'s
|
||
`prolog_length` field — those remain prologue-only inferences.
|
||
- Does not classify functions further than the existing `is_leaf` /
|
||
`is_saverestore` columns. Class membership is M3.
|
||
- Does not detect functions whose entries are missing from BOTH `.pdata`
|
||
and the bl-target scan (extremely rare; would require executable-byte
|
||
linear sweep).
|
||
|
||
### Reference docs
|
||
- Microsoft PE32+ exception data spec for PowerPC RUNTIME_FUNCTION.
|
||
- xenia-canary `src/xenia/cpu/xex_module.cc:1570-1587` — canary's reference
|
||
parser (extracts `BeginAddress` only; we additionally decode word 1).
|
||
|
||
### Validation queries
|
||
```sql
|
||
-- All pdata entries found
|
||
SELECT COUNT(*) FROM pdata_entries; -- ~23073 for Sylpheed
|
||
-- Functions cross-validated against pdata
|
||
SELECT COUNT(*) FROM functions WHERE pdata_validated;
|
||
-- Functions detected ONLY by prologue (orphans of pdata)
|
||
SELECT COUNT(*) FROM functions WHERE NOT pdata_validated;
|
||
-- Pdata orphans NOT yet in functions (should be 0 after this layer)
|
||
SELECT COUNT(*) FROM pdata_entries p
|
||
LEFT JOIN functions f ON f.address = p.begin_address
|
||
WHERE f.address IS NULL;
|
||
-- Audit-031 mis-merge resolved: 0x824D29F0 should have its own row
|
||
SELECT name FROM functions WHERE address = 2186674160; -- 0x824D29F0
|
||
```
|
||
|
||
---
|
||
|
||
## Layer M2 — MSVC C++ name demangler (landed)
|
||
|
||
### Schema additions
|
||
- New table `demangled_names(address BIGINT NULL, mangled VARCHAR NOT NULL,
|
||
raw_demangled VARCHAR NOT NULL, namespace_path VARCHAR NULL,
|
||
class_name VARCHAR NULL, method_name VARCHAR NULL,
|
||
params_signature VARCHAR NULL)`.
|
||
- Indices on `address`, `class_name`, `method_name`.
|
||
|
||
### What this layer does
|
||
- Wraps `msvc_demangler::demangle` (a Rust port of LLVM's
|
||
`MicrosoftDemangle.cpp`) and splits the formatted output into structured
|
||
fields via a heuristic top-level parser (handles templates and nested parens
|
||
correctly).
|
||
- Populates `demangled_names` from any label whose name starts with `?` plus
|
||
any import name that happens to be mangled (defensive — typical kernel
|
||
imports use C names).
|
||
|
||
### What this layer does NOT do
|
||
- Does not parse the AST returned by `msvc_demangler::parse` — uses the formatted
|
||
string and a heuristic split. Adequate for typical class member functions
|
||
and RTTI strings; exotic template / lambda forms still get `raw_demangled`
|
||
populated but may have NULL structured fields.
|
||
- Does not yet ingest RTTI strings discovered in `.rdata` — that's M3's job;
|
||
M3 will append rows to this table at the addresses where it finds RTTI
|
||
TypeDescriptors.
|
||
|
||
### Reference docs
|
||
- `msvc-demangler` crate (`https://docs.rs/msvc-demangler/0.11`).
|
||
- LLVM `MicrosoftDemangle.cpp` (the parser this crate ports).
|
||
|
||
## Layer M3 — Vtable + RTTI detection (planned)
|
||
|
||
Adds `vtables`, `methods`, `classes` tables. Heuristic vtable scan over
|
||
`.rdata` + `.data`, optional MSVC RTTI `CompleteObjectLocator → TypeDescriptor`
|
||
walk, anonymous-class fallback when RTTI is stripped. See
|
||
`crates/xenia-analysis/src/vtables.rs` (when landed).
|
||
|
||
## Layer M4 — Class-aware probe targeting (planned)
|
||
|
||
CLI extension only — no schema changes. `--pc-probe=Class::method` and
|
||
`--pc-probe-class=ClassName` resolve via M3's tables. See
|
||
`crates/xenia-analysis/src/lookup.rs` (when landed).
|
||
|
||
---
|
||
|
||
## Forward work (M5–M12, not yet landed)
|
||
|
||
- **M5** — indirect-dispatch reachability via vtable+CTR dataflow.
|
||
- **M6** — extended `xrefs.kind='write'` for indexed/byte-reverse/multiword/VMX/DCBZ/atomic stores with `addr_mode` column.
|
||
- **M7** — `.rdata` ASCII / UTF-16 string pool detection cross-referenced with PCs.
|
||
- **M8** — dispatch-table heuristics beyond vtables (e.g. function-pointer arrays in `.data`).
|
||
- **M9** — `__CxxFrameHandler` exception scope-table parsing.
|
||
- **M10** — `.tls` section / TLS slot tracking.
|
||
- **M11** — `__xc_a` / `__xc_z` static-initializer driver detection.
|
||
- **M12** — comparative-PC-trace mode for canary diff (runtime side, not analyzer).
|