# `xenia-analysis` schema reference Authoritative documentation for the DuckDB tables and SQL views produced by `xenia-rs dis --db sylpheed.db`. Track schema changes here alongside any update to the `db_schema_golden` test fixture. The base + disasm tables (`metadata`, `sections`, `imports`, `functions`, `labels`, `instructions`, `xrefs`, opt-in `exec_trace` / `import_calls` / `branch_trace`) are documented inline in `src/db.rs` doc comment. This file collects layered analysis additions and forward-work notes. --- ## Layer M1 — `.pdata` boundary correction (landed) ### Schema additions - `functions.pdata_validated BOOLEAN NOT NULL` — `true` when the row's `address` matches a `RUNTIME_FUNCTION.BeginAddress` from `.pdata`. Linker ground truth. - `functions.pdata_length BIGINT NULL` — `function_length` (bytes) from the matching pdata entry; `NULL` when the row is prologue-only. - New table `pdata_entries(begin_address BIGINT PRIMARY KEY, end_address BIGINT, function_length BIGINT, prolog_length BIGINT, flags BIGINT)` — every parsed `.pdata` `RUNTIME_FUNCTION` entry (raw, before any merge with prologue analysis). - Index `idx_functions_pdata_validated` on `functions(pdata_validated)`. ### What this layer does - Parses `.pdata` 8-byte `RUNTIME_FUNCTION` entries (PowerPC PE32 layout): word 0 `BeginAddress` (absolute VA), word 1 packed `{prolog_length:8, function_length:22, flags:2}`, both big-endian. - Unions pdata `BeginAddress` values into the function-candidate set fed to the prologue walker, so functions our prologue heuristic missed still get rows. - When pdata supplies a longer `function_length` than the prologue walk found, extends `end_address` to the pdata-implied end (catches mis-split where the walker stopped at an early `blr`). - After the walker, performs a forward pass that trims `function.end` to the next start when they overlap (catches mis-merge where one row spanned two prologues — the audit-031 `sub_824D23B0` / `sub_824D29F0` case). ### What this layer does NOT do - Does not adjust prolog-derived `frame_size` / `saved_gprs` from `.pdata`'s `prolog_length` field — those remain prologue-only inferences. - Does not classify functions further than the existing `is_leaf` / `is_saverestore` columns. Class membership is M3. - Does not detect functions whose entries are missing from BOTH `.pdata` and the bl-target scan (extremely rare; would require executable-byte linear sweep). ### Reference docs - Microsoft PE32+ exception data spec for PowerPC RUNTIME_FUNCTION. - xenia-canary `src/xenia/cpu/xex_module.cc:1570-1587` — canary's reference parser (extracts `BeginAddress` only; we additionally decode word 1). ### Validation queries ```sql -- All pdata entries found SELECT COUNT(*) FROM pdata_entries; -- ~23073 for Sylpheed -- Functions cross-validated against pdata SELECT COUNT(*) FROM functions WHERE pdata_validated; -- Functions detected ONLY by prologue (orphans of pdata) SELECT COUNT(*) FROM functions WHERE NOT pdata_validated; -- Pdata orphans NOT yet in functions (should be 0 after this layer) SELECT COUNT(*) FROM pdata_entries p LEFT JOIN functions f ON f.address = p.begin_address WHERE f.address IS NULL; -- Audit-031 mis-merge resolved: 0x824D29F0 should have its own row SELECT name FROM functions WHERE address = 2186674160; -- 0x824D29F0 ``` --- ## Layer M2 — MSVC C++ name demangler (landed) ### Schema additions - New table `demangled_names(address BIGINT NULL, mangled VARCHAR NOT NULL, raw_demangled VARCHAR NOT NULL, namespace_path VARCHAR NULL, class_name VARCHAR NULL, method_name VARCHAR NULL, params_signature VARCHAR NULL)`. - Indices on `address`, `class_name`, `method_name`. ### What this layer does - Wraps `msvc_demangler::demangle` (a Rust port of LLVM's `MicrosoftDemangle.cpp`) and splits the formatted output into structured fields via a heuristic top-level parser (handles templates and nested parens correctly). - Populates `demangled_names` from any label whose name starts with `?` plus any import name that happens to be mangled (defensive — typical kernel imports use C names). ### What this layer does NOT do - Does not parse the AST returned by `msvc_demangler::parse` — uses the formatted string and a heuristic split. Adequate for typical class member functions and RTTI strings; exotic template / lambda forms still get `raw_demangled` populated but may have NULL structured fields. - Does not yet ingest RTTI strings discovered in `.rdata` — that's M3's job; M3 will append rows to this table at the addresses where it finds RTTI TypeDescriptors. ### Reference docs - `msvc-demangler` crate (`https://docs.rs/msvc-demangler/0.11`). - LLVM `MicrosoftDemangle.cpp` (the parser this crate ports). ## Layer M3 — Vtable + RTTI detection (landed) ### Schema additions - `vtables(address PK, length, col_address NULL, class_name, rtti_present, base_classes_json NULL)` — every detected static vtable. - `methods(vtable_address, slot, function_address, mangled_name NULL, demangled_name NULL, PRIMARY KEY (vtable_address, slot))` — one row per method slot. - `classes(name PK, vtable_address, rtti_present, base_classes_json NULL)` — deduped by class name (first-detected vtable wins). - Indices: `methods.function_address`, `classes.rtti_present`. ### What this layer does - Walks `.rdata` and `.data` looking for runs of ≥3 consecutive 4-byte BE values where each value is a known function start (from M1's corrected `functions` table). Single-2-method vtables are intentionally rejected to control false-positive rate. - Attempts the MSVC RTTI walk `vtable[-1] → CompleteObjectLocator → TypeDescriptor` for each candidate. When successful, the demangled `class ClassName` string fills `class_name` and a best-effort `RTTIClassHierarchyDescriptor` walk fills `base_classes_json` (JSON array of base class names). - Falls back to `ANON_Class_<8-hex>` keyed by FNV-1a hash of the sorted method-PC tuple when RTTI is absent (typical for shipped game binaries). Identical vtables across the binary (multiple instances) collapse to the same anonymous name. ### What this layer does NOT do - Vtables built at runtime in heap-allocated memory (e.g. by ctors copying static templates) are out of scope — only static `.rdata`/`.data` content. - Multiple-inheritance "extra" vftables (one per base subobject) are detected as independent vtables with no link between them. - Inheritance-tree walking beyond `RTTIClassHierarchyDescriptor`'s direct base list is not attempted. ### Reference docs - openrce.org "Reversing Microsoft Visual C++" — RTTI layout articles (CompleteObjectLocator at vtable[-1]; TypeDescriptor at COL+0xC; mangled name at TD+0x8). ## Layer M4 — Class-aware probe targeting (landed) CLI extension only — no schema changes. The probe-token grammar adds three symbolic forms on top of the existing `0xADDR` literal: - `Class::method` — joins `classes` × `methods` × `demangled_names` to find every PC whose vtable belongs to that class and whose demangled `method_name` matches. - `Class::*` — joins `classes` × `methods` to find every method PC of that class. - `function_name` — falls back to `functions.name` lookup for free functions / saverestore stubs / labels. Numeric tokens never touch the DB (preserves zero-IO fast path; lockstep digest unaffected). Symbolic tokens require the DuckDB at `--probe-db PATH` or `XENIA_PROBE_DB`; default is `sylpheed.db` next to the .iso when present. Resolution happens BEFORE guest exec begins, so it cannot affect the lockstep digest. See `crates/xenia-analysis/src/lookup.rs`. --- ## Layer M5 — Indirect-dispatch reachability (landed) ### Schema additions - New value `'ind_call'` in the `xrefs.kind` set. - New SQL view `v_indirect_reachability_from_entry` — strict superset of `v_reachability_from_entry`, taking `ind_call` edges in the BFS. ### What this layer does - Walks each `FuncAnalysis.functions` entry with a per-basic-block register tracker. Recognises the canonical static-vtable pattern: `lis+addi → lwz off(rA) → mtctr → bcctrl`, where `rA` ends up holding a known vtable's start address from M3. - Honours the PowerPC ABI: `bl`-style calls (op 18 / 16 with LK=1) clobber volatile r0..r12 + ctr but preserve non-volatile r13..r31, so a vtable pointer parked in r30/r31 before a call survives. - Treats every M3 `loc_*` label as a basic-block boundary (kills register state) so jump-IN paths cannot induce false positives. ### What this layer does NOT do (and observed impact) - Vtable pointer loaded from a `this`-pointer field (`lwz r_vt, off(rA)` where `rA = this`) — by far the dominant pattern in real C++ — is unresolvable without alias / points-to analysis. - On Sylpheed: the layer detects 0 edges. The binary's 1,001 lis+addi references into vtables are mostly constructor-side **vptr writes** (`stw rVtable, vptr_offset(this)`), not direct dispatches. The renderer hunt's audit-009 cluster therefore needs a future M5.5 with `this`-flow tracking before this layer surfaces it. ### Reference docs - IBM PowerPC ABI: register-save convention (volatile r0..r12 + ctr, non-volatile r13..r31). ## Layer M7 — String / constant-pool detection (landed) ### Schema additions - New table `strings(address PK, encoding, length, content)`. - Index `idx_strings_encoding`. ### What this layer does - Scans `.rdata` for runs of length ≥ 6 of printable ASCII bytes followed by a NUL terminator. - Scans `.rdata` for UTF-16LE runs of length ≥ 6 code units (printable-ASCII basic plane only) followed by a u16 NUL terminator. - Cross-reference is implicit: existing `xrefs.kind='ref'` rows whose `target` falls in `strings.address`'s exact match set name the referencing PCs. SQL: `SELECT s.content, x.source FROM xrefs x JOIN strings s ON s.address = x.target WHERE x.kind='ref'`. ### What this layer does NOT do - No UTF-8 multibyte / non-ASCII basic plane in either encoding. - No `.data` scan (read-only-section bias). - No multi-byte CJK encodings — Japanese text in localised builds may be represented in shift_jis / utf-8 with non-printable bytes that this scanner skips. ### Sylpheed yield - 6,311 ASCII strings (including full embedded HLSL shader source). - 0 UTF-16LE strings (binary uses ASCII / native CJK encoding). - 9,132 lis+addi sites cross-reference into the detected strings — names the source PCs that reference each string. ## Layer M6 — Extended store-class xrefs + `addr_mode` column (landed) ### Schema additions - `xrefs.addr_mode VARCHAR NULL` — sub-classifies how the source instruction computes its target. NULL for control-flow edges (call / ind_call / j / br); one of the following tags for data edges: - `d_form` — standard signed-16 displacement (lwz/stw/lfs/stfs/etc.) - `lis_addi` — address materialised via `lis + addi` register tracking - `lis_ori` — address materialised via `lis + ori` - `multiword` — `lmw / stmw` (one xref per slot; up to 32-rS slots) - `x_form_indexed` — `stwx / stbx / sthx / stwux / stbux / sthux / stdx / stdux / lwzx / lbzx / lhzx / lhax / lwzux / lbzux / lhzux / lhaux / ldx / ldux` — emitted only when both rA and rB are tracked constants - `x_form_byterev` — `stwbrx / sthbrx / lwbrx / lhbrx` - `atomic` — `stwcx. / stdcx.` reservation-conditional stores - `dcbz` — cache-line clear (32-byte zero at rA+rB) - Index `idx_xrefs_addr_mode`. ### What this layer does - Tags every existing data xref with its addressing mode (`d_form` for the bulk; `lis_addi` / `lis_ori` for the lift-and-add cases that produce DataRef rows). - Adds new dispatch for opcode 47 (`stmw`) and 46 (`lmw`), expanding to per-slot DataWrite / DataRead rows. - Adds new dispatch for opcode 31 X-form: stores, atomic, byte-reverse, dcbz. X-form rows are emitted ONLY when both rA and rB resolve to known constants (otherwise the address is runtime-dependent and we skip). ### What this layer does NOT do - VMX / VMX128 vector stores (opcode 31 with vector XO codes) are not emitted — they always have register-indexed addresses that the lis+addi tracker can't usually resolve, and detecting them adds noise without improving target resolution. - The dominant runtime-of-stwx pattern (rA = base, rB = runtime index) is not resolved — by design; mem-watch covers the runtime side per VERIFY-B. ### Sylpheed yield - 28,834 `lis_addi` refs, 18,485 `d_form` reads, 3,288 `d_form` writes — the existing baseline now properly tagged. - **442 newly-detected `x_form_indexed` reads** — primarily lwzx/lhzx reads from in-table dispatch (each pair (rA,rB) resolved statically). - **40 newly-detected `atomic` writes** — every `stwcx.` site with a resolvable address; useful for reservation-table audits. - 9 `lis_ori` refs. - 0 multiword / dcbz / byterev — these instructions exist in the binary but are not in lis+addi-tracked code paths. ## Layer M8 + M11 — Function-pointer arrays beyond vtables (landed) ### Schema additions - New table `function_pointer_arrays(address PK, length, kind)` where `kind` is `'vtable'` (M3 re-emit), `'dispatch_table'` (M8), or `'static_init'` (M11). - New table `function_pointer_array_entries(array_address, slot, function_address, PRIMARY KEY (array_address, slot))` — one row per slot of every detected array (vtable + non-vtable). - Indices on `function_pointer_arrays.kind` and `function_pointer_array_entries.function_address`. ### What this layer does - Walks `.rdata` (only — `.data` produces too many false positives) for runs of ≥ 2 consecutive 4-byte BE values where each value is a known function entry from M1's `functions` table. - Skips runs whose start matches an M3 vtable head — those are re-emitted in this table with `kind='vtable'` for unified queries but not re-classified. - Heuristically classifies non-vtable runs: - `static_init` (M11): every entry's first instruction is `mfspr r12, LR` AND the next is `stwu r1, -N(r1)` with `N ≤ 0x80` (or a save-stub `bl`). Mirrors the typical C++ static-initialiser prologue. - `dispatch_table` (M8): everything else. ### What this layer does NOT do - Does not parse symbol-table-bracketed regions like `__xc_a` / `__xc_z` / `__xi_a` / `__xi_z` directly — Sylpheed's symbol table is stripped. - Does not chain multi-segment static-init drivers; future M11.5 could walk the entry-point's static-init driver call chain to surface ground-truth ctor PCs. - 2-slot runs in `.rdata` may be false positives where two struct fields happen to alias function VAs; downstream queries should use a length filter (`WHERE length >= 3`) when high precision matters. ### Sylpheed yield - 722 vtables (M3 re-emit) + 388 dispatch_tables = 1,110 arrays in `function_pointer_arrays`. - 0 static_init detected — Sylpheed's ctors don't all match the conservative prologue heuristic. Lengths concentrate at 2 slots (typical of switch-case jump tables). ## Layer M9 — `has_eh` from `.pdata` exception flag (landed) ### Schema additions - `functions.has_eh BOOLEAN NOT NULL` — true when `.pdata`'s exception- handler-present bit (bit 31 of word 1, the high bit) is set. - Index `idx_functions_has_eh`. ### What this layer does - Derived directly from M1's already-parsed `pdata.flags` bit field (no new parsing). The bit was always available in `pdata_entries.flags`; this layer surfaces it as a first-class column on `functions`. ### What this layer does NOT do - Does not parse the actual `__CxxFrameHandler` / `__C_specific_handler` scope-table records that the exception bit gates. Walking those tables would let us name try/catch ranges and per-state cleanup actions, but is out of scope for a derive-only milestone. ### Sylpheed yield - 2,975 of 23,073 pdata-validated functions have `has_eh=true` (12.9%) — plausible MSVC C++ EH coverage rate. Largest EH function: 26,328 bytes (`sub_823518F0`). ## Layer M10 — `.tls` section / TLS directory (landed) ### Schema additions - New table `tls_info(raw_data_start, raw_data_end, index_address, callback_array, zero_fill_size, characteristics)` — at most one row (the IMAGE_TLS_DIRECTORY32). - New table `tls_callbacks(slot PK, address)` — one row per resolved TLS callback function. ### What this layer does - Reads the first 24 bytes of the `.tls` section as an `IMAGE_TLS_DIRECTORY32` and walks the zero-terminated callback array. - All addresses stored as absolute VAs. ### What this layer does NOT do - Does not parse the raw TLS template content (the variable initialiser block); just records its start/end VAs. ### Sylpheed yield - 0 rows — Sylpheed has no `.tls` section. Infrastructure ready for any binary that uses `__declspec(thread)` storage. ## Layer M12 — `--lr-trace` runtime canary-diff harness (landed) ### Runtime additions (no DB) - New CLI flag `--lr-trace=PC[,PC,...]` on `exec` — comma-separated PCs to capture as JSONL records on every fire. Symbolic tokens (`Class::method`) resolve via M4's lookup against `--probe-db`. Settable via `XENIA_LR_TRACE`. - New CLI flag `--lr-trace-out=PATH` — writes JSONL to a file (one record per line). Stdout when omitted. Settable via `XENIA_LR_TRACE_OUT`. - New kernel state fields `lr_trace_pcs: HashSet` + `lr_trace_writer: Option>` and helper `KernelState::fire_lr_trace_if_match(hw_id)` invoked from the per-instruction probe slot. ### JSONL record fields `pc, tid, hw, cycle, r3, r4, r5, r6, lr` — superset of what xenia-canary's `--log_lr_on_pc` patch emits, with a cycle counter added for cross-run reproducibility. ### What this layer does NOT do - Does not capture VMX / FP register state (only GPRs r3..r6). - Does not buffer / batch records — one `write_all` per fire. For high-frequency probes (e.g. tight loops at >1M fires/sec), redirect to a file and use a SSD. ### Determinism Lockstep digest unaffected: probe firing happens after the per-instr hooks for ctor/branch probes and only emits side-channel output. Verified end-of-session: `check sylpheed.iso --stable-digest -n 2M` ×2 produced byte-identical digests (`instructions=2000005`). --- ## Layer M5.5 — `this`-flow indirect-dispatch resolution (landed) ### Schema additions - New table `vptr_writes(writer_pc, vtable_address, vptr_offset, writer_function)` — every detected `stw rVtable, vptr_off(rThis)` site. - New table `indirect_dispatch_sites(dispatch_pc PK, vptr_offset, slot, candidate_count)` — one row per resolved dispatch. - New table `indirect_dispatch_candidates(dispatch_pc, vtable_address, method_address)` — one row per (dispatch × candidate vtable). Joined to existing `xrefs.kind='ind_call'` edges (one ind_call row per candidate). - New indices on `vptr_writes.vtable_address`, `vptr_writes.vptr_offset`, `indirect_dispatch_candidates.method_address`, `indirect_dispatch_candidates.vtable_address`, `indirect_dispatch_sites.(vptr_offset, slot)`. ### What this layer does (class-membership inference) 1. **Phase 1 — vptr-write scan**: walk every function with the lis+addi tracker; whenever `stw rA, off(rB)` writes a known M3 vtable address, record `(vtable_addr, vptr_offset, writer_pc)`. 2. **Phase 2 — invert**: build `vtables_by_offset[vptr_off] = {V}` for the set of vtables ever written at that offset. 3. **Phase 3 — dispatch detection**: walk back ≤16 instructions from each `bcctrl`/`bctr LK=1`, find the canonical `lwz vt, off(this); lwz fn, slot*4(vt); mtctr fn` chain. Extract `(vptr_off, slot)`. Bail on register clobber, branch, or label boundary. 4. **Phase 4 — emit**: for each `(dispatch_pc, vptr_off, slot)`, emit one `xrefs.kind='ind_call'` row per candidate vtable that has a matching slot. Multi-candidate rows are an over-approximation. ### What this layer does NOT do - No alias resolution at multi-candidate sites — emits one edge per matching vtable. Downstream queries should filter `indirect_dispatch_sites WHERE candidate_count=1` for high-confidence edges. - No flow-sensitive analysis: register state is killed at every label (basic-block boundary) and at `bl`/`bcl` calls (volatile r0..r12 + ctr). We do NOT propagate values across calls in the chain-walker. - No tracking of vptr writes via X-form indexed (`stwx`), VMX, or multiword stores. Only D-form `stw rA, off(rB)`. - Does not synthesise vptr writes for inlined / elided constructors. If a class never has a writer at offset `vptr_off`, dispatches through that offset find no candidates. ### Sylpheed yield - 567 vptr writes covering 214 distinct vtables (~30% of M3's 722). - 29 distinct vptr offsets used; offset 0 dominates (501/567 = 88%, single-inheritance). - **6,842 dispatch sites resolved**: 97 single-candidate (high-confidence) + 6,745 multi-candidate (over-approximation). - 687,963 `ind_call` xref rows total. - **2,746 newly-reachable functions** via the M5 BFS view (`v_indirect_reachability_from_entry`) compared to call/j/br alone. - Audit-009 cluster (renderer plateau): functions newly visible include `0x823BC9E0`, `0x823BC290`, `0x823BC5A0`, `0x823BB158`, `0x823BB1E0`, `0x823BCAF0`, `0x823BC4C8` — actionable starting points for the cluster's reachability hunt. ### Reference docs - IBM PowerPC ABI (volatile/non-volatile register partition). - Itanium C++ ABI on vtable layout (offset-from-`this` model adapted by MSVC for Win32 PPC). ## Layer M9.5 — `__CxxFrameHandler` scope-table parsing (landed) ### Schema additions - New table `eh_funcinfo(address PK, magic, max_state, p_unwind_map, n_try_blocks, p_try_block_map, n_ip_map_entries, p_ip_to_state_map, p_es_type_list, eh_flags)`. - New table `eh_unwind_map(funcinfo_address, state_index, to_state, action_pc, PRIMARY KEY (funcinfo_address, state_index))`. - New table `eh_try_blocks(funcinfo_address, try_index, try_low, try_high, catch_high, n_catches, p_handler_array, PRIMARY KEY (funcinfo_address, try_index))`. ### What this layer does - Magic-scans `.rdata` for the documented MSVC FuncInfo signatures (0x19930520 / 0x19930521 / 0x19930522), reading 4-byte BE values on 4-byte alignment. - Sanity-checks `max_state` ≤ 10,000, `n_try_blocks` ≤ 1,000, all internal pointers landing in valid sections. - Walks `pUnwindMap` (8-byte UnwindMapEntry) and `pTryBlockMap` (20-byte TryBlockMapEntry) into one row each. ### What this layer does NOT do - Does not associate FuncInfo records with their owning function via the `bl __CxxFrameHandler` registration site — joins to `functions` by best-effort PC-range queries. A future M9.6 can chase the registration to make the link explicit. - Does not parse `pHandlerArray` (per-try-block catch type info). ### Sylpheed yield - 2,588 FuncInfo records (all version 0x19930522). - 10,019 unwind-map entries. - 315 try-blocks across the binary. ## Layer M11.5 — Static-init driver chain detection (landed) ### Schema additions - Reuses existing `function_pointer_arrays` table — drivers' arrays are emitted with `kind='static_init'`, replacing M11's prologue-heuristic output where the structurally-grounded pattern fires. ### What this layer does - Walks every detected function looking for the canonical `_initterm`- style loop: `lwz cursor; mtctr; bcctrl; addi cursor, cursor, 4` bounded by a comparison against another constant register. - Extracts `(array_start, array_end)` from the cursor's initial constant value and the end-comparand register. - Reads the array, validates each entry against `func_analysis.functions`, and emits the array as `static_init`. ### What this layer does NOT do - Doesn't handle drivers with multiple back-to-back trampoline loops. - Doesn't follow `_initterm_e` return-status semantics — both `_initterm` and `_initterm_e` match if the loop body matches. ### Sylpheed yield - 0 drivers detected. Sylpheed's static-init structure does not match the canonical CRT loop pattern; the binary likely calls ctors via another mechanism (inline at the entry point, or via a different driver shape). Infrastructure ready for any binary with the documented MSVC pattern. ## Layer VMX — Vector-store xrefs (M6 follow-up, landed) Extends the M6 X-form opcode-31 dispatch in `xref.rs` with AltiVec/VMX vector loads and stores. New entries (XO codes): - `lvx` (103), `lvxl` (359), `lvebx` (7), `lvehx` (39), `lvewx` (71) — `addr_mode='x_form_indexed'`, `kind='read'`. - `stvx` (231), `stvxl` (487), `stvebx` (135), `stvehx` (167), `stvewx` (199) — `addr_mode='x_form_indexed'`, `kind='write'`. Same constraint as M6: rows emitted only when both `rA` and `rB` resolve to known constants (rare but useful). ### Sylpheed yield - 110 `stvx` writes newly resolved. ## Layer SJIS+UTF-8 — Localised-string detection (M7 follow-up, landed) Extends `xenia_analysis::strings::analyze` with two additional scanners. ### Shift_JIS detection Per JIS X 0208: lead byte ∈ [0x81, 0x9F] ∪ [0xE0, 0xEF]; trail byte ∈ [0x40, 0x7E] ∪ [0x80, 0xFC]. Single-byte ASCII and JIS half-width katakana (0xA1..=0xDF) are passed through. At least one multi-byte pair must be present (so we don't double-count pure ASCII). SJIS bytes are rendered as `\\xHH` escapes in the `content` column for diagnostic readability — full SJIS→UTF-8 decoding is a future enhancement. ### UTF-8 detection Validates 2-byte (`110xxxxx 10xxxxxx`) and 3-byte (`1110xxxx 10xxxxxx 10xxxxxx`) sequences plus printable ASCII. Skips 4-byte (supplementary plane) which is rare in game text. ### Sylpheed yield - 790 Shift_JIS strings (Japanese debug + UI text, including `[WARNING] ノードに割り当てるエフェクトIDの指定がない ノードデータが見つからない` style mission strings). - 39 UTF-8 strings. - 6,311 ASCII strings (unchanged from M7). ## Forward work (not yet landed) - **M9.6** — link `eh_funcinfo` records back to their owning functions via `bl __CxxFrameHandler` registration sites + per-try-block `pHandlerArray` parsing. - **M11.6** — relax M11.5 to detect non-canonical static-init driver shapes (`_initterm_e` with status return, custom drivers). - Full SJIS → UTF-8 decoding in the `strings.content` column. - VMX128 (opcode 4) vector-store xrefs — separate encoding space, low ROI; document if Sylpheed's renderer cluster uses it.