M9.5 + M11.5 + VMX + SJIS/UTF-8: close the post-M5.5 deferred set

Closes the four remaining deferred follow-up items in one bundle.
All four are smaller-scope and additive; lockstep determinism
unaffected (analyzer-only changes).

## M9.5 — __CxxFrameHandler scope-table parsing

- New `xenia_analysis::eh_scope` module. Magic-scans .rdata for the
  three documented MSVC FuncInfo signatures (0x19930520/21/22) on
  4-byte alignment. Each match is parsed as the documented struct
  (BE u32 fields), with sanity caps on max_state / n_try_blocks /
  pointer validity.
- Walks pUnwindMap (UnwindMapEntry, 8 bytes) and pTryBlockMap
  (TryBlockMapEntry, 20 bytes) into one row each.
- New tables eh_funcinfo, eh_unwind_map, eh_try_blocks.
- Sylpheed yield: 2,588 FuncInfo (all version 0x19930522) /
  10,019 unwind entries / 315 try-blocks.

## M11.5 — Static-init driver chain detection

- New `xenia_analysis::static_init` module. Walks every function
  looking for the canonical _initterm loop: lwz cursor; mtctr;
  bcctrl; addi cursor, cursor, 4 bounded by a compare against another
  constant register. Extracts (array_start, array_end) and reads
  the array.
- Reuses `function_pointer_arrays` table — drivers' arrays land with
  kind='static_init' (replacing M11's prologue-heuristic output where
  the structurally-grounded pattern fires).
- Sylpheed yield: 0 drivers detected — the binary's static-init
  structure does not match the canonical CRT loop. Infrastructure
  ready; future M11.6 can relax.

## VMX vector-store xrefs (M6 follow-up)

- Adds AltiVec/VMX X-form load/store XOs to the M6 opcode-31
  dispatch: lvx/lvxl/lvebx/lvehx/lvewx (reads) and
  stvx/stvxl/stvebx/stvehx/stvewx (writes), all addr_mode=
  'x_form_indexed'. Static resolution still requires both rA and rB
  constant.
- Sylpheed yield: 110 newly-detected stvx writes.

## Shift_JIS + UTF-8 localised-string detection (M7 follow-up)

- Extends `xenia_analysis::strings::analyze` with scan_shift_jis (JIS
  X 0208 lead/trail byte ranges + half-width katakana pass-through)
  and scan_utf8 (2- and 3-byte sequences). At least one multi-byte
  unit required so pure-ASCII strings aren't double-counted.
- SJIS bytes rendered as \xHH escapes for diagnostic readability;
  full SJIS→UTF-8 decoding deferred.
- Sylpheed yield: 790 Shift_JIS strings (Japanese debug + UI text)
  + 39 UTF-8.

## Tests

- +2 EH (parses_minimal_funcinfo_v0, rejects_bogus_max_state)
- +2 static_init (detects_canonical_initterm_loop, rejects_function_without_pattern)
- +2 strings (detects_shift_jis_string, detects_utf8_multibyte_string)

Tests 649→655 (+6 unit tests). DB schema golden + write_analysis_results
signature updated for new EH parameter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-10 00:36:53 +02:00
parent b03192c772
commit e428ce33aa
9 changed files with 1159 additions and 14 deletions

View File

@@ -457,11 +457,114 @@ byte-identical digests (`instructions=2000005`).
- Itanium C++ ABI on vtable layout (offset-from-`this` model adapted
by MSVC for Win32 PPC).
## Layer M9.5 — `__CxxFrameHandler` scope-table parsing (landed)
### Schema additions
- New table `eh_funcinfo(address PK, magic, max_state, p_unwind_map,
n_try_blocks, p_try_block_map, n_ip_map_entries, p_ip_to_state_map,
p_es_type_list, eh_flags)`.
- New table `eh_unwind_map(funcinfo_address, state_index, to_state, action_pc,
PRIMARY KEY (funcinfo_address, state_index))`.
- New table `eh_try_blocks(funcinfo_address, try_index, try_low, try_high,
catch_high, n_catches, p_handler_array,
PRIMARY KEY (funcinfo_address, try_index))`.
### What this layer does
- Magic-scans `.rdata` for the documented MSVC FuncInfo signatures
(0x19930520 / 0x19930521 / 0x19930522), reading 4-byte BE values
on 4-byte alignment.
- Sanity-checks `max_state` ≤ 10,000, `n_try_blocks` ≤ 1,000, all
internal pointers landing in valid sections.
- Walks `pUnwindMap` (8-byte UnwindMapEntry) and `pTryBlockMap`
(20-byte TryBlockMapEntry) into one row each.
### What this layer does NOT do
- Does not associate FuncInfo records with their owning function via
the `bl __CxxFrameHandler` registration site — joins to `functions`
by best-effort PC-range queries. A future M9.6 can chase the
registration to make the link explicit.
- Does not parse `pHandlerArray` (per-try-block catch type info).
### Sylpheed yield
- 2,588 FuncInfo records (all version 0x19930522).
- 10,019 unwind-map entries.
- 315 try-blocks across the binary.
## Layer M11.5 — Static-init driver chain detection (landed)
### Schema additions
- Reuses existing `function_pointer_arrays` table — drivers' arrays are
emitted with `kind='static_init'`, replacing M11's prologue-heuristic
output where the structurally-grounded pattern fires.
### What this layer does
- Walks every detected function looking for the canonical `_initterm`-
style loop: `lwz cursor; mtctr; bcctrl; addi cursor, cursor, 4`
bounded by a comparison against another constant register.
- Extracts `(array_start, array_end)` from the cursor's initial
constant value and the end-comparand register.
- Reads the array, validates each entry against
`func_analysis.functions`, and emits the array as `static_init`.
### What this layer does NOT do
- Doesn't handle drivers with multiple back-to-back trampoline loops.
- Doesn't follow `_initterm_e` return-status semantics — both
`_initterm` and `_initterm_e` match if the loop body matches.
### Sylpheed yield
- 0 drivers detected. Sylpheed's static-init structure does not match
the canonical CRT loop pattern; the binary likely calls ctors via
another mechanism (inline at the entry point, or via a different
driver shape). Infrastructure ready for any binary with the
documented MSVC pattern.
## Layer VMX — Vector-store xrefs (M6 follow-up, landed)
Extends the M6 X-form opcode-31 dispatch in `xref.rs` with AltiVec/VMX
vector loads and stores. New entries (XO codes):
- `lvx` (103), `lvxl` (359), `lvebx` (7), `lvehx` (39), `lvewx` (71)
— `addr_mode='x_form_indexed'`, `kind='read'`.
- `stvx` (231), `stvxl` (487), `stvebx` (135), `stvehx` (167),
`stvewx` (199) — `addr_mode='x_form_indexed'`, `kind='write'`.
Same constraint as M6: rows emitted only when both `rA` and `rB`
resolve to known constants (rare but useful).
### Sylpheed yield
- 110 `stvx` writes newly resolved.
## Layer SJIS+UTF-8 — Localised-string detection (M7 follow-up, landed)
Extends `xenia_analysis::strings::analyze` with two additional scanners.
### Shift_JIS detection
Per JIS X 0208: lead byte ∈ [0x81, 0x9F] [0xE0, 0xEF];
trail byte ∈ [0x40, 0x7E] [0x80, 0xFC]. Single-byte ASCII and JIS
half-width katakana (0xA1..=0xDF) are passed through. At least one
multi-byte pair must be present (so we don't double-count pure ASCII).
SJIS bytes are rendered as `\\xHH` escapes in the `content` column for
diagnostic readability — full SJIS→UTF-8 decoding is a future
enhancement.
### UTF-8 detection
Validates 2-byte (`110xxxxx 10xxxxxx`) and 3-byte
(`1110xxxx 10xxxxxx 10xxxxxx`) sequences plus printable ASCII. Skips
4-byte (supplementary plane) which is rare in game text.
### Sylpheed yield
- 790 Shift_JIS strings (Japanese debug + UI text, including
`[WARNING] ードに割り当てるエフェクトIDの指定がない ノードデータが見つからない` style mission strings).
- 39 UTF-8 strings.
- 6,311 ASCII strings (unchanged from M7).
## Forward work (not yet landed)
- **M9.5** — full `__CxxFrameHandler` scope-table parsing (try/catch
range names, per-state cleanup actions).
- **M11.5** — walk the static-initialiser driver call chain from the
entry point to surface ground-truth ctor PCs.
- VMX/VMX128 vector-store xref emission (M6 follow-up).
- UTF-8 / shift_jis localised-string detection in `.rdata` (M7 follow-up).
- **M9.6** — link `eh_funcinfo` records back to their owning functions
via `bl __CxxFrameHandler` registration sites + per-try-block
`pHandlerArray` parsing.
- **M11.6** — relax M11.5 to detect non-canonical static-init driver
shapes (`_initterm_e` with status return, custom drivers).
- Full SJIS → UTF-8 decoding in the `strings.content` column.
- VMX128 (opcode 4) vector-store xrefs — separate encoding space, low
ROI; document if Sylpheed's renderer cluster uses it.