feat(vcr-ra): frame-slot dead-store elimination behind SYNTH_STACK_FWD (flag-off) (#242)#515
Merged
Merged
Conversation
…D (flag-off) (#242) The optimized ARM path lowers wasm locals frame-resident (`local.get`→`ldr [sp]`, `local.set`→`str [sp]`). On the silicon hot path most of that traffic is spurious — gale #209 measured flat_flight's spills as fitting in R0-R8 once value-allocated. `forward_stack_reloads` (already built, behind SYNTH_STACK_FWD) turns a reload whose value is still in a register into a register move, but leaves the now-dead `str` behind. This adds the missing completion: `eliminate_dead_frame_stores` removes a `str rX,[sp,#N]` whose slot is overwritten before any read. `eliminate_dead_stores` cannot catch these — a store defines no register, so its register-def liveness is vacuous; slot liveness is what's needed. SOUNDNESS — overwrite-only by design. A store is proven dead ONLY by a later store to the SAME immediate slot with no intervening read or aliasing op. Reaching the end of the function does NOT count (the "abandoned at frame teardown" case needs epilogue reasoning we deliberately do not do — default is KEEP). The forward scan stops, keeping the store, at the first of: a `ldr` from the slot, ANY sp-relative register-offset access, ANY sp-relative SUB-WORD access (ldrb/ldrh/strb/…, which could touch bytes inside the slot), Push/Pop, a call (a callee may read an outgoing stack-arg slot), an SP redefinition, or any op `reg_effect` does not model (branch/label). Rests on the load-bearing assumption that spill slots are word-aligned and non-overlapping (documented at the pass). 8 unit tests pin each boundary (overwrite-removed, read-kept, distinct-slots, call/SP-change/branch/sub-word/reg-offset all keep); the sub-word and overwrite-at-end cases were verified failing before their guards (non-vacuous). GATED DEFAULT-OFF. Both passes sit under `SYNTH_STACK_FWD`; off ⇒ byte-identical (no flag-off code path changes; frozen gate green). The shipped behaviour of this PR is therefore UNCHANGED — the win is the flip's payoff, a separate silicon-gated step. New CI oracle `frame_slot_dce_differential.py` EXECUTES the flag-ON build under unicorn and asserts flight_algo's return matches wasmtime across several sensor inputs (incl. the 0x07FDF307 frozen-anchor value), with linear memory seeded as wasmtime's; it also asserts the flag-off `.text` is byte-identical across builds. Flip payoff (measured, objdump, NOT shipped here): flat_flight `flight_algo` sp-relative traffic 20→7 (9 reloads forwarded, 4 dead stores removed), 139→135 instructions. VCR-RA / epic #242. Behaviour frozen on every shipped path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
avrabe
added a commit
that referenced
this pull request
Jun 26, 2026
…t-on (#242) (#516) The two paired frame-traffic passes shipped flag-off in #514/#515 — `forward_stack_reloads` (a `local.set; local.get` reload becomes `mov rY,rX` when rX still holds the value) and `eliminate_dead_frame_stores` (the now-dead `str rX,[sp,#N]` whose slot is overwritten-before-read is removed) — go DEFAULT-ON. Escape hatch: `SYNTH_NO_STACK_FWD=1` restores the frame-resident bytes. Same gated path as the cmp→select (v0.13.0) and local-promotion (v0.14.0) flips. The win lands on the SHIPPED `--relocatable` path (the post-passes run on the direct selector's output, which is what gale ships): flight_seam 774→738, flight_seam_flat 910→878; control_step unchanged (no spurious slot reuse); signed_div_const + all RV32 unchanged (ARM-only — verified m4 and m7 identically, RISC-V byte-identical). RESULTS bit-identical, proven on every frozen anchor: control_step 0x00210A55 (control_step_differential.py 13/13), flat AND inlined flight_algo 0x07FDF307 (flight_seam_differential.py MATCH both). The execution differential now runs BOTH the default and the opt-out against wasmtime (a default flip is only safe if the shipped path AND its rollback both match) and asserts the two emit different bytes (flip engaged). Broad oracle: `cargo test --workspace` green under the new default (the wast/spec suite is compile-only, so gale's G474RE silicon is the broad-execution check). Clean-room verified (6/6 claims, independent harness). GATING: - Frozen goldens re-frozen (flight_seam, flight_seam_flat); control_step + RV32 untouched. New `frozen_fixtures_stack_fwd_escape_hatch_restores_old_bytes` asserts `SYNTH_NO_STACK_FWD=1` restores the pre-flip bytes byte-for-byte — the rollback proof and a tripwire. - CI oracle updated to test the default + the opt-out. This is an instruction/memory-op proxy win (flight_algo sp-traffic 20→7, 139→135 insns); the measured CYCLE number is gale's G474RE, confirmed post-ship per the cmp→select silicon-gate-waiver precedent. const-CSE (SYNTH_CONST_CSE) stays flag-off — its alias-eviction prerequisite is open and it is inert on flat_flight. VCR-RA / epic #242. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Completes the stack-reload-forwarding lever: after
forward_stack_reloads(already built, behind
SYNTH_STACK_FWD) turns a frame reload into a registermove, the original
str rX,[sp,#N]is left behind as a dead store. This addseliminate_dead_frame_storesto remove it.This is part of the "optimize down toward native parity" push — the frame-resident
lowering (
local.get→ldr [sp],local.set→str [sp]) is the dominant spurioustraffic gale measured on silicon (#209: flat_flight's spills fit in R0-R8 once
value-allocated).
eliminate_dead_storescan't catch these — astrdefines no register, so itsregister-def liveness is vacuous; slot liveness is what's needed.
Soundness — overwrite-only by design
A store is proven dead only by a later store to the same immediate slot
with no intervening read or aliasing op. Reaching the function end does not
count (the "abandoned at frame teardown" case needs epilogue reasoning we
deliberately skip — default is KEEP). The forward scan stops and keeps the store
at the first of:
ldrfrom the slot,ldrb/ldrh/strb/… — could touch bytes inside the slot),Push/Pop, a call (callee may read an outgoing stack-arg slot),reg_effectdoes not model (branch/label).Rests on the documented load-bearing assumption that spill slots are
word-aligned and non-overlapping. 8 unit tests pin each boundary; the
sub-word and overwrite-at-end guards were verified failing before the fix
(non-vacuous) — closing the #483/#496/#507-class hole an advisor flagged before
it could reach the flip.
Gating — DEFAULT-OFF, shipped behaviour unchanged
Both passes sit under
SYNTH_STACK_FWD; off ⇒ byte-identical (no flag-offcode path changes; frozen gate green). The shipped behaviour of this PR is
unchanged — the win below is the flip's payoff, a separate silicon-gated step.
New CI oracle
frame_slot_dce_differential.pyEXECUTES the flag-ON build underunicorn and asserts
flight_algo's return matches wasmtime across several sensorinputs (incl. the
0x07FDF307frozen-anchor value), linear memory seeded aswasmtime's; it also asserts the flag-off
.textis byte-identical across builds.Flip payoff (measured via objdump — NOT moved by this PR)
flight_algo)VCR-RA / epic #242. Behaviour frozen on every shipped path.