● explainer · v0.1 · 2026-04-27
The Compression Falsification Ladder
We took eight increasingly ambitious claims about compressing Mixture-of-Experts language models and tried, under pre-registered rules, to kill every one of them. Five died, one survived in a deliberately modest form, two remain open. Every empirical verdict on this page comes from a single model family — OLMoE-1B-7B — and we claim no transfer beyond it. This page explains the protocol that did the killing — kill-fast rules, 5 gates, 4 traps, Lean kill theorems — top to bottom.
across G₀–G₇ + R6, R6.5
(G₃, q≈16–32)
(G₄ task-cond, G₇, R7 G_8)
(7 sorry)
The Problem
Compression research selects for survivors, not truth
A familiar pattern in compression papers: try a method on several models, report the configuration that worked, frame the result as a discovery. Nothing in that pipeline distinguishes a real mechanism from a lucky draw. Four failure modes do most of the damage — composite examples, not citations of specific papers:
- Cherry-picking. Five models were tested; one was reported.
- Metric shopping. Perplexity looked bad, so the paper switched to accuracy.
- Hidden fine-tuning. “Post-training” quietly included three epochs of recovery training.
- No negative controls. Nobody checked whether a random method scores just as well.
The common thread is that the decision rule was chosen after seeing the results. The Compression Falsification Ladder inverts this: every rule is locked before the experiment runs, and the protocol is built to kill ideas as fast and as cheaply as possible. If you can’t kill it, it might actually be good.
The Ladder
Eight rungs, climbed by surviving
The ladder is a sequence of compression claims, G₀ through G₇, ordered by ambition. Each rung fixes a family of allowed transformations (what you may do to the weights), an invariant (what must be preserved), a kill threshold (the pre-registered number at which the claim dies), and a non-vacuity constraint (so a rung can’t “pass” by claiming nothing). The bottom rungs ask whether free symmetries of the network — orthogonal regauges, Tucker rotations — hide compressible structure. The middle rungs aggregate weights along the MoE router’s actual routing distribution. The top rungs bet on task-conditional circuits and pairwise router statistics.
Climbing happens by surviving honest tests, not by declaring victory. Every rung ends in a one-shot verdict — KILL, OPEN, or CONSTRUCTIVE — and a killed rung is a published result, not a discarded one. Where the ladder stands today:
| Rung | Claim | Verdict | Evidence |
|---|---|---|---|
| G0 | Free orthogonal regauge | KILLED | r_eff ≥ 9 (layer), ≥ 32 (expert); residual ≥ 88.9% / 96.9% |
| G1 | Reclassified — spectrum diagnostic only | RECLASSIFIED | Spectrum-reshape diagnostic only; G₁.0' (quantize round-trip) KILLED — 0/24 vs_orth pass at 0.50 gate (g1_0_prime_gauge_quant_r2, decision: kill) |
| G2 | Free Tucker on layer × expert × hidden | KILLED | r_eff ≥ 256; worst observed 262.773; residual ≥ 99.6% |
| G3 | Per-route weight aggregation | CONSTRUCTIVE | MODEST CONSTRUCTIVE — r_eff_eff min = 3.89, mean ≈ 5.6; q≈16–32 captures 58–80% energy |
| G4 | Task-conditional rank reduction | KILLED | KILLED — uniform rank-(16,24) killed (r2: 8/12, r3: 9/12, below 10/12 threshold); task-conditional r4 killed (wikitext fails at all ranks); full 16-layer deploy r6 killed (humaneval +9.90% > 8% gate, gsm8k +60.9%, wikitext +110.5%). |
| G5 | LRH-implied compression | KILLED | KILLED AT LRH PROXY — 13/16 layers LRH_IMPLAUSIBLE_AFTER_DEFLATION |
| G6 | Single-router rank reduction | KILLED | Route-flip rate above tolerance at all ranks |
| G7 | Pairwise router certificate (gated) | OPEN | HOT_SET_DISPENSED — C_pair r_eff=4.64, 153 dominant pairs; top-100 coverage 0.72 < 0.95 threshold |
| G10 | Contractive scan accumulation | KILLED | KILLED 2026-05-02 — RW1.x replacement ladder blocked: best oracle RW1.4.8 H=0.686533 > 0.5 held-out gate; train→eval generalization gap is the load-bearing block (not α≈0.984 replay). Decision: KILL_RW1X_BLOCKS_RW2_RW3 (cross-check/preregistry/rwkv7_replacement_ladder_rw1_rw3/decision.json). Lean foundation: 20 theorems, 0 sorry (RWKVStability.lean). |
| G10.5 | Smooth (RWKV) + Sparse (top-k routing) attention split | REFINE | DIAGNOSTIC-CONSTRUCTIVE 2026-05-02: hybrid output rel L2 0.213 vs RWKV-only 0.788 (3.7× improvement). Sparse mass mean 0.59 — confirms G_10's failure was wrong target (RWKV trying to fit full A instead of A_smooth). Routing-stability gate failed (Jaccard 0.23 < 0.5) because routing is dynamic content-addressed, not static — methodology issue, not theory issue. |
The shape of the scoreboard is the finding — with the standing caveat that every verdict is specific to OLMoE-1B-7B under this protocol. Free regauging (G₀) and cross-mode Tucker reshaping (G₂) die by closed-form argument over measured OLMoE spectra: once the singular values are on the table, no rotation in the allowed family can concentrate enough energy to meet the budget. Learned non-orthogonal gauges (G₁) beat a zero-cost random orthogonal rotation by only 2–3%, so the rung was reclassified as a spectrum diagnostic rather than promoted as a method. The one constructive survivor, G₃, aggregates expert weights along the routing distribution and is deliberately labelled modest: 58–80% of energy captured at q≈16–32, blocked below that by a measured effective-rank floor (minimum 3.89). G₄ splits: the uniform variant is killed while the task-conditional variant remains open — a split a single headline number would have hidden.
The 5 Gates
Five all-or-nothing checkpoints
A compression claim that survives its rung-specific probe must still clear five independent gates. They are deliberately conjunctive — pass four and fail one, and the idea is dead. Each gate pins a precise metric and explicitly bans the looser proxy that usually sneaks past review:
- Bytes. Does the compressed model actually occupy fewer bytes (
global_delta_bytes < 0)? Parameter counts and FLOPs are disallowed — same-precision residuals and dtype overheads make both lie. - Activation Error. Is the activation-weighted relative error $|(W_{\text{eff}} - W’)X|F / |W{\text{eff}} X|_F$ small on real data? The spectral norm of $W - W’$ is disallowed: it ignores the data distribution.
- Route Stability. Does the MoE router still select the same top-k experts (flip rate ≤ 0.10)? Averaged top-1 accuracy is disallowed — averaging smooths over per-token expert flips.
- Task Conditioning. Does the worst task survive, not just the favourite one? Single-task perplexity is disallowed.
- Loss Gate. Does end-to-end NLL on held-out tasks stay within 5%? Perplexity on a training shard is disallowed — that tests memorization.
A worked example — one candidate on one model, illustrative rather than a survey: on OLMoE-1B-7B, a uniform low-rank candidate clears Bytes (−22.4 MB), Activation Error (0.082), Route Stability (0.071), and the Loss Gate (3.4% ΔNLL) — and dies at Gate 4, where the worst-task error reaches 0.241. One number, one kill. The cost of conjunctive gates is false negatives: a sound method that misses one gate by noise dies anyway. That is the intended direction of error.
The 4 Traps
Trap cells: lie detectors aimed at the experimenter
Gates test the method. Trap cells test us. Every experiment carries four embedded controls whose expected outcomes are declared in advance; a trap that fires “wrong” invalidates the run, not the idea.
- Random Basis Trap. Replace the optimized basis with a random orthogonal one. Random should lose badly. At G₁ it didn’t: random was fractionally better than the learned gauge (ratio 1.013, 24/24 cells failing the vs-orth gate) — which is precisely how G₁ got reclassified.
- Task Swap Trap. Apply the code-fitted basis to text and vice versa. If the basis is genuinely task-conditional, the swap must fail — and at G₄ it failed in 12/12 cells. That is the outcome a real task signal predicts; consistency with the prediction is all a trap can certify.
- Off-Projection Trap. Route data through the subspace the basis claims to ignore. Anything short of catastrophe means the basis isn’t ignoring what it claims. At G₄ r6 this inflated wikitext NLL by +647% — the trap fired correctly.
- Trivial Pass Trap. Feed the uncompressed model through the full pipeline. Identity must score zero error; anything else is a pipeline bug. It held at 0% across every run.
Beneath the traps sits the τ floor: the best error achieved by B random baselines on N held-out points. A method PASSes only if it beats τ by a pre-registered relative margin. The τ floor itself had to earn trust — three early rounds at N=100, B=1 were too noisy to decide anything and forced a redesign; the current configuration (N=1000, B=10) is stable.
Pre-Registered Rules
Lock the rules, then play — and kill fast
Before any experiment runs, the claim, metric, sample count, random seeds, and decision threshold are written into tau.json and sealed with a SHA-256 fingerprint. The experiment script verifies the seal before executing: edited file → broken seal → the experiment refuses to start. The seal is tamper-evident, not tamper-proof — a determined experimenter could edit and re-seal, and the record of that would live in version history, not in the hash. What it eliminates is the quieter failure: thresholds that drift after seeing data, with nobody able to say when.
The kill-fast protocol then minimizes the cost of every death:
- Analytical pre-filter (5 min) — byte budget, spectrum floor, random floor, with pencil and paper. If any check fails, kill immediately. No GPU needed.
- Hardest-cell-first (≈4 min vs ≈60) — run the worst-case cell (Layer 3 + wikitext) first. If it fails, stop.
- One-shot decisions — no “inconclusive”. The decision rule is locked; ambiguity is a protocol failure, not an invitation to keep trying.
When a rung dies, the death is archived: a machine-readable decision.json, a human-readable postmortem.md, a trap_results.json proving the controls fired correctly, and an updated theorem tracker.
Lean Kill Theorems
When a kill is analytical, it gets machine-checked
Some rungs die empirically. Others die analytically — and an analytical kill is only as trustworthy as its algebra. So every analytical kill on the ladder is paired with a Lean 4 kill theorem, machine-checked against Mathlib. The G₂ kill, for instance, is not “we ran Tucker and it looked bad”: olmoe_tucker_cross_mode_no_compression proves that no cross-mode Tucker reshape of OLMoE’s (layer × expert × hidden) tensor can compress past the budget. The proof is conditional in exactly one place: the measured singular values enter as named axioms rather than being smuggled into the algebra. A wrong measurement would void the kill; the algebra downstream of it cannot be wrong. The G₀ kill rests on invertible_gauge_preserves_rank_family; the gate metrics themselves are formalized, down to byte-count closed forms (static_student_bytes_lt_dense) and route-stability margins (route_flip_zero_above_margin).
The tracker is honest about its frontier: 152 theorems are proved end-to-end, 7 remain sorry placeholders, and 5 measured-data axioms are declared rather than hidden. The status grid regenerates from the Lean source on every commit, so the website cannot claim more than the proof assistant accepts.
How to Verify
No number on this page is hand-typed
Every number above renders from _data/site_data.json, which is regenerated by scripts/generate_site_data.py from three sources: the lecture notes, the Lean files under LeanMining/NeuralGeometry/, and the raw probe outputs in cross-check/results/*.json. Each rung card links the exact probe script and backing JSON that produced its verdict. The Lean development rebuilds with lake build LeanMining.NeuralGeometry.
If you want to stress the protocol itself, the playground lets you pick a rung, move the rank threshold and τ margin, and watch the verdict change — probe → trap check → τ comparison → one-shot verdict.