explainer · v0.1 · 2026-04-27

The Compression Falsification Ladder

We took eight increasingly ambitious claims about compressing Mixture-of-Experts language models and tried, under pre-registered rules, to kill every one of them. Five died, one survived in a deliberately modest form, two remain open. Every empirical verdict on this page comes from a single model family — OLMoE-1B-7B — and we claim no transfer beyond it. This page explains the protocol that did the killing — kill-fast rules, 5 gates, 4 traps, Lean kill theorems — top to bottom.

7
Killed rungs
across G₀–G₇ + R6, R6.5
1
Modest constructive
(G₃, q≈16–32)
3
Open / gated
(G₄ task-cond, G₇, R7 G_8)
152 / 159
Lean theorems proved
(7 sorry)

The Problem

Compression research selects for survivors, not truth

A familiar pattern in compression papers: try a method on several models, report the configuration that worked, frame the result as a discovery. Nothing in that pipeline distinguishes a real mechanism from a lucky draw. Four failure modes do most of the damage — composite examples, not citations of specific papers:

The common thread is that the decision rule was chosen after seeing the results. The Compression Falsification Ladder inverts this: every rule is locked before the experiment runs, and the protocol is built to kill ideas as fast and as cheaply as possible. If you can’t kill it, it might actually be good.

The Ladder

Eight rungs, climbed by surviving

The ladder is a sequence of compression claims, G₀ through G₇, ordered by ambition. Each rung fixes a family of allowed transformations (what you may do to the weights), an invariant (what must be preserved), a kill threshold (the pre-registered number at which the claim dies), and a non-vacuity constraint (so a rung can’t “pass” by claiming nothing). The bottom rungs ask whether free symmetries of the network — orthogonal regauges, Tucker rotations — hide compressible structure. The middle rungs aggregate weights along the MoE router’s actual routing distribution. The top rungs bet on task-conditional circuits and pairwise router statistics.

Climbing happens by surviving honest tests, not by declaring victory. Every rung ends in a one-shot verdict — KILL, OPEN, or CONSTRUCTIVE — and a killed rung is a published result, not a discarded one. Where the ladder stands today:

RungClaimVerdictEvidence
G0 Free orthogonal regauge KILLED r_eff ≥ 9 (layer), ≥ 32 (expert); residual ≥ 88.9% / 96.9%
G1 Reclassified — spectrum diagnostic only RECLASSIFIED Spectrum-reshape diagnostic only; G₁.0' (quantize round-trip) KILLED — 0/24 vs_orth pass at 0.50 gate (g1_0_prime_gauge_quant_r2, decision: kill)
G2 Free Tucker on layer × expert × hidden KILLED r_eff ≥ 256; worst observed 262.773; residual ≥ 99.6%
G3 Per-route weight aggregation CONSTRUCTIVE MODEST CONSTRUCTIVE — r_eff_eff min = 3.89, mean ≈ 5.6; q≈16–32 captures 58–80% energy
G4 Task-conditional rank reduction KILLED KILLED — uniform rank-(16,24) killed (r2: 8/12, r3: 9/12, below 10/12 threshold); task-conditional r4 killed (wikitext fails at all ranks); full 16-layer deploy r6 killed (humaneval +9.90% > 8% gate, gsm8k +60.9%, wikitext +110.5%).
G5 LRH-implied compression KILLED KILLED AT LRH PROXY — 13/16 layers LRH_IMPLAUSIBLE_AFTER_DEFLATION
G6 Single-router rank reduction KILLED Route-flip rate above tolerance at all ranks
G7 Pairwise router certificate (gated) OPEN HOT_SET_DISPENSED — C_pair r_eff=4.64, 153 dominant pairs; top-100 coverage 0.72 < 0.95 threshold
G10 Contractive scan accumulation KILLED KILLED 2026-05-02 — RW1.x replacement ladder blocked: best oracle RW1.4.8 H=0.686533 > 0.5 held-out gate; train→eval generalization gap is the load-bearing block (not α≈0.984 replay). Decision: KILL_RW1X_BLOCKS_RW2_RW3 (cross-check/preregistry/rwkv7_replacement_ladder_rw1_rw3/decision.json). Lean foundation: 20 theorems, 0 sorry (RWKVStability.lean).
G10.5 Smooth (RWKV) + Sparse (top-k routing) attention split REFINE DIAGNOSTIC-CONSTRUCTIVE 2026-05-02: hybrid output rel L2 0.213 vs RWKV-only 0.788 (3.7× improvement). Sparse mass mean 0.59 — confirms G_10's failure was wrong target (RWKV trying to fit full A instead of A_smooth). Routing-stability gate failed (Jaccard 0.23 < 0.5) because routing is dynamic content-addressed, not static — methodology issue, not theory issue.

The shape of the scoreboard is the finding — with the standing caveat that every verdict is specific to OLMoE-1B-7B under this protocol. Free regauging (G₀) and cross-mode Tucker reshaping (G₂) die by closed-form argument over measured OLMoE spectra: once the singular values are on the table, no rotation in the allowed family can concentrate enough energy to meet the budget. Learned non-orthogonal gauges (G₁) beat a zero-cost random orthogonal rotation by only 2–3%, so the rung was reclassified as a spectrum diagnostic rather than promoted as a method. The one constructive survivor, G₃, aggregates expert weights along the routing distribution and is deliberately labelled modest: 58–80% of energy captured at q≈16–32, blocked below that by a measured effective-rank floor (minimum 3.89). G₄ splits: the uniform variant is killed while the task-conditional variant remains open — a split a single headline number would have hidden.

Browse every rung, its thresholds, and its probe →

The 5 Gates

Five all-or-nothing checkpoints

A compression claim that survives its rung-specific probe must still clear five independent gates. They are deliberately conjunctive — pass four and fail one, and the idea is dead. Each gate pins a precise metric and explicitly bans the looser proxy that usually sneaks past review:

  1. Bytes. Does the compressed model actually occupy fewer bytes (global_delta_bytes < 0)? Parameter counts and FLOPs are disallowed — same-precision residuals and dtype overheads make both lie.
  2. Activation Error. Is the activation-weighted relative error $|(W_{\text{eff}} - W’)X|F / |W{\text{eff}} X|_F$ small on real data? The spectral norm of $W - W’$ is disallowed: it ignores the data distribution.
  3. Route Stability. Does the MoE router still select the same top-k experts (flip rate ≤ 0.10)? Averaged top-1 accuracy is disallowed — averaging smooths over per-token expert flips.
  4. Task Conditioning. Does the worst task survive, not just the favourite one? Single-task perplexity is disallowed.
  5. Loss Gate. Does end-to-end NLL on held-out tasks stay within 5%? Perplexity on a training shard is disallowed — that tests memorization.

A worked example — one candidate on one model, illustrative rather than a survey: on OLMoE-1B-7B, a uniform low-rank candidate clears Bytes (−22.4 MB), Activation Error (0.082), Route Stability (0.071), and the Loss Gate (3.4% ΔNLL) — and dies at Gate 4, where the worst-task error reaches 0.241. One number, one kill. The cost of conjunctive gates is false negatives: a sound method that misses one gate by noise dies anyway. That is the intended direction of error.

Read the full gate definitions and try the FAIL toggles →

The 4 Traps

Trap cells: lie detectors aimed at the experimenter

Gates test the method. Trap cells test us. Every experiment carries four embedded controls whose expected outcomes are declared in advance; a trap that fires “wrong” invalidates the run, not the idea.

Beneath the traps sits the τ floor: the best error achieved by B random baselines on N held-out points. A method PASSes only if it beats τ by a pre-registered relative margin. The τ floor itself had to earn trust — three early rounds at N=100, B=1 were too noisy to decide anything and forced a redesign; the current configuration (N=1000, B=10) is stable.

Flip the trap cards and drive the τ calculator →

Pre-Registered Rules

Lock the rules, then play — and kill fast

Before any experiment runs, the claim, metric, sample count, random seeds, and decision threshold are written into tau.json and sealed with a SHA-256 fingerprint. The experiment script verifies the seal before executing: edited file → broken seal → the experiment refuses to start. The seal is tamper-evident, not tamper-proof — a determined experimenter could edit and re-seal, and the record of that would live in version history, not in the hash. What it eliminates is the quieter failure: thresholds that drift after seeing data, with nobody able to say when.

The kill-fast protocol then minimizes the cost of every death:

  1. Analytical pre-filter (5 min) — byte budget, spectrum floor, random floor, with pencil and paper. If any check fails, kill immediately. No GPU needed.
  2. Hardest-cell-first (≈4 min vs ≈60) — run the worst-case cell (Layer 3 + wikitext) first. If it fails, stop.
  3. One-shot decisions — no “inconclusive”. The decision rule is locked; ambiguity is a protocol failure, not an invitation to keep trying.

When a rung dies, the death is archived: a machine-readable decision.json, a human-readable postmortem.md, a trap_results.json proving the controls fired correctly, and an updated theorem tracker.

See the seal demo and the full pre-registration template →

Lean Kill Theorems

When a kill is analytical, it gets machine-checked

Some rungs die empirically. Others die analytically — and an analytical kill is only as trustworthy as its algebra. So every analytical kill on the ladder is paired with a Lean 4 kill theorem, machine-checked against Mathlib. The G₂ kill, for instance, is not “we ran Tucker and it looked bad”: olmoe_tucker_cross_mode_no_compression proves that no cross-mode Tucker reshape of OLMoE’s (layer × expert × hidden) tensor can compress past the budget. The proof is conditional in exactly one place: the measured singular values enter as named axioms rather than being smuggled into the algebra. A wrong measurement would void the kill; the algebra downstream of it cannot be wrong. The G₀ kill rests on invertible_gauge_preserves_rank_family; the gate metrics themselves are formalized, down to byte-count closed forms (static_student_bytes_lt_dense) and route-stability margins (route_flip_zero_above_margin).

The tracker is honest about its frontier: 152 theorems are proved end-to-end, 7 remain sorry placeholders, and 5 measured-data axioms are declared rather than hidden. The status grid regenerates from the Lean source on every commit, so the website cannot claim more than the proof assistant accepts.

Inspect the live theorem grid →

How to Verify

No number on this page is hand-typed

Every number above renders from _data/site_data.json, which is regenerated by scripts/generate_site_data.py from three sources: the lecture notes, the Lean files under LeanMining/NeuralGeometry/, and the raw probe outputs in cross-check/results/*.json. Each rung card links the exact probe script and backing JSON that produced its verdict. The Lean development rebuilds with lake build LeanMining.NeuralGeometry.

If you want to stress the protocol itself, the playground lets you pick a rung, move the rank threshold and τ margin, and watch the verdict change — probe → trap check → τ comparison → one-shot verdict.

Go deeper