The 5 Gates
Five checkpoints. All-or-nothing.
Before any compression idea is accepted, it must pass five independent tests. Think of these as security checkpoints at an airport: skip any one and something dangerous gets through.
Toggle a gate to FAIL to see what happens when an idea is killed at that gate.
-
1
Bytes
Does the compressed model actually use fewer bytes?
Analogy. Is the suitcase actually smaller?
Metric.
global_delta_bytes < 0Lean.
static_student_bytes_lt_denseOLMoE. -22.4 MB — PASS
-
2
Activation Error
Does the compressed model produce similar outputs on real data?
Analogy. Can you still find your clothes?
Metric.
‖(W_eff − W')X‖_F / ‖W_eff X‖_FLean.
activation_weighted_error_le_frobeniusOLMoE. 0.082 relative — PASS
-
3
Route Stability
Does the router still pick the right experts?
Analogy. Do the zippers still work?
Metric.
route_flip_rate ≤ 0.10Lean.
route_flip_zero_above_marginOLMoE. 0.071 flip rate — PASS
-
4
Task Conditioning
Does it work on all tasks, not just one?
Analogy. Does the packing work for business trips and vacations?
Metric.
max_t ‖(W_eff,t − W'_t)X_t‖_F / ‖W_eff,t X_t‖_FLean.
task_union_basis_contains_task_basisOLMoE. 0.241 relative (worst task) — FAIL
-
5
Loss Gate
Does the full model still work end-to-end?
Analogy. Can you actually leave the house with it?
Metric.
Δ NLL on held-out tasks ≤ 5%Lean.
one_bit_compounding_exceeds_gateOLMoE. 0.034 Δ NLL fraction — PASS
Why disallowed metrics fail
Each gate has a precise metric. Looser proxies sneak past honest tests. Common pitfalls:
| Gate | Disallowed substitute | Why it lies |
|---|---|---|
| 1. Bytes | parameter count, FLOPs | Same precision can leave residuals; same param count can need more dtype bytes. |
| 2. Activation Error | spectral norm of $W - W’$ | Spectral norm ignores the data distribution; activation-weighted error penalizes the directions that matter on $X$. |
| 3. Route Stability | averaged top-1 accuracy | Average smooths over per-token expert flips that destroy task-conditional behavior. |
| 4. Task Conditioning | single-task perplexity | Lets you tune to one task while quietly killing another. |
| 5. Loss Gate | perplexity on a training shard | Tests memorization, not the deployed task. |
Real OLMoE-1B-7B numbers
| Gate | Metric | Value | Verdict |
|---|---|---|---|
| 1. Bytes | global_delta_bytes < 0 |
-22.4 MB | PASS |
| 2. Activation Error | ‖(W_eff − W')X‖_F / ‖W_eff X‖_F |
0.082 relative | PASS |
| 3. Route Stability | route_flip_rate ≤ 0.10 |
0.071 flip rate | PASS |
| 4. Task Conditioning | max_t ‖(W_eff,t − W'_t)X_t‖_F / ‖W_eff,t X_t‖_F |
0.241 relative (worst task) | FAIL |
| 5. Loss Gate | Δ NLL on held-out tasks ≤ 5% |
0.034 Δ NLL fraction | PASS |