The 5 Gates

Five checkpoints. All-or-nothing.

Before any compression idea is accepted, it must pass five independent tests. Think of these as security checkpoints at an airport: skip any one and something dangerous gets through.

Toggle a gate to FAIL to see what happens when an idea is killed at that gate.

1

Bytes

Does the compressed model actually use fewer bytes?

Analogy. Is the suitcase actually smaller?

Metric. global_delta_bytes < 0

Lean. static_student_bytes_lt_dense

OLMoE. -22.4 MB — PASS
2

Activation Error

Does the compressed model produce similar outputs on real data?

Analogy. Can you still find your clothes?

Metric. ‖(W_eff − W')X‖_F / ‖W_eff X‖_F

Lean. activation_weighted_error_le_frobenius

OLMoE. 0.082 relative — PASS
3

Route Stability

Does the router still pick the right experts?

Analogy. Do the zippers still work?

Metric. route_flip_rate ≤ 0.10

Lean. route_flip_zero_above_margin

OLMoE. 0.071 flip rate — PASS
4

Task Conditioning

Does it work on all tasks, not just one?

Analogy. Does the packing work for business trips and vacations?

Metric. max_t ‖(W_eff,t − W'_t)X_t‖_F / ‖W_eff,t X_t‖_F

Lean. task_union_basis_contains_task_basis

OLMoE. 0.241 relative (worst task) — FAIL
5

Loss Gate

Does the full model still work end-to-end?

Analogy. Can you actually leave the house with it?

Metric. Δ NLL on held-out tasks ≤ 5%

Lean. one_bit_compounding_exceeds_gate

OLMoE. 0.034 Δ NLL fraction — PASS

Why disallowed metrics fail

Each gate has a precise metric. Looser proxies sneak past honest tests. Common pitfalls:

Gate	Disallowed substitute	Why it lies
1. Bytes	parameter count, FLOPs	Same precision can leave residuals; same param count can need more dtype bytes.
2. Activation Error	spectral norm of $W - W’$	Spectral norm ignores the data distribution; activation-weighted error penalizes the directions that matter on $X$.
3. Route Stability	averaged top-1 accuracy	Average smooths over per-token expert flips that destroy task-conditional behavior.
4. Task Conditioning	single-task perplexity	Lets you tune to one task while quietly killing another.
5. Loss Gate	perplexity on a training shard	Tests memorization, not the deployed task.

Real OLMoE-1B-7B numbers

Gate	Metric	Value	Verdict
1. Bytes	`global_delta_bytes < 0`	-22.4 MB	PASS
2. Activation Error	`‖(W_eff − W')X‖_F / ‖W_eff X‖_F`	0.082 relative	PASS
3. Route Stability	`route_flip_rate ≤ 0.10`	0.071 flip rate	PASS
4. Task Conditioning	`max_t ‖(W_eff,t − W'_t)X_t‖_F / ‖W_eff,t X_t‖_F`	0.241 relative (worst task)	FAIL
5. Loss Gate	`Δ NLL on held-out tasks ≤ 5%`	0.034 Δ NLL fraction	PASS

Five checkpoints. All-or-nothing.

Bytes

Activation Error

Route Stability

Task Conditioning

Loss Gate

Why disallowed metrics fail

Real OLMoE-1B-7B numbers