The eval framework that drives every quant we ship.
The regression check on a fresh quant has to test the workload, not the cheapest thing to run. A 50-sample GSM8K pass is a quick gut-check that math still works. It doesn't catch a quant that quietly forgot how to emit valid function-call syntax, or one that lost the long-range KV precision that makes retrieval through 12k tokens work. By the time a user notices, the quant is already in their cache.
You catch the regression with the eval that tests the workload. Not the one that's cheap to run.
So every optiq quant goes through a two-stage suite. A fast smoketest for triage, then a full benchmark run for the headline number. Six benchmarks roll up into one Capability Score on the model card. Sandboxed HumanEval execution. Auto-resolved KL reference. All from one CLI command.
Two stages: a smoketest and the full benchmarks
Quants pass through two checkpoints. The first is fast and triages; the second is slow and decides what we ship.
| Stage | Time / model | What it answers | Tasks |
|---|---|---|---|
| Smoketest | ~5 min | Did the convert work? Are we close to the reference distribution? | KL on 64 prompts × 256 tokens · GSM8K-50 (chat-templated, thinking off) |
| Benchmarks | ~2 h | How much capability did we keep across the workloads users actually run? | MMLU-1k 5-shot · GSM8K-1k · IFEval (full) · BFCL-V3 simple (200) · HumanEval (164) · HashHop (25 × 4 hops) |
The smoketest is the gate. A quant that fails it doesn't get the full benchmarks. The benchmark numbers are what end up on the model card.
The smoketest: KL + GSM8K-50
KL divergence between two language models, computed token-by-token over a small batch of held-out prompts, is a cheap signal that works well in practice. The reference is the highest-fidelity version of the model that fits on the box. The candidate is the OptIQ quant. We compute KL(reference ‖ candidate) per token, average across 64 prompts × 256 tokens, and report mean + p95.
The auto-resolver picks the reference automatically:
# Pick the highest-fidelity reference that fits on the box. bf16_gb = hf_repo_size_gb(strip_quant_suffix(model_id)) avail_gb = psutil.virtual_memory().available / 1024**3 if bf16_gb <= 0.70 * avail_gb: return "bf16", strip_quant_suffix(model_id) # bf16 doesn't fit; fall back to the uniform-4-bit MLX baseline. return "uniform_4bit", uniform_4bit_repo(model_id)
Two flags fall out of the smoketest sweep that the GSM8K-50 number alone wouldn't have surfaced: KL 20× higher on the Gemma-4 26B-A4B MoE and 31B dense relative to Qwen3.5-27B at comparable size, despite both showing healthy GSM8K-50 numbers. KL catches calibration regressions that single-task accuracy misses.
The smoketest is also how we decide when to invest two hours of compute in the full benchmark run. If a fresh convert fails KL, the bit allocation went bad and rerunning the full suite is wasted machine time.
The benchmarks: six metrics
The benchmark suite is what ends up on the model card. Each task targets a different capability slice, so a quant that quietly breaks one of them can't hide behind the others.
- MMLU: 5-shot, stratified across the 57 subjects, 1000 samples. Encyclopedic knowledge after instruction-tuning. The bf16 anchor.
- GSM8K: 1000 samples, 3-shot CoT, chat-templated,
enable_thinking=Falsefor reasoning models. Multi-step arithmetic. - IFEval: full Google IFEval set with all 25+ constraint verifiers. Measures whether the model can follow detailed format / length / capitalization / inclusion-exclusion instructions. We report strict (the standard, harder metric).
- BFCL-V3 simple: 200 single-turn function-calls with AST equivalence scoring. Whether the model can emit a syntactically valid call and pick the right tool from a small candidate set.
- HumanEval: all 164 problems, sandboxed Python execution, pass@1 only.
- HashHop Long-Context Evaluation: 25 instances at each of hops ∈ {1, 2, 3, 4} (100 total) at ~12 k context. Multi-hop key→value retrieval through a chain of N hash assignments. The model has to walk the chain through the long context and surface the terminal hash.
Run from the CLI as a single task:
optiq eval ./optiq_mixed --task all --score
Each individual task is also addressable (--task mmlu, --task ifeval, --task hashhop, etc.) for when you only need one number.
Sandboxing HumanEval
HumanEval requires actually executing the model's generated Python against a unit-test harness. Doing that on the user's machine with no isolation is a footgun. A model that emits os.system("rm -rf …") ruins someone's afternoon. The sandbox helper falls through three tiers:
- apple/container: when present, runs each candidate inside a fresh container with no network, no filesystem mount outside
/tmp, and a wall-clock timeout. Hardest isolation, slowest start. - sandbox-exec: macOS native, when
/usr/bin/sandbox-execis available. Subprocess with a tight seatbelt profile (no network, deny file-write outside/tmp). Fast. - subprocess + rlimit: universal fallback. Spawn a Python child with
RLIMIT_AS,RLIMIT_CPU,RLIMIT_FSIZEcaps and a process-group timeout. No filesystem isolation; exists so the eval doesn't simply fail to run on Linux CI.
The helper picks the strictest tier available at runtime. Reported pass@1 is identical across tiers because the test harness is deterministic. Only the blast radius of malicious code changes.
HashHop Long-Context Evaluation
The first five benchmarks are short-context. MMLU prompts are a few hundred tokens. BFCL function-call prompts are similar. None of them push the KV cache out past 2k tokens, so a quant that silently lost long-context attention precision can still post strong numbers across all five.
HashHop is the one that catches it. Each instance is a dictionary of N hash assignments shaped like h0 = h1, h1 = h2, … h(N-1) = 'hN', scattered among thousands of unrelated chains and serialized into one large prompt. The model gets the starting hash and has to walk the chain through ~12k tokens of context, surface the terminal 16-character hash, and emit nothing else. Random guessing on a 16-character alphabetic hash is effectively zero (5216), so accuracy maps cleanly to how reliably the model is using its KV cache to retrieve the right key at each hop.
We sample 25 instances at each of hops ∈ {1, 2, 3, 4}, 100 instances total. Easy hops (1) catch coarse retrieval breakage; deep hops (4) stress compounding attention error across many heads × many tokens.
Why this matters specifically for mixed-precision: uniform 4-bit quantization of weights erodes long-range attention precision because small numerical errors compound across many heads × many tokens. Per-layer mixed precision (some layers 4-bit, sensitive ones 8-bit) preserves the layers retrieval depends on, so HashHop deltas vs uniform 4-bit are typically the largest single delta in the benchmark suite. On gemma-4-31B it's +22 pp over uniform 4-bit; on the 26B-A4B MoE it's +11 pp.
The Capability Score
Six percentages are hard to compare side-by-side. We want one number that answers which quant is more capable on average?. And we want a formula the reader can audit, not a hidden value judgement dressed up as math.
The simplest one that meets that bar:
Capability_Score = mean(MMLU, GSM8K, IFEval, BFCL, HumanEval, HashHop)
We tried a weighted formula first. Something like MMLU + 0.3 × IFEval + 0.5 × BFCL − 5 × disk_GB. It looked clever. It also embedded our quality/disk tradeoff in a way users can't see, and it could turn a +1 pp capability win into a "loss" if the disk grew by half a gigabyte. That's a recommendation, not a measurement.
So we stripped it down. The six benchmarks each get an equal vote. disk_gb is reported next to the score as an unweighted second axis, and the reader picks their own tradeoff. If you're optimizing for an 8 GB Mac, smaller wins. If you're on a 64 GB Studio, larger probably wins. The score doesn't pretend to know.
Three properties worth flagging. (1) GSM8K and MMLU both vote, because in practice they disagree often enough on quants that letting both vote catches regressions either one alone would miss. (2) HumanEval votes, which means a quant that breaks code generation can't hide behind strong instruction-following. (3) HashHop votes, which means a quant that holds up at 2k context but breaks at 12k can't claim parity by averaging only short-context numbers.
Picking the KL reference
One technical note that took us a few iterations to get right.
For models that fit in RAM (everything ≤ ~10 B at bf16 on a 36 GB Mac), the KL reference is unambiguous: it's the bf16 model itself. For 27 B+, bf16 doesn't fit, and you need a substitute reference that's still strictly higher-fidelity than the candidate. The community's uniform-4-bit MLX publish of the same model is exactly that: same architecture and weights modulo quantization noise, just at uniform 4-bit (no per-layer mixed precision).
The auto-resolver picks bf16 if available, falls back to uniform-4-bit otherwise. The fall-back was originally driven by a crude params × 2 bytes size estimate, which under-counted gemma-4-26B-A4B's MoE expert tensors and tried to load 110 GB of bf16 into 36 GB of RAM. Now we hit HfApi.model_info() and sum the actual safetensors shard sizes. The resolver is exact and the OOM is gone.
Reproducing
Everything in this post runs from the CLI. No special setup beyond pip install mlx-optiq:
# Fast smoketest (KL + GSM8K-50) optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task smoketest # Full benchmarks (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop + Score) optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task all --score # Single tasks if you only need one number optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task bfcl optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task humaneval optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task hashhop # Custom reference for KL (skip auto-resolver) optiq eval ./my-quant --task kl --reference-model Qwen/Qwen3.5-9B --reference-mode bf16
Every task above is callable on its own. Pick the one you need with optiq eval --task <name>.
— the mlx-optiq team