A family of OptIQ-4bit quants. Ready to load.
Each model is a standard MLX checkpoint. Load it with mlx_lm.load(...), no special runtime. Sensitivity-driven mixed-precision quantization recovers what uniform 4-bit drops, especially at the smaller end where every layer counts.
Quants built with mlx-optiq have been downloaded 140,000+ times in the last month across Hugging Face, between our mlx-community line and developers publishing their own OptIQ quants.
Diffusion LLM: OptIQ's first non-autoregressive family.
A new category. Diffusion language models don't decode left-to-right — they iteratively un-mask a block of tokens over a handful of denoising steps. The family has two members at opposite extremes: Google's DiffusionGemma-26B-A4B (block-diffusion, 128-expert MoE, image-text) at the frontier, and dhara-250m (tri-mode: AR + diffusion + self-speculation) — a 250M model small enough to fine-tune on a laptop. OptIQ ships native, dependency-free decoders for both; for DiffusionGemma it lands a higher Capability Score on a smaller artifact, while for dhara the measurement certifies that 4-bit is lossless.
| Model | bf16 size | mlx-optiq size | Compression | Capability Score | Δ vs published 4-bit |
|---|---|---|---|---|---|
| diffusiongemma-26B-A4B-it-OptiQ-4bit | 51,562 MB | 14,000 MB | 3.7× | 59.90 | +0.07 |
| dhara-250m-OptiQ-4bit | 460 MB | 170 MB | 2.7× | 8.54 | ≈ bf16 |
mlx-lm/mlx-vlm; OptIQ ships a vendored, dependency-free decoder for it (mlx-optiq ≥ 0.2.3). Text + image generation and LoRA fine-tuning (a diffusion-native denoising loss) both work; optiq serve runs it with the fast confidence-threshold sampler by default (4.6–5× faster than the model's default). MTP / speculative drafting and KV-cache quant don't apply — the model is non-autoregressive and the parallel canvas un-masking is the native analog.
optiq serve, KV-quant — works. The recommended decode mode is self-speculation (--mtp): it drafts a block and AR-verifies it (two forwards per round), so output is identical to autoregressive decode but faster (~1.4× on M3 Max decoding greedily, several tokens per round). The model is overhead-bound, so 4-bit and bf16 decode at the same speed — the quant's win is size, not speed. Benchmark scores sit at the 250M floor and are preserved intact by quantization.
Nemotron 3: NVIDIA's Mamba-attention hybrid.
NVIDIA's Nemotron 3 Nano interleaves Mamba2 state-space blocks with a handful of full-attention layers, and the larger model adds a 128-expert sparse MoE. The 30B-A3B (≈3 B active per token) is the standout: OptiQ assigns per-layer 4/8-bit across the fused routed experts and clears uniform 4-bit by a full +2.0 Capability Score, winning or tying all six benchmarks. The dense 4B is a smaller, tighter win.
| Model | bf16 size | mlx-optiq size | Compression | Capability Score | Δ vs uniform-4 |
|---|---|---|---|---|---|
| NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit | 63,236 MB | 21,043 MB | 3.0× | 69.15 | +2.02 |
| NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit | 7,947 MB | 2,938 MB | 2.7× | 63.60 | +0.24 |
kv_config.json covering just those attention layers (three at 4-bit, one at 8-bit). Point optiq serve --kv-config kv_config.json at it. optiq kv-cache gained NemotronH support in v0.1.5 — earlier versions raised ZeroDivisionError on this architecture.
MiniCPM5: a 1B that punches above its weight.
OpenBMB's 1.08B-parameter Llama-architecture base, fully Apache-2.0. Hybrid-reasoning chat template with an enable_thinking flag. In non-thinking mode (the OptIQ benchmark recipe) it posts 52% MMLU, 65% IFEval, and 58% HumanEval on a model that weighs less than a gigabyte on disk. The OptIQ-4bit quant beats stock uniform-4 by 12 points on HumanEval and rescues HashHop from a 0% floor — same sensitivity-aware allocation story as the larger families.
| Model | bf16 size | mlx-optiq size | Compression | Capability Score | Δ vs uniform-4 |
|---|---|---|---|---|---|
| MiniCPM5-1B-OptiQ-4bit | 2,062 MB | 875 MB | 2.4× | 30.28 | +4.44 |
<think> reasoning mode. Pass chat_template_kwargs={"enable_thinking": true} to wake it up; expect substantially higher math/tool scores in that mode. OptIQ's benchmark recipe forces thinking off for cross-family comparability, so the table reflects fast-assistant performance.
Gemma-4: Google's instruct series.
Two small dense (e2b, e4b), the new 12 B (the unified text+vision Gemma-4, now with image input), and two large (31 B dense, 26 B-A4B sparse-MoE). Mixed-precision recovery is dramatic — gemma-4-e4b posts a +13.6-point Capability Score gain over uniform 4-bit, and the 12 B adds +6.4. Pair e4b or 31B with their matching -assistant-bf16 drafter for speculative decoding. All five also take image input through a bf16 vision sidecar; see the vision guide.
| Model | bf16 size | mlx-optiq size | Compression | Capability Score | Δ vs uniform-4 |
|---|---|---|---|---|---|
| gemma-4-e2b-it-OptiQ-4bit | 9,216 MB | 4,098 MB | 2.2× | 53.21 | +2.12 |
| gemma-4-e4b-it-OptiQ-4bit | 14,336 MB | 6,231 MB | 2.3× | 65.84 | +13.57 |
| gemma-4-12B-it-OptiQ-4bit | 22,811 MB | 8,449 MB | 2.7× | 68.23 | +6.40 |
| gemma-4-31B-it-OptiQ-4bit | 63,488 MB | 21,328 MB | 3.0× | 79.69 | +3.47 |
| gemma-4-26B-A4B-it-OptiQ-4bit | 53,248 MB | 16,813 MB | 3.2× | 72.68 | +3.06 |
kv_config.json from a real sensitivity-analysis pass. Point optiq serve --kv-config kv_config.json at it. The runtime fills in for upstream mlx-lm's RotatingKVCache.to_quantized (which raises NotImplementedError in v0.1.2 and earlier) via optiq.runtime.kv.RotatingQuantizedKVCache, plus a small SDPA dispatch patch for Gemma-4's KV-sharing layers. The model still loads fine without the config (stock fp16 KV).
Qwen3.6: frontier-class reasoning.
Qwen3.6 in two configurations: a dense 27 B and a 256-expert MoE with 3 B active per token. Both quantized with the same mlx-optiq pass; both ship a bundled MTP head for ~1.4× decode via optiq serve --mtp. Both beat uniform 4-bit on the six-metric Capability Score, and both take image input via a bf16 vision sidecar (vision guide).
| Model | bf16 size | mlx-optiq size | Compression | Capability Score | Δ vs uniform-4 |
|---|---|---|---|---|---|
| Qwen3.6-27B-OptiQ-4bit | 57,344 MB | 17,876 MB | 3.2× | 82.96 | +0.46 |
| Qwen3.6-35B-A3B-OptiQ-4bit | 73,728 MB | 22,679 MB | 3.3× | 76.78 | +1.12 |
Qwen3.5: the daily-driver series.
From 0.8 B for prompt-rewriters and toy agents up to 27 B for serious reasoning, plus a 35 B-A3B sparse MoE. All quantized with the same mlx-optiq pass; all ship a bundled MTP head for speculative decoding via optiq serve --mtp; all beat uniform 4-bit on the six-metric Capability Score. Every Qwen3.5 size also takes image input via a bf16 vision sidecar (vision guide).
| Model | bf16 size | mlx-optiq size | Compression | Capability Score | Δ vs uniform-4 |
|---|---|---|---|---|---|
| Qwen3.5-0.8B-OptiQ-4bit | 1,229 MB | 620 MB | 2.0× | 36.00 | +4.27 |
| Qwen3.5-2B-OptiQ-4bit | 3,072 MB | 1,463 MB | 2.1× | 47.66 | +2.12 |
| Qwen3.5-4B-OptiQ-4bit | 8,192 MB | 3,118 MB | 2.6× | 65.76 | +1.90 |
| Qwen3.5-9B-OptiQ-4bit | 18,432 MB | 6,772 MB | 2.7× | 66.77 | +0.19 |
| Qwen3.5-27B-OptiQ-4bit | 57,344 MB | 17,788 MB | 3.2× | 79.05 | +0.17 |
| Qwen3.5-35B-A3B-OptiQ-4bit | 73,728 MB | 21,603 MB | 3.4× | 74.17 | +0.42 |
optiq serve --mtp. Drop to 4 B for laptops with less RAM, step up to 27 B if you have headroom.
One snippet. Any model on this page.
Every OptIQ quant follows the same load contract. Swap the repo name; the rest stays.
from mlx_lm import load, generate # Pick any of the 12. Stock mlx-lm, no special loader needed. model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "Explain mixed-precision quantization in 3 sentences."}], tokenize=False, add_generation_prompt=True, ) out = generate(model, tok, prompt=prompt, max_tokens=300) print(out)