mlx-optiq
Pre-built quants · Hugging Face

A family of OptIQ-4bit quants. Ready to load.

Each model is a standard MLX checkpoint. Load it with mlx_lm.load(...), no special runtime. Sensitivity-driven mixed-precision quantization recovers what uniform 4-bit drops, especially at the smaller end where every layer counts.

Quants built with mlx-optiq have been downloaded 140,000+ times in the last month across Hugging Face, between our mlx-community line and developers publishing their own OptIQ quants.

01 Diffusion LLM family · added Jun 13, 2026

Diffusion LLM: OptIQ's first non-autoregressive family.

A new category. Diffusion language models don't decode left-to-right — they iteratively un-mask a block of tokens over a handful of denoising steps. The family has two members at opposite extremes: Google's DiffusionGemma-26B-A4B (block-diffusion, 128-expert MoE, image-text) at the frontier, and dhara-250m (tri-mode: AR + diffusion + self-speculation) — a 250M model small enough to fine-tune on a laptop. OptIQ ships native, dependency-free decoders for both; for DiffusionGemma it lands a higher Capability Score on a smaller artifact, while for dhara the measurement certifies that 4-bit is lossless.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs published 4-bit
diffusiongemma-26B-A4B-it-OptiQ-4bit 51,562 MB 14,000 MB 3.7× 59.90 +0.07
dhara-250m-OptiQ-4bit 460 MB 170 MB 2.7× 8.54 ≈ bf16
Diffusion specifics DiffusionGemma is not loadable by stock mlx-lm/mlx-vlm; OptIQ ships a vendored, dependency-free decoder for it (mlx-optiq ≥ 0.2.3). Text + image generation and LoRA fine-tuning (a diffusion-native denoising loss) both work; optiq serve runs it with the fast confidence-threshold sampler by default (4.6–5× faster than the model's default). MTP / speculative drafting and KV-cache quant don't apply — the model is non-autoregressive and the parallel canvas un-masking is the native analog.
dhara-250m specifics A tri-mode tiny model built as a base to fine-tune, like Gemma-270M. Its mlx-native port registers with mlx-lm, so the whole OptIQ pipeline — convert, LoRA, eval, optiq serve, KV-quant — works. The recommended decode mode is self-speculation (--mtp): it drafts a block and AR-verifies it (two forwards per round), so output is identical to autoregressive decode but faster (~1.4× on M3 Max decoding greedily, several tokens per round). The model is overhead-bound, so 4-bit and bf16 decode at the same speed — the quant's win is size, not speed. Benchmark scores sit at the 250M floor and are preserved intact by quantization.
Diffusion LLM getting-started guide →

02 Nemotron 3 family · added Jun 3, 2026

Nemotron 3: NVIDIA's Mamba-attention hybrid.

NVIDIA's Nemotron 3 Nano interleaves Mamba2 state-space blocks with a handful of full-attention layers, and the larger model adds a 128-expert sparse MoE. The 30B-A3B (≈3 B active per token) is the standout: OptiQ assigns per-layer 4/8-bit across the fused routed experts and clears uniform 4-bit by a full +2.0 Capability Score, winning or tying all six benchmarks. The dense 4B is a smaller, tighter win.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit 63,236 MB 21,043 MB 3.0× 69.15 +2.02
NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit 7,947 MB 2,938 MB 2.7× 63.60 +0.24
Hybrid KV cache Only the four full-attention layers carry a KV cache; the Mamba2 blocks keep recurrent state instead. Each repo ships a kv_config.json covering just those attention layers (three at 4-bit, one at 8-bit). Point optiq serve --kv-config kv_config.json at it. optiq kv-cache gained NemotronH support in v0.1.5 — earlier versions raised ZeroDivisionError on this architecture.
Nemotron 3 getting-started guide →

03 MiniCPM5 family · added May 28, 2026

MiniCPM5: a 1B that punches above its weight.

OpenBMB's 1.08B-parameter Llama-architecture base, fully Apache-2.0. Hybrid-reasoning chat template with an enable_thinking flag. In non-thinking mode (the OptIQ benchmark recipe) it posts 52% MMLU, 65% IFEval, and 58% HumanEval on a model that weighs less than a gigabyte on disk. The OptIQ-4bit quant beats stock uniform-4 by 12 points on HumanEval and rescues HashHop from a 0% floor — same sensitivity-aware allocation story as the larger families.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
MiniCPM5-1B-OptiQ-4bit 2,062 MB 875 MB 2.4× 30.28 +4.44
Heads up MiniCPM5 ships a hybrid <think> reasoning mode. Pass chat_template_kwargs={"enable_thinking": true} to wake it up; expect substantially higher math/tool scores in that mode. OptIQ's benchmark recipe forces thinking off for cross-family comparability, so the table reflects fast-assistant performance.
MiniCPM5 getting-started guide →

04 Gemma-4 family · added Apr 25, 2026

Gemma-4: Google's instruct series.

Two small dense (e2b, e4b), the new 12 B (the unified text+vision Gemma-4, now with image input), and two large (31 B dense, 26 B-A4B sparse-MoE). Mixed-precision recovery is dramatic — gemma-4-e4b posts a +13.6-point Capability Score gain over uniform 4-bit, and the 12 B adds +6.4. Pair e4b or 31B with their matching -assistant-bf16 drafter for speculative decoding. All five also take image input through a bf16 vision sidecar; see the vision guide.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
gemma-4-e2b-it-OptiQ-4bit 9,216 MB 4,098 MB 2.2× 53.21 +2.12
gemma-4-e4b-it-OptiQ-4bit 14,336 MB 6,231 MB 2.3× 65.84 +13.57
gemma-4-12B-it-OptiQ-4bit 22,811 MB 8,449 MB 2.7× 68.23 +6.40
gemma-4-31B-it-OptiQ-4bit 63,488 MB 21,328 MB 3.0× 79.69 +3.47
gemma-4-26B-A4B-it-OptiQ-4bit 53,248 MB 16,813 MB 3.2× 72.68 +3.06
Mixed-precision KV now works on Gemma-4 (v0.1.3+) Each Gemma-4 repo above ships a recommended kv_config.json from a real sensitivity-analysis pass. Point optiq serve --kv-config kv_config.json at it. The runtime fills in for upstream mlx-lm's RotatingKVCache.to_quantized (which raises NotImplementedError in v0.1.2 and earlier) via optiq.runtime.kv.RotatingQuantizedKVCache, plus a small SDPA dispatch patch for Gemma-4's KV-sharing layers. The model still loads fine without the config (stock fp16 KV).
Gemma-4 getting-started guide →

05 Qwen3.6 family · added earlier in April 2026

Qwen3.6: frontier-class reasoning.

Qwen3.6 in two configurations: a dense 27 B and a 256-expert MoE with 3 B active per token. Both quantized with the same mlx-optiq pass; both ship a bundled MTP head for ~1.4× decode via optiq serve --mtp. Both beat uniform 4-bit on the six-metric Capability Score, and both take image input via a bf16 vision sidecar (vision guide).

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
Qwen3.6-27B-OptiQ-4bit 57,344 MB 17,876 MB 3.2× 82.96 +0.46
Qwen3.6-35B-A3B-OptiQ-4bit 73,728 MB 22,679 MB 3.3× 76.78 +1.12
Qwen3.6 getting-started guide →

06 Qwen3.5 family · the founding lineup

Qwen3.5: the daily-driver series.

From 0.8 B for prompt-rewriters and toy agents up to 27 B for serious reasoning, plus a 35 B-A3B sparse MoE. All quantized with the same mlx-optiq pass; all ship a bundled MTP head for speculative decoding via optiq serve --mtp; all beat uniform 4-bit on the six-metric Capability Score. Every Qwen3.5 size also takes image input via a bf16 vision sidecar (vision guide).

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)
Model bf16 size mlx-optiq size Compression Capability Score Δ vs uniform-4
Qwen3.5-0.8B-OptiQ-4bit 1,229 MB 620 MB 2.0× 36.00 +4.27
Qwen3.5-2B-OptiQ-4bit 3,072 MB 1,463 MB 2.1× 47.66 +2.12
Qwen3.5-4B-OptiQ-4bit 8,192 MB 3,118 MB 2.6× 65.76 +1.90
Qwen3.5-9B-OptiQ-4bit 18,432 MB 6,772 MB 2.7× 66.77 +0.19
Qwen3.5-27B-OptiQ-4bit 57,344 MB 17,788 MB 3.2× 79.05 +0.17
Qwen3.5-35B-A3B-OptiQ-4bit 73,728 MB 21,603 MB 3.4× 74.17 +0.42
Recommended starting point For most users on a 36 GB Mac, the Qwen3.5-9B quant is the default. Strongest Capability-per-GB and runs at full 64 k context with mixed-precision KV. Bundled MTP head delivers ~1.4× decode via optiq serve --mtp. Drop to 4 B for laptops with less RAM, step up to 27 B if you have headroom.
Qwen3.5 getting-started guide →

05 Loading any of them

One snippet. Any model on this page.

Every OptIQ quant follows the same load contract. Swap the repo name; the rest stays.

load_any.pypython
from mlx_lm import load, generate

# Pick any of the 12. Stock mlx-lm, no special loader needed.
model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain mixed-precision quantization in 3 sentences."}],
    tokenize=False,
    add_generation_prompt=True,
)
out = generate(model, tok, prompt=prompt, max_tokens=300)
print(out)
Per-family notes Each model family has slightly different recommended sampling defaults and chat templates. See the MiniCPM5, Qwen3.5, Qwen3.6 and Gemma-4 getting-started guides.