MLX-quantized LLMs for Apple Silicon: the OptIQ-4bit model family

01 Diffusion LLM family · added Jun 13, 2026

Diffusion LLM: OptIQ's first non-autoregressive family.

A new category. Diffusion language models don't decode left-to-right — they iteratively un-mask a block of tokens over a handful of denoising steps. The family has two members at opposite extremes: Google's DiffusionGemma-26B-A4B (block-diffusion, 128-expert MoE, image-text) at the frontier, and dhara-250m (tri-mode: AR + diffusion + self-speculation) — a 250M model small enough to fine-tune on a laptop. OptIQ ships native, dependency-free decoders for both; for DiffusionGemma it lands a higher Capability Score on a smaller artifact, while for dhara the measurement certifies that 4-bit is lossless.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)

Model	bf16 size	mlx-optiq size	Compression	Capability Score	Δ vs published 4-bit
diffusiongemma-26B-A4B-it-OptiQ-4bit	51,562 MB	14,000 MB	3.7×	59.90	+0.07
dhara-250m-OptiQ-4bit	460 MB	170 MB	2.7×	8.54	≈ bf16

Diffusion specifics DiffusionGemma is not loadable by stock mlx-lm/mlx-vlm; OptIQ ships a vendored, dependency-free decoder for it (mlx-optiq ≥ 0.2.3). Text + image generation and LoRA fine-tuning (a diffusion-native denoising loss) both work; optiq serve runs it with the fast confidence-threshold sampler by default (4.6–5× faster than the model's default). MTP / speculative drafting and KV-cache quant don't apply — the model is non-autoregressive and the parallel canvas un-masking is the native analog.

dhara-250m specifics A tri-mode tiny model built as a base to fine-tune, like Gemma-270M. Its mlx-native port registers with mlx-lm, so the whole OptIQ pipeline — convert, LoRA, eval, optiq serve, KV-quant — works. The recommended decode mode is self-speculation (--mtp): it drafts a block and AR-verifies it (two forwards per round), so output is identical to autoregressive decode but faster (~1.4× on M3 Max decoding greedily, several tokens per round). The model is overhead-bound, so 4-bit and bf16 decode at the same speed — the quant's win is size, not speed. Benchmark scores sit at the 250M floor and are preserved intact by quantization.

Diffusion LLM getting-started guide →

02 Nemotron 3 family · added Jun 3, 2026

Nemotron 3: NVIDIA's Mamba-attention hybrid.

NVIDIA's Nemotron 3 Nano interleaves Mamba2 state-space blocks with a handful of full-attention layers, and the larger model adds a 128-expert sparse MoE. The 30B-A3B (≈3 B active per token) is the standout: OptiQ assigns per-layer 4/8-bit across the fused routed experts and clears uniform 4-bit by a full +2.0 Capability Score, winning or tying all six benchmarks. The dense 4B is a smaller, tighter win.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)

Model	bf16 size	mlx-optiq size	Compression	Capability Score	Δ vs uniform-4
NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit	63,236 MB	21,043 MB	3.0×	69.15	+2.02
NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit	7,947 MB	2,938 MB	2.7×	63.60	+0.24

Hybrid KV cache Only the four full-attention layers carry a KV cache; the Mamba2 blocks keep recurrent state instead. Each repo ships a kv_config.json covering just those attention layers (three at 4-bit, one at 8-bit). Point optiq serve --kv-config kv_config.json at it. optiq kv-cache gained NemotronH support in v0.1.5 — earlier versions raised ZeroDivisionError on this architecture.

Nemotron 3 getting-started guide →

03 MiniCPM5 family · added May 28, 2026

MiniCPM5: a 1B that punches above its weight.

OpenBMB's 1.08B-parameter Llama-architecture base, fully Apache-2.0. Hybrid-reasoning chat template with an enable_thinking flag. In non-thinking mode (the OptIQ benchmark recipe) it posts 52% MMLU, 65% IFEval, and 58% HumanEval on a model that weighs less than a gigabyte on disk. The OptIQ-4bit quant beats stock uniform-4 by 12 points on HumanEval and rescues HashHop from a 0% floor — same sensitivity-aware allocation story as the larger families.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)

Model	bf16 size	mlx-optiq size	Compression	Capability Score	Δ vs uniform-4
MiniCPM5-1B-OptiQ-4bit	2,062 MB	875 MB	2.4×	30.28	+4.44

Heads up MiniCPM5 ships a hybrid <think> reasoning mode. Pass chat_template_kwargs={"enable_thinking": true} to wake it up; expect substantially higher math/tool scores in that mode. OptIQ's benchmark recipe forces thinking off for cross-family comparability, so the table reflects fast-assistant performance.

MiniCPM5 getting-started guide →

04 Gemma-4 family · added Apr 25, 2026

Gemma-4: Google's instruct series.

Two small dense (e2b, e4b), the new 12 B (the unified text+vision Gemma-4, now with image input), and two large (31 B dense, 26 B-A4B sparse-MoE). Mixed-precision recovery is dramatic — gemma-4-e4b posts a +13.6-point Capability Score gain over uniform 4-bit, and the 12 B adds +6.4. Pair e4b or 31B with their matching -assistant-bf16 drafter for speculative decoding. All five also take image input through a bf16 vision sidecar; see the vision guide.

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)

Model	bf16 size	mlx-optiq size	Compression	Capability Score	Δ vs uniform-4
gemma-4-e2b-it-OptiQ-4bit	9,216 MB	4,098 MB	2.2×	53.21	+2.12
gemma-4-e4b-it-OptiQ-4bit	14,336 MB	6,231 MB	2.3×	65.84	+13.57
gemma-4-12B-it-OptiQ-4bit	22,811 MB	8,449 MB	2.7×	68.23	+6.40
gemma-4-31B-it-OptiQ-4bit	63,488 MB	21,328 MB	3.0×	79.69	+3.47
gemma-4-26B-A4B-it-OptiQ-4bit	53,248 MB	16,813 MB	3.2×	72.68	+3.06

Mixed-precision KV now works on Gemma-4 (v0.1.3+) Each Gemma-4 repo above ships a recommended kv_config.json from a real sensitivity-analysis pass. Point optiq serve --kv-config kv_config.json at it. The runtime fills in for upstream mlx-lm's RotatingKVCache.to_quantized (which raises NotImplementedError in v0.1.2 and earlier) via optiq.runtime.kv.RotatingQuantizedKVCache, plus a small SDPA dispatch patch for Gemma-4's KV-sharing layers. The model still loads fine without the config (stock fp16 KV).

Gemma-4 getting-started guide →

05 Qwen3.6 family · added earlier in April 2026

Qwen3.6: frontier-class reasoning.

Qwen3.6 in two configurations: a dense 27 B and a 256-expert MoE with 3 B active per token. Both quantized with the same mlx-optiq pass; both ship a bundled MTP head for ~1.4× decode via optiq serve --mtp. Both beat uniform 4-bit on the six-metric Capability Score, and both take image input via a bf16 vision sidecar (vision guide).

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)

Model	bf16 size	mlx-optiq size	Compression	Capability Score	Δ vs uniform-4
Qwen3.6-27B-OptiQ-4bit	57,344 MB	17,876 MB	3.2×	82.96	+0.46
Qwen3.6-35B-A3B-OptiQ-4bit	73,728 MB	22,679 MB	3.3×	76.78	+1.12

Qwen3.6 getting-started guide →

06 Qwen3.5 family · the founding lineup

Qwen3.5: the daily-driver series.

From 0.8 B for prompt-rewriters and toy agents up to 27 B for serious reasoning, plus a 35 B-A3B sparse MoE. All quantized with the same mlx-optiq pass; all ship a bundled MTP head for speculative decoding via optiq serve --mtp; all beat uniform 4-bit on the six-metric Capability Score. Every Qwen3.5 size also takes image input via a bf16 vision sidecar (vision guide).

Capability Score · 6-benchmark mean (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop)

Model	bf16 size	mlx-optiq size	Compression	Capability Score	Δ vs uniform-4
Qwen3.5-0.8B-OptiQ-4bit	1,229 MB	620 MB	2.0×	36.00	+4.27
Qwen3.5-2B-OptiQ-4bit	3,072 MB	1,463 MB	2.1×	47.66	+2.12
Qwen3.5-4B-OptiQ-4bit	8,192 MB	3,118 MB	2.6×	65.76	+1.90
Qwen3.5-9B-OptiQ-4bit	18,432 MB	6,772 MB	2.7×	66.77	+0.19
Qwen3.5-27B-OptiQ-4bit	57,344 MB	17,788 MB	3.2×	79.05	+0.17
Qwen3.5-35B-A3B-OptiQ-4bit	73,728 MB	21,603 MB	3.4×	74.17	+0.42

Recommended starting point For most users on a 36 GB Mac, the Qwen3.5-9B quant is the default. Strongest Capability-per-GB and runs at full 64 k context with mixed-precision KV. Bundled MTP head delivers ~1.4× decode via optiq serve --mtp. Drop to 4 B for laptops with less RAM, step up to 27 B if you have headroom.

Qwen3.5 getting-started guide →

05 Loading any of them

One snippet. Any model on this page.

Every OptIQ quant follows the same load contract. Swap the repo name; the rest stays.

load_any.pypython

from mlx_lm import load, generate

# Pick any of the 12. Stock mlx-lm, no special loader needed.
model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain mixed-precision quantization in 3 sentences."}],
    tokenize=False,
    add_generation_prompt=True,
)
out = generate(model, tok, prompt=prompt, max_tokens=300)
print(out)

Per-family notes Each model family has slightly different recommended sampling defaults and chat templates. See the MiniCPM5, Qwen3.5, Qwen3.6 and Gemma-4 getting-started guides.

A family of OptIQ-4bit quants. Ready to load.

Diffusion LLM: OptIQ's first non-autoregressive family.

Nemotron 3: NVIDIA's Mamba-attention hybrid.

MiniCPM5: a 1B that punches above its weight.

Gemma-4: Google's instruct series.

Qwen3.6: frontier-class reasoning.

Qwen3.5: the daily-driver series.

One snippet. Any model on this page.