# mlx-optiq > mlx-optiq is a Python toolkit for running large language models entirely on Apple Silicon. One PyPI package gives you: (1) sensitivity-driven mixed-precision weight quantization that beats uniform 4-bit at the same size; (2) per-layer mixed-precision KV cache for long-context decode; (3) sensitivity-aware LoRA fine-tuning with PEFT-compatible output; (4) hot-swappable mounted LoRA adapters; (5) a dual-protocol inference server speaking both the OpenAI `/v1/chat/completions` API and the Anthropic `/v1/messages` API from the same process; (6) a two-stage evaluation harness (smoketest + full benchmark suite with HumanEval running in a layered sandbox); (7) image+text (VLM) input on the Gemma-4 and Qwen3.5/3.6 families via a vendored vision tower kept at bf16 in a sidecar, with no mlx-vlm runtime dependency. ## Install ``` pip install mlx-optiq ``` Requirements: macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.11+. Optional extras: `mlx-optiq[convert]` (psutil for RAM precheck), `mlx-optiq[eval]` (datasets for MMLU/GSM8K/IFEval/BFCL/HumanEval/HashHop), `mlx-optiq[serve]` (uvicorn/fastapi), `mlx-optiq[all]`. ## Pre-built models on Hugging Face All pre-built quants live under the `mlx-community` organization on HF. They load with stock `mlx_lm.load(...)`. No special runtime. Every OptIQ quant beats stock uniform 4-bit on the six-metric Capability Score (mean of MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop). ### Nemotron 3 family (NVIDIA Mamba2 + attention hybrid; 30B adds a 128-expert MoE) - `mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit`: 20.6 GB · Capability Score **69.2** (+2.0 vs uniform-4) - `mlx-community/NVIDIA-Nemotron-3-Nano-4B-OptiQ-4bit`: 2.9 GB · Capability Score **63.6** (+0.2 vs uniform-4) ### MiniCPM5 family (Llama-arch dense, hybrid reasoning) - `mlx-community/MiniCPM5-1B-OptiQ-4bit`: 0.9 GB · Capability Score **30.3** (+4.4 vs uniform-4) ### Qwen3.5 family (dense + 1 sparse MoE) - `mlx-community/Qwen3.5-0.8B-OptiQ-4bit`: 0.6 GB · Capability Score **36.0** (+4.3 vs uniform-4) - `mlx-community/Qwen3.5-2B-OptiQ-4bit`: 1.4 GB · Capability Score **47.7** (+2.1 vs uniform-4) - `mlx-community/Qwen3.5-4B-OptiQ-4bit`: 3.0 GB · Capability Score **65.8** (+1.9 vs uniform-4) - `mlx-community/Qwen3.5-9B-OptiQ-4bit`: 6.6 GB · Capability Score **66.8** (+0.2 vs uniform-4) - `mlx-community/Qwen3.5-27B-OptiQ-4bit`: 17.4 GB · Capability Score **79.0** (+0.2 vs uniform-4) - `mlx-community/Qwen3.5-35B-A3B-OptiQ-4bit`: 21.1 GB · Capability Score **74.2** (+0.4 vs uniform-4) ### Qwen3.6 family - `mlx-community/Qwen3.6-27B-OptiQ-4bit`: 17.5 GB · Capability Score **83.0** (+0.5 vs uniform-4) - `mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit`: 22.1 GB · Capability Score **76.8** (+1.1 vs uniform-4) ### Gemma-4 family (instruct) - `mlx-community/gemma-4-e2b-it-OptiQ-4bit`: 4.0 GB · Capability Score **53.2** (+2.1 vs uniform-4) - `mlx-community/gemma-4-e4b-it-OptiQ-4bit`: 6.1 GB · Capability Score **65.8** (+13.6 vs uniform-4) - `mlx-community/gemma-4-12B-it-OptiQ-4bit`: 8.3 GB · Capability Score **68.2** (+6.4 vs uniform-4) · unified text+vision Gemma-4, with image input - `mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit`: 16.4 GB · Capability Score **72.7** (+3.1 vs uniform-4) - `mlx-community/gemma-4-31B-it-OptiQ-4bit`: 20.8 GB · Capability Score **79.7** (+3.5 vs uniform-4) All Qwen3.5 / Qwen3.6 quants ship a bundled MTP head (`mtp.safetensors`) for ~1.4× decode via `optiq serve --mtp`. Gemma-4 quants pair with the matching [`mlx-community/-it-assistant-bf16`](https://huggingface.co/mlx-community) drafter via `optiq serve --drafter `. Gemma-4 and Qwen3.5/3.6 quants that carry an `optiq_vision.safetensors` sidecar also take image input (see Vision section below). ## Loading any pre-built quant ```python from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "Explain mixed-precision quantization."}], tokenize=False, add_generation_prompt=True, ) out = generate(model, tok, prompt=prompt, max_tokens=300) print(out) ``` For Qwen3.5/3.6 reasoning models, pass `enable_thinking=False` to skip the `...` channel for faster (slightly less accurate) output. ## Streaming generation ```python from mlx_lm import load, stream_generate from mlx_lm.sample_utils import make_sampler model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") sampler = make_sampler(temp=0.6, top_p=0.95) for response in stream_generate(model, tok, prompt="...", max_tokens=200, sampler=sampler): print(response.text, end="", flush=True) ``` ## Quantizing your own model ```bash # auto-routes between bf16 and uniform_4bit reference based on RAM optiq convert Qwen/Qwen3.5-9B \ --target-bpw 4.5 \ --candidate-bits 4,8 \ --reference auto \ -o ./optiq_output/Qwen3.5-9B ``` Two reference modes: - `bf16` (gold standard, requires bf16 in RAM, ~2 × params in GB) - `uniform_4bit` (for big models, builds 4-bit baseline + streams bf16 layer-by-layer from disk) - `auto` (default; picks bf16 if it fits, else uniform_4bit) The output is a standard MLX checkpoint with per-layer bit assignments stored in metadata. It loads anywhere stock `mlx-lm` loads. ## Mixed-precision KV-cache serving One-time sensitivity pass, then serve: ```bash optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 5.0 --candidate-bits 4,8 \ -o ./kv/qwen35_9b optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/qwen35_9b/kv_config.json \ --port 8080 ``` Delivers +31% to +62% decode speedup at 64k context on Qwen3.5 4B/9B vs fp16 KV. KV-quant works on all OptIQ-published families including Gemma-4. The sliding-window cache (Gemma-4's `RotatingKVCache`) is handled by `optiq.runtime.kv.RotatingQuantizedKVCache`, installed automatically when `--kv-bits` or `--kv-config` is set. Bundled `kv_config.json` files ship in each Gemma-4 repo from v0.1.3 onward; on `optiq <= 0.1.2` the path still raises `NotImplementedError: RotatingKVCache Quantization NYI` — upgrade with `pip install -U mlx-optiq` to land the fix. ## OpenAI- and Anthropic-compatible API `optiq serve` exposes BOTH endpoints from the same process: - OpenAI: `/v1/chat/completions` (streaming SSE) - Anthropic: `/v1/messages` (streaming SSE), works with Claude Code and the official `anthropic` Python SDK ```python # OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-used") resp = client.chat.completions.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", messages=[{"role": "user", "content": "hi"}], stream=True, ) for chunk in resp: print(chunk.choices[0].delta.content or "", end="") # Anthropic client (same server) from anthropic import Anthropic client = Anthropic(base_url="http://localhost:8080", api_key="not-used") resp = client.messages.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", max_tokens=300, messages=[{"role": "user", "content": "hi"}], ) print(resp.content[0].text) ``` Claude Code via env vars (one-line setup): ```bash export ANTHROPIC_BASE_URL="http://localhost:8080" export ANTHROPIC_API_KEY="not-used" claude # now driven by your local quant ``` ## Vision (image input) on Gemma-4 and Qwen3.5/3.6 As of v0.2.0, mlx-optiq answers image+text prompts on the Gemma-4 and Qwen3.5/3.6 families. The language tower is OptIQ-quantized and decoded by mlx-lm; the vision tower is vendored into mlx-optiq (no mlx-vlm runtime dependency) and kept at bf16 in a sidecar named `optiq_vision.safetensors`. Because mlx-lm selects weights with `glob("model*.safetensors")`, it never sees the sidecar, so the same published repo loads text-only under stock mlx-lm and full image+text under OptIQ. Vision stays bf16 (int4 vision degrades OCR and fine detail); only the language tower is quantized. The vendored path reproduces mlx-vlm's vision tower and projection outputs to a maximum absolute difference of zero. `optiq serve` and `optiq lab` turn on image support automatically when the model carries the sidecar. Send an OpenAI-style `image_url` content part: ```bash optiq serve --model mlx-community/gemma-4-e2b-it-OptiQ-4bit # POST messages with content: [{"type":"image_url","image_url":{"url":"data:image/png;base64,..."}}, # {"type":"text","text":"What is in this image?"}] ``` Python API: `OptiqEngine.generate(prompt, images=["cat.jpg"])` or a full `messages=` list with `image_url` parts. Attach a vision sidecar to an existing quant with `optiq.vlm.build_vision_sidecar(base, quant_dir)`. The vision path runs only when a request carries an image, so MTP, LoRA, KV-quant, and text generation are unchanged. ## Sensitivity-aware LoRA fine-tuning ```bash optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \ --data ./my_training_data \ --max-seq-length 1400 \ --rank 8 --rank-scaling by_bits \ --num-layers 16 --iters 1000 \ -o ./my_adapter optiq lora info ./my_adapter ``` `--rank-scaling by_bits` gives 8-bit mlx-optiq layers 2× the adapter rank of 4-bit layers at the same total parameter budget. The same layers mlx-optiq kept at 8-bit during quantization also get more adapter capacity. Output is PEFT-compatible (`adapter_config.json` + `adapters.safetensors`) plus an mlx-optiq sidecar (`optiq_lora_config.json`) recording the per-layer rank distribution. Data format is JSONL (one example per line, `{"text": "..."}` or `{"messages": [...]}`). Same as `mlx_lm.lora`. ### Empirical training-ceiling map (M3 Max 36 GB, default config) | Model | Max seq len | Peak mem | |---|---|---| | Qwen3.5-0.8B | 2,800 | 23.4 GB | | Qwen3.5-2B | 2,400 | 19.3 GB | | Qwen3.5-4B | 1,600 | 24.8 GB | | Qwen3.5-9B | 1,400 | 25.4 GB | | Qwen3.5-27B / Qwen3.6-27B | 512 | 27.7 GB | | gemma-4-26B-A4B | 512 | 27.6 GB | | Qwen3.5-35B-A3B / Qwen3.6-35B-A3B | 128 | 25.3 GB | | gemma-4-31B-it | 32 | 21.4 GB | Two distinct failure modes when pushing past these: - **Memory cliff** (~27-28 GB): macOS uses compressed memory, throughput drops 9-30% - **MTLResource cliff** (independent of bytes): Apple GPUs cap at 499 K simultaneously bound resources. 2 B at T=3,200 hits a hard `kIOGPUCommandBufferCallbackErrorOutOfMemory` even at 22 GB peak. Don't extrapolate "more GB headroom" → "longer T". ## Hot-swap mounted LoRA adapters `optiq serve --adapter` accepts a single adapter (HF repo id or local path) per process. For multi-adapter, in-process hot-swap, use the Python primitive: ```python from mlx_lm import load, generate from optiq.adapters.mount import ( prepare_model_for_mounted_lora, mount_adapter_on_model, AdapterActivation, ) model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") prepare_model_for_mounted_lora(model) mount_adapter_on_model(model, "agent-A", "./adapter_a") mount_adapter_on_model(model, "agent-B", "./adapter_b") with AdapterActivation("agent-A"): out_a = generate(model, tok, prompt=p, max_tokens=100) with AdapterActivation("agent-B"): out_b = generate(model, tok, prompt=p, max_tokens=100) ``` Memory: ~50 MB per extra adapter vs ~5 GB per full base copy. The `ContextVar` means concurrent asyncio tasks or threads with different active adapters don't step on each other. ## Evaluation ```bash # Fast triage (KL + GSM8K-50, ~5 min on 27B) optiq eval ./optiq_mixed --task smoketest # Full benchmark suite (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop) optiq eval ./optiq_mixed --task all --score --output-json ./bench.json # Single tasks optiq eval ./optiq_mixed --task humaneval # 164 problems, sandboxed code execution optiq eval ./optiq_mixed --task bfcl # 200 simple function-call questions optiq eval ./optiq_mixed --task ifeval # full 540-prompt IFEval optiq eval ./optiq_mixed --task hashhop # 25×4 multi-hop key/value retrieval, ~12k ctx ``` The `--score` flag (under `--task all`) computes a Capability Score = unweighted mean of MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop. HumanEval runs in a layered sandbox (apple/container → sandbox-exec → subprocess + rlimit), so untrusted model-generated code can't escape. HashHop Long-Context Evaluation checks retrieval through `N` chained key→value hash lookups in a ~12 k token context. KL evaluator auto-resolves the reference: bf16 if it fits in RAM (per-shard sizes via `HfApi.model_info`), else the mlx-community uniform-4-bit baseline. ## CLI reference - `optiq convert MODEL [--target-bpw 4.5] [--candidate-bits 4,8] [--reference auto|bf16|uniform_4bit] [--calibration-mix optiq|PATH] [--n-floor-per-block 2] [-o PATH]` - `optiq kv-cache MODEL [--target-bits 5.0] [--candidate-bits 4,8] [--n-samples 5] [--seq-len 512] [-o PATH]` - `optiq lora train MODEL --data PATH [--rank 8] [--scale 20.0] [--rank-scaling by_bits|constant|by_kl] [--num-layers 16] [--max-seq-length 1600] [--iters 1000] [-o PATH]` - `optiq lora info ADAPTER_PATH` - `optiq serve --model MODEL [--kv-config PATH | --kv-bits 4|8] [--adapter PATH-OR-REPO] [--anthropic/--no-anthropic] [-- mlx_lm.server flags]` - `optiq eval MODEL_PATH --task smoketest|all|kl|mmlu|gsm8k|gsm8k-50|ifeval|bfcl|humaneval [--score] [--kv-config PATH | --kv-bits 4|8] [--reference-mode auto|bf16|uniform_4bit] [--output-json PATH]` - `optiq benchmark MODEL_PATH [--baseline UNIFORM_PATH] [--n-samples 50]` - `optiq latency MODEL_PATH [--calibrate]` - `optiq --version` ## How sensitivity works (algorithm) For each `(layer L, candidate bits b)`: 1. Forward-pass calibration data with all weights at reference precision; record output logits. 2. Replace just L's weight with a simulate-quantized copy at b bits (round-trip quantize→dequantize). 3. Forward-pass the same calibration data; record perturbed logits. 4. Compute KL divergence between reference and perturbed logits, averaged over samples. 5. Restore L; move to next layer. Then greedy knapsack: start every layer at the lowest bit, greedily upgrade the layer with the largest KL-reduction-per-bit until the average BPW reaches target. `lm_head`, `embed_tokens`, first/last attention blocks are protected at 8-bit by default. Calibration: bundled `optiq.jsonl` mix: 40 hand-curated samples across prose (5), reasoning (6), code (6), agent loops (8), function-calling (7), and constraint-bearing instructions (8). Chat samples are rendered through the target model's tokenizer.apply_chat_template before tokenization, so the activated subspace matches production. Pass `--calibration-mix /path/to/your.jsonl` to override; rebuild the default mix with `python scripts/build_calibration.py`. ## Site map - https://mlx-optiq.com/: overview - https://mlx-optiq.com/models: all 16 pre-built quants - https://mlx-optiq.com/docs/: documentation index - https://mlx-optiq.com/docs/install: installation - https://mlx-optiq.com/docs/quants: using pre-built quants - https://mlx-optiq.com/docs/sensitivity: methodology - https://mlx-optiq.com/docs/nemotron3: Nemotron 3 family guide (NVIDIA Mamba2 + attention hybrid) - https://mlx-optiq.com/docs/qwen3.5: Qwen3.5 family guide - https://mlx-optiq.com/docs/qwen3.6: Qwen3.6 family guide - https://mlx-optiq.com/docs/gemma-4: Gemma-4 family guide - https://mlx-optiq.com/docs/finetune: LoRA fine-tuning - https://mlx-optiq.com/docs/serve: KV-quant serving - https://mlx-optiq.com/docs/vision: image input (image+text) on Gemma-4 via the bf16 optiq_vision sidecar - https://mlx-optiq.com/docs/mtp: MTP speculative decoding guide + bench grid - https://mlx-optiq.com/docs/faq: FAQ for the most common queries (install, quantize Qwen3.5/3.6/Gemma-4, KV cache, LoRA, Claude Code, MTP, pre-built quants) - https://mlx-optiq.com/docs/cli: CLI reference - https://mlx-optiq.com/docs/lab/: OptIQ Lab, local web UI overview (install, structure, HF token) - https://mlx-optiq.com/docs/lab/chat: Lab chat, local tools, self-healing tool calls, chat-with-files RAG with citations, Canvas HTML rendering - https://mlx-optiq.com/docs/lab/arena: Model Arena, compare two models side by side with tokens/sec - https://mlx-optiq.com/docs/lab/hub: Hub, browse published OptIQ quants, search HF, load with one click - https://mlx-optiq.com/docs/lab/quantize: Lab quantize wizard (sensitivity + knapsack + convert) - https://mlx-optiq.com/docs/lab/finetune: Lab sensitivity-aware LoRA fine-tuning - https://mlx-optiq.com/docs/lab/dataset: Lab dataset builder, 12 templates to JSONL - https://mlx-optiq.com/docs/integrations/: OpenAI-compatible serving for coding agents (matrix overview) - https://mlx-optiq.com/docs/integrations/claude-code: Claude Code via /v1/messages - https://mlx-optiq.com/docs/integrations/codex: Codex via /v1/responses (the deprecated-2026 Chat Completions path is not used) - https://mlx-optiq.com/docs/integrations/opencode: OpenCode CLI via /v1/chat/completions - https://mlx-optiq.com/docs/integrations/openclaw: OpenClaw (Unsloth's mlx-friendly Claude-style runner) via /v1/messages - https://mlx-optiq.com/docs/integrations/hermes-agent: Hermes Agent via /v1/chat/completions - https://mlx-optiq.com/blog/: engineering posts and research - https://mlx-optiq.com/blog/vision-support: image+text on Gemma-4 with no mlx-vlm dependency; one bf16 sidecar loads text-only under mlx-lm or full image+text under OptIQ; vendored vision tower validated bit-exact - https://mlx-optiq.com/blog/humanizer-stacked-lora: stacked SFT + DPO LoRAs on MiniCPM5-1B-OptiQ-4bit match human writing on RADAR (P(AI) 0.51 → 0.37, 100% gap closed); built on OptIQ 0.1.4's --mount-adapter (textbook SFT → DPO continuation) and per-request adapter stacking ("sft+dpo" syntax) - https://mlx-optiq.com/blog/gemma-spec-decoding: first MLX port of Google's Gemma-4 -assistant drafter, 1.18x decode geomean on E4B with 31% acceptance (γ=1 greedy), wired into OptIQ Lab Server as "Spec drafter" - https://mlx-optiq.com/blog/lab-chat-tools: OptIQ Lab chat with local tools (web search, sandboxed Python with inline matplotlib, terminal), 25-turn loop, Stop button, dedup, tool-call healer for six malformed shapes - https://mlx-optiq.com/blog/tight-ram-kv-quant: 4-bit KV cache that actually saves memory (34% below fp16 at 32k) via streaming converter + FlashAttention orchestration - https://mlx-optiq.com/blog/mtp-on-apple-silicon: three fixes to get MTP speculative decoding running at 1.4x on a 24 GB M4 Mac - https://mlx-optiq.com/blog/eval-framework: two-stage eval (smoketest + benchmarks) + Capability Score - https://mlx-optiq.com/blog/calibration-mix: six-domain calibration mix methodology - https://mlx-optiq.com/blog/gemma-4-support: Gemma-4 family launch (e2b/e4b/26B-A4B/31B), +32 pp recovery on e4b - https://mlx-optiq.com/blog/turboquant-postmortem: postmortem on the rotated-space KV experiment we built but didn't ship - https://mlx-optiq.com/blog/sensitivity-aware-lora: LoRA fine-tuning with rank scaled by per-layer bit assignment - https://mlx-optiq.com/blog/not-all-layers-are-equal: research foundation, per-layer sensitivity for weights and KV cache ## Distribution - PyPI: https://pypi.org/project/mlx-optiq/ - Hugging Face quants: https://huggingface.co/mlx-community