Workflow · vision

Vision (image input)

As of v0.2.0, mlx-optiq answers image and text prompts on the Gemma-4 and Qwen3.5 / Qwen3.6 families. The language tower is still OptIQ mixed-precision quantized and decoded by mlx-lm; the vision tower is vendored into mlx-optiq (no mlx-vlm runtime dependency) and kept at bf16 in a sidecar that rides alongside the quantized weights.

At a glance Vision support is on the Gemma-4 family (e2b, e4b, 12B, 26B-A4B, 31B) and the Qwen3.5 / Qwen3.6 family. The vision/audio towers stay at bf16; only the language tower is quantized. Audio (speech) input is not supported.

One artifact, two ways to load it

OptIQ stores the vision and audio towers, at bf16, in a sidecar file named optiq_vision.safetensors next to the quantized language shards. mlx-lm selects its weights with glob("model*.safetensors"), so it never matches the sidecar. The result is a single published repo that loads two ways:

Loader	Reads	You get
stock `mlx-lm`	model*.safetensors	Text-only model (sidecar ignored)
OptIQ	model*.safetensors + optiq_vision.safetensors	Full image + text

There is no separate vision build. Vision stays at bf16 because 4-bit vision degrades OCR and fine detail; the language tower, where almost all of the size lives, is still fully quantized.

Supported models

Pre-built quants take image input today, across three vision architectures. Every preprocessing step and vision tower is vendored from mlx-vlm and reproduces its outputs to a maximum absolute difference of zero.

Family	Models with image input	Vision tower
Gemma-4	e2b, e4b, 12B, 26B-A4B, 31B	SigLIP tower (e2b/e4b/26B/31B); encoder-free unified backbone (12B)
Qwen3.5	0.8B, 2B, 4B, 9B, 27B, 35B-A3B	Qwen3-VL encoder
Qwen3.6	27B, 35B-A3B	Qwen3-VL encoder

Nemotron 3 and MiniCPM5 are text-only and carry no sidecar. Audio input is not supported on any model.

Serving images

When a model carries the sidecar, optiq serve turns on image support automatically. Send an OpenAI-style image_url content part (a data URL or an http(s) URL):

terminalbash

optiq serve --model mlx-community/gemma-4-e2b-it-OptiQ-4bit

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":[
        {"type":"image_url","image_url":{"url":"data:image/png;base64,..."}},
        {"type":"text","text":"What is in this image?"}]}]}'

Text-only requests are unchanged: the vision path only runs when a request actually carries an image, so MTP speculation, mounted LoRA adapters, KV-cache quantization, and plain text generation all behave exactly as without it.

In the Lab

The Lab's Chat tab takes image uploads directly. Run optiq lab --model <sidecar-equipped quant>, open Chat, click attach, drop in a picture, and ask a question.

OptIQ Lab analyzing an uploaded image of shapes. — gemma-4-e2b at 4-bit, reading an uploaded image in the Lab.

Python API

The engine takes images= (paths, URLs, data URLs, or PIL images) or a full messages= list with image_url parts:

pythonpy

from mlx_lm import load
from optiq.runtime.engine import OptiqEngine

model, tok = load("mlx-community/gemma-4-e2b-it-OptiQ-4bit")
eng = OptiqEngine.from_loaded(model, tok, "mlx-community/gemma-4-e2b-it-OptiQ-4bit")

st = eng.generate("What is in this image?", images=["cat.jpg"], max_tokens=128)
print(st.text)

Adding the sidecar to a quant

If you have an existing OptIQ language quant and the bf16 base it came from, attach a vision sidecar with one call. It extracts the bf16 vision and audio towers, writes optiq_vision.safetensors into the quant directory, and restores the multimodal config keys:

pythonpy

from optiq.vlm import build_vision_sidecar

build_vision_sidecar(
    base="google/gemma-4-e2b-it",        # bf16 base with the towers
    quant_dir="./gemma-4-e2b-it-OptiQ-4bit",  # existing OptIQ language quant
)

How it works

The vision front-end preprocesses the pixels, runs the vendored vision tower, projects the result into the language model's hidden space, and scatters those soft tokens into the text-embedding sequence at the image-placeholder positions. The merged embeddings go to mlx-lm's language model through its input_embeddings hook, and decode proceeds with the same quantized weights, KV cache, and sampler as text. mlx-optiq resolves the right front-end per model_type: gemma4 (SigLIP tower), gemma4_unified (the encoder-free 12B), and qwen3_5 (Qwen3-VL tower).

The vendoring is validated against mlx-vlm tensor for tensor: feeding mlx-vlm's own pixel values through mlx-optiq's preprocessing and vision tower reproduces its outputs to a maximum absolute difference of zero, on every architecture.

Each tower needs the backbone to treat its visual tokens correctly. For the SigLIP towers, the tokens are self-contained, so the standard causal decode is enough (one detail: gemma4_text rescales the incoming embeddings by embed_scale, so the vision features are pre-divided to compensate). The unified Gemma-4 12B has no separate tower and was trained to attend bidirectionally over the image span, so OptIQ makes that span bidirectional with a one-shot mask wrapper (text and decode stay causal). Qwen's tower already carries 2D rotary positions internally, so its visual tokens arrive position-aware and need nothing special from the backbone.

See also The release write-up: mlx-optiq can see. Model lists and sampling defaults: the Gemma-4 and Qwen3.5 family guides.