Apriel-1.5-15B-Thinker — MLX 3-bit (Apple Silicon)
Format: MLX (Mac, Apple Silicon)
Quantization: 3-bit (balanced footprint ↔ quality)
Base: ServiceNow-AI/Apriel-1.5-15B-Thinker
Architecture: Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder)
This repository provides a 3-bit MLX build of Apriel-1.5-15B-Thinker for on-device multimodal inference on Apple Silicon. In side-by-side tests, the 3-bit variant often:
- uses significantly less RAM than 6-bit,
- decodes faster, and
- tends to produce more direct answers (less “thinking out loud”) at low temperature.
If RAM allows, we also suggest trying 4-bit/5-bit/6-bit variants (guidance below) for tasks that demand more fidelity.
Explore other Apriel MLX variants under the
mlx-communitynamespace on the Hub.
🔎 Upstream → MLX summary
Apriel-1.5-15B-Thinker is a multimodal reasoning VLM built via depth upscaling, two-stage multimodal continual pretraining, and SFT with explicit reasoning traces (math, coding, science, tool-use).
This MLX release converts the upstream checkpoint with 3-bit quantization for smaller memory and quick startup on macOS.
📦 Contents
config.json(MLX config for Pixtral-style VLM)mlx_model*.safetensors(3-bit shards)tokenizer.json,tokenizer_config.jsonprocessor_config.json/image_processor.jsonmodel_index.jsonand metadata
🚀 Quickstart (CLI)
Single image caption
python -m mlx_vlm.generate \
--model <this-repo-id> \
--image /path/to/image.jpg \
--prompt "Describe this image in two concise sentences." \
--max-tokens 128 --temperature 0.0 --device mps --seed 0
🔀 Model Family Comparison (2-bit → 6-bit)
TL;DR: Start with 3-bit for the best size↔quality trade-off. If you need finer OCR/diagram detail and have RAM, step up to 4-bit/5-bit. Use 6-bit only when you have headroom and you explicitly instruct concision.
📊 Quick Comparison
| Variant | 🧠 Peak RAM* | ⚡ Speed (rel.) | 🗣️ Output Style (typical) | ✅ Best For | ⚠️ Watch Out For |
|---|---|---|---|---|---|
| 2-bit | ~7–8 GB | 🔥🔥🔥🔥 | Shortest, most lossy | Minimal RAM demos, quick triage | Detail loss on OCR/dense charts; more omissions |
| 3-bit | ~9–10 GB | 🔥🔥🔥🔥 | Direct, concise | Default on M1/M2/M3; day-to-day use | May miss tiny text; keep prompts precise |
| 4-bit | ~11–12.5 GB | 🔥🔥🔥 | More detail retained | Docs/UIs with small text; charts | Slightly slower; still quantization artifacts |
| 5-bit | ~13–14 GB | 🔥🔥☆ | Higher fidelity | Heavier document/diagram tasks | Needs more RAM; occasional verbose answers |
| 6-bit | ~14.5–16 GB | 🔥🔥 | Highest MLX fidelity | Max quality under quant | Can “think aloud”; add be concise instruction |
*Indicative for a ~15B VLM under MLX; exact numbers vary with device, image size, and context length.
🧪 Example (COCO 000000039769.jpg — “two cats on a pink couch”)
| Variant | ⏱️ Prompt TPS | ⏱️ Gen TPS | 📈 Peak RAM | 📝 Notes |
|---|---|---|---|---|
| 3-bit | ~79 tok/s | ~9.79 tok/s | ~9.57 GB | Direct answer; minimal “reasoning” leakage |
| 6-bit | ~78 tok/s | ~6.50 tok/s | ~14.81 GB | Sometimes prints “Here are my reasoning steps…” |
Settings:
--temperature 0.0 --max-tokens 100 --device mps. Results vary by Mac model and image resolution; trend is consistent.
🧭 Choosing the Right Precision
- I just want it to work on my Mac: 👉 3-bit
- Tiny fonts / invoices / UI text matter: 👉 4-bit, then 5-bit if RAM allows
- I need every drop of quality and have ≥16 GB free: 👉 6-bit (add “Answer directly; do not include reasoning.”)
- I have very little RAM: 👉 2-bit (expect noticeable quality loss)
⚙️ Suggested Settings (per variant)
| Variant | Max Tokens | Temp | Seed | Notes |
|---|---|---|---|---|
| 2-bit | 64–96 | 0.0 | 0 | Keep short; single image; expect omissions |
| 3-bit | 96–128 | 0.0 | 0 | Great default; concise prompts help |
| 4-bit | 128–192 | 0.0–0.2 | 0 | Better small-text recall; watch RAM |
| 5-bit | 128–256 | 0.0–0.2 | 0 | Highest OCR among quantized tiers pre-6b |
| 6-bit | 128–256 | 0.0 | 0 | Add anti-CoT phrasing (see below) |
Anti-CoT prompt add-on (any bit-width):
“Answer directly. Do not include your reasoning steps.”
(Optional) Add a stop string if your stack supports it (e.g., stop at "\nHere are my reasoning steps:").
🛠️ One-liners (swap model IDs)
# 2-bit
python -m mlx_vlm.generate --model <2bit-repo> --image img.jpg --prompt "Describe this image." \
--max-tokens 96 --temperature 0.0 --device mps --seed 0
# 3-bit (recommended default)
python -m mlx_vlm.generate --model <3bit-repo> --image img.jpg --prompt "Describe this image in two sentences." \
--max-tokens 128 --temperature 0.0 --device mps --seed 0
# 4-bit
python -m mlx_vlm.generate --model <4bit-repo> --image img.jpg --prompt "Summarize the document and read key totals." \
--max-tokens 160 --temperature 0.1 --device mps --seed 0
# 5-bit
python -m mlx_vlm.generate --model <5bit-repo> --image img.jpg --prompt "Extract the fields (date, total, vendor) from this invoice." \
--max-tokens 192 --temperature 0.1 --device mps --seed 0
# 6-bit
python -m mlx_vlm.generate --model <6bit-repo> --image img.jpg \
--prompt "Answer directly. Do not include your reasoning steps.\n\nDescribe this image clearly." \
--max-tokens 192 --temperature 0.0 --device mps --seed 0
- Downloads last month
- 7
3-bit
Model tree for mlx-community/Apriel-1.5-15b-Thinker-3bit-MLX
Base model
ServiceNow-AI/Apriel-1.5-15b-Thinker