MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10
NVFP4 quantization of cerebras/MiniMax-M2.5-REAP-139B-A10B for NVIDIA DGX Spark (GB10).
The base model is a Cerebras REAP (Router-weighted Expert Activation Pruning) variant of MiniMaxAI/MiniMax-M2.5. REAP uniformly prunes experts from 256 → 154 (40% pruning), reducing total parameters from 230B to 139B while maintaining near-identical performance. This is the more aggressively pruned sibling of the 172B (25%) variant.
Model Details
| Base Model | cerebras/MiniMax-M2.5-REAP-139B-A10B |
| Original Model | MiniMaxAI/MiniMax-M2.5 (230B) |
| Architecture | MiniMaxM2ForCausalLM (MoE, 154 experts, 8 active per token) |
| Total Parameters | 139B |
| Active Parameters | 10B per token |
| Hidden Layers | 62 |
| Quantization | NVFP4 (4-bit floating point), all layers including self_attn |
| Format | compressed-tensors (safetensors), 17 shards |
| Size on Disk | 75 GB |
| Context Length | 196,608 tokens (~192K) |
| License | Modified MIT (inherited from Cerebras REAP) |
Why 139B over 172B?
| 172B REAP | 139B REAP | |
|---|---|---|
| Expert pruning | 25% (256 → 192) | 40% (256 → 154) |
| NVFP4 size | 93 GB | 75 GB |
| Single Spark fit | Tight (max ~65K ctx) | Comfortable (~90K+ ctx headroom) |
| Cerebras eval loss | Baseline | ~0.5% degradation |
The 139B variant trades minimal quality for significantly more memory headroom on a single DGX Spark. With 75GB model weight vs 93GB, you gain ~18GB for KV cache — translating to substantially more context or concurrent sessions.
Performance (Single NVIDIA DGX Spark — GB10, 128 GB)
TODO: Benchmark pending — model just quantized. Will update with llama-benchy results.
Expected: similar or slightly faster than 172B NVFP4 (27–29 tok/s) due to smaller model footprint.
Quantization Details
- Method: Post-training quantization via LLM Compressor (
llmcompressor0.10.0) - Scheme: NVFP4 (compressed-tensors format)
- Calibration Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
- Calibration Samples: 64
- Max Sequence Length: 2048 tokens
- Ignore List:
lm_head,model.embed_tokens,re:.*block_sparse_moe\.gate$ - Environment:
LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 - Hardware Used: NVIDIA DGX Spark (CPU offloading + 300GB swap)
- Total Quantization Time: 4.7 hours (281 minutes)
- Quant pipeline model load: 50 seconds (27 BF16 shards into CPU RAM — this is llmcompressor load, NOT vLLM inference)
- Calibration forward passes + weight calibration (28,892 weights): ~2+ hours (swap-dominated)
- Model compression: 28,892 iterations in ~60 minutes (highly variable 1–16 it/s due to swap I/O)
- Model save: 17 shards to disk
- Bottleneck: swap I/O throughout (260GB model on 128GB RAM + 300GB swap)
Quantization Pipeline
The source model on HuggingFace is labeled BF16 but actually contains float8_e4m3fn weights with weight_scale_inv blocks of shape [128, 128]. A dequantization step was required before NVFP4 quantization:
- Download:
cerebras/MiniMax-M2.5-REAP-139B-A10B(131GB, 27 shards — FP8) - Dequant FP8 → BF16: Block-wise dequantization (multiply by
scale_inv), output 260GB / 27 shards - Quantize BF16 → NVFP4: LLM Compressor oneshot with GB10-optimized ignore list
- Output: 75GB / 17 shards (compressed-tensors format)
Key Advantage Over Conservative Quants
Our quantization quantizes all Linear layers including self_attn (q/k/v projections). Conservative approaches (e.g., lukealonso's NVFP4) leave attention in BF16, wasting ~47% of per-token bandwidth. This is safe because NVFP4 calibration handles attention weight distributions well on this architecture.
Container Setup for Quantization
# Image: avarok/dgx-vllm-nvfp4-kernel:v23 (has llmcompressor + deps)
# Override entrypoint since default launches vLLM server
docker run -d --name minimax-139b-quant \
--gpus all --ipc=host \
-v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-BF16-real:/workspace/input_model \
-v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:/workspace/output_model \
-v /opt/huggingface/models/quantize-minimax-139b.py:/workspace/quantize.py \
-e LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 \
--entrypoint bash \
avarok/dgx-vllm-nvfp4-kernel:v23 \
-c "pip install --upgrade transformers && python /workspace/quantize.py"
Important: The --entrypoint bash override is required because the default entrypoint launches vLLM. The pip install --upgrade transformers is needed because the image ships an older transformers that doesn't support MiniMax M2 architecture.
Swap Configuration
The 260GB BF16 model exceeds 128GB physical RAM. A 300GB swap file was created:
sudo fallocate -l 300G /opt/huggingface/swapfile
sudo chmod 600 /opt/huggingface/swapfile
sudo mkswap /opt/huggingface/swapfile
sudo swapon /opt/huggingface/swapfile
This causes significant I/O stalls during compression (speed drops from 16 it/s to 1 it/s when paging), but the process completes successfully.
Running on a Single DGX Spark
Docker image: avarok/dgx-vllm-nvfp4-kernel:v23 (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1)
Download the model:
huggingface-cli download saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10 \
--local-dir /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4
Launch:
docker run -d --name minimax-139b --gpus all --ipc=host \
-v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4:/models/MiniMax-M2.5-REAP-139B-NVFP4 \
-p 8000:8000 \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
-e MODEL=/models/MiniMax-M2.5-REAP-139B-NVFP4 \
-e PORT=8000 \
-e MAX_MODEL_LEN=131072 \
-e GPU_MEMORY_UTIL=0.93 \
-e "VLLM_EXTRA_ARGS=--trust-remote-code --kv-cache-dtype fp8 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think" \
avarok/dgx-vllm-nvfp4-kernel:v23
Note: With 75GB model weight (vs 93GB for 172B), you can likely push
MAX_MODEL_LENhigher — 131072 should be achievable. Benchmark results will confirm exact limits.
Test it:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMax-M2.5-REAP-139B-NVFP4",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 1.0,
"top_p": 0.95,
"top_k": 40,
"min_p": 0.01,
"max_tokens": 512
}'
Environment Variables
| Variable | Why |
|---|---|
VLLM_NVFP4_GEMM_BACKEND=marlin |
Use Marlin kernels for FP4 GEMM (FlashInfer JIT fails on Spark SM121a) |
VLLM_TEST_FORCE_FP8_MARLIN=1 |
Required for Marlin backend activation |
VLLM_USE_FLASHINFER_MOE_FP4=0 |
Disable FlashInfer for MoE FP4 (JIT ninja build crashes) |
VLLM_MARLIN_USE_ATOMIC_ADD=1 |
Atomic adds for Marlin (stability on GB10) |
GPU_MEMORY_UTIL=0.93 |
0.95 OOMs on Spark; 0.93 is the safe max |
--kv-cache-dtype fp8 |
FP8 KV cache saves memory, enables larger context |
--attention-backend flashinfer |
FlashInfer for attention (not MoE) — works fine |
Recommended Sampling Parameters
{
"temperature": 1.0,
"top_p": 0.95,
"top_k": 40,
"min_p": 0.01
}
Comparison: Our Quants vs Others
| Model | Quant | Size | Attention | tok/s (single Spark) |
|---|---|---|---|---|
| Ours — 139B REAP NVFP4 | All Linear incl. attn | 75 GB | Quantized | TBD |
| Ours — 172B REAP NVFP4 | All Linear incl. attn | 93 GB | Quantized | 28 tok/s |
| lukealonso — 139B NVFP4 | Expert MLPs only | 79 GB | BF16 (bottleneck) | ~16 tok/s |
Related Models
- saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 — Our 172B REAP NVFP4 (93GB, 28 tok/s)
- saricles/Qwen3-Next-80B-A3B-Coder-NVFP4-GB10 — Qwen3 Coder NVFP4 (62 tok/s)
- cerebras/MiniMax-M2.5-REAP-139B-A10B — Source FP8 model
- cerebras/MiniMax-M2.5-REAP-172B-A10B — 172B FP8 variant
Acknowledgments
- Base model by MiniMax
- REAP sparse-inference pruning by Cerebras (paper)
- Quantization tooling by vLLM / LLM Compressor
- Quantized by saricles on NVIDIA DGX Spark
- Downloads last month
- 324
Model tree for saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10
Base model
MiniMaxAI/MiniMax-M2.5