MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10

NVFP4 quantization of cerebras/MiniMax-M2.5-REAP-139B-A10B for NVIDIA DGX Spark (GB10).

The base model is a Cerebras REAP (Router-weighted Expert Activation Pruning) variant of MiniMaxAI/MiniMax-M2.5. REAP uniformly prunes experts from 256 → 154 (40% pruning), reducing total parameters from 230B to 139B while maintaining near-identical performance. This is the more aggressively pruned sibling of the 172B (25%) variant.

Model Details

Base Model cerebras/MiniMax-M2.5-REAP-139B-A10B
Original Model MiniMaxAI/MiniMax-M2.5 (230B)
Architecture MiniMaxM2ForCausalLM (MoE, 154 experts, 8 active per token)
Total Parameters 139B
Active Parameters 10B per token
Hidden Layers 62
Quantization NVFP4 (4-bit floating point), all layers including self_attn
Format compressed-tensors (safetensors), 17 shards
Size on Disk 75 GB
Context Length 196,608 tokens (~192K)
License Modified MIT (inherited from Cerebras REAP)

Why 139B over 172B?

172B REAP 139B REAP
Expert pruning 25% (256 → 192) 40% (256 → 154)
NVFP4 size 93 GB 75 GB
Single Spark fit Tight (max ~65K ctx) Comfortable (~90K+ ctx headroom)
Cerebras eval loss Baseline ~0.5% degradation

The 139B variant trades minimal quality for significantly more memory headroom on a single DGX Spark. With 75GB model weight vs 93GB, you gain ~18GB for KV cache — translating to substantially more context or concurrent sessions.

Performance (Single NVIDIA DGX Spark — GB10, 128 GB)

TODO: Benchmark pending — model just quantized. Will update with llama-benchy results.

Expected: similar or slightly faster than 172B NVFP4 (27–29 tok/s) due to smaller model footprint.

Quantization Details

  • Method: Post-training quantization via LLM Compressor (llmcompressor 0.10.0)
  • Scheme: NVFP4 (compressed-tensors format)
  • Calibration Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
  • Calibration Samples: 64
  • Max Sequence Length: 2048 tokens
  • Ignore List: lm_head, model.embed_tokens, re:.*block_sparse_moe\.gate$
  • Environment: LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1
  • Hardware Used: NVIDIA DGX Spark (CPU offloading + 300GB swap)
  • Total Quantization Time: 4.7 hours (281 minutes)
    • Quant pipeline model load: 50 seconds (27 BF16 shards into CPU RAM — this is llmcompressor load, NOT vLLM inference)
    • Calibration forward passes + weight calibration (28,892 weights): ~2+ hours (swap-dominated)
    • Model compression: 28,892 iterations in ~60 minutes (highly variable 1–16 it/s due to swap I/O)
    • Model save: 17 shards to disk
    • Bottleneck: swap I/O throughout (260GB model on 128GB RAM + 300GB swap)

Quantization Pipeline

The source model on HuggingFace is labeled BF16 but actually contains float8_e4m3fn weights with weight_scale_inv blocks of shape [128, 128]. A dequantization step was required before NVFP4 quantization:

  1. Download: cerebras/MiniMax-M2.5-REAP-139B-A10B (131GB, 27 shards — FP8)
  2. Dequant FP8 → BF16: Block-wise dequantization (multiply by scale_inv), output 260GB / 27 shards
  3. Quantize BF16 → NVFP4: LLM Compressor oneshot with GB10-optimized ignore list
  4. Output: 75GB / 17 shards (compressed-tensors format)

Key Advantage Over Conservative Quants

Our quantization quantizes all Linear layers including self_attn (q/k/v projections). Conservative approaches (e.g., lukealonso's NVFP4) leave attention in BF16, wasting ~47% of per-token bandwidth. This is safe because NVFP4 calibration handles attention weight distributions well on this architecture.

Container Setup for Quantization

# Image: avarok/dgx-vllm-nvfp4-kernel:v23 (has llmcompressor + deps)
# Override entrypoint since default launches vLLM server

docker run -d --name minimax-139b-quant \
  --gpus all --ipc=host \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-BF16-real:/workspace/input_model \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:/workspace/output_model \
  -v /opt/huggingface/models/quantize-minimax-139b.py:/workspace/quantize.py \
  -e LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 \
  --entrypoint bash \
  avarok/dgx-vllm-nvfp4-kernel:v23 \
  -c "pip install --upgrade transformers && python /workspace/quantize.py"

Important: The --entrypoint bash override is required because the default entrypoint launches vLLM. The pip install --upgrade transformers is needed because the image ships an older transformers that doesn't support MiniMax M2 architecture.

Swap Configuration

The 260GB BF16 model exceeds 128GB physical RAM. A 300GB swap file was created:

sudo fallocate -l 300G /opt/huggingface/swapfile
sudo chmod 600 /opt/huggingface/swapfile
sudo mkswap /opt/huggingface/swapfile
sudo swapon /opt/huggingface/swapfile

This causes significant I/O stalls during compression (speed drops from 16 it/s to 1 it/s when paging), but the process completes successfully.

Running on a Single DGX Spark

Docker image: avarok/dgx-vllm-nvfp4-kernel:v23 (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1)

Download the model:

huggingface-cli download saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10 \
  --local-dir /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4

Launch:

docker run -d --name minimax-139b --gpus all --ipc=host \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4:/models/MiniMax-M2.5-REAP-139B-NVFP4 \
  -p 8000:8000 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -e MODEL=/models/MiniMax-M2.5-REAP-139B-NVFP4 \
  -e PORT=8000 \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.93 \
  -e "VLLM_EXTRA_ARGS=--trust-remote-code --kv-cache-dtype fp8 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think" \
  avarok/dgx-vllm-nvfp4-kernel:v23

Note: With 75GB model weight (vs 93GB for 172B), you can likely push MAX_MODEL_LEN higher — 131072 should be achievable. Benchmark results will confirm exact limits.

Test it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMax-M2.5-REAP-139B-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "min_p": 0.01,
    "max_tokens": 512
  }'

Environment Variables

Variable Why
VLLM_NVFP4_GEMM_BACKEND=marlin Use Marlin kernels for FP4 GEMM (FlashInfer JIT fails on Spark SM121a)
VLLM_TEST_FORCE_FP8_MARLIN=1 Required for Marlin backend activation
VLLM_USE_FLASHINFER_MOE_FP4=0 Disable FlashInfer for MoE FP4 (JIT ninja build crashes)
VLLM_MARLIN_USE_ATOMIC_ADD=1 Atomic adds for Marlin (stability on GB10)
GPU_MEMORY_UTIL=0.93 0.95 OOMs on Spark; 0.93 is the safe max
--kv-cache-dtype fp8 FP8 KV cache saves memory, enables larger context
--attention-backend flashinfer FlashInfer for attention (not MoE) — works fine

Recommended Sampling Parameters

Per MiniMax documentation:

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}

Comparison: Our Quants vs Others

Model Quant Size Attention tok/s (single Spark)
Ours — 139B REAP NVFP4 All Linear incl. attn 75 GB Quantized TBD
Ours — 172B REAP NVFP4 All Linear incl. attn 93 GB Quantized 28 tok/s
lukealonso — 139B NVFP4 Expert MLPs only 79 GB BF16 (bottleneck) ~16 tok/s

Related Models

Acknowledgments

Downloads last month
324
Safetensors
Model size
79B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10

Quantized
(10)
this model

Paper for saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10