MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10

NVFP4 quantization of cerebras/MiniMax-M2.5-REAP-139B-A10B for NVIDIA DGX Spark (GB10).

The base model is a Cerebras REAP (Router-weighted Expert Activation Pruning) variant of MiniMaxAI/MiniMax-M2.5. REAP uniformly prunes experts from 256 → 154 (40% pruning), reducing total parameters from 230B to 139B while maintaining near-identical performance. This is the more aggressively pruned sibling of the 172B (25%) variant.

Model Details


Base Model	cerebras/MiniMax-M2.5-REAP-139B-A10B
Original Model	MiniMaxAI/MiniMax-M2.5 (230B)
Architecture	MiniMaxM2ForCausalLM (MoE, 154 experts, 8 active per token)
Total Parameters	139B
Active Parameters	10B per token
Hidden Layers	62
Quantization	NVFP4 (4-bit floating point), all layers including self_attn
Format	compressed-tensors (safetensors), 17 shards
Size on Disk	75 GB
Context Length	196,608 tokens (~192K)
License	Modified MIT (inherited from Cerebras REAP)

Why 139B over 172B?

	172B REAP	139B REAP
Expert pruning	25% (256 → 192)	40% (256 → 154)
NVFP4 size	93 GB	75 GB
Single Spark fit	Tight (max ~65K ctx)	Comfortable (~90K+ ctx headroom)
Cerebras eval loss	Baseline	~0.5% degradation

The 139B variant trades minimal quality for significantly more memory headroom on a single DGX Spark. With 75GB model weight vs 93GB, you gain ~18GB for KV cache — translating to substantially more context or concurrent sessions.

Performance (Single NVIDIA DGX Spark — GB10, 128 GB)

TODO: Benchmark pending — model just quantized. Will update with llama-benchy results.

Expected: similar or slightly faster than 172B NVFP4 (27–29 tok/s) due to smaller model footprint.

Quantization Details

Method: Post-training quantization via LLM Compressor (llmcompressor 0.10.0)
Scheme: NVFP4 (compressed-tensors format)
Calibration Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
Calibration Samples: 64
Max Sequence Length: 2048 tokens
Ignore List: lm_head, model.embed_tokens, re:.*block_sparse_moe\.gate$
Environment: LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1
Hardware Used: NVIDIA DGX Spark (CPU offloading + 300GB swap)
Total Quantization Time: 4.7 hours (281 minutes)
- Quant pipeline model load: 50 seconds (27 BF16 shards into CPU RAM — this is llmcompressor load, NOT vLLM inference)
- Calibration forward passes + weight calibration (28,892 weights): ~2+ hours (swap-dominated)
- Model compression: 28,892 iterations in ~60 minutes (highly variable 1–16 it/s due to swap I/O)
- Model save: 17 shards to disk
- Bottleneck: swap I/O throughout (260GB model on 128GB RAM + 300GB swap)

Quantization Pipeline

The source model on HuggingFace is labeled BF16 but actually contains float8_e4m3fn weights with weight_scale_inv blocks of shape [128, 128]. A dequantization step was required before NVFP4 quantization:

Download: cerebras/MiniMax-M2.5-REAP-139B-A10B (131GB, 27 shards — FP8)
Dequant FP8 → BF16: Block-wise dequantization (multiply by scale_inv), output 260GB / 27 shards
Quantize BF16 → NVFP4: LLM Compressor oneshot with GB10-optimized ignore list
Output: 75GB / 17 shards (compressed-tensors format)

Key Advantage Over Conservative Quants

Our quantization quantizes all Linear layers including self_attn (q/k/v projections). Conservative approaches (e.g., lukealonso's NVFP4) leave attention in BF16, wasting ~47% of per-token bandwidth. This is safe because NVFP4 calibration handles attention weight distributions well on this architecture.

Container Setup for Quantization

# Image: avarok/dgx-vllm-nvfp4-kernel:v23 (has llmcompressor + deps)
# Override entrypoint since default launches vLLM server

docker run -d --name minimax-139b-quant \
  --gpus all --ipc=host \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-BF16-real:/workspace/input_model \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:/workspace/output_model \
  -v /opt/huggingface/models/quantize-minimax-139b.py:/workspace/quantize.py \
  -e LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 \
  --entrypoint bash \
  avarok/dgx-vllm-nvfp4-kernel:v23 \
  -c "pip install --upgrade transformers && python /workspace/quantize.py"

Important: The --entrypoint bash override is required because the default entrypoint launches vLLM. The pip install --upgrade transformers is needed because the image ships an older transformers that doesn't support MiniMax M2 architecture.

Swap Configuration

The 260GB BF16 model exceeds 128GB physical RAM. A 300GB swap file was created:

sudo fallocate -l 300G /opt/huggingface/swapfile
sudo chmod 600 /opt/huggingface/swapfile
sudo mkswap /opt/huggingface/swapfile
sudo swapon /opt/huggingface/swapfile

This causes significant I/O stalls during compression (speed drops from 16 it/s to 1 it/s when paging), but the process completes successfully.

Running on a Single DGX Spark

Docker image: avarok/dgx-vllm-nvfp4-kernel:v23 (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1)

Download the model:

huggingface-cli download saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10 \
  --local-dir /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4

Launch:

docker run -d --name minimax-139b --gpus all --ipc=host \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4:/models/MiniMax-M2.5-REAP-139B-NVFP4 \
  -p 8000:8000 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -e MODEL=/models/MiniMax-M2.5-REAP-139B-NVFP4 \
  -e PORT=8000 \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.93 \
  -e "VLLM_EXTRA_ARGS=--trust-remote-code --kv-cache-dtype fp8 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think" \
  avarok/dgx-vllm-nvfp4-kernel:v23

Note: With 75GB model weight (vs 93GB for 172B), you can likely push MAX_MODEL_LEN higher — 131072 should be achievable. Benchmark results will confirm exact limits.

Test it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMax-M2.5-REAP-139B-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "min_p": 0.01,
    "max_tokens": 512
  }'

Environment Variables

Variable	Why
`VLLM_NVFP4_GEMM_BACKEND=marlin`	Use Marlin kernels for FP4 GEMM (FlashInfer JIT fails on Spark SM121a)
`VLLM_TEST_FORCE_FP8_MARLIN=1`	Required for Marlin backend activation
`VLLM_USE_FLASHINFER_MOE_FP4=0`	Disable FlashInfer for MoE FP4 (JIT ninja build crashes)
`VLLM_MARLIN_USE_ATOMIC_ADD=1`	Atomic adds for Marlin (stability on GB10)
`GPU_MEMORY_UTIL=0.93`	0.95 OOMs on Spark; 0.93 is the safe max
`--kv-cache-dtype fp8`	FP8 KV cache saves memory, enables larger context
`--attention-backend flashinfer`	FlashInfer for attention (not MoE) — works fine

Recommended Sampling Parameters

Per MiniMax documentation:

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}

Comparison: Our Quants vs Others

Model	Quant	Size	Attention	tok/s (single Spark)
Ours — 139B REAP NVFP4	All Linear incl. attn	75 GB	Quantized	TBD
Ours — 172B REAP NVFP4	All Linear incl. attn	93 GB	Quantized	28 tok/s
lukealonso — 139B NVFP4	Expert MLPs only	79 GB	BF16 (bottleneck)	~16 tok/s