Image-Text-to-Text
MLX
Safetensors
English
kimi_k25
quantized
moe-aware-quant
conversational
custom_code
Instructions to use mlx-community/Kimi-K2.6-MoE-Smart-Quant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Kimi-K2.6-MoE-Smart-Quant with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("mlx-community/Kimi-K2.6-MoE-Smart-Quant") config = load_config("mlx-community/Kimi-K2.6-MoE-Smart-Quant") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Kimi-K2.6-MoE-Smart-Quant (MLX)
MoE-aware mixed-precision quantization of moonshotai/Kimi-K2.6 for Apple Silicon.
Quantization Strategy
Unlike uniform quantization, this applies per-component bit allocation optimized for MoE + MLA architecture:
| Component | Bits | Rationale |
|---|---|---|
| Routed experts (384 SwitchLinear) | 4-bit | Only 8/384 fire per token โ very tolerant of low-bit |
| Shared expert (always active) | 6-bit | Every-token path, needs precision |
| MLA value projections (v_a/v_b) | 8-bit | Most sensitive attention weights |
| MLA other projections (q_a/q_b/kv_a/kv_b/o) | 6-bit | Latent compression layer |
| lm_head + embed_tokens | 8-bit | Output quality |
| First/last 3 decoder layers | 6-bit | Boundary layer sensitivity |
| Gate/router | unquantized | Tiny params, routing-critical |
| Vision encoder | unquantized | Preserved via mlx-vlm |
Effective average: ~4.5 bpw โ near-6-bit quality at near-4-bit size.
Model Details
- Base model: Kimi-K2.6 (1T params, 32B active, 384 experts)
- Architecture: MoE + MLA (kimi_k25)
- Context: 256K tokens
- Modality: Vision + Language (VLM)
- Converted with: mlx-vlm 0.4.2
Usage
Hardware Requirements
- Single node: M3/M4 Ultra 192GB+ (fits in ~150GB)
- Distributed: 2x M3 Ultra via JACCL/RDMA for headroom
Weights uploading โ conversion in progress.
Model tree for mlx-community/Kimi-K2.6-MoE-Smart-Quant
Base model
moonshotai/Kimi-K2.6