Kimi-K2.6-MoE-Smart-Quant (MLX)

MoE-aware mixed-precision quantization of moonshotai/Kimi-K2.6 for Apple Silicon.

Quantization Strategy

Unlike uniform quantization, this applies per-component bit allocation optimized for MoE + MLA architecture:

Component	Bits	Rationale
Routed experts (384 SwitchLinear)	4-bit	Only 8/384 fire per token — very tolerant of low-bit
Shared expert (always active)	6-bit	Every-token path, needs precision
MLA value projections (v_a/v_b)	8-bit	Most sensitive attention weights
MLA other projections (q_a/q_b/kv_a/kv_b/o)	6-bit	Latent compression layer
lm_head + embed_tokens	8-bit	Output quality
First/last 3 decoder layers	6-bit	Boundary layer sensitivity
Gate/router	unquantized	Tiny params, routing-critical
Vision encoder	unquantized	Preserved via mlx-vlm

Effective average: ~4.5 bpw — near-6-bit quality at near-4-bit size.

Weights uploading — conversion in progress.

Downloads last month: -; Downloads are not tracked for this model. How to track

Base model

Quantized

(33)

this model