Discussion on Quantization Performance: Per-Channel vs Per-Group in Qwen3-32B-MLX-8bit

#2
by amingHW - opened

Hi Hugging Face team and community,

I recently encountered an interesting observation while quantizing models for deployment. In my benchmark tests, ​​per-group quantization​​ consistently outperforms ​​per-channel quantization​​, even when comparing 4-bit per-group to 8-bit per-channel weights. This seems counterintuitive, as per-channel is often considered more precise due to finer-grained scaling.

Specifically, I noticed that ​​Qwen3-32B-MLX-8bit​​ uses ​​8-bit per-group quantization​​ (not per-channel). This makes me wonder:

  1. ​​Is there something unique about the Qwen3-32B architecture​​ that makes per-group quantization more suitable? For example, are there specific weight distributions, layer structures, or training techniques that favor group-wise scaling?
  2. ​​Could this be related to the MLX framework's optimization​​? Does per-group work better with certain hardware or software constraints?\n3. ​​Has the team experimented with per-channel for this model​​? If so, what were the results?

My own findings suggest that per-group (even at lower bit widths) preserves accuracy better than per-channel, which aligns with Qwen3-32B-MLX-8bit's design choice. I'd appreciate any insights into why this might be the case—whether it’s model-specific or a general trend for large transformers.

Thanks for your time!

Sign up or log in to comment