Discussion on Quantization Performance: Per-Channel vs Per-Group in Qwen3-32B-MLX-8bit
Hi Hugging Face team and community,
I recently encountered an interesting observation while quantizing models for deployment. In my benchmark tests, per-group quantization consistently outperforms per-channel quantization, even when comparing 4-bit per-group to 8-bit per-channel weights. This seems counterintuitive, as per-channel is often considered more precise due to finer-grained scaling.
Specifically, I noticed that Qwen3-32B-MLX-8bit uses 8-bit per-group quantization (not per-channel). This makes me wonder:
- Is there something unique about the Qwen3-32B architecture that makes per-group quantization more suitable? For example, are there specific weight distributions, layer structures, or training techniques that favor group-wise scaling?
- Could this be related to the MLX framework's optimization? Does per-group work better with certain hardware or software constraints?\n3. Has the team experimented with per-channel for this model? If so, what were the results?
My own findings suggest that per-group (even at lower bit widths) preserves accuracy better than per-channel, which aligns with Qwen3-32B-MLX-8bit's design choice. I'd appreciate any insights into why this might be the case—whether it’s model-specific or a general trend for large transformers.
Thanks for your time!