Discussion on Quantization Performance: Per-Channel vs Per-Group in Qwen3-32B-MLX-8bit

by amingHW - opened Sep 10, 2025

Sep 10, 2025

Hi Hugging Face team and community,

I recently encountered an interesting observation while quantizing models for deployment. In my benchmark tests, per-group quantization consistently outperforms per-channel quantization, even when comparing 4-bit per-group to 8-bit per-channel weights. This seems counterintuitive, as per-channel is often considered more precise due to finer-grained scaling.

Specifically, I noticed that Qwen3-32B-MLX-8bit uses 8-bit per-group quantization (not per-channel). This makes me wonder:

Is there something unique about the Qwen3-32B architecture that makes per-group quantization more suitable? For example, are there specific weight distributions, layer structures, or training techniques that favor group-wise scaling?
Could this be related to the MLX framework's optimization? Does per-group work better with certain hardware or software constraints?\n3. Has the team experimented with per-channel for this model? If so, what were the results?

My own findings suggest that per-group (even at lower bit widths) preserves accuracy better than per-channel, which aligns with Qwen3-32B-MLX-8bit's design choice. I'd appreciate any insights into why this might be the case—whether it’s model-specific or a general trend for large transformers.

Thanks for your time!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment