IQ4_NL
"The 14 heads of Qwen2.5-0.5B doesn't allow for any of the other 4-bit quants to be made"
Doesn't IQ4_NL work even with irregular shapes? If yes, its perplexity is better than q4_0 for the same bpw.
"The 14 heads of Qwen2.5-0.5B doesn't allow for any of the other 4-bit quants to be made"
Doesn't IQ4_NL work even with irregular shapes? If yes, its perplexity is better than q4_0 for the same bpw.
Sorry, somehow missed this.
The perplexity (or any other probabilistic measure) doesn't really matter for speculative decoding. To see why consider this simple example for a model with a vocabulary of 3 tokens: A, B and C:
- If the draft model predicts
[0.6, 0.3, 0.1]for the next token then even with a huge amount of noise the draft model will always predict token A as the most likely next token (this is known as Hinge Loss). - If the draft model predicts
[0.4, 0.4, 0.2]for the next token then even with a moderate amount of noise added this won't really be useful for (single sequence) speculation anyway (for beam search speculation it might though).
So even though there are better quants than Q4_0, it's very doubtful if it would actually make any difference at all to the actual acceptance rate.