Thanks for sharing your work!
Heya I replied to you over on your reddit post: https://www.reddit.com/r/LocalLLaMA/comments/1qbm7f4/gemma_3_1b_qat_q4_0_gguf_without_imatrix_and/
Just wanted to thank you for sharing your procedures as well as its always great to have some more quantizers cooking up new recipes!
Cheers!
@ubergarm I have added perplexity measurements to the model card. It looks good as far as I can tell. In addition to wiki.test.raw, I used another dataset based on TIGER-Lab/StructEval because that is actually the direction in which I intend to use this quantization myself later. Even with this dataset, it performs very well in terms of perplexity compared to the other tested quantizations.
The numbers look almost too good for my taste... I hope I didn't miscalculate somewhere 😬
Sweet job comparing perplexity values and showing your commands for reproducibility!
The QAT models can be funny where quantized versions may have lower "better" perplexity than original bf16 sometimes. KLD can also help in this case, but that is more annoying to compute as it requires storing logits from the baseline bf16 and using them in comparisons. Also given QAT requires some extra training it is possible the process itself not only is improving quality due to restricting weight values to a given set, but also simply it is getting some more time in the oven in general (as compared to the non QAT bf16).
I have an old (possibly out of date at this point in terms of exact syntax) example of KLD here in my quant cookers guide: https://github.com/ikawrakow/ik_llama.cpp/discussions/434
I was kinda surprised the original google Q4_0 was using f16 for the token embedding and not bf16 as the range is different, but given they trained it I assume they confirmed there was no clipping.
Finally if you're into discord, check out https://huggingface.co/BeaverAI where a lot of fine tuners and quantizers hang out!
The numbers look almost too good for my taste...
Perplexity is a single dimension and does not always indicate performance in all situations and is also not really directly comparable across different models / quantization procedures (e.g. imatrix methodoligies and corpus) etc.
I mainly use it to compare quants for the same model prepared with exactly same methodolgy relatively against each other and the original full size bf16.
You can look at graphs I've made for over a dozen models on the model card of each of my repos to get an idea. Also @Thireus has a cool front end GUI attempting to predict the final perplexity based on measurements and hyper custom quant mixes which is essentially trying to model the curve of perplexity vs model size: https://gguf0.thireus.com/quant_assign.html
Finally if you use the exact same perplexity test corpus that you used for an imatrix dataset it is likely "benchmaxxing" and will give artificially low numbers probably.
Anyway, great job again exploring the quantization options and paying attention to all the nuance! Cheers!