Llama 3.2 3B β fraQtl KV Cache Optimized
KV cache optimized with fraQtl β 3.5x less KV cache memory during inference.
Note: The model file size is the same as the original (~6.4GB). The optimization modifies V projection weights so that at inference time, the KV cache uses up to 3.5Γ less GPU memory (requires fraQtl runtime fork β shipping 1-2 weeks). The savings happen at runtime, not at download.
| Metric | Value |
|---|---|
| Original | meta-llama/Llama-3.2-3B |
| File size | Same as original (~6.4GB) |
| KV cache memory | up to 3.5Γ less at runtime (requires fraQtl runtime fork β shipping 1-2 weeks) |
| PPL before | 14.3943 |
| PPL after | 14.8613 |
| Delta | +0.467 (weight-level) |
| Config | k=32, INT3 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("fraQtl/Llama-3.2-3B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Llama-3.2-3B-compressed")
# KV cache uses 3.5x less memory during inference.
Runtime Compression
Runtime KV cache compression with significantly improved quality is shipping in 1-2 weeks (Path B llama.cpp fork). Contact us for early access.
- Downloads last month
- 54
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support