Llama 3.2 3B β€” fraQtl KV Cache Optimized

KV cache optimized with fraQtl β€” 3.5x less KV cache memory during inference.

Note: The model file size is the same as the original (~6.4GB). The optimization modifies V projection weights so that at inference time, the KV cache uses up to 3.5Γ— less GPU memory (requires fraQtl runtime fork β€” shipping 1-2 weeks). The savings happen at runtime, not at download.

Metric Value
Original meta-llama/Llama-3.2-3B
File size Same as original (~6.4GB)
KV cache memory up to 3.5Γ— less at runtime (requires fraQtl runtime fork β€” shipping 1-2 weeks)
PPL before 14.3943
PPL after 14.8613
Delta +0.467 (weight-level)
Config k=32, INT3

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/Llama-3.2-3B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Llama-3.2-3B-compressed")
# KV cache uses 3.5x less memory during inference.

Runtime Compression

Runtime KV cache compression with significantly improved quality is shipping in 1-2 weeks (Path B llama.cpp fork). Contact us for early access.

Downloads last month
54
Safetensors
Model size
3B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for fraQtl/Llama-3.2-3B-fraQtl-kv

Quantizations
2 models