Llama 3.2 3B — fraQtl KV Cache Optimized

KV cache optimized with fraQtl — 3.5x less KV cache memory during inference.

Note: The model file size is the same as the original (~6.4GB). The optimization modifies V projection weights so that at inference time, the KV cache uses up to 3.5× less GPU memory (requires fraQtl runtime fork — shipping 1-2 weeks). The savings happen at runtime, not at download.

Metric	Value
Original	meta-llama/Llama-3.2-3B
File size	Same as original (~6.4GB)
KV cache memory	up to 3.5× less at runtime (requires fraQtl runtime fork — shipping 1-2 weeks)
PPL before	14.3943
PPL after	14.8613
Delta	+0.467 (weight-level)
Config	k=32, INT3

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/Llama-3.2-3B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Llama-3.2-3B-compressed")
# KV cache uses 3.5x less memory during inference.

Runtime Compression

Runtime KV cache compression with significantly improved quality is shipping in 1-2 weeks (Path B llama.cpp fork). Contact us for early access.

Downloads last month: 54

Safetensors

Model size

3B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraQtl/Llama-3.2-3B-fraQtl-kv

Quantizations

2 models