ternary-models: VLMs, Multimodal & Audio
Collection
Ternary-quantized models for architectures GGUF can't handle. tritplane3 scheme. • 16 items • Updated • 2
Ternary-quantized version of openai/whisper-large-v3, produced with ternary-quant.
This demonstrates ternary-quant's component-aware workflow for audio/speech models. The decoder is ternary-quantized while the audio encoder is preserved in FP16 for transcription quality. This is a HuggingFace-native PTQ artifact rather than a GGUF deployment artifact.
| Metric | Value |
|---|---|
| Scheme | tritplane3 (3-plane progressive ternary) |
| Components quantized | decoder (320 linear layers) |
| Audio encoder | Kept in FP16 (preserving audio understanding quality) |
| Stored size | 943.7 MB |
| FP16 size | 1677.7 MB |
| Compression ratio | 1.8x |
from ternary_quant.inference import load_ternary_model
import torch
model, processor = load_ternary_model(
"AsadIsmail/whisper-large-v3-ternary",
runtime_mode="cached",
device="cpu"
)
# Important: cast to float32 to match encoder conv1d dtype
model = model.float()
# Transcribe audio
import librosa
audio, sr = librosa.load("audio.mp3", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to("cpu").float() for k, v in inputs.items()}
with torch.no_grad():
predicted_ids = model.generate(**inputs)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
pip install ternary-quant
ternary-quant quantize-broad openai/whisper-large-v3 \
--output ./whisper-large-v3-ternary \
--components decoder \
--scheme tritplane3 --dtype float16 --eval
Base model
openai/whisper-large-v3