berryber09
/

Wan2.2-I2V-14B-INT8-W8A8

video-generation

Model card Files Files and versions

Wan 2.2 I2V 14B — INT8 W8A8 Quantized

INT8 tensorwise quantized versions of Wan2.2-I2V-A14B for fast inference on NVIDIA GPUs with INT8 tensor cores.

Models

High noise expert (early steps): wan2.2_i2v_high_noise_14B_fp16_learned_int8mixed_tensorwise.safetensors (~14 GB)
Low noise expert (late steps): wan2.2_i2v_low_noise_14B_fp16_learned_int8mixed_tensorwise.safetensors (~14 GB)

Quantization Details

Method: W8A8 (8-bit weights, 8-bit activations) with learned tensorwise scales
Format: Mixed precision — attention/norm layers kept in FP16, linear layers quantized to INT8
Quality: Near-lossless compared to FP16 original
Speed: ~2x faster than FP16/BF16 on RTX 3090 (Ampere) with torch.compile

Requirements

ComfyUI
ComfyUI-Flux2-INT8 custom node (provides the INT8 W8A8 loader)
NVIDIA GPU with INT8 tensor cores (RTX 20-series or newer)
PyTorch with CUDA support
Triton (for torch.compile acceleration)

Usage in ComfyUI

Place both safetensors files in ComfyUI/models/diffusion_models/
Use the "Load Diffusion Model INT8 (W8A8)" node to load
Wan 2.2 I2V workflows use both experts — high noise for early steps, low noise for late steps

Performance (RTX 3090, 24 GB)

Model weights: ~8.9 GB on GPU (partially offloaded with --normalvram)
Peak VRAM during sampling: ~23 GB (normal for 14B model + activations)
Compatible with: LoRA (bypass mode), sage attention, torch.compile

INT8 is the fastest practical quantization for Ampere GPUs. FP8 requires Ada/Hopper.

Credits

Wan-AI for the original Wan 2.2 models
dxqb/OneTrainer for the INT8 quantization code
bertbobson/ComfyUI-Flux2-INT8 for the ComfyUI integration

Downloads last month: -

Model tree for berryber09/Wan2.2-I2V-14B-INT8-W8A8

Base model

Wan-AI/Wan2.2-I2V-A14B

Finetuned

(57)

this model