Wan 2.2 I2V 14B β INT8 W8A8 Quantized
INT8 tensorwise quantized versions of Wan2.2-I2V-A14B for fast inference on NVIDIA GPUs with INT8 tensor cores.
Models
- High noise expert (early steps):
wan2.2_i2v_high_noise_14B_fp16_learned_int8mixed_tensorwise.safetensors(~14 GB) - Low noise expert (late steps):
wan2.2_i2v_low_noise_14B_fp16_learned_int8mixed_tensorwise.safetensors(~14 GB)
Quantization Details
- Method: W8A8 (8-bit weights, 8-bit activations) with learned tensorwise scales
- Format: Mixed precision β attention/norm layers kept in FP16, linear layers quantized to INT8
- Quality: Near-lossless compared to FP16 original
- Speed: ~2x faster than FP16/BF16 on RTX 3090 (Ampere) with torch.compile
Requirements
- ComfyUI
- ComfyUI-Flux2-INT8 custom node (provides the INT8 W8A8 loader)
- NVIDIA GPU with INT8 tensor cores (RTX 20-series or newer)
- PyTorch with CUDA support
- Triton (for torch.compile acceleration)
Usage in ComfyUI
- Place both safetensors files in
ComfyUI/models/diffusion_models/ - Use the "Load Diffusion Model INT8 (W8A8)" node to load
- Wan 2.2 I2V workflows use both experts β high noise for early steps, low noise for late steps
Performance (RTX 3090, 24 GB)
- Model weights: ~8.9 GB on GPU (partially offloaded with
--normalvram) - Peak VRAM during sampling: ~23 GB (normal for 14B model + activations)
- Compatible with: LoRA (bypass mode), sage attention, torch.compile
INT8 is the fastest practical quantization for Ampere GPUs. FP8 requires Ada/Hopper.
Credits
- Wan-AI for the original Wan 2.2 models
- dxqb/OneTrainer for the INT8 quantization code
- bertbobson/ComfyUI-Flux2-INT8 for the ComfyUI integration
- Downloads last month
- -
Model tree for berryber09/Wan2.2-I2V-14B-INT8-W8A8
Base model
Wan-AI/Wan2.2-I2V-A14B