Wan 2.2 I2V 14B β€” INT8 W8A8 Quantized

INT8 tensorwise quantized versions of Wan2.2-I2V-A14B for fast inference on NVIDIA GPUs with INT8 tensor cores.

Models

  • High noise expert (early steps): wan2.2_i2v_high_noise_14B_fp16_learned_int8mixed_tensorwise.safetensors (~14 GB)
  • Low noise expert (late steps): wan2.2_i2v_low_noise_14B_fp16_learned_int8mixed_tensorwise.safetensors (~14 GB)

Quantization Details

  • Method: W8A8 (8-bit weights, 8-bit activations) with learned tensorwise scales
  • Format: Mixed precision β€” attention/norm layers kept in FP16, linear layers quantized to INT8
  • Quality: Near-lossless compared to FP16 original
  • Speed: ~2x faster than FP16/BF16 on RTX 3090 (Ampere) with torch.compile

Requirements

  • ComfyUI
  • ComfyUI-Flux2-INT8 custom node (provides the INT8 W8A8 loader)
  • NVIDIA GPU with INT8 tensor cores (RTX 20-series or newer)
  • PyTorch with CUDA support
  • Triton (for torch.compile acceleration)

Usage in ComfyUI

  1. Place both safetensors files in ComfyUI/models/diffusion_models/
  2. Use the "Load Diffusion Model INT8 (W8A8)" node to load
  3. Wan 2.2 I2V workflows use both experts β€” high noise for early steps, low noise for late steps

Performance (RTX 3090, 24 GB)

  • Model weights: ~8.9 GB on GPU (partially offloaded with --normalvram)
  • Peak VRAM during sampling: ~23 GB (normal for 14B model + activations)
  • Compatible with: LoRA (bypass mode), sage attention, torch.compile

INT8 is the fastest practical quantization for Ampere GPUs. FP8 requires Ada/Hopper.

Credits

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for berryber09/Wan2.2-I2V-14B-INT8-W8A8

Finetuned
(57)
this model