25% faster generation with Flash SDPA Code
Tested on Blackwell (RTX 5080), 25% faster than native SDPA:
───────────────────────┬────────────┬──────────┐
│ Backend │ Total time │ Per step │
├───────────────────────┼────────────┼──────────┤
│ Native SDPA (default) │ 208.49s │ ~4.17s │
├───────────────────────┼────────────┼──────────┤
│ Flash SDPA │ 156.67s │ ~3.13s │
└───────────────────────┴────────────┴──────────┘
Flash SDPA is ~25% faster — saved about 52 seconds on a 50-step Full HD generation.
Use the code:
import torch
from diffusers import ZImagePipeline
from diffusers.models.attention_dispatch import attention_backend
# Load the pipeline
pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
)
pipe.enable_model_cpu_offload()
# Generate image
prompt = "Two young Asian women stand close together against a backdrop of a plain gray textured wall, possibly an indoor carpeted floor. The woman on the left has long, curly hair, wears a navy blue sweater with cream-colored ruffles on the left sleeve, a white stand-up collar shirt underneath, and white trousers; she wears small gold earrings"
negative_prompt = "" # Optional, but would be powerful when you want to remove some unwanted content
with attention_backend("_native_flash"):
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=1920,
width=1088,
cfg_normalization=False,
num_inference_steps=50,
guidance_scale=4,
generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("example.png")
4080s + flash attention 2.83, ComfyUI, a 1k*2k image takes 130 seconds. TAT
edited: I just realized this is a discussion about the turbo model; I thought it was about z-image-base :P
How do you guys tolerate this speed... I went downstairs to buy a pack of cigarettes and encountered nine people who had XXXXed "that man", and when I came back, the progress bar wasn't even at the bottom.