Papers
arxiv:2605.18739

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Published on May 18
· Submitted by
Wei Huang
on May 19
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

LongLive-2.0 presents an NVFP4-based parallel infrastructure for long video generation that addresses training and inference bottlenecks through sequence-parallel autoregressive training and diffusion model tuning.

AI-generated summary

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.

Community

Paper submitter

that balanced sequence parallelism in LongLive-2.0 is a clever way to fuse teacher forcing with SP execution, but how sensitive are the training gains to the exact chunk size and the clean-history vs noisy-target pairing when real long-form content has irregular shot lengths? the arxivlens breakdown helped me parse the method details (https://arxivlens.com/PaperView/Details/longlive-2-0-an-nvfp4-parallel-infrastructure-for-long-video-generation-9770-66c951dd), especially how the SP-aware chunked VAE encoding and per-rank memory reductions play with KV cache quantization. would love to see a small ablation where you vary the shot-length distribution or test on datasets with more abrupt scene changes to see if the gains persist.

{
"title": "THIS STYLES Luxury Watch Intro",
"duration": "10 Seconds",
"aspect_ratio": "9:16",
"resolution": "4K Ultra HD",
"fps": "60fps",
"style": {
"theme": "Luxury Cinematic Watch Advertisement",
"mood": "Premium, Dark, Elegant, Powerful, Futuristic",
"visual_style": [
"Ultra Realistic CGI",
"Rolex Inspired Luxury",
"Dark Black and Gold Aesthetic",
"Cinematic Lighting",
"Luxury Metallic Reflections",
"High-End Commercial Quality"
],
"color_palette": [
"Matte Black",
"Metallic Gold",
"Dark Grey",
"Warm Gold Highlights"
]
},
"brand": {
"name": "THIS STYLES",
"logo_usage": {
"use_uploaded_logo": true,
"style": "3D Metallic Gold Embossed Logo",
"lighting": "Soft Gold Glow with Cinematic Reflections"
}
},
"scene_1": {
"time": "0s - 2s",
"description": "The video starts in complete darkness. Tiny floating golden particles slowly appear in the black environment. A cinematic gold light sweeps across the screen revealing luxury mechanical watch gears rotating in slow motion. The atmosphere feels rich, mysterious, and premium.",
"camera": [
"Slow cinematic push-in",
"Macro depth of field",
"Smooth motion blur"
],
"lighting": [
"Dark studio lighting",
"Soft gold rim light",
"Volumetric fog",
"Luxury reflections"
],
"effects": [
"Floating gold dust",
"Luxury particles",
"Mechanical gear animation",
"Cinematic smoke"
]
},
"scene_2": {
"time": "2s - 5s",
"description": "A luxury black and gold watch emerges dramatically from darkness. The watch rotates slowly in mid-air above a glossy reflective black surface. The watch design feels inspired by Rolex and Rado aesthetics but fully unique for THIS STYLES. Cinematic reflections move across the sapphire crystal and metallic bracelet.",
"watch_details": {
"dial": "Premium black textured dial",
"case": "Metallic gold luxury finish",
"bracelet": "High-end stainless steel with gold accents",
"glass": "Reflective sapphire crystal",
"movement": "Luxury automatic style"
},
"camera": [
"Orbit camera movement",
"Macro close-up shots",
"Slow-motion rotation"
],
"effects": [
"Gold light streaks",
"Floating luxury particles",
"Premium reflections"
]
},
"scene_3": {
"time": "5s - 8s",
"description": "The floating golden particles merge together and transform into the uploaded THIS STYLES logo. The logo appears as a premium metallic gold 3D emblem glowing softly in the center of the screen. Elegant cinematic lighting highlights the embossed details.",
"text": {
"main": "THIS STYLES",
"tagline": "TIME DEFINES STYLE"
},
"typography": {
"style": "Luxury Serif Font",
"material": "Metallic Gold",
"animation": "Cinematic Fade and Glow"
},
"camera": [
"Slow rotation around logo",
"Center cinematic framing"
],
"effects": [
"Gold particle assembly",
"Soft glow",
"Luxury fog",
"Cinematic lens flare"
]
},
"scene_4": {
"time": "8s - 10s",
"description": "Final luxury brand outro. The watch settles beside the THIS STYLES logo in a dark premium environment. Gold reflections slowly move across the screen while the final call-to-action appears.",
"final_text": [
"Luxury Watches Collection",
"Shop Now",
"thisstyles.com",
"+92 319 7827504"
],
"effects": [
"Luxury gold glow",
"Floating particles",
"Elegant fade out"
],
"ending": "Fade to black with soft cinematic gold particles disappearing."
},
"audio": {
"background_music": "Deep cinematic luxury soundtrack",
"sound_effects": [
"Luxury metallic hits",
"Watch ticking",
"Soft bass impacts",
"Cinematic whooshes"
]
},
"quality_requirements": [
"Ultra realistic rendering",
"Hollywood cinematic quality",
"No cartoon look",
"Hyper detailed textures",
"Premium luxury advertisement style",
"Smooth animation",
"High-end watch commercial feel"
]
}
WhatsApp Image 2026-05-21 at 12.13.25 AM

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.18739
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 4

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18739 in a Space README.md to link it from this page.

Collections including this paper 5