Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model Paper • 2502.10248 • Published Feb 14, 2025 • 57
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models Paper • 2601.20354 • Published 5 days ago • 107
MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning Paper • 2601.01568 • Published 29 days ago • 1
Klear: Unified Multi-Task Audio-Video Joint Generation Paper • 2601.04151 • Published 26 days ago • 16
HeartMuLa: A Family of Open Sourced Music Foundation Models Paper • 2601.10547 • Published 18 days ago • 41
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head Paper • 2601.07832 • Published 21 days ago • 51
LTX-2: Efficient Joint Audio-Visual Foundation Model Paper • 2601.03233 • Published 27 days ago • 144
LTX-2 Collection LTX-2 base models and accompanying LoRAs and IC-LoRAs • 13 items • Updated 4 days ago • 47
HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming Paper • 2512.21338 • Published Dec 24, 2025 • 22
SiD-DiT Collection Collection of Distilled Flow Matching Models with Score Identity Distillation • 17 items • Updated Nov 29, 2025 • 1
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling Paper • 2512.14614 • Published Dec 16, 2025 • 71