JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
Under the JavisVerse project, we introduce JavisDiT++, a concise yet powerful DiT model to generate semantically and temporally aligned sounding videos with textual conditions.
π° News
- [2026.02.26] π₯π₯ We have upgraded from JavisDiT to JavisDiT++, both of which are accepted at ICLR 2026.
- [2025.12.26] π JavisDiT and JavisGPT are integrated into the JavisVerse project. We hope to contribute to the Joint Audio-Video Intelligence Symphony (Javis) in the community.
- [2025.12.26] π We released JavisGPT, a unified multi-modal LLM for sounding-video comprehension and generation. For more details refer to this repo.
- [2025.08.11] π₯ We released the data and code for JAVG evaluation. For more details refer to eval/javisbench/README.md.
- [2025.04.15] π₯ We released the data preparation and model training instructions. You can train JavisDiT with your own dataset!
- [2025.04.07] π₯ We released the inference code and a preview model of JavisDiT-v0.1 at HuggingFace.
- [2025.04.03] We release the repository of JavisDiT. Code, model, and data are coming soon.
Brief Introduction
JavisDiT++ addresses the key bottleneck of JAVG with a unified perspective of modeling and optimization.
- We model JAVG via joint self-attention to enable dense inter-modal interaction, with modality-specific MoE (MS-MoE) design to refine intra-modal representation.
- We propose a temporally aligned rotary position encoding (TA-RoPE) scheme to ensure explicit and fine-grained audio-video token synchronization.
- We devise the AV-DPO technique to consistently improve audio-video quality and synchronization by aligning generation with human preferences.
We hope to set a new standard for the JAVG community. For more technical details, kindly refer to our paper and github repo.
Citation
If you find JavisDiT is useful and use it in your project, please kindly cite:
@inproceedings{liu2026javisdit++,
title = {JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation},
author = {Liu, Kai and Zheng, Yanhao and Wang, Kai and Wu, Shengqiong and Zhang, Rongjunchen and Luo, Jiebo and Hatzinakos, Dimitrios and Liu, Ziwei and Fei, Hao and Chua, Tat-Seng},
conference = {The Fourteenth International Conference on Learning Representations},
year = {2026},
}