---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- de
- fr
- yue
pipeline_tag: image-to-video
tags:
- text-to-video
- image-text-to-video
- text-to-audio
- text-to-audio-video
- image-to-audio-video
- image-text-to-audio-video
- multimodal
---
# daVinci-MagiHuman
### Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
This repository contains the weights for **daVinci-MagiHuman**, introduced in the [paper](https://huggingface.co/papers/2603.21986).
SII-GAIR & Sand.ai
[](https://github.com/GAIR-NLP/daVinci-MagiHuman)
[](https://arxiv.org/abs/2603.21986)
[](https://huggingface.co/spaces/SII-GAIR/daVinci-MagiHuman)
[](https://huggingface.co/GAIR/daVinci-MagiHuman)
[](https://opensource.org/licenses/Apache-2.0)
[](https://www.python.org/)
[](https://pytorch.org/)
## Highlights
- **Single-Stream Transformer** — A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
- **Exceptional Human-Centric Quality** — Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
- **Multilingual** — Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
- **Blazing Fast Inference** — Generates a 5-second 256p video in **2 seconds** and a 5-second 1080p video in **38 seconds** on a single H100 GPU.
- **State-of-the-Art Results** — Achieves **80.0%** win rate vs Ovi 1.1 and **60.9%** vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
- **Fully Open Source** — We release the complete model stack: base model, distilled model, super-resolution model, and inference code.
## Architecture