본문으로 건너뛰기

[논문리뷰] Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

링크: 논문 PDF로 바로 열기

I have browsed the paper. Now I will extract the necessary information to construct the summary and figure JSON.

Authors: Hanwen Sun, Hansi Teng, Ethan Chern, and others. The abstract explicitly mentions "SII-GAIR & Sand.ai". The Appendix A lists many authors and identifies Yue Cao and Pengfei Liu as Project Leaders. For the summary, I will use "Hanwen Sun, Hansi Teng, Ethan Chern, et al." and mention "SII-GAIR & Sand.ai" as affiliations or just use the listed authors as per the instruction for "제1저자, 제2저자, et al.".

Keywords: From the abstract and introduction, I can identify: Audio-Video Generation, Generative Foundation Model, Single-Stream Transformer, Human-Centric Generation, Fast Inference, Multilingual.

Key Terms & Definitions:

  • Single-Stream Transformer : A model architecture that processes text, video, and audio tokens within a unified token sequence using self-attention only, avoiding separate cross-attention or fusion modules.
  • Latent-Space Super-Resolution : A two-stage inference technique where a base model generates low-resolution video/audio latents, and a dedicated super-resolution stage refines the video in latent space for higher output resolutions.
  • Turbo VAE Decoder : A lightweight, re-trained VAE decoder used at inference time to substantially reduce decoding overhead, which is critical for both the base generator and super-resolution pipeline.
  • Word Error Rate (WER) : A common metric for evaluating the performance of speech recognition or speech intelligibility, where a lower percentage indicates better accuracy.
  • Human-Centric Generation : The model's particular strength in generating expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization, focusing on human subjects.

Motivation & Problem Statement: Current open-source audio-video generation models struggle to combine strong generation quality, multilingual support, and inference efficiency within a simple and scalable architecture. Existing models often rely on complex multi-stream designs with separate pathways and fusion blocks for different modalities, leading to architectural complexity and challenges in optimization. This complexity makes it difficult for further research and community development.

Methodology & Key Results: The paper proposes daVinci-MagiHuman , an audio-video generative foundation model built around a Single-Stream Transformer architecture. This 40-layer, 15B-parameter Transformer employs a sandwich architecture layout where initial and final layers use modality-specific parameters, while middle layers share parameters for deep multimodal fusion. It features Timestep-Free Denoising and Per-Head Gating for numerical stability and representability. For efficient inference, it utilizes Latent-Space Super-Resolution to refine low-resolution outputs, a Turbo VAE Decoder for reduced decoding overhead, Full-Graph Compilation (MagiCompiler) for ~ 1.2x speedup on H100, and Distillation (DMD-2) to enable strong generation quality with only 8 denoising steps without CFG.

In quantitative evaluation, daVinci-MagiHuman achieves the highest Visual Quality (4.80) and Text Alignment (4.18) scores and the lowest Word Error Rate (WER) of 14.60% for speech intelligibility, significantly outperforming Ovi 1.1 (WER 40.45%) and LTX 2.3 (WER 19.23%). Inference on a single H100 GPU allows generation of a 5-second 256p video in 2 seconds , and a 5-second 1080p video in 38.4 seconds . Human evaluation further shows a preference for daVinci-MagiHuman, with win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2,000 comparisons.

Conclusion & Impact: daVinci-MagiHuman provides a robust open-source foundation for audio-video generation by integrating architectural simplicity with high human-centric generation quality, broad multilingual support, and fast inference. Its single-stream design simplifies development and optimization, making it more accessible for future research and community contributions. The model's superior performance in quality, intelligibility, and inference speed, along with its full open-source release, establishes a practical and extensible platform for advancing the field of audio-video generation, particularly for latency-sensitive interactive applications and diverse linguistic contexts.

Important Figures: I need to find the src attribute for the images and make sure they are full URLs.

  • Figure 1: 2603.21986v1/x1.png -> https://arxiv.org/html/2603.21986v1/x1.png (Examples of generated videos)
  • Figure 2: 2603.21986v1/figures/arch.png -> https://arxiv.org/html/2603.21986v1/figures/arch.png (Overall architecture)
  • Figure 3: 2603.21986v1/x2.png -> https://arxiv.org/html/2603.21986v1/x2.png (Human evaluation results)

These three figures fit the criteria: architecture, key results, and examples.

Now, I will format the output according to the instructions. All citations will point to `` as there is only one browsed URL. Double-check:

  • Technical Terminology in English: Yes, preserving terms like Transformer, Latency, WER, etc.
  • Natural Korean Phrasing: Yes, sentence structure is Korean.
  • Mandatory Metadata: Yes, authors and keywords.
  • Output Format: Markdown for Part 1, JSON for Part 2, separated by `

⚠️ 알림: 이 리뷰는 AI로 작성되었습니다.

Review 의 다른글