#Video Diffusion Models

39개의 포스트

[논문리뷰] LongE2V: Long-Horizon Event-based Video Reconstruction, Prediction, and Frame Interpolation with Video Diffusion Models

본 논문은 기존 event-based vision 모델들이 겪는 성능 한계와 작업별 파편화 문제를 해결하기 위해 LongE2V를 제안한다.

#Review #Event-based Vision #Video Diffusion Models #Video Reconstruction #Long-horizon Prediction #Frame Interpolation #Autoregressive Unrolling

2026년 7월 9일

[논문리뷰] Flex-Forcing: Towards a Unified Autoregressive and Bidirectional Video Diffusion Model

기존의 비디오 생성 모델은 Bidirectional diffusion과 Autoregressive 모델이라는 두 개의 분리된 패러다임으로 나뉘어 있어, 각각의 장단점이 뚜렷하다는 한계가 있습니다.

#Review #Video Diffusion Models #Autoregressive Generation #Bidirectional Generation #Flexible Chunking #Denoising Timesteps #KV Caching #Any-order Editing

2026년 7월 7일

[논문리뷰] The Surprising Effectiveness of Video Diffusion Models for Hand Motion Reconstruction

본 논문은 기존의 egocentric 4D 손 모션 재구성 방법론이 직면한 심각한 병목 현상을 해결하고자 합니다. 기존 방식들은 이미지 기반 탐지기(Detector)에 의존하거나, 제한된 데이터로 학습된 시간적 모듈을 사용하여 심한 은닉 상황에서 성능이 저하되는 한계가 있습니다 .

#Review #Video Diffusion Models #Hand Motion Reconstruction #Egocentric Video #4D Reconstruction #Embodied AI #Occlusion Reasoning

2026년 6월 29일

[논문리뷰] DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

본 논문은 기존의 Subject-driven Video Generation (S2V) 모델들이 고정된 도메인 내의 충실도(In-domain fidelity)에는 집중하지만, 스타일이나 도메인 속성이 변하는 Cross-domain 환경에서의 유연성과 편집 능력이 부족하다는 문제를 해결하고자 합니다 .

#Review #Subject-driven Video Generation #Open Domain #Domain-MoT #DualRoPE #Cross-Pair Consistent Loss #Video Diffusion Models

2026년 6월 24일

[논문리뷰] FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

본 논문은 기존의 Feedforward 장면 생성 모델들이 출력하는 볼륨 기반의 3D Gaussian 방식이 가지는 기하학적 한계를 극복하고자 합니다.

#Review #3D Scene Generation #Triangle Splatting #Video Diffusion Models #Differentiable Rendering #Feedforward Latent Decoding #Surface Reconstruction

2026년 6월 23일

[논문리뷰] RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

본 논문은 기존 비디오 생성 모델에서 관찰되는 3D spatiotemporal attention의 이차 복잡도로 인한 과도한 Inference Latency 및 계산 비용 문제를 해결하는 것을 목적으로 한다.

#Review #Video Diffusion Models #Diffusion Transformers #Training-Free Acceleration #Asynchronous Scheduling #Latent Trajectory Projection #Spatiotemporal Coherence

2026년 6월 14일

[논문리뷰] FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

본 논문은 Autoregressive Video Diffusion 모델에서 장기 문맥(Long-term context) 유지가 어려워 발생하는 비디오의 시간적 붕괴 문제를 해결합니다.

#Review #Video Diffusion Models #Memory Consolidation #Autoregressive Generation #Temporal Consistency #Long-term Dependency

2026년 6월 9일

[논문리뷰] Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

본 논문은 Video Diffusion Model의 효율적인 정렬(Alignment)을 위한 단일 단계(Single-step) 훈련 프레임워크인 Flash-GRPO를 제안합니다 .

#Review #Video Diffusion Models #Group Relative Policy Optimization #Reinforcement Learning #Single-step Training #Iso-temporal Grouping #Temporal Gradient Rectification #Alignment

2026년 5월 17일

[논문리뷰] AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

본 논문은 기존 consistency distillation 기반 모델들이 고정된 NFE budgets에 종속되어 sampling step이 증가할 때 오히려 성능이 저하되는 구조적 한계를 해결하기 위해 AnyFlow를 제안한다.

#Review #Video Diffusion Models #Flow Map #Any-Step Distillation #On-Policy Distillation #Test-Time Scaling #Backward Simulation #Causal Video Generation

2026년 5월 13일

[논문리뷰] UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

기존의 비디오 생성 연구들은 각 문제 설정(예: Text-to-Video, Inverse Rendering)에 대해 개별적인 모델을 학습시키는 파편화된 방식을 취하고 있어, 고정된 입력-출력 매핑에 제한되고 모달리티 간의 상호 상관관계를 활용하지 못하는 한계가 있습니다.

#Review #Video Diffusion Models #Multimodal Video Generation #Intrinsic Decomposition #Diffusion Priors #Stochastic Condition Masking #Decoupled Gated LoRA #Cross-Modal Self-Attention

2026년 5월 3일

[논문리뷰] VOID: Video Object and Interaction Deletion

본 연구는 CogVideoX 확산 모델을 기반으로, 물리적 인과 관계를 반영하는 카운터팩추얼 생성 모델을 구축하였습니다. 먼저 Kubric과 HUMOTO를 통해 객체 제거 전후의 물리적 역학 변화를 학습하고, VLM을 활용해 영상 내 영향받는 영역을 실시간으로 추론하여 Quadmask를 생성함으로써 모델의 생성 범위를 명확히 제한합니다.

#Review #Video Object Removal #Counterfactual Reasoning #Video Diffusion Models #Interaction-Aware Masking #Vision-Language Models

2026년 4월 2일

[논문리뷰] Generative World Renderer

본 논문은 generative inverse 및 forward rendering 기술을 실제 환경(in-the-wild)으로 확장하는 데 발생하는 데이터 병목 문제를 해결하는 것을 목표로 합니다.

#Review #Generative World Renderer #Inverse Rendering #G-buffer #Dataset Construction #Video Diffusion Models #VLM-based Evaluation

2026년 4월 2일

[논문리뷰] EgoSim: Egocentric World Simulator for Embodied Interaction Generation

본 논문은 기존의 egocentric world simulator들이 겪고 있는 3D 기반의 공간적 일관성 부족과 동적 상호작용에 따른 world state 업데이트 미흡 문제를 해결하기 위해 제안되었다.

#Review #Egocentric World Simulator #Updatable 3D State #Embodied Interaction Generation #Video Diffusion Models #Scalable Data Pipeline

2026년 4월 2일

[논문리뷰] VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

대규모 비디오 Diffusion 모델은 뛰어난 시각적 품질을 보여주지만, 카메라 궤적의 불안정성이나 기하학적 표류(Geometric Drift)와 같은 3D/4D 일관성 문제에 취약합니다 .

#Review #Video Diffusion Models #Geometric Consistency #Reinforcement Learning #Latent Geometry Model #4D Reconstruction #Group Relative Policy Optimization

2026년 3월 31일

[논문리뷰] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

최근 autoregressive video diffusion models 는 상당한 발전을 이루었지만, 장시간 비디오 생성 시 발생하는 몇 가지 주요 제약 사항들에 직면해 있다.

#Review #Autoregressive Video Generation #KV Cache Management #Long Context Inference #Video Diffusion Models #Temporal Consistency #Spatiotemporal Compression #RoPE Adjustment #Dynamic Context Selection

2026년 3월 29일

[논문리뷰] RealMaster: Lifting Rendered Scenes into Photorealistic Video

최신 비디오 생성 모델들은 뛰어난 실사 이미지(photorealism)를 만들어내지만, 특정 장면 요구사항에 맞춰 생성된 콘텐츠를 정밀하게 제어하는 데는 한계가 있습니다. 또한, 명시적인 기하학적 구조(explicit geometry)가 없기 때문에 3D 일관성(3D consistency)을 보장하기 어렵습니다.

#Review #Sim-to-Real Translation #Photorealistic Video Generation #Video Diffusion Models #Structural Precision #Global Semantic Transformation #IC-LoRA #Temporal Consistency

2026년 3월 24일

[논문리뷰] MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

비디오 diffusion 모델은 단순한 plausible clip 생성에서 카메라 모션, revisits, 그리고 intervention 하에서 일관성을 유지하는 world simulator로 발전하고 있습니다.

#Review #Spatial Memory #World Models #Video Diffusion Models #Hybrid Memory #Controllable Video Generation #Long-horizon Consistency #Patch-and-Compose

2026년 3월 18일

[논문리뷰] DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

대규모 diffusion models 가 비디오 합성 능력을 혁신했지만, multi-subject identity 와 multi-granularity motion 에 대한 정밀한 제어는 여전히 중대한 과제로 남아있습니다.

#Review #Video Diffusion Models #Video Customization #Motion Control #Reinforcement Learning #Multi-Subject #Omni-Motion #Latent Identity #DiT

2026년 3월 12일

[논문리뷰] ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

본 연구는 3D/4D 감독 없이 물리적으로 그럴듯한 관절형 인간-객체 상호작용(HOI)을 합성 하는 근본적인 문제를 해결하고자 합니다. 기존 제로샷 방법론들이 강체 객체 조작 에만 한정되며 명시적인 4D 기하학적 추론 이 부족하여 발생하는 비현실적인 상호작용 문제를 극복하는 것이 주된 목표입니다.

#Review #Human-Object Interaction (HOI)#4D Reconstruction #Articulated Objects #Video Diffusion Models #Inverse Rendering #Zero-shot Learning #Motion Synthesis #3D Gaussians

2026년 3월 4일

[논문리뷰] Solaris: Building a Multiplayer Video World Model in Minecraft

기존 단일 에이전트 비디오 월드 모델의 한계를 극복하고, Minecraft 와 같은 복잡한 3D 환경에서 일관된 다중 시점 관찰을 시뮬레이션할 수 있는 다중 에이전트 비디오 월드 모델 (Solaris) 을 구축하는 것이 목표입니다.

#Review #Multi-agent World Models #Video Diffusion Models #Minecraft #Self Forcing #Checkpointed Self Forcing #Multi-view Consistency #Data Collection #Embodied AI

2026년 2월 25일

[논문리뷰] World Action Models are Zero-shot Policies

본 논문은 Vision-Language-Action (VLA) 모델의 한계인 새로운 환경에서 미지의 물리적 동작에 대한 일반화 능력 부족을 해결하고자 합니다.

#Review #World Action Models #Video Diffusion Models #Zero-shot Generalization #Cross-embodiment Transfer #Real-time Control #Robotics #Foundation Models #Flow Matching

2026년 2월 18일

[논문리뷰] Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

논문은 오토-회귀 비디오 생성 모델의 주요 병목인 KV-cache 메모리 문제 를 해결하고자 합니다.

#Review #Auto-Regressive Video Generation #KV-Cache Quantization #Memory Optimization #Long Video Generation #Video Diffusion Models #Semantic-Aware Smoothing #Progressive Residual Quantization

2026년 2월 4일

[논문리뷰] SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

비디오 Diffusion Transformer의 긴 입력 시퀀스로 인해 발생하는 높은 계산 지연 시간 문제를 해결하고, 기존의 스파스 어텐션 방식이 가진 제한된 스파시티 또는 과도한 학습 오버헤드 의 한계를 극복하고자 합니다.

#Review #Video Diffusion Models #Sparse Attention #Linear Attention #Computational Efficiency #Transformer Tuning #Video Generation #LoRA #Gating Mechanism

2026년 1월 25일

[논문리뷰] Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

본 논문은 Diffusion Transformer (DiT) 기반의 Image-to-Video (I2V) 모델에서 텍스트 프롬프트에 대한 제어력 부족 문제를 해결하고자 합니다.

#Review #Video Diffusion Models #Image-to-Video Generation #Diffusion Transformers (DiT)#Controllability #Semantic Alignment #Focal Guidance #Prompt Adherence

2026년 1월 14일

[논문리뷰] Yume-1.5: A Text-Controlled Interactive World Generation Model

본 논문은 대규모 파라미터 크기, 긴 추론 단계, 빠르게 증가하는 히스토리컬 컨텍스트, 그리고 텍스트 기반 제어 능력 부족과 같은 기존 비디오 확산 모델의 한계를 극복하여 사실적이고 상호작용적이며 연속적인 가상 세계를 실시간으로 생성 하는 것을 목표로 합니다.

#Review #Interactive World Generation #Video Diffusion Models #Text-to-Video #Image-to-Video #Real-time Generation #Temporal-Spatial-Channel Modeling #Self-Forcing

2025년 12월 29일

[논문리뷰] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

논문은 단일 이미지로부터 장범위(long-range) 및 기하학적으로 일관된 새로운 시점 비디오를 생성하는 근본적인 문제를 해결하고자 합니다.

#Review #Novel View Synthesis #3D Geometry Propagation #Video Diffusion Models #Gaussian Splatting #Autoregressive Generation #Spatio-Temporal Noise #Geometric Consistency

2025년 12월 22일

[논문리뷰] EgoX: Egocentric Video Generation from a Single Exocentric Video

본 연구는 단일 외부 시점(exocentric) 비디오 입력으로부터 사실적이고 일관성 있는 내부 시점(egocentric) 비디오를 생성하는 것을 목표로 합니다.

#Review #Egocentric Video Generation #Exocentric-to-Egocentric #Video Diffusion Models #3D Scene Reconstruction #Geometry-Guided Attention #View Synthesis #Camera Pose Estimation #LoRA Adaptation

2025년 12월 14일

[논문리뷰] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

효율적인 스트리밍 비디오 생성 시 기존 방법론들이 정적 초기 토큰에 과도하게 의존하여 동적 움직임 저하와 '프레임 복사' 문제를 겪는 한계를 극복하고자 합니다. 본 연구는 실시간으로 높은 시각적 충실도와 강력한 움직임 역동성을 동시에 유지하는 비디오 생성을 목표로 합니다.

#Review #Streaming Video Generation #Video Diffusion Models #Distribution Matching Distillation #Reinforcement Learning #Autoregressive Models #Attention Sink #Real-time

2025년 12월 4일

[논문리뷰] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

본 논문은 기존의 autoregressive 비디오 diffusion 모델이 가진 세 가지 핵심 한계를 해결하는 것을 목표로 합니다.

#Review #Autoregressive Video Generation #Rotary Positional Embedding #Infinite Video Generation #Action Control #Cinematic Transitions #Video Diffusion Models #KV Cache

2025년 12월 1일

[논문리뷰] Loomis Painter: Reconstructing the Painting Process

본 논문은 기존 생성 모델들이 겪는 시간적 불연속성, 구조적 불일치, 그리고 다양한 예술 매체에 대한 일반화 능력 부족 문제를 해결하여, 어떤 입력 이미지에 대해서도 사실적이고 일관된 단계별 그림 그리기 과정 을 생성하는 것을 목표로 합니다.

#Review #Painting Process Generation #Video Diffusion Models #Media Transfer #Reverse Painting #Dataset Curation #Perceptual Distance Profile #Artistic Workflow #Generative AI

2025년 11월 23일

[논문리뷰] Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

본 논문의 핵심 목표는 실세계 다중 뷰 데이터 없이 단일 이미지 또는 비디오 입력으로부터 고품질의 3D 및 4D 장면을 생성하는 것입니다.

#Review #Generative AI #3D Scene Reconstruction #Video Diffusion Models #Self-Distillation #3D Gaussian Splatting #Dynamic 4D Generation #Monocular Input

2025년 9월 24일

[논문리뷰] WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

본 연구는 기존 비디오 확산 모델(VDM)이 3D/4D 작업에서 겪는 제어 가능성, 시공간 일관성, 기하학적 충실도의 한계를 해결하고자 합니다.

#Review #Video Diffusion Models #3D/4D Generation #Training-Free Guidance #Camera Trajectory Control #Novel View Synthesis #Geometric Consistency #Inference-Time Optimization

2025년 9월 19일

[논문리뷰] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

3D 데이터 부족 문제를 해결하기 위해 대규모 비디오 데이터에서 얻은 상식 사전(commonsense priors) 을 활용하여 3D 생성 모델의 일반화 능력을 향상시키는 것을 목표로 합니다.

#Review #3D Generation #Video Diffusion Models #Spatial Consistency #Semantic Knowledge #Multi-view Synthesis #Large-scale Dataset #Image-to-3D #Text-to-3D

2025년 9월 1일

[논문리뷰] ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models

기존 3D 인페인팅 방법론들이 다중 뷰 2D 이미지 인페인팅에 의존하여 발생하는 뷰 간 불일치, 흐릿한 텍스처, 공간 불연속성 문제를 해결하고자 합니다. 이를 극복하고 비디오 확산 모델 의 시공간적 일관성 유지 능력을 활용하여 고품질의 일관된 3D 객체 완성 및 편집을 목표로 합니다.

#Review #3D Inpainting #Multi-view Consistency #Video Diffusion Models #3D Object Completion #Generative Models #LoRA #3D Gaussian Splatting

2025년 8월 27일

[논문리뷰] ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

이 논문은 전통적인 카툰 제작 파이프라인의 핵심적인 병목 현상인 인비트위닝(inbetweening) 과 컬러라이제이션(colorization) 단계의 수동적인 노력과 오류 누적 문제를 해결하는 것을 목표로 합니다.

#Review #Cartoon Generation #Video Diffusion Models #DiT #Post-Keyframing #Low-Rank Adaptation #Sparse Control #Generative AI #Animation

2025년 8월 15일

[논문리뷰] Rethinking Visual Intelligence: Insights from Video Pretraining

Large Language Models (LLMs)의 성공에도 불구하고 시각 도메인에서 구성적 이해, 샘플 효율성, 범용 문제 해결 의 한계가 지속되고 있습니다.

#Review #Video Diffusion Models #Visual Intelligence #Pretraining #Foundation Models #Low-resource Learning #Inductive Biases #Visual Reasoning #Image-to-Image Tasks

2025년 10월 29일

[논문리뷰] Point Prompting: Counterfactual Tracking with Video Diffusion Models

본 논문은 사전 학습된 비디오 확산 모델(video diffusion models) 이 추가 훈련 없이 제로-샷(zero-shot) 방식으로 시점 추적(point tracking)을 수행할 수 있는지 탐구합니다.

#Review #Video Diffusion Models #Point Tracking #Zero-Shot Learning #Counterfactual Modeling #Visual Prompting #SDEdit #Negative Prompting #Object Permanence

2025년 10월 16일

[논문리뷰] VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

본 논문은 사용자가 지정한 임의의 공간 및 시간 위치에 패치를 배치하여 비디오를 생성하는 '임의의 시공간 비디오 완성(arbitrary spatio-temporal video completion)' 이라는 새로운 태스크를 제안합니다.

#Review #Video Completion #Spatio-Temporal Control #In-Context Conditioning #Video Diffusion Models #RoPE Interpolation #VAE #Unified Framework #Video Generation

2025년 10월 10일

[논문리뷰] SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

본 논문은 단일 이미지로부터 다중 시점 일관성 있는 PBR(Physically Based Rendering) 재질(알베도, 러프니스, 메탈릭, 표면 노멀) 을 예측하는 프레임워크를 제시하며, 이는 단일 이미지 기반 역렌더링 의 고질적인 난제를 해결하고자 합니다.

#Review #Single Image 3D Reconstruction #Material Prediction #Video Diffusion Models #Physically Based Rendering (PBR)#Inverse Rendering #Novel View Synthesis #Camera Control #Latent Diffusion

2025년 10월 10일