#Multimodal Generation

19개의 포스트

[논문리뷰] MultiRef-Compass: Towards Comprehensive Evaluation of Multi-Reference-to-Audio-Video Generation

본 연구는 기존 비디오 생성 벤치마크들이 단일 참조(single-reference) 기반의 과업에 치중되어 있어, 실제 콘텐츠 제작 현장에서 요구되는 복합적인 다중 참조(multi-reference) 기반의 생성 능력을 충분히 평가하지 못한다는 문제에서 출발한다 .

#Review #MultiRef-Compass #MR2AV #Multimodal Generation #Reference Consistency #Instruction Following #Benchmark #MLLM-as-a-Judge

2026년 7월 16일

[논문리뷰] Concurrent Image Understanding and Generation: Self-Correcting Coupled Markov Jump Processes

기존의 다중 모달 생성 시스템은 텍스트와 이미지 생성이 상호 독립적이거나 비동기적으로 이루어져, 모달리티 간의 심각한 불일치(contradiction)가 발생하고 이를 사후 수정할 수 없다는 한계가 있습니다.

#Review #Masked Diffusion Models #Multimodal Generation #Coupled Markov Jump Processes #Self-Correction #Remasking #Visual Reasoning

2026년 7월 16일

[논문리뷰] Unified Audio Intelligence Without Regressing on Text Intelligence

본 논문은 오디오와 비전 등 다중 모달 능력을 강화한 기존 LLM들이 텍스트 추론 및 지식 처리 능력에서 심각한 성능 퇴보를 보이는 문제를 해결하고자 합니다. 특히 최근의 멀티모달 모델들은 강력한 생성 능력을 갖추었음에도 불구하고, Reasoning 벤치마크에서 원본 모델 대비 눈에 띄는 저하를 보입니다.

#Review #Audio-Text LLM #Mixture-of-Experts (MoE)#Multimodal Generation #Cascade RL #Audio Intelligence

2026년 7월 6일

[논문리뷰] Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

본 논문은 데이터 저널리즘에서 발생하는 할루시네이션(Hallucination) 문제와 데이터 투명성 결여를 해결하기 위해 Data2Story를 제안한다.

#Review #Data Journalism #Multi-Agent System #Evidence-Grounded #Multimodal Generation #Verifiability #Auditability

2026년 6월 15일

[논문리뷰] Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

본 논문은 기존 멀티모달 생성 모델들이 복잡한 다중 이미지 명령을 처리할 때 발생하는 성능 저하 문제를 해결하기 위해 제안되었습니다.

#Review #Multimodal Generation #Interleaved Instructions #Object Binding #Transformer #Multimodal Image Editing #Scalable Data Engine

2026년 5월 12일

[논문리뷰] STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

본 논문은 기존의 통합 멀티모달 모델들이 겪는 생성 메커니즘의 구조적 파편화 문제를 해결하고자 합니다.

#Review #Multimodal Generation #Normalizing Flows #Autoregressive Transformers #Pretzel Architecture #Unified Modeling #Visual Understanding

2026년 5월 10일

[논문리뷰] MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

본 논문은 MLLM의 강력한 시맨틱 추론 능력과 확산 모델의 고품질 이미지 생성 능력을 통합하면서도 학습 효율성을 극대화하는 것을 핵심 문제로 다룹니다.

#Review #Multimodal Generation #Vision-Language Model #Latent Embeddings #Diffusion Model #Representation Alignment #Unified Framework

2026년 4월 22일

[논문리뷰] FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

본 논문은 기존 multimodal generation이 언어 모델 중심의 파이프라인에 의존하여 vision의 자체적인 추론 및 생성 능력이 제한되는 문제를 해결하고자 한다.

#Review #Multimodal Generation #Flow Matching #Visual Prompts #Image-in Image-out #Visual Instruction Following #VisPrompt-5M #VP-Bench

2026년 4월 8일

[논문리뷰] JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

기존 오픈소스 공동 오디오-비디오 생성(JAVG) 모델들이 생성 품질 , 시간 동기화 , 그리고 인간 선호도 정렬 측면에서 상용 모델(예: Veo3)에 비해 한계를 보이는 문제를 해결하는 것을 목표로 합니다.

#Review #Joint Audio-Video Generation #Diffusion Transformer #Modality-specific Mixture-of-Experts #Temporal-Aligned ROPE #Direct Preference Optimization #Multimodal Generation #Text-to-AV

2026년 2월 25일

[논문리뷰] Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

본 논문은 옴니모달 대규모 언어 모델(OLLMs)에 3D 얼굴 애니메이션 생성 기능을 통합하여 텍스트 및 음성 입력에 대한 자연스럽고 표현적인 멀티모달 출력을 가능하게 하는 것을 목표로 합니다.

#Review #Omni-modal LLMs #3D Facial Animation #Speech-to-Face Generation #Token-as-Query Gated Fusion (TQGF)#Discrete Speech Units #ARKit-52 Blendshapes #InstructEx Dataset #Multimodal Generation

2026년 2월 11일

[논문리뷰] TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

논문은 멀티모달 이해와 생성 태스크를 단일 프레임워크 내에서 원활하게 수행하는 TUNA라는 네이티브 통합 멀티모달 모델(UMM) 을 개발하는 것을 목표로 합니다. 기존 UMM의 분리된 또는 편향된 시각 표현 방식 으로 인한 한계를 극복하고, 이해와 생성 모두에 효과적인 통합된 연속 시각 표현 공간 을 구축하고자 합니다.

#Review #Unified Multimodal Models #Visual Representation #VAE #Flow Matching #Multimodal Understanding #Multimodal Generation #Image Editing #State-of-the-Art

2025년 12월 1일

[논문리뷰] Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

본 논문은 언어 중심의 접근 방식을 통해 멀티모달 이해, 추론 및 생성 능력을 통합하는 Uni-MoE-2.0-Omni 라는 효율적인 옴니모달 대규모 모델을 개발하는 것을 목표로 합니다.

#Review #Omnimodal Large Models #Mixture-of-Experts (MoE)#Language-Centric AI #Multimodal Understanding #Multimodal Generation #Progressive Training #Omni-Modality 3D RoPE

2025년 11월 17일

[논문리뷰] Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

이 논문은 멀티모달 이해(I2T)와 생성(T2I) 간의 근본적인 불일치를 해결하고, 이들이 단순히 공존하는 것을 넘어 진정으로 상호 이점을 얻을 수 있는지 탐구합니다. 저자들은 두 태스크를 통합하는 단일하고 근본적인 목적 함수 를 제시하여, 상호 보완적인 방식으로 멀티모달 시스템의 성능을 향상시키는 것을 목표로 합니다.

#Review #Multimodal Understanding #Multimodal Generation #Unified Models #Auto-Encoder #Reinforcement Learning #Image-to-Text #Text-to-Image #Reconstruction Fidelity

2025년 9월 12일

[논문리뷰] MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation

본 논문은 다양한 입력 신호에 실시간으로 반응하며, 낮은 지연 시간과 높은 시각적 일관성을 유지하는 대화형 디지털 휴먼 비디오 생성 시스템 을 구축하는 것을 목표로 합니다. 기존 방식의 높은 지연 시간, 계산 비용, 제한된 제어 가능성 등의 한계를 극복하고자 합니다.

#Review #Multimodal Generation #Digital Human Synthesis #Real-time Video Generation #Autoregressive LLM #Diffusion Models #Deep Compression Autoencoder #Exposure Bias Mitigation #Streaming Inference

2025년 8월 28일

[논문리뷰] EgoTwin: Dreaming Body and View in First Person

본 논문은 egocentric video 생성 분야의 미개척 영역을 탐구하며, 특히 카메라 착용자의 모션과 시점이 일관되고 인과적으로 연결된 방식으로 egocentric video와 인간 모션을 공동 생성하는 새로운 태스크를 제시합니다.

#Review #Egocentric Video Generation #Human Motion Synthesis #Diffusion Transformers #Multimodal Generation #Viewpoint Alignment #Causal Interplay #First-Person Vision

2025년 8월 25일

[논문리뷰] Uniform Discrete Diffusion with Metric Path for Video Generation

본 논문은 연속 공간(continuous-space) 비디오 생성 모델과 비교하여 뒤처져 있던 이산 공간(discrete-space) 비디오 생성 모델의 성능 격차를 해소하는 것을 목표로 합니다.

#Review #Discrete Diffusion #Video Generation #Metric Path #Long Video Generation #Asynchronous Scheduling #Text-to-Video #Multimodal Generation

2025년 10월 29일

[논문리뷰] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

현재 다중 모달 생성 모델이 다양한 영어 방언 텍스트 입력에 대해 효과적으로 콘텐츠를 생성할 수 있는지 평가하고, 방언 사용자들이 겪는 성능 저하 문제를 해결하는 것이 주요 목표입니다.

#Review #Multimodal Generation #Dialect Robustness #Text-to-Image #Text-to-Video #Benchmarking #Diffusion Models #Text Encoder Tuning #Low-Resource Dialects

2025년 10월 17일

[논문리뷰] OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

이 논문은 오토회귀(AR) 모델 의 엄격한 순차적 생성과 확산(Diffusion) 모델 의 고정 길이 생성이라는 근본적인 한계를 극복하는 것을 목표로 합니다.

#Review #Non-Autoregressive #Multimodal Generation #Edit Flows #Flow Matching #Interleaved Generation #Text-to-Image Synthesis #Unified Models

2025년 10월 8일

[논문리뷰] Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

본 논문은 기존 의료 AI 모델의 모달리티별 단편화 문제를 해결하고, 의료 이미지(방사선, 병리학)와 임상 보고서 간의 통합적인 생성 능력 을 갖춘 범용 의료 AI 에이전트를 개발하는 것을 목표로 합니다.

#Review #Discrete Diffusion Models #Multimodal Large Language Models (MLLMs)#Medical Image Generation #Medical Report Generation #Multimodal Generation #Medical AI #Cross-modal Alignment

2025년 10월 8일