#Unified Multimodal Models

16개의 포스트

[논문리뷰] Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

본 논문은 UMMs에서 수행된 텍스트 기반 지식 편집(Knowledge Editing)이 이미지 생성 과정으로 적절히 전이되는지 검증하고자 합니다 . 기존의 텍스트 도메인 지식 편집 기법들은 텍스트 출력에서는 높은 성공률을 보이지만, 동일한 수정이 시각적 생성 결과로까지 일관되게 이어지는지는 불명확합니다.

#Review #Unified Multimodal Models #Knowledge Editing #Cross-Modal Transfer #Visual Generation #UniKE #Reasoning-Augmented

2026년 6월 3일

[논문리뷰] Representation Forcing for Bottleneck-Free Unified Multimodal Models

본 논문은 기존 UMM이 frozen VAE에 의존하여 발생하는 structural bottleneck 문제를 해결하기 위해 Representation Forcing (RF)을 제안한다 .

#Review #Unified Multimodal Models #Representation Forcing #Pixel-space Diffusion #Vector Quantization #End-to-End Learning #Bottleneck-Free #Mixture-of-Transformers

2026년 5월 31일

[논문리뷰] Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

본 연구는 UMM 학습 시 이해와 생성 작업 간에 발생하는 아키텍처적 충돌과 이로 인한 성능 트레이드오프 문제를 해결하고자 한다. 기존의 다중 작업 학습(Multi-task learning)은 복잡한 파이프라인과 데이터 균형 조정 기법을 필요로 하며, 종종 한 작업의 성능 향상이 다른 작업의 저하를 초래하는 한계가 있다.

#Review #Unified Multimodal Models #Intelligent Image Editing #Instruction Tuning #Data Synthesis #Multi-task Learning #Reasoning-intensive

2026년 5월 20일

[논문리뷰] Semantic Generative Tuning for Unified Multimodal Models

본 논문은 현대 UMM들이 이해와 생성이라는 두 핵심 과업을 분리된 최적화 경로로 학습함으로써 발생하는 표현적 불일치(Representational misalignment) 문제를 해결하고자 합니다.

#Review #Unified Multimodal Models #Generative Tuning #Image Segmentation #Multimodal Alignment #Semantic Proxy #Representation Learning

2026년 5월 19일

[논문리뷰] Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

본 논문은 최신 UMM이 이해와 생성 기능을 한 모델 내에 통합했음에도 불구하고, 실제로는 두 구성 요소가 상호작용 없이 분리된(Decoupled) 구조로 설계되어 성능 극대화에 한계가 있다는 문제를 지적합니다.

#Review #Unified Multimodal Models #Understanding-Oriented Post-Training #Generation Synergy #Flow Matching #Semantic Supervision #MetaQuery

2026년 5월 10일

[논문리뷰] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

통합 멀티모달 모델(UMM)이 강한 의미론적 이해와 강력한 생성 능력 사이에서 겪는 본질적인 상충 관계를 해결하고자 합니다. 이 논문은 InternVL-U 라는 경량의 4B 매개변수 UMM을 제안하여, 이해, 추론, 생성, 편집 능력을 하나의 통합 프레임워크 내에서 민주화하는 것을 목표로 합니다.

#Review #Unified Multimodal Models #Multimodal Large Language Model #Image Generation #Image Editing #Chain-of-Thought #Data Synthesis #Low-parameter Models

2026년 3월 10일

[논문리뷰] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

이 논문은 통합 멀티모달 모델에서 생성(generation) 능력이 이해(understanding) 능력을 향상시키는지, 그리고 언제, 어떤 방식으로 향상시키는지 에 대한 불확실성을 해결하고자 합니다.

#Review #Unified Multimodal Models #Multimodal Understanding #Generation-to-Understanding #Benchmark #Vision-Language Models #Generate-then-Answer #Model Evaluation

2026년 3월 3일

[논문리뷰] Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

본 논문은 기존 AI 시스템이 언어적/추상적 영역에서 강세를 보이지만, 풍부한 표현과 사전 지식, 특히 명시적인 시각적 세계 모델링의 부족으로 인해 물리적/공간적 지능 분야에서는 인간에 비해 뒤처지는 문제를 해결하고자 합니다.

#Review #Multimodal AI #World Models #Visual Generation #Chain-of-Thought (CoT)#Multimodal Reasoning #Unified Multimodal Models #Spatial-Physical Reasoning

2026년 1월 27일

[논문리뷰] UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

본 연구는 통합 멀티모달 모델(UMMs)이 입력 이해는 뛰어나지만, 그 이해를 고품질 생성으로 변환하는 데 어려움을 겪는 현상인 'Conduction Aphasia' 문제를 해결하는 것을 목표로 합니다.

#Review #Unified Multimodal Models #Self-Supervised Learning #Text-to-Image Generation #Multi-Agent Framework #Cognitive Pattern Reconstruction #Cycle-Consistency #Conduction Aphasia

2026년 1월 6일

[논문리뷰] TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

논문은 멀티모달 이해와 생성 태스크를 단일 프레임워크 내에서 원활하게 수행하는 TUNA라는 네이티브 통합 멀티모달 모델(UMM) 을 개발하는 것을 목표로 합니다. 기존 UMM의 분리된 또는 편향된 시각 표현 방식 으로 인한 한계를 극복하고, 이해와 생성 모두에 효과적인 통합된 연속 시각 표현 공간 을 구축하고자 합니다.

#Review #Unified Multimodal Models #Visual Representation #VAE #Flow Matching #Multimodal Understanding #Multimodal Generation #Image Editing #State-of-the-Art

2025년 12월 1일

[논문리뷰] Architecture Decoupling Is Not All You Need For Unified Multimodal Model

본 논문은 통합 멀티모달 모델(UMM)에서 시각 생성 및 이해 태스크 간의 내재된 충돌을 완화하면서도 모델 아키텍처 디커플링에 과도하게 의존하지 않고 성능을 향상시키는 것을 목표로 합니다. 과도한 디커플링이 통합 모델의 상호작용적 추론 능력과 지식 전이 능력을 저해하는 문제를 해결하고자 합니다.

#Review #Unified Multimodal Models #Architecture Decoupling #Cross-Modal Attention #Attention Interaction Alignment (AIA) Loss #Task Conflicts #Image Generation #Image Understanding

2025년 11월 30일

[논문리뷰] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

본 논문은 통합 멀티모달 모델(UMMs)에서 '이해' 능력이 '생성' 과정에 실제로 정보를 제공하고 안내하는지 여부를 조사합니다.

#Review #Unified Multimodal Models #Understanding-Generation Gap #Reasoning #Knowledge Transfer #Chain-of-Thought #Self-Training #Synthetic Data #Evaluation Framework

2025년 11월 25일

[논문리뷰] ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

본 논문은 기존 통합 멀티모달 모델(UMM) 평가 방식이 텍스트 및 이미지 이해/생성 능력을 개별적으로 측정하여 모달리티 간 상호 추론 능력 을 간과하는 문제를 제기합니다.

#Review #Multimodal AI #Benchmarking #Cross-Modal Reasoning #Omnimodal Generation #Visual Generation #Verbal Generation #Unified Multimodal Models

2025년 11월 9일

[논문리뷰] Reconstruction Alignment Improves Unified Multimodal Models

논문은 통합 멀티모달 모델(UMM)이 이미지-텍스트 쌍으로 훈련될 때 캡션의 희소성으로 인해 미세한 시각적 디테일을 놓치고, 이해와 생성 간의 정렬이 불완전하다는 문제를 해결하고자 합니다.

#Review #Unified Multimodal Models #Image Generation #Image Editing #Post-training #Self-supervised Learning #Reconstruction Alignment #Visual Embeddings

2025년 9월 10일

[논문리뷰] LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

본 논문은 기존의 선도적인 통합 멀티모달 모델(UMM)들이 상당한 계산 자원과 학습 비용을 요구한다는 문제에 주목합니다.

#Review #Unified Multimodal Models #Double Fusion #Lightweight AI #Text-to-Image Generation #Image Editing #Model Architecture #Efficient Training #Cross-modal Interaction

2025년 10월 28일

[논문리뷰] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

본 논문은 Unified Multimodal Models ( UMMs )이 이미지 이해 능력에 비해 이미지 생성 능력에서 현저한 격차를 보이는 문제에 주목합니다. 모델이 사용자 지침에 따라 이미지를 정확하게 이해하더라도, 동일한 텍스트 프롬프트로부터 충실한 이미지를 생성하지 못하는 역설을 해결하고자 합니다.

#Review #Unified Multimodal Models #Self-Rewarding #Text-to-Image Generation #Image Understanding #Post-Training #Global-Local Reward #Compositional Reasoning

2025년 10월 15일