#Unified Models

11개의 포스트

[논문리뷰] LatentUMM: Dual Latent Alignment for Unified Multimodal Models

본 논문은 기존 멀티모달 모델이 겪고 있는 Modality 간의 표현 불일치 문제를 해결하기 위해 LatentUMM을 제안한다. 기존의 방식들은 서로 다른 모달리티의 특징을 독립적인 Latent Space로 학습하여, Cross-modal 태스크에서의 성능 저하 및 정렬(Alignment) 미흡이라는 한계를 가진다.

#Review #Multimodal Learning #Latent Alignment #Unified Models #Representation Learning #Cross-modal Representation

2026년 5월 24일

[논문리뷰] LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

본 논문은 시각적 이해와 생성을 공유된 semantic latent space에서 통합하는 LatentUM을 제안한다. 핵심 방법론인 MBAQ는 VLM의 출력 분포를 보존하도록 설계되어, 시각적 특징을 복원 중심이 아닌 이해 중심의 디스크리트 토큰으로 양자화한다 .

#Review #Unified Models #Cross-Modal Reasoning #Semantic Latent Space #MBAQ #Mixture-of-Modal Experts

2026년 4월 2일

[논문리뷰] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

본 논문은 기존 통합 멀티모달 모델들이 단일 패스로만 작동하여 반복적인 개선 없이 출력을 생성하는 한계를 지적합니다. 복잡한 공간 구성, 다중 객체 상호작용, 진화하는 지침 등 다단계 추론과 자가 수정이 필요한 멀티모달 작업에서 이러한 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal AI #Chain-of-Thought #Test-time Scaling #Unified Models #Iterative Reasoning #Image Generation #Visual Reasoning #Self-Correction

2026년 2월 17일

[논문리뷰] Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

본 논문은 기존 변형 오토인코더(VAE) 의 저차원 잠재 공간이 대규모 텍스트-이미지(T2I) 생성 모델에서 가질 수 있는 한계를 극복하고자 합니다.

#Review #Text-to-Image Generation #Diffusion Models #Representation Autoencoder #Latent Space #Large-Scale Models #Unified Models #Noise Scheduling

2026년 1월 22일

[논문리뷰] ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

본 논문은 강력한 Vision-Language Model (VLM) 을 탑재한 최신 비디오 통합 모델들이 추론 기반 시각 편집(reason-informed visual editing) 에서 어려움을 겪는 문제를 해결하는 것을 목표로 합니다.

#Review #Video Editing #Reasoning #Unified Models #Self-Reflective Learning #Vision-Language Models (VLMs)#Diffusion Models #RVE-Bench

2025년 12월 11일

[논문리뷰] GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

본 논문은 통합 멀티모달 모델(UMMs)의 생성적 추론 능력 을 평가하기 위한 벤치마크 개발을 목표로 합니다. 기존 벤치마크들이 판별적 이해 또는 제약 없는 생성만을 평가하는 한계를 극복하고, 언어 이해와 정밀한 시각 생성을 융합하는 기하학적 생성적 추론 을 종합적으로 측정하고자 합니다.

#Review #Multimodal AI #Generative Reasoning #Geometric Construction #Benchmark #GeoGebra #Code-based Evaluation #Unified Models

2025년 11월 16일

[논문리뷰] RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

본 논문은 기존 벤치마크들이 통합 멀티모달 모델의 이해 및 생성 능력을 개별적으로 평가하는 한계를 지적하며, 모델의 아키텍처적 통합 이 실제적으로 이러한 역량 간의 시너지 효과 를 유도하는지에 대한 근본적인 질문에 답하는 것을 목표로 합니다.

#Review #Unified Models #Multimodal AI #Benchmark #Capability Synergy #Visual Understanding #Image Generation #Dual-Evaluation Protocol

2025년 9월 30일

[논문리뷰] Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

통합 멀티모달 모델에서 확산 디노이징과 자기회귀 디코딩의 반복적인 프로세스로 발생하는 상당한 계산 오버헤드 를 해결하는 것이 주 목표입니다. Hyper-Bagel 이라는 통합 가속 프레임워크를 제안하여 멀티모달 이해 및 생성 작업을 동시에 가속화하면서 원본 모델의 고품질 출력을 유지하고자 합니다.

#Review #Multimodal AI #Acceleration Framework #Speculative Decoding #Diffusion Distillation #Unified Models #Text-to-Image Generation #Image Editing #Computational Efficiency

2025년 9월 24일

[논문리뷰] Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

이 논문은 멀티모달 이해(I2T)와 생성(T2I) 간의 근본적인 불일치를 해결하고, 이들이 단순히 공존하는 것을 넘어 진정으로 상호 이점을 얻을 수 있는지 탐구합니다. 저자들은 두 태스크를 통합하는 단일하고 근본적인 목적 함수 를 제시하여, 상호 보완적인 방식으로 멀티모달 시스템의 성능을 향상시키는 것을 목표로 합니다.

#Review #Multimodal Understanding #Multimodal Generation #Unified Models #Auto-Encoder #Reinforcement Learning #Image-to-Text #Text-to-Image #Reconstruction Fidelity

2025년 9월 12일

[논문리뷰] Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

본 논문은 통합 멀티모달 모델의 생성(Generation) 및 이해(Understanding) 능력 간의 실제적인 상호작용 을 평가하는 기존 벤치마크의 한계를 해결하는 것을 목표로 합니다.

#Review #Multimodal AI #Unified Models #Benchmark #Generation #Understanding #Reasoning #Evaluation #Cross-modal Synergy

2025년 10월 16일

[논문리뷰] OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

이 논문은 오토회귀(AR) 모델 의 엄격한 순차적 생성과 확산(Diffusion) 모델 의 고정 길이 생성이라는 근본적인 한계를 극복하는 것을 목표로 합니다.

#Review #Non-Autoregressive #Multimodal Generation #Edit Flows #Flow Matching #Interleaved Generation #Text-to-Image Synthesis #Unified Models

2025년 10월 8일