#Image Understanding

11개의 포스트

[논문리뷰] Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Large Language Models (LLMs)는 Chain-of-Thought prompting과 같은 확장된 추론을 통해 상당한 발전을 이루었지만, 이를 Multi-modal Large Language Models (MLLMs)로 확장하는 것은 여전히 큰 도전 과제입니다.

#Review #Visual Reasoning #Image Understanding #Video Understanding #Multi-Agent System #Reinforcement Learning #Self-Evolving

2026년 3월 23일

[논문리뷰] UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

본 연구는 기존 통합 멀티모달 모델의 한계를 해결하고자 합니다. 특히, 이산적인 시각 토크나이저 사용으로 인한 세부 의미 정보 손실 문제와, 연속적인 고차원 시각 표현을 직접 모델링할 때 발생하는 학습 불안정성 및 느린 수렴 문제를 극복하는 것을 목표로 합니다.

#Review #Unified Multimodal Model #Image Generation #Image Understanding #Semantic Compression #Continuous Representation #Diffusion Model #Transformer #Image Editing

2026년 3월 11일

[논문리뷰] OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

본 논문은 이미지 이해(understanding)와 생성(generation) 모두에 활용될 수 있는 단일하고 통합된 시각적 표현을 학습하는 고급 비전 인코더인 OpenVision 3 를 제안합니다.

#Review #Unified Visual Encoder #Image Understanding #Image Generation #VAE #Vision Transformer #Multimodal Learning #Reconstruction #Contrastive Learning

2026년 1월 22일

[논문리뷰] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

의료 영상 이해(semantic abstraction)와 생성(pixel-level reconstruction)이라는 근본적으로 상충하는 목표를 기존 파라미터 공유 방식의 단일 모델에서 통합할 때 발생하는 성능 저하 문제를 해결하고자 합니다.

#Review #Chest X-Ray #Medical Foundation Model #Autoregressive Model #Diffusion Model #Multimodal Learning #Image Understanding #Image Generation #Cross-Modal Attention

2026년 1월 20일

[논문리뷰] OneThinker: All-in-one Reasoning Model for Image and Video

기존 MLLM(Multimodal Large Language Models)이 단일 태스크나 단일 모달리티(이미지 또는 비디오)에 국한되는 한계를 넘어, 이미지와 비디오 이해를 아우르는 다양한 시각 태스크를 동시에 처리할 수 있는 범용적인 추론 모델 인 'All-in-one multimodal reasoning generalist' 를 개발하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Reinforcement Learning #Visual Reasoning #Generalist Model #Image Understanding #Video Understanding #Multitask Learning #EMA-GRPO

2025년 12월 3일

[논문리뷰] Architecture Decoupling Is Not All You Need For Unified Multimodal Model

본 논문은 통합 멀티모달 모델(UMM)에서 시각 생성 및 이해 태스크 간의 내재된 충돌을 완화하면서도 모델 아키텍처 디커플링에 과도하게 의존하지 않고 성능을 향상시키는 것을 목표로 합니다. 과도한 디커플링이 통합 모델의 상호작용적 추론 능력과 지식 전이 능력을 저해하는 문제를 해결하고자 합니다.

#Review #Unified Multimodal Models #Architecture Decoupling #Cross-Modal Attention #Attention Interaction Alignment (AIA) Loss #Task Conflicts #Image Generation #Image Understanding

2025년 11월 30일

[논문리뷰] Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

본 논문은 기존 멀티모달 Masked Diffusion Model (MDM)의 한계를 극복하고, 이미지 이해, 객체 접지, 이미지 편집, 고해상도(1024px) 텍스트-투-이미지 생성 등 광범위한 멀티모달 태스크를 단일 프레임워크 내에서 처리할 수 있는 통합 MDM 인 Lavida-O를 제안하는 것을 목표로 합니다.

#Review #Multimodal AI #Masked Diffusion Models #Image Understanding #Image Generation #Image Editing #Object Grounding #ElasticMoT #Self-reflection

2025년 9월 25일

[논문리뷰] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

본 논문은 Unified Multimodal Models ( UMMs )이 이미지 이해 능력에 비해 이미지 생성 능력에서 현저한 격차를 보이는 문제에 주목합니다. 모델이 사용자 지침에 따라 이미지를 정확하게 이해하더라도, 동일한 텍스트 프롬프트로부터 충실한 이미지를 생성하지 못하는 역설을 해결하고자 합니다.

#Review #Unified Multimodal Models #Self-Rewarding #Text-to-Image Generation #Image Understanding #Post-Training #Global-Local Reward #Compositional Reasoning

2025년 10월 15일

[논문리뷰] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

카메라 중심의 장면 이해와 생성을 별개의 문제로 다루던 기존 방식의 한계를 극복하고, 이를 단일 멀티모달 모델 로 통합하는 것을 목표로 합니다.

#Review #Unified Multimodal Model #Camera-Centric #Image Understanding #Image Generation #Spatial Reasoning #Camera Parameters #Instruction Tuning #Multimodal Spatial Intelligence

2025년 10월 13일

[논문리뷰] Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

기존 autoregressive 시각 모델에서 이산 잠재 공간 토크나이저 의 양자화 오류가 의미 표현력과 시각-언어 이해 능력을 저해하는 문제를 해결하고자 합니다.

#Review #Unified Vision-Language Model #Continuous Tokenizer #Autoregressive Generation #Image Understanding #Image Generation #Multimodal AI #In-context Editing

2025년 10월 9일

[논문리뷰] Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

본 논문은 다양한 양상의 데이터(텍스트, 이미지)를 처리할 수 있는 옴니(Omni) 형태의 멀티모달 생성 및 이해 모델 인 Lumina-DiMOO를 제안합니다.

#Review #Multi-modal LLM #Discrete Diffusion #Image Generation #Image Understanding #Omni-modal #Interactive Retouching #Generative AI #Reinforcement Learning

2025년 10월 9일