#Multimodal Reasoning

62개의 포스트

[논문리뷰] S1-Omni: A Unified Multimodal Reasoning Model for Scientific Understanding, Prediction, and Generation

본 논문은 기존의 AI for Science(AI4S) 연구들이 Domain-specific models, Tool-augmented LLMs, 그리고 Scientific language models로 파편화되어 있다는 문제점을 해결하고자 합니다 .

#Review #AI4S #Multimodal Reasoning #Scientific Modeling #Foundation Model #S1-Omni #Knowledge Alignment

2026년 7월 19일

[논문리뷰] Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

본 연구는 단백질, 화학 물질, 무기 결정 등 과학적 구조(Structure)와 물성(Property) 간의 복잡한 관계를 해석하는 과정에서 기존 AI 시스템이 겪는 표현력과 추론의 한계를 해결하고자 합니다.

#Review #Foundation Model #Structure-property Relationship #Multimodal Reasoning #Scientific AI #Chain-of-thought #Native Structural Reasoning

2026년 7월 8일

[논문리뷰] ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

본 논문은 기존 의료용 MLLM 평가 체계가 최종 답변의 정확도만 판단할 뿐, 환각(Hallucination)이 발생하는 근본적인 원인을 규명하지 못하는 한계를 해결하고자 합니다.

#Review #Medical MLLM #Hallucination Diagnosis #Chain-of-Thought #Multimodal Reasoning #Stage-wise Evaluation #Stage-Replacement Intervention

2026년 6월 14일

[논문리뷰] WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

본 논문은 기존 멀티모달 벤치마크들이 모델의 실제 추론 능력을 충분히 측정하지 못하는 한계점을 극복하기 위해 WorldBench를 제안한다. 많은 기존 벤치마크가 특정 도메인에 편향되어 있거나 시각적 다양성이 부족하여, VLM의 실제 문제 해결 능력을 과대평가하게 만드는 경향이 있다.

#Review #Multimodal Reasoning #Benchmark #Vision-Language Model #Visual Diversity #Inference #Evaluation #LLM

2026년 6월 7일

[논문리뷰] GenClaw: Code-Driven Agentic Image Generation

본 논문은 기존의 end-to-end 방식의 image generation 모델이 겪는 제어 가능성 및 추론 능력의 한계를 해결하고자 합니다. 기존 모델들은 프롬프트 재작성을 통해 반복적인 '블랙박스' 식 시행착오를 거치며, 복잡한 공간 관계나 텍스트 레이아웃을 정밀하게 제어하는 데 실패하는 경우가 많습니다 .

#Review #Agentic Image Generation #Code-Driven #SVG #Multimodal Reasoning #Layered Representation #Controllable Generation

2026년 5월 28일

[논문리뷰] From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

본 논문은 최신 promptable segmentation 모델들이 시각적 살점(salient cues)에 과도하게 의존하여 semantically invalid한 프롬프트에도 정확한 마스크를 생성하는 '개념적 기반(concept-faithful grounding)'의 결여 문제를 해결하고자 합니다 .

#Review #Promptable Segmentation #Counterfactual Evaluation #Semantic Grounding #Visual Hallucination #Multimodal Reasoning #Open-Vocabulary Segmentation

2026년 5월 13일

[논문리뷰] InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

본 논문은 기존의 멀티모달 에이전트 벤치마크들이 시각적 증거를 단순히 답변의 최종 종착지(Endpoint)로만 취급하여, 실제 정보 탐색 과정에서 시각적 정보가 검색 경로를 제어하는 역할을 간과한다는 문제를 지적합니다.

#Review #Multimodal Agent #Interleaved Search #Visual Evidence #Agentic Search Benchmark #Multimodal Reasoning #Open-web Search

2026년 5월 10일

[논문리뷰] Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

본 논문은 Autoregressive LVLM이 긴 문맥 생성 시 겪는 Visual Signal Dilution 문제를 해결하고자 한다.

#Review #Large Vision-Language Models #Visual Signal Dilution #Persistent Visual Memory #Autoregressive Generation #Multimodal Reasoning #Bottleneck Adapter

2026년 5월 4일

[논문리뷰] OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

본 논문은 월드 모델의 개념적 모호성을 해결하고 표준화된 정의 및 통합 프레임워크를 정립하기 위해 OpenWorldLib 을 제안한다.

#Review #World Models #Unified Inference Framework #Multimodal Reasoning #Vision-Language-Action #3D Generation #Interactive Video Generation

2026년 4월 6일

[논문리뷰] PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

본 논문은 기존의 영상 이해 벤치마크가 대부분 단일 시점 정보만으로 해결 가능하거나, 지나치게 논리적 구조에만 치중되어 있어 모델의 실질적인 시각적 추론 능력을 평가하기 어렵다는 문제를 제기한다.

#Review #Video Benchmark #Multimodal Reasoning #Perception-Centric #Long-Horizon #Test-Time Thinking

2026년 4월 1일

[논문리뷰] When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

최근 멀티모달 대규모 언어 모델(MLLMs)은 추론 작업에서 강력한 성능을 보여주었지만, 이러한 발전은 주로 고품질의 주석 처리된 데이터나 교사 모델(teacher-model) 증류(distillation)에 의존하고 있어 비용이 많이 들고 확장이 어렵습니다.

#Review #Unsupervised Self-Evolution #Multimodal Reasoning #Consistency-Based Reward #Judge Modulation #Group Relative Policy Optimization (GRPO)#Policy Updates #Mathematical Reasoning #Large Language Models

2026년 3월 25일

[논문리뷰] ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

본 논문은 MLLM이 어려운 시각 태스크에서 사용자에게 단순한 도움을 먼저 요청할 수 있는 'Proactiveness'를 갖췄는지 평가하기 위해 7개 데이터셋을 재구성한 ProactiveBench를 제안하고, 22개 MLLM을 분석합니다.

#Review #MLLM #Benchmark #Proactiveness #Reinforcement Learning #Multimodal Reasoning #Human-AI Interaction

2026년 3월 22일

[논문리뷰] From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

본 논문은 다중모드 대규모 추론 모델(MLRMs) 의 콜드-스타트 초기화(cold-start initialization) 단계의 메커니즘을 분석하고 최적화하여, 모델의 다중모드 추론 성능과 시각적 기반(visual grounding) 능력을 향상시키는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Cold-Start Initialization #Attention Mechanism #Visual Grounding #Large Multimodal Models (LMMs)#Reinforcement Learning (RLHF)#Data Synthesis #Visual Attention Score (VAS)

2026년 3월 9일

[논문리뷰] MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

본 논문은 실생활 시나리오에서 멀티모달 대규모 언어 모델(MLLM) 의 다양한 다중 이미지 추론 능력을 평가하기 위한 표준화된 벤치마크의 부재를 해결하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Multi-Image Analysis #Real-life Scenarios #Benchmark #MLLMs Evaluation #Chain-of-Thought #Reasoning Types

2026년 3월 2일

[논문리뷰] From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

본 논문은 기존의 LMM(Large Multimodal Models) 자가 학습 프레임워크가 겪는 해석 가능한 진단 부족과 시각적 다양성 부족이라는 근본적인 한계를 해결하고자 합니다.

#Review #Large Multimodal Models #Iterative Training #Diagnostic-Driven Learning #Reinforcement Learning #Multimodal Reasoning #Data Generation #Agent Systems

2026년 2월 26일

[논문리뷰] DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

기존 멀티모달 RLVR(Reinforcement Learning with Verifiable Rewards) 학습 데이터셋의 제한적인 다양성, 커버리지, 일반화 능력을 극복하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Mathematical Dataset #RLVR #Data Curation #Visual Diversity #K12 Mathematics #Large Multimodal Models

2026년 2월 22일

[논문리뷰] BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

기존 벤치마크의 제한적인 태스크 복잡도, 정보 검색 가능성, 평가 차원의 문제를 해결하여 멀티모달 웹 브라우징 에이전트의 심층 검색 역량을 포괄적으로 평가할 수 있는 새롭고 검증 가능한 벤치마크를 개발하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Web Browsing Agents #Deep Search #Benchmark #Tool Use #Process Evaluation #Multimodal Reasoning #Open-world QA

2026년 2월 16일

[논문리뷰] Thinking with Drafting: Optical Decompression via Logical Reconstruction

본 논문은 멀티모달 대규모 언어 모델(MLLM)이 시각적 입력에 대한 복잡한 추론 작업에서 겪는 '정밀도 역설'을 해결하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Visual Algebra #Domain-Specific Language #Optical Decompression #Logical Reconstruction #Bar Model #MLLMs #Verification

2026년 2월 12일

[논문리뷰] When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

본 논문은 대규모 이미지 편집 모델에서 시각적 프롬프트가 사용자 의도를 전달하는 새로운 패러다임이 도입되면서 발생하는 미탐지된 안전 위험 을 밝히고 해결하는 것을 목표로 합니다.

#Review #Vision-Centric Jailbreak Attack #Image Editing Models #Safety Benchmark #IESBench #Multimodal Reasoning #Adversarial Attack #Defense Mechanism

2026년 2월 11일

[논문리뷰] Chain of Mindset: Reasoning with Adaptive Cognitive Modes

기존 LLM(대규모 언어 모델)의 고정된 단일 사고방식 추론 방식이 문제 해결의 여러 단계에서 요구되는 이질적인 인지적 요구를 충족하지 못하는 한계를 해결하고자 합니다. 본 연구는 단계별로 적응적인 사고방식을 유연하게 조율하여 LLM의 문제 해결 능력을 차세대 지능 수준으로 끌어올리는 것을 목표로 합니다.

#Review #Adaptive Reasoning #Cognitive Modes #Large Language Models (LLMs)#Agentic AI #Multimodal Reasoning #Mindset Orchestration #Contextual Filtering #Training-free Framework

2026년 2월 10일

[논문리뷰] Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

본 논문은 Reinforcement Learning with Verifiable Rewards (RLVR) 훈련 과정에서 GRPO 및 GSPO 와 같은 주류 알고리즘이 겪는 응답 길이 편향(length bias) 문제를 분석하고 해결하는 것을 목표로 합니다.

#Review #Reinforcement Learning with Verifiable Rewards #LLMs #Policy Optimization #Response Length Bias #Sequence-level Clipping #Length-Unbiased Optimization #Multimodal Reasoning

2026년 2월 5일

[논문리뷰] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

본 논문은 기존 VLM(Vision-Language Model) 평가의 한계를 극복하고 적응형 멀티모달 추론 능력을 종합적으로 평가하는 벤치마크를 제안합니다.

#Review #Multimodal Reasoning #Adaptive Learning #Vision-Language Models (VLMs)#Benchmarking #Mode Selection #Tool Learning #Reasoning Process Evaluation #Matthews Correlation Coefficient (MCC)

2026년 2월 3일

[논문리뷰] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

본 논문은 복잡한 추론과 세계 지식이 필요한 이미지 합성 태스크에서 기존 통합 멀티모달 모델의 한계를 해결하고자 합니다.

#Review #Multimodal Reasoning #Image Generation #Image Editing #World Knowledge #Self-Reflection #Unified Framework #Text-to-Image

2026년 2월 2일

[논문리뷰] Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

기존 텍스트-이미지(T2I) 모델의 한계인 정적인 동작, 암묵적인 사용자 의도 파악 실패, 복잡한 지식 기반 추론 능력 부족을 해결하는 것입니다.

#Review #Agentic Text-to-Image #Multimodal Reasoning #Cognitive Search #Knowledge-Driven Generation #Image Generation Benchmarks #Complex User Intent

2026년 2월 2일

[논문리뷰] MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

본 논문은 고품질 추론 데이터의 부족으로 인해 독점 시스템에 비해 뒤처지는 오픈소스 멀티모달 모델의 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Data-centric AI #Chain-of-Thought #Large Language Models #Visual Question Answering #STEM Reasoning #Dataset #Fine-tuning

2026년 1월 29일

[논문리뷰] Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

본 논문은 기존 AI 시스템이 언어적/추상적 영역에서 강세를 보이지만, 풍부한 표현과 사전 지식, 특히 명시적인 시각적 세계 모델링의 부족으로 인해 물리적/공간적 지능 분야에서는 인간에 비해 뒤처지는 문제를 해결하고자 합니다.

#Review #Multimodal AI #World Models #Visual Generation #Chain-of-Thought (CoT)#Multimodal Reasoning #Unified Multimodal Models #Spatial-Physical Reasoning

2026년 1월 27일

[논문리뷰] Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

과학적 추론을 위한 멀티모달 데이터의 부족과 기존 Text-to-Image(T2I) 모델 이 시각적으로는 그럴듯하지만 과학적으로 부정확한 이미지를 생성하는 문제를 해결하고자 합니다.

#Review #Scientific Image Synthesis #Multimodal Reasoning #Text-to-Image #Benchmarking #Programmatic Synthesis #Large Multimodal Models #Synthetic Data

2026년 1월 26일

[논문리뷰] Agentic Very Long Video Understanding

본 논문은 항상 켜져 있는 개인 AI 비서가 요구하는 매우 긴 비디오 이해의 과제를 해결하는 것을 목표로 합니다.

#Review #Long-Horizon Video Understanding #Agentic AI #Entity Graph #Multimodal Reasoning #Video Question Answering #EgoLifeQA #Retrieval Augmented Generation

2026년 1월 26일

[논문리뷰] XR: Cross-Modal Agents for Composed Image Retrieval

AI 시대의 Composed Image Retrieval (CIR)에서 기존 유사성 기반 패러다임의 한계를 극복하고, 레퍼런스 이미지와 텍스트 수정 사항을 통합하는 데 필요한 교차-모달 추론 능력 을 향상시키는 것이 목표입니다.

#Review #Composed Image Retrieval #Cross-Modal Agents #Multimodal Reasoning #Training-free Framework #Information Retrieval #Agentic AI #Progressive Retrieval

2026년 1월 21일

[논문리뷰] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

현재 Multimodal Large Language Models (MLLMs)이 겪는 텍스트 중심 추론의 한계와 복잡한 장기 시각 중심 태스크에서의 비효율성을 해결하고, 확산 모델을 활용한 새로운 '생성형 멀티모달 추론' 패러다임을 확립하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Diffusion Models #Image-to-Image Generation #Vision-centric AI #Generative AI #Spatial Planning #Constraint Satisfaction

2026년 1월 1일

[논문리뷰] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

본 논문은 텍스트 전용 추론 모델이 암묵적인 공간 및 기하학적 관계를 파악하는 데 어려움을 겪는 복잡한 추론 문제의 한계를 해결하고자 합니다.

#Review #Multimodal Reasoning #Visual Thinking #Reinforcement Learning #Code Generation #Geometric Reasoning #Adaptive Reward Mechanism #Problem Solving

2025년 12월 31일

[논문리뷰] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

본 논문은 대규모 시각-언어 모델(VLM)이 미세한 시각적 증거(fine-grained visual evidence) 를 놓치고, 도메인 간 일반화 능력이 떨어지며, 추론 시 높은 비용을 유발하는 문제를 해결하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Vision-Language Models (VLMs)#Perceptual Shaping #KL-Divergence #Chart Understanding #Data Augmentation #Reinforcement Learning (RL)#GRPO

2025년 12월 28일

[논문리뷰] LongVideoAgent: Multi-Agent Reasoning with Long Videos

본 논문은 기존 MLLM(Multimodal Large Language Models)이 긴 길이의 비디오에서 발생하는 정보 압축 손실, 제한된 도구 세트, 그리고 미세한 시간적 추론 능력 부족 문제를 해결하는 것을 목표로 합니다.

#Review #Multi-Agent System #Long Video Understanding #Video Question Answering #Reinforcement Learning #Large Language Models #Temporal Grounding #Multimodal Reasoning #Tool-Augmented AI

2025년 12월 23일

[논문리뷰] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

본 논문은 이미지와 텍스트가 혼합된 시퀀스를 처리하는 옴니 모델(Omni Models)을 위한 보상 모델(Reward Models, RMs)의 부족한 평가 프레임워크를 해결하고자 합니다.

#Review #Reward Models #Multimodal LLMs #Benchmark #Text-to-Image Generation #Image Editing #Interleaved Generation #Multimodal Reasoning #MLLM-as-a-judge

2025년 12월 18일

[논문리뷰] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

이 논문은 고수준 추론과 저수준 그라운딩이 긴밀하게 결합된 기존 end-to-end 어포던스 예측 모델들이 새로운 객체나 복잡한 지시에 대한 일반화에 어려움을 겪는 한계를 해결하고자 합니다.

#Review #Affordance Prediction #Zero-Shot Learning #Agentic AI #Foundation Models #Multimodal Reasoning #Visual Grounding #Image Generation #Robotics

2025년 12월 16일

[논문리뷰] Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

본 논문은 데이터 부족 및 보상 해킹(reward hacking) 문제로 인해 강화 학습(RL) 기반 Vision-Language Models (VLMs) 의 전문 도메인(예: 화학, 지구 과학) 적용 및 지속적인 자체 진화 학습이 어려운 문제를 해결하고자 합니다.

#Review #Vision-Language Models #Reinforcement Learning #Self-Evolving Learning #Data-Scarce Domains #Context-First Learning #Reward Hacking Mitigation #Multimodal Reasoning #Curriculum Learning

2025년 12월 8일

[논문리뷰] Qwen3-VL Technical Report

Qwen3-VL은 기존 Qwen 시리즈 중 가장 강력한 Vision-Language Model (VLM) 을 개발하여 광범위한 멀티모달 벤치마크에서 뛰어난 성능을 달성하는 것을 목표로 합니다.

#Review #Vision-Language Model #Multimodal Reasoning #Long-Context #Interleaved Data #Mixture-of-Experts #DeepStack #Agentic AI

2025년 12월 3일

[논문리뷰] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

현재 MLLM(Multimodal Large Language Models) 이 각 문제를 de novo 방식으로 해결하며 시각적 주의 집중 및 논리적 추론 오류를 반복하는 한계를 극복하는 것이 목표입니다.

#Review #Multimodal LLMs #Semantic Memory #Agentic Learning #Error Attribution #Visual Reasoning #Long-term Memory #Grow-and-Refine #Multimodal Reasoning

2025년 11월 27일

[논문리뷰] Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

기존 VLM이 이산적인 텍스트 기반 추론에 국한되어 공간 추론 및 기하학적 인식과 같은 미세한 시각적 이해가 필요한 작업에서 어려움을 겪는 문제를 해결하는 것이 목표입니다.

#Review #Vision-Language Models (VLMs)#Chain-of-Thought (CoT)#Continuous Visual Tokens #Multimodal Reasoning #Perceptual Grounding #Visual Thinking #Dense Prediction

2025년 11월 24일

[논문리뷰] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

멀티모달 추론(Multimodal Reasoning) 분야에서 투명하고 재현 가능한 데이터 큐레이션 및 훈련 전략 의 부재로 인한 확장성 연구의 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Large Multimodal Models #Supervised Fine-tuning #Reinforcement Learning #Data Curation #Open-source #Multimodal Benchmarks

2025년 11월 23일

[논문리뷰] VisPlay: Self-Evolving Vision-Language Models from Images

본 논문은 인간 주석이나 작업별 휴리스틱 없이, 대규모 비정형 이미지 데이터로부터 Vision-Language Models (VLMs) 의 추론 능력을 자율적으로 개선하는 것을 목표로 합니다. 기존 강화 학습(RL) 방식이 지닌 비용과 확장성 한계를 극복하고자 합니다.

#Review #Self-Evolving #Vision-Language Models #Reinforcement Learning #Self-Play #Unlabeled Data #Multimodal Reasoning #Group Relative Policy Optimization #Hallucination Mitigation

2025년 11월 19일

[논문리뷰] Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

본 논문은 비디오 모델의 추론 능력, 특히 비디오 생성 을 통한 추론 능력을 체계적으로 평가하기 위한 포괄적인 벤치마크의 부재를 해결합니다.

#Review #Video Models #Spatial Reasoning #Maze Solving #Video Generation #Benchmark #Supervised Fine-tuning #Test-Time Scaling #Multimodal Reasoning

2025년 11월 19일

[논문리뷰] REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

본 논문은 기존 텍스트 기반 자기 성찰(self-reflection) 메커니즘 이 풍부하고 동적인 시각 정보를 처리하는 데 한계가 있어, 장문 비디오 이해(long-form video understanding) 태스크에서 성능 저하를 겪는 문제를 해결하고자 합니다.

#Review #Multimodal Reasoning #Long-Form Video Understanding #Self-Reflection #Reinforcement Learning #Tool-Augmented MLLMs #Visual Rethinking #Video Question Answering #Causal Attribution

2025년 11월 18일

[논문리뷰] MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning

본 연구는 멀티모달 대규모 언어 모델(MLLM)이 복잡한 수학 문제 해결과 같은 추론 태스크에서 겪는 어려움을 극복하는 것을 목표로 합니다. 특히, 기존의 정적인 교사 모델 유래 데이터셋에 의존하는 방식이 모델의 새로운 문제 적응력과 견고한 일반화 능력을 제한한다는 한계를 해결하고자 합니다.

#Review #Multimodal Reasoning #Mathematical Problem Solving #Self-Evolving #Iterative Fine-Tuning #Reward Models #Reflection #Large Language Models (LLMs)

2025년 11월 12일

[논문리뷰] Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

대규모 비전-언어 모델(LVLM)이 시각적 정보를 불충분하게 활용하고 텍스트 우선(textual priors)에 과도하게 의존하여 발생하는 환각(hallucinations) 문제를 해결하는 것을 목표로 합니다. 이를 통해 모델의 시각적 grounding을 강화하고 더 균형 잡힌 멀티모달 추론을 촉진하고자 합니다.

#Review #Hallucination Mitigation #Large Vision-Language Models #Textual Embeddings #Multimodal Reasoning #Attention Mechanism #Visual Grounding #Modality Imbalance

2025년 11월 9일

[논문리뷰] DeepEyesV2: Toward Agentic Multimodal Model

본 논문은 텍스트와 이미지를 단순히 이해하는 것을 넘어, 코드 실행 환경 및 웹 검색 과 같은 외부 도구를 능동적으로 호출하고 이러한 도구 작업을 추론 과정에 원활하게 통합할 수 있는 Agentic 멀티모달 모델 을 구축하는 것을 목표로 합니다.

#Review #Agentic AI #Multimodal Models #Tool Use #Reinforcement Learning #Supervised Fine-tuning #Multimodal Reasoning #Web Search #Code Execution

2025년 11월 9일

[논문리뷰] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

기존의 'Thinking with Text' 및 'Thinking with Images' 패러다임이 가진 정적 이미지의 한계와 모달리티 분리 문제를 극복하고자 합니다.

#Review #Video Generation #Multimodal Reasoning #Temporal Understanding #Spatial Reasoning #Foundation Models #AI Benchmarking #In-Context Learning #Self-Consistency

2025년 11월 9일

[논문리뷰] MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

본 논문은 대규모 multimodal 추론 모델의 발전을 저해하는 두 가지 주요 한계를 해결하고자 합니다.

#Review #Multimodal Reasoning #Reinforcement Learning #Variance-Aware Sampling #Gradient Vanishing #Data Curation #Chain-of-Thought #GRPO

2025년 9월 26일

[논문리뷰] MAPO: Mixed Advantage Policy Optimization

본 연구는 파운데이션 모델의 추론 성능 향상을 위한 기존 강화 학습(RL) 방법론, 특히 Group Relative Policy Optimization (GRPO) 이 겪는 'advantage reversion' 및 'advantage mirror' 문제 해결을 목표로 합니다.

#Review #Reinforcement Learning #Foundation Models #Policy Optimization #Advantage Function #Trajectory Certainty #Multimodal Reasoning #GRPO

2025년 9월 24일

[논문리뷰] AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

언어 모델(LLMs)이 오디오 입력 없이 텍스트만으로 청각적 상식과 추론 능력을 이해하는 데 부족함을 해결하고자 합니다. 이 격차를 해소하기 위해 청각 지식을 평가하는 AuditoryBench++ 벤치마크를 제시하고, LLM이 청각 정보를 '상상'하여 추론하는 AIR-CoT 방법론을 개발하는 것을 목표로 합니다.

#Review #Auditory Knowledge #Large Language Models #Multimodal Reasoning #Benchmark #Chain-of-Thought #Auditory Imagination #Text-only Reasoning

2025년 9월 23일

[논문리뷰] MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

논문은 MARS2 2025 Challenge 를 통해 멀티모달 기계 학습 및 LLM 분야의 발전을 촉진하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Large Language Models (LLMs)#Multimodal Large Language Models (MLLMs)#Visual Grounding #Visual Question Answering #Advertisement Video Analysis #Real-world Scenarios #Challenge Benchmark

2025년 9월 18일

[논문리뷰] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

본 논문은 인공지능 분야의 근본적인 도전 과제인 멀티모달 추론 의 한계를 극복하는 것을 목표로 합니다. 특히, 최첨단 GPT-03 과 같은 모델도 시각 정보 통합에 어려움을 겪는 과학 분야의 멀티모달 시나리오에서 시각-텍스트 모달리티 간의 격차를 해소 하고 견고한 추론 성능을 확보하고자 합니다.

#Review #Multimodal Reasoning #Science AI #Caption-assisted Reasoning #SeePhys Challenge #Large Language Models #Visual Question Answering #Physics Problems #Cross-modal Alignment

2025년 9월 17일

[논문리뷰] D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

온라인 밈(meme)에서 암묵적이고 문화적으로 민감한 다크 유머를 이해하고 탐지하는 문제를 해결하는 것을 목표로 합니다. 기존 자원 및 방법론의 부족을 다루기 위해 다중모드 콘텐츠에서 다크 유머의 존재, 타겟 범주 및 강도를 식별하는 포괄적인 프레임워크를 제시합니다.

#Review #Dark Humor Detection #Multimodal Reasoning #Vision-Language Models (VLMs)#Iterative Reasoning Refinement #Meme Analysis #Content Moderation #Cross-Modal Attention #Dataset Annotation

2025년 9월 9일

[논문리뷰] LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

본 논문은 critic 모델이 단순히 응답을 평가하는 것을 넘어 강력한 정책 모델로서 생성 능력까지 갖출 수 있다는 통념에 도전합니다. 최종 목표는 선호도 기반 critic 데이터를 활용한 강화 학습(RL) 을 통해, 평가와 생성 두 가지 역할을 동시에 탁월하게 수행하는 단일 멀티모달 모델을 개발하는 것입니다.

#Review #Vision-Language Models (VLMs)#Critic Models #Policy Models #Reinforcement Learning (RL)#Self-Criticism #Multimodal Reasoning #Preference Learning #Generative Models

2025년 9월 3일

[논문리뷰] InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

본 논문은 로봇이 실제 환경에서 효과적으로 작동하기 위해 멀티모달 추론과 정확한 동작 생성을 통합하는 문제를 해결하고자 합니다.

#Review #Vision-Language-Action (VLA)#Instruction Tuning #Multimodal Reasoning #Robotic Manipulation #Catastrophic Forgetting #Mixture-of-Experts (MoE)#Flow Matching

2025년 8월 5일

[논문리뷰] SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

텍스트 전용 대규모 언어 모델(LLMs)이 시각 정보를 직접 처리할 수 없는 한계를 극복하고, 멀티모달 추론 능력을 효율적이고 비용 효과적으로 활용할 수 있도록 하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Text-only LLM #Agentic AI #Information Flow #VQA #Structured Intermediate Representation #Decoupled Architecture #Tool Use

2025년 10월 30일

[논문리뷰] VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

본 논문은 시각적 생성 모델의 후속 훈련을 위한 멀티모달 보상 모델(RMs)의 두 가지 주요 한계를 해결하는 것을 목표로 합니다.

#Review #Video Reward Models #Multimodal Reasoning #Thinking-with-Image #Visual Reasoning #Reinforcement Learning #Chain-of-Thought #Context Management

2025년 10월 17일

[논문리뷰] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

본 논문은 대규모 언어 모델(LLM)이 시각적 보조 자료에 본질적으로 의존하는 기하학 등 수학적 문제에서 겪는 어려움을 해결하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Visual Chain-of-Thought (VCoT)#Large Multimodal Models (LMMs)#Geometric Reasoning #Diagram Generation #Dataset #Benchmark

2025년 10월 17일

[논문리뷰] ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

멀티모달 대규모 추론 모델(MLRMs)이 쉬운 문제에 대해 과도하게 추론하여 비효율적인 반면, 어려운 문제에는 불충분한 탐색으로 해답을 놓치는 불균형을 해결하는 것이 목표입니다. 문제 난이도에 따라 탐색 노력을 동적으로 할당하는 적응형 추론 프레임워크 ARES 를 제시하여 MLRM의 효율성과 성능을 개선하고자 합니다.

#Review #Multimodal Reasoning #Adaptive Learning #Reinforcement Learning #Entropy Shaping #Difficulty-Aware #Chain-of-Thought #Token-Level Analysis

2025년 10월 13일

[논문리뷰] Factuality Matters: When Image Generation and Editing Meet Structured Visuals

본 연구는 최신 시각 생성 모델들이 차트, 다이어그램, 수학 도형과 같은 구조화된 시각 자료 생성 및 편집에서 보이는 한계를 해결하고자 합니다. 이러한 자료들은 구성 계획 , 텍스트 렌더링 , 멀티모달 추론 을 통한 사실적 정확성 을 요구하며, 이 분야에 대한 체계적인 탐구가 부족하다는 문제를 인식했습니다.

#Review #Structured Visuals #Image Generation #Image Editing #Multimodal Reasoning #Factual Fidelity #Chain-of-Thought #Evaluation Benchmark #Diffusion Models

2025년 10월 7일

[논문리뷰] Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

이 논문은 대규모 언어 모델(LLM)의 추론 신뢰성을 향상시키는 프로세스 보상 모델(PRM)을 시각-언어 모델(VLM) 영역으로 확장하고자 합니다.

#Review #Vision-Language Models (VLMs)#Process Reward Models (PRMs)#Multimodal Reasoning #Test-Time Scaling (TTS)#Process Supervision #Dataset Construction #Perception Errors #MCTS

2025년 10월 2일

[논문리뷰] More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

이 논문은 Vision-Language Models (VLMs)의 추론이 논리적 추론을 강화하지만, 기본적인 시각적 질문에서 인식 기반(perceptual grounding)을 손상시켜 인식 실패를 초래하는 이중적인 특성을 탐구합니다.

#Review #Vision-Language Models #Multimodal Reasoning #Reasoning #Visual Forgetting #Perceptual Grounding #Reinforcement Learning #Policy Optimization #Visual Anchors

2025년 10월 1일