#Multimodal LLMs

85개의 포스트

[논문리뷰] MentalThink: Shaping Thoughts in Mental SVG World

본 논문은 기존의 언어 중심 Multimodal CoT가 가진 시각적 접지(Visual Grounding)의 취약성과 할루시네이션(Hallucination) 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Spatial Reasoning #Scalable Vector Graphics #Chain-of-Thought #Reinforcement Learning #Mental Imagery

2026년 7월 7일

[논문리뷰] OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

본 논문은 실시간 환경에서 활동하는 멀티모달 에이전트가 단편적인 현재 시점의 정보가 아닌, 시간 흐름에 따른 공간적 구조를 지속적으로 유지하고 추론해야 한다는 도전 과제를 해결하고자 합니다.

#Review #Multimodal LLMs #Streaming Spatial Intelligence #Egocentric Video #Hierarchical Benchmark #Spatiotemporal Reasoning #Allocentric Mapping

2026년 6월 3일

[논문리뷰] HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

본 연구는 기존 VQA 벤치마크들이 주로 서구권의 데이터나 단순한 합성 차트에 편향되어 있어, 일본의 공식 행정 문서와 같이 복잡한 레이아웃과 높은 Domain-Specific 지식을 요구하는 자료에 대한 평가가 부족하다는 점을 해결하고자 합니다.

#Review #VQA #Japanese #Document AI #Multimodal LLMs #Chart Understanding #Table Reasoning #Benchmark

2026년 6월 1일

[논문리뷰] Toward Native Multimodal Modeling: A Roadmap

본 논문은 기존 Large Language Models (LLMs)이 텍스트 전용 인터페이스에 근본적으로 제한되어 실제 세계의 풍부한 센서리 신호(sensory signals)를 통한 그라운딩(grounding)이 부족하다는 문제의식에서 출발합니다.

#Review #Native Multimodal Modeling #Cross-modal Fusion #Transformer Architectures #Multimodal LLMs #M2M Symmetric Modeling #Mid-Fusion #Early-Fusion

2026년 5월 25일

[논문리뷰] OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

본 논문은 Omni-modal Large Language Models(MLLMs)의 발전에도 불구하고, 실제 환경에서의 Proactive 스트리밍 이해 능력을 정밀하게 평가할 수 있는 표준화된 벤치마크가 부재하다는 문제점을 해결하고자 합니다 .

#Review #Omni-proactive streaming #Video understanding #Benchmark #Multimodal LLMs #Audio-visual perception #Long-horizon evaluation

2026년 5월 21일

[논문리뷰] Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

본 논문은 최신 <strong>Multimodal Large Language Models (MLLMs)</strong>가 객체 인식이나 장면 묘사와 같은 표면적 시각 인지에서는 뛰어난 성과를 보이나, 인간의 핵심 인지 능력인 visuo-cognitive 및 visuospatial reasoning 역량은 여전히 부족하다는 문제의식에서 출발합니다.

#Review #Multimodal LLMs #Visuospatial Reasoning #Fluid Intelligence #Mental Transformation #ART Taxonomy #Cognitive Benchmark

2026년 4월 21일

[논문리뷰] Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

본 논문은 Multimodal Large Language Models (MLLMs) 가 텍스트를 이미지 형태로 처리할 때 발생하는 '모달리티 갭(modality gap)'을 체계적으로 진단하고 해결하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Modality Gap #Visual Text Understanding #Error Analysis #Self-Distillation #Text-to-Image Conversion #Reasoning Collapse

2026년 3월 10일

[논문리뷰] PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

현재 명시적 지시에만 반응하는 GUI 에이전트 의 한계를 극복하고, 사용자의 암묵적인 의도를 연속적인 시각 입력(스크린샷)으로부터 예측 하여 시기적절한 추천을 제공하는 능동형(Proactive) AI 비서 를 개발하는 것을 목표로 합니다.

#Review #Proactive Agents #GUI Automation #Intent Recommendation #Multimodal LLMs #Benchmark #Memory-aware Framework #Human-Computer Interaction

2026년 3월 9일

[논문리뷰] MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

다중 모달리티 대규모 언어 모델(MLLMs)에서 채널별 스무딩 양자화(channel-wise smoothing quantization) 기법이 시각 및 텍스트 토큰 활성화의 큰 차이로 인해 실패하는 문제를 해결하는 것이 목표입니다.

#Review #Multimodal LLMs #Post-Training Quantization #Modality-Aware Smoothing #Cross-Modal Compensation #Quantization #Model Compression #SVD-based Whitening

2026년 3월 5일

[논문리뷰] RIVER: A Real-Time Interaction Benchmark for Video LLMs

대부분의 Multimodal Large Language Models (MLLMs)이 오프라인 패러다임으로 작동하여 실시간 상호작용 능력이 부족하다는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Real-time Interaction #Video Understanding #Benchmark #Temporal Reasoning #Long-term Memory #Proactive Response

2026년 3월 4일

[논문리뷰] Phi-4-reasoning-vision-15B Technical Report

본 논문은 추론 능력, 효율성, 학습 데이터 요구사항의 균형을 맞춘 소형 오픈소스 멀티모달 추론 모델인 Phi-4-reasoning-vision-15B 를 개발하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Efficient AI #Reasoning Models #Vision-Language Models #Data Curation #Mid-Fusion #High-Resolution Vision #Small Language Models

2026년 3월 4일

[논문리뷰] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

본 연구는 기존의 텍스트 중심 안전성 평가와 레드팀 활동의 한계를 극복하고, 멀티모달 LLM의 정렬(alignment)이 오디오, 이미지, 비디오 입력에 대해 일반화되는지 체계적으로 테스트하기 위한 통합 플랫폼 을 제공하는 것을 목표로 합니다. 특히, 모달리티 전환이 다중 턴 공격에 미치는 영향을 규명하고자 합니다.

#Review #Multimodal LLMs #Safety Evaluation #Red Teaming #Adversarial Attacks #Modality Switching #LLM Alignment #Compliance #ASR

2026년 3월 4일

[논문리뷰] MediX-R1: Open Ended Medical Reinforcement Learning

본 논문은 의료 멀티모달 대규모 언어 모델(MLLM)이 다지선다형 질문을 넘어 임상적으로 근거한 자유 형식 답변 을 생성하도록 하는 오픈엔드 의료 강화 학습(RL) 프레임워크인 MediX-R1 을 제안합니다.

#Review #Reinforcement Learning #Multimodal LLMs #Medical AI #Composite Reward #LLM-as-a-Judge #Open-ended Generation #Medical Imaging

2026년 2월 26일

[논문리뷰] Imagination Helps Visual Reasoning, But Not Yet in Latent Space

본 논문은 Multimodal Large Language Models (MLLMs)에서 잠재 공간(latent space)을 활용한 시각적 추론(Latent Visual Reasoning, LVR)의 효과와 내재된 메커니즘을 심층적으로 분석하고, 그 한계를 극복하기 위한 대안적인 접근 방식을 제시하는 것을 목표로 합니다.

#Review #Visual Reasoning #Latent Space #Causal Mediation Analysis #Multimodal LLMs #Textual Imagination #Model Interpretation #Latent Tokens

2026년 2월 26일

[논문리뷰] BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

기존 로봇 조작 벤치마크가 주로 단일 팔 조작에 국한되어 양팔 조작에 필수적인 공간-시간적 조정, 동적 역할 할당, 자가 충돌 방지 등의 복잡성을 포착하지 못하는 문제를 해결하는 것이 목표입니다.

#Review #Bimanual Manipulation #MLLMs #Robotics Benchmark #Spatial Reasoning #Action Planning #End-Effector Control #Embodied AI #Multimodal LLMs

2026년 2월 18일

[논문리뷰] BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

기존 벤치마크의 제한적인 태스크 복잡도, 정보 검색 가능성, 평가 차원의 문제를 해결하여 멀티모달 웹 브라우징 에이전트의 심층 검색 역량을 포괄적으로 평가할 수 있는 새롭고 검증 가능한 벤치마크를 개발하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Web Browsing Agents #Deep Search #Benchmark #Tool Use #Process Evaluation #Multimodal Reasoning #Open-world QA

2026년 2월 16일

[논문리뷰] Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

본 논문은 기존 MLLM(Multimodal Large Language Models)이 정적이고 내부적인 지식에 의존하여 비디오를 이해하는 한계를 극복하고, 동적이고 새로운 컨텍스트에서 시연(demonstration)을 통해 학습하고 적응하는 능력을 평가하는 새로운 태스크인 Demo-driven Video In-Context Learning 을 제안합니다.

#Review #Video Understanding #In-Context Learning #Procedural Knowledge #Multimodal LLMs #Benchmark #Direct Preference Optimization #Demonstration Selection

2026년 2월 9일

[논문리뷰] Reinforced Attention Learning

본 논문은 기존 RL 기반 LLM 후처리 방식이 MLLM에서 시각적 추론을 위한 '생성할 내용'에만 초점을 맞추어 제한적인 성능 향상을 보이거나 심지어 성능을 저하시키는 문제를 해결하고자 합니다.

#Review #Reinforcement Learning #Multimodal LLMs #Attention Mechanisms #Policy Gradient #Knowledge Distillation #Visual Grounding #Post-training

2026년 2월 5일

[논문리뷰] CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

본 논문은 텍스트 기반 LLM의 선형적인 컨텍스트 길이 증가와 그에 따른 계산 비용 문제로 인한 코드 이해의 비효율성을 해결하고자 합니다.

#Review #Vision Language Models #Code Understanding #Visual Code Representation #Code Compression #Computational Efficiency #Multimodal LLMs #Software Engineering

2026년 2월 3일

[논문리뷰] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

현재 Text-to-Image (T2I) 모델들이 복잡한 공간 관계(공간 인식, 추론, 상호작용) 처리에서 실패하는 한계를 해결하고, 기존의 짧고 정보 밀도가 낮은 프롬프트 기반 벤치마크의 부적합성을 극복하는 것을 목표로 합니다.

#Review #Text-to-Image Models #Spatial Intelligence #Benchmark #Evaluation #Prompt Engineering #Multimodal LLMs #Fine-tuning #Spatial Reasoning

2026년 1월 29일

[논문리뷰] GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection

본 논문은 이미지-텍스트 쌍에서 풍자(sarcasm)를 효과적으로 탐지하기 위해 기존 방법론의 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal Sarcasm Detection #Large Language Models #Multimodal LLMs #Discrepancy Modeling #Image Captioning #Gated Fusion #Semantic Incongruity

2026년 1월 28일

[논문리뷰] AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

본 논문은 멀티모달 대규모 언어 모델(MLLM)의 시각적 추론 능력을 향상시키기 위해, 적응적이며 다단계적인 도구 활용 능력 을 개발하는 것을 목표로 합니다. 기존 MLLM이 새로운 도구나 작업에 직면했을 때 도구를 유연하게 사용하고 조정하는 데 어려움을 겪는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Tool Orchestration #Visual Reasoning #Reinforcement Learning #Adaptive Learning #Generalization #Tool Use

2026년 1월 27일

[논문리뷰] AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

본 논문은 기존 벤치마크들이 다루지 못했던 시간-가변 오디오-비주얼 신호의 인간 문화적 맥락 이해 능력 을 평가하기 위해, MLLM(Multimodal Large Language Model) 의 맥락적, 문화적 지식 및 사고 능력 을 진단하는 새로운 벤치마크인 AVMeme Exam 을 제시합니다.

#Review #Multimodal LLMs #Benchmark #Cultural Understanding #Contextual Inference #Audio-Visual Memes #Multilingual #Q&A Evaluation

2026년 1월 27일

[논문리뷰] VIOLA: Towards Video In-Context Learning with Minimal Annotations

본 논문은 레이블링된 데이터가 부족한 새로운 비디오 도메인에서 Multimodal Large Language Models (MLLMs) 의 일반화 능력을 향상시키는 것을 목표로 합니다.

#Review #Video In-Context Learning #Minimal Annotation #Active Learning #Pseudo-Labeling #Multimodal LLMs #Density-Uncertainty Sampling #Confidence-Aware Retrieval #Low-Resource Adaptation

2026년 1월 22일

[논문리뷰] SAMTok: Representing Any Mask with Two Words

본 논문은 픽셀 단위의 멀티모달 대규모 언어 모델(MLLMs)이 복잡한 인코더, 전용 디코더, 비호환적인 훈련 목표로 인해 확장성 문제를 겪는 점을 해결하고자 합니다.

#Review #Mask Tokenization #Multimodal LLMs #Pixel-wise Vision-Language #Reinforcement Learning #Segmentation Anything Model #Discrete Representation

2026년 1월 22일

[논문리뷰] FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

기존 벤치마크들이 주로 회고적 이해에 초점을 맞추는 한계를 해결하기 위해, 오디오-비주얼 환경에서 멀티모달 대규모 언어 모델(MLLM)의 미래 사건 예측 능력 을 평가하는 것을 목표로 합니다. 특히, 모델이 교차 모달 인과 및 시간 추론 을 수행하고 내부 지식을 활용하여 미래 이벤트를 예측하는 능력을 평가하고자 합니다.

#Review #Multimodal LLMs #Future Forecasting #Audio-Visual Reasoning #Benchmark #Instruction Tuning #Omni-Modal #Causal Reasoning

2026년 1월 20일

[논문리뷰] Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

본 논문은 LLM 기반의 소프트웨어 엔지니어링 이슈 해결(Issue Resolution) 분야에 대한 최초의 체계적인 종합 조사를 제공하는 것을 목표로 합니다. 특히 SWE-bench 와 같은 벤치마크에 의해 촉진된 자율 코딩 에이전트의 발전을 분석하고, 이 분야의 핵심 도전 과제와 미래 연구 방향을 제시하고자 합니다.

#Review #LLM-based Issue Resolution #Software Engineering #Autonomous Agents #Code Generation #Benchmarking #Reinforcement Learning #Supervised Fine-tuning #Multimodal LLMs

2026년 1월 20일

[논문리뷰] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

본 연구는 고품질의 중국어 이미지-텍스트 데이터 의 부족으로 인해 지연되었던 중국어 비전-언어 사전 훈련(VLP) 연구의 발전을 목표로 합니다. 최신 웹 데이터를 기반으로 한 대규모 고품질 중국어 크로스모달 데이터셋인 DanQing 을 구축하고, 이를 통해 중국어 VLP 모델의 성능을 향상시키는 것이 주된 목적입니다.

#Review #Vision-Language Pre-training #Chinese Dataset #Data Filtering #Cross-modal Retrieval #Zero-shot Classification #Multimodal LLMs #SigLIP

2026년 1월 15일

[논문리뷰] A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

본 논문은 GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, Seedream 4.5 등 7개 최신 AI 모델의 안전성을 종합적이고 다차원적으로 평가하는 것을 목표로 합니다.

#Review #AI Safety #Large Language Models #Multimodal LLMs #Benchmark Evaluation #Adversarial Robustness #Multilingual Evaluation #Regulatory Compliance #Image Generation Safety

2026년 1월 15일

[논문리뷰] Ministral 3

본 연구는 컴퓨팅 및 메모리 제약이 있는 환경 을 위한 효율적인 매개변수 효율적(parameter-efficient) 밀집 언어 모델 인 Ministral 3 시리즈를 개발하는 것을 목표로 합니다.

#Review #Large Language Models #Model Distillation #Pruning #Parameter-Efficient AI #Multimodal LLMs #Instruction Tuning #Reinforcement Learning from Human Feedback #Open-Source AI

2026년 1월 13일

[논문리뷰] Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

본 논문은 기존 비디오 질의응답 벤치마크의 한계, 즉 폐쇄된 증거 설정과 텍스트 기반 검색에 의존하는 문제점을 해결하고자 합니다.

#Review #Video Question Answering #Open-domain Search #Multimodal LLMs #Agentic AI #Benchmark #Video Understanding #Multi-hop Reasoning

2026년 1월 12일

[논문리뷰] BabyVision: Visual Reasoning Beyond Language

최신 멀티모달 대규모 언어 모델(MLLMs)이 고수준의 지식 기반 과제에서는 탁월하지만, 3세 아동도 쉽게 해결하는 기본적인 시각적 추론 과제에서 실패하는 근본적인 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Visual Reasoning #Benchmark #Early Vision #Spatial Perception #Visual Tracking #Pattern Recognition #Generative Models

2026년 1월 12일

[논문리뷰] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

기존 Multimodal Large Language Models (MLLMs) 이 시각적 수학 문제 해결에서 낮은 정확도와 일관성 없는 추론을 보이는 문제를 해결하는 것이 목표입니다. 특히, 시각적 정보 추출 후 이 정보가 추론 과정에 충실히 통합되고 활용되는지를 보장하지 못하는 한계를 극복하고자 합니다.

#Review #Multimodal LLMs #Visual Reasoning #Mathematical Problem Solving #Knowledge Internalization #Reinforcement Learning #Cognitive-Inspired AI #Perception-Reasoning Alignment

2026년 1월 6일

[논문리뷰] Video-BrowseComp: Benchmarking Agentic Video Research on Open Web

본 논문은 기존 벤치마크들이 텍스트 및 정적 멀티모달 정보 탐색에 초점을 맞추고 동적인 웹 비디오 콘텐츠를 간과하는 문제점을 해결하고자 합니다.

#Review #Agentic AI #Video Understanding #Web Browsing #Benchmark #Multimodal LLMs #Temporal Grounding #Cross-Source Reasoning #Information Seeking

2025년 12월 29일

[논문리뷰] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

기존 옴니모달 대규모 언어 모델(OmniLLMs) 이 겪는 미세한 크로스모달 이해(fine-grained cross-modal understanding) 및 멀티모달 정렬(multimodal alignment) 의 한계를 해결하는 것을 목표로 합니다.

#Review #Omnimodal Understanding #Audio-Guided Perception #Active Learning Agents #Cross-Modal Alignment #Tool-Use #Video Understanding #Multimodal LLMs

2025년 12월 29일

[논문리뷰] SpatialTree: How Spatial Abilities Branch Out in MLLMs

멀티모달 대규모 언어 모델(MLLM) 내에서 공간 능력의 계층적 구조가 제대로 이해되지 않고 단편적으로 연구되는 문제를 해결하는 것을 목표로 합니다.

#Review #Spatial Intelligence #Multimodal LLMs #Cognitive Hierarchy #Benchmark #Reinforcement Learning #Supervised Fine-tuning #Spatial Reasoning

2025년 12월 23일

[논문리뷰] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

본 논문은 기존 MLLM이 3D 구조와 시간적 역학(4D)을 추론하는 능력이 부족하며, 특히 4D 인지 및 시간적 이해 가 약하다는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #4D Understanding #Perceptual Distillation #Region-level VQA #Video Question Answering #Temporal Perception #Depth Perception

2025년 12월 21일

[논문리뷰] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

본 논문은 이미지와 텍스트가 혼합된 시퀀스를 처리하는 옴니 모델(Omni Models)을 위한 보상 모델(Reward Models, RMs)의 부족한 평가 프레임워크를 해결하고자 합니다.

#Review #Reward Models #Multimodal LLMs #Benchmark #Text-to-Image Generation #Image Editing #Interleaved Generation #Multimodal Reasoning #MLLM-as-a-judge

2025년 12월 18일

[논문리뷰] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

이 논문은 음성 양식이 LLM(Large Language Model) 에 직접 통합될 때 음성-텍스트 번역(ST) 품질이 향상되는지, 아니면 기존의 계단식(cascaded) 또는 직접(direct) 모델 이 여전히 더 효과적인 솔루션인지 평가합니다.

#Review #Speech-to-Text Translation #Multimodal LLMs #Speech Foundation Models #Cascaded Systems #Benchmarking #Speech Modality Integration #Robustness #Evaluation Metrics

2025년 12월 18일

[논문리뷰] Step-GUI Technical Report

논문은 GUI 자동화 분야에서 고품질 훈련 데이터를 효율적이고 신뢰성 있게 확보하는 근본적인 문제를 해결하고자 합니다. 또한, 이종 기기 간의 표준화된 인터페이스를 구축하여 사용자 개인 정보를 보호하고, 실제 일상적인 사용 패턴에 기반한 평가 벤치마크를 통해 에이전트의 실용성을 검증하는 것을 목표로 합니다.

#Review #GUI Automation #Self-Evolving Pipeline #Reinforcement Learning #Multimodal LLMs #Privacy-Preserving AI #Human-Computer Interaction #Model Context Protocol #Benchmarking

2025년 12월 17일

[논문리뷰] Thinking with Images via Self-Calling Agent

본 논문은 희소한 고품질 추론 데이터로 인해 강화 학습을 통한 MLLM의 Interleaved Multimodal Chain-of-Thought (iMCoT) 최적화가 어렵다는 문제점을 해결하고자 합니다.

#Review #Multimodal LLMs #Self-Calling Chain-of-Thought #Reinforcement Learning #Visual Reasoning #Agentic AI #Tool Calling #Group Relative Policy Optimization

2025년 12월 11일

[논문리뷰] OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

본 논문은 멀티모달 대규모 언어 모델(MLLM)의 안전성 정렬을 우회하는 탈옥(jailbreak) 공격 에 대한 통합적인 벤치마크 및 툴박스 를 구축하는 것을 목표로 합니다. 기존 벤치마크가 가진 제한적인 공격 시나리오, 표준화되지 않은 방어 평가, 재현 가능한 툴박스 부재와 같은 한계를 극복하고자 합니다.

#Review #Multimodal LLMs #Jailbreak Attack #Attack-Defense Evaluation #Benchmark #Safety Alignment #Vulnerability Analysis #Risk Taxonomy #Evaluation Metrics

2025년 12월 8일

[논문리뷰] Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

본 논문은 비디오 이해 태스크에서 멀티모달 LLM(MLLM)이 생성하는 설명문의 시각적 객체 및 시간적 행동 환각 문제를 공동으로 완화하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Video Understanding #Hallucination Mitigation #Object Hallucination #Action Hallucination #Contrastive Learning #Self-Augmentation #Tracklet-Phrase Alignment

2025년 12월 4일

[논문리뷰] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

본 논문은 기존 멀티모달 보상 모델(Reward Models, RMs)이 겪는 환각, 약한 시각적 접지(visual grounding), 그리고 검증을 위한 도구 사용 능력 부족 문제를 해결하는 것을 목표로 합니다.

#Review #Multimodal Reward Models #Agentic AI #Tool Use #Reinforcement Learning #Visual Reasoning #Multimodal LLMs #Instruction Following #Evaluation Benchmarks

2025년 12월 4일

[논문리뷰] OneThinker: All-in-one Reasoning Model for Image and Video

기존 MLLM(Multimodal Large Language Models)이 단일 태스크나 단일 모달리티(이미지 또는 비디오)에 국한되는 한계를 넘어, 이미지와 비디오 이해를 아우르는 다양한 시각 태스크를 동시에 처리할 수 있는 범용적인 추론 모델 인 'All-in-one multimodal reasoning generalist' 를 개발하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Reinforcement Learning #Visual Reasoning #Generalist Model #Image Understanding #Video Understanding #Multitask Learning #EMA-GRPO

2025년 12월 3일

[논문리뷰] PAI-Bench: A Comprehensive Benchmark For Physical AI

현재 다중 모달 대규모 언어 모델( MLLM )과 비디오 생성 모델( VGM )이 실제 물리적 역학을 인지하고 예측하는 능력을 충분히 지원하는지 이해하는 데 한계가 있습니다.

#Review #Physical AI #Benchmark #Video Generation #Conditional Video Generation #Video Understanding #Multimodal LLMs #Physical Plausibility #Embodied Reasoning

2025년 12월 2일

[논문리뷰] LongVT: Incentivizing 'Thinking with Long Videos' via Native Tool Calling

논문은 대규모 멀티모달 모델(LMMs)이 장시간 비디오(hours-long)에서 증거가 희박하고 시간적으로 분산된 정보를 처리할 때 발생하는 환각 현상과 부정확한 추론 문제를 해결하고자 합니다.

#Review #Long Video Understanding #Multimodal LLMs #Tool Calling #Reinforcement Learning #Chain-of-Thought #Temporal Grounding #Video Question Answering

2025년 12월 1일

[논문리뷰] SO-Bench: A Structural Output Evaluation of Multimodal LLMs

본 논문은 멀티모달 대규모 언어 모델(MLLMs)이 시각적 입력으로부터 스키마 기반 정보를 추출하고 추론하여 구조화된 출력을 생성하는 능력에 대한 체계적인 벤치마크가 부재하다는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Structural Output #Information Extraction #JSON Schema #SO-Bench #Visual Reasoning #Supervised Fine-tuning #Reinforcement Learning

2025년 11월 30일

[논문리뷰] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

현재 MLLM(Multimodal Large Language Models) 이 각 문제를 de novo 방식으로 해결하며 시각적 주의 집중 및 논리적 추론 오류를 반복하는 한계를 극복하는 것이 목표입니다.

#Review #Multimodal LLMs #Semantic Memory #Agentic Learning #Error Attribution #Visual Reasoning #Long-term Memory #Grow-and-Refine #Multimodal Reasoning

2025년 11월 27일

[논문리뷰] GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

본 연구는 기존 에이전트 시각 추론 모델들이 주로 이미지 조작 도구에 집중하여 일반적인 목적으로 확장하기 어려운 한계를 해결하고자 합니다.

#Review #Geolocalization #Agentic Models #Visual Reasoning #Web-Augmented #Multimodal LLMs #Reinforcement Learning #Tool Use #GeoBench

2025년 11월 23일

[논문리뷰] Step-Audio-R1 Technical Report

오디오 언어 모델이 추론 과정을 거치면 성능이 저하되는 기존의 문제, 즉 '텍스트 대리 추론' 현상을 해결하고, 오디오 도메인에서 진정한 추론 능력을 성공적으로 활성화하는 것을 목표로 합니다. 이는 오디오 인텔리전스에 대한 심층적 사고의 이점을 입증하고자 합니다.

#Review #Audio Reasoning #Multimodal LLMs #Modality-Grounded Reasoning Distillation (MGRD)#Chain-of-Thought #Reinforcement Learning #Audio Understanding #Self-Distillation

2025년 11월 20일

[논문리뷰] VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

본 논문은 기존 비디오 이상 탐지(VAD) 방법들이 놓치던 이상 행동의 깊은 인과 관계 및 객체 간 상호작용 을 이해하는 한계를 극복하고자 합니다. 궁극적으로 비디오 내 이상 현상에 대한 자세한 해석과 의미론적 이해 를 제공하는 것을 목표로 합니다.

#Review #Video Anomaly Understanding #Large Language Models #Causal Reasoning #Relation-Aware #Keyframe Sampling #Multimodal LLMs #Scene Graphs

2025년 11월 10일

[논문리뷰] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

멀티모달 대규모 언어 모델(MLLM)이 비디오에서 시공간 추론을 수행하는 데 어려움을 겪는 문제를 해결하는 것을 목표로 합니다.

#Review #Spatial Reasoning #Video Understanding #Simulated Data #Instruction Tuning #Multimodal LLMs #Sim-to-Real Transfer #AI2-THOR

2025년 11월 9일

[논문리뷰] Cambrian-S: Towards Spatial Supersensing in Video

본 논문은 현재 멀티모달 대규모 언어 모델(MLLM)이 비디오를 단편적인 프레임으로 처리하고 공간 구조를 제대로 이해하지 못하며, 언어적 기억에 과도하게 의존하는 한계를 지적합니다.

#Review #Spatial Supersensing #Video Understanding #Multimodal LLMs #Predictive Sensing #Memory Management #Event Segmentation #VSI-SUPER #Instruction Tuning

2025년 11월 9일

[논문리뷰] Benchmark Designers Should 'Train on the Test Set' to Expose Exploitable Non-Visual Shortcuts

이 논문은 Multimodal Large Language Model (MLLM)이 시각적 이해 없이 비시각적 단축키(편향, 언어적 선험지식, 피상적인 패턴)를 악용하여 멀티모달 벤치마크에서 높은 점수를 얻는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Benchmark Design #Non-Visual Shortcuts #Test-Set Stress-Test #Bias Mitigation #Model Evaluation #Benchmark Robustness

2025년 11월 9일

[논문리뷰] MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

기존 멀티모달 벤치마크들이 텍스트 기반 추론을 과도하게 강조하거나 시각 중심의 인지적 행동을 체계적으로 포착하지 못하여 MLLM의 인지 능력을 불충분하게 평가하는 한계를 해결하는 것을 목표로 합니다. 시각 기반 추론에 중점을 둔 새로운 벤치마크 MME-CC 를 도입하여 MLLM의 인지 능력을 심층적으로 평가하고자 합니다.

#Review #Multimodal LLMs #Benchmark #Cognitive Capacity #Visual Reasoning #MLLM Evaluation #Error Analysis #Chain-of-Thought

2025년 11월 9일

[논문리뷰] ChartM^3: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

본 연구는 기존 멀티모달 대규모 언어 모델(MLLM)이 실제 복잡한 차트 이해 작업에서 겪는 한계(제한된 차트 유형 및 복잡성, 낮은 질문 복잡성, 해석력 부족 등)를 해결하고자 합니다.

#Review #Chart Comprehension #Visual Reasoning #Data Generation #Code-Driven Pipeline #Multimodal LLMs #Retrieval-Augmented Generation #Reinforcement Learning #Synthetic Data

2025년 11월 9일

[논문리뷰] Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

대규모 멀티모달 모델(LMM)이 이미지 인코더에서 생성되는 막대한 수의 시각 토큰으로 인해 겪는 심각한 추론 비효율성 문제를 해결하는 것이 주된 목표입니다.

#Review #Large Multimodal Models #Visual Token Compression #Token Pruning #Benchmark #Efficiency #Inference Latency #Multimodal LLMs

2025년 11월 9일

[논문리뷰] TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

본 연구는 기존 벤치마크들이 OpenAI o3 와 같은 최신 MLLM의 'thinking-with-images' (이미지로 사고하기) 능력, 즉 이미지 조작 도구를 활용한 문제 해결 능력을 충분히 포착하지 못하는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Agentic Reasoning #Thinking-with-Images #Visual Reasoning Benchmark #Tool Use #Image Manipulation #Fine-tuning

2025년 11월 9일

[논문리뷰] Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

본 논문은 최신 Multimodal Large Language Models (MLLMs) 의 3D 공간 추론 능력을 평가하고 향상시키는 것을 목표로 합니다.

#Review #Multimodal LLMs #Spatial Reasoning #Viewpoint Learning #Two-Stage Fine-tuning #3D Consistency #Viewpoint-100K #Reinforcement Learning

2025년 11월 9일

[논문리뷰] Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

본 논문은 MLLM(Multimodal Large Language Model) 기반 embodied agent 가 시각적 백도어 공격에 취약함을 지적하고, 이 문제를 해결하고자 합니다.

#Review #Visual Backdoor Attacks #MLLM Embodied Agents #Contrastive Trigger Learning #Policy Manipulation #Adversarial AI #Embodied AI Security #Multimodal LLMs

2025년 11월 9일

[논문리뷰] VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

본 연구는 기존 비디오 벤치마크들이 장거리 이동 및 다일(multi-day) 활동과 같은 거시적 규모의 지리 공간-시간적 시나리오 를 충분히 다루지 못한다는 한계를 지적하며, MLLM(Multimodal Large Language Models)의 확장된 지리 공간 및 시간적 이해 능력 을 평가하는 새로운 벤치마크 VIR-Bench를 제시합니다.

#Review #Multimodal LLMs #Video Understanding #Geospatial Reasoning #Temporal Reasoning #Travel Itinerary Reconstruction #Benchmark #Agent System #VLOG

2025년 9월 24일

[논문리뷰] TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

이 논문은 비디오 시간적 접지(temporal grounding) 작업에서 멀티모달 대규모 언어 모델(MLLMs) 의 효율성을 개선하는 것을 목표로 합니다. 기존 강화 학습( RL ) 방법론, 특히 GRPO 가 큰 시간 검색 공간에서 비효율적인 탐색과 불안정한 정책 업데이트를 겪는 문제를 해결하고자 합니다.

#Review #Video LLMs #Temporal Grounding #Reinforcement Learning #Off-policy Learning #Reward Shaping #Chain-of-Thought #Multimodal LLMs

2025년 9월 23일

[논문리뷰] OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

본 논문은 기존 MLLM 기반 Embodied 시스템의 Geometric Adaptability Gap (다양한 공간 요구사항에 대한 3D 정보 부족)과 Embodiment Constraint Gap (실제 로봇의 물리적 제약 무시)이라는 두 가지 핵심 한계를 해결하고자 합니다.

#Review #Embodied AI #Multimodal LLMs #3D Grounding #Task-Adaptive Reasoning #Embodiment-Aware Planning #Robotics #Spatial Reasoning

2025년 9월 12일

[논문리뷰] Visual Representation Alignment for Multimodal Large Language Models

본 논문은 시각적 지시 튜닝으로 훈련된 다중 모달 대규모 언어 모델(MLLM) 이 객체 카운팅이나 공간 추론과 같은 시각 중심 작업에서 제한적인 성능을 보이는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Visual Representation Alignment #Foundation Models #Regularization #Fine-grained Visual Understanding #Spatial Reasoning #Object Counting #Vision-Language Models

2025년 9월 10일

[논문리뷰] Reinforced Visual Perception with Tools

본 논문은 멀티모달 대규모 언어 모델(LLM)이 복잡한 시각적 추론 문제를 해결하고 외부 시각 도구를 효과적으로 활용하는 능력을 강화하는 것을 목표로 합니다. 기존 지도 학습(SFT) 기반 접근 방식의 한계인 고비용 데이터 생성, 섬세한 데이터 필터링 필요성, 그리고 제한된 일반화 능력을 극복하고자 합니다.

#Review #Visual Reasoning #Multimodal LLMs #Reinforcement Learning #Tool Usage #Perception-heavy Benchmarks #GRPO #Vision Tools

2025년 9월 9일

[논문리뷰] Kwai Keye-VL 1.5 Technical Report

본 논문은 동적이고 정보 밀도가 높은 비디오 콘텐츠 이해에서 발생하는 공간 해상도와 시간 범위 간의 트레이드오프 문제를 해결하고, 기존 모델들이 비디오 이해에서 겪는 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Video Understanding #Slow-Fast Encoding #Long Context #Chain-of-Thought #Reinforcement Learning #Human Alignment #Native-Resolution Vision Encoder

2025년 9월 3일

[논문리뷰] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

본 연구는 텍스트-이미지(T2I) 생성 시 다중 속성 및 모호한 프롬프트 처리 능력의 한계 를 극복하고자 합니다.

#Review #Text-to-Image Generation #Reinforcement Learning #Chain of Thought #Multimodal LLMs #Stage-Aware Rewards #Semantic Reasoning #Generative AI

2025년 8월 26일

[논문리뷰] Ovis2.5 Technical Report

Ovis2.5는 이전 Ovis 버전의 한계, 특히 고정 해상도 이미지 처리와 선형 사고 체인(CoT) 기반 추론의 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Native Resolution Vision #Deep Reasoning #Chart Analysis #OCR #Visual Grounding #Training Efficiency #Preference Optimization

2025년 8월 19일

[논문리뷰] Has GPT-5 Achieved Spatial Intelligence? An Empirical Study

이 연구는 최신 MLLM(Multi-modal Large Language Model) , 특히 GPT-5 가 인공 일반 지능(AGI)의 핵심 역량인 공간 이해 및 추론 능력을 얼마나 달성했는지 실증적으로 평가하는 것을 목표로 합니다.

#Review #Spatial Intelligence #Multimodal LLMs #Benchmark Evaluation #GPT-5 #Cognitive AI #AGI

2025년 8월 19일

[논문리뷰] Thyme: Think Beyond Images

본 논문은 기존의 '이미지로 생각하기' 방식의 멀티모달 대규모 언어 모델(MLLM) 이 가진 이미지 조작 기능의 제한성과 논리적 추론 능력의 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Code Generation #Image Processing #Reinforcement Learning #Supervised Fine-Tuning #Visual Reasoning #Sandbox

2025년 8월 18일

[논문리뷰] Controlling Multimodal LLMs via Reward-guided Decoding

본 논문은 MLLM(Multimodal Large Language Models)이 다양한 사용자 요구에 맞춰 동작을 조절할 수 있도록, 추론 과정에서 세밀한 제어 를 가능하게 하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Reward Models #Guided Decoding #Visual Grounding #Hallucination Mitigation #Object Precision #Object Recall #Inference-time Control

2025년 8월 18일

[논문리뷰] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

본 논문은 인간 중심 시나리오에서 MLLM(Multimodal Large Language Models) 의 심층적인 이해 및 공감적, 상황 인지적 응답 능력을 평가하기 위한 세분화된 평가 프레임워크의 부족 문제 를 해결하고자 합니다.

#Review #Multimodal LLMs #Human-Centered AI #Empathy #Context-Awareness #MLLM Benchmark #Reinforcement Learning #Reasoning

2025년 8월 15일

[논문리뷰] Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

이 논문은 현재 문서 검색 증강 생성(RAG) 시스템 의 평가 벤치마크가 실제 세계의 복잡성과 한계를 제대로 반영하지 못하는 문제점을 해결하고자 합니다.

#Review #Retrieval-Augmented Generation #Multimodal LLMs #Benchmark Evaluation #Document Understanding #Multi-hop Reasoning #Information Retrieval #Evaluation Dataset

2025년 8월 8일

[논문리뷰] STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

기존 오디오 벤치마크가 텍스트로 쉽게 표현 가능한 의미론적 내용에 치중하여 미세한 지각 추론 능력을 간과하는 문제를 해결하는 것을 목표로 합니다.

#Review #Audio Intelligence #Spatio-Temporal Reasoning #4D Audio #Benchmark #Large Audio-Language Models #Perceptual Reasoning #Multimodal LLMs

2025년 10월 29일

[논문리뷰] RoboOmni: Proactive Robot Manipulation in Omni-modal Context

본 논문은 기존 로봇 조작 모델이 명시적인 지시에 의존하며 실제 환경에서 인간의 의도를 능동적으로 파악하는 데 한계가 있다는 문제를 해결합니다.

#Review #Robotic Manipulation #Multimodal LLMs #Vision-Language-Action #Proactive AI #Omni-modal Learning #Intent Recognition #Contextual Instructions

2025년 10월 29일

[논문리뷰] Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Multimodal Large Language Models (MLLMs)가 복잡한 시각적 계획과 상상력을 요구하는 시나리오에서 겪는 어려움을 해결하고, MLLM에 내부 시각적 스크래치패드(visual scratchpad) 를 부여하여 시각적 사고(visual thought) 를 통해 멀티모달 추론 능력을 향상시키는 것을 목표로 합니다.

#Review #Multimodal LLMs #Visual Reasoning #Latent Space #Sketch Generation #Visual Thinking #Autoregressive Generation #Interpretability

2025년 10월 29일

[논문리뷰] Directional Reasoning Injection for Fine-Tuning MLLMs

논문은 멀티모달 대규모 언어 모델(MLLM)의 추론 능력이 텍스트 전용 LLM에 비해 현저히 떨어진다는 문제에 주목합니다. 대규모 멀티모달 추론 데이터셋이나 강화 학습 없이도, 텍스트 전용 추론 전문가 모델 의 추론 지식을 비추론 멀티모달 LLM 으로 효율적으로 전이하는 경량화된 방법을 개발하는 것을 목표로 합니다.

#Review #Multimodal LLMs #Reasoning Transfer #Gradient-based Fine-tuning #Model Merging #Parameter-Efficient Learning #Supervised Fine-tuning #Directional Prior

2025년 10월 23일

[논문리뷰] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents

본 논문은 Multimodal Large Language Models (MLLMs)의 다중 작업 지도 미세 조정(SFT)에서 최적의 데이터 혼합 전략을 찾아 성능을 극대화하는 문제를 해결합니다. 특히, 모바일 폰 에이전트(MPA)의 다양한 기능을 동시에 처리하는 MLLM의 효율성을 향상시키는 것을 목표로 합니다.

#Review #Multimodal LLMs #Fine-tuning #Data Mixing Optimization #Mobile Phone Agents #Downstream Task Prediction #Benchmark #Neural Networks

2025년 10월 23일

[논문리뷰] MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

기존 MLLM 평가 벤치마크가 주로 단일 턴 질의응답과 비디오 내용의 사실적 인지에만 초점을 맞춘 한계를 해결합니다.

#Review #Multimodal LLMs #Video Understanding #Benchmark #Multi-Turn Dialogues #Perceptivity #Interactivity #Evaluation

2025년 10월 22일

[논문리뷰] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

본 논문은 기존 MLLM 이 전체적인 이해에는 뛰어나지만, 복잡한 장면의 미세한 디테일과 객체 간의 복잡한 관계를 파악하는 데 한계가 있음을 지적합니다.

#Review #Multimodal LLMs #Region Understanding #Contextual Pixel Understanding #RoI-aligned Feature Replay #Compositional Reasoning #GAR-Bench #Zero-shot Video Understanding

2025년 10월 22일

[논문리뷰] MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

본 논문은 사용자 인터페이스(UI) 디자인 평가 과정에서 발생하는 리소스 제약을 해결하기 위해 Multimodal Large Language Models (MLLMs) 이 인간의 UI 인식과 선호도를 얼마나 정확하게 예측할 수 있는지 벤치마킹하는 것을 목표로 합니다.

#Review #Multimodal LLMs #UI Evaluation #Human Perception #Benchmarking #UX Research #MLLM-as-a-Judge #Cognitive Factors #Pairwise Comparison

2025년 10월 15일

[논문리뷰] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

현재 Multimodal Large Language Models (MLLM) 은 복잡한 실제 문제 해결에 필수적인 긴 추론 체인(long-chain reflective reasoning) 및 반복적 사고(iterative thinking) 능력에서 한계를 보입니다.

#Review #Multimodal LLMs #Reflective Reasoning #Long-Chain Reasoning #Benchmark #Policy Optimization #Data Generation #Reinforcement Learning #Backtracking

2025년 10월 10일

[논문리뷰] Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

본 연구는 기존 벤치마크들이 산점도(scatterplot) 관련 태스크를 충분히 다루지 못하여 AI 모델의 성능을 평가하는 데 한계가 있다는 문제점을 해결하고자 합니다.

#Review #Scatterplot Analysis #AI Benchmarking #Multimodal LLMs #Synthetic Data Generation #Cluster Detection #Outlier Detection #Data Visualization #Prompt Engineering

2025년 10월 8일

[논문리뷰] Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

본 연구는 AI-생성 비디오에서 인간이 인지하는 '딥페이크 흔적'을 식별하고 그 이유를 근거 있게 설명할 수 있는가에 대한 문제를 해결하고자 합니다.

#Review #AI-Generated Videos #Deepfake Detection #Multimodal LLMs #Human Perception #Video Generation Evaluation #Spatiotemporal Annotation #Reward Modeling

2025년 10월 1일