#Large Multimodal Models

14개의 포스트

[논문리뷰] Large Multimodal Models as General In-Context Classifiers

본 논문은 대규모 멀티모달 모델(LMMs)이 이미지 분류 작업에서 대조 학습 기반 시각-언어 모델(VLMs)보다 성능이 떨어진다는 기존 인식을 재고하고, 인컨텍스트 학습(ICL)이 LMMs의 분류 능력을 얼마나 향상시킬 수 있는지 탐구합니다.

#Review #Large Multimodal Models #In-Context Learning #Image Classification #Open-World Classification #Zero-Shot Learning #Vision-Language Models #CLIP

2026년 3월 5일

[논문리뷰] From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

본 논문은 기존의 LMM(Large Multimodal Models) 자가 학습 프레임워크가 겪는 해석 가능한 진단 부족과 시각적 다양성 부족이라는 근본적인 한계를 해결하고자 합니다.

#Review #Large Multimodal Models #Iterative Training #Diagnostic-Driven Learning #Reinforcement Learning #Multimodal Reasoning #Data Generation #Agent Systems

2026년 2월 26일

[논문리뷰] DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

기존 멀티모달 RLVR(Reinforcement Learning with Verifiable Rewards) 학습 데이터셋의 제한적인 다양성, 커버리지, 일반화 능력을 극복하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Mathematical Dataset #RLVR #Data Curation #Visual Diversity #K12 Mathematics #Large Multimodal Models

2026년 2월 22일

[논문리뷰] Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

과학적 추론을 위한 멀티모달 데이터의 부족과 기존 Text-to-Image(T2I) 모델 이 시각적으로는 그럴듯하지만 과학적으로 부정확한 이미지를 생성하는 문제를 해결하고자 합니다.

#Review #Scientific Image Synthesis #Multimodal Reasoning #Text-to-Image #Benchmarking #Programmatic Synthesis #Large Multimodal Models #Synthetic Data

2026년 1월 26일

[논문리뷰] MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

본 논문은 기존 연구 에이전트 벤치마크들이 텍스트 전용 또는 짧은 형태의 멀티모달 질의응답에 초점을 맞춰, 멀티모달 증거를 활용한 종단 간 보고서 생성 능력을 평가하는 데 한계가 있음을 지적합니다.

#Review #Multimodal Deep Research #Research Agents #Benchmark #Evaluation Framework #Retrieval-Augmented Generation #Large Multimodal Models #Visual Grounding #Citation Analysis

2026년 1월 21일

[논문리뷰] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

본 논문은 텍스트가 풍부한 비디오에서 미세한 증거를 기반으로 하는 추론 문제, 특히 기존 단일 패스(single-pass) 비디오 QA 모델의 환각 및 오류 문제 를 해결하고자 합니다.

#Review #Video Reasoning #Large Multimodal Models #Reinforcement Learning #Visual Rumination #Text-Rich Video #Video Question Answering #Iterative Perception

2025년 11월 23일

[논문리뷰] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

멀티모달 추론(Multimodal Reasoning) 분야에서 투명하고 재현 가능한 데이터 큐레이션 및 훈련 전략 의 부재로 인한 확장성 연구의 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Large Multimodal Models #Supervised Fine-tuning #Reinforcement Learning #Data Curation #Open-source #Multimodal Benchmarks

2025년 11월 23일

[논문리뷰] V-Thinker: Interactive Thinking with Images

본 논문은 대규모 멀티모달 모델(LMM)이 긴 추론 과정에서 시각적 정보로부터 벗어나 환각을 일으키는 문제를 해결하고자 합니다.

#Review #Large Multimodal Models #Interactive Reasoning #Vision-Centric Thinking #Reinforcement Learning #Data Synthesis #Visual Tools #Curriculum Learning #Multimodal AI

2025년 11월 9일

[논문리뷰] Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

대규모 멀티모달 모델(LMM)이 이미지 인코더에서 생성되는 막대한 수의 시각 토큰으로 인해 겪는 심각한 추론 비효율성 문제를 해결하는 것이 주된 목표입니다.

#Review #Large Multimodal Models #Visual Token Compression #Token Pruning #Benchmark #Efficiency #Inference Latency #Multimodal LLMs

2025년 11월 9일

[논문리뷰] Morae: Proactively Pausing UI Agents for User Choices

본 논문은 기존 UI 에이전트들이 맹인 및 저시력(BLV) 사용자들에게 중요한 의사결정 시 선택권을 주지 않고 자동으로 작업을 완료하여 사용자 주도성을 저해하는 문제를 해결하고자 합니다.

#Review #UI Agents #Accessibility #Human-Agent Interaction #Mixed-Initiative AI #Large Multimodal Models #Proactive AI #User Choice #Blind and Low-Vision Users

2025년 9월 1일

[논문리뷰] Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

본 논문은 대규모 멀티모달 모델(LMMs)이 결함 있는 입력을 수동적으로 수용하여 잘못된 추론을 유발하는 문제를 해결하고자 합니다.

#Review #Large Multimodal Models #Input Scrutiny #Error Detection #Faulty Inputs #Evaluation Framework #Modality Preference #Cross-Modal Inconsistency

2025년 8월 8일

[논문리뷰] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

기존 비디오 벤치마크들이 일반 시나리오와 단순 추론에 집중하여 최신 대규모 멀티모달 모델(LMM) 의 고급 인지 능력을 평가하는 데 한계가 있음을 지적하며, 과학 분야에서의 복잡한 비디오 추론 능력을 종합적으로 평가할 수 있는 엄격한 벤치마크인 SciVideoBench 를 구축하는 것을 목표로 합니다.

#Review #Video Reasoning #Multimodal AI #Scientific Research #Large Multimodal Models #Benchmark #Quantitative Reasoning #Domain Knowledge #Visual Grounding

2025년 10월 10일

[논문리뷰] Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

기존 비디오 추론 모델들이 텍스트 기반 추론만을 제공하며 핵심 증거의 시점과 위치를 명시하지 못하는 문제를 해결하고자 합니다.

#Review #Video Reasoning #Spatio-Temporal Grounding #Large Multimodal Models #Reinforcement Learning #Chain-of-Thought #Visual Evidence #Dataset Curation

2025년 10월 24일

[논문리뷰] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

대규모 멀티모달 모델(LMM)의 고정적이고 제한적인 지식 문제를 해결하고, 새로운 지식 주입 시 발생하는 치명적 망각(Catastrophic Forgetting)을 완화하는 것을 목표로 합니다.

#Review #Knowledge Injection #Large Multimodal Models #Catastrophic Forgetting #Data Augmentation #Parameter-Efficient Fine-Tuning #Null Space #Continual Learning

2025년 10월 23일