#Visual Question Answering

26개의 포스트

[논문리뷰] Personal AI Agent for Camera Roll VQA

본 연구는 사용자 개인의 Camera Roll 전체를 대상으로 대화형 AI가 사진을 검색하고 질의에 응답하는 VQA 설정에서의 한계를 해결하고자 한다.

#Review #Personal AI Agent #Camera Roll #Visual Question Answering #Long-horizon Memory #Hierarchical Memory #Multimodal LLM #Agentic Workflow

2026년 6월 4일

[논문리뷰] MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

본 논문은 범용 Multimodal Large Language Models (MLLMs)가 기계 공학 도면의 복잡성과 도메인 특수성을 제대로 해석하지 못하는 문제를 해결하고자 한다.

#Review #Multimodal Large Language Models #Mechanical Drawing Understanding #Visual Question Answering #Spatial Reasoning #Reinforcement Learning #Domain-Specialized Benchmark

2026년 6월 4일

[논문리뷰] Brain-IT-VQA: From Brain Signals to Answers

본 논문은 기존의 fMRI 기반 시각적 재구성 및 VQA 연구들이 가진 성능적 한계와 신경과학적 해석의 어려움을 해결하고자 합니다.

#Review #fMRI #Visual Question Answering #Brain Decoding #Vision-Language Models #Brain-IT #NSD-VQA

2026년 6월 1일

[논문리뷰] Less Detail, Better Answers: Degradation-Driven Prompting for VQA

본 논문은 최신 Vision-Language Models (VLMs) 가 고해상도 이미지에서 오히려 불필요한 시각적 노이즈로 인해 환각(Hallucination)이나 추론 오류를 범하는 현상을 해결하고자 합니다.

#Review #Vision-Language Models #Visual Question Answering #Degradation-Driven Prompting #Agentic Perception #Structural Bottleneck

2026년 4월 6일

[논문리뷰] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

본 논문은 기존 멀티모달 대규모 언어 모델(MLLM)이 주로 사용하는 자기회귀(autoregressive) 아키텍처 의 한계를 극복하고, 텍스트, 음성, 이미지 전반에 걸친 이해 및 생성을 통합할 수 있는 새로운 확률적 모델링 대안 을 탐색하는 것을 목표로 합니다.

#Review #Multimodal AI #Discrete Diffusion Models #Masked Language Modeling #Unified Generative Models #Any-to-Any #Speech-to-Image #Visual Question Answering

2026년 3월 10일

[논문리뷰] When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

의료 Vision-Language Model (VLM)에서 강화 학습(RL)이 시각적 추론을 개선하는지, 또는 주로 Supervised Fine-tuning (SFT)을 통해 이미 유도된 행동을 단순히 강화하는지에 대한 불분명함을 해소하는 것이 목표입니다.

#Review #Medical VLMs #Reinforcement Learning #Supervised Fine-tuning #Visual Question Answering #Multi-modality #Reasoning Capacity #MedMNIST

2026년 3월 2일

[논문리뷰] LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

본 논문은 기존 확산 언어 모델(dLLMs) 기반 추론 시스템이 겪는 태스크 특이성, RL 학습 불안정성, 훈련 신호 부족 등의 문제를 해결하고자 합니다.

#Review #Multimodal Diffusion Models #Reasoning #Reinforcement Learning #Supervised Finetuning #Visual Question Answering #Image Editing #Object Grounding #Policy Gradient

2026년 2월 16일

[논문리뷰] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

본 논문은 기존 멀티모달 딥 리서치 MLLM들이 겪는 히트율 문제(검색 엔진의 노이즈와 불안정성) 및 제한된 추론 깊이/검색 폭 문제를 해결하고자 합니다.

#Review #Multimodal Large Language Models #Deep Research #Agentic AI #Tool Use #Visual Question Answering #Reinforcement Learning #Multi-scale Search

2026년 2월 2일

[논문리뷰] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

본 논문은 기존의 다중 모달 대규모 언어 모델(MLLM) 벤치마크가 시각 검색 중심적이지 않거나 지나치게 이상적인 검색 시나리오 에 의존하여 모델의 실제 시각 및 텍스트 검색 능력을 정확히 평가하지 못하는 문제를 해결하고자 합니다.

#Review #Multimodal Large Language Models #Visual Question Answering #Deep Research #Benchmark #Visual Search #Textual Search #Cropped Search #Evaluation

2026년 2월 2일

[논문리뷰] Toward Cognitive Supersensing in Multimodal Large Language Model

본 논문은 추상적인 시각 정보와 시각적 기억을 요구하는 복잡한 인지 문제에서 멀티모달 대규모 언어 모델(MLLMs) 의 제한된 성능을 개선하는 것을 목표로 합니다. 인간의 시각 공간 스케치패드와 시각적 심상과 유사한 시각적 추론 메커니즘을 MLLM 에 부여하여 인지 능력 격차를 해소하고자 합니다.

#Review #Multimodal Large Language Models #Cognitive Reasoning #Visual Imagery #Latent Representations #Reinforcement Learning #Visual Question Answering #Benchmark

2026년 2월 2일

[논문리뷰] MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

본 논문은 고품질 추론 데이터의 부족으로 인해 독점 시스템에 비해 뒤처지는 오픈소스 멀티모달 모델의 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Data-centric AI #Chain-of-Thought #Large Language Models #Visual Question Answering #STEM Reasoning #Dataset #Fine-tuning

2026년 1월 29일

[논문리뷰] UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

본 연구는 Multimodal Large Language Models (MLLMs) 이 이미지의 미학, 품질, 구조, 텍스처와 같은 지각 수준의 특성을 이해하는 데 어려움을 겪는 문제를 해결하고자 합니다.

#Review #Perceptual Understanding #Image Aesthetics #Image Quality #Image Structure #Image Texture #MLLM Benchmark #Visual Question Answering #Reward Model

2025년 12월 28일

[논문리뷰] Jina-VLM: Small Multilingual Vision Language Model

본 연구는 VLM의 실용적 배포를 저해하는 두 가지 주요 과제를 해결하는 것을 목표로 합니다. 첫째, 비전 적응 과정에서 발생하는 다국어 성능 저하 문제를 극복하고, 둘째, 고품질 VLM 훈련 및 배포의 높은 계산 비용을 줄여 접근성을 높이는 것입니다.

#Review #Vision-Language Model #Multilingual VLM #Small VLM #Visual Question Answering #Attention Pooling #Image Tiling #SigLIP #Qwen

2025년 12월 3일

[논문리뷰] Scaling Spatial Intelligence with Multimodal Foundation Models

본 연구는 최신 멀티모달 파운데이션 모델(Multimodal Foundation Models, MLLMs)이 가진 공간 지능(spatial intelligence)의 부족함을 해결하고, SenseNova-SI 계열 모델을 통해 대규모 데이터 스케일링을 통해 공간 지능을 효과적으로 육성하는 방법을 탐구하는 것을 목표로 합니다.

#Review #Spatial Intelligence #Multimodal Foundation Models #Data Scaling #Perspective-taking #Visual Question Answering #Emergent Capabilities #Embodied AI #Benchmark Evaluation

2025년 11월 20일

[논문리뷰] VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery

본 연구는 고대 그리스 도자기에 대한 전문가 수준의 추론 능력을 갖춘 MLLM(Multimodal Large Language Models) 에이전트를 개발하는 것을 목표로 합니다.

#Review #Multimodal Large Language Models #Visual Question Answering #Reinforcement Learning #Cultural Heritage #Ancient Greek Pottery #Supervised Fine-Tuning #Benchmark

2025년 9월 23일

[논문리뷰] MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

기존 통합 멀티모달 LLM이 시각적 이해와 생성 능력 사이의 성능 트레이드오프, 특히 텍스트가 풍부한 벤치마크에서의 저하를 겪는 문제를 해결하는 것을 목표로 합니다.

#Review #Multimodal LLM #Hybrid Tokenizer #Text-to-Image Generation #Visual Question Answering #Autoregressive Model #Diffusion Decoder #Unified Architecture #Model Scaling

2025년 9월 22일

[논문리뷰] MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

논문은 MARS2 2025 Challenge 를 통해 멀티모달 기계 학습 및 LLM 분야의 발전을 촉진하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Large Language Models (LLMs)#Multimodal Large Language Models (MLLMs)#Visual Grounding #Visual Question Answering #Advertisement Video Analysis #Real-world Scenarios #Challenge Benchmark

2025년 9월 18일

[논문리뷰] GenExam: A Multidisciplinary Text-to-Image Exam

기존 텍스트-투-이미지(T2I) 벤치마크들이 일반적인 세계 지식이나 개념 설명에 치우쳐 엄격한 도면 시험 평가에 미흡하다는 문제점을 해결하고자 합니다.

#Review #Text-to-Image Generation #Multidisciplinary #Benchmark #Evaluation #AGI #Reasoning #Scoring System #Visual Question Answering

2025년 9월 18일

[논문리뷰] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

본 논문은 인공지능 분야의 근본적인 도전 과제인 멀티모달 추론 의 한계를 극복하는 것을 목표로 합니다. 특히, 최첨단 GPT-03 과 같은 모델도 시각 정보 통합에 어려움을 겪는 과학 분야의 멀티모달 시나리오에서 시각-텍스트 모달리티 간의 격차를 해소 하고 견고한 추론 성능을 확보하고자 합니다.

#Review #Multimodal Reasoning #Science AI #Caption-assisted Reasoning #SeePhys Challenge #Large Language Models #Visual Question Answering #Physics Problems #Cross-modal Alignment

2025년 9월 17일

[논문리뷰] Measuring Epistemic Humility in Multimodal Large Language Models

본 논문은 멀티모달 대규모 언어 모델(MLLM)의 환각(hallucination) 문제를 해결하고, 특히 모델이 불확실한 상황에서 잘못된 정보를 확신하지 않고 '모르는 것을 모른다고 인정하는' 능력 , 즉 인식론적 겸손(epistemic humility) 을 측정하는 새로운 벤치마크를 제시하는 것을 목표로 합니다.

#Review #Multimodal Large Language Models #Hallucination #Epistemic Humility #Benchmark #False-Option Rejection #Visual Question Answering #Scene Graph

2025년 9월 16일

[논문리뷰] WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

본 논문은 Multimodal Large Language Models (MLLMs) 의 상징적 음악 분석 및 추론 능력에 대한 실세계 적용 가능성을 평가하는 것을 목표로 합니다.

#Review #Multimodal Large Language Models #Symbolic Music Reasoning #Music Score Analysis #Benchmarking #Visual Question Answering #In-the-Wild Data #Music Theory

2025년 9월 8일

[논문리뷰] 'Does the cafe entrance look accessible? Where is the door?' Towards Geospatial AI Agents for Visual Inquiries

본 논문은 기존 지도 시스템이 구조화된 GIS 데이터에 의존하여 시각적-공간적 질의(예: '카페 입구가 접근 가능한가요?', '문은 어디에 있고 어떻게 생겼나요?')에 답변하는 데 한계가 있음을 지적합니다.

#Review #Geospatial AI #Multimodal AI Agents #Visual Question Answering #Accessibility #Street View Imagery #Spatial Reasoning #Human-Computer Interaction

2025년 8월 22일

[논문리뷰] Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

본 연구는 기존 비전-언어 모델(VLMs)이 매개변수 규모에 제약이 있고, 견고한 자가 수정 능력이 부족하며, 긴 시각적 맥락과 복잡한 추론을 요구하는 문서 기반 태스크에서 저조한 성능을 보이는 문제를 해결하고자 합니다.

#Review #Visual Document Understanding #Visual Question Answering #Multi-Agent System #Test-Time Scaling #Self-Correction #Mixed Reward Modeling #Large Language Models

2025년 8월 8일

[논문리뷰] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

본 논문은 지식 기반 시각 질문 답변(KB-VQA) 태스크에서 멀티모달 쿼리의 품질과 검색 결과의 관련성 이 부족하여 발생하는 문제를 해결하는 것을 목표로 합니다.

#Review #Visual Question Answering #Retrieval-Augmented Generation #Multimodal AI #Reinforcement Learning #Knowledge Base #Tool Learning #Information Filtering

2025년 10월 21일

[논문리뷰] DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

기존 MLLM이 지식 집약적 시각 질의응답(VQA)에서 겪는 정보 부족, 정체된 데이터, 비효율적인 검색 쿼리 등의 한계를 극복하기 위해, 멀티모달 LLM이 온디맨드 다중 턴 웹 검색 을 수행하고 이미지와 텍스트 검색 도구 모두에 대해 동적으로 쿼리를 생성 및 개선 하는 능력을 부여하는 것을 목표로 합니다.

#Review #Multimodal LLM #Web Search #Visual Question Answering #Reinforcement Learning #Image Cropping #Self-Correction #Tool Use

2025년 10월 15일

[논문리뷰] VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

현재 시각 언어 모델(VLM) 벤치마크가 밀집된 고해상도 장면 에서의 시각적 이해 능력을 과대평가하고 있다는 문제 인식을 바탕으로, 모델의 세밀한 시각적 이해 능력 과 복잡한 추론 능력 을 정확하게 평가할 수 있는 새로운 VQA 벤치마크를 제시하는 것이 목표입니다.

#Review #Visual Question Answering #Multimodal Models #Dense Scenes #Fine-Grained Perception #Benchmark #Error Analysis #Counting #OCR

2025년 10월 1일