본문으로 건너뛰기

#Visual Question Answering

24개의 포스트

[논문리뷰] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

댓글 수 로딩 중

[논문리뷰] When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

댓글 수 로딩 중

[논문리뷰] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] Toward Cognitive Supersensing in Multimodal Large Language Model

댓글 수 로딩 중

[논문리뷰] UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

댓글 수 로딩 중

[논문리뷰] Jina-VLM: Small Multilingual Vision Language Model

댓글 수 로딩 중

[논문리뷰] Scaling Spatial Intelligence with Multimodal Foundation Models

댓글 수 로딩 중

[논문리뷰] MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

댓글 수 로딩 중

[논문리뷰] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

댓글 수 로딩 중

[논문리뷰] Measuring Epistemic Humility in Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] 'Does the cafe entrance look accessible? Where is the door?' Towards Geospatial AI Agents for Visual Inquiries

댓글 수 로딩 중

[논문리뷰] Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

댓글 수 로딩 중

[논문리뷰] DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

댓글 수 로딩 중

[논문리뷰] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

댓글 수 로딩 중

[논문리뷰] VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

댓글 수 로딩 중