본문으로 건너뛰기

#Multimodal LLMs

83개의 포스트

[논문리뷰] HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

댓글 수 로딩 중

[논문리뷰] Toward Native Multimodal Modeling: A Roadmap

댓글 수 로딩 중

[논문리뷰] OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

댓글 수 로딩 중

[논문리뷰] Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

댓글 수 로딩 중

[논문리뷰] Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

댓글 수 로딩 중

[논문리뷰] PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

댓글 수 로딩 중

[논문리뷰] MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

댓글 수 로딩 중

[논문리뷰] Imagination Helps Visual Reasoning, But Not Yet in Latent Space

댓글 수 로딩 중

[논문리뷰] BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

댓글 수 로딩 중

[논문리뷰] Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

댓글 수 로딩 중

[논문리뷰] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

댓글 수 로딩 중

[논문리뷰] AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

댓글 수 로딩 중

[논문리뷰] AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

댓글 수 로딩 중

[논문리뷰] FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

댓글 수 로딩 중

[논문리뷰] Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

댓글 수 로딩 중

[논문리뷰] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

댓글 수 로딩 중

[논문리뷰] A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

댓글 수 로딩 중

[논문리뷰] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

댓글 수 로딩 중

[논문리뷰] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

댓글 수 로딩 중

[논문리뷰] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

댓글 수 로딩 중

[논문리뷰] Step-GUI Technical Report

댓글 수 로딩 중

[논문리뷰] OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

댓글 수 로딩 중

[논문리뷰] Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

댓글 수 로딩 중

[논문리뷰] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

댓글 수 로딩 중

[논문리뷰] OneThinker: All-in-one Reasoning Model for Image and Video

댓글 수 로딩 중

[논문리뷰] LongVT: Incentivizing 'Thinking with Long Videos' via Native Tool Calling

댓글 수 로딩 중

[논문리뷰] SO-Bench: A Structural Output Evaluation of Multimodal LLMs

댓글 수 로딩 중

[논문리뷰] Step-Audio-R1 Technical Report

댓글 수 로딩 중

[논문리뷰] VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

댓글 수 로딩 중

[논문리뷰] Benchmark Designers Should 'Train on the Test Set' to Expose Exploitable Non-Visual Shortcuts

댓글 수 로딩 중

[논문리뷰] MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

댓글 수 로딩 중

[논문리뷰] ChartM^3: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

댓글 수 로딩 중

[논문리뷰] TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

댓글 수 로딩 중

[논문리뷰] VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

댓글 수 로딩 중

[논문리뷰] TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

댓글 수 로딩 중

[논문리뷰] OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

댓글 수 로딩 중

[논문리뷰] Visual Representation Alignment for Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] Reinforced Visual Perception with Tools

댓글 수 로딩 중

[논문리뷰] Kwai Keye-VL 1.5 Technical Report

댓글 수 로딩 중

[논문리뷰] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

댓글 수 로딩 중

[논문리뷰] Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

댓글 수 로딩 중

[논문리뷰] MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

댓글 수 로딩 중

[논문리뷰] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

댓글 수 로딩 중

[논문리뷰] Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

댓글 수 로딩 중

[논문리뷰] Directional Reasoning Injection for Fine-Tuning MLLMs

댓글 수 로딩 중

[논문리뷰] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents

댓글 수 로딩 중

[논문리뷰] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

댓글 수 로딩 중