[논문리뷰] ResAdapt: Adaptive Resolution for Efficient Multimodal ReasoningShizhu He이 arXiv에 게시한 'ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Input-side Adaptation#Contextual Bandit#Cost-Aware Policy Optimization (CAPO)#Visual Budgeting#Efficient Inference#Temporal Reasoning2026년 3월 30일댓글 수 로딩 중
[논문리뷰] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided ReasoningarXiv에 게시된 'Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Spatial Reasoning#Textual Representation#Allocentric Context#Egocentric Video#Prompting Methods#VSI-Bench#OST-Bench2026년 3월 25일댓글 수 로딩 중
[논문리뷰] UI-Voyager: A Self-Evolving GUI Agent Learning via Failed ExperienceYiming Gao이 arXiv에 게시한 'UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience' 논문에 대한 자세한 리뷰입니다.#Review#GUI Agent#Self-Evolving Learning#Rejection Fine-Tuning (RFT)#Group Relative Self-Distillation (GRSD)#Credit Assignment#Sparse Rewards#Mobile Automation#Multimodal Large Language Models (MLLMs)2026년 3월 25일댓글 수 로딩 중
[논문리뷰] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene UnderstandingarXiv에 게시된 'Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding' 논문에 대한 자세한 리뷰입니다.#Review#Video Generation Models#3D Priors#Scene Understanding#Spatial Reasoning#Multimodal Large Language Models (MLLMs)#Latent World Simulator#Adaptive Gated Fusion#Generative AI2026년 3월 19일댓글 수 로딩 중
[논문리뷰] Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol UnderstandingJunnan Dong이 arXiv에 게시한 'Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Discrete Symbols#Cognitive Mismatch#Symbol Understanding#Benchmark#Recognition-Reasoning Inversion#Human Cognition2026년 3월 19일댓글 수 로딩 중
[논문리뷰] Video-CoE: Reinforcing Video Event Prediction via Chain of EventsarXiv에 게시된 'Video-CoE: Reinforcing Video Event Prediction via Chain of Events' 논문에 대한 자세한 리뷰입니다.#Review#Video Event Prediction (VEP)#Multimodal Large Language Models (MLLMs)#Chain of Events (CoE)#Logical Reasoning#Visual Grounding#Reinforcement Learning (RL)#Supervised Fine-Tuning (SFT)2026년 3월 18일댓글 수 로딩 중
[논문리뷰] Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language ModelsSong Dai이 arXiv에 게시한 'Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Video-SFT#Temporal Trap#Spatial Understanding#Temporal Budget#Hybrid-Frame Strategy#Negative Transfer2026년 3월 18일댓글 수 로딩 중
[논문리뷰] CodePercept: Code-Grounded Visual STEM Perception for MLLMsarXiv에 게시된 'CodePercept: Code-Grounded Visual STEM Perception for MLLMs' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#STEM Visual Reasoning#Code-Grounded Perception#Image-to-Code Translation#Data Generation#Benchmark#Reinforcement Learning#Matplotlib2026년 3월 11일댓글 수 로딩 중
[논문리뷰] Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified InstructionsarXiv에 게시된 'Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions' 논문에 대한 자세한 리뷰입니다.#Review#Video Understanding#Multimodal Large Language Models (MLLMs)#Instruction Tuning#Data Curation#Attribute-Structured Data#Quality Verification#Temporal Grounding#Video Captioning2026년 2월 15일댓글 수 로딩 중
[논문리뷰] Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language ModelsHanzhen Zhao이 arXiv에 게시한 'Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Modality Gap#Subspace Alignment#Unpaired Data#Representation Learning#Pretraining#Geometric Alignment2026년 2월 9일댓글 수 로딩 중
[논문리뷰] Robust-R1: Degradation-Aware Reasoning for Robust Visual UnderstandingRuntao Liu이 arXiv에 게시한 'Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Visual Degradation#Robustness#Reasoning Chains#Supervised Fine-Tuning (SFT)#Reinforcement Learning (RL)#Degradation-Aware Reasoning#Interpretability2025년 12월 21일댓글 수 로딩 중
[논문리뷰] Exploring MLLM-Diffusion Information Transfer with MetaCanvasarXiv에 게시된 'Exploring MLLM-Diffusion Information Transfer with MetaCanvas' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Diffusion Models#Image Generation#Video Generation#Image Editing#Video Editing#Latent Space Planning#Canvas Tokens#Information Transfer2025년 12월 14일댓글 수 로딩 중
[논문리뷰] IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual PromptingarXiv에 게시된 'IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Infrared Image Understanding#Benchmark Dataset#Visual Question Answering (VQA)#Generative Visual Prompting (GenViP)#Domain Adaptation#Image-to-Image Translation2025년 12월 10일댓글 수 로딩 중
[논문리뷰] Same Content, Different Answers: Cross-Modal Inconsistency in MLLMsarXiv에 게시된 'Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Cross-Modal Consistency#Reasoning Inconsistency#OCR Performance#Modality Gap#Benchmarking#Render Equivalence2025년 12월 9일댓글 수 로딩 중
[논문리뷰] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial IntelligenceJiawei Sheng이 arXiv에 게시한 'COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Spatial Reasoning#Perception Enhancement#Auxiliary Modalities#Adaptive Interleaved Reasoning#Reinforcement Learning#Chain-of-Thought2025년 12월 7일댓글 수 로딩 중
[논문리뷰] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept GenerationZiyu Guo이 arXiv에 게시한 'DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation' 논문에 대한 자세한 리뷰입니다.#Review#Text-to-Image Generation#Chain-of-Thought (CoT)#Multimodal Large Language Models (MLLMs)#Visual Planning#Rare Concept Generation#Drafting#Classifier-Free Guidance (CFG)#Image Refinement2025년 12월 4일댓글 수 로딩 중
[논문리뷰] Adversarial Confusion Attack: Disrupting Multimodal Large Language ModelsArtur Janicki이 arXiv에 게시한 'Adversarial Confusion Attack: Disrupting Multimodal Large Language Models' 논문에 대한 자세한 리뷰입니다.#Review#Adversarial Attack#Multimodal Large Language Models (MLLMs)#Entropy Maximization#Confusion Attack#Black-box Transfer#PGD#AI Agent Safety2025년 12월 3일댓글 수 로딩 중
[논문리뷰] Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language ModelsarXiv에 게시된 'Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Token Pruning#Graph-Structured Pruning (GSP)#Query-Conditioned Semantic Pruning (QCSP)#Determinantal Point Processes (DPP)#Model Efficiency#Visual Redundancy2025년 12월 1일댓글 수 로딩 중
[논문리뷰] Monet: Reasoning in Latent Visual Space Beyond Images and LanguagePengfei Wan이 arXiv에 게시한 'Monet: Reasoning in Latent Visual Space Beyond Images and Language' 논문에 대한 자세한 리뷰입니다.#Review#Latent Visual Reasoning#Multimodal Large Language Models (MLLMs)#Supervised Fine-tuning (SFT)#Reinforcement Learning (RL)#Visual-latent Policy Optimization (VLPO)#Chain-of-Thought (CoT)#Abstract Visual Thinking2025년 11월 26일댓글 수 로딩 중
[논문리뷰] MedSAM3: Delving into Segment Anything with Medical ConceptsYi Lu이 arXiv에 게시한 'MedSAM3: Delving into Segment Anything with Medical Concepts' 논문에 대한 자세한 리뷰입니다.#Review#Medical Image Segmentation#Segment Anything Model (SAM)#Promptable Concept Segmentation (PCS)#Multimodal Large Language Models (MLLMs)#Agentic AI#Domain Adaptation#Text-guided Segmentation2025년 11월 25일댓글 수 로딩 중
[논문리뷰] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language ModelsZhen Li이 arXiv에 게시한 'AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models' 논문에 대한 자세한 리뷰입니다.#Review#3D Embodied Reasoning#Multimodal Large Language Models (MLLMs)#Chain-of-Thought (CoT)#Affordance Grounding#Motion Estimation#View Synthesis#Active Perception2025년 11월 13일댓글 수 로딩 중
[논문리뷰] MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMsarXiv에 게시된 'MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Multi-Video Understanding#Evaluation Benchmark#Video Perception#Video Reasoning#Sports Analytics#Autonomous Driving2025년 11월 10일댓글 수 로딩 중
[논문리뷰] When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMsHaotian Wang이 arXiv에 게시한 'When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Modality Following#Unimodal Uncertainty#Modality Preference#Conflict Resolution#Internal Mechanism#Entropy#Controllable Dataset2025년 11월 9일댓글 수 로딩 중
[논문리뷰] Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual EvidencearXiv에 게시된 'Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence' 논문에 대한 자세한 리뷰입니다.#Review#Video Reasoning#Multimodal Large Language Models (MLLMs)#Reinforcement Learning (RLVR)#Evidence Grounding#Multi-step Reasoning#Frame Retrieval#Dataset Construction#Progressive Learning2025년 10월 24일댓글 수 로딩 중
[논문리뷰] ARGenSeg: Image Segmentation with Autoregressive Image Generation ModelarXiv에 게시된 'ARGenSeg: Image Segmentation with Autoregressive Image Generation Model' 논문에 대한 자세한 리뷰입니다.#Review#Image Segmentation#Autoregressive Generation#Multimodal Large Language Models (MLLMs)#Visual Understanding#VQ-VAE#Multi-scale Prediction#Referring Expression Segmentation#Image Generation2025년 10월 24일댓글 수 로딩 중
[논문리뷰] ViCO: A Training Strategy towards Semantic Aware Dynamic High-ResolutionarXiv에 게시된 'ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Dynamic Resolution#Token Compression#Semantic Awareness#Visual Consistency Learning (ViCO)#Visual Resolution Router (ViR)#Inference Optimization2025년 10월 15일댓글 수 로딩 중
[논문리뷰] ExpVid: A Benchmark for Experiment Video Understanding & ReasoningarXiv에 게시된 'ExpVid: A Benchmark for Experiment Video Understanding & Reasoning' 논문에 대한 자세한 리뷰입니다.#Review#Experiment Video Understanding#Multimodal Large Language Models (MLLMs)#Scientific Reasoning#Benchmark#Wet-Lab Experiments#Procedural Understanding#Fine-grained Perception#Video QA2025년 10월 15일댓글 수 로딩 중
[논문리뷰] PhysToolBench: Benchmarking Physical Tool Understanding for MLLMsXu Zheng이 arXiv에 게시한 'PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Physical Tool Understanding#Benchmarking#Embodied AI#Visual Question Answering (VQA)#Tool Affordances#Reasoning2025년 10월 13일댓글 수 로딩 중
[논문리뷰] Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMsJingyi Liao이 arXiv에 게시한 'Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Visual Reference Tokens (VRTs)#Dense Prediction#Referring Expression Comprehension (REC)#Open-Vocabulary Detection (OVD)#Image Captioning#Unified Architecture#Autoregressive Generation2025년 10월 9일댓글 수 로딩 중
[논문리뷰] EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging BenchmarkTianwen Qian이 arXiv에 게시한 'EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark' 논문에 대한 자세한 리뷰입니다.#Review#Egocentric Vision#Nighttime Conditions#Visual Question Answering (VQA)#Day-Night Alignment#Multimodal Large Language Models (MLLMs)#Depth Estimation#Correspondence Retrieval#Benchmark2025년 10월 8일댓글 수 로딩 중
[논문리뷰] Discrete Diffusion Models with MLLMs for Unified Medical Multimodal GenerationarXiv에 게시된 'Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation' 논문에 대한 자세한 리뷰입니다.#Review#Discrete Diffusion Models#Multimodal Large Language Models (MLLMs)#Medical Image Generation#Medical Report Generation#Multimodal Generation#Medical AI#Cross-modal Alignment2025년 10월 8일댓글 수 로딩 중
[논문리뷰] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive DecodingarXiv에 게시된 'CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Radiology Report Generation (RRG)#Medical Hallucinations#Contrastive Decoding#Training-free Inference#Clinical AI#Visual Question Answering (VQA)2025년 10월 8일댓글 수 로딩 중
[논문리뷰] Self-Improvement in Multimodal Large Language Models: A SurveyYapeng Tian이 arXiv에 게시한 'Self-Improvement in Multimodal Large Language Models: A Survey' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Self-Improvement#Data Collection#Data Organization#Model Optimization#Survey#Reinforcement Learning#Direct Preference Optimization2025년 10월 6일댓글 수 로딩 중
[논문리뷰] GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric ReasoningHou Pong Chan이 arXiv에 게시한 'GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Geometric Reasoning#Visual Perception#Reinforcement Learning (RL)#Two-stage Training#GeoPQA Benchmark#Perceptual Bottleneck2025년 9월 23일댓글 수 로딩 중
[논문리뷰] MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and OutlookBowen Zhou이 arXiv에 게시한 'MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Reasoning#Large Language Models (LLMs)#Multimodal Large Language Models (MLLMs)#Visual Grounding#Visual Question Answering#Advertisement Video Analysis#Real-world Scenarios#Challenge Benchmark2025년 9월 18일댓글 수 로딩 중
[논문리뷰] R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce LearningHan Hu이 arXiv에 게시한 'R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Auto-Thinking#Reinforcement Learning (RL)#Bi-mode Annealing#Bi-mode Policy Optimization (BPO)#General-Purpose AI#Reasoning#Efficiency2025년 9월 1일댓글 수 로딩 중
[논문리뷰] OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive SimulationJiaqi Yang이 arXiv에 게시한 'OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation' 논문에 대한 자세한 리뷰입니다.#Review#Video Avatar Generation#Cognitive Simulation#Multimodal Large Language Models (MLLMs)#Diffusion Transformers (DiT)#Multimodal Fusion#Human Motion Synthesis#Contextual Animation2025년 8월 27일댓글 수 로딩 중
[논문리뷰] MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language ModelsZhihan Zhou이 arXiv에 게시한 'MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models' 논문에 대한 자세한 리뷰입니다.#Review#Multimodal Large Language Models (MLLMs)#Math Reasoning#Real-World Benchmark#Visual Perception#Robustness#K-12 Education#Dataset2025년 8월 14일댓글 수 로딩 중
[논문리뷰] Reinforcement Learning in Vision: A SurveyQingwei Meng이 arXiv에 게시한 'Reinforcement Learning in Vision: A Survey' 논문에 대한 자세한 리뷰입니다.#Review#Reinforcement Learning (RL)#Computer Vision (CV)#Multimodal Large Language Models (MLLMs)#Visual Generation#Vision-Language-Action (VLA) Models#Policy Optimization#Reward Modeling2025년 8월 12일댓글 수 로딩 중
[논문리뷰] LaTCoder: Converting Webpage Design to Code with Layout-as-ThoughtTianpeng Lv이 arXiv에 게시한 'LaTCoder: Converting Webpage Design to Code with Layout-as-Thought' 논문에 대한 자세한 리뷰입니다.#Review#Design-to-Code#Webpage Generation#Multimodal Large Language Models (MLLMs)#Layout Preservation#Chain-of-Thought (CoT)#UI Automation#Code Generation2025년 8월 7일댓글 수 로딩 중