#Spatial Reasoning

59개의 포스트

[논문리뷰] MentalThink: Shaping Thoughts in Mental SVG World

본 논문은 기존의 언어 중심 Multimodal CoT가 가진 시각적 접지(Visual Grounding)의 취약성과 할루시네이션(Hallucination) 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Spatial Reasoning #Scalable Vector Graphics #Chain-of-Thought #Reinforcement Learning #Mental Imagery

2026년 7월 7일

[논문리뷰] Thinking with Visual Grounding

본 논문은 기존 VLM(Vision-Language Model)의 추론 과정이 언어적 논리에는 치중되어 있으나, 정작 그 논리의 근거가 되는 이미지 내 특정 영역을 명시하지 않아 검증이 어렵다는 문제를 해결하고자 합니다.

#Review #Visually Grounded Thinking #Vision-Language Models #Reinforcement Learning #Visual Grounding #SAM3 #Spatial Reasoning

2026년 6월 18일

[논문리뷰] AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

본 논문은 Multimodal Foundation Models (MFMs)가 물리적 세계의 3D 공간을 추론하는 데 있어 근본적인 한계를 지니고 있음을 지적합니다.

#Review #AlloSpatial #Spatial Reasoning #Allocentric Cognitive Mapping #World2Mind #Spatial Reasoning Harness #Foundation Models #Reinforcement Learning

2026년 6월 14일

[논문리뷰] SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

본 논문은 기존의 정적인 VQA나 시뮬레이터 종속적 벤치마크가 멀티모달 에이전트의 실제 환경에서의 동적 공간 추론 능력을 평가하는 데 한계가 있다는 점을 지적합니다. 대부분의 기존 연구는 privileged state 정보에 의존하거나 특정 환경에 고착화된 인터페이스를 사용하여 일반적인 공간 지능을 측정하기 어렵습니다 .

#Review #Spatial Reasoning #Multimodal Agents #Interactive Benchmark #Egocentric Vision #POMDP #Spatial Intelligence

2026년 6월 8일

[논문리뷰] Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

기존의 VLM들은 관측된 이미지에 제한되어 있어 보이지 않는 레이아웃을 추론하거나 시점 변화에 따른 공간적 일관성을 유지하는 데 한계를 보입니다. 특히 제한적인 일인칭 관측 환경에서는 alternative viewpoint에서 장면을 파악해야 할 필요성이 크지만, 현 모델들은 이를 능동적으로 해결하지 못합니다.

#Review #Vision-Language Models #Spatial Reasoning #World Simulator #Reinforcement Learning #View Consistency #Agentic Reasoning

2026년 6월 7일

[논문리뷰] SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

본 논문은 VLM이 embodied 환경에서 생성하는 수치적 출력값(예: action magnitude, spatial coordinate)이 실제 공간 정보에 기반하고 있는지에 대해 의문을 제기합니다.

#Review #Vision-Language Models #Spatial Numerical Understanding #Spatial Exploration #Spatial Reasoning #Metric Grounding #Num2Space #Space2Num

2026년 6월 7일

[논문리뷰] MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

본 논문은 범용 Multimodal Large Language Models (MLLMs)가 기계 공학 도면의 복잡성과 도메인 특수성을 제대로 해석하지 못하는 문제를 해결하고자 한다.

#Review #Multimodal Large Language Models #Mechanical Drawing Understanding #Visual Question Answering #Spatial Reasoning #Reinforcement Learning #Domain-Specialized Benchmark

2026년 6월 4일

[논문리뷰] Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

본 논문은 MLLM이 물리적 환경에서 복잡한 공간 추론을 수행하기 위해 필수적인 Wide-Baseline Matching 능력을 체계적으로 학습하고 평가할 프레임워크가 부족하다는 점을 문제로 지적합니다.

#Review #Multimodal Large Language Models #Spatial Reasoning #Wide-Baseline Matching #Reinforcement Learning #Curriculum Learning #Vision-Language Benchmarks

2026년 6월 3일

[논문리뷰] Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

본 논문은 기존의 spatial reasoning 벤치마크들이 시각적 관측이 항상 충분하고 신뢰 가능하다는 비현실적인 가정에 의존하고 있다는 점을 지적합니다.

#Review #Vision-Language Models #Spatial Reasoning #Observational Uncertainty #Abstention #Occlusion #Perspective Ambiguity #Embodied AI

2026년 5월 31일

[논문리뷰] AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

본 연구는 기존 VLM 에이전트가 긴 호흡의 공간적 과업(long-horizon spatial tasks)을 수행할 때 발생하는 '공간적 맹목(spatial blindness)'과 '모달리티 불일치(modality mismatch)' 문제를 해결합니다.

#Review #VLM Agents #Visual Skill Memory #Reinforcement Learning #Reward Shaping #Spatial Reasoning #Self-Evolving

2026년 5월 18일

[논문리뷰] Unlocking Dense Metric Depth Estimation in VLMs

본 논문은 기존 VLMs가 2D 과업에는 뛰어나지만 3D 이해 능력은 여전히 제한적이라는 핵심 문제에서 출발합니다 . 기존 연구들은 외부의 3D 전문 모델로부터 지식을 증류하거나, 텍스트 기반으로만 학습하여 정밀한 기하학적 정보가 부족하고 오류가 누적되는 한계를 보입니다.

#Review #Vision-Language Models #Dense Metric Depth Estimation #3D Geometry #Unified Supervision #Spatial Reasoning

2026년 5월 17일

[논문리뷰] PanoWorld: Towards Spatial Supersensing in 360^circ Panorama World

기존의 MLLM들은 인간의 시야각과 유사한 perspective-image 패러다임에 의존하여 360° 환경을 파악하는 데 한계를 보입니다.

#Review #Multimodal Large Language Models #Panorama #Equirectangular Projection #Spatial Reasoning #Spatial Supersensing #Instruction Tuning

2026년 5월 14일

[논문리뷰] SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

본 논문은 3D 공간 추론 학습에서 데이터 주석(annotation) 비용과 모델 합의(consensus) 기반 학습의 한계 문제를 해결하고자 합니다.

#Review #Spatial Reasoning #Self-Evolution #Vision-Language Models #Deterministic Geometric Environment #Reinforcement Learning

2026년 4월 15일

[논문리뷰] Token Warping Helps MLLMs Look from Nearby Viewpoints

본 논문은 토큰을 변환 단위로 사용하는 Token Warping 프레임워크를 제안하며, 특히 Backward Token Warping이 안정성과 의미론적 일관성 측면에서 가장 우수함을 입증한다. 와 는 MLLM 토큰이 위치 잡음에 강건하다는 점을 증명하며, 이를 바탕으로 시점 변환 시 토큰 기반의 역투영 기법을 적용한다.

#Review #Multimodal Large Language Models #Token Warping #Viewpoint-Aware Reasoning #Spatial Reasoning #Mental Imagery

2026년 4월 5일

[논문리뷰] Make Geometry Matter for Spatial Reasoning

최근 VLMs는 광범위한 훈련을 통해 일반적인 영상 이해 능력은 향상되었으나, 3D 공간상의 물체 관계나 움직임을 파악하는 Spatial Reasoning 에는 여전히 한계를 보입니다.

#Review #Vision-Language Models #Spatial Reasoning #Geometry Tokens #Token Masking #Gated Routing

2026년 3월 30일

[논문리뷰] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

기존의 Multimodal Large Language Models (MLLMs)는 2D 시각 신호에 과도하게 고정되어 3D 환경에 대한 구조화된 추상화를 구축하지 못함으로써 3D 공간 추론(spatial reasoning)에서 어려움을 겪습니다.

#Review #Multimodal Large Language Models (MLLMs)#Spatial Reasoning #Textual Representation #Allocentric Context #Egocentric Video #Prompting Methods #VSI-Bench #OST-Bench

2026년 3월 25일

[논문리뷰] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

최근 Multimodal Large Language Models (MLLMs)는 인상적인 Semantic Capability를 보여주지만, Fine-grained geometric reasoning 및 Physical dynamics와 관련된 'Spatial blindness' 문제를 겪고 있습니다.

#Review #Video Generation Models #3D Priors #Scene Understanding #Spatial Reasoning #Multimodal Large Language Models (MLLMs)#Latent World Simulator #Adaptive Gated Fusion #Generative AI

2026년 3월 19일

[논문리뷰] Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

본 논문의 핵심 목표는 수동 개입 없이 원시 비디오 스트림을 대규모의 홀리스틱 3D 공간 지능 데이터로 자동 변환하는 파이프라인인 Holi-Spatial 을 제시하는 것입니다.

#Review #3D Spatial Intelligence #Video Stream Processing #Automated Data Curation #3D Gaussian Splatting (3DGS)#Vision-Language Models (VLMs)#Open-Vocabulary Segmentation #Spatial Reasoning #Multimodal Datasets

2026년 3월 9일

[논문리뷰] Utonia: Toward One Encoder for All Point Clouds

본 논문의 핵심 목표는 단일 인코더 로 원격 감지, 실외 LiDAR, 실내 RGB-D 시퀀스, 객체 중심 CAD 모델, 비디오 리프티드 포인트 클라우드 등 다양한 도메인의 포인트 클라우드를 통합 처리 하는 것입니다.

#Review #Point Clouds #Self-supervised Learning #Multi-domain Learning #Foundation Model #Point Transformer #Representation Learning #Robotics #Spatial Reasoning

2026년 3월 3일

[논문리뷰] JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

기존 2D-중심 AV-LLM이 RGB 비디오와 모노 오디오에 의존하여 3D 환경에서 음원 위치 파악 및 공간 추론에 어려움을 겪는 문제를 해결하고자 합니다.

#Review #3D Audio-Visual Learning #Spatial Grounding #Spatial Reasoning #Large Language Models (LLMs)#Ambisonics #RGB-D #Simulated Environments #Neural Intensity Vector

2026년 2월 25일

[논문리뷰] Learning Situated Awareness in the Real World

본 논문은 기존의 멀티모달 파운데이션 모델(MFM) 벤치마크들이 환경 중심의 공간 관계에만 초점을 맞추고, 에이전트의 시점, 자세, 움직임에 따른 관찰자 중심의 상황 인식(situated awareness) 을 간과하는 문제점을 해결하고자 합니다.

#Review #Situated Awareness #Egocentric Vision #Spatial Reasoning #Multimodal Foundation Models #Video Understanding #Benchmark #Real-world Data

2026년 2월 18일

[논문리뷰] BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

기존 로봇 조작 벤치마크가 주로 단일 팔 조작에 국한되어 양팔 조작에 필수적인 공간-시간적 조정, 동적 역할 할당, 자가 충돌 방지 등의 복잡성을 포착하지 못하는 문제를 해결하는 것이 목표입니다.

#Review #Bimanual Manipulation #MLLMs #Robotics Benchmark #Spatial Reasoning #Action Planning #End-Effector Control #Embodied AI #Multimodal LLMs

2026년 2월 18일

[논문리뷰] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

현재 Text-to-Image (T2I) 모델들이 복잡한 공간 관계(공간 인식, 추론, 상호작용) 처리에서 실패하는 한계를 해결하고, 기존의 짧고 정보 밀도가 낮은 프롬프트 기반 벤치마크의 부적합성을 극복하는 것을 목표로 합니다.

#Review #Text-to-Image Models #Spatial Intelligence #Benchmark #Evaluation #Prompt Engineering #Multimodal LLMs #Fine-tuning #Spatial Reasoning

2026년 1월 29일

[논문리뷰] World Craft: Agentic Framework to Create Visualizable Worlds via Text

본 논문은 프로그래밍 기술이 없는 비전문가도 텍스트 설명을 통해 실행 및 시각화 가능한 AI Town 환경 을 쉽게 만들 수 있도록 하는 것을 목표로 합니다.

#Review #Generative Agents #AI Town #LLM #Environment Creation #Multi-agent System #Spatial Reasoning #Text-to-World #Reverse Synthesis

2026년 1월 27일

[논문리뷰] Think3D: Thinking with Space for Spatial Reasoning

기존 Vision-Language Models (VLMs) 이 2D 인식을 넘어선 진정한 3D 공간 추론 능력 과 일관된 공간 표현을 구축하는 데 한계가 있음을 해결하고자 합니다.

#Review #Spatial Reasoning #3D Reconstruction #VLM Agents #Tool Calling #Reinforcement Learning #Novel View Synthesis #Iterative Exploration

2026년 1월 20일

[논문리뷰] SpatialTree: How Spatial Abilities Branch Out in MLLMs

멀티모달 대규모 언어 모델(MLLM) 내에서 공간 능력의 계층적 구조가 제대로 이해되지 않고 단편적으로 연구되는 문제를 해결하는 것을 목표로 합니다.

#Review #Spatial Intelligence #Multimodal LLMs #Cognitive Hierarchy #Benchmark #Reinforcement Learning #Supervised Fine-tuning #Spatial Reasoning

2025년 12월 23일

[논문리뷰] N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

본 연구는 기존 멀티모달 모델이 2D 이미지에 의존하여 3D 공간 이해 능력이 부족하다는 한계를 해결하는 것을 목표로 합니다.

#Review #3D Grounding #Spatial Reasoning #Vision-Language Models #Depth Estimation #3D Object Detection #Chain-of-Thought #Data Generation #Multimodal AI

2025년 12월 18일

[논문리뷰] MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

본 논문은 MLLM(Multi-modal Large Language Models)이 물리적 환경에서 일반적인 비서 역할을 수행하기 위해 필수적인 비디오 기반 공간 지능 을 평가할 수 있는 포괄적인 벤치마크의 부재를 해결하고자 합니다.

#Review #Video-Based Spatial Intelligence #MLLM Benchmark #Spatial Reasoning #Multi-Modal Learning #Perception #Planning #Prediction #Cross-Video Reasoning #Human-AI Gap

2025년 12월 17일

[논문리뷰] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

본 논문은 눈에 보이지 않는 미세한 엔티티(원자, 분자)의 공간적 관계를 인식하고 추론하는 능력인 MiSI (Microscopic Spatial Intelligence) 개념을 도입하고, Vision-Language Models (VLMs) 의 해당 도메인 잠재력을 평가하는 것을 목표로 합니다.

#Review #Vision-Language Models #Microscopic Spatial Intelligence #Molecular Structures #Benchmarking #PDBbind Dataset #Spatial Reasoning #Drug Discovery

2025년 12월 11일

[논문리뷰] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

본 연구는 기존 MLLM이 3D 공간 추론 및 객체 속성 이해에 어려움을 겪는 문제를 해결하고자 합니다. 단일 통합 MLLM이 공간 지각 능력을 내재적으로 향상 시키고, 적응형의 인터리브드 추론 을 통해 더욱 강력한 공간 지능을 달성할 수 있는지 탐구하는 것을 목표로 합니다.

#Review #Multimodal Large Language Models (MLLMs)#Spatial Reasoning #Perception Enhancement #Auxiliary Modalities #Adaptive Interleaved Reasoning #Reinforcement Learning #Chain-of-Thought

2025년 12월 7일

[논문리뷰] SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

기존 LVLM(Large Vision-Language Models) 기반의 VLN(Vision-Language Navigation) 에이전트가 겪는 지각, 추론, 계획 오류로 인한 낮은 내비게이션 성능 문제를 해결하고자 합니다.

#Review #Vision-Language Navigation #Large Vision-Language Models #Visual Prompt #Reinforcement Fine-Tuning #Policy Optimization #Embodied AI #Spatial Reasoning #Perception Errors

2025년 12월 4일

[논문리뷰] SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

본 논문은 시각-언어 모델(VLM)이 실제 로봇 공학 애플리케이션에 필수적인 정밀한 공간 추론 능력 을 습득하도록 하는 것을 목표로 합니다.

#Review #Spatial Reasoning #Vision Language Models #Reinforcement Learning #Tool Augmentation #Robotics #Multi-Tool Use #Embodied AI

2025년 12월 3일

[논문리뷰] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

본 논문은 비디오 생성 모델이 시각 데이터(비디오 컨텍스트) 만을 사용하여 인간의 인지와 유사한 시공간 지능(Visuospatial Intelligence) 을 발휘할 수 있는지 탐구하는 것을 목표로 합니다.

#Review #Video Generation #Spatial Reasoning #Visuospatial Intelligence #Diffusion Models #Context-Guided Generation #Scene Navigation #Object Grounding #Out-of-Domain Generalization

2025년 12월 2일

[논문리뷰] Geometrically-Constrained Agent for Spatial Reasoning

본 논문은 Vision Language Models (VLMs)이 공간 추론 시 겪는 의미론-기하학적 간극(semantic-to-geometric gap) 문제를 해결하고자 합니다.

#Review #Spatial Reasoning #Vision Language Models (VLMs)#Geometric Constraints #Agentic AI #Tool Integration #Semantic-to-Geometric Gap #Task Formalization

2025년 11월 30일

[논문리뷰] Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

본 논문은 비디오 모델의 추론 능력, 특히 비디오 생성 을 통한 추론 능력을 체계적으로 평가하기 위한 포괄적인 벤치마크의 부재를 해결합니다.

#Review #Video Models #Spatial Reasoning #Maze Solving #Video Generation #Benchmark #Supervised Fine-tuning #Test-Time Scaling #Multimodal Reasoning

2025년 11월 19일

[논문리뷰] Error-Driven Scene Editing for 3D Grounding in Large Language Models

본 논문은 현재 3D-LLMs 가 3D 환경에서 언어를 시각적 및 공간적 요소에 정확하게 연결하지 못하는 문제점을 해결하고자 합니다.

#Review #3D Grounding #3D-LLMs #Scene Editing #Counterfactual Augmentation #Error-Driven Learning #Spatial Reasoning #Visual Grounding

2025년 11월 18일

[논문리뷰] Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

본 연구는 RL 후처리 훈련이 기존 VLM의 내재적 추론 능력 경계 를, 특히 시각 중심의 공간 추론 작업에서 확장할 수 있는지 탐색하는 것을 목표로 합니다. 이를 위해, 정밀하게 난이도를 제어할 수 있는 프레임워크인 Ariadne 를 도입하여 VLM의 추론 행동을 체계적으로 조사하고 한계를 확장하고자 합니다.

#Review #Vision-Language Models (VLMs)#Reinforcement Learning (RL)#Spatial Reasoning #Controllable Framework #RLVR #GRPO #Maze Navigation #Generalization Boundaries

2025년 11월 10일

[논문리뷰] Visual Spatial Tuning

본 논문은 기존 Vision-Language Models (VLMs) 이 시각 정보에서 공간 관계를 포착하는 데 한계가 있다는 문제를 해결하고자 합니다.

#Review #Vision-Language Models #Spatial Reasoning #Spatial Perception #Dataset Creation #Reinforcement Learning #Visuospatial AI #Robotics

2025년 11월 9일

[논문리뷰] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

기존의 'Thinking with Text' 및 'Thinking with Images' 패러다임이 가진 정적 이미지의 한계와 모달리티 분리 문제를 극복하고자 합니다.

#Review #Video Generation #Multimodal Reasoning #Temporal Understanding #Spatial Reasoning #Foundation Models #AI Benchmarking #In-Context Learning #Self-Consistency

2025년 11월 9일

[논문리뷰] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

멀티모달 대규모 언어 모델(MLLM)이 비디오에서 시공간 추론을 수행하는 데 어려움을 겪는 문제를 해결하는 것을 목표로 합니다.

#Review #Spatial Reasoning #Video Understanding #Simulated Data #Instruction Tuning #Multimodal LLMs #Sim-to-Real Transfer #AI2-THOR

2025년 11월 9일

[논문리뷰] RiddleBench: A New Generative Reasoning Benchmark for LLMs

대규모 언어 모델(LLMs)이 인간 지능의 핵심 요소인 유연하고 다면적인 추론 능력(논리적 추론, 공간 인식, 제약 조건 만족)을 평가하는 데 있어 기존 벤치마크의 한계를 해결하고자 합니다.

#Review #LLM Reasoning #Generative AI #Benchmark #Logical Deduction #Spatial Reasoning #Constraint Satisfaction #Hallucination Cascade #Self-Correction

2025년 11월 9일

[논문리뷰] LTD-Bench: Evaluating Large Language Models by Letting Them Draw

현재 LLM 평가 방식이 공간 추론 능력 의 근본적인 한계를 가리는 추상적인 수치에 의존하여 모델 역량에 대한 직관적 이해를 제공하지 못하는 문제를 해결하고자 합니다.

#Review #LLM Evaluation #Spatial Reasoning #Benchmark #Generative AI #Visual Perception #Spatial Imagination #Code Generation

2025년 11월 9일

[논문리뷰] Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

본 논문은 최신 Multimodal Large Language Models (MLLMs) 의 3D 공간 추론 능력을 평가하고 향상시키는 것을 목표로 합니다.

#Review #Multimodal LLMs #Spatial Reasoning #Viewpoint Learning #Two-Stage Fine-tuning #3D Consistency #Viewpoint-100K #Reinforcement Learning

2025년 11월 9일

[논문리뷰] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

대규모 시각-언어 모델(LVLM)의 공간 이해 능력 부족 이라는 한계를 해결하는 것을 목표로 합니다.

#Review #Self-supervised learning #Reinforcement Learning #Spatial Understanding #Vision-Language Models #Pretext Tasks #RGB-D Images #Spatial Reasoning

2025년 11월 9일

[논문리뷰] Visual Jigsaw Post-Training Improves MLLMs

본 논문은 기존 MLLM(Multimodal Large Language Models)의 텍스트 중심 후속 훈련 패러다임이 시각 신호에 대한 세밀한 이해를 과소평가한다는 문제점을 해결하고자 합니다.

#Review #MLLMs #Post-training #Self-supervised Learning #Visual Understanding #Jigsaw Puzzles #RLVR #Multimodal Perception #Spatial Reasoning

2025년 9월 30일

[논문리뷰] MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

로봇 조작 태스크를 위한 현실적이고 태스크 관련성이 높은 3D 탁상 장면(tabletop scene)을 자동으로 생성하는 것을 목표로 합니다. 기존 수동 또는 무작위 장면 생성 방식의 비효율성과 낮은 현실성을 극복하고, 고수준의 태스크 지시와 3D 장면 레이아웃 간의 큰 격차를 해소하고자 합니다.

#Review #3D Scene Generation #Robotic Manipulation #Large Language Models #Spatial Reasoning #Dataset #Direct Preference Optimization #Tabletop Scene

2025년 9월 29일

[논문리뷰] PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era

본 논문은 기존 핀홀(pinhole) 비전에 비해 연구가 뒤처진 옴니디렉셔널(omnidirectional) 비전의 잠재력을 발현하고, 데이터 병목 현상, 모델 역량 한계, 애플리케이션 공백과 같은 주요 문제를 해결하여 신체화된 AI(Embodied AI) 시대에 포괄적인 환경 인식을 달성하는 것을 목표로 합니다.

#Review #Omnidirectional Vision #Embodied AI #Panoramic Perception #Multi-modal Learning #Dataset Development #Robot Navigation #Spatial Reasoning #System Architecture

2025년 9월 18일

[논문리뷰] 3D Aware Region Prompted Vision Language Model

본 논문은 단일 뷰 2D 이미지와 다중 뷰 3D 데이터를 공유된 시각 토큰 공간으로 연결하는 3D-aware Vision-Language Model (VLM) 인 SR-3D 를 제안하여, 복잡한 3D 장면에서 유연하고 정확한 공간 추론 능력을 제공하는 것을 목표로 합니다.

#Review #3D Vision #Vision-Language Models #Spatial Reasoning #Region Prompting #Multi-view Learning #Depth Estimation #Unified Representation #Generative AI

2025년 9월 17일

[논문리뷰] OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

본 논문은 기존 MLLM 기반 Embodied 시스템의 Geometric Adaptability Gap (다양한 공간 요구사항에 대한 3D 정보 부족)과 Embodiment Constraint Gap (실제 로봇의 물리적 제약 무시)이라는 두 가지 핵심 한계를 해결하고자 합니다.

#Review #Embodied AI #Multimodal LLMs #3D Grounding #Task-Adaptive Reasoning #Embodiment-Aware Planning #Robotics #Spatial Reasoning

2025년 9월 12일

[논문리뷰] Visual Representation Alignment for Multimodal Large Language Models

본 논문은 시각적 지시 튜닝으로 훈련된 다중 모달 대규모 언어 모델(MLLM) 이 객체 카운팅이나 공간 추론과 같은 시각 중심 작업에서 제한적인 성능을 보이는 문제를 해결하고자 합니다.

#Review #Multimodal LLMs #Visual Representation Alignment #Foundation Models #Regularization #Fine-grained Visual Understanding #Spatial Reasoning #Object Counting #Vision-Language Models

2025년 9월 10일

[논문리뷰] 'Does the cafe entrance look accessible? Where is the door?' Towards Geospatial AI Agents for Visual Inquiries

본 논문은 기존 지도 시스템이 구조화된 GIS 데이터에 의존하여 시각적-공간적 질의(예: '카페 입구가 접근 가능한가요?', '문은 어디에 있고 어떻게 생겼나요?')에 답변하는 데 한계가 있음을 지적합니다.

#Review #Geospatial AI #Multimodal AI Agents #Visual Question Answering #Accessibility #Street View Imagery #Spatial Reasoning #Human-Computer Interaction

2025년 8월 22일

[논문리뷰] RynnEC: Bringing MLLMs into Embodied World

본 논문의 핵심 목표는 기존 Multi-modal Large Language Models ( MLLM )이 실제 물리적 세계를 이해하는 데 부족했던 기초적인 시각 인지 능력 의 한계를 극복하는 것입니다.

#Review #Multi-modal Large Language Models #Embodied AI #Embodied Cognition #Video Understanding #Instance Segmentation #Spatial Reasoning #Robotics

2025년 8월 21일

[논문리뷰] Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

본 논문은 강화 학습(RL) 모델의 과적합 문제를 해결하여, visuomotor 에이전트가 다양한 환경에서 일반화 가능한 행동을 습득하지 못하는 한계를 극복하고자 합니다.

#Review #Reinforcement Learning #Multi-Task Learning #Visuomotor Agents #Spatial Reasoning #Generalization #Minecraft #Cross-View Goal Specification #Automated Task Synthesis

2025년 8월 2일

[논문리뷰] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

본 연구는 최신 비디오 생성 모델, 특히 Veo-3 가 복잡한 시각적 추론 시나리오에서 제로샷 추론자(zero-shot reasoner) 로서 얼마나 준비되었는지를 종합적으로 평가하는 것을 목표로 합니다.

#Review #Video Generation Models #Zero-Shot Reasoning #Visual Reasoning #MME-COF Benchmark #Chain-of-Frame Reasoning #Temporal Coherence #Spatial Reasoning

2025년 10월 31일

[논문리뷰] Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

본 논문은 인간의 다중모달 공간 추론 능력을 대규모 모델(MLLMs)에 적용하는 연구의 현황을 체계적으로 검토하고, 이 분야의 발전을 위한 공개 벤치마크 를 제시하는 것을 목표로 합니다.

#Review #Multimodal Large Language Models #Spatial Reasoning #Survey #Benchmarks #3D Vision #Embodied AI #Vision-Language Navigation

2025년 10월 30일

[논문리뷰] Reasoning in Space via Grounding in the World

기존 3D LLM이 통일된 3D 표현 부재 및 외부 모듈 의존으로 인해 3D 시각적 그라운딩과 공간 추론을 원활하게 통합하지 못하는 문제를 해결하는 것이 목표입니다. 본 연구는 LLM이 자율회귀적 방식으로 자연스럽고 효과적인 그라운딩을 수행하여 공간 추론 능력을 향상시킬 수 있는 방법을 모색합니다.

#Review #3D Visual Grounding #Spatial Reasoning #Large Language Models (LLMs)#Chain-of-Thought (CoT)#Hybrid Representation #Multi-modal LLMs #Point Clouds

2025년 10월 16일

[논문리뷰] Detect Anything via Next Point Prediction

본 논문은 MLLM(Multimodal Large Language Model) 기반 객체 감지에서 발생하는 낮은 재현율, 중복 예측, 좌표 불일치 등의 문제를 해결하고, 기존 회귀 기반 모델과 동등하거나 이를 능가하는 제로샷 객체 인식 성능 을 달성하는 것을 목표로 합니다.

#Review #Multimodal Large Language Models #Object Detection #Coordinate Prediction #Reinforcement Learning #Supervised Fine-tuning #Visual Perception #Zero-shot Learning #Spatial Reasoning

2025년 10월 15일

[논문리뷰] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

카메라 중심의 장면 이해와 생성을 별개의 문제로 다루던 기존 방식의 한계를 극복하고, 이를 단일 멀티모달 모델 로 통합하는 것을 목표로 합니다.

#Review #Unified Multimodal Model #Camera-Centric #Image Understanding #Image Generation #Spatial Reasoning #Camera Parameters #Instruction Tuning #Multimodal Spatial Intelligence

2025년 10월 13일

[논문리뷰] SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

본 논문은 기존 공간 추론 모델들이 실내 3D 스캔 및 수동 어노테이션에 의존하고 개별 장면에 과적합되는 한계를 극복하여, mm부터 km까지 아우르는 모든 스케일에서의 시각 공간 추론(All-Scale Visual Spatial Reasoning) 능력을 발전시키는 것을 목표로 합니다.

#Review #Spatial Reasoning #Multi-Scale Vision #MLLM #Dataset #Scale Experts #Reinforcement Learning #Computer Vision #Robotics

2025년 10월 13일