#Spatial Understanding

6개의 포스트

[논문리뷰] Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Multimodal Large Language Models (MLLMs)는 Vision과 Language를 연결하는 데 상당한 발전을 이루었지만, 공간 이해와 시점 인지(viewpoint-aware) 추론 능력은 여전히 부족합니다.

#Review #Vision-Language Models #3D Reasoning #Language-based Localization #Spatial Understanding #Situation Modeling #Global Layout Reconstruction #Monocular Video

2026년 3월 19일

[논문리뷰] Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

최근 MLLMs 는 비디오-기반 Supervised Fine-tuning (Video-SFT) 을 통해 시각적 이해 능력을 크게 발전시켜왔습니다. 그러나 Video-SFT 가 시각적 능력의 미세한 진화, 특히 공간적 이해와 시간적 이해 사이의 균형에 미치는 영향은 아직 제대로 연구되지 않았습니다.

#Review #Multimodal Large Language Models (MLLMs)#Video-SFT #Temporal Trap #Spatial Understanding #Temporal Budget #Hybrid-Frame Strategy #Negative Transfer

2026년 3월 18일

[논문리뷰] Enhancing Spatial Understanding in Image Generation via Reward Modeling

본 연구는 복잡한 공간 관계가 포함된 텍스트 프롬프트에서 현재 Text-to-Image(T2I) 모델 이 직면하는 한계를 해결하고, 생성된 이미지의 공간적 정확도를 향상시키는 것을 목표로 합니다.

#Review #Image Generation #Reward Modeling #Spatial Understanding #Reinforcement Learning #Visual Language Models #Text-to-Image #Preference Learning

2026년 3월 1일

[논문리뷰] MiMo-Embodied: X-Embodied Foundation Model Technical Report

이 논문은 자율 주행(Autonomous Driving)과 인공지능(Embodied AI) 두 가지 핵심 도메인을 단일 모델 로 통합하는 최초의 오픈소스 크로스-엠바디드 파운데이션 모델(MiMo-Embodied) 을 개발하는 것을 목표로 합니다.

#Review #Vision-Language Model (VLM)#Embodied AI #Autonomous Driving #Foundation Model #Multimodal Learning #Task Planning #Affordance Prediction #Spatial Understanding #Reinforcement Learning

2025년 11월 20일

[논문리뷰] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

대규모 시각-언어 모델(LVLM)의 공간 이해 능력 부족 이라는 한계를 해결하는 것을 목표로 합니다.

#Review #Self-supervised learning #Reinforcement Learning #Spatial Understanding #Vision-Language Models #Pretext Tasks #RGB-D Images #Spatial Reasoning

2025년 11월 9일

[논문리뷰] LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

본 논문은 복잡한 실제 시나리오를 시뮬레이션하는 고충실도 3D 가상 환경 을 생성하는 데 초점을 맞추어, sim-to-real 격차 를 줄이고 풍부한 데이터를 효율적으로 수집하는 것을 목표로 합니다.

#Review #Multimodal LLM #3D World Generation #Unreal Engine 5 #Procedural Content Generation #Interactive Environments #Sim-to-Real #Spatial Understanding #Multimodal Input

2025년 9월 8일