#RGB-D

3개의 포스트

[논문리뷰] Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

본 논문은 3D-VG 작업을 'Think(추론)', 'Act(도구 호출)', 'Build(재구성)' 단계로 세분화한 TAB 프레임워크를 제안합니다 . TAB은 고정된 파이프라인 대신, 전문적인 3D-VG Skill blueprint에 따라 VLM 에이전트가 능동적으로 visual tool을 호출하여 타겟을 추적하고 마스크를 생성합니다.

#Review #3D Visual Grounding #Vision-Language Models #Agentic Framework #RGB-D #Zero-Shot #Geometric Reconstruction

2026년 4월 1일

[논문리뷰] JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

기존 2D-중심 AV-LLM이 RGB 비디오와 모노 오디오에 의존하여 3D 환경에서 음원 위치 파악 및 공간 추론에 어려움을 겪는 문제를 해결하고자 합니다.

#Review #3D Audio-Visual Learning #Spatial Grounding #Spatial Reasoning #Large Language Models (LLMs)#Ambisonics #RGB-D #Simulated Environments #Neural Intensity Vector

2026년 2월 25일

[논문리뷰] LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry

이 논문은 전통적인 모듈형 내비게이션 파이프라인의 지연 시간과 오류 누적 문제를 해결하고, 기존 end-to-end 방식의 명시적 localization 의존성 한계를 극복하는 것을 목표로 합니다.

#Review #Autonomous Navigation #End-to-end Learning #Localization Grounded #Visual Geometry #Metric-aware Perception #Diffusion Policy #RGB-D

2025년 12월 22일