#Active Perception

7개의 포스트

[논문리뷰] Native Active Perception as Reasoning for Omni-Modal Understanding

본 논문은 기존의 패시브한 Long Video Understanding 모델들이 가진 컴퓨팅 자원 및 성능의 한계를 해결하기 위해 제안되었습니다. 기존 연구들은 비디오 전체를 균일하게 처리하거나 전역적 사전 스캔에 의존함으로써, 비디오 길이에 따라 계산 비용이 선형적으로 증가하는 고질적인 병목 현상을 겪고 있습니다 .

#Review #Omni-modal Understanding #Active Perception #POMDP #Agentic Reasoning #Test-time Scaling #TAURA #Reinforcement Learning

2026년 6월 17일

[논문리뷰] ActiveMimic: Egocentric Video Pretraining with Active Perception

본 논문은 대규모 Egocentric Human Video를 로봇 학습에 활용할 때 발생하는 성능 저하의 핵심 원인이 '능동적 인식(Active Perception) 정보의 부재'에 있음을 규명합니다 .

#Review #Robot Manipulation #Egocentric Human Video #Active Perception #Robot Pretraining #Unified Action Representation

2026년 6월 14일

[논문리뷰] OmniGAIA: Towards Native Omni-Modal AI Agents

본 연구는 현재 바이모달 상호작용에 국한된 멀티모달 LLM의 한계를 넘어, 인간의 지능처럼 영상, 오디오, 이미지 모달리티 전반에 걸쳐 통합적으로 인지하고 추론하며 외부 도구를 사용하는 네이티브 옴니모달 AI 에이전트 를 개발하고 평가하는 것을 목표로 합니다.

#Review #Omni-modal AI #Multi-modal Agents #Tool-Integrated Reasoning #Benchmark #Event Graph #Active Perception #Trajectory Synthesis #DPO

2026년 2월 26일

[논문리뷰] SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

Vision-Language-Action (VLA) 모델의 고정된 추론 파이프라인이 지각적 모호성이나 행동의 다중 양상과 같은 불확실한 상황에서 오류를 누적하는 문제를 해결하고자 합니다.

#Review #Vision-Language-Action Models #Self-Uncertainty Estimation #Adaptive Inference #Active Perception #Action Decoding #Visual Attention #Robotic Manipulation

2026년 2월 10일

[논문리뷰] EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

본 논문은 인간형 로봇의 실제 환경 배포 시 발생하는 고유한 불안정성, 부분적 정보 기반의 지각/이동/조작 통합의 어려움, 그리고 동적 환경에서의 견고한 하위 태스크 전환 문제를 해결하는 것을 목표로 합니다.

#Review #Humanoid Robots #Vision-Language Models #Task Planning #Egocentric Control #Mobile Manipulation #Active Perception #Human-Robot Interaction #Real-World Deployment

2026년 2월 4일

[논문리뷰] Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

본 논문은 정적인 이미지에 국한된 기존 Vision-Language Models (VLMs) 의 Visual Question Answering (VQA) 한계를 극복하고, 앰뷸러토리 비전 능력을 갖춘 에이전트가 더 유익한 시점을 능동적으로 선택하도록 학습시키는 것을 목표로 합니다.

#Review #Active Perception #Vision-Language Models (VLMs)#Embodied AI #View Selection #Reinforcement Learning (RL)#Supervised Fine-Tuning (SFT)#Visual Question Answering (VQA)#3D Environments

2025년 12월 15일

[논문리뷰] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

본 논문은 3D 환경에서 자연어 명령을 기반으로 물체의 상호작용 가능한 요소(affordance elements)를 식별하고, 해당 요소의 3D 마스크 , 동작 유형 , 동작 축 방향 을 포함하는 구조화된 트립렛을 예측하는 Fine-grained 3D Embodied Reasoning 이라는 새로운 태스크를 제안합니다.

#Review #3D Embodied Reasoning #Multimodal Large Language Models (MLLMs)#Chain-of-Thought (CoT)#Affordance Grounding #Motion Estimation #View Synthesis #Active Perception

2025년 11월 13일