#Action Prediction

5개의 포스트

[논문리뷰] RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

본 논문은 현대의 VLA 모델들이 학습 과정에서 진정한 의미적 이해보다는 시각적 혹은 지시어-행동 간의 통계적 Shortcut에 의존하는 문제를 해결하고자 한다 . 저자들은 기존의 로봇 학습 벤치마크들이 단순한 형태의 명령어를 사용하여 모델의 진정한 의미론적 추론 능력을 검증하지 못하고 있다고 지적한다.

#Review #Vision-Language-Action Models #Embodied AI #Semantic Grounding #Action Prediction #Robotics Benchmark #Instruction-following

2026년 6월 1일

[논문리뷰] CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

지능형 에이전트가 복잡한 데스크톱 워크플로우를 자동화할 수 있다는 비전은 연속적이고 고품질의 인간 데모 비디오 부족으로 인해 진전이 지연되고 있다.

#Review #Computer-Use Agents #Video Demonstrations #Human Annotation #Desktop Applications #Visual Grounding #Action Prediction #Multi-layered Reasoning #Foundation Action Models

2026년 3월 25일

[논문리뷰] VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

본 논문은 로봇 조작 분야에서 기존 VLA 모델의 제한적인 일반화 능력을 극복하고, 새로운 태스크, 객체, 환경에 대한 강건한 적응을 가능하게 하는 것을 목표로 합니다. 특히, 대규모 비디오 생성 모델 을 로봇 조작에 활용하여 일반화 가능한 VLA 매니퓰레이터를 구축할 수 있는지 탐구합니다.

#Review #Robot Manipulation #Video Generation Models #Vision-Language-Action (VLA)#Diffusion Transformer #Generalization #Action Prediction #Visual Imagination

2025년 12월 8일

[논문리뷰] GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

본 논문은 데스크톱 컴퓨터 사용 에이전트(CUAs) 연구의 세 가지 주요 격차(실세계 CUA 태스크 부족, 자동화된 데이터 수집 및 주석 파이프라인 부재, 통합 벤치마크 부족)를 해결하는 것을 목표로 합니다.

#Review #Computer-Using Agents #GUI Grounding #Screen Parsing #Action Prediction #Desktop Automation #Dataset #Benchmark #Multimodal Learning #LLM-augmented Data

2025년 11월 9일

[논문리뷰] Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

기존 VLA(Vision-Language-Action) 모델이 비전 생성 및 행동 예측을 분리하여 다루거나 외부 전문가에 의존하는 한계를 극복하는 것을 목표로 합니다.

#Review #Vision-Language-Action (VLA)#Diffusion Models #Discrete Denoising #Multimodal Learning #Robotics #Embodied AI #Joint Generation #Action Prediction

2025년 11월 9일