#GUI Agents

30개의 포스트

[논문리뷰] KnowAct-GUIClaw: Know Deeply, Act Perfectly, Personal GUI Assistant with Self-Evolving Memory and Skill

본 논문은 기존의 OpenClaw 계열 에이전트가 GUI 환경에서의 복잡한 작업 자동화 시 겪는 구조적 한계를 해결하고자 합니다. 기존 방식은 플랫폼 간의 호환성이 부족하고, 지속적인 학습을 통한 성능 향상 메커니즘이 부재하여 다양한 기기 환경에 적응하기 어렵다는 문제점이 있습니다.

#Review #GUI Agents #Personal Assistant #Self-Evolving Memory #Skill Library #Cross-Platform Interaction #POMDP #Task Decomposition

2026년 7월 15일

[논문리뷰] Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

본 논문은 현대의 MLLM(Multimodal Large Language Models)이 VideoQA와 같은 피상적인 시각적 단서 인식에는 뛰어나지만, 영상 튜토리얼로부터 깊은 절차적 지식을 습득하고 이를 복잡한 하위 작업에 일반화하는 능력은 부족하다는 점을 문제로 제기합니다 .

#Review #VideoQA #Video-Guided Agent #Keyframe Extraction #In-Context Learning #GUI Agents #Procedural Knowledge #Temporal Reasoning

2026년 6월 29일

[논문리뷰] GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

본 논문은 기존의 Computer-Use 에이전트 평가 방식이 GUI와 CLI라는 상호작용 모달리티(Modality)의 차이를 모델 성능, 작업 환경, 에이전트의 제어 능력과 혼동하고 있다는 점을 지적한다.

#Review #GUI Agents #CLI Agents #Computer-Use #Skill-Mediated #Execution Bottlenecks #Benchmark #Action Space #Visual Grounding

2026년 6월 25일

[논문리뷰] Joint Agent Memory and Exploration Learning via Novelty Signals

본 논문은 LLM 기반 에이전트가 개방형 환경에서 효율적인 탐색을 수행하지 못하는 문제를 해결하고자 합니다. 기존 에이전트는 환경과의 상호작용 기록이 길어짐에 따라 전체 기록을 유지하는 데 발생하는 막대한 계산 비용과 메모리 저장 공간 문제에 직면해 있습니다.

#Review #Agent Memory #Exploration #Novelty Signals #GUI Agents #Latency #Token Efficiency #Latent Memory

2026년 6월 1일

[논문리뷰] CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

기존의 GUI 에이전트는 웹 탐색이나 단순 OS 작업에서는 상당한 진전을 보였으나, 정교한 미디어 후반 작업과 같은 전문적인 창의적 워크플로우에 대한 대응 능력은 거의 검증되지 않았습니다.

#Review #GUI Agents #Media Post-Production #Benchmark #Multimodal #Long-Horizon #Grounding #Vibe Cutting

2026년 5월 20일

[논문리뷰] OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

본 논문은 기존 GUI 에이전트 벤치마크가 정적 스크린샷 위주로 구성되어 있어, 실시간 환경에서 요구되는 동적 오디오 및 비디오 처리 능력을 평가하지 못한다는 한계를 해결하고자 한다 .

#Review #GUI Agents #Multimodal Benchmark #Smartphone Environments #Temporal Reasoning #Auditory Processing #Action Grounding

2026년 5월 19일

[논문리뷰] MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

본 논문은 현재의 GUI agent가 장기적(Long-Horizon) 태스크 수행 시 인터페이스 변화에 따른 태스크 상태를 유지하는 데 한계를 보인다는 점을 문제로 지적합니다.

#Review #GUI Agents #Multimodal Memory #Long-Horizon #Memory Control #MLLM #Working Memory #Episodic Memory

2026년 5월 18일

[논문리뷰] PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

본 연구는 기존 GUI 에이전트들이 주로 의존하는 'region-tolerant' 패러다임이 정밀한 기하학적 구성 작업에서 실패하는 근본적인 문제를 해결하고자 한다.

#Review #GUI Agents #Geometric Reasoning #Precision-Sensitive #Dependency-Structured Planning #Pixel-Grounded Supervised Tuning #Reinforcement Learning #Semantic-Execution Gap

2026년 5월 17일

[논문리뷰] MMSkills: Towards Multimodal Skills for General Visual Agents

본 논문은 시각적 에이전트가 복잡한 환경에서 성공적인 결정을 내리기 위해 필요한 Multimodal Procedural Knowledge의 부재 문제를 해결하고자 합니다.

#Review #Multimodal Agents #Procedural Knowledge #Visual Grounding #Branch Loading #GUI Agents #Skill Representation

2026년 5월 17일

[논문리뷰] Orchard: An Open-Source Agentic Modeling Framework

본 논문은 에이전트 모델링 연구에서 인프라와 훈련 기법 간의 결합도가 높아 재현성과 확장성에 한계가 있다는 점을 지적합니다. 기존 연구들은 에이전트의 하네스(harness)와 훈련 스택이 환경 관리와 강하게 결합되어 있어, 서로 다른 도메인이나 환경에서의 재사용이 어렵습니다.

#Review #Agentic Modeling #Kubernetes-native #Orchard Env #Balanced Adaptive Rollout #Credit-assignment SFT #SWE-bench #GUI Agents #Tool-calling

2026년 5월 14일

[논문리뷰] AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

본 논문은 현재 GUI 에이전트 평가 방식이 단순한 시각적 요소 매칭에 치중되어 있어, 실제 디지털 환경에서의 복잡한 상태 변화와 GUI 동역학을 이해하는 능력을 측정하지 못한다는 문제를 해결하고자 한다.

#Review #GUI Agents #Multi-Modal Benchmarking #Functional Understanding #Interaction Outcome Prediction #Vision-Language Models #Hierarchical Decomposition

2026년 4월 28일

[논문리뷰] CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

최근 Multimodal Agentic Pipelines이 Human-Computer Interaction을 변화시키고 있지만, 대부분 Short-Horizon 또는 General-Purpose Application에 초점을 맞추고 있으며, 특히 Healthcare 분야에서 Long-Horizon Automation은 크게 탐구되지 않은 상태이다.

#Review #Multi-Agent Framework #Healthcare Automation #Long-Horizon Tasks #Actor-Critic #Tool Grounding #Dual-Memory #CareFlow #GUI Agents

2026년 3월 25일

[논문리뷰] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

본 논문은 기존 오픈소스 GUI 에이전트들이 긴 호라이즌 탐색(long-horizon navigation) 태스크 에서 상용 시스템에 비해 뒤쳐지는 문제를 해결하고자 합니다.

#Review #GUI Agents #Reinforcement Learning #Supervised Fine-tuning #Visual Grounding #Long-Horizon Tasks #Partial Verifiability #KL Regularization #Data Curation

2026년 2월 25일

[논문리뷰] Computer-Using World Model

본 논문은 복잡한 소프트웨어 환경에서 에이전트가 행동의 결과를 추론하는 능력의 부재로 인해 발생하는 문제를 해결하는 것을 목표로 합니다.

#Review #World Model #GUI Agents #Desktop Automation #Reinforcement Learning #Large Language Models #Visual State Realization #Textual State Transition

2026년 2월 19일

[논문리뷰] Code2World: A GUI World Model via Renderable Code Generation

본 논문은 기존 텍스트 및 픽셀 기반 GUI 월드 모델이 가지는 시각적 충실도와 세밀한 구조적 제어 능력 부족 문제를 해결하고자 합니다. 사용자 인터페이스(UI)의 다음 상태를 렌더링 가능한 코드 생성 을 통해 예측하여, 높은 시각적 충실도와 정교한 구조적 제어가 가능한 GUI 월드 모델 을 구축하는 것을 목표로 합니다.

#Review #GUI World Model #Renderable Code Generation #Vision-Language Model #Reinforcement Learning #HTML Synthesis #UI Prediction #GUI Agents

2026년 2월 10일

[논문리뷰] Continual GUI Agents

본 연구는 GUI(Graphical User Interface) 에이전트가 새로운 도메인이나 해상도 변화와 같은 동적인 디지털 환경(데이터 분포의 변화)에서 성능 저하 없이 지속적으로 학습(continual learning) 할 수 있도록 하는 새로운 태스크인 Continual GUI Agents 를 정의합니다.

#Review #Continual Learning #GUI Agents #Reinforcement Learning #Grounding #Domain Adaptation #Resolution Adaptation #Reward Shaping #Human-Computer Interaction

2026년 2월 1일

[논문리뷰] MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

본 연구는 사용자 상호작용 부족, UI 전용 작업의 한계, 비실용적인 배포 아키텍처, 동적 환경에서의 취약성 등 기존 GUI 에이전트의 현실적인 배포 문제를 해결하고자 합니다.

#Review #GUI Agents #Foundation Models #Reinforcement Learning #Device-Cloud Collaboration #Mobile Navigation #Tool Augmentation #User Interaction

2025년 12월 28일

[논문리뷰] VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

기존 GUI 그라운딩 벤치마크가 데이터 부족, 좁은 도메인 커버리지, 단일 플랫폼 집중, 그리고 과도한 전문 지식 요구 등의 한계를 가지고 있음을 지적합니다.

#Review #GUI Grounding #Multi-Platform #Benchmark #MLLM #Hierarchical Evaluation #Human-in-the-Loop Annotation #GUI Agents #Multilingual Dataset

2025년 12월 18일

[논문리뷰] GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

본 연구는 GUI(Graphical User Interface) 에이전트가 실제 환경에서 복잡한 화면 탐색 과제를 수행하는 데 필요한 포괄적인 환경 정보를 얻기 어렵다는 문제를 해결합니다.

#Review #GUI Agents #Screen Navigation #Reinforcement Learning #Multi-Turn RL #Simulation #Supervised Fine-tuning #Generalization

2025년 12월 2일

[논문리뷰] HiconAgent: History Context-aware Policy Optimization for GUI Agents

GUI(Graphical User Interface) 에이전트가 순차적 탐색 작업을 수행할 때, 과도한 계산 오버헤드와 불필요한 정보로 인한 방해 없이 과거 컨텍스트를 효과적이고 효율적으로 활용하는 방법을 연구합니다.

#Review #GUI Agents #Reinforcement Learning #Context-aware #History Compression #Policy Optimization #Multimodal LLM #Dynamic Sampling

2025년 12월 1일

[논문리뷰] HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

본 논문은 자율 GUI(Graphical User Interface) 에이전트 가 부정확하거나 과도한 확신을 가진 예측을 생성하여 태스크 실패로 이어지는 문제를 해결하고자 합니다.

#Review #GUI Grounding #Uncertainty Calibration #Reinforcement Learning #Confidence Estimation #Brier Score #GUI Agents #Visual-Language Models

2025년 11월 9일

[논문리뷰] MobiAgent: A Systematic Framework for Customizable Mobile Agents

본 논문은 GUI 기반 모바일 에이전트가 직면하는 낮은 태스크 완료율, 느린 응답 시간, 예상치 못한 상황 처리 능력 부족 등 실세계 태스크 실행의 정확성과 효율성 문제 를 해결하고자 합니다. 특히, 기존 모델들의 한계를 극복하고 맞춤형 모바일 에이전트 를 위한 체계적인 프레임워크를 제공하는 것을 목표로 합니다.

#Review #Mobile Agents #GUI Agents #Vision-Language Models #Agent Acceleration #Benchmarking #Reinforcement Learning #Data Collection

2025년 9월 3일

[논문리뷰] FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

기존 GUI 에이전트 벤치마크는 게임 다양성과 전체 스토리라인 완료 평가 기능이 부족하며, 에이전트가 이전에 관찰한 정보를 기억하고 활용하는 '관찰-행동 간극' 문제를 제대로 다루지 못했습니다.

#Review #GUI Agents #Adventure Games #Benchmark #Full Story Arc #Observation-Behavior Gap #LLMs #Automated Evaluation

2025년 9월 3일

[논문리뷰] CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

GUI(Graphical User Interface) 기반 자율 에이전트의 핵심 난제인 장기 계획(long-horizon planning) 능력과 정밀한 미세 실행(fine-grained execution) 능력 사이의 고질적인 트레이드오프를 해결하는 것을 목표로 합니다.

#Review #GUI Agents #Reinforcement Learning #Planner-Executor Architecture #Decoupled Training #Large Vision-Language Models #Specialization #Generalization #Computer Use Agent

2025년 8월 28일

[논문리뷰] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

본 논문은 기존 GUI 에이전트 훈련 및 추론 방식의 세 가지 한계점인 추론 설계 딜레마(P1) , 비효율적인 보상(P2) , 그리고 고해상도 디스플레이에서의 시각적 노이즈(P3) 를 해결하고자 합니다.

#Review #GUI Agents #Reinforcement Learning #Grounding #MLLMs #Reward Function #Resampling #Visual Noise Reduction

2025년 8월 11일

[논문리뷰] UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

본 논문은 GUI 그라운딩(grounding) 태스크에서 자연어 명령어의 다양성과 품질 이 모델 성능에 미치는 영향을 간과했던 기존 연구의 한계를 극복하고자 합니다. 명령어에 존재하는 23.3%의 오류율 을 개선하고, 추론 시 명령어 다양성 을 활용하여 최대 76%의 상대적 성능 향상 을 목표로 합니다.

#Review #GUI Grounding #Natural Language Instructions #Multi-Perspective Reasoning #Supervised Fine-Tuning (SFT)#Reinforcement Learning (RL)#Policy Collapse Mitigation #GUI Agents

2025년 10월 27일

[논문리뷰] VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

본 연구는 GUI(Graphical User Interface) 에이전트 훈련에 필요한 대규모의 수동 주석된 상호작용 데이터 확보의 어려움을 해결하고자 합니다.

#Review #GUI Agents #Video Pretraining #Inverse Dynamics #Action Recognition #Computer Use Automation #Data Synthesis #Multimodal Learning

2025년 10월 23일

[논문리뷰] ColorAgent: Building A Robust, Personalized, and Interactive OS Agent

본 논문은 명령어 기반 인터페이스에서 AI 에이전트 상호작용으로 변화하는 인간-운영체제 상호작용의 흐름 속에서, 사용자의 지시를 정확히 따르고 사용자 의도를 충실히 반영하는 강건하고 개인화된 대화형 OS 에이전트 인 ColorAgent 를 구축하는 것을 목표로 합니다.

#Review #OS Agent #Reinforcement Learning #Multi-agent Systems #Personalization #Proactive Interaction #GUI Agents #Self-Evolving Training

2025년 10월 23일

[논문리뷰] GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

본 논문은 Vision-Language Model (VLM) 기반 GUI 에이전트가 고해상도 스크린샷 시퀀스 및 장기 작업을 처리할 때 발생하는 비효율성 문제를 해결하는 것을 목표로 합니다.

#Review #GUI Agents #KV Cache Compression #Spatio-Temporal Awareness #Vision-Language Models #Efficiency #Attention Sparsity #QR Decomposition

2025년 10월 2일

[논문리뷰] Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

본 논문은 낮은 지연 시간, 강력한 프라이버시 보장 및 제한된 연결성 환경에서 견고한 동작을 요구하는 온디바이스 GUI 에이전트 개발의 과제를 해결하고자 합니다.

#Review #GUI Agents #On-Device AI #Multimodal LLM #GUI Grounding #GUI Navigation #Reinforcement Learning #Supervised Fine-tuning #Synthetic Data

2025년 10월 1일