#Interactive Benchmarking

3개의 포스트

[논문리뷰] AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

본 논문은 실세계 복잡한 환경에서 LLM 에이전트가 Progressive Disclosure되는 Dual Constraints 환경 하에서 효과적으로 계획을 수립하고 수정하는 능력이 부족하다는 점을 지적한다.

#Review #Large Language Model Agents #Adaptive Planning #Dual Constraints #Progressive Disclosure #Interactive Benchmarking #Constraint-based Planning

2026년 6월 4일

[논문리뷰] CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

본 논문은 기존의 인과 추론 벤치마크가 LLM의 진정한 인과적 사고를 평가하기보다 암기된 지식에 의존하는 'Causal parrot' 문제를 해결하기 위해 CausaLab을 제안한다 .

#Review #Causal Discovery #LLM Agents #Structural Causal Models #Interactive Benchmarking #Scientific Discovery #Mechanism Recovery

2026년 5월 28일

[논문리뷰] KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

본 논문은 현재의 모바일 에이전트 벤치마크가 사용자의 개인화된 요구사항을 이해하거나 선제적인 의사결정을 내리는 실제 서비스 환경을 제대로 반영하지 못한다는 문제에서 출발합니다.

#Review #Mobile Agent #Personalization #Proactive Assistance #Interactive Benchmarking #User Simulation #GUI Automation

2026년 4월 9일