[논문리뷰] ResearchGym: Evaluating Language Model Agents on Real-World AI ResearchArman Cohan이 arXiv에 게시한 'ResearchGym: Evaluating Language Model Agents on Real-World AI Research' 논문에 대한 자세한 리뷰입니다.#Review#LLM Agents#AI Research#Benchmark#Closed-loop Research#Agent Evaluation#Reproducibility#Real-world Tasks2026년 2월 17일댓글 수 로딩 중
[논문리뷰] LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context GrowtharXiv에 게시된 'LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth' 논문에 대한 자세한 리뷰입니다.#Review#Large Language Models#Language Agents#Long Context#Context Rot#Benchmarking#Context Management#Tool Use#Agent Evaluation#Dynamic Environments2026년 2월 9일댓글 수 로딩 중
[논문리뷰] ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using AgentsarXiv에 게시된 'ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents' 논문에 대한 자세한 리뷰입니다.#Review#Process Reward Models#Tool-using Agents#Benchmark#Reinforcement Learning#Large Language Models#Reward-guided Search#Agent Evaluation#Step-level Rewards2026년 1월 20일댓글 수 로딩 중
[논문리뷰] EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commercearXiv에 게시된 'EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce' 논문에 대한 자세한 리뷰입니다.#Review#E-commerce#Foundation Agents#LLM Agents#Benchmark#Agent Evaluation#Tool Use#Multi-step Reasoning#Real-world Scenarios2025년 12월 9일댓글 수 로딩 중
[논문리뷰] Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge GraphsZeyi Liao이 arXiv에 게시한 'Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs' 논문에 대한 자세한 리뷰입니다.#Review#Agent Evaluation#Task Generation#Knowledge Graphs#Multimodal AI#Web Interaction#Document Comprehension#LLM-driven Agents2025년 10월 7일댓글 수 로딩 중
[논문리뷰] ARE: Scaling Up Agent Environments and EvaluationsMatteo Bettini이 arXiv에 게시한 'ARE: Scaling Up Agent Environments and Evaluations' 논문에 대한 자세한 리뷰입니다.#Review#Agent Environments#Agent Evaluation#LLM Agents#Asynchronous Systems#Reinforcement Learning#Tool Use#Multi-agent Collaboration#Benchmark2025년 9월 23일댓글 수 로딩 중
[논문리뷰] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol ServersPrathyusha Jwalapuram이 arXiv에 게시한 'MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers' 논문에 대한 자세한 리뷰입니다.#Review#Large Language Models#Benchmarking#Model Context Protocol#Tool Use#Real-World Applications#Agent Evaluation#Long Context#Unknown Tools2025년 8월 21일댓글 수 로딩 중
[논문리뷰] OpenCUA: Open Foundations for Computer-Use AgentsTianbao Xie이 arXiv에 게시한 'OpenCUA: Open Foundations for Computer-Use Agents' 논문에 대한 자세한 리뷰입니다.#Review#Computer-Use Agents#Vision-Language Models#Chain-of-Thought Reasoning#Large-scale Dataset#Open-source Framework#Desktop Automation#Agent Evaluation2025년 8월 13일댓글 수 로딩 중