#Performance Evaluation

6개의 포스트

[논문리뷰] Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

본 논문은 LLM 기반 에이전트가 복잡한 산업 환경에서 실질적인 능력을 발휘하는지 평가하기 위한 방법론적 문제를 다룹니다. 기존 벤치마크는 지나치게 단순화된 과제에 의존하거나, 실무에서 필수적인 프라이버시 보호 및 다단계 실행 능력을 적절히 측정하지 못하는 한계가 있습니다 .

#Review #Agentic AI #Industry 4.0 #Benchmarking #Privacy-preserving #Multi-agent systems #Performance Evaluation #AssetOpsBench

2026년 5월 13일

[논문리뷰] VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

본 논문은 사용자 의도 중심의 10가지 범주, 149개의 작업, 그리고 80개의 환경 변이를 포함하는 VenusBench-Mobile을 제안한다. 에이전트의 실패 원인을 세밀하게 분석하기 위해 PUDAM 역량 분류 체계를 도입하여 각 작업의 난이도를 4단계(Level 1-4)로 구분하였다.

#Review #Mobile GUI Agents #User-Centric Benchmark #Capability Diagnostics #Human-Computer Interaction #Performance Evaluation #Robustness

2026년 4월 8일

[논문리뷰] Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

본 논문은 Model Context Protocol (MCP) 도구 설명 에 내재된 결함이나 '냄새'의 만연함과 그 영향에 대한 불확실성을 해결하고자 합니다.

#Review #Model Context Protocol #AI Agents #Tool Descriptions #Software Smells #Prompt Engineering #Foundation Models #Performance Evaluation #Ablation Study

2026년 2월 25일

[논문리뷰] Discovering Hidden Gems in Model Repositories

본 논문은 대규모 모델 저장소에서 사용자에게 잘 알려지지 않았지만 성능이 뛰어난 '숨겨진 보석' 모델들을 효율적으로 발견하는 것을 목표로 합니다. 특히, 현재 모델 사용의 집중이 효율적인 시장 선택의 결과인지, 아니면 우수한 모델들이 단순히 간과되고 있는지 규명하고자 합니다.

#Review #Model Discovery #Hidden Gems #Sequential Halving #Multi-Armed Bandit #Model Repositories #Large Language Models #Performance Evaluation

2026년 1월 29일

[논문리뷰] Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games

논문은 OpenAI의 ChatGPT Atlas 에이전트 가 웹 환경에서 상호작용하는 능력을, 특히 웹 기반 게임을 통해 평가하는 것을 목표로 합니다.

#Review #Web Agent #Large Language Models #Multimodal AI #Browser Automation #Game AI #ChatGPT Atlas #Performance Evaluation #Human-Computer Interaction

2025년 10월 31일

[논문리뷰] U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

의료 영상 분할 분야에서 수천 가지의 U-Net 변형 모델이 제안되었음에도 불구하고, 이들의 성능과 실용성을 포괄적으로, 통계적으로 엄격하게, 그리고 효율성을 고려하여 평가하는 종합적인 벤치마크의 부재를 해결하는 것이 목표입니다.

#Review #U-Net #Medical Image Segmentation #Benchmarking #Performance Evaluation #Efficiency Metrics #Zero-shot Generalization #U-Score

2025년 10월 9일