#Tool-use

8개의 포스트

[논문리뷰] Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

본 논문은 Compact Language Models 기반의 에이전트가 복잡한 MCP 도구 사용 환경에서 겪는 구조적 취약성과 낮은 실행 성공률 문제를 해결하고자 합니다.

#Review #Tool-use #Compact Language Models #Inference-time Evolution #Executable Workflow #MCP-Bench #LLM Agents #Evolutionary Search

2026년 6월 11일

[논문리뷰] Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

본 논문은 기존 LLM 에이전트 프레임워크가 가지는 보안 경계의 모호함과 장기 실행 에이전트에 대한 인프라 부족 문제를 해결하기 위해 Agent libOS를 제안합니다.

#Review #LLM Agents #Library OS #Runtime Security #Capability-based Security #Object Memory #Tool-use #System Architecture

2026년 6월 3일

[논문리뷰] A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

본 논문은 기존의 툴 사용 에이전트 벤치마크가 고정된 시나리오에 의존함에 따라 발생하는 심각한 포화(Saturation) 현상과 벤치마크 구축의 높은 노동 집약적 비용 문제를 해결하고자 합니다.

#Review #Agent Benchmarks #Tool-use #Task Synthesis #Coverage #Difficulty #Adaptive Contrastive n-gram Model

2026년 6월 1일

[논문리뷰] GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

본 논문은 기존의 도구 사용 벤치마크가 실제 생산성 워크플로우의 복잡성을 제대로 반영하지 못하는 한계를 해결하기 위해 제안되었습니다. 현재의 벤치마크들은 주로 AI가 생성한 쿼리나 가상의 도구에 의존하며, 단기적이고 폐쇄적인 작업에 국한되어 있습니다.

#Review #Autonomous LLM Agents #Agent Evaluation #General AI Assistant #Tool-use #Workflow Management

2026년 4월 19일

[논문리뷰] SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

본 논문은 LLM 에이전트가 복잡한 과학적 워크플로우에서 도메인 특화 도구를 사용하여 다단계 추론을 수행하는 능력을 평가하고 향상시키는 것을 목표로 합니다. 기존 벤치마크들이 정적 질의응답에 치중하여 에이전트의 대화형 도구 사용 능력을 제대로 반영하지 못하는 한계를 해결하고자 합니다.

#Review #LLM Agents #Tool-use #Scientific Reasoning #Benchmarking #Interactive Environment #Data Synthesis #Error Recovery #Multi-step Tasks

2026년 2월 15일

[논문리뷰] CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

기존 LLM 에이전트 벤치마크가 이상적인 설정에서의 태스크 완료에만 초점을 맞추고 실제 환경에서의 신뢰성, 일관성, 한계 인식 을 간과하는 문제를 해결하고자 합니다.

#Review #LLM Agents #Benchmarks #Tool-use #Consistency #Uncertainty Handling #Hallucination #In-car Assistant #Policy Adherence

2026년 2월 5일

[논문리뷰] DocDancer: Towards Agentic Document-Grounded Information Seeking

본 연구는 기존 DocQA(Document Question Answering) 에이전트들의 비효율적인 도구 활용 및 폐쇄형 모델 의존성 문제를 해결하고자 합니다.

#Review #Agentic AI #Document Question Answering #Tool-use #Information Seeking #Synthetic Data Generation #Long-context Understanding #Multimodal Documents

2026년 1월 8일

[논문리뷰] LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

본 논문은 기존 도구 사용 벤치마크가 시뮬레이션되거나 소규모의 MCP(Model Context Protocol) 서버에 국한되어 실제 대규모의 동적인 환경을 반영하지 못하는 한계를 지적합니다.

#Review #LLM Agent #Tool-use #MCP #Benchmark #Large-scale #Real-world tasks #Automated Evaluation #Meta-tool-learning

2025년 8월 6일