#Ill-Structured Problems

1개의 포스트

[논문리뷰] DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

본 논문은 기존 벤치마크의 데이터 누출 위험과 비현실적인 평가 방식의 한계를 극복하기 위해, 대규모 언어 모델(LLM) 기반 연구 에이전트 의 실제 연구 능력을 평가하기 위한 새로운 벤치마크인 DeepResearch Arena 를 제안합니다.

#Review #LLM Evaluation #Research Agents #Benchmark #Multi-Agent System #Seminar-Grounded Tasks #Data Leakage Prevention #Ill-Structured Problems

2025년 9월 5일