본문으로 건너뛰기

#LLM Evaluation

59개의 포스트

[논문리뷰] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

댓글 수 로딩 중

[논문리뷰] RubricBench: Aligning Model-Generated Rubrics with Human Standards

댓글 수 로딩 중

[논문리뷰] LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

댓글 수 로딩 중

[논문리뷰] Implicit Intelligence -- Evaluating Agents on What Users Don't Say

댓글 수 로딩 중

[논문리뷰] EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

댓글 수 로딩 중

[논문리뷰] Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

댓글 수 로딩 중

[논문리뷰] Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

댓글 수 로딩 중

[논문리뷰] DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

댓글 수 로딩 중

[논문리뷰] KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

댓글 수 로딩 중

[논문리뷰] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

댓글 수 로딩 중

[논문리뷰] COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

댓글 수 로딩 중

[논문리뷰] LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

댓글 수 로딩 중

[논문리뷰] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

댓글 수 로딩 중

[논문리뷰] IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages

댓글 수 로딩 중

[논문리뷰] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

댓글 수 로딩 중

[논문리뷰] DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

댓글 수 로딩 중

[논문리뷰] LiveTradeBench: Seeking Real-World Alpha with Large Language Models

댓글 수 로딩 중

[논문리뷰] FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

댓글 수 로딩 중

[논문리뷰] DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context

댓글 수 로딩 중

[논문리뷰] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

댓글 수 로딩 중

[논문리뷰] DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

댓글 수 로딩 중

[논문리뷰] A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

댓글 수 로딩 중

[논문리뷰] AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

댓글 수 로딩 중

[논문리뷰] mSCoRe: a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning

댓글 수 로딩 중

[논문리뷰] From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

댓글 수 로딩 중

[논문리뷰] Are Today's LLMs Ready to Explain Well-Being Concepts?

댓글 수 로딩 중

[논문리뷰] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

댓글 수 로딩 중

[논문리뷰] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

댓글 수 로딩 중

[논문리뷰] AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

댓글 수 로딩 중

[논문리뷰] RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

댓글 수 로딩 중

[논문리뷰] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

댓글 수 로딩 중

[논문리뷰] BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

댓글 수 로딩 중

[논문리뷰] BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

댓글 수 로딩 중

[논문리뷰] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

댓글 수 로딩 중

[논문리뷰] MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

댓글 수 로딩 중

[논문리뷰] Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

댓글 수 로딩 중