본문으로 건너뛰기

#Benchmarking

101개의 포스트

[논문리뷰] MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

댓글 수 로딩 중

[논문리뷰] OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

댓글 수 로딩 중

[논문리뷰] EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

댓글 수 로딩 중

[논문리뷰] SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

댓글 수 로딩 중

[논문리뷰] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

댓글 수 로딩 중

[논문리뷰] Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

댓글 수 로딩 중

[논문리뷰] Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

댓글 수 로딩 중

[논문리뷰] SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

댓글 수 로딩 중

[논문리뷰] FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

댓글 수 로딩 중

[논문리뷰] Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams

댓글 수 로딩 중

[논문리뷰] $OneMillion-Bench: How Far are Language Agents from Human Experts?

댓글 수 로딩 중

[논문리뷰] MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

댓글 수 로딩 중

[논문리뷰] NanoKnow: How to Know What Your Language Model Knows

댓글 수 로딩 중

[논문리뷰] DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

댓글 수 로딩 중

[논문리뷰] Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

댓글 수 로딩 중

[논문리뷰] SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

댓글 수 로딩 중

[논문리뷰] FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

댓글 수 로딩 중

[논문리뷰] LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

댓글 수 로딩 중

[논문리뷰] OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

댓글 수 로딩 중

[논문리뷰] WideSeek: Advancing Wide Research via Multi-Agent Scaling

댓글 수 로딩 중

[논문리뷰] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

댓글 수 로딩 중

[논문리뷰] WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora

댓글 수 로딩 중

[논문리뷰] AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

댓글 수 로딩 중

[논문리뷰] Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

댓글 수 로딩 중

[논문리뷰] DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

댓글 수 로딩 중

[논문리뷰] PRiSM: Benchmarking Phone Realization in Speech Models

댓글 수 로딩 중

[논문리뷰] MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

댓글 수 로딩 중

[논문리뷰] Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

댓글 수 로딩 중

[논문리뷰] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

댓글 수 로딩 중

[논문리뷰] ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

댓글 수 로딩 중

[논문리뷰] A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation

댓글 수 로딩 중

[논문리뷰] Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

댓글 수 로딩 중

[논문리뷰] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

댓글 수 로딩 중

[논문리뷰] Step-GUI Technical Report

댓글 수 로딩 중

[논문리뷰] V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

댓글 수 로딩 중

[논문리뷰] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

댓글 수 로딩 중

[논문리뷰] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

댓글 수 로딩 중

[논문리뷰] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

댓글 수 로딩 중

[논문리뷰] From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

댓글 수 로딩 중

[논문리뷰] TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

댓글 수 로딩 중

[논문리뷰] DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

댓글 수 로딩 중

[논문리뷰] Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

댓글 수 로딩 중

[논문리뷰] DigiData: Training and Evaluating General-Purpose Mobile Control Agents

댓글 수 로딩 중

[논문리뷰] TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models

댓글 수 로딩 중

[논문리뷰] RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies

댓글 수 로딩 중

[논문리뷰] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

댓글 수 로딩 중

[논문리뷰] UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

댓글 수 로딩 중

[논문리뷰] Instruction-Following Evaluation in Function Calling for Large Language Models

댓글 수 로딩 중

[논문리뷰] Logics-Parsing Technical Report

댓글 수 로딩 중

[논문리뷰] MobiAgent: A Systematic Framework for Customizable Mobile Agents

댓글 수 로딩 중

[논문리뷰] Benchmarking Optimizers for Large Language Model Pretraining

댓글 수 로딩 중

[논문리뷰] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

댓글 수 로딩 중

[논문리뷰] Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents

댓글 수 로딩 중

[논문리뷰] OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

댓글 수 로딩 중

[논문리뷰] I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

댓글 수 로딩 중

[논문리뷰] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

댓글 수 로딩 중

[논문리뷰] AgroBench: Vision-Language Model Benchmark in Agriculture

댓글 수 로딩 중

[논문리뷰] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

댓글 수 로딩 중

[논문리뷰] InteractComp: Evaluating Search Agents With Ambiguous Queries

댓글 수 로딩 중

[논문리뷰] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

댓글 수 로딩 중

[논문리뷰] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

댓글 수 로딩 중

[논문리뷰] MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

댓글 수 로딩 중

[논문리뷰] PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

댓글 수 로딩 중

[논문리뷰] BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

댓글 수 로딩 중

[논문리뷰] NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

댓글 수 로딩 중

[논문리뷰] U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

댓글 수 로딩 중

[논문리뷰] Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

댓글 수 로딩 중

[논문리뷰] VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

댓글 수 로딩 중

[논문리뷰] BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

댓글 수 로딩 중