본문으로 건너뛰기

#Benchmark

266개의 포스트

[논문리뷰] MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

댓글 수 로딩 중

[논문리뷰] HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

댓글 수 로딩 중

[논문리뷰] LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

댓글 수 로딩 중

[논문리뷰] AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

댓글 수 로딩 중

[논문리뷰] SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

댓글 수 로딩 중

[논문리뷰] Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

댓글 수 로딩 중

[논문리뷰] TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

댓글 수 로딩 중

[논문리뷰] TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

댓글 수 로딩 중

[논문리뷰] OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

댓글 수 로딩 중

[논문리뷰] CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

댓글 수 로딩 중

[논문리뷰] MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

댓글 수 로딩 중

[논문리뷰] ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

댓글 수 로딩 중

[논문리뷰] Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

댓글 수 로딩 중

[논문리뷰] Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

댓글 수 로딩 중

[논문리뷰] PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

댓글 수 로딩 중

[논문리뷰] WorldMark: A Unified Benchmark Suite for Interactive Video World Models

댓글 수 로딩 중

[논문리뷰] MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

댓글 수 로딩 중

[논문리뷰] Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

댓글 수 로딩 중

[논문리뷰] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

댓글 수 로딩 중

[논문리뷰] OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

댓글 수 로딩 중

[논문리뷰] GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

댓글 수 로딩 중

[논문리뷰] Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

댓글 수 로딩 중

[논문리뷰] DeonticBench: A Benchmark for Reasoning over Rules

댓글 수 로딩 중

[논문리뷰] Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

댓글 수 로딩 중

[논문리뷰] SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

댓글 수 로딩 중

[논문리뷰] FileGram: Grounding Agent Personalization in File-System Behavioral Traces

댓글 수 로딩 중

[논문리뷰] ClawArena: Benchmarking AI Agents in Evolving Information Environments

댓글 수 로딩 중

[논문리뷰] AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

댓글 수 로딩 중

[논문리뷰] VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

댓글 수 로딩 중

[논문리뷰] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

댓글 수 로딩 중

[논문리뷰] ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

댓글 수 로딩 중

[논문리뷰] GEditBench v2: A Human-Aligned Benchmark for General Image Editing

댓글 수 로딩 중

[논문리뷰] VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

댓글 수 로딩 중

[논문리뷰] Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

댓글 수 로딩 중

[논문리뷰] SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

댓글 수 로딩 중

[논문리뷰] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

댓글 수 로딩 중

[논문리뷰] CodePercept: Code-Grounded Visual STEM Perception for MLLMs

댓글 수 로딩 중

[논문리뷰] MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

댓글 수 로딩 중

[논문리뷰] Do What I Say: A Spoken Prompt Dataset for Instruction-Following

댓글 수 로딩 중

[논문리뷰] PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

댓글 수 로딩 중

[논문리뷰] Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

댓글 수 로딩 중

[논문리뷰] RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

댓글 수 로딩 중

[논문리뷰] AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

댓글 수 로딩 중

[논문리뷰] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

댓글 수 로딩 중

[논문리뷰] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

댓글 수 로딩 중

[논문리뷰] DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

댓글 수 로딩 중

[논문리뷰] OmniGAIA: Towards Native Omni-Modal AI Agents

댓글 수 로딩 중

[논문리뷰] LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

댓글 수 로딩 중

[논문리뷰] A Very Big Video Reasoning Suite

댓글 수 로딩 중

[논문리뷰] MAEB: Massive Audio Embedding Benchmark

댓글 수 로딩 중

[논문리뷰] Learning Situated Awareness in the Real World

댓글 수 로딩 중

[논문리뷰] ResearchGym: Evaluating Language Model Agents on Real-World AI Research

댓글 수 로딩 중

[논문리뷰] BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

댓글 수 로딩 중

[논문리뷰] GENIUS: Generative Fluid Intelligence Evaluation Suite

댓글 수 로딩 중

[논문리뷰] EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

댓글 수 로딩 중

[논문리뷰] GISA: A Benchmark for General Information-Seeking Assistant

댓글 수 로딩 중

[논문리뷰] Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

댓글 수 로딩 중

[논문리뷰] RISE-Video: Can Video Generators Decode Implicit World Rules?

댓글 수 로딩 중

[논문리뷰] HY3D-Bench: Generation of 3D Assets

댓글 수 로딩 중

[논문리뷰] Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

댓글 수 로딩 중

[논문리뷰] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] Toward Cognitive Supersensing in Multimodal Large Language Model

댓글 수 로딩 중

[논문리뷰] TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

댓글 수 로딩 중

[논문리뷰] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

댓글 수 로딩 중

[논문리뷰] AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

댓글 수 로딩 중

[논문리뷰] AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

댓글 수 로딩 중

[논문리뷰] VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

댓글 수 로딩 중

[논문리뷰] Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

댓글 수 로딩 중

[논문리뷰] MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

댓글 수 로딩 중

[논문리뷰] FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

댓글 수 로딩 중

[논문리뷰] AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems

댓글 수 로딩 중

[논문리뷰] DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

댓글 수 로딩 중

[논문리뷰] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

댓글 수 로딩 중

[논문리뷰] VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

댓글 수 로딩 중

[논문리뷰] TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

댓글 수 로딩 중

[논문리뷰] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

댓글 수 로딩 중

[논문리뷰] GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

댓글 수 로딩 중

[논문리뷰] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

댓글 수 로딩 중

[논문리뷰] NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

댓글 수 로딩 중

[논문리뷰] EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce

댓글 수 로딩 중

[논문리뷰] OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

댓글 수 로딩 중

[논문리뷰] EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

댓글 수 로딩 중

[논문리뷰] StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

댓글 수 로딩 중

[논문리뷰] IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages

댓글 수 로딩 중

[논문리뷰] OralGPT-Omni: A Versatile Dental Multimodal Large Language Model

댓글 수 로딩 중

[논문리뷰] DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

댓글 수 로딩 중

[논문리뷰] Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets?

댓글 수 로딩 중

[논문리뷰] AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

댓글 수 로딩 중

[논문리뷰] Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

댓글 수 로딩 중

[논문리뷰] ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

댓글 수 로딩 중

[논문리뷰] GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

댓글 수 로딩 중

[논문리뷰] MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

댓글 수 로딩 중

[논문리뷰] GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

댓글 수 로딩 중

[논문리뷰] MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

댓글 수 로딩 중

[논문리뷰] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

댓글 수 로딩 중

[논문리뷰] RiddleBench: A New Generative Reasoning Benchmark for LLMs

댓글 수 로딩 중

[논문리뷰] UniREditBench: A Unified Reasoning-based Image Editing Benchmark

댓글 수 로딩 중

[논문리뷰] RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

댓글 수 로딩 중

[논문리뷰] EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

댓글 수 로딩 중

[논문리뷰] Does FLUX Already Know How to Perform Physically Plausible Image Composition?

댓글 수 로딩 중

[논문리뷰] BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

댓글 수 로딩 중

[논문리뷰] VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

댓글 수 로딩 중

[논문리뷰] OpenGVL - Benchmarking Visual Temporal Progress for Data Curation

댓글 수 로딩 중

[논문리뷰] SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

댓글 수 로딩 중

[논문리뷰] AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

댓글 수 로딩 중

[논문리뷰] ARE: Scaling Up Agent Environments and Evaluations

댓글 수 로딩 중

[논문리뷰] Measuring Epistemic Humility in Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

댓글 수 로딩 중

[논문리뷰] LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

댓글 수 로딩 중

[논문리뷰] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

댓글 수 로딩 중

[논문리뷰] MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

댓글 수 로딩 중

[논문리뷰] Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

댓글 수 로딩 중

[논문리뷰] DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

댓글 수 로딩 중

[논문리뷰] FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

댓글 수 로딩 중

[논문리뷰] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

댓글 수 로딩 중

[논문리뷰] SpotEdit: Evaluating Visually-Guided Image Editing Methods

댓글 수 로딩 중

[논문리뷰] AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

댓글 수 로딩 중

[논문리뷰] MultiRef: Controllable Image Generation with Multiple Visual References

댓글 수 로딩 중

[논문리뷰] MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

댓글 수 로딩 중

[논문리뷰] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

댓글 수 로딩 중

[논문리뷰] WideSearch: Benchmarking Agentic Broad Info-Seeking

댓글 수 로딩 중

[논문리뷰] VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

댓글 수 로딩 중

[논문리뷰] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

댓글 수 로딩 중

[논문리뷰] AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

댓글 수 로딩 중

[논문리뷰] VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

댓글 수 로딩 중

[논문리뷰] PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

댓글 수 로딩 중

[논문리뷰] Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

댓글 수 로딩 중

[논문리뷰] ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

댓글 수 로딩 중

[논문리뷰] MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model

댓글 수 로딩 중

[논문리뷰] LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

댓글 수 로딩 중

[논문리뷰] Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

댓글 수 로딩 중

[논문리뷰] ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

댓글 수 로딩 중

[논문리뷰] MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

댓글 수 로딩 중

[논문리뷰] UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

댓글 수 로딩 중

[논문리뷰] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

댓글 수 로딩 중

[논문리뷰] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

댓글 수 로딩 중

[논문리뷰] MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

댓글 수 로딩 중

[논문리뷰] EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

댓글 수 로딩 중

[논문리뷰] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

댓글 수 로딩 중

[논문리뷰] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents

댓글 수 로딩 중

[논문리뷰] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

댓글 수 로딩 중

[논문리뷰] PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

댓글 수 로딩 중

[논문리뷰] PICABench: How Far Are We from Physically Realistic Image Editing?

댓글 수 로딩 중

[논문리뷰] BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

댓글 수 로딩 중

[논문리뷰] Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

댓글 수 로딩 중

[논문리뷰] VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

댓글 수 로딩 중

[논문리뷰] MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

댓글 수 로딩 중