본문으로 건너뛰기

#Benchmark

217개의 포스트

[논문리뷰] Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

댓글 수 로딩 중

[논문리뷰] SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

댓글 수 로딩 중

[논문리뷰] ClawArena: Benchmarking AI Agents in Evolving Information Environments

댓글 수 로딩 중

[논문리뷰] AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

댓글 수 로딩 중

[논문리뷰] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

댓글 수 로딩 중

[논문리뷰] MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

댓글 수 로딩 중

[논문리뷰] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

댓글 수 로딩 중

[논문리뷰] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

댓글 수 로딩 중

[논문리뷰] VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

댓글 수 로딩 중