본문으로 건너뛰기

#Evaluation Metrics

35개의 포스트

[논문리뷰] FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

댓글 수 로딩 중

[논문리뷰] DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

댓글 수 로딩 중

[논문리뷰] MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

댓글 수 로딩 중

[논문리뷰] MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

댓글 수 로딩 중

[논문리뷰] On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

댓글 수 로딩 중

[논문리뷰] DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

댓글 수 로딩 중

[논문리뷰] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

댓글 수 로딩 중

[논문리뷰] OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

댓글 수 로딩 중

[논문리뷰] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

댓글 수 로딩 중

[논문리뷰] M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

댓글 수 로딩 중

[논문리뷰] Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

댓글 수 로딩 중

[논문리뷰] Instruction-Following Evaluation in Function Calling for Large Language Models

댓글 수 로딩 중

[논문리뷰] Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

댓글 수 로딩 중

[논문리뷰] Why Language Models Hallucinate

댓글 수 로딩 중

[논문리뷰] T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

댓글 수 로딩 중

[논문리뷰] SpotEdit: Evaluating Visually-Guided Image Editing Methods

댓글 수 로딩 중

[논문리뷰] Advances in Speech Separation: Techniques, Challenges, and Future Trends

댓글 수 로딩 중

[논문리뷰] WideSearch: Benchmarking Agentic Broad Info-Seeking

댓글 수 로딩 중

[논문리뷰] ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review

댓글 수 로딩 중

[논문리뷰] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

댓글 수 로딩 중

[논문리뷰] Revisiting Modeling and Evaluation Approaches in Speech Emotion Recognition: Considering Subjectivity of Annotators and Ambiguity of Emotions

댓글 수 로딩 중

[논문리뷰] PICABench: How Far Are We from Physically Realistic Image Editing?

댓글 수 로딩 중

[논문리뷰] BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

댓글 수 로딩 중