본문으로 건너뛰기

#LLM-as-Judge

4개의 포스트

[논문리뷰] FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

댓글 수 로딩 중

[논문리뷰] MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

댓글 수 로딩 중

[논문리뷰] EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge

댓글 수 로딩 중

[논문리뷰] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

댓글 수 로딩 중