[논문리뷰] When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine ValidityJohn P Dickerson이 arXiv에 게시한 'When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity' 논문에 대한 자세한 리뷰입니다.#Review#LLM Judge#Benchmark Evaluation#Validity#Reliability#Psychometrics#Factor Analysis#Schema Adherence#ELO Ranking2025년 9월 26일댓글 수 로딩 중