[논문리뷰] When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine ValidityJohn P Dickerson이 arXiv에 게시한 'When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity' 논문에 대한 자세한 리뷰입니다.#Review#LLM Judge#Benchmark Evaluation#Validity#Reliability#Psychometrics#Factor Analysis#Schema Adherence#ELO Ranking2025년 9월 26일댓글 수 로딩 중
[논문리뷰] Neither Valid nor Reliable? Investigating the Use of LLMs as JudgesGolnoosh Farnadi이 arXiv에 게시한 'Neither Valid nor Reliable? Investigating the Use of LLMs as Judges' 논문에 대한 자세한 리뷰입니다.#Review#LLMs as Judges#NLG Evaluation#Measurement Theory#Validity#Reliability#Evaluation Bias#Scalability#Responsible AI2025년 8월 26일댓글 수 로딩 중