#LLM-as-a-Judge

21개의 포스트

[논문리뷰] Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

본 논문은 LLM post-training에서 활용되는 기존의 reward evaluation 방식이 이질적인 평가 기준을 통합하는 데 한계를 보이고 있다는 점을 지적한다.

#Review #Reward Modeling #Agent Skills #LLM-as-a-Judge #Reinforcement Learning #Instruction Following #Evidence-based Evaluation

2026년 6월 8일

[논문리뷰] When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

본 논문은 여러 평가 기준을 동시에 고려해야 하는 Multi-Objective LLM Judge의 프롬프트 최적화 과정에서 발생하는 근본적인 문제들을 규명한다.

#Review #LLM-as-a-Judge #Prompt Optimization #Textual Gradient #Multi-Objective Optimization #Gradient Dilution #Instruction Interference

2026년 6월 7일

[논문리뷰] Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

본 논문은 현대의 ASR 시스템이 단일 패스 방식에 고착되어 있어, 인간의 의사소통처럼 반복적인 확인과 수정이 필요한 상황에서 의미론적 오류(Meaning-critical errors)를 효과적으로 해결하지 못하는 문제를 해결합니다 .

#Review #Interactive ASR #Agentic Correction #Semantic Evaluation #S2ER #Human-AI Alignment #LLM-as-a-Judge

2026년 6월 7일

[논문리뷰] Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

본 연구는 Rubric-based RL에서 발생하는 보상 해킹의 불투명성을 해결하기 위해 수행되었습니다. 실제 환경에서는 모델의 답변 품질과 평가자의 잠재적 편향이 혼재되어 있어, 보상 해킹의 발현 시점을 정확히 파악하거나 해킹의 원인을 단일 요소로 분리하기가 어렵습니다 .

#Review #Reinforcement Learning #Reward Hacking #LLM-as-a-Judge #Alignment #Policy Gradient #Alignment #Evaluation

2026년 6월 3일

[논문리뷰] Unsupervised Process Reward Models

본 논문은 기존 PRM 학습에 필수적인 인간 전문가의 단계별 주석 데이터가 갖는 높은 비용과 확장성 문제를 해결하고자 합니다.

#Review #Unsupervised Learning #Process Reward Models #Reinforcement Learning #Reasoning #Test-time Scaling #LLM-as-a-Judge

2026년 5월 21일

[논문리뷰] Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

텍스트 데이터의 Privacy 보호는 현대 NLP에서 필수적이지만, 이를 정량화할 명확한 기준이 부재합니다.

#Review #privacy evaluation #knowledge distillation #de-identification #LLM-as-a-Judge #textual privacy

2026년 3월 31일

[논문리뷰] Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Multimodal AI agents는 online web execution을 포함하는 복잡한 real-world workflow를 점차 자동화하고 있습니다.

#Review #Multimodal AI Agents #Web-agent Benchmark #Egocentric Video #Visual Grounding #Online Evaluation #LLM-as-a-Judge #Perception-Action Alignment

2026년 3월 24일

[논문리뷰] Specificity-aware reinforcement learning for fine-grained open-world classification

본 논문은 오픈 월드 환경에서 미세 분류를 수행할 때, 대규모 멀티모달 모델(LMMs) 이 지나치게 일반적인 예측을 내놓는 경향을 해결하고자 합니다. 모델의 정확성 을 저해하지 않으면서 예측의 구체성(specificity) 을 향상시키는 것이 주된 연구 목표입니다.

#Review #Open-World Classification #Fine-Grained Classification #Reinforcement Learning #LMMs #Specificity-Aware Reward #GRPO #LLM-as-a-Judge #Cross-Domain Generalization

2026년 3월 4일

[논문리뷰] MediX-R1: Open Ended Medical Reinforcement Learning

본 논문은 의료 멀티모달 대규모 언어 모델(MLLM)이 다지선다형 질문을 넘어 임상적으로 근거한 자유 형식 답변 을 생성하도록 하는 오픈엔드 의료 강화 학습(RL) 프레임워크인 MediX-R1 을 제안합니다.

#Review #Reinforcement Learning #Multimodal LLMs #Medical AI #Composite Reward #LLM-as-a-Judge #Open-ended Generation #Medical Imaging

2026년 2월 26일

[논문리뷰] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

이 논문은 AI 에이전트가 복잡한 다단계 정보 탐색 작업 에서 포괄적인 답변 목록 을 생성하는 능력을 평가하기 위한 새로운 벤치마크인 DeepSearchQA 를 소개합니다.

#Review #AI Agents #Deep Research #Benchmark #Information Retrieval #Comprehensiveness #Multi-step Reasoning #Evaluation #LLM-as-a-Judge

2026년 1월 29일

[논문리뷰] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

본 논문은 복잡한 GUI 태스크에서 자율 에이전트 개발을 위한 에이전트 강화 학습( Agentic RL )의 주요 병목인 태스크 완료 검증의 비효율성과 신뢰성 문제 를 해결하고자 합니다.

#Review #Agentic RL #Self-Verifying Agents #GUI Automation #Evidence Curation #LLM-as-a-Judge #Reward Shaping #AndroidLab

2025년 12월 29일

[논문리뷰] Are We on the Right Way to Assessing LLM-as-a-Judge?

본 논문은 현재 LLM-as-a-Judge 평가 방법론이 인간 주석에 과도하게 의존하여 발생하는 편향, 불일치성, 확장성 문제를 해결하고자 합니다.

#Review #LLM-as-a-Judge #Evaluation Metrics #Consistency #Robustness #Positional Bias #Transitivity #Situational Preference #Multi-agent Systems

2025년 12월 21일

[논문리뷰] ViDiC: Video Difference Captioning

본 논문은 동적 비디오 시퀀스 간의 시각적 차이를 이해하고 설명하는 Video Difference Captioning (ViDiC) 이라는 새로운 태스크를 제안합니다.

#Review #Video Difference Captioning #Multimodal Large Language Models #Video Understanding #Comparative Reasoning #Evaluation Benchmark #LLM-as-a-Judge #ViDiC-1K

2025년 12월 3일

[논문리뷰] TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

본 논문은 LLM-as-a-judge 평가 프레임워크에서 발생하는 핵심적인 불일치 문제 를 해결하는 것을 목표로 합니다.

#Review #LLM-as-a-Judge #Evaluation Frameworks #Inconsistency Reduction #Probabilistic Scoring #Transitivity #Information Loss #Perplexity #Large Language Models

2025년 9월 26일

[논문리뷰] CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

기존 LLM 기반 코드 리뷰(CR) 벤치마크가 겪는 '현실성 격차'(reality gap) 문제를 해결하고자 합니다.

#Review #Code Review #LLMs #Benchmark #Python Projects #End-to-End Evaluation #Context-Awareness #Software Engineering #LLM-as-a-Judge

2025년 9월 23일

[논문리뷰] Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

본 논문은 팟캐스트와 같은 롱폼 오디오 도메인에서 개인화된 추천 시스템 평가의 어려움(노출 편향, A/B 테스트의 높은 비용 및 제약)을 해결하고자 합니다. 특히, 배포 전 모델 선택 단계에서 확장 가능하고 신뢰할 수 있으며 해석 가능한 평가 방법론의 부재라는 핵심 문제를 다룹니다.

#Review #Podcast Recommendation #LLM-as-a-Judge #Offline Evaluation #User Profiling #Recommender Systems #Natural Language Processing

2025년 8월 20일

[논문리뷰] Are Today's LLMs Ready to Explain Well-Being Concepts?

본 연구는 대규모 언어 모델(LLMs)이 웰빙 개념을 정확하고 다양한 잠재 고객(일반 대중 및 도메인 전문가)에게 적합하게 설명할 준비가 되어 있는지를 체계적으로 평가하는 것을 목표로 합니다. 특히, 기존 LLM의 한계를 분석하고 미세 조정을 통해 설명 품질을 개선할 수 있는지 탐구합니다.

#Review #Large Language Models #Well-being Concepts #LLM Evaluation #Principle-Guided Evaluation #LLM-as-a-Judge #Supervised Fine-Tuning (SFT)#Direct Preference Optimization (DPO)#Explanation Generation

2025년 8월 8일

[논문리뷰] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

이 논문은 대규모 언어 모델(LLM) 기반의 대화 평가에서 현재 'LLM-as-a-judge' 패러다임이 겪는 편향 문제와 추론 시 발생하는 과도한 계산 오버헤드 를 해결하고자 합니다.

#Review #Multi-Turn Dialogue Evaluation #LLM-as-a-Judge #Multi-Judge Aggregation #Preference Learning #Dialogue Quality Assessment #Maximum Likelihood Estimation #Computational Efficiency

2025년 8월 4일

[논문리뷰] Understanding DeepResearch via Reports

본 논문은 지식 집약적 연구 작업을 수행하는 DeepResearch 에이전트 의 복합적인 평가 문제에 주목합니다.

#Review #DeepResearch Agents #LLM-as-a-Judge #Report Evaluation #Agentic AI #Factuality #Redundancy #Research Automation #Benchmark

2025년 10월 13일

[논문리뷰] CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

본 논문은 대규모 언어 모델(LLM) 기반 에이전트들이 외부 감독 없이 에이전트 간 상호작용 을 통해 자율적으로 능력을 개선하는 자체 진화(self-evolution) 패러다임을 확립하는 것을 목표로 합니다.

#Review #Multi-Agent Systems #LLM Agents #Self-Evolution #Reinforcement Learning #Interaction Rewards #LLM-as-a-Judge #Decentralized Learning

2025년 10월 10일

[논문리뷰] Unified Reinforcement and Imitation Learning for Vision-Language Models

본 논문은 대규모 Vision-Language Models (VLMs) 의 비효율성을 해결하기 위해, 리소스가 제한된 환경에서도 강력하고 경량화된 VLM을 구축하는 효율적인 훈련 알고리즘 Unified Reinforcement and Imitation Learning (RIL) 을 제안합니다.

#Review #Vision-Language Models #Reinforcement Learning #Imitation Learning #Model Distillation #Lightweight VLMs #LLM-as-a-Judge #Multimodal Learning

2025년 10월 23일