#Rubric-Augmented Verification

1개의 포스트

[논문리뷰] C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

본 논문은 Rubric 생성과 Rubric 기반 검증을 협력적이지만 비판적인 의사소통 과정으로 재정의합니다. 제안 방법론인 C2는 우선 Verifier의 신뢰도를 기준으로 Rubric을 Helpful한 것과 Misleading한 것으로 합성한 후, 이 쌍을 활용하여 Generator를 DPO로 학습시키고 Verifier를 GRPO로 학습시킵니다 .

#Review #Reward Modeling #Reinforcement Learning from Human Feedback (RLHF)#Rubric-Augmented Verification #Binary Preferences #Cooperative Communication

2026년 4월 16일