#Advantage Shaping

1개의 포스트

[논문리뷰] No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

본 논문은 기존의 Verifiable Rewards를 활용한 강화 학습(RLVR) 방법론, 특히 GRPO 가 모든 롤아웃 응답이 동일한 보상을 받는 ' Zero-Variance Prompts '를 무시하여 귀중한 학습 신호를 손실하고 롤아웃 비용을 낭비하는 문제를 해결하고자 합니다.

#Review #LLM Reinforcement Learning #Zero-Variance Prompts #Advantage Shaping #Entropy-Guided #Math Reasoning #RLVR #Group Relative Policy Optimization

2025년 9월 29일