[논문리뷰] BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement LearningarXiv에 게시된 'BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning' 논문에 대한 자세한 리뷰입니다.#Review#LLM Reinforcement Learning#Trust Region#Policy Optimization#Ratio Clipping#f-divergence#Entropy Regularization#Exploration#BandPO2026년 3월 8일댓글 수 로딩 중
[논문리뷰] No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage ShapingarXiv에 게시된 'No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping' 논문에 대한 자세한 리뷰입니다.#Review#LLM Reinforcement Learning#Zero-Variance Prompts#Advantage Shaping#Entropy-Guided#Math Reasoning#RLVR#Group Relative Policy Optimization2025년 9월 29일댓글 수 로딩 중