[논문리뷰] On the Direction of RLVR Updates for LLM Reasoning: Identification and ExploitationLarge Language Models (LLMs)의 reasoning capability는 Reinforcement Learning with Verifiable Rewards (RLVR)와 같은 기법을 통해 크게 발전했습니다.#Review#RLVR#LLM Reasoning#Log Probability Difference#Directional Updates#Test-Time Extrapolation#Advantage Reweighting#Sparse Updates2026년 3월 23일댓글 수 로딩 중