[논문리뷰] VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM TrainingarXiv에 게시된 'VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training' 논문에 대한 자세한 리뷰입니다.#Review#Off-Policy RL#LLM Training#Importance Sampling#Variance Reduction#Variational Optimization#Policy Gradient#Sequence-Level Optimization#Reinforcement Learning2026년 2월 22일댓글 수 로딩 중
[논문리뷰] Stabilizing Reinforcement Learning with LLMs: Formulation and PracticesarXiv에 게시된 'Stabilizing Reinforcement Learning with LLMs: Formulation and Practices' 논문에 대한 자세한 리뷰입니다.#Review#Reinforcement Learning (RL)#Large Language Models (LLMs)#Policy Gradient#REINFORCE#Mixture-of-Experts (MoE)#Training Stability#Importance Sampling#Routing Replay#Off-policy Learning2025년 12월 1일댓글 수 로딩 중
[논문리뷰] ASPO: Asymmetric Importance Sampling Policy OptimizationXiu Li이 arXiv에 게시한 'ASPO: Asymmetric Importance Sampling Policy Optimization' 논문에 대한 자세한 리뷰입니다.#Review#Reinforcement Learning#Large Language Models#Importance Sampling#Policy Optimization#PPO-Clip#Outcome-Supervised RL#Token Weighting#GRPO2025년 10월 8일댓글 수 로딩 중