[논문리뷰] SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy OptimizationarXiv에 게시된 'SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization' 논문에 대한 자세한 리뷰입니다.#Review#LLM#Reinforcement Learning#Soft-Thinking#Gumbel Reparameterization#Policy Optimization#Chain-of-Thought (CoT)#GRPO2025년 11월 10일댓글 수 로딩 중