본문으로 건너뛰기

#RLVR

44개의 포스트

[논문리뷰] From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

댓글 수 로딩 중

[논문리뷰] DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

댓글 수 로딩 중

[논문리뷰] You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

댓글 수 로딩 중

[논문리뷰] Video Models Can Reason with Verifiable Rewards

댓글 수 로딩 중

[논문리뷰] CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

댓글 수 로딩 중

[논문리뷰] Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

댓글 수 로딩 중

[논문리뷰] HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

댓글 수 로딩 중

[논문리뷰] ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

댓글 수 로딩 중

[논문리뷰] Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

댓글 수 로딩 중

[논문리뷰] How Far Can Unsupervised RLVR Scale LLM Training?

댓글 수 로딩 중

[논문리뷰] Heterogeneous Agent Collaborative Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

댓글 수 로딩 중

[논문리뷰] Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

댓글 수 로딩 중

[논문리뷰] Rectifying LLM Thought from Lens of Optimization

댓글 수 로딩 중

[논문리뷰] Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

댓글 수 로딩 중

[논문리뷰] Quantile Advantage Estimation for Entropy-Safe Reasoning

댓글 수 로딩 중

[논문리뷰] No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

댓글 수 로딩 중

[논문리뷰] Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

댓글 수 로딩 중

[논문리뷰] Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

댓글 수 로딩 중

[논문리뷰] olmOCR 2: Unit Test Rewards for Document OCR

댓글 수 로딩 중