본문으로 건너뛰기

#Policy Optimization

94개의 포스트

[논문리뷰] Not only where, But when: Temporal Scheduling for RLVR

댓글 수 로딩 중

[논문리뷰] Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

댓글 수 로딩 중

[논문리뷰] CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

댓글 수 로딩 중

[논문리뷰] KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

댓글 수 로딩 중

[논문리뷰] UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

댓글 수 로딩 중

[논문리뷰] ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

댓글 수 로딩 중

[논문리뷰] Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

댓글 수 로딩 중

[논문리뷰] FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

댓글 수 로딩 중

[논문리뷰] CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

댓글 수 로딩 중

[논문리뷰] Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

댓글 수 로딩 중

[논문리뷰] BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] Heterogeneous Agent Collaborative Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] InfoPO: Information-Driven Policy Optimization for User-Centric Agents

댓글 수 로딩 중

[논문리뷰] ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

댓글 수 로딩 중

[논문리뷰] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

댓글 수 로딩 중

[논문리뷰] On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

댓글 수 로딩 중

[논문리뷰] F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

댓글 수 로딩 중

[논문리뷰] Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

댓글 수 로딩 중

[논문리뷰] Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

댓글 수 로딩 중

[논문리뷰] LatentMem: Customizing Latent Memory for Multi-Agent Systems

댓글 수 로딩 중

[논문리뷰] Self-Hinting Language Models Enhance Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] Rethinking the Trust Region in LLM Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

댓글 수 로딩 중

[논문리뷰] Reinforcement Learning via Self-Distillation

댓글 수 로딩 중

[논문리뷰] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

댓글 수 로딩 중

[논문리뷰] Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

댓글 수 로딩 중

[논문리뷰] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

댓글 수 로딩 중

[논문리뷰] SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

댓글 수 로딩 중

[논문리뷰] HiconAgent: History Context-aware Policy Optimization for GUI Agents

댓글 수 로딩 중

[논문리뷰] Soft Adaptive Policy Optimization

댓글 수 로딩 중

[논문리뷰] Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

댓글 수 로딩 중

[논문리뷰] VIDEOP2R: Video Understanding from Perception to Reasoning

댓글 수 로딩 중

[논문리뷰] Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

댓글 수 로딩 중

[논문리뷰] π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

댓글 수 로딩 중

[논문리뷰] VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

댓글 수 로딩 중

[논문리뷰] Tree Search for LLM Agent Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature

댓글 수 로딩 중

[논문리뷰] The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

댓글 수 로딩 중

[논문리뷰] Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

댓글 수 로딩 중

[논문리뷰] TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

댓글 수 로딩 중

[논문리뷰] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

댓글 수 로딩 중

[논문리뷰] Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

댓글 수 로딩 중

[논문리뷰] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

댓글 수 로딩 중

[논문리뷰] Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

댓글 수 로딩 중

[논문리뷰] InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

댓글 수 로딩 중

[논문리뷰] FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

댓글 수 로딩 중

[논문리뷰] Agentic Entropy-Balanced Policy Optimization

댓글 수 로딩 중

[논문리뷰] Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

댓글 수 로딩 중

[논문리뷰] Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

댓글 수 로딩 중

[논문리뷰] Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

댓글 수 로딩 중

[논문리뷰] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

댓글 수 로딩 중

[논문리뷰] A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

댓글 수 로딩 중

[논문리뷰] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

댓글 수 로딩 중

[논문리뷰] More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

댓글 수 로딩 중