#On-policy Distillation

12개의 포스트

[논문리뷰] DOPD: Dual On-policy Distillation

본 논문은 OPD 환경에서 특권 정보를 주입할 때 발생하는 Privilege Illusion 문제를 해결하고자 합니다.

#Review #On-policy Distillation #Privileged Information #Privilege Illusion #Advantage-aware #Dual Distillation #Large Language Model #Vision-Language Model

2026년 6월 30일

[논문리뷰] AsyncOPD: How Stale Can On-Policy Distillation Be?

본 논문은 LLM 사후 학습에서 OPD가 겪는 On-policy systems bottleneck 문제를 해결하기 위해 비동기식 학습 파이프라인의 도입 필요성을 제기한다. 기존의 동기식 학습은 rollout 생성이 완료될 때까지 학습기를 대기시켜 하드웨어 활용률을 저하시킨다.

#Review #On-policy Distillation #Asynchronous RL #Reverse KL #Staleness #Teacher Cache #Multi-sample MC #Large Language Model

2026년 6월 29일

[논문리뷰] Qwen-Image-2.0-RL Technical Report

본 연구는 Qwen-Image-2.0 diffusion 모델이 가진 생성 품질과 지시 이행 능력 사이의 간극을 좁히고, 복잡한 편집 태스크에서 일관된 성능을 확보하기 위해 수행되었다.

#Review #RLHF #On-policy Distillation #Diffusion Models #Reward Modeling #Flow Matching #GRPO #Qwen-Image-Bench

2026년 6월 28일

[논문리뷰] Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

본 연구는 OPD가 일반적인 Supervised Fine-tuning(SFT)과 달리 어떤 기하학적 특성을 가지며, 왜 RLVR(Reinforcement Learning from Verifier-derived Rewards)과 유사한 sparse한 업데이트 양상을 보이는지 규명합니다.

#Review #On-policy Distillation #Parameter Sparsity #Model Geometry #Subnetwork Masking #LLM Post-training #Optimizer Dynamics

2026년 6월 14일

[논문리뷰] Trajectory-Refined Distillation

본 논문은 현대 LLM의 후행 학습에서 널리 사용되는 OPD가 구조적으로 직면한 Prefix Failure 문제를 해결하고자 합니다. 기존 연구들은 토큰 단위의 손실 함수 수정이나 특정 토큰의 가중치 조정을 통해 이 문제를 해결하려 했으나, 이는 실패한 궤적의 근본 원인을 수정하지 못하는 한계가 있었습니다 .

#Review #On-policy Distillation #Prefix Failure #Trajectory-Refined Distillation #Large Language Models #Self-distillation #Policy Gradient #Alignment

2026년 6월 8일

[논문리뷰] On the Geometry of On-Policy Distillation

본 논문은 OPD가 SFT와 RLVR의 특성을 모두 공유함에도 불구하고, 파라미터 공간에서의 구체적인 학습 동역학(training dynamics)은 제대로 규명되지 않았다는 점을 핵심 문제로 정의합니다.

#Review #On-policy Distillation #Parameter-space Geometry #Subspace Locking #SFT #RLVR #Large Language Models

2026년 6월 8일

[논문리뷰] Trust-Region Behavior Blending for On-Policy Distillation

본 논문은 OPD 초기 단계에서 발생하는 학습 불안정성과 낮은 품질의 데이터 생성 문제를 해결하고자 합니다. 기존 OPD는 학생 모델이 학습 초기에 낮은 품질의 trajectory를 생성하면, 교사 모델의 지도(supervision)가 비효율적인 영역에 집중되는 한계가 있습니다 .

#Review #On-policy Distillation #Trust Region #Knowledge Distillation #Language Model Alignment #Annealed Warmup #Behavior Policy

2026년 5월 31일

[논문리뷰] Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

본 논문은 기존의 Selective OPD 기법들이 단순히 토큰의 불확실성(Entropy)이나 교사-학생 간의 불일치(Divergence)만을 토큰 선택 기준으로 삼는 한계를 해결하고자 합니다.

#Review #On-policy Distillation #Knowledge Distillation #Token Teachability #Selective OPD #Teacher-Student Compatibility

2026년 5월 31일

[논문리뷰] Less is More: Early Stopping Rollout for On-Policy Distillation

본 논문은 기존 OPD 방식에서 발생하는 Off-policy Teacher Decay 문제를 해결하기 위해 제안되었습니다 .

#Review #On-policy Distillation #Knowledge Distillation #Language Models #Early Stopping Rollout #Off-policy Teacher Decay #Cascading Alignment #Sub-mode Commitment

2026년 5월 27일

[논문리뷰] The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

본 논문은 LLM의 On-policy Distillation 과정에서 발생하는 reward extrapolation의 한계점을 해결하고자 한다.

#Review #On-policy Distillation #Reward Extrapolation #Structured Output #Format Adherence #Importance Sampling #LLM

2026년 5월 13일

[논문리뷰] HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

본 논문은 모달리티 적응형 컴퓨팅을 위한 MoT 아키텍처와 비전-언어 연결을 강화하는 Visual Latent Tokens를 핵심 방법론으로 제안합니다 . 시각적 인지 능력 향상을 위해 HY-ViT 2.0 인코더를 탑재하고, 고품질 embodied 데이터를 활용한 반복적인 사후 학습 패러다임을 설계했습니다.

#Review #Embodied Foundation Models #Mixture-of-Transformers #Visual Latent Tokens #On-policy Distillation #Chain-of-Thought #Real-world Agents

2026년 4월 9일

[논문리뷰] Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Large Language Model (LLM)의 Post-training에 있어 On-policy Distillation (OPD)은 student-generated rollouts에 대한 teacher feedback을 활용하기 때문에 매력적이다.

#Review #On-policy Distillation #LLM Post-training #Sampled-token OPD #Variance Reduction #Local Support Matching #Truncated Reverse-KL #Top-p Rollout Sampling #Special Token Masking

2026년 3월 26일