본문으로 건너뛰기

#Reward Hacking

18개의 포스트

[논문리뷰] Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

댓글 수 로딩 중

[논문리뷰] UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

댓글 수 로딩 중

[논문리뷰] Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

댓글 수 로딩 중

[논문리뷰] Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

댓글 수 로딩 중

[논문리뷰] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

댓글 수 로딩 중

[논문리뷰] GARDO: Reinforcing Diffusion Models without Reward Hacking

댓글 수 로딩 중

[논문리뷰] Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

댓글 수 로딩 중

[논문리뷰] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

댓글 수 로딩 중

[논문리뷰] RewardDance: Reward Scaling in Visual Generation

댓글 수 로딩 중

[논문리뷰] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

댓글 수 로딩 중

[논문리뷰] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

댓글 수 로딩 중