#Alignment

7개의 포스트

[논문리뷰] Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

arXiv에 게시된 'Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges' 논문에 대한 자세한 리뷰입니다.

#Review #Reward Hacking #Alignment #RLHF #Proxy Compression Hypothesis #Emergent Misalignment #Large Models #Scalable Oversight

2026년 4월 22일

[논문리뷰] FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

본 논문은 탐색(exploration)과 최적화(optimization)를 분리한 Sol-RL이라는 2단계(two-stage) 프레임워크를 제안합니다 . 1단계에서는 고도로 최적화된 NVFP4 추론을 통해 방대한 후보군을 빠르게 생성하여 상대적 보상 순위를 매기고, 상위 및 하위의 contrastive subset을 선별합니다.

#Review #Diffusion Models #Reinforcement Learning #FP4 Quantization #Rollout Scaling #Alignment #Efficiency #Two-stage Framework

2026년 4월 8일

[논문리뷰] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Jack Lindsey이 arXiv에 게시한 'The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #Language Models #Persona Control #Activation Steering #Persona Drift #Alignment #Post-training #Interpretability #Safety

2026년 1월 19일

[논문리뷰] Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

arXiv에 게시된 'Lost in the Noise: How Reasoning Models Fail with Contextual Distractors' 논문에 대한 자세한 리뷰입니다.

#Review #Robustness #Contextual Distractors #RAG #Reasoning Models #Alignment #Tool Use #NoisyBench #Rationale-Aware Reward #Inverse Scaling

2026년 1월 12일

[논문리뷰] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Yu Xu이 arXiv에 게시한 'Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing' 논문에 대한 자세한 리뷰입니다.

#Review #In-Context Image Generation #Image Editing #Multimodal Models #Chain-of-Thought #Structured Reasoning #Reinforcement Learning #Alignment #Diffusion Models

2026년 1월 8일

[논문리뷰] AI & Human Co-Improvement for Safer Co-Superintelligence

arXiv에 게시된 'AI & Human Co-Improvement for Safer Co-Superintelligence' 논문에 대한 자세한 리뷰입니다.

#Review #AI Safety #Superintelligence #Human-AI Collaboration #Self-Improving AI #Co-Improvement #Alignment #AI Research Agents

2025년 12월 7일

[논문리뷰] Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

Xinyuan Liu이 arXiv에 게시한 'Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Agents #Alignment #Self-Evolution #Behavioral Drift #Reinforcement Learning #Multi-Agent Systems #Alignment Tipping Process

2025년 10월 7일