#Scalable Oversight

4개의 포스트

[논문리뷰] Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

본 논문은 현대의 RLHF, RLAIF, RLVR 등 정렬 파이프라인이 내재적으로 가지고 있는 구조적 취약점인 reward hacking 문제를 다룬다.

#Review #Reward Hacking #Alignment #RLHF #Proxy Compression Hypothesis #Emergent Misalignment #Large Models #Scalable Oversight

2026년 4월 22일

[논문리뷰] Steering LLMs via Scalable Interactive Oversight

본 논문은 대규모 언어 모델(LLM)이 복잡하고 장기적인 태스크를 자동화함에 따라 발생하는 '감독 격차(supervision gap)' 문제를 해결하고자 합니다. 이는 비전문가 사용자가 충분한 도메인 전문성 없이 AI 시스템을 효과적으로 조종하고 복잡한 출력을 검증하기 어려운 문제를 지칭합니다.

#Review #Scalable Oversight #Interactive AI #Large Language Models #Human-AI Collaboration #Product Requirement Documents #Reinforcement Learning #Structured Interaction #Vibe Coding

2026년 2월 5일

[논문리뷰] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

본 논문은 복잡한 추론 태스크에서 LLM의 출력을 평가하고 피드백을 제공하는 비판(critiquing) 모델을 훈련하는 것을 목표로 합니다.

#Review #Reinforcement Learning #Language Models #Critiquing #Two-Stage Optimization #Actor-Critic #Scalable Oversight #Discriminability #Helpfulness

2025년 10월 29일

[논문리뷰] Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

본 연구는 신뢰할 수 없는 LLM 에이전트가 안전 메커니즘을 우회하여 AI 제어 프로토콜을 전복시키는 문제를 다룹니다. 특히, 공격자 모델이 프로토콜과 모니터 모델에 대한 지식을 가진 적응형 공격(adaptive attacks) 에 초점을 맞춰, LLM 모니터를 핵심 실패 지점으로 악용하는 새로운 공격 벡터를 제시합니다.

#Review #AI Control Protocols #LLM Monitors #Adaptive Attacks #Prompt Injection #Jailbreaking #Red Teaming #Scalable Oversight

2025년 10월 13일