[논문리뷰] Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?arXiv에 게시된 'Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?' 논문에 대한 자세한 리뷰입니다.#Review#Safety Alignment#Large Reasoning Models#Mechanistic Interpretability#Refusal Cliff#Attention Heads#Data Selection#Linear Probing2025년 10월 8일댓글 수 로딩 중