#Safety Injection

1개의 포스트

[논문리뷰] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

본 논문은 대규모 언어 모델(LLM)의 안전 정렬(safety alignment)이 특정 내부 표현 방향에 의해 매개되며 우회될 수 있다는 기존 연구를 바탕으로, 정반대로 안전 정렬을 강화하는 새로운 방법을 제안합니다.

#Review #LLM Safety #Alignment Amplification #Rank-One Update #Mechanistic Interpretability #Weight Steering #Jailbreak Robustness #Fine-tuning-free #Safety Injection

2025년 8월 29일