#White-box Intervention

1개의 포스트

[논문리뷰] Steered LLM Activations are Non-Surjective

본 연구는 Activation Steering이 유도하는 모델의 내부 행동 변화가 실제 텍스트 프롬프트를 통해서도 동일하게 구현 가능한지라는 근본적인 의문을 해결하고자 합니다.

#Review #Activation Steering #Surjectivity #LLM Interpretability #Prompt-Reachability #White-box Intervention #AI Safety

2026년 5월 17일