#Uncertainty Dynamics

1개의 포스트

[논문리뷰] What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

본 논문은 대규모 언어 모델(LLMs)의 안전성을 위협하는 Jailbreak 공격을 모델 내부의 활성화 상태(internal representations) 분석을 통해 효율적으로 탐지하고자 합니다. 기존 연구들은 주로 프롬프트 수준의 필터링이나 외부 분류기에 의존하여 모델 내부의 의미적 변화를 간과하는 한계가 있습니다.

#Review #Jailbreak Detection #Large Language Models #Predictive Entropy #Logit Lens #Intermediate Layers #Adversarial Robustness #Uncertainty Dynamics

2026년 6월 24일