#MLP Neurons

2개의 포스트

[논문리뷰] Targeted Neuron Modulation via Contrastive Pair Search

LLM이 유해한 요청을 거부하도록 Instruction-tuning되지만, 이러한 Safety behavior의 Mechanistic basis는 여전히 불분명하다.

#Review #Neuron Modulation #Contrastive Neuron Attribution #Refusal Mechanisms #Alignment Fine-tuning #Mechanistic Interpretability #Behavioral Steering #MLP Neurons

2026년 5월 18일

[논문리뷰] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

본 논문은 templated prompts를 사용하여 특정 개체에 반응하는 뉴런을 추출하고, 이를 인과적 개입(Causal Intervention)을 통해 검증하는 파이프라인을 제안합니다. 먼저, 여러 프롬프트에서 안정적으로 활성화되는 뉴런을 순위화하여 Entity Cells를 식별합니다.

#Review #Mechanistic Interpretability #LLM #Entity Cells #Factual Recall #Causal Intervention #MLP Neurons #Canonicalization

2026년 4월 2일