#Mechanistic Interpretability

24개의 포스트

[논문리뷰] Convergent Evolution: How Different Language Models Learn Similar Number Representations

arXiv에 게시된 'Convergent Evolution: How Different Language Models Learn Similar Number Representations' 논문에 대한 자세한 리뷰입니다.

#Review #Language Models #Mechanistic Interpretability #Fourier Features #Convergent Evolution #Modular Arithmetic #Representation Learning

2026년 4월 22일

[논문리뷰] Structural Graph Probing of Vision-Language Models

arXiv에 게시된 'Structural Graph Probing of Vision-Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #Vision-Language Models #Neural Topology #Mechanistic Interpretability #Neuron Correlation #Graph Neural Networks #Causal Intervention

2026년 4월 9일

[논문리뷰] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

본 논문은 templated prompts를 사용하여 특정 개체에 반응하는 뉴런을 추출하고, 이를 인과적 개입(Causal Intervention)을 통해 검증하는 파이프라인을 제안합니다. 먼저, 여러 프롬프트에서 안정적으로 활성화되는 뉴런을 순위화하여 Entity Cells를 식별합니다.

#Review #Mechanistic Interpretability #LLM #Entity Cells #Factual Recall #Causal Intervention #MLP Neurons #Canonicalization

2026년 4월 2일

[논문리뷰] Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

arXiv에 게시된 'Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering' 논문에 대한 자세한 리뷰입니다.

#Review #Audio-Language Models (LALMs)#Text Dominance #Mechanistic Interpretability #Attention Heads #Activation Steering #Multimodal Grounding #Inference-time Intervention

2026년 3월 10일

[논문리뷰] Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Ivan Oseledets이 arXiv에 게시한 'Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?' 논문에 대한 자세한 리뷰입니다.

#Review #Sparse Autoencoders #Interpretability #Neural Network Internals #Evaluation Baselines #Feature Decomposition #LLMs #Mechanistic Interpretability

2026년 2월 17일

[논문리뷰] Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Lecheng Yan이 arXiv에 게시한 'Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs' 논문에 대한 자세한 리뷰입니다.

#Review #RLVR #LLMs #Mechanistic Interpretability #Memorization Shortcuts #Data Contamination #Anchor-Adapter Circuit #Path Patching #Logit Lens

2026년 1월 19일

[논문리뷰] Reasoning Models Generate Societies of Thought

James Evans이 arXiv에 게시한 'Reasoning Models Generate Societies of Thought' 논문에 대한 자세한 리뷰입니다.

#Review #Reasoning Models #Large Language Models (LLMs)#Multi-Agent Systems #Society of Thought #Mechanistic Interpretability #Reinforcement Learning #Cognitive Diversity #Conversational AI

2026년 1월 18일

[논문리뷰] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

arXiv에 게시된 'Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Reasoning #Mechanistic Interpretability #Sparse Autoencoders (SAEs)#Activation Steering #Unsupervised Learning #Reasoning Behaviors #Chain-of-Thought #Feature Disentanglement

2025년 12월 31일

[논문리뷰] In-Context Representation Hijacking

yossig이 arXiv에 게시한 'In-Context Representation Hijacking' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Jailbreak #In-Context Learning #Representation Hijacking #Mechanistic Interpretability #LLM Safety #Adversarial Attack #Semantic Shift

2025년 12월 3일

[논문리뷰] The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models

arXiv에 게시된 'The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #Analogical Reasoning #Large Language Models #Mechanistic Interpretability #Proportional Analogies #Story Analogies #Structural Alignment #Attention Knockout #Patchscopes

2025년 12월 2일

[논문리뷰] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Bohyung Han이 arXiv에 게시한 'Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs' 논문에 대한 자세한 리뷰입니다.

#Review #Video Large Language Models #VideoQA #Mechanistic Interpretability #Attention Knockout #Temporal Reasoning #Information Flow #Model Interpretability #Logit Lens

2025년 10월 27일

[논문리뷰] Emergence of Linear Truth Encodings in Language Models

Alberto Bietti이 arXiv에 게시한 'Emergence of Linear Truth Encodings in Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #Language Models #Truth Encoding #Linear Subspaces #Mechanistic Interpretability #Transformer Models #Learning Dynamics #Truth Co-occurrence Hypothesis #Hallucinations

2025년 10월 24일

[논문리뷰] Large Language Models Do NOT Really Know What They Don't Know

arXiv에 게시된 'Large Language Models Do NOT Really Know What They Don't Know' 논문에 대한 자세한 리뷰입니다.

#Review #LLMs #Hallucination Detection #Mechanistic Interpretability #Internal States #Knowledge Recall #Refusal Tuning #Factual Associations #Associated Hallucinations

2025년 10월 17일

[논문리뷰] ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall

Jiaqi Tang이 arXiv에 게시한 'ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall' 논문에 대한 자세한 리뷰입니다.

#Review #Knowledge Editing #LLMs #Multi-hop Reasoning #Mechanistic Interpretability #Neuron-level Attribution #Factual Recall #Transformer Networks

2025년 10월 13일

[논문리뷰] Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

arXiv에 게시된 'Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?' 논문에 대한 자세한 리뷰입니다.

#Review #Safety Alignment #Large Reasoning Models #Mechanistic Interpretability #Refusal Cliff #Attention Heads #Data Selection #Linear Probing

2025년 10월 8일

[논문리뷰] Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

arXiv에 게시된 'Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context' 논문에 대한 자세한 리뷰입니다.

#Review #Language Models #In-Context Learning #Entity Binding #Mechanistic Interpretability #Causal Abstraction #Long-Context Reasoning #Positional Encoding #Information Retrieval

2025년 10월 8일

[논문리뷰] Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Jacobo Azcona이 arXiv에 게시한 'Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Hallucinations #Mechanistic Interpretability #Distributional Semantics Tracing (DST)#Dual-Process Theory #Semantic Drift #Commitment Layer #Faithfulness Score

2025년 10월 8일

[논문리뷰] OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

Elena Tutubalina이 arXiv에 게시한 'OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features' 논문에 대한 자세한 리뷰입니다.

#Review #Sparse Autoencoders #Mechanistic Interpretability #Feature Disentanglement #Orthogonality #LLM Features #Feature Absorption #Feature Composition

2025년 10월 6일

[논문리뷰] Eliciting Secret Knowledge from Language Models

Neel Nanda이 arXiv에 게시한 'Eliciting Secret Knowledge from Language Models' 논문에 대한 자세한 리뷰입니다.

#Review #Language Models #Secret Elicitation #Mechanistic Interpretability #Black-box Methods #White-box Methods #AI Auditing #Model Organisms #Prefill Attacks

2025년 10월 2일

[논문리뷰] Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

arXiv에 게시된 'Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training' 논문에 대한 자세한 리뷰입니다.

#Review #Mechanistic Interpretability #Attention Heads #Post-Training #Supervised Fine-Tuning (SFT)#Reinforcement Learning (RL)#Circuit Analysis #Reasoning Models #Transformer Architecture

2025년 10월 1일

[논문리뷰] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Bernard Ghanem이 arXiv에 게시한 'Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection' 논문에 대한 자세한 리뷰입니다.

#Review #LLM Safety #Alignment Amplification #Rank-One Update #Mechanistic Interpretability #Weight Steering #Jailbreak Robustness #Fine-tuning-free #Safety Injection

2025년 8월 29일

[논문리뷰] Beyond Transcription: Mechanistic Interpretability in ASR

Aviv Shamsian이 arXiv에 게시한 'Beyond Transcription: Mechanistic Interpretability in ASR' 논문에 대한 자세한 리뷰입니다.

#Review #ASR #Mechanistic Interpretability #Logit Lens #Linear Probing #Activation Patching #Hallucinations #Repetitions #Encoder-Decoder

2025년 8월 28일

[논문리뷰] CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

Adriano Koshiyama이 arXiv에 게시한 'CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection' 논문에 대한 자세한 리뷰입니다.

#Review #Sparse Autoencoders #LLM Steering #Feature Selection #Correlation Analysis #AI Safety #Bias Mitigation #Mechanistic Interpretability

2025년 8월 20일

[논문리뷰] BiasGym: Fantastic Biases and How to Find (and Remove) Them

Arnav Arora이 arXiv에 게시한 'BiasGym: Fantastic Biases and How to Find (and Remove) Them' 논문에 대한 자세한 리뷰입니다.

#Review #Bias Mitigation #LLMs #Mechanistic Interpretability #Fine-tuning #Attention Steering #Stereotype Analysis #Safety Alignment

2025년 8월 13일