#Model Interpretability

8개의 포스트

[논문리뷰] LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

본 논문은 RL post-training 과정에서 발생하는 data contamination이 모델의 평가 신뢰성과 일반화 성능을 저해한다는 문제를 지적한다. 기존의 탐지 방식은 주로 token likelihood나 entropy 등 출력(Output-level) 신호에 의존해왔다.

#Review #Data Contamination #Reinforcement Learning #Membership Inference Attack #Representation Geometry #Representation Dynamics #Model Interpretability

2026년 5월 28일

[논문리뷰] AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

이 논문은 오디오 처리 모델, 특히 Whisper 와 HuBERT 의 복잡한 내부 표현을 Sparse AutoEncoders (SAEs) 를 통해 이해하고 해석하는 것을 목표로 합니다.

#Review #Sparse Autoencoders (SAEs)#Audio Representation Learning #Model Interpretability #Whisper #HuBERT #Feature Steering #EEG Correlation #Audio Analysis

2026년 2월 8일

[논문리뷰] No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs

본 연구는 Large Language Models (LLMs)의 Chain-of-Thought (CoT) 추론 과정에서 내재된 계획 능력(latent planning horizon) 을 규명하는 것을 목표로 합니다.

#Review #Chain-of-Thought #LLM Planning #Probing Methods #Uncertainty Estimation #Reasoning Dynamics #Model Interpretability

2026년 2월 3일

[논문리뷰] CRISP: Persistent Concept Unlearning via Sparse Autoencoders

본 논문은 대규모 언어 모델(LLMs)에서 불필요하거나 유해한 지식을 영구적으로 제거(Persistent Concept Unlearning) 하면서도 모델의 일반적인 유용성과 생성 품질을 유지하는 것을 목표로 합니다.

#Review #Concept Unlearning #Sparse Autoencoders (SAEs)#LLMs #Parameter-Efficient Fine-Tuning #Model Interpretability #Safety-Critical AI #Feature Suppression #WMDP Benchmark

2025년 8월 25일

[논문리뷰] VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

본 논문은 Vision-Language Models (VLMs)의 vision-language alignment 메커니즘 에 대한 해석 가능성 부족 문제를 해결하고자 합니다.

#Review #Vision-Language Models (VLMs)#Model Interpretability #Sparse Autoencoder (SAE)#Multi-modal Alignment #Concept Learning #Hallucination Elimination #Zero-shot Classification

2025년 10월 29일

[논문리뷰] Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain

본 논문은 대규모 언어 모델(LLM)이 인간 수준의 언어 능력을 보여주지만 구문 구조를 모델링하는 특정 연산 모듈이 불분명하다는 문제에 주목합니다.

#Review #Large Language Models #Syntactic Structure #Human Brain #Frequency Tagging #Neuroscience #Model Interpretability #Representational Similarity Analysis #Intracranial EEG

2025년 10월 16일

[논문리뷰] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

본 논문은 Video Large Language Models ( VideoLLMs )가 비디오-텍스트 정보(spatiotemporal inputs)를 어떻게 내부적으로 추출하고 전파하여 비디오 질의응답 (VideoQA) 태스크에서 Temporal Reasoning을 수행하는지 그 메커니즘을 밝히는 것을 목표로 합니다.

#Review #Video Large Language Models #VideoQA #Mechanistic Interpretability #Attention Knockout #Temporal Reasoning #Information Flow #Model Interpretability #Logit Lens

2025년 10월 27일

[논문리뷰] The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

본 논문은 기존 Transformer 모델이 CoT (Chain-of-Thought) 추론 의 일반화와 뇌 기능에 대한 미시적 해석을 제공하지 못하는 한계를 지적합니다.

#Review #Large Language Models #Brain-Inspired AI #Graph Neural Networks #Hebbian Learning #Scale-Free Networks #Model Interpretability #Transformer Architecture

2025년 10월 1일