#Sparse Autoencoders

15개의 포스트

[논문리뷰] SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

본 논문은 SAE를 이용한 잠재 공간(latent-space) 방어 기법들이 행동을 완전히 통제하지 못할 수 있다는 한계점을 지적합니다.

#Review #Sparse Autoencoders #Intervention #Post-Intervention Recovery #Constrained Optimization #Interpretability #Safety #Residual Stream

2026년 6월 17일

[논문리뷰] Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

본 논문은 독립적인 random seed로 학습된 SAE들이 왜 서로 다른 feature 세트를 학습하는지, 즉 feature의 비재현성(non-reproducibility) 문제를 해결하고자 합니다.

#Review #Sparse Autoencoders #Feature Stability #Mechanistic Interpretability #Seed Dependence #Subspace Analysis #Functional Asymmetry

2026년 6월 15일

[논문리뷰] Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

본 연구는 TTS 언어 모델의 내부 동작이 '블랙박스'로 남아있어, 특정 음성 속성을 정교하게 제어하기 어렵다는 문제를 해결합니다. 기존의 음성 모델은 특정 스타일이나 화자 변환을 위해 전체 모델을 재학습하거나 프롬프트 엔지니어링에 의존해야 하며, 이는 제어의 정밀도와 효율성 측면에서 한계가 있습니다.

#Review #Sparse Autoencoders #Text-to-Speech #Mechanistic Interpretability #Latent Space #Controllable Generation

2026년 6월 9일

[논문리뷰] Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

본 논문은 CLIP과 같은 대규모 vision-language 모델을 하위 태스크(downstream task)에 맞게 fine-tuning할 때 발생하는 OOD(Out-of-Distribution) 성능 저하 문제를 해결하고자 한다.

#Review #CLIP #Sparse Autoencoders #Robust Fine-tuning #Interpretability #Representational Drift #Computer Vision

2026년 5월 17일

[논문리뷰] WriteSAE: Sparse Autoencoders for Recurrent State

본 논문은 기존의 Residual SAE가 해결하지 못했던 state-space 및 hybrid recurrent language model의 matrix cache write 문제를 다룬다.

#Review #Sparse Autoencoders #State-Space Models #Recurrent Neural Networks #Mechanistic Interpretability #Cache-Patching #WriteSAE

2026년 5월 13일

[논문리뷰] Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

본 논문은 최신 Vision-Language Models(VLMs)가 Adversarial 공격에 극도로 취약하며, 기존의 탐지 방식들은 실질적인 배포 환경에서의 강력한 공격이나 데이터 분포 변화에 대응하지 못한다는 문제를 해결하고자 합니다.

#Review #Vision-Language Models #Adversarial Attack Detection #Sparse Autoencoders #Plug-and-Play #Robustness #Out-of-Domain Generalization

2026년 5월 10일

[논문리뷰] Automatic Image-Level Morphological Trait Annotation for Organismal Images

본 논문은 Sparse Autoencoders(SAE)와 Multimodal Large Language Models(MLLM)을 결합한 모듈형 자동 주석 파이프라인을 제안합니다. 우선 DINOv2 백본을 통해 추출된 특징에 SAE를 학습시켜 공간적으로 명확한 형태학적 부분을 담당하는 뉴런을 식별합니다.

#Review #Sparse Autoencoders #Morphological Trait Annotation #Multimodal Large Language Models #Fine-grained Visual Recognition #Biological Foundation Models

2026년 4월 2일

[논문리뷰] Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

본 논문은 Sparse Autoencoders (SAEs)가 신경망의 활성화를 해석 가능한 희소 특징으로 분해하는 데 있어 실제로 의미 있는 특징을 학습하는지 여부를 체계적으로 평가하는 것을 목표로 합니다.

#Review #Sparse Autoencoders #Interpretability #Neural Network Internals #Evaluation Baselines #Feature Decomposition #LLMs #Mechanistic Interpretability

2026년 2월 17일

[논문리뷰] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

대규모 언어 모델(LLM)의 후처리 훈련에서 데이터 다양성이 중요함에도 불구하고, 기존 텍스트 기반 또는 일반 임베딩 기반 다양성 지표는 태스크 관련 특징을 제대로 포착하지 못하는 문제를 해결하고자 합니다.

#Review #Data Synthesis #LLMs #Feature Space #Sparse Autoencoders #Diversity Metrics #Post-Training #Instruction Tuning #Feature Activation Coverage

2026년 2월 15일

[논문리뷰] On the Evidentiary Limits of Membership Inference for Copyright Auditing

본 논문은 LLM(Large Language Model) 학습 데이터의 저작권 감사에서 MIA(Membership Inference Attack) 가 신뢰할 수 있는 기술적 증거로 사용될 수 있는지 여부를 조사합니다.

#Review #Membership Inference Attacks #Copyright Auditing #Large Language Models #Adversarial Robustness #Paraphrasing #Sparse Autoencoders #Semantic Preservation #LLM Security

2026년 1월 20일

[논문리뷰] BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

본 논문은 인간 뇌에서 시각적 개념 표현을 대규모로 발견하고 해석하는 자동화된 프레임워크인 BrainExplore 를 제안합니다. 기존 fMRI 연구의 소규모, 수동 분석 및 특정 영역 의존성의 한계를 극복하고, 방대한 시각적 개념 공간에서 정교하고 해석 가능한 뇌 활동 패턴 을 자동으로 식별하는 것을 목표로 합니다.

#Review #fMRI #Brain Mapping #Visual Representation #Interpretability #Sparse Autoencoders #Vision-Language Models #Unsupervised Learning #Neuroscience

2025년 12월 10일

[논문리뷰] Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

본 논문은 현대 LLM 분석에 중요한 도구인 Intrinsic Dimension (ID) 의 텍스트 기반 결정 요인을 밝히는 것을 목표로 합니다.

#Review #Intrinsic Dimension #LLMs #Text Complexity #Sparse Autoencoders #Text Semantics #Genre Analysis #Embedding Space #Text Generation

2025년 11월 23일

[논문리뷰] CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

본 논문은 기존의 Sparse Autoencoder (SAE) 기반 LLM 조향 방식이 요구하는 대규모 대조 데이터셋 또는 방대한 활성화 저장 공간 의 한계를 해결하고자 합니다.

#Review #Sparse Autoencoders #LLM Steering #Feature Selection #Correlation Analysis #AI Safety #Bias Mitigation #Mechanistic Interpretability

2025년 8월 20일

[논문리뷰] Memory Retrieval and Consolidation in Large Language Models through Function Tokens

본 논문은 대규모 언어 모델(LLMs) 내에서 기억 검색(memory retrieval) 및 기억 통합(memory consolidation) 메커니즘이 어떻게 작동하는지에 대한 이해 부족을 해결하는 것을 목표로 합니다.

#Review #Large Language Models #LLM Interpretability #Function Tokens #Memory Retrieval #Memory Consolidation #Sparse Autoencoders #Pre-training

2025년 10월 10일

[논문리뷰] OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

본 논문은 기존 Sparse Autoencoders (SAEs)가 겪는 피쳐 흡수(feature absorption) 및 피쳐 구성(feature composition) 문제를 해결하여, LLM 내부 활성화에서 추출되는 피쳐의 해석 가능성과 원자성을 높이는 것을 목표로 합니다.

#Review #Sparse Autoencoders #Mechanistic Interpretability #Feature Disentanglement #Orthogonality #LLM Features #Feature Absorption #Feature Composition

2025년 10월 6일