본문으로 건너뛰기

#Mechanistic Interpretability

32개의 포스트

[논문리뷰] Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

댓글 수 로딩 중

[논문리뷰] The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models

댓글 수 로딩 중

[논문리뷰] Measuring the Depth of LLM Unlearning via Activation Patching

댓글 수 로딩 중

[논문리뷰] Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

댓글 수 로딩 중

[논문리뷰] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

댓글 수 로딩 중

[논문리뷰] Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

댓글 수 로딩 중

[논문리뷰] Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

댓글 수 로딩 중

[논문리뷰] Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

댓글 수 로딩 중

[논문리뷰] Reasoning Models Generate Societies of Thought

댓글 수 로딩 중

[논문리뷰] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

댓글 수 로딩 중

[논문리뷰] The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models

댓글 수 로딩 중

[논문리뷰] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

댓글 수 로딩 중

[논문리뷰] Beyond Transcription: Mechanistic Interpretability in ASR

댓글 수 로딩 중

[논문리뷰] CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

댓글 수 로딩 중

[논문리뷰] BiasGym: Fantastic Biases and How to Find (and Remove) Them

댓글 수 로딩 중

[논문리뷰] Large Language Models Do NOT Really Know What They Don't Know

댓글 수 로딩 중

[논문리뷰] Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

댓글 수 로딩 중

[논문리뷰] OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

댓글 수 로딩 중

[논문리뷰] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

댓글 수 로딩 중

[논문리뷰] Emergence of Linear Truth Encodings in Language Models

댓글 수 로딩 중

[논문리뷰] Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

댓글 수 로딩 중