본문으로 건너뛰기

#Mechanistic Interpretability

24개의 포스트

[논문리뷰] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

댓글 수 로딩 중

[논문리뷰] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

댓글 수 로딩 중

[논문리뷰] Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

댓글 수 로딩 중

[논문리뷰] CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

댓글 수 로딩 중