Interpretability of Deep Neural Networks with Sparse Autoencoders
AnaLuPinho
83 views
17 slides
Sep 17, 2024
Slide 1 of 17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
About This Presentation
Overview of Keynote and follow-up Tutorial on Sparse Autoencoders by Jack Lindsey at the Cognitive Computational Neuroscience Conference 2024
Size: 1.68 MB
Language: en
Added: Sep 17, 2024
Slides: 17 pages
Slide Content
Sparse Autoencoders (SAEs)
Overview of Keynote and Tutorial at CCN2024
by Jack Lindsey from Anthropic Interpretability Team
Ana Luísa Pinho 12
th
of September, 2024
YouTube link of the Keynote
“Recent advances in interpretability of deep neural network models”:
https://www.youtube.com/watch?v=-xmLMWC2YlI (start → 46m37s)
Why study interpretability of LLMs?
Explain their flawed behavior
For the neuroscience crowd:
LLMs (or generally Deep Neural Net’s) provide good models for
complex behavior
Why study interpretability of LLMs?
Explain their flawed behavior
For the neuroscience crowd:
LLMs (or generally Deep Neural Net’s) provide good models for
complex behavior
What kinds of abstractions do LLMs represent internally,
and how are they represented?
The Linear Representation Hypothesis
Language models represent many semantically meaningful concepts
linearly in their activations.
The Linear Representation Hypothesis
Lots of empirical evidence supporting this hypothesis
Example:
“King – Man + Woman = Queen” (word embedding learnt, using Skigram; Mikolov et al. 2013)
Language models represent many semantically meaningful concepts
linearly in their activations.
Adapted from Jack Lindsey’s talk, CCN2024
How can models represent numerous concepts linearly
without causing interference?
Superposition
Elhage et al., 2022
N-dimensional space linearly represent many more than
N variables, with minimal interference
As long as they are sparsely active
Useful to represent features that are only occasionally
relevant in a given context
Adapted from Jack Lindsey’s talk, CCN2024
How can models represent numerous concepts linearly
without causing interference?
From Jack Lindsey’s talk, CCN2024
Sparse Autoencoder
via Dictionary Learning w/ L1 penalty
Adapted from Jack Lindsey’s talk, CCN2024
SAEs often learn features that correspond to interpretable concepts
Feature activates
on Arabic Text
Feature activates
on DNA-nucleotide
sequences
SAEs trained from activations of a 1-layer transformer model with 512-neuron MLP
Bricken, T., et al. 2023
Trained sparse autoencoders:
of varying sizes (1M, 4M, 34M features)
on activity from the middle layer of Claude 3 Sonnet
Link to blog post: https://transformer-circuits.pub/2024/scaling-monosemanticity
Adapted from Jack Lindsey’s talk, CCN2024
Check original talk for more funny examples on errors in code or
tailored behavior in response to praise.
Adapted from Jack Lindsey’s talk, CCN2024
From Jack Lindsey’s talk, CCN2024
Feature interpretations not perfect but reliable
Features are more interpretable than activations of model’s individual neurons.
As the size of the autoencoder increases, the interpretability of the features improves.
Specificity Results
Feature interpretations not perfect but reliable
Features are more interpretable than activations of model’s individual neurons.
As the size of the autoencoder increases, the interpretability of the features improves.
Sensitivity tests: ρspearman = 0.4 (predicted and true activations)
Feature interpretations not wrong but coarser than expected
Specificity Results
The larger the autoencoder, the more comprehensive its coverage.
Identifies more frequent concepts at the beginning, and then increasingly more specific ones.
Assessing Breadth of Feature Coverage
Tutorial: Python Colab Notebook
Setup
Build a toy model and
visualize its model activations
Train a SAE and visualize its
encoding and decoding weights
Address dead features
when training SAEs
SAEs trained on activations from
two 1-layer transformers
Validate SAEs sparsity
Validate SAEs reconstruction loss
Find the highest-activating tokens
Other features’ quality-checking
Features Steering in the model