Interpretability of Deep Neural Networks with Sparse Autoencoders

AnaLuPinho 83 views 17 slides Sep 17, 2024
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Overview of Keynote and follow-up Tutorial on Sparse Autoencoders by Jack Lindsey at the Cognitive Computational Neuroscience Conference 2024


Slide Content

Sparse Autoencoders (SAEs)
Overview of Keynote and Tutorial at CCN2024
by Jack Lindsey from Anthropic Interpretability Team
Ana Luísa Pinho 12
th
of September, 2024
YouTube link of the Keynote
“Recent advances in interpretability of deep neural network models”:
https://www.youtube.com/watch?v=-xmLMWC2YlI (start → 46m37s)


Why study interpretability of LLMs?
Explain their flawed behavior

For the neuroscience crowd:

LLMs (or generally Deep Neural Net’s) provide good models for

complex behavior


Why study interpretability of LLMs?
Explain their flawed behavior

For the neuroscience crowd:

LLMs (or generally Deep Neural Net’s) provide good models for

complex behavior
What kinds of abstractions do LLMs represent internally,
and how are they represented?

The Linear Representation Hypothesis
Language models represent many semantically meaningful concepts
linearly in their activations.

The Linear Representation Hypothesis

Lots of empirical evidence supporting this hypothesis
Example:
“King – Man + Woman = Queen” (word embedding learnt, using Skigram; Mikolov et al. 2013)
Language models represent many semantically meaningful concepts
linearly in their activations.

Adapted from Jack Lindsey’s talk, CCN2024
How can models represent numerous concepts linearly
without causing interference?

Superposition
Elhage et al., 2022

N-dimensional space linearly represent many more than

N variables, with minimal interference

As long as they are sparsely active

Useful to represent features that are only occasionally

relevant in a given context
Adapted from Jack Lindsey’s talk, CCN2024
How can models represent numerous concepts linearly
without causing interference?

From Jack Lindsey’s talk, CCN2024
Sparse Autoencoder

via Dictionary Learning w/ L1 penalty
Adapted from Jack Lindsey’s talk, CCN2024

SAEs often learn features that correspond to interpretable concepts
Feature activates
on Arabic Text
Feature activates
on DNA-nucleotide
sequences
SAEs trained from activations of a 1-layer transformer model with 512-neuron MLP
Bricken, T., et al. 2023

Trained sparse autoencoders:

of varying sizes (1M, 4M, 34M features)

on activity from the middle layer of Claude 3 Sonnet
Link to blog post: https://transformer-circuits.pub/2024/scaling-monosemanticity
Adapted from Jack Lindsey’s talk, CCN2024

Check original talk for more funny examples on errors in code or
tailored behavior in response to praise.
Adapted from Jack Lindsey’s talk, CCN2024

From Jack Lindsey’s talk, CCN2024


Feature interpretations not perfect but reliable

Features are more interpretable than activations of model’s individual neurons.

As the size of the autoencoder increases, the interpretability of the features improves.
Specificity Results


Feature interpretations not perfect but reliable

Features are more interpretable than activations of model’s individual neurons.

As the size of the autoencoder increases, the interpretability of the features improves.

Sensitivity tests: ρspearman = 0.4 (predicted and true activations)

Feature interpretations not wrong but coarser than expected
Specificity Results


The larger the autoencoder, the more comprehensive its coverage.

Identifies more frequent concepts at the beginning, and then increasingly more specific ones.
Assessing Breadth of Feature Coverage

Tutorial: Python Colab Notebook

Setup

Build a toy model and
visualize its model activations

Train a SAE and visualize its
encoding and decoding weights

Address dead features
when training SAEs

SAEs trained on activations from
two 1-layer transformers

Validate SAEs sparsity

Validate SAEs reconstruction loss

Find the highest-activating tokens

Other features’ quality-checking

Features Steering in the model