Interpretability of Deep Neural Networks with Sparse Autoencoders

AnaLuPinho 83 views 17 slides Sep 17, 2024

Slide 1 of 17

About This Presentation

Overview of Keynote and follow-up Tutorial on Sparse Autoencoders by Jack Lindsey at the Cognitive Computational Neuroscience Conference 2024

Size: 1.68 MB

Language: en

Added: Sep 17, 2024

Slides: 17 pages

Slide Content

Sparse Autoencoders (SAEs)
Overview of Keynote and Tutorial at CCN2024
by Jack Lindsey from Anthropic Interpretability Team
Ana Luísa Pinho 12
th
of September, 2024
YouTube link of the Keynote
“Recent advances in interpretability of deep neural network models”:
https://www.youtube.com/watch?v=-xmLMWC2YlI (start → 46m37s)


Why study interpretability of LLMs?
Explain their flawed behavior

For the neuroscience crowd:

LLMs (or generally Deep Neural Net’s) provide good models for

complex behavior


Why study interpretability of LLMs?
Explain their flawed behavior

For the neuroscience crowd:

LLMs (or generally Deep Neural Net’s) provide good models for

complex behavior
What kinds of abstractions do LLMs represent internally,
and how are they represented?

The Linear Representation Hypothesis
Language models represent many semantically meaningful concepts
linearly in their activations.

The Linear Representation Hypothesis

Lots of empirical evidence supporting this hypothesis
Example:
“King – Man + Woman = Queen” (word embedding learnt, using Skigram; Mikolov et al. 2013)
Language models represent many semantically meaningful concepts
linearly in their activations.

Adapted from Jack Lindsey’s talk, CCN2024
How can models represent numerous concepts linearly
without causing interference?

Superposition
Elhage et al., 2022

N-dimensional space linearly represent many more than

N variables, with minimal interference

As long as they are sparsely active

Useful to represent features that are only occasionally

relevant in a given context
Adapted from Jack Lindsey’s talk, CCN2024
How can models represent numerous concepts linearly
without causing interference?

From Jack Lindsey’s talk, CCN2024
Sparse Autoencoder

via Dictionary Learning w/ L1 penalty
Adapted from Jack Lindsey’s talk, CCN2024

SAEs often learn features that correspond to interpretable concepts
Feature activates
on Arabic Text
Feature activates
on DNA-nucleotide
sequences
SAEs trained from activations of a 1-layer transformer model with 512-neuron MLP
Bricken, T., et al. 2023

Trained sparse autoencoders:

of varying sizes (1M, 4M, 34M features)

on activity from the middle layer of Claude 3 Sonnet
Link to blog post: https://transformer-circuits.pub/2024/scaling-monosemanticity
Adapted from Jack Lindsey’s talk, CCN2024

Check original talk for more funny examples on errors in code or
tailored behavior in response to praise.
Adapted from Jack Lindsey’s talk, CCN2024

From Jack Lindsey’s talk, CCN2024


Feature interpretations not perfect but reliable

Features are more interpretable than activations of model’s individual neurons.

As the size of the autoencoder increases, the interpretability of the features improves.

Sensitivity tests: ρspearman = 0.4 (predicted and true activations)

Feature interpretations not wrong but coarser than expected
Specificity Results


The larger the autoencoder, the more comprehensive its coverage.

Identifies more frequent concepts at the beginning, and then increasingly more specific ones.
Assessing Breadth of Feature Coverage

Tutorial: Python Colab Notebook

Setup

Build a toy model and
visualize its model activations

Train a SAE and visualize its
encoding and decoding weights

Address dead features
when training SAEs

SAEs trained on activations from
two 1-layer transformers

Validate SAEs sparsity

Validate SAEs reconstruction loss

Find the highest-activating tokens

Other features’ quality-checking

Features Steering in the model

Interpretability of Deep Neural Networks with Sparse Autoencoders

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Interpretability of Deep Neural Networks with Sparse Autoencoders

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx