Yann's talk at SFI, about self-supervised learning.
Size: 32.96 MB
Language: en
Added: Jun 17, 2024
Slides: 66 pages
Slide Content
Towards Machines
that can
Understand, Reason & Plan
Santa Fe Institute workshop: AI and barrier of meaning 2
Yann LeCun
Courant Institute & Center for Data Science, NYU
Meta – Fundamental AI Research
Generated with Make-A-Scene
Santa Fe Institute
2023-04-24
Y. LeCun
Machine Learning sucks! (compared to humans and animals)
Supervised learning (SL) requires large numbers of labeled samples.
Reinforcement learning (RL) requires insane amounts of trials.
SL/RL-trained ML systems:
are specialized and brittle
make “stupid” mistakes
do not reason nor plan
Animals and humans:
Can learn new tasks very quickly.
Understand how the world works
Can reason an plan
Humans and animals have common sense, current machines don’t
Y. LeCun
Machine Learning sucks! (plain ML/DL, at least)
Machine Learning systems (most of them anyway)
Have a constant number of computational steps between input and
output.
Do not reason.
Cannot plan.
Humans and some animals
Understand how the world works.
Can predict the consequences of their actions.
Can perform chains of reasoning with an unlimited number of steps.
Can plan complex tasks by decomposing it into sequences of subtasks
Self-Supervised Learning
has
taken over the world
For understanding & generation
of images, audio, text...
Generated with Make-A-Scene
Y. LeCun
Self-Supervised Learning = Learning to Fill in the Blanks
time or space →
Reconstruct the input or Predict missing parts of the input.
Y. LeCun
Self-Supervised Learning = Learning to Fill in the Blanks
time or space →
Reconstruct the input or Predict missing parts of the input.
Y. LeCun
Denoising Auto-Encoders
BERT [Devlin 2018], RoBERTa [Ott 2019]
ỹ
ŷ
y
C(y,ỹ)
Decoder
Encoder
corruption
Learned
representation
This is a [...] of text extracted
[...] a large set of [...] articles
This is a piece of text extracted
from a large set of news articles
Figures: Alfredo Canziani
Generative AI Systems
Producing images, video, text
Generated with Make-A-Scene
Y. LeCun
Encoder
Auto-Regressive Generative Architectures
Outputs one “token” after another
Tokens may represent words, image patches, speech segments...
Stochastic
Predictor
x[t-3] x[t-2] x[t-1] x[t] x[t+1]
Prompt
Encoder
Stochastic
Predictor
x[t-2] x[t-1] x[t] x[t+2]
Context
x[t+1]
Predicted token
Y. LeCun
Auto-Regressive Large Language Models (AR-LLMs)
Outputs one text token after another
Tokens may represent words or subwords
Encoder/predictor is a transformer architecture
With billions of parameters: typically from 1B to 500B
Training data: 1 to 2 trillion tokens
LLMs for dialog/text generation:
BlenderBot, Galactica, LLaMA (FAIR), Alpaca (Stanford), LaMDA/Bard
(Google), Chinchilla (DeepMind), ChatGPT (OpenAI), GPT-4 ??…
Performance is amazing … but … they make stupid mistakes
Factual errors, logical errors, inconsistency, limited reasoning, toxicity...
LLMs have no knowledge of the underlying reality
They have no common sense & they can’t plan their answer
Y. LeCun
LLaMA & LLaMA-I 65B: code generation
LLaMA 65B
LLaMA-I 65B
Fine-tuned to
follow
instructions
Y. LeCun
LLaMA 65B: fluent hallucination
Human-provided
Prompt (bold)
Generated
text
Y. LeCun
What are Auto-Regressive LLMs Good For?
Auto-Regressive LLMs are good for
Writing assistance, first draft generation, stylistic polishing.
Code writing assistance
What they not good for:
Producing factual and consistent answers (hallucinations!)
Taking into account recent information (anterior to the last training)
Behaving properly (they mimic behaviors from the training set)
Reasoning, planning, math
Using “tools”, such as search engines, calculators, database queries…
We are easily fooled by their fluency.
But they don’t know how the world works.
Y. LeCun
Unpopular Opinion about AR-LLMs
Auto-Regressive LLMs are doomed.
They cannot be made factual, non-toxic, etc.
They are not controllable
Probability e that any produced token takes
us outside of the set of correct answers
Probability that answer of length n is
correct:
P(correct) = (1-e)
n
This diverges exponentially.
It’s not fixable (without a major redesign).
Tree of all possible
token sequences
Tree of “correct”
answers
Y. LeCun
Auto-Regressive Generative Models Suck!
AR-LLMs
Have a constant number of computational steps between input and
output. Weak representational power.
Do not really reason. Do not really plan
Humans and many animals
Understand how the world works.
Can predict the consequences of their actions.
Can perform chains of reasoning with an unlimited number of steps.
Can plan complex tasks by decomposing it into sequences of subtasks
Y. LeCun
Limitations of LLMs
Auto-Regressive LLMs (at best)
approximate the functions of the
Wernicke and Broca areas in the brain.
What about the pre-frontal cortex?
ArXiv:2206.10498ArXiv:2301.06627
How do humans
and animals learn
so quickly?
Not supervised.
Not Reinforced.
At least not much.
Y. LeCun
How could machines learn like animals and humans?
P
e
r
c
e
p
t
io
n
P
r
o
d
u
c
t
io
n
Physics
Actions
Objects
01234567891011121314
Age
Age (months)
stability,
support
gravity, inertia
conservation of
momentum
Object permanence
solidity, rigidity
shape
constancy
crawling
walking
emotional contagion
rational, goal-
directed actions
face tracking
proto-imitation
pointing
biological
motion
false perceptual
beliefs
helping vs
hindering
natural kind categories
Social
Communication
[Emmanuel
Dupoux]
How can babies
learn how the
world works?
How can
teenagers learn
to drive with
20h of practice?
Y. LeCun
How do Human and Animal Babies Learn?
How do they learn how the world works?
Largely by observation, with remarkably little interaction (initially).
They accumulate enormous amounts of background knowledge
About the structure of the world, like intuitive physics.
Perhaps common sense emerges from this knowledge?
Photos courtesy of
Emmanuel Dupoux
Y. LeCun
Three challenges for AI & Machine Learning
1. Learning representations and predictive models of the world
Supervised and reinforcement learning require too many samples/trials
Self-supervised learning / learning dependencies / to fill in the blanks
learning to represent the world in a non task-specific way
Learning predictive models for planning and control
2. Learning to reason, like Daniel Kahneman’s “System 2”
Beyond feed-forward, System 1 subconscious computation.
Making reasoning compatible with learning.
Reasoning and planning as energy minimization.
3. Learning to plan complex action sequences
Learning hierarchical representations of action plans
Towards Autonomous
AI Systems
That can learn, reason, plan
“A path towards autonomous
machine intelligence”
https://openreview.net/forum?id=BZ5a1r-kVsf
Technical talk:
search “Yann LeCun Berkeley” on YouTube
Generated with Make-A-Scene
Y. LeCun
Modular Architecture for Autonomous AI
Configurator
Configures other modules for task
Perception
Estimates state of the world
World Model
Predicts future world states
Cost
Compute “discomfort”
Actor
Find optimal action sequences
Short-Term Memory
Stores state-cost episodes percept
action
Actor
World Model
Intrinsic
cost
Perception
Short-term
memory
configurator
Critic
Cost
Y. LeCun
Modular Architecture for Autonomous AI
There is a long history of
cognitive architectures in AI
See [Langley AAAI 2017] for a
review.
CAPS (Thibadeau 83), SOAR
(Laird 87), ACT-R (Anderson 93),
Prodigy (Veloso 95), EPIC (Kieras
and Meyer 97), CLARION (Sun
and Zhang 2004), ICARUS
(Langley 09)….
percept
action
Actor
World Model
Intrinsic
cost
Perception
Short-term
memory
configurator
Critic
Cost
Y. LeCun
Mode-1 Perception-Action Cycle
Perception module s[0]=Enc(x)
Extract representation of the world
Policy module A(s[0])
Computes an action reactively
Cost module C(s[0])
Computes cost of state
Optionally:
World Model Pred(s,a)
Predicts future state
Stores states and costs in short-term
memory
Pred(s,a)
C(s[1])
s[1]
action
s[0]
Actor
A(s)
a[0]
C(s[0])
Y. LeCun
Mode-2 Perception-Planning-Action Cycle
Akin to classical Model-Predictive Control (MPC)
Actor proposes an ation sequence
World Model predicts outcome
Actor optimizes action sequence to minimize cost
e.g. using gradient descent, dynamic programming, MC tree search…
Actor sends first action(s) to effectors
Pred(s,a)
C(s[t])
Pred(s,a)
C(s[t+1])
Pred(s,a)
C(s[T-1])
Pred(s,a)
C(s[T])
s[t] s[t+1]
action
s[0]
a[0]
Actor
C(s[0])
s[T-1]
a[t] a[t+1] a[T-1]
[Henaff et al ICLR 19],[Hafner et al. ICML 19],[Chaplot et al. ICML 21],[Escontrela CoRL 22],...
Y. LeCun
Compiling Mode-2 into Mode-1
Akin to Amortized Inference
System performs Mode-2 cycle to get optimal action sequence.
Optimal actions used as targets to train the policy module A(s)
Policy module can be used for Mode-1 or to initialize Mode-2.
M(s,a) M(s,a) M(s,a)
C(s[T])
s[t] s[t+1]
action
s[0]
a[t] a[T-1]a[0]
Actor
A(s[0]) A(s[t]) A(s[t+1])D D
C(s[t]) C(s[t+1])C(s[0])
[Henaff et al. ICLR 2019] [Schrittwieser et al. MuZero 2020]
Y. LeCun
Cost Module
Intrinsic Cost (IC)
Immutable cost modules.
Hard-wired drives.
Trainable Cost (TC)
Trainable
Predicts future values of IC
Equivalent to a critic in RL
Implements subgoals
Configurable
All are differentiable
TC1(s)
s
IC1(s) IC2(s) ICk(s)
...
TC2(s) TCl(s)
...
Intrinsic Cost (IC) Trainable Cost / Critic (TC)
Building & Training
the World Model
Energy-Based Models
Joint-Embedding Architecture
Y. LeCun
Self-Supervised Learning = Learning to Fill in the Blanks
time or space →
Reconstruct the input or Predict missing parts of the input.
Y. LeCun
The world is stochastic
Training a system to make a single
prediction makes it predict the
average of all plausible predictions
Blurry predictions!
G(x)
y
x y
C(y,y)
Divergence
measure
Prediction
Deterministic
Function
Y. LeCun
How do we represent uncertainty in the predictions?
The world is only partially
predictable
How can a predictive model
represent multiple
predictions?
Probabilistic models are
intractable in continuous
domains.
Generative Models must
predict every detail of the
world
My solution: Joint-
Embedding Predictive
Architecture
[Henaff, Canziani, LeCun ICLR 2019]
[Mathieu,
Couprie,
LeCun
ICLR 2016]
Y. LeCun
Architectures: Generative vs Joint Embedding
Generative: predicts y (with all the details, including irrelevant ones)
Joint Embedding: predicts an abstract representation of y
a) Generative Architecture
Examples: VAE, MAE...
b) Joint Embedding Architecture
Y. LeCun
Joint Embedding Architectures
Computes abstract representations for x and y
Tries to make them equal or predictable from each other.
a) Joint Embedding Architecture (JEA)
Examples: Siamese Net, Pirl, MoCo,
SimCLR, BarlowTwins, VICReg,
Y. LeCun
Architecture for the world model: JEPA
JEPA: Joint Embedding
Predictive Architecture.
x: observed past and present
y: future
a: action
z: latent variable (unknown)
D( ): prediction cost
C( ): surrogate cost
JEPA predicts a representation
of the future Sy from a
representation of the past and
present Sx
Energy-Based
Models
Capture dependencies
through an energy function
Y. LeCun
Energy-Based Models: Implicit function
The only way to formalize & understand all model types
Gives low energy to compatible pairs of x and y
Gives higher energy to incompatible pairs
time or space →
Energy
Landscape
x
F(x,y)
y
x
y
Y. LeCun
Training Energy-Based Models: Collapse Prevention
A flexible energy surface can take any shape.
We need a loss function that shapes the energy surface so that:
Data points have low energies
Points outside the regions of high data density have higher energies.
Collapse! Contrastive Method Regularized Methods
Y. LeCun
EBM Training: two categories of methods
Contrastive methods
Push down on energy of
training samples
Pull up on energy of
suitably-generated
contrastive samples
Scales very badly with
dimension
Regularized Methods
Regularizer minimizes the
volume of space that can
take low energy
Contrastive
Method
Regularized
Method
Low energy
region
Training
samples
Contrastive
samples
x
x
x
y
y
y
Y. LeCun
Recommendations:
Abandon generative models
in favor joint-embedding architectures
Abandon probabilistic model
in favor of energy-based models
Abandon contrastive methods
in favor of regularized methods
Abandon Reinforcement Learning
In favor of model-predictive control
Use RL only when planning doesn’t yield the
predicted outcome, to adjust the world model
or the critic.
Regularized Methods
for joint embedding
architectures
This is the cool stuff!
Push down on the energy of
compatible sample pairs
Maximize the information capacity
of representations
Y. LeCun
Training a JEPA non contrastively
Four terms in the cost
Maximize information
content in
representation of x
Maximize information
content in
representation of y
Minimize Prediction
error
Minimize information
content of latent
variable z
Maximize
Information
Content
Maximize
Information
Content
Minimize
Information
Content
Minimize
Prediction
Error
Y. LeCun
VICReg: Variance, Invariance, Covariance Regularization
Variance:
Maintains variance of
components of
representations
Invariance:
Minimizes prediction
error.
Barlow Twins [Zbontar et al. ArXiv:2103.03230], VICReg [Bardes, Ponce, LeCun arXiv:2105.04906, ICLR 2022],
VICRegL [Bardes et al. NeurIPS 2022], MCR2 [Yu et al. NeurIPS 2020][Ma, Tsao, Shum, 2022]
Y. LeCun
VICReg: Variance, Invariance, Covariance Regularization
Variance:
Maintains variance of
components of
representations
Covariance:
Decorrelates
components of
covariance matrix of
representations
Invariance:
Minimizes prediction
error.
Barlow Twins [Zbontar et al. ArXiv:2103.03230], VICReg [Bardes, Ponce, LeCun arXiv:2105.04906, ICLR 2022],
VICRegL [Bardes et al. NeurIPS 2022], MCR2 [Yu et al. NeurIPS 2020][Ma, Tsao, Shum, 2022]
Y. LeCun
VICReg: Variance, Invariance, Covariance Regularization
Variance:
Maintains variance of
components of
representations
Covariance:
Decorrelates
components of
covariance matrix of
representations
Invariance:
Minimizes prediction
error.
Barlow Twins [Zbontar et al. ArXiv:2103.03230], VICReg [Bardes, Ponce, LeCun arXiv:2105.04906, ICLR 2022],
VICRegL [Bardes et al. NeurIPS 2022], MCR2 [Yu et al. NeurIPS 2020][Ma, Tsao, Shum, 2022]
Y. LeCun
VICReg: expander makes variables pairwise independent
[Mialon, Balestriero, LeCun arxiv:2209.14905]
VC criterion can be used for source separation / ICA
Y. LeCun
SSL-Pretrained Joint Embedding for Image Recognition
x y
hx
Costs
hy
FeX(x) FeX(y)
Proj(hx) Proj(hy)
x
hx
FeX(x)
Linear
Classifier
Cross
entropy
label
JEA pretrained with VICReg
“polar bear”
Training a supervised linear head
d=2048
d=8192
ConvNext
ConvNet
Y. LeCun
VICReg: Results with linear head and semi-supervised.
Y. LeCun
VICReg: Results with transfer tasks.
Y. LeCun
VICRegL: local matching latent variable for segmentation
Latent variable optimization:
Finds a pairing between local feature vectors of the two images
[Bardes, Ponce, LeCun, NeurIPS 2022, arXiv:2210.01571]
Y. LeCun
VICRegL: local matching latent variable for segmentation
Y. LeCun
Image-JEPA: uses masking, transformer, EMA weights
“SSL from images with a JEPA”
M. Assran et al arxiv:2301.08243
Jointly embeds a context and a
number of neighboring patches.
Uses predictors
Uses only masking
Y. LeCun
I-JEPA Results
Training is fast
Non-generative method
seems to beat
reconstruction-based
methods (MAE)
Y. LeCun
I-JEPA Results on ImageNet
JEPA better than generative
architecture on pixels.
Closing the gap with methods
that use data augments
Methods with only masking
No data augmentation
Methods with data
augmentation
Similar to SimCLR
Y. LeCun
I-JEPA Results on ImageNet with 1% training
JEPA better than generative
architecture on pixels.
Closing the gap with methods
that use data augments
Methods with only masking
Methods with data
augmentation
Hierarchical JEPA
for
Hierarchical Planning
Control, planning, and policy learning.
Y. LeCun
Multi time-scale Predictions
Low-level
representations
can only predict in
the short term.
Too much details
Prediction is hard
Higher-level
representations
can predict in the
longer term.
Less details.
Prediction is easier JEPA-1
JEPA-2
Y. LeCun
MC-JEPA: Motion & Content JEPA
Simultaneous SSL for
Image recognition
Motion estimation
Trained on
ImageNet 1k
Various video datasets
Uses VCReg to prevent
collapse
ConvNext-T backbone
Y. LeCun
MC-JEPA: Motion & Content JEPA
Motion estimation architecture uses a top-down hierarchical
predictor that “warp” feature maps.
Y. LeCun
MC-JEPA: Optical Flow Estimation Results
Y. LeCun
Hierarchical Planning with Uncertainty
Hierarchical world model
Hierarchical planning
An action at level k specifies an
objective for level k-1
Prediction in higher levels are
more abstract and longer-range.
This type of planning/reasoning
by minimizing a cost w.r.t “action”
variables is what’s missing from
current architectures
Including LLMs, multimodal
systems, learning robots,...
Pred1
C(s1,a2)
Enc1(x)
s1
s1initial
Pred2Enc2(s[0])
s2initial
s2
C(s2)
z1
z2
Pred0
C(s0,a1)
Enc0(x)
s0
s0 initial
a1
Pred1
z1
Pred0
a0
Pred2
Y. LeCun
Steps towards Autonomous AI Systems
Self-Supervised Learning
To learn representations of the world
To learn predictive models of the world
Handling uncertainty in predictions
Joint-embedding predictive architectures
Energy-Based Model framework
Learning world models from observation
Like animals and human babies?
Reasoning and planning
That is compatible with gradient-based learning
No symbols, no logic → vectors & continuous functions
Y. LeCun
Towards Human-Level Machine Intelligence
Self-Supervised Learning
Learning models of the world from observation
Learning to Reason and Plan
By learning to predict the consequences of
actions
By being driven by objectives / costs
Will machines become more intelligent than
human? Yes, but not tomorrow.
Will machine have emotions, consciousness,
moral sense? Almost certainly, yes.
Will they want to take over the world? No!
Y. LeCun
Conclusions
Common sense is a collections of world models
H-JEPA trained with SSL can learn hierarchical world models
Understanding is prediction using world models
Better mental models lead to better understanding
Reasoning/planning is using models to achieve goals
Designing intrinsic cost functions to drive learning
Cost functions for SSL drive the system to learn relevant
representations.
Intrinsic cost functions for inference drive the behavior of the system
How to design & train costs to get the system to behave properly?
Y. LeCun
A Single, Configurable World Model Engine
What is the Configurator?
The configurator configures the agent for a deliberate (“conscious”)
tasks.
Configures all other modules for the task at hand
Primes the perception module
Provides executive control
Sets subgoals
Configures the world model for the task.
There is a single world model engine
The system can only perform one “conscious” task at a time
Consciousness is a consequence of the single-world-model limitation
Y. LeCun
Questions
Can we get machines to learn like humans and animals?
SSL, H-JEPA, Energy-Based Models, new mathematics ?
Will machines eventually reach human-level intelligence (HLAI)? YES!
We hear a lot about “Artificial General Intelligence”
But there is no such thing as general intelligence
Intelligence is always specialized, including human intelligence.
We should talk about rat-level, cat-level, or human-level AI (HLAI).
We will have machines with super-human intelligence
Yes, but it won’t happen tomorrow
Machines will not “take over the world”
This is a projection of human nature on machines
Intelligence is not correlated with a desire to dominate, even in humans.