Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
sergeykarayev
65,996 views
43 slides
Jan 29, 2021
Slide 1 of 55
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
About This Presentation
Review lecture of the fundamentals of Deep Learning.
Size: 4.81 MB
Language: en
Added: Jan 29, 2021
Slides: 43 pages
Slide Content
Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel Deep Learning Fundamentals
Full Stack Deep Learning - UC Berkeley Spring 2021 There is a lot here
•We assume this is mostly review for most of you.
•If not, try to go through http://neuralnetworksanddeeplearning.com
2
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
3
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
4
Full Stack Deep Learning - UC Berkeley Spring 2021 Single (Biological) Neuron
5
[image source: cs231n.stanford.edu]
Full Stack Deep Learning - UC Berkeley Spring 2021 Single (Artificial) Neuron
6
g
g
[image source: cs231n.stanford.edu]
Full Stack Deep Learning - UC Berkeley Spring 2021 Common Activation Functions
7
[source: MIT 6.S191 introtodeeplearning.com]
Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Network
8
x z
(1) z
(2) z
(3)Notation:
Choice of w determines the function from x --> y
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
9
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
10
Full Stack Deep Learning - UC Berkeley Spring 2021 What Functions Can a Neural Net Represent?
11
[images source: neuralnetworksanddeeplearning.com]
Does there exist a choice for w to
make this work?
Full Stack Deep Learning - UC Berkeley Spring 2021 Universal Function Approximation Theorem
•In words: Given any continuous function f(x), if a 2-layer neural network
has enough hidden units, then there is a choice of weights that allow it to
closely approximate f(x).
12
Cybenko (1989) “Approximations by superpositions of sigmoidal functions”
Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks”
Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”
Full Stack Deep Learning - UC Berkeley Spring 2021 Universal Approximation Function
13
Explore the idea interactively at
http://neuralnetworksanddeeplearning.com/chap4.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
14
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
15
Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
•Supervised Learning $"
•Unsupervised Learning$"
•Reinforcement Learning
$ "
•[also transfer learning, imitation learning, meta-learning, …]
16
Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
%QOOGTEKCNN[8KCDNG
6QFC[
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel
7R0GZV
Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 1: Image Recognition
•X = pixel values
•Y = one hot vector encoding category
18
x y x y
[figure sources: https://en.wiktionary.org/wiki/cat; https://www.guidedogs.org/]
Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 2: Speech Recognition
•X = sequence of pressure readings
•Y = sequence of one-hot encodings of words
19
[source: vitecinc.com]
Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 3: Machine Translation
•X = sequence of one-hot encodings of words in first language
•Y = sequence of one-hot encodings of words in second language
20
Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel
Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 1: predict next character (charRNN)
22
[source: Tommy Mullaney]
Full Stack Deep Learning - UC Berkeley Spring 2021 Example of next character prediction
23
[Radford et al, 2017]
This product does what it is
supposed to. I always keep
three of these in my
kitchen just in case ever I
need a replacement cord.
Great little item. Hard to
put on the crib without
some kind of
embellishment. My guess
is just like the screw kind
of attachment I had.
Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 2: predict nearby words (word2vec)
24
[Mikolov et al, 2013]
Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 3: predict next pixel (pixelCNN)
25
[van den Oord et al, 2016]
Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 4: Variational Autoencoder (VAE): X -> Z -> Xhat
26
[Kingma and Welling, 2014] [Figure: Kevin Frans]
^introduce noise here
Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 5: predict X that is indistinguishable from real X (GAN)
27
[Goodfellow et al, 2015]. [Figure source: https://www.slideshare.net/xavigiro
Full Stack Deep Learning - UC Berkeley Spring 2021 Example of Unsupervised Learning in Action
28
[Salimans, Goodfellow, Zaremba, Cheung, Radford & Chen, NIPS 2016]
Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel
Full Stack Deep Learning - UC Berkeley Spring 2021 Reinforcement Learning
•Learn to act to maximize reward
30
Full Stack Deep Learning - UC Berkeley Spring 2021 Reinforcement Learning Examples
31
Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
•Supervised Learning: learn X --> Y
•Unsupervised Learning: learn X
•Reinforcement Learning: learn to interact with environment
x_t -> a_t, x_{t+1} -> a_{t+1}, …
•[also transfer learning, imitation learning, meta-learning, …]
32
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
33
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
34
Full Stack Deep Learning - UC Berkeley Spring 2021 Linear Regression
•Why this line?
•“Best fit through data”
•Formally, minimize squared error:
•More generally, minimize loss L:
35
Empirical Risk Minimization = minimize loss L as measured (empirically) on the data
[Figure source: Wikipedia]
Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Regression
•Find w, b parameters that optimize loss:
•Typical losses:
•Mean Squared Error (MSE)
•Huber loss (= more robust to outliers)
36
[Figure source: Wikipedia]
Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Classification
•Find w, b parameters that optimize loss:
•Typical loss: cross-entropy
37
[figure source: Ritchie Ng]
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
38
Full Stack Deep Learning - UC Berkeley Spring 2021 Optimizing the Loss
•Our goal: find w,b that optimize
39
Full Stack Deep Learning - UC Berkeley Spring 2021 How to Improve One Parameter?
•Update
40
Full Stack Deep Learning - UC Berkeley Spring 2021 How to Improve All Parameters?
•Gradient Descent (aka Steepest Descent):
41
[figure source: neuralnetworksanddeeplearning.com (Nielsen)]
Full Stack Deep Learning - UC Berkeley Spring 2021 Conditioning
42
[figure source: neuralnetworksanddeeplearning.com (Nielsen)]
Full Stack Deep Learning - UC Berkeley Spring 2021 Conditioning
•Initialization (more later)
•Normalization
•Batch norm, weight norm, layer norm, … (more later)
•Second order methods:
•Exact:
•Newton’s method
•Natural gradient
•Approximate second order methods:
•Adagrad, Adam, Momentum
43
Full Stack Deep Learning - UC Berkeley Spring 2021 Sampling Schemes for Gradient Descent
•Gradient Descent
•Stochastic Gradient Descent
•Compute each gradient step on just a subset (“batch”) of data
•! less compute per step
•! more noisy per step
•Overall: faster progress per amount of compute
44
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
45
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
46
Full Stack Deep Learning - UC Berkeley Spring 2021 Our Status
•We have reduced learning to optimizing a loss function:
•We can do this by (stochastic) gradient descent, which iterates:
•How to efficiently compute the gradients?
47
Full Stack Deep Learning - UC Berkeley Spring 2021 Gradients are just derivatives
•Derivatives tables:
48
■But neural net f is never
one of those?
■No problem: CHAIN RULE:
If
Then
[source: http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
Full Stack Deep Learning - UC Berkeley Spring 2021 Automatic Differentation
•Automatic differentiation software
•e.g. PyTorch, TensorFlow, Theano, Chainer, etc.
•Only need to program the function f(x,w).
•Software automatically computes all derivatives
•This is typically done by caching info during forward computation
pass of f, and then doing a backward pass = “backpropagation”
49
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
50
Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Architectures
•“simplest”: sequence of fully connected layers:
51
Full Stack Deep Learning - UC Berkeley Spring 2021 NN Architecture Considerations
•Data efficiency:
•Extremely large networks can represent anything (see “universal function approximation
theorem”) but might also need extremely large amount of data to latch onto the right thing
•! Encode prior knowledge into the architecture, e.g.:
•Computer vision: Convolutional Networks = spatial translation invariance
•Sequence processing (e.g. NLP): Recurrent Networks = temporal invariance
•Optimization landscape / conditioning:
•Depth over Width, Skip connections, Batch / Weight / Layer Normalization
•Computational / Parameter efficiency
•Factorized convolutions
•Strided convolutions
52
Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
53
Full Stack Deep Learning - UC Berkeley Spring 2021 CUDA
•Lastly, why did deep learning explosion really kick off around 2013?
•Bigger datasets are one part of the story
•Good libraries for matrix computations on GPUs
•Why are GPUs crucial?
•All NN computations are just matrix multiplications, which are well
parallelized onto the many cores of a GPU.
54
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
55