Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021

sergeykarayev 65,996 views 43 slides Jan 29, 2021
Slide 1
Slide 1 of 55
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55

About This Presentation

Review lecture of the fundamentals of Deep Learning.


Slide Content

Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel Deep Learning Fundamentals

Full Stack Deep Learning - UC Berkeley Spring 2021 There is a lot here
•We assume this is mostly review for most of you.
•If not, try to go through http://neuralnetworksanddeeplearning.com
2

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
3

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
4

Full Stack Deep Learning - UC Berkeley Spring 2021 Single (Biological) Neuron
5
[image source: cs231n.stanford.edu]

Full Stack Deep Learning - UC Berkeley Spring 2021 Single (Artificial) Neuron
6
g
g
[image source: cs231n.stanford.edu]

Full Stack Deep Learning - UC Berkeley Spring 2021 Common Activation Functions
7
[source: MIT 6.S191 introtodeeplearning.com]

Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Network
8
x z
(1) z
(2) z
(3)Notation:
Choice of w determines the function from x --> y

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
9

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
10

Full Stack Deep Learning - UC Berkeley Spring 2021 What Functions Can a Neural Net Represent?
11
[images source: neuralnetworksanddeeplearning.com]
Does there exist a choice for w to
make this work?

Full Stack Deep Learning - UC Berkeley Spring 2021 Universal Function Approximation Theorem
•In words: Given any continuous function f(x), if a 2-layer neural network
has enough hidden units, then there is a choice of weights that allow it to
closely approximate f(x).
12
Cybenko (1989) “Approximations by superpositions of sigmoidal functions”
Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks”
Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”

Full Stack Deep Learning - UC Berkeley Spring 2021 Universal Approximation Function
13
Explore the idea interactively at
http://neuralnetworksanddeeplearning.com/chap4.html

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
14

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
15

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
•Supervised Learning $"
•Unsupervised Learning$"
•Reinforcement Learning
$ "
•[also transfer learning, imitation learning, meta-learning, …]
16

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
%QOOGTEKCNN[8KCDNG
6QFC[
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel
7R0GZV

Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 1: Image Recognition
•X = pixel values
•Y = one hot vector encoding category
18
x y x y
[figure sources: https://en.wiktionary.org/wiki/cat; https://www.guidedogs.org/]

Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 2: Speech Recognition
•X = sequence of pressure readings
•Y = sequence of one-hot encodings of words
19
[source: vitecinc.com]

Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 3: Machine Translation
•X = sequence of one-hot encodings of words in first language
•Y = sequence of one-hot encodings of words in second language
20

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 1: predict next character (charRNN)
22
[source: Tommy Mullaney]

Full Stack Deep Learning - UC Berkeley Spring 2021 Example of next character prediction
23
[Radford et al, 2017]
This product does what it is
supposed to. I always keep
three of these in my
kitchen just in case ever I
need a replacement cord.
Great little item. Hard to
put on the crib without
some kind of
embellishment. My guess
is just like the screw kind
of attachment I had.

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 2: predict nearby words (word2vec)
24
[Mikolov et al, 2013]

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 3: predict next pixel (pixelCNN)
25
[van den Oord et al, 2016]

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 4: Variational Autoencoder (VAE): X -> Z -> Xhat
26
[Kingma and Welling, 2014] [Figure: Kevin Frans]
^introduce noise here

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 5: predict X that is indistinguishable from real X (GAN)
27
[Goodfellow et al, 2015]. [Figure source: https://www.slideshare.net/xavigiro

Full Stack Deep Learning - UC Berkeley Spring 2021 Example of Unsupervised Learning in Action
28
[Salimans, Goodfellow, Zaremba, Cheung, Radford & Chen, NIPS 2016]

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel

Full Stack Deep Learning - UC Berkeley Spring 2021 Reinforcement Learning
•Learn to act to maximize reward
30

Full Stack Deep Learning - UC Berkeley Spring 2021 Reinforcement Learning Examples
31

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
•Supervised Learning: learn X --> Y
•Unsupervised Learning: learn X
•Reinforcement Learning: learn to interact with environment

x_t -> a_t, x_{t+1} -> a_{t+1}, …
•[also transfer learning, imitation learning, meta-learning, …]
32

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
33

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
34

Full Stack Deep Learning - UC Berkeley Spring 2021 Linear Regression
•Why this line?
•“Best fit through data”
•Formally, minimize squared error:
•More generally, minimize loss L:
35
Empirical Risk Minimization = minimize loss L as measured (empirically) on the data
[Figure source: Wikipedia]

Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Regression
•Find w, b parameters that optimize loss:
•Typical losses:
•Mean Squared Error (MSE)
•Huber loss (= more robust to outliers)
36
[Figure source: Wikipedia]

Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Classification
•Find w, b parameters that optimize loss:
•Typical loss: cross-entropy
37
[figure source: Ritchie Ng]

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
38

Full Stack Deep Learning - UC Berkeley Spring 2021 Optimizing the Loss
•Our goal: find w,b that optimize
39

Full Stack Deep Learning - UC Berkeley Spring 2021 How to Improve One Parameter?
•Update
40

Full Stack Deep Learning - UC Berkeley Spring 2021 How to Improve All Parameters?
•Gradient Descent (aka Steepest Descent):
41
[figure source: neuralnetworksanddeeplearning.com (Nielsen)]

Full Stack Deep Learning - UC Berkeley Spring 2021 Conditioning
42
[figure source: neuralnetworksanddeeplearning.com (Nielsen)]

Full Stack Deep Learning - UC Berkeley Spring 2021 Conditioning
•Initialization (more later)
•Normalization
•Batch norm, weight norm, layer norm, … (more later)
•Second order methods:
•Exact:
•Newton’s method
•Natural gradient
•Approximate second order methods:
•Adagrad, Adam, Momentum
43

Full Stack Deep Learning - UC Berkeley Spring 2021 Sampling Schemes for Gradient Descent
•Gradient Descent
•Stochastic Gradient Descent
•Compute each gradient step on just a subset (“batch”) of data
•! less compute per step
•! more noisy per step
•Overall: faster progress per amount of compute
44

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
45

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
46

Full Stack Deep Learning - UC Berkeley Spring 2021 Our Status
•We have reduced learning to optimizing a loss function:
•We can do this by (stochastic) gradient descent, which iterates:
•How to efficiently compute the gradients?
47

Full Stack Deep Learning - UC Berkeley Spring 2021 Gradients are just derivatives
•Derivatives tables:
48
■But neural net f is never
one of those?
■No problem: CHAIN RULE:
If
Then
[source: http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html

Full Stack Deep Learning - UC Berkeley Spring 2021 Automatic Differentation
•Automatic differentiation software
•e.g. PyTorch, TensorFlow, Theano, Chainer, etc.
•Only need to program the function f(x,w).
•Software automatically computes all derivatives
•This is typically done by caching info during forward computation
pass of f, and then doing a backward pass = “backpropagation”
49

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
50

Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Architectures
•“simplest”: sequence of fully connected layers:
51

Full Stack Deep Learning - UC Berkeley Spring 2021 NN Architecture Considerations
•Data efficiency:
•Extremely large networks can represent anything (see “universal function approximation
theorem”) but might also need extremely large amount of data to latch onto the right thing
•! Encode prior knowledge into the architecture, e.g.:
•Computer vision: Convolutional Networks = spatial translation invariance
•Sequence processing (e.g. NLP): Recurrent Networks = temporal invariance
•Optimization landscape / conditioning:
•Depth over Width, Skip connections, Batch / Weight / Layer Normalization
•Computational / Parameter efficiency
•Factorized convolutions
•Strided convolutions
52

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
53

Full Stack Deep Learning - UC Berkeley Spring 2021 CUDA
•Lastly, why did deep learning explosion really kick off around 2013?
•Bigger datasets are one part of the story
•Good libraries for matrix computations on GPUs
•Why are GPUs crucial?
•All NN computations are just matrix multiplications, which are well
parallelized onto the many cores of a GPU.
54

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
55