Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021

Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel Deep Learning Fundamentals

Full Stack Deep Learning - UC Berkeley Spring 2021 There is a lot here
•We assume this is mostly review for most of you.
•If not, try to go through http://neuralnetworksanddeeplearning.com
2

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Diﬀerentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
3

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Diﬀerentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
4

Full Stack Deep Learning - UC Berkeley Spring 2021 Single (Biological) Neuron
5
[image source: cs231n.stanford.edu]

Full Stack Deep Learning - UC Berkeley Spring 2021 Single (Artiﬁcial) Neuron
6
g
g
[image source: cs231n.stanford.edu]

Full Stack Deep Learning - UC Berkeley Spring 2021 Common Activation Functions
7
[source: MIT 6.S191 introtodeeplearning.com]

Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Network
8
x z
(1) z
(2) z
(3)Notation:
Choice of w determines the function from x --> y

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
9

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Diﬀerentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
10

Full Stack Deep Learning - UC Berkeley Spring 2021 What Functions Can a Neural Net Represent?
11
[images source: neuralnetworksanddeeplearning.com]
Does there exist a choice for w to
make this work?

Full Stack Deep Learning - UC Berkeley Spring 2021 Universal Function Approximation Theorem
•In words: Given any continuous function f(x), if a 2-layer neural network
has enough hidden units, then there is a choice of weights that allow it to
closely approximate f(x).
12
Cybenko (1989) “Approximations by superpositions of sigmoidal functions”
Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks”
Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”

Full Stack Deep Learning - UC Berkeley Spring 2021 Universal Approximation Function
13
Explore the idea interactively at
http://neuralnetworksanddeeplearning.com/chap4.html

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
14

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Diﬀerentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
15

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
•Supervised Learning $"
•Unsupervised Learning$"
•Reinforcement Learning
$ "
•[also transfer learning, imitation learning, meta-learning, …]
16

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
%QOOGTEKCNN[8KCDNG
6QFC[
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel
7R0GZV

Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 1: Image Recognition
•X = pixel values
•Y = one hot vector encoding category
18
x y x y
[ﬁgure sources: https://en.wiktionary.org/wiki/cat; https://www.guidedogs.org/]

Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 2: Speech Recognition
•X = sequence of pressure readings
•Y = sequence of one-hot encodings of words
19
[source: vitecinc.com]

Full Stack Deep Learning - UC Berkeley Spring 2021 Supervised Learning: X ! Y
•Ex 3: Machine Translation
•X = sequence of one-hot encodings of words in ﬁrst language
•Y = sequence of one-hot encodings of words in second language
20

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 1: predict next character (charRNN)
22
[source: Tommy Mullaney]

Full Stack Deep Learning - UC Berkeley Spring 2021 Example of next character prediction
23
[Radford et al, 2017]
This product does what it is
supposed to. I always keep
three of these in my
kitchen just in case ever I
need a replacement cord.
Great little item. Hard to
put on the crib without
some kind of
embellishment. My guess
is just like the screw kind
of attachment I had.

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 2: predict nearby words (word2vec)
24
[Mikolov et al, 2013]

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 3: predict next pixel (pixelCNN)
25
[van den Oord et al, 2016]

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 4: Variational Autoencoder (VAE): X -> Z -> Xhat
26
[Kingma and Welling, 2014] [Figure: Kevin Frans]
^introduce noise here

Full Stack Deep Learning - UC Berkeley Spring 2021 Unsupervised Learning: X
•Ex. 5: predict X that is indistinguishable from real X (GAN)
27
[Goodfellow et al, 2015]. [Figure source: https://www.slideshare.net/xavigiro

Full Stack Deep Learning - UC Berkeley Spring 2021 Example of Unsupervised Learning in Action
28
[Salimans, Goodfellow, Zaremba, Cheung, Radford & Chen, NIPS 2016]

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
Unsupervised Learning
➔Unlabeled data X
➔Learn X
➔Generate fakes, insights
Supervised Learning
➔Labeled data X and Y
➔Learn X -> Y
➔Make Predictions
Reinforcement Learning
➔Learn how to take Actions in
an Environment
"This product does what it is
supposed to. I always keep three
of these in my kitchen just in
case ever I need a replacement
cord."
"Hey Siri"
cat
Diagram courtesy of Shayne Miel

Full Stack Deep Learning - UC Berkeley Spring 2021 Reinforcement Learning
•Learn to act to maximize reward
30

Full Stack Deep Learning - UC Berkeley Spring 2021 Reinforcement Learning Examples
31

Full Stack Deep Learning - UC Berkeley Spring 2021 Types of Learning Problems
•Supervised Learning: learn X --> Y
•Unsupervised Learning: learn X
•Reinforcement Learning: learn to interact with environment 
x_t -> a_t, x_{t+1} -> a_{t+1}, …
•[also transfer learning, imitation learning, meta-learning, …]
32

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
33

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Diﬀerentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
34

Full Stack Deep Learning - UC Berkeley Spring 2021 Linear Regression
•Why this line?
•“Best ﬁt through data”
•Formally, minimize squared error:
•More generally, minimize loss L:
35
Empirical Risk Minimization = minimize loss L as measured (empirically) on the data
[Figure source: Wikipedia]

Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Regression
•Find w, b parameters that optimize loss:
•Typical losses:
•Mean Squared Error (MSE)
•Huber loss (= more robust to outliers)
36
[Figure source: Wikipedia]

Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Classiﬁcation
•Find w, b parameters that optimize loss:
•Typical loss: cross-entropy
37
[figure source: Ritchie Ng]

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Diﬀerentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
38

Full Stack Deep Learning - UC Berkeley Spring 2021 Optimizing the Loss
•Our goal: ﬁnd w,b that optimize
39

Full Stack Deep Learning - UC Berkeley Spring 2021 How to Improve One Parameter?
•Update
40

Full Stack Deep Learning - UC Berkeley Spring 2021 How to Improve All Parameters?
•Gradient Descent (aka Steepest Descent):
41
[figure source: neuralnetworksanddeeplearning.com (Nielsen)]

Full Stack Deep Learning - UC Berkeley Spring 2021 Conditioning
42
[figure source: neuralnetworksanddeeplearning.com (Nielsen)]

Full Stack Deep Learning - UC Berkeley Spring 2021 Conditioning
•Initialization (more later)
•Normalization
•Batch norm, weight norm, layer norm, … (more later)
•Second order methods:
•Exact:
•Newton’s method
•Natural gradient
•Approximate second order methods:
•Adagrad, Adam, Momentum
43

Full Stack Deep Learning - UC Berkeley Spring 2021 Sampling Schemes for Gradient Descent
•Gradient Descent
•Stochastic Gradient Descent
•Compute each gradient step on just a subset (“batch”) of data
•! less compute per step
•! more noisy per step
•Overall: faster progress per amount of compute
44

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
45

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Differentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
46

Full Stack Deep Learning - UC Berkeley Spring 2021 Our Status
•We have reduced learning to optimizing a loss function:
•We can do this by (stochastic) gradient descent, which iterates:
•How to eﬃciently compute the gradients?
47

Full Stack Deep Learning - UC Berkeley Spring 2021 Gradients are just derivatives
•Derivatives tables:
48
■But neural net f is never
one of those?
■No problem: CHAIN RULE:
If
Then
[source: http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html

Full Stack Deep Learning - UC Berkeley Spring 2021 Automatic Differentation
•Automatic diﬀerentiation software
•e.g. PyTorch, TensorFlow, Theano, Chainer, etc.
•Only need to program the function f(x,w).
•Software automatically computes all derivatives
•This is typically done by caching info during forward computation
pass of f, and then doing a backward pass = “backpropagation”
49

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Diﬀerentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
50

Full Stack Deep Learning - UC Berkeley Spring 2021 Neural Net Architectures
•“simplest”: sequence of fully connected layers:
51

Full Stack Deep Learning - UC Berkeley Spring 2021 NN Architecture Considerations
•Data eﬃciency:
•Extremely large networks can represent anything (see “universal function approximation
theorem”) but might also need extremely large amount of data to latch onto the right thing
•! Encode prior knowledge into the architecture, e.g.:
•Computer vision: Convolutional Networks = spatial translation invariance
•Sequence processing (e.g. NLP): Recurrent Networks = temporal invariance
•Optimization landscape / conditioning:
•Depth over Width, Skip connections, Batch / Weight / Layer Normalization
•Computational / Parameter eﬃciency
•Factorized convolutions
•Strided convolutions
52

Full Stack Deep Learning - UC Berkeley Spring 2021 Outline
•Neural Networks
•Universality
•Learning Problems
•Empirical Risk Minimization / Loss Functions
•Gradient Descent
•Backpropagation / Automatic Diﬀerentiation
•Architectural Considerations (deep / conv / rnn)
•CUDA / Cores of Compute
53

Full Stack Deep Learning - UC Berkeley Spring 2021 CUDA
•Lastly, why did deep learning explosion really kick oﬀ around 2013?
•Bigger datasets are one part of the story
•Good libraries for matrix computations on GPUs
•Why are GPUs crucial?
•All NN computations are just matrix multiplications, which are well
parallelized onto the many cores of a GPU.
54

Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
55

Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx