LeNet-5

Gradient-Based Learning Applied to Document Recognition
Y. LeCun, L. Bottou, Y. Bengioand P. Haffner
Proceedings of the IEEE, 86(11):2278--‐2324, November 1998
01
LeNet
Speaker: Chia-JungNi

•History of Representative CNN models
•Three key ideasfor CNN
•Local Receptive Fields
•Shared Weights
•Sub-sampling
•Model Architecture
•Implementation
•Keras
02
Outline
Slide: https://drive.google.com/file/d/12YWNNbqB-_JHl0CrNEl6loINBJoGHgE3/view?usp=sharing
Code: https://drive.google.com/file/d/1wDcDgoF8VSj29ab-cXsN82Q1pxdBiaUx/view?usp=sharing

1980s
CNN be
proposed
1998LeNet
2012
AlexNet
2015
VGGNet
2015
GoogleNet
2016
ResNet
2017
DenseNet
03
History of Representative CNN models
Thefirsttimeuse
&#3627408411;&#3627408410;&#3627408412;??????−&#3627408425;??????&#3627408424;&#3627408425;&#3627408410;??????&#3627408410;????????????&#3627408424;&#3627408423;to
updatemodelparams.
Thefirsttimeuse
??????????????????to
acceleratecomputations

•Why local connectivity? (what)
•Spatial correlation is local
•Reduce # of parameters
04
Three key ideas : Local Receptive Fields (1/3)
Example. WLOG
-1000x1000 image
-3x3 filter (kernel)
10
6
+1params.
/ hidden unit
3
2
+1params.
/ hidden unit

•Why weight sharing? (where)
•Statistics is at different locations
•Reduce # of parameters
05
Three key ideas : Shared Weights(2/3)
Example. WLOG
-# input units (neurons) = 7
-# hidden unit = 3
3∗3+3=12params. 3∗1+3=6params.

•Why Sub-sampling? (Size)
•Sub-sampling the pixel will not change the object
•Reduce memoryconsumption
06
Three key ideas : Sub-sampling(3/3)
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
2 3
3 3
1.51.75
1.51.75
Max-pooling Avg-pooling

•Architecture of LeNet-5
•Two sets of convolutional and average pooling layers
•Followed by a flattening convolutional layer
•Then two fully-connected layers and finally a softmaxclassifier
07
Model Architecture

•Similar to the idea of activation function
•All the first 6 layers (C1, S2, C3, S4, C5, F6) feature maps are passed
through this nonlinear scaled hyperbolic tangent function
08
Model Architecture–Squashing Function
&#3627408467;&#3627408462;=??????&#3627408481;&#3627408462;&#3627408475;&#3627408467;&#3627408454;&#3627408462;,&#3627408484;ℎ&#3627408466;&#3627408479;&#3627408466;ቊ
??????=1.17519
&#3627408454;=
2
3
with this choice of params,
the equalities &#3627408467;1=1and &#3627408467;−1=−1satisfied.
&#3627408467;&#3627408462;&#3627408467;′&#3627408462;
&#3627408467;′′&#3627408462;
Some details
-Symmetric functions will yield faster convergence,
although the learning might be slow as the weights are too
large/ small.
-The absolute value of the 2
nd
derivative of f(a) is a
maximum at +1 and -1, which also improves the
convergence toward the end of learning session.

09
Model Architecture –1
st
layer (1/7)
•Trainable params
= (weight* input map channel + bias) * output map channel
= (5*5*1 + 1) * 6= 156
•Connections
= (weight * input map channel + bias) * output map channel * output map size
= (5*5*1 + 1) * 6* (28*28) = 122,304
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
W
??????−1
−??????+2??????
&#3627408454;
+1
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1
Convolution layer 1 (C1)
with 6 feature maps or filters
having size 5×5, a stride of one,
and ‘same’ padding!

10
Model Architecture –2
nd
layer (2/7)
•Trainable params
= (weight+ bias) * output map channel
= (1+ 1) * 6= 12
•Connections
= (kernel size + bias) * output map channel * output map size
= (2*2+ 1) * 6* (14*14) = 5,880
Subsampling layer 2 (S2)
with a filter size 2×2, a
strideoftwo, and ‘valid’
padding!
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
W
??????−1
−??????+2??????
&#3627408454;
+1
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1

11
Model Architecture –3
rd
layer (3/7)
•Trainable params
= ∑group [ (weight* input map channel + bias) * output map channel ]
= (5*5*3 + 1) * 6+ (5*5*4 + 1) * 6+ (5*5*4 + 1) * 3+ (5*5*6 + 1) * 1= 456+606 + 303 +151 = 1,516
•Connections
= (weight * input map channel + bias) * output map channel * output map size
= [(5*5*3 + 1) * 6+ (5*5*4 + 1) * 6+ (5*5*4 + 1) * 3+ (5*5*6 + 1) * 1] * (10*10) = 151,600
Convolution layer 3 (C3)
with 16 feature maps having
size 5×5 and a stride of one,
and ‘valid’ padding!
Based on the consideration of computationcosts,
•First6feature maps are connected to3contiguous input maps
•Second6feature maps are connected to4contiguous input maps
•Next3feature maps are connected to4discontinuous input maps
•Last1feature map are connected to all6input maps
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
W
??????−1
−??????+2??????
&#3627408454;
+1
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1

12
Model Architecture –4
th
layer (4/7)
•Trainable params
= (weight+ bias) * output map channel
= (1+ 1) * 16= 32
•Connections
= (kernel size + bias) * output map channel * output map size
= (2*2+ 1) * 16* (5*5) = 2,000
Subsampling layer 4 (S4)
with a filter size 2×2, a
strideoftwo, and ‘valid’
padding!
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
W
??????−1
−??????+2??????
&#3627408454;
+1
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1

13
Model Architecture –5
th
layer (5/7)
•Trainable params
= (weight* input map channel + bias) * output map channel
= (5*5*16 + 1) * 120= 48,120
•Connections
= (weight * input map channel + bias) * output map channel * output map size
= (5*5*16 + 1) * 120* (1*1) = 48,120
Convolution layer 5 (C5)
with 120 feature maps or
filters having size 5×5, a stride
of one, and ‘valid’ padding!
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
W
??????−1
−??????+2??????
&#3627408454;
+1
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1

14
Model Architecture –6
th
layer (6/7)
•Trainable params
= (weight+ bias) * output map channel
= (120+ 1) * 84= 10,164
•Connections
= (weight+ bias) * output map channel
= (120+ 1) * 84= 10,164
Fully-connected layer (F6)
with 84 neuron units!
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
W
??????−1
−??????+2??????
&#3627408454;
+1
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1

15
Model Architecture –Output layer (7/7)
Output layer
with the Euclidean Radial Basis Function
(RBF)
The output of each RBF units (RBF) &#3627408486;
&#3627408470;
is
computed as follow:
&#3627408486;
&#3627408470;
=෍
&#3627408471;
&#3627408485;
&#3627408471;−&#3627408484;
&#3627408470;&#3627408471;
2
Loss Function
With mean squared error
function (MSE) to measure
discrepancy
The output of a particular RBF can be interpreted as a penalty term measuring the fit
between the input pattern and a model of the class associated with the RBF. In
probabilistic terms, the RBF output can be interpreted as the unnormalized negative
log-likelihood of a Gaussian distortion in the space of configuration of layer F6.
, where &#3627408486;
??????
??????is the output of &#3627408439;
??????
-thRBF units,
that is, the one that corresponds to the right
class of input pattern &#3627408487;
??????
.

Model Architecture (LeNet-5)
Notation W, H F S P
Layer
Feature Map
# channel
Feature Map
Size
Filter Size
(Kernel Size)
Stride Padding
Activation
function
InputImage 1 32x32 - - - -
1 Convilution 6 28x28 5x5 1 0 tanh
2 Avg-Pooling 6 14x14 2x2 2 0 tanh
3 Convilution 16 10x10 5x5 1 0 tanh
4 Avg-Pooling 16 5x5 2x2 2 0 tanh
5 Convilution 120 1x1 5x5 1 0 tanh
6 FC - 84 - - - tanh
OutputFC - 10 - - - RBF
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
W
??????−1
−??????+2??????
&#3627408454;
+1
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1
16

Implementation –Download Data Set & Normalize
16

Implementation –Define LeNet-5 Model
16

Implementation –Define LeNet-5 Model & Evaluate
16

Implementation –Visualize the Training Process
16

17
Thanks for your listening.

18
Appendix 1. Common to zero pad the border
Example. WLOG
-input 7x7
-3x3 filter, applied with stride 1
-pad with 1 pixel border => what is the output?
In general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with (F-1)/2.
(will preserve size spatially)
•F = 3 => zero pad with 1
•F = 5 => zero pad with 2
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1
??????
??????
=&#3627408470;&#3627408475;&#3627408481;
??????
??????−1
−??????+2??????
&#3627408454;
+1

19
Appendix 2. Sub-Samplingv.s. Pooling
•Sub-Sampling is simply Average-Pooling with learnable weights per
feature map.
•Sub-Sampling is a generalization of Average-Pooling.
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
1.51.75
1.51.75
Avg-pooling
1.51.75
1.51.75
Sub-sampling
&#3627408484;+&#3627408463;
, where w and b ∈&#3627408453;

19
Appendix 3. Radial Basis Function (RBF) units
&#3627408485;
84
&#3627408485;
1
&#3627408485;
2
&#3627408485;
83
&#3627408485;
&#3627408471;
&#3627408486;
10
&#3627408486;
1
&#3627408486;
2
&#3627408486;
9
&#3627408486;
&#3627408470;
⋮
⋮
⋮
⋮
&#3627408484;
1,1
&#3627408484;
10,84
&#3627408512;
10×1=&#3627408510;
10×84
??????
&#3627408511;
84×1
????????????????????????.
1)&#3627408485;
&#3627408471;∈ℝ&#3627408470;&#3627408480;&#3627408481;ℎ&#3627408466;&#3627408476;&#3627408482;&#3627408481;&#3627408477;&#3627408482;&#3627408481;&#3627408476;&#3627408467;&#3627408481;ℎ&#3627408466;??????&#3627408438;6&#3627408473;&#3627408462;&#3627408486;&#3627408466;&#3627408479;&#3627408484;&#3627408470;&#3627408481;ℎ&#3627408480;&#3627408478;&#3627408482;&#3627408462;&#3627408480;ℎ&#3627408470;&#3627408475;&#3627408468;&#3627408467;&#3627408482;&#3627408475;&#3627408464;&#3627408481;&#3627408470;&#3627408476;&#3627408475;&#3627408467;&#3627408462;=??????&#3627408481;&#3627408462;&#3627408475;&#3627408467;&#3627408454;&#3627408462;,
∀&#3627408471;=1,…,84
2)&#3627408455;ℎ&#3627408466;&#3627408464;&#3627408476;&#3627408474;&#3627408477;&#3627408476;&#3627408475;&#3627408466;&#3627408475;&#3627408481;&#3627408480;&#3627408476;&#3627408467;&#3627408470;&#3627408475;&#3627408481;&#3627408466;&#3627408479;&#3627408474;&#3627408466;&#3627408465;&#3627408470;&#3627408462;&#3627408481;&#3627408466;&#3627408477;&#3627408462;&#3627408479;&#3627408462;&#3627408474;&#3627408466;&#3627408481;&#3627408466;&#3627408479;&#3627408480;&#3627408483;&#3627408466;&#3627408464;&#3627408481;&#3627408476;&#3627408479;&#3627408480;൛&#3627408484;
&#3627408470;&#3627408471;|&#3627408470;=1,…,10;&#3627408471;=

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

LeNet-5

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......