LeNet-5

ssuser2e52e8 1,928 views 24 slides Jun 22, 2019
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Gradient
Based Learning Applied to Document Recognition
Y
. LeCun , L. Bottou , Y. Bengio and P. Haffner
Proceedings of the IEEE, 86(11
):2278 ----‐2324 , November 1998


Slide Content

Gradient-Based Learning Applied to Document Recognition
Y. LeCun, L. Bottou, Y. Bengioand P. Haffner
Proceedings of the IEEE, 86(11):2278--‐2324, November 1998
01
LeNet
Speaker: Chia-JungNi

•History of Representative CNN models
•Three key ideasfor CNN
•Local Receptive Fields
•Shared Weights
•Sub-sampling
•Model Architecture
•Implementation
•Keras
02
Outline
Slide: https://drive.google.com/file/d/12YWNNbqB-_JHl0CrNEl6loINBJoGHgE3/view?usp=sharing
Code: https://drive.google.com/file/d/1wDcDgoF8VSj29ab-cXsN82Q1pxdBiaUx/view?usp=sharing

1980s
CNN be
proposed
1998LeNet
2012
AlexNet
2015
VGGNet
2015
GoogleNet
2016
ResNet
2017
DenseNet
03
History of Representative CNN models
Thefirsttimeuse
���??????−�??????���??????�????????????��to
updatemodelparams.
Thefirsttimeuse
??????????????????to
acceleratecomputations

•Why local connectivity? (what)
•Spatial correlation is local
•Reduce # of parameters
04
Three key ideas : Local Receptive Fields (1/3)
Example. WLOG
-1000x1000 image
-3x3 filter (kernel)
10
6
+1params.
/ hidden unit
3
2
+1params.
/ hidden unit

•Why weight sharing? (where)
•Statistics is at different locations
•Reduce # of parameters
05
Three key ideas : Shared Weights(2/3)
Example. WLOG
-# input units (neurons) = 7
-# hidden unit = 3
3∗3+3=12params. 3∗1+3=6params.

•Why Sub-sampling? (Size)
•Sub-sampling the pixel will not change the object
•Reduce memoryconsumption
06
Three key ideas : Sub-sampling(3/3)
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
2 3
3 3
1.51.75
1.51.75
Max-pooling Avg-pooling

•Architecture of LeNet-5
•Two sets of convolutional and average pooling layers
•Followed by a flattening convolutional layer
•Then two fully-connected layers and finally a softmaxclassifier
07
Model Architecture

•Similar to the idea of activation function
•All the first 6 layers (C1, S2, C3, S4, C5, F6) feature maps are passed
through this nonlinear scaled hyperbolic tangent function
08
Model Architecture–Squashing Function
��=??????������,�ℎ���ቊ
??????=1.17519
�=
2
3
with this choice of params,
the equalities �1=1and �−1=−1satisfied.
���′�
�′′�
Some details
-Symmetric functions will yield faster convergence,
although the learning might be slow as the weights are too
large/ small.
-The absolute value of the 2
nd
derivative of f(a) is a
maximum at +1 and -1, which also improves the
convergence toward the end of learning session.

09
Model Architecture –1
st
layer (1/7)
•Trainable params
= (weight* input map channel + bias) * output map channel
= (5*5*1 + 1) * 6= 156
•Connections
= (weight * input map channel + bias) * output map channel * output map size
= (5*5*1 + 1) * 6* (28*28) = 122,304
??????
??????
=���
W
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1
Convolution layer 1 (C1)
with 6 feature maps or filters
having size 5×5, a stride of one,
and ‘same’ padding!

10
Model Architecture –2
nd
layer (2/7)
•Trainable params
= (weight+ bias) * output map channel
= (1+ 1) * 6= 12
•Connections
= (kernel size + bias) * output map channel * output map size
= (2*2+ 1) * 6* (14*14) = 5,880
Subsampling layer 2 (S2)
with a filter size 2×2, a
strideoftwo, and ‘valid’
padding!
??????
??????
=���
W
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1

11
Model Architecture –3
rd
layer (3/7)
•Trainable params
= ∑group [ (weight* input map channel + bias) * output map channel ]
= (5*5*3 + 1) * 6+ (5*5*4 + 1) * 6+ (5*5*4 + 1) * 3+ (5*5*6 + 1) * 1= 456+606 + 303 +151 = 1,516
•Connections
= (weight * input map channel + bias) * output map channel * output map size
= [(5*5*3 + 1) * 6+ (5*5*4 + 1) * 6+ (5*5*4 + 1) * 3+ (5*5*6 + 1) * 1] * (10*10) = 151,600
Convolution layer 3 (C3)
with 16 feature maps having
size 5×5 and a stride of one,
and ‘valid’ padding!
Based on the consideration of computationcosts,
•First6feature maps are connected to3contiguous input maps
•Second6feature maps are connected to4contiguous input maps
•Next3feature maps are connected to4discontinuous input maps
•Last1feature map are connected to all6input maps
??????
??????
=���
W
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1

12
Model Architecture –4
th
layer (4/7)
•Trainable params
= (weight+ bias) * output map channel
= (1+ 1) * 16= 32
•Connections
= (kernel size + bias) * output map channel * output map size
= (2*2+ 1) * 16* (5*5) = 2,000
Subsampling layer 4 (S4)
with a filter size 2×2, a
strideoftwo, and ‘valid’
padding!
??????
??????
=���
W
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1

13
Model Architecture –5
th
layer (5/7)
•Trainable params
= (weight* input map channel + bias) * output map channel
= (5*5*16 + 1) * 120= 48,120
•Connections
= (weight * input map channel + bias) * output map channel * output map size
= (5*5*16 + 1) * 120* (1*1) = 48,120
Convolution layer 5 (C5)
with 120 feature maps or
filters having size 5×5, a stride
of one, and ‘valid’ padding!
??????
??????
=���
W
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1

14
Model Architecture –6
th
layer (6/7)
•Trainable params
= (weight+ bias) * output map channel
= (120+ 1) * 84= 10,164
•Connections
= (weight+ bias) * output map channel
= (120+ 1) * 84= 10,164
Fully-connected layer (F6)
with 84 neuron units!
??????
??????
=���
W
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1

15
Model Architecture –Output layer (7/7)
Output layer
with the Euclidean Radial Basis Function
(RBF)
The output of each RBF units (RBF) �
�
is
computed as follow:
�
�
=෍
�
�
�−�
��
2
Loss Function
With mean squared error
function (MSE) to measure
discrepancy
The output of a particular RBF can be interpreted as a penalty term measuring the fit
between the input pattern and a model of the class associated with the RBF. In
probabilistic terms, the RBF output can be interpreted as the unnormalized negative
log-likelihood of a Gaussian distortion in the space of configuration of layer F6.
, where �
??????
??????is the output of �
??????
-thRBF units,
that is, the one that corresponds to the right
class of input pattern �
??????
.

Model Architecture (LeNet-5)
Notation W, H F S P
Layer
Feature Map
# channel
Feature Map
Size
Filter Size
(Kernel Size)
Stride Padding
Activation
function
InputImage 1 32x32 - - - -
1 Convilution 6 28x28 5x5 1 0 tanh
2 Avg-Pooling 6 14x14 2x2 2 0 tanh
3 Convilution 16 10x10 5x5 1 0 tanh
4 Avg-Pooling 16 5x5 2x2 2 0 tanh
5 Convilution 120 1x1 5x5 1 0 tanh
6 FC - 84 - - - tanh
OutputFC - 10 - - - RBF
??????
??????
=���
W
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1
16

Implementation –Download Data Set & Normalize
16

Implementation –Define LeNet-5 Model
16

Implementation –Define LeNet-5 Model & Evaluate
16

Implementation –Visualize the Training Process
16

17
Thanks for your listening.

18
Appendix 1. Common to zero pad the border
Example. WLOG
-input 7x7
-3x3 filter, applied with stride 1
-pad with 1 pixel border => what is the output?
In general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with (F-1)/2.
(will preserve size spatially)
•F = 3 => zero pad with 1
•F = 5 => zero pad with 2
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1

19
Appendix 2. Sub-Samplingv.s. Pooling
•Sub-Sampling is simply Average-Pooling with learnable weights per
feature map.
•Sub-Sampling is a generalization of Average-Pooling.
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
1.51.75
1.51.75
Avg-pooling
1.51.75
1.51.75
Sub-sampling
�+�
, where w and b ∈�

19
Appendix 3. Radial Basis Function (RBF) units
�
84
�
1
�
2
�
83
�
�
�
10
�
1
�
2
�
9
�
�




�
1,1
�
10,84
�
10×1=�
10×84
??????
�
84×1
????????????????????????.
1)�
�∈ℝ���ℎ����������ℎ�??????�6��������ℎ�����ℎ�������������=??????������,
∀�=1,…,84
2)�ℎ������������������������������������������൛�
��|�=1,…,10;�=
Tags