Gradient
Based Learning Applied to Document Recognition
Y
. LeCun , L. Bottou , Y. Bengio and P. Haffner
Proceedings of the IEEE, 86(11
):2278 ----‐2324 , November 1998
Size: 2.06 MB
Language: en
Added: Jun 22, 2019
Slides: 24 pages
Slide Content
Gradient-Based Learning Applied to Document Recognition
Y. LeCun, L. Bottou, Y. Bengioand P. Haffner
Proceedings of the IEEE, 86(11):2278--‐2324, November 1998
01
LeNet
Speaker: Chia-JungNi
1980s
CNN be
proposed
1998LeNet
2012
AlexNet
2015
VGGNet
2015
GoogleNet
2016
ResNet
2017
DenseNet
03
History of Representative CNN models
Thefirsttimeuse
���??????−�??????���??????�????????????��to
updatemodelparams.
Thefirsttimeuse
??????????????????to
acceleratecomputations
•Why local connectivity? (what)
•Spatial correlation is local
•Reduce # of parameters
04
Three key ideas : Local Receptive Fields (1/3)
Example. WLOG
-1000x1000 image
-3x3 filter (kernel)
10
6
+1params.
/ hidden unit
3
2
+1params.
/ hidden unit
•Why weight sharing? (where)
•Statistics is at different locations
•Reduce # of parameters
05
Three key ideas : Shared Weights(2/3)
Example. WLOG
-# input units (neurons) = 7
-# hidden unit = 3
3∗3+3=12params. 3∗1+3=6params.
•Why Sub-sampling? (Size)
•Sub-sampling the pixel will not change the object
•Reduce memoryconsumption
06
Three key ideas : Sub-sampling(3/3)
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
2 3
3 3
1.51.75
1.51.75
Max-pooling Avg-pooling
•Architecture of LeNet-5
•Two sets of convolutional and average pooling layers
•Followed by a flattening convolutional layer
•Then two fully-connected layers and finally a softmaxclassifier
07
Model Architecture
•Similar to the idea of activation function
•All the first 6 layers (C1, S2, C3, S4, C5, F6) feature maps are passed
through this nonlinear scaled hyperbolic tangent function
08
Model Architecture–Squashing Function
��=??????������,�ℎ���ቊ
??????=1.17519
�=
2
3
with this choice of params,
the equalities �1=1and �−1=−1satisfied.
���′�
�′′�
Some details
-Symmetric functions will yield faster convergence,
although the learning might be slow as the weights are too
large/ small.
-The absolute value of the 2
nd
derivative of f(a) is a
maximum at +1 and -1, which also improves the
convergence toward the end of learning session.
09
Model Architecture –1
st
layer (1/7)
•Trainable params
= (weight* input map channel + bias) * output map channel
= (5*5*1 + 1) * 6= 156
•Connections
= (weight * input map channel + bias) * output map channel * output map size
= (5*5*1 + 1) * 6* (28*28) = 122,304
??????
??????
=���
W
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1
Convolution layer 1 (C1)
with 6 feature maps or filters
having size 5×5, a stride of one,
and ‘same’ padding!
15
Model Architecture –Output layer (7/7)
Output layer
with the Euclidean Radial Basis Function
(RBF)
The output of each RBF units (RBF) �
�
is
computed as follow:
�
�
=
�
�
�−�
��
2
Loss Function
With mean squared error
function (MSE) to measure
discrepancy
The output of a particular RBF can be interpreted as a penalty term measuring the fit
between the input pattern and a model of the class associated with the RBF. In
probabilistic terms, the RBF output can be interpreted as the unnormalized negative
log-likelihood of a Gaussian distortion in the space of configuration of layer F6.
, where �
??????
??????is the output of �
??????
-thRBF units,
that is, the one that corresponds to the right
class of input pattern �
??????
.
Implementation –Define LeNet-5 Model & Evaluate
16
Implementation –Visualize the Training Process
16
17
Thanks for your listening.
18
Appendix 1. Common to zero pad the border
Example. WLOG
-input 7x7
-3x3 filter, applied with stride 1
-pad with 1 pixel border => what is the output?
In general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with (F-1)/2.
(will preserve size spatially)
•F = 3 => zero pad with 1
•F = 5 => zero pad with 2
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1
??????
??????
=���
??????
??????−1
−??????+2??????
�
+1
19
Appendix 2. Sub-Samplingv.s. Pooling
•Sub-Sampling is simply Average-Pooling with learnable weights per
feature map.
•Sub-Sampling is a generalization of Average-Pooling.
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
1.51.75
1.51.75
Avg-pooling
1.51.75
1.51.75
Sub-sampling
�+�
, where w and b ∈�