Optimization Optimization Optimi(v6).pdf

Optimization
李宏毅
Hung-yi Lee

Last time …
small
Small?
median
large
Shallow
Deep
A target function &#3627408467;
∗
to fit
Eventually cover&#3627408467;
∗
?
&#3627408467;
∗
What is the
difference?
1
2
3
Optimization: Is it possible to
find &#3627408467;
∗
in the function space.

Optimization
Network: &#3627408467;
??????&#3627408485;
Training data:
&#3627408485;
1
,ො&#3627408486;
1
&#3627408485;
2
,ො&#3627408486;
2
&#3627408485;
??????
,ො&#3627408486;
??????
.....
??????&#3627409155;=෍
??????=1
??????
&#3627408473;&#3627408467;
??????&#3627408485;
??????
−ො&#3627408486;
??????
&#3627409155;
∗
=????????????&#3627408468;min
??????
??????&#3627409155;
Optimization ≠Learning
In Deep Learning, ??????&#3627409155;is
not convex.
Non-convex optimization is NP-hard.
Why can we solve the problem by gradient descent?

Loss of Deep Learning
is not convex
There are at least exponentially
many global minima for a neural net.
?
Permutating the neurons in one
layer does not change the loss.

Non-convex ≠Difficult
??????&#3627409155;
Not guarantee to find
optimal solution by
gradient descent
??????&#3627409155;
??????&#3627409155;

Outline
Review: Hessian
Deep Linear Model
Deep Non-linear Model
Conjecture about Deep Learning
Empirical Observation about Error Surface

Hessian Matrix:
When Gradient is Zero
Some examples in this part are from:
https://www.math.upenn.edu/~kazdan/312F12/Notes/
max-min-notesJan09/max-min.pdf

Training stops ….
•People believe training stuck because the
parameters are near a critical point
local minima
How about
saddle point?
http://www.deeplearningbook.org/contents/optimization.html
critical point:
gradient is zero

When Gradient is Zero
&#3627408467;&#3627409155;=&#3627408467;&#3627409155;
0
+&#3627409155;−&#3627409155;
0??????
&#3627408468;+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
+⋯
Gradient g is a vector
Hessian H is a matrix
&#3627408468;
&#3627408470;=
??????&#3627408467;&#3627409155;
0
??????&#3627409155;
&#3627408470;
??????
&#3627408470;&#3627408471;=
??????
2
??????&#3627409155;
&#3627408470;??????&#3627409155;
&#3627408471;
&#3627408467;&#3627409155;
0
=
??????
2
??????&#3627409155;
&#3627408471;??????&#3627409155;
&#3627408470;
&#3627408467;&#3627409155;
0
=??????
&#3627408471;&#3627408470;
symmetric
&#3627409147;&#3627408467;&#3627409155;
0

Hessian
&#3627408467;&#3627409155;=&#3627408467;&#3627409155;
0
+&#3627409155;−&#3627409155;
0??????
&#3627408468;+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
+⋯
H determines the curvature
Source of image:
http://www.deeplearningbook.org
/contents/numerical.html

Hessian
&#3627408467;&#3627409155;=&#3627408467;&#3627409155;
0
+&#3627409155;−&#3627409155;
0??????
&#3627408468;+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
+⋯
Newton’s method
&#3627409147;&#3627408467;&#3627409155;≈&#3627409147;&#3627409155;−&#3627409155;
0??????
&#3627408468;+&#3627409147;
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
??????&#3627409155;−&#3627409155;
0??????
&#3627408468;
??????&#3627409155;
&#3627408470;
=&#3627408468;
&#3627408470;
=&#3627408468;
??????
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
??????&#3627409155;
&#3627408470;
??????&#3627409155;−&#3627409155;
0
&#3627409147;&#3627408467;&#3627409155;=0Find the space such that

Hessian
&#3627408467;&#3627409155;=&#3627408467;&#3627409155;
0
+&#3627409155;−&#3627409155;
0??????
&#3627408468;+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
+⋯
Newton’s method
&#3627409147;&#3627408467;&#3627409155;≈&#3627409147;&#3627409155;−&#3627409155;
0??????
&#3627408468;+&#3627409147;
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
=&#3627408468;
??????&#3627409155;−&#3627409155;
0
&#3627409147;&#3627408467;&#3627409155;≈&#3627408468;+??????&#3627409155;−&#3627409155;
0
=0
??????&#3627409155;−&#3627409155;
0
=−&#3627408468;
&#3627409155;−&#3627409155;
0
=−??????
−1
&#3627408468;
&#3627409155;=&#3627409155;
0
−??????
−1
&#3627408468;&#3627409155;=&#3627409155;
0
−&#3627409154;&#3627408468;v.s.
Change the direction, determine step size

Hessian
&#3627408467;&#3627409155;=&#3627408467;&#3627409155;
0
+&#3627409155;−&#3627409155;
0??????
&#3627408468;+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
+⋯
Newton’s method
What is the problem?
If &#3627408467;&#3627408485;is a quadratic function, obtain critical point in one step.
Consider that &#3627409155;=&#3627408485;
Source of image:
https://math.stackexchange.com/questions/60
9680/newtons-method-intuition
?
Not suitable for Deep Learning

Hessian
Source of image:
http://www.offconvex.org/2016/03/22/saddlepoints/
&#3627408467;&#3627409155;=&#3627408467;&#3627409155;
0
+&#3627409155;−&#3627409155;
0??????
&#3627408468;+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
+⋯
At critical point (&#3627408468;=0)
H tells us the properties of critical points.

Review: Linear Algebra
http://speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2016/Lecture/eigen.pdf
•If ??????&#3627408483;=??????&#3627408483;(&#3627408483;is a vector, ??????is a scalar)
•&#3627408483;is an eigenvector of A
•??????is an eigenvalue of A that corresponds to &#3627408483;
Eigen value
Eigen vector
A must be square
excluding zero vector

Review: Positive/Negative Definite
•An nxnmatrix A is symmetric.
•For every non-zero vector x (&#3627408485;≠0)
positive definite:
positive semi-definite:
negative definite:
negative semi-definite:
&#3627408485;
??????
??????&#3627408485;>0
&#3627408485;
??????
??????&#3627408485;≥0
&#3627408485;
??????
??????&#3627408485;<0
&#3627408485;
??????
??????&#3627408485;≤0
All eigen values
are positive.
All eigen values
are negative.
All eigen values
are non-negative.
All eigen values
are non-positive.
10
01

Hessian&#3627408467;&#3627409155;≈&#3627408467;&#3627409155;
0
+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
At critical point:
H is positive definite&#3627408485;
??????
??????&#3627408485;>0
Local minimaAround &#3627409155;
0
:&#3627408467;&#3627409155;>&#3627408467;&#3627409155;
0
All eigen values are positive.
H is negative definite&#3627408485;
??????
??????&#3627408485;<0
Local maximaAround &#3627409155;
0
:&#3627408467;&#3627409155;<&#3627408467;&#3627409155;
0
All eigen values are negative
&#3627408485;
??????
??????&#3627408485;≥0?
&#3627408485;
??????
??????&#3627408485;≤0?
Sometimes&#3627408485;
??????
??????&#3627408485;>0, sometimes&#3627408485;
??????
??????&#3627408485;<0
Saddle point
&#3627408485;
??????
??????&#3627408485;

Hessian&#3627408467;&#3627409155;≈&#3627408467;&#3627409155;
0
+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
At critical point:
&#3627408483;is an eigen vector&#3627408483;
??????
??????&#3627408483;=&#3627408483;
??????
??????&#3627408483;=??????&#3627408483;
2
Unit vector =??????
&#3627408483;
&#3627409155;
0
+??????
Because H is an nxnsymmetric matrix,
H can have eigen vectors &#3627408483;
1,&#3627408483;
2,…,&#3627408483;
??????form a orthonormal basis.
&#3627408485;=??????
1&#3627408483;
1+??????
2&#3627408483;
2
&#3627408483;
1
&#3627409155;
0
+??????
1
=??????
1
2
??????
1+??????
??????
2
??????
2?
&#3627408483;
2
+??????
2
&#3627408485;
&#3627408485;
??????
??????&#3627408485;
=??????
1&#3627408483;
1+??????
2&#3627408483;
2
??????
????????????
1&#3627408483;
1+??????
2&#3627408483;
2
&#3627408483;
1and &#3627408483;
2are orthogonal
(Ignore 1/2 for
simplicity)

Hessian&#3627408467;&#3627409155;≈&#3627408467;&#3627409155;
0
+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
At critical point:
&#3627408483;is an eigen vector&#3627408483;
??????
??????&#3627408483;=&#3627408483;
??????
??????&#3627408483;=??????&#3627408483;
2
Unit vector =??????
&#3627408482;
&#3627409155;
0
+?
Because H is an nxnsymmetric matrix,
H can have eigen vectors &#3627408483;
1,&#3627408483;
2,…,&#3627408483;
??????form a orthonormal basis.
&#3627408482;=??????
1&#3627408483;
1+??????
2&#3627408483;
2+⋯+??????
??????&#3627408483;
??????
??????
1
2
??????
1+??????
2
2
??????
2+⋯+??????
??????
2
??????
??????

Examples
&#3627408467;&#3627408485;,&#3627408486;=&#3627408485;
2
+3&#3627408486;
2
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=2&#3627408485;
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=6&#3627408486;
&#3627408485;=0,&#3627408486;=0
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=2 =0
=0 =6
??????=
20
06
Positive-definite
Local minima

Examples
&#3627408467;&#3627408485;,&#3627408486;=−&#3627408485;
2
+3&#3627408486;
2
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=−2&#3627408485;
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=6&#3627408486;
&#3627408485;=0,&#3627408486;=0
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=−2 =0
=0 =6
??????=
−20
06
Saddle

Degenerate
•Degenerate Hessian has at least one zero eigen value
&#3627408467;&#3627408485;,&#3627408486;=&#3627408485;
2
+&#3627408486;
4
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=2&#3627408485;
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=4&#3627408486;
3
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=2 =0
=0 =12&#3627408486;
2
??????=
20
00
&#3627408485;=&#3627408486;=0

Degenerate
•Degenerate Hessian has at least one zero eigen value
&#3627408467;&#3627408485;,&#3627408486;=&#3627408485;
2
+&#3627408486;
4
??????=
20
00
&#3627408468;=
0
0
&#3627408468;&#3627408485;,&#3627408486;=&#3627408485;
2
−&#3627408486;
4
??????=
20
00
&#3627408468;=
0
0
&#3627408485;=&#3627408486;=0 &#3627408485;=&#3627408486;=0
No Difference

Degenerate
&#3627408467;&#3627408485;,&#3627408486;=−&#3627408485;
4
−&#3627408486;
4
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=−4&#3627408485;
3
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=−4&#3627408486;
3
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=−12&#3627408485;
2 =0
=0 =−12&#3627408486;
2
??????=
00
00
ℎ&#3627408485;,&#3627408486;=0
??????=
00
00
&#3627408468;=
0
0
&#3627408485;=&#3627408486;=0

Degenerate
http://homepages.math.uic.edu/~juliu
s/monkeysaddle.html
&#3627408467;&#3627408485;,&#3627408486;=&#3627408485;
3
−3&#3627408485;&#3627408486;
2
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=3&#3627408485;
2
−3&#3627408486;
2
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=−6&#3627408485;&#3627408486;
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=6&#3627408485;=−6&#3627408486;
=−6&#3627408486;
=−6&#3627408485;
Monkey Saddle
c.f.

Training stuck ≠ZeroGradient
•People believe training stuck because the
parameters are around a critical point
!!!
http://www.deeplearningbook.org/contents/optimization.html

Training stuck ≠ZeroGradient
http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/
Approach a saddle point, and then escape

Deep Linear Network

https://arxiv.org/abs/1412.6544
&#3627408484;
1 &#3627408484;
2
&#3627408485; &#3627408486;ො&#3627408486;
=1 =1

??????=ො&#3627408486;−&#3627408484;
1&#3627408484;
2&#3627408485;
2
????????????
??????&#3627408484;
1
=21−&#3627408484;
1&#3627408484;
2−&#3627408484;
2
????????????
??????&#3627408484;
2
=21−&#3627408484;
1&#3627408484;
2−&#3627408484;
1
??????
2
??????
??????&#3627408484;
1
2
=2−&#3627408484;
2−&#3627408484;
2
??????
2
??????
??????&#3627408484;
2
2
=2−&#3627408484;
1−&#3627408484;
1
??????
2
??????
??????&#3627408484;
1??????&#3627408484;
2
=−2+4&#3627408484;
1&#3627408484;
2
??????
2
??????
??????&#3627408484;
2??????&#3627408484;
1
=−2+4&#3627408484;
1&#3627408484;
2
&#3627408484;
1 &#3627408484;
2
&#3627408485; &#3627408486;ො&#3627408486;
=1 =1
The probability of stuck as
saddle point is almost zero.
Easy to escape
=1−&#3627408484;
1&#3627408484;
2
2

??????=1−&#3627408484;
1&#3627408484;
2&#3627408484;
3
2
????????????
??????&#3627408484;
1
=21−&#3627408484;
1&#3627408484;
2&#3627408484;
3−&#3627408484;
2&#3627408484;
3
??????
2
??????
??????&#3627408484;
1
2
=2−&#3627408484;
2&#3627408484;
3
2
??????
2
??????
??????&#3627408484;
2??????&#3627408484;
1
=−2&#3627408484;
3+4&#3627408484;
1&#3627408484;
2&#3627408484;
3
2
2-hidden layers
&#3627408484;
1 &#3627408484;
2
&#3627408485; &#3627408486;ො&#3627408486;
&#3627408484;
3
????????????
??????&#3627408484;
2
=21−&#3627408484;
1&#3627408484;
2&#3627408484;
3−&#3627408484;
1&#3627408484;
3
????????????
??????&#3627408484;
3
=21−&#3627408484;
1&#3627408484;
2&#3627408484;
3−&#3627408484;
1&#3627408484;
2
&#3627408484;
1=&#3627408484;
2=&#3627408484;
3=0
??????=
000
000
000
&#3627408484;
1=&#3627408484;
2=0,&#3627408484;
3=&#3627408472;
??????=
0−20
−200
000
Saddle point
So flat
&#3627408484;
1&#3627408484;
2&#3627408484;
3=1global minima
All minima are global, some
critical points are “bad”.

10 hidden layers
0
0.2
0.4
0.6
0.8
1
1.2
0
0.5
1
1.5
2
2.5
10-hidden layers
&#3627409155;
1
&#3627409155;
2
origin

Demo

Deep Linear Network
x
&#3627408486;=??????
??????
??????
??????−1
⋯??????
2
??????
1
&#3627408485;
y W
K
W
1
W
2…..
??????=෍
??????=1
??????
&#3627408485;
??????
−ො&#3627408486;
??????2
ො&#3627408486;
Hidden layer size ≥Input dim, output dim
More than two hidden layers can produce
saddle point without negative eigenvalues.

Reference
•Kenji Kawaguchi, Deep Learning without Poor Local Minima, NIPS, 2016
•HaihaoLu,Kenji Kawaguchi, Depth Creates No Bad Local Minima, arXiv, 2017
•Thomas Laurent,James von Brecht, Deep linear neural networks with arbitrary
loss: All local minima are global, arXiv, 2017
•Maher Nouiehed,Meisam Razaviyayn, Learning Deep Models: Critical Points
and Local Openness, arXiv, 2018

Non-linear Deep Network
Does it have local minima?
證明事情不存在很難，證明事情存在相對容易
感謝曾子家同學發現投影片上的錯字

Even Simple Task can be Difficult

ReLUhas local
+
1
&#3627408486;+
1
&#3627408485;
+
1
&#3627408486;+
1
&#3627408485;
1
0-3
1
+
1
&#3627408486;+
1
&#3627408485;
1
0-4
-7
(-1.3) (1,-3) (3,0) (4,1) (5,2)
This relunetwork has local minima.

“Blind Spot” of ReLUx y
0
0
0
0
0
0
0
0
0
Gradient
is zero
It is pretty easy to make this happens ……

“Blind Spot” of ReLU
•MNIST, Adam, 1M updates
Consider your
initialization

k neurons
&#3627408485;
1 +
+
…
+&#3627408485;
2
……
Label
generator
ො&#3627408486;+
If ??????≥&#3627408472;and
&#3627408484;
&#3627408470;
=&#3627408483;
&#3627408470;
n neurons
&#3627408485;
1 +
+
…
&#3627408486;+
The number of
k and n matters
We obtain
global minima
&#3627408485;+&#3627408485;
2
……
&#3627408484;
1
&#3627408484;
2
&#3627408484;
??????
&#3627408484;
1??????
&#3627408485;
&#3627408484;
2??????
&#3627408485;
&#3627408484;
????????????
&#3627408485;
1
1
1
&#3627408485;
&#3627408483;
1
&#3627408483;
2
&#3627408483;
??????
&#3627408483;
1??????
&#3627408485;
&#3627408483;
2??????
&#3627408485;
&#3627408483;
&#3627408472;
??????
&#3627408485;
1
1
1
Considering Data
Network to
be trained
N(0,1)

No local for ??????≥&#3627408472;+2
Considering Data

Considering Data

Reference
•Grzegorz Swirszcz,Wojciech Marian Czarnecki,Razvan Pascanu, “Local
minima in training of neural networks”, arXiv, 2016
•ItaySafran,OhadShamir, “Spurious Local Minima are Common in Two-Layer
ReLUNeural Networks”, arXiv, 2017
•Yi Zhou,YingbinLiang, “Critical Points of Neural Networks: Analytical Forms
and Landscape Properties”, arXiv, 2017
•Shai Shalev-Shwartz,OhadShamir,ShakedShammah, “Failures of Gradient-
Based Deep Learning”, arXiv, 2017
The theory should looks like …
Under some conditions (initialization, data, ……),
We can find global optimal.

Conjecture
about Deep Learning
Almost all local minimum have very similar loss to the global
optimum, and hence finding a local minimum is good enough.

Analyzing Hessian
•When we meet a critical point, it can be saddle point or
local minima.
•Analyzing H
If the network has N parameters
&#3627408483;
1 &#3627408483;
2 &#3627408483;
3 &#3627408483;
??????……
We assume ??????has 1/2 (?) to be positive, 1/2 (?) to be negative.
??????
1 ??????
2 ??????
3 ??????
??????……

Analyzing Hessian
•If N=1:
•If N=2:
•If N=10:
&#3627408483;
1
??????
1
1/2 local minima, 1/2 local maxima,
Saddle point is almost impossible
&#3627408483;
1
??????
1
&#3627408483;
2
??????
2
1/4 local minima, 1/4 local maxima,
1/2 Saddle points
1/1024 local minima, 1/1024 local maxima,
Almost every critical point is saddle point
When a network is very large,
It is almost impossible to meet a local minima.
Saddle point is what you need to worry about.
+ + --
+-, -+

Error v.s. Eigenvalues
Source of image:
http://proceedings.mlr.press/v70/pennington17a/pennington17a.pdf
We assume ??????has 1/2 (?)
to be negative.
pis a probability
related to error
Larger error, larger p
p

Guess about Error Surface
https://stats385.github.io/assets/lectures/Understanding_and_improving_deep_lea
ring_with_random_matrix_theory.pdf
global minima local minima
saddle
(good enough)

Training Error v.s. Eigenvalues

Training Error v.s. Eigenvalues
Portion of positive eigenvalues“Degree of Local Minima”
1 -“degree of local minima”
(portion of negative eigen values)

&#3627409148;∝
??????
??????
−1
3/2
1
-
“degree of local minima”
empiricaltheoretical
1 -“degree of local minima”
(portion of negative eigen values)

Spin Glass v.s. Deep Learning
•Deep learning is the same as spin glass model with
7 assumptions.
spin glass model network

More Theory
•If the size of network is
large enough, we can find
global optimal by gradient
descent
•Independent to
initialization

Reference
•Razvan Pascanu,Yann N. Dauphin,Surya Ganguli,Yoshua Bengio, On the saddle
point problem for non-convex optimization, arXiv, 2014
•Yann Dauphin,Razvan Pascanu,CaglarGulcehre,KyunghyunCho,Surya
Ganguli,Yoshua Bengio, “Identifying and attacking the saddle point problem in
high-dimensional non-convex optimization”, NIPS, 2014
•Anna Choromanska,Mikael Henaff,Michael Mathieu,Gérard Ben Arous,Yann
LeCun, “The Loss Surfaces of Multilayer Networks”, PMLR, 2015
•Jeffrey Pennington, YasamanBahri, “Geometry of Neural Network Loss Surfaces
via Random Matrix Theory”, PMLR, 2017
•Benjamin D. Haeffele, Rene Vidal, ”Global Optimality in Neural Network
Training”, CVPR, 2017

What does the Error
Surface look like?

Error Surface
&#3627408484;
1 &#3627408484;
2
??????

Profile
&#3627409155;
0
&#3627409155;
∗
&#3627409155;
0
+2&#3627409155;
∗
−&#3627409155;
0
local minima is rare?

Profile

Profile
two random starting points two “solutions”

&#3627409155;
0
&#3627409155;
∗

Profile -LSTM

Training Processing
6-layer CNN on
CIFAR-10
Different initialization / different strategies usually
lead to similar loss (there are some exceptions).
Different
initialization

Training Processing
•Different strategies (the same initialization)

8% disagreement

Training Processing
何時分道揚鑣？
Different training strategies
Different basins
http://mypaper.pchome.com.tw
/ccschoolgeo/post/1311484084

Training Processing
•Training strategies make difference at all stages of
training

Larger basin
for Adam

Batch Normalization

Skip Connection

Reference
•Ian J. Goodfellow,Oriol Vinyals,Andrew M. Saxe, “Qualitatively
characterizing neural network optimization problems”, ICLR 2015
•Daniel Jiwoong Im,Michael Tao,Kristin Branson, “An Empirical Analysis of
Deep Network Loss Surfaces”, arXiv2016
•QianliLiao,Tomaso Poggio, “Theory II: Landscape of the Empirical Risk in
Deep Learning”, arXiv2017
•Hao Li,Zheng Xu,Gavin Taylor,Christoph Studer,Tom Goldstein, “Visualizing the
Loss Landscape of Neural Nets”, arXiv2017

Concluding Remarks

Concluding Remarks
•Deep linear network is not convex, but all the local minima
are global minima.
•There are saddle points which are hard to escape
•Deep network has local minima.
•We need more theory in the future
•Conjecture:
•When training a larger network, it is rare to meet local
minima.
•All local minima are almost as good as global
•We can try to understand the error surface by visualization.
•The error surface is not as complexed as imagined.

Optimization Optimization Optimi(v6).pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Optimization Optimization Optimi(v6).pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Tags

Categories

Download

Quick Actions