Optimization Optimization Optimi(v6).pdf

cmptcmpt3 22 views 74 slides Aug 30, 2024
Slide 1
Slide 1 of 74
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74

About This Presentation

Optimization principles


Slide Content

Optimization
李宏毅
Hung-yi Lee

Last time …
small
Small?
median
large
Shallow
Deep
A target function �

to fit
Eventually cover�

?
�

What is the
difference?
1
2
3
Optimization: Is it possible to
find �

in the function space.

Optimization
Network: �
??????�
Training data:
�
1
,ො�
1
�
2
,ො�
2
�
??????
,ො�
??????
.....
??????�=෍
??????=1
??????
��
??????�
??????
−ො�
??????
�

=????????????�min
??????
??????�
Optimization ≠Learning
In Deep Learning, ??????�is
not convex.
Non-convex optimization is NP-hard.
Why can we solve the problem by gradient descent?

Loss of Deep Learning
is not convex
There are at least exponentially
many global minima for a neural net.
?
Permutating the neurons in one
layer does not change the loss.

Non-convex ≠Difficult
??????�
Not guarantee to find
optimal solution by
gradient descent
??????�
??????�

Outline
Review: Hessian
Deep Linear Model
Deep Non-linear Model
Conjecture about Deep Learning
Empirical Observation about Error Surface

Hessian Matrix:
When Gradient is Zero
Some examples in this part are from:
https://www.math.upenn.edu/~kazdan/312F12/Notes/
max-min-notesJan09/max-min.pdf

Training stops ….
•People believe training stuck because the
parameters are near a critical point
local minima
How about
saddle point?
http://www.deeplearningbook.org/contents/optimization.html
critical point:
gradient is zero

When Gradient is Zero
��=��
0
+�−�
0??????
�+
1
2
�−�
0??????
??????�−�
0
+⋯
Gradient g is a vector
Hessian H is a matrix
�
�=
??????��
0
??????�
�
??????
��=
??????
2
??????�
�??????�
�
��
0
=
??????
2
??????�
�??????�
�
��
0
=??????
��
symmetric
���
0

Hessian
��=��
0
+�−�
0??????
�+
1
2
�−�
0??????
??????�−�
0
+⋯
H determines the curvature
Source of image:
http://www.deeplearningbook.org
/contents/numerical.html

Hessian
��=��
0
+�−�
0??????
�+
1
2
�−�
0??????
??????�−�
0
+⋯
Newton’s method
���≈��−�
0??????
�+�
1
2
�−�
0??????
??????�−�
0
??????�−�
0??????
�
??????�
�
=�
�
=�
??????
1
2
�−�
0??????
??????�−�
0
??????�
�
??????�−�
0
���=0Find the space such that

Hessian
��=��
0
+�−�
0??????
�+
1
2
�−�
0??????
??????�−�
0
+⋯
Newton’s method
���≈��−�
0??????
�+�
1
2
�−�
0??????
??????�−�
0
=�
??????�−�
0
���≈�+??????�−�
0
=0
??????�−�
0
=−�
�−�
0
=−??????
−1
�
�=�
0
−??????
−1
��=�
0
−��v.s.
Change the direction, determine step size

Hessian
��=��
0
+�−�
0??????
�+
1
2
�−�
0??????
??????�−�
0
+⋯
Newton’s method
What is the problem?
If ��is a quadratic function, obtain critical point in one step.
Consider that �=�
Source of image:
https://math.stackexchange.com/questions/60
9680/newtons-method-intuition
?
Not suitable for Deep Learning

Hessian
Source of image:
http://www.offconvex.org/2016/03/22/saddlepoints/
��=��
0
+�−�
0??????
�+
1
2
�−�
0??????
??????�−�
0
+⋯
At critical point (�=0)
H tells us the properties of critical points.

Review: Linear Algebra
http://speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2016/Lecture/eigen.pdf
•If ??????�=??????�(�is a vector, ??????is a scalar)
•�is an eigenvector of A
•??????is an eigenvalue of A that corresponds to �
Eigen value
Eigen vector
A must be square
excluding zero vector

Review: Positive/Negative Definite
•An nxnmatrix A is symmetric.
•For every non-zero vector x (�≠0)
positive definite:
positive semi-definite:
negative definite:
negative semi-definite:
�
??????
??????�>0
�
??????
??????�≥0
�
??????
??????&#3627408485;<0
&#3627408485;
??????
??????&#3627408485;≤0
All eigen values
are positive.
All eigen values
are negative.
All eigen values
are non-negative.
All eigen values
are non-positive.
10
01

Hessian&#3627408467;&#3627409155;≈&#3627408467;&#3627409155;
0
+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
At critical point:
H is positive definite&#3627408485;
??????
??????&#3627408485;>0
Local minimaAround &#3627409155;
0
:&#3627408467;&#3627409155;>&#3627408467;&#3627409155;
0
All eigen values are positive.
H is negative definite&#3627408485;
??????
??????&#3627408485;<0
Local maximaAround &#3627409155;
0
:&#3627408467;&#3627409155;<&#3627408467;&#3627409155;
0
All eigen values are negative
&#3627408485;
??????
??????&#3627408485;≥0?
&#3627408485;
??????
??????&#3627408485;≤0?
Sometimes&#3627408485;
??????
??????&#3627408485;>0, sometimes&#3627408485;
??????
??????&#3627408485;<0
Saddle point
&#3627408485;
??????
??????&#3627408485;

Hessian&#3627408467;&#3627409155;≈&#3627408467;&#3627409155;
0
+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
At critical point:
&#3627408483;is an eigen vector&#3627408483;
??????
??????&#3627408483;=&#3627408483;
??????
??????&#3627408483;=??????&#3627408483;
2
Unit vector =??????
&#3627408483;
&#3627409155;
0
+??????
Because H is an nxnsymmetric matrix,
H can have eigen vectors &#3627408483;
1,&#3627408483;
2,…,&#3627408483;
??????form a orthonormal basis.
&#3627408485;=??????
1&#3627408483;
1+??????
2&#3627408483;
2
&#3627408483;
1
&#3627409155;
0
+??????
1
=??????
1
2
??????
1+??????
??????
2
??????
2?
&#3627408483;
2
+??????
2
&#3627408485;
&#3627408485;
??????
??????&#3627408485;
=??????
1&#3627408483;
1+??????
2&#3627408483;
2
??????
????????????
1&#3627408483;
1+??????
2&#3627408483;
2
&#3627408483;
1and &#3627408483;
2are orthogonal
(Ignore 1/2 for
simplicity)

Hessian&#3627408467;&#3627409155;≈&#3627408467;&#3627409155;
0
+
1
2
&#3627409155;−&#3627409155;
0??????
??????&#3627409155;−&#3627409155;
0
At critical point:
&#3627408483;is an eigen vector&#3627408483;
??????
??????&#3627408483;=&#3627408483;
??????
??????&#3627408483;=??????&#3627408483;
2
Unit vector =??????
&#3627408482;
&#3627409155;
0
+?
Because H is an nxnsymmetric matrix,
H can have eigen vectors &#3627408483;
1,&#3627408483;
2,…,&#3627408483;
??????form a orthonormal basis.
&#3627408482;=??????
1&#3627408483;
1+??????
2&#3627408483;
2+⋯+??????
??????&#3627408483;
??????
??????
1
2
??????
1+??????
2
2
??????
2+⋯+??????
??????
2
??????
??????

Examples
&#3627408467;&#3627408485;,&#3627408486;=&#3627408485;
2
+3&#3627408486;
2
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=2&#3627408485;
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=6&#3627408486;
&#3627408485;=0,&#3627408486;=0
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=2 =0
=0 =6
??????=
20
06
Positive-definite
Local minima

Examples
&#3627408467;&#3627408485;,&#3627408486;=−&#3627408485;
2
+3&#3627408486;
2
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=−2&#3627408485;
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=6&#3627408486;
&#3627408485;=0,&#3627408486;=0
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=−2 =0
=0 =6
??????=
−20
06
Saddle

Degenerate
•Degenerate Hessian has at least one zero eigen value
&#3627408467;&#3627408485;,&#3627408486;=&#3627408485;
2
+&#3627408486;
4
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=2&#3627408485;
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=4&#3627408486;
3
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=2 =0
=0 =12&#3627408486;
2
??????=
20
00
&#3627408485;=&#3627408486;=0

Degenerate
•Degenerate Hessian has at least one zero eigen value
&#3627408467;&#3627408485;,&#3627408486;=&#3627408485;
2
+&#3627408486;
4
??????=
20
00
&#3627408468;=
0
0
&#3627408468;&#3627408485;,&#3627408486;=&#3627408485;
2
−&#3627408486;
4
??????=
20
00
&#3627408468;=
0
0
&#3627408485;=&#3627408486;=0 &#3627408485;=&#3627408486;=0
No Difference

Degenerate
&#3627408467;&#3627408485;,&#3627408486;=−&#3627408485;
4
−&#3627408486;
4
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=−4&#3627408485;
3
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=−4&#3627408486;
3
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=−12&#3627408485;
2 =0
=0 =−12&#3627408486;
2
??????=
00
00
ℎ&#3627408485;,&#3627408486;=0
??????=
00
00
&#3627408468;=
0
0
&#3627408485;=&#3627408486;=0

Degenerate
http://homepages.math.uic.edu/~juliu
s/monkeysaddle.html
&#3627408467;&#3627408485;,&#3627408486;=&#3627408485;
3
−3&#3627408485;&#3627408486;
2
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408485;
=3&#3627408485;
2
−3&#3627408486;
2
??????&#3627408467;&#3627408485;,&#3627408486;
??????&#3627408486;
=−6&#3627408485;&#3627408486;
??????
2
??????&#3627408485;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408485;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408485;
&#3627408467;&#3627408485;,&#3627408486;
??????
2
??????&#3627408486;??????&#3627408486;
&#3627408467;&#3627408485;,&#3627408486;
=6&#3627408485;=−6&#3627408486;
=−6&#3627408486;
=−6&#3627408485;
Monkey Saddle
c.f.

Training stuck ≠ZeroGradient
•People believe training stuck because the
parameters are around a critical point
!!!
http://www.deeplearningbook.org/contents/optimization.html

Training stuck ≠ZeroGradient
http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/
Approach a saddle point, and then escape

Deep Linear Network

https://arxiv.org/abs/1412.6544
&#3627408484;
1 &#3627408484;
2
&#3627408485; &#3627408486;ො&#3627408486;
=1 =1

??????=ො&#3627408486;−&#3627408484;
1&#3627408484;
2&#3627408485;
2
????????????
??????&#3627408484;
1
=21−&#3627408484;
1&#3627408484;
2−&#3627408484;
2
????????????
??????&#3627408484;
2
=21−&#3627408484;
1&#3627408484;
2−&#3627408484;
1
??????
2
??????
??????&#3627408484;
1
2
=2−&#3627408484;
2−&#3627408484;
2
??????
2
??????
??????&#3627408484;
2
2
=2−&#3627408484;
1−&#3627408484;
1
??????
2
??????
??????&#3627408484;
1??????&#3627408484;
2
=−2+4&#3627408484;
1&#3627408484;
2
??????
2
??????
??????&#3627408484;
2??????&#3627408484;
1
=−2+4&#3627408484;
1&#3627408484;
2
&#3627408484;
1 &#3627408484;
2
&#3627408485; &#3627408486;ො&#3627408486;
=1 =1
The probability of stuck as
saddle point is almost zero.
Easy to escape
=1−&#3627408484;
1&#3627408484;
2
2

??????=1−&#3627408484;
1&#3627408484;
2&#3627408484;
3
2
????????????
??????&#3627408484;
1
=21−&#3627408484;
1&#3627408484;
2&#3627408484;
3−&#3627408484;
2&#3627408484;
3
??????
2
??????
??????&#3627408484;
1
2
=2−&#3627408484;
2&#3627408484;
3
2
??????
2
??????
??????&#3627408484;
2??????&#3627408484;
1
=−2&#3627408484;
3+4&#3627408484;
1&#3627408484;
2&#3627408484;
3
2
2-hidden layers
&#3627408484;
1 &#3627408484;
2
&#3627408485; &#3627408486;ො&#3627408486;
&#3627408484;
3
????????????
??????&#3627408484;
2
=21−&#3627408484;
1&#3627408484;
2&#3627408484;
3−&#3627408484;
1&#3627408484;
3
????????????
??????&#3627408484;
3
=21−&#3627408484;
1&#3627408484;
2&#3627408484;
3−&#3627408484;
1&#3627408484;
2
&#3627408484;
1=&#3627408484;
2=&#3627408484;
3=0
??????=
000
000
000
&#3627408484;
1=&#3627408484;
2=0,&#3627408484;
3=&#3627408472;
??????=
0−20
−200
000
Saddle point
So flat
&#3627408484;
1&#3627408484;
2&#3627408484;
3=1global minima
All minima are global, some
critical points are “bad”.

10 hidden layers
0
0.2
0.4
0.6
0.8
1
1.2
0
0.5
1
1.5
2
2.5
10-hidden layers
&#3627409155;
1
&#3627409155;
2
origin

Demo

Deep Linear Network
x
&#3627408486;=??????
??????
??????
??????−1
⋯??????
2
??????
1
&#3627408485;
y W
K
W
1
W
2…..
??????=෍
??????=1
??????
&#3627408485;
??????
−ො&#3627408486;
??????2
ො&#3627408486;
Hidden layer size ≥Input dim, output dim
More than two hidden layers can produce
saddle point without negative eigenvalues.

Reference
•Kenji Kawaguchi, Deep Learning without Poor Local Minima, NIPS, 2016
•HaihaoLu,Kenji Kawaguchi, Depth Creates No Bad Local Minima, arXiv, 2017
•Thomas Laurent,James von Brecht, Deep linear neural networks with arbitrary
loss: All local minima are global, arXiv, 2017
•Maher Nouiehed,Meisam Razaviyayn, Learning Deep Models: Critical Points
and Local Openness, arXiv, 2018

Non-linear Deep Network
Does it have local minima?
證明事情不存在很難,證明事情存在相對容易
感謝曾子家同學發現投影片上的錯字

Even Simple Task can be Difficult

ReLUhas local
+
1
&#3627408486;+
1
&#3627408485;
+
1
&#3627408486;+
1
&#3627408485;
1
0-3
1
+
1
&#3627408486;+
1
&#3627408485;
1
0-4
-7
(-1.3) (1,-3) (3,0) (4,1) (5,2)
This relunetwork has local minima.

“Blind Spot” of ReLUx y
0
0
0
0
0
0
0
0
0
Gradient
is zero
It is pretty easy to make this happens ……

“Blind Spot” of ReLU
•MNIST, Adam, 1M updates
Consider your
initialization

k neurons
&#3627408485;
1 +
+

+&#3627408485;
2
……
Label
generator
ො&#3627408486;+
If ??????≥&#3627408472;and
&#3627408484;
&#3627408470;
=&#3627408483;
&#3627408470;
n neurons
&#3627408485;
1 +
+

&#3627408486;+
The number of
k and n matters
We obtain
global minima
&#3627408485;+&#3627408485;
2
……
&#3627408484;
1
&#3627408484;
2
&#3627408484;
??????
&#3627408484;
1??????
&#3627408485;
&#3627408484;
2??????
&#3627408485;
&#3627408484;
????????????
&#3627408485;
1
1
1
&#3627408485;
&#3627408483;
1
&#3627408483;
2
&#3627408483;
??????
&#3627408483;
1??????
&#3627408485;
&#3627408483;
2??????
&#3627408485;
&#3627408483;
&#3627408472;
??????
&#3627408485;
1
1
1
Considering Data
Network to
be trained
N(0,1)

No local for ??????≥&#3627408472;+2
Considering Data

Considering Data

Reference
•Grzegorz Swirszcz,Wojciech Marian Czarnecki,Razvan Pascanu, “Local
minima in training of neural networks”, arXiv, 2016
•ItaySafran,OhadShamir, “Spurious Local Minima are Common in Two-Layer
ReLUNeural Networks”, arXiv, 2017
•Yi Zhou,YingbinLiang, “Critical Points of Neural Networks: Analytical Forms
and Landscape Properties”, arXiv, 2017
•Shai Shalev-Shwartz,OhadShamir,ShakedShammah, “Failures of Gradient-
Based Deep Learning”, arXiv, 2017
The theory should looks like …
Under some conditions (initialization, data, ……),
We can find global optimal.

Conjecture
about Deep Learning
Almost all local minimum have very similar loss to the global
optimum, and hence finding a local minimum is good enough.

Analyzing Hessian
•When we meet a critical point, it can be saddle point or
local minima.
•Analyzing H
If the network has N parameters
&#3627408483;
1 &#3627408483;
2 &#3627408483;
3 &#3627408483;
??????……
We assume ??????has 1/2 (?) to be positive, 1/2 (?) to be negative.
??????
1 ??????
2 ??????
3 ??????
??????……

Analyzing Hessian
•If N=1:
•If N=2:
•If N=10:
&#3627408483;
1
??????
1
1/2 local minima, 1/2 local maxima,
Saddle point is almost impossible
&#3627408483;
1
??????
1
&#3627408483;
2
??????
2
1/4 local minima, 1/4 local maxima,
1/2 Saddle points
1/1024 local minima, 1/1024 local maxima,
Almost every critical point is saddle point
When a network is very large,
It is almost impossible to meet a local minima.
Saddle point is what you need to worry about.
+ + --
+-, -+

Error v.s. Eigenvalues
Source of image:
http://proceedings.mlr.press/v70/pennington17a/pennington17a.pdf
We assume ??????has 1/2 (?)
to be negative.
pis a probability
related to error
Larger error, larger p
p

Guess about Error Surface
https://stats385.github.io/assets/lectures/Understanding_and_improving_deep_lea
ring_with_random_matrix_theory.pdf
global minima local minima
saddle
(good enough)

Training Error v.s. Eigenvalues

Training Error v.s. Eigenvalues
Portion of positive eigenvalues“Degree of Local Minima”
1 -“degree of local minima”
(portion of negative eigen values)

&#3627409148;∝
??????
??????
−1
3/2
1
-
“degree of local minima”
empiricaltheoretical
1 -“degree of local minima”
(portion of negative eigen values)

Spin Glass v.s. Deep Learning
•Deep learning is the same as spin glass model with
7 assumptions.
spin glass model network

More Theory
•If the size of network is
large enough, we can find
global optimal by gradient
descent
•Independent to
initialization

Reference
•Razvan Pascanu,Yann N. Dauphin,Surya Ganguli,Yoshua Bengio, On the saddle
point problem for non-convex optimization, arXiv, 2014
•Yann Dauphin,Razvan Pascanu,CaglarGulcehre,KyunghyunCho,Surya
Ganguli,Yoshua Bengio, “Identifying and attacking the saddle point problem in
high-dimensional non-convex optimization”, NIPS, 2014
•Anna Choromanska,Mikael Henaff,Michael Mathieu,Gérard Ben Arous,Yann
LeCun, “The Loss Surfaces of Multilayer Networks”, PMLR, 2015
•Jeffrey Pennington, YasamanBahri, “Geometry of Neural Network Loss Surfaces
via Random Matrix Theory”, PMLR, 2017
•Benjamin D. Haeffele, Rene Vidal, ”Global Optimality in Neural Network
Training”, CVPR, 2017

What does the Error
Surface look like?

Error Surface
&#3627408484;
1 &#3627408484;
2
??????

Profile
&#3627409155;
0
&#3627409155;

&#3627409155;
0
+2&#3627409155;

−&#3627409155;
0
local minima is rare?

Profile

Profile
two random starting points two “solutions”

&#3627409155;
0
&#3627409155;

Profile -LSTM

Training Processing
6-layer CNN on
CIFAR-10
Different initialization / different strategies usually
lead to similar loss (there are some exceptions).
Different
initialization

Training Processing
•Different strategies (the same initialization)

8% disagreement

Training Processing
何時分道揚鑣?
Different training strategies
Different basins
http://mypaper.pchome.com.tw
/ccschoolgeo/post/1311484084

Training Processing
•Training strategies make difference at all stages of
training

Larger basin
for Adam

Batch Normalization

Skip Connection

Reference
•Ian J. Goodfellow,Oriol Vinyals,Andrew M. Saxe, “Qualitatively
characterizing neural network optimization problems”, ICLR 2015
•Daniel Jiwoong Im,Michael Tao,Kristin Branson, “An Empirical Analysis of
Deep Network Loss Surfaces”, arXiv2016
•QianliLiao,Tomaso Poggio, “Theory II: Landscape of the Empirical Risk in
Deep Learning”, arXiv2017
•Hao Li,Zheng Xu,Gavin Taylor,Christoph Studer,Tom Goldstein, “Visualizing the
Loss Landscape of Neural Nets”, arXiv2017

Concluding Remarks

Concluding Remarks
•Deep linear network is not convex, but all the local minima
are global minima.
•There are saddle points which are hard to escape
•Deep network has local minima.
•We need more theory in the future
•Conjecture:
•When training a larger network, it is rare to meet local
minima.
•All local minima are almost as good as global
•We can try to understand the error surface by visualization.
•The error surface is not as complexed as imagined.