Dep Neural Networks introduction new.pdf

Deep Neural Network
Module 3
1

Optimization
2

What we Learn….
3.1 Challenges in Neural Network Optimization –saddle points and
plateau
3.2 Non-convex optimization intuition
3.3 Overview of optimization algorithms
3.4 Momentum based algorithms
3.5 Algorithms with Adaptive Learning Rates
3

Optimization Algorithms
4

Optimization Algorithm
●Optimizationalgorithmstraindeeplearningmodels.
●Optimizationalgorithmsarethetoolsthatallow
○continueupdatingmodelparameters
○tominimizethevalueofthelossfunction,asevaluatedonthe
trainingset.
●Inoptimization,alossfunctionisoftenreferredtoastheobjective
functionoftheoptimizationproblem.
●Bytraditionandconventionmostoptimizationalgorithmsareconcerned
withminimization.
5

6
Optimization

Why Optimization Algorithm?
●Theperformanceoftheoptimizationalgorithmdirectlyaffectsthe
modelʼstrainingefficiency.
●Understandingtheprinciplesofdifferentoptimizationalgorithmsandthe
roleoftheirhyperparameterswillenableustotunethe
hyperparametersinatargetedmannertoimprovetheperformanceof
deeplearningmodels.
●Thegoalofoptimizationistoreducethetrainingerror.Thegoalof
deeplearningistoreducethegeneralizationerror,thisrequiresreduction
inoverfittingalso.
7

Optimization Challenges in Deep Learning
●Local minima
●Saddle points
●Vanishing gradients
8

Local Minima
●Foranyobjectivefunctionf(x),ifthevalue
off(x)atxissmallerthanthevaluesoff(x)
atanyotherpointsinthevicinityofx,thenf
(x)couldbealocalminimum.
●Ifthevalueoff(x)atxistheminimumof
theobjectivefunctionovertheentire
domain,thenf(x)istheglobalminimum.
●Inminibatchstochasticgradientdescent,
thenaturalvariationofgradientsover
minibatchesisabletodislodgethe
parametersfromlocalminima.
9

10
Finding Minimum of a Function

11
Finding Minimum of a Function

12
Derivatives of a Function

Saddle points
●Asaddlepointisanylocation
whereallgradientsofafunction
vanishbutwhichisneitheraglobal
noralocalminimum.
●Eg:f(x,y)=x^2−y^2
○saddlepointat(0,0)
○Maximumwrty
○Minimumwrtx
13

14
Derivatives at Saddle (Inflection) Point

15
Derivatives at Saddle (Inflection) Point

16
Functions of Multiple Variables

17
Gradient of a Scalar Function

18
Gradient of a Scalar Function with multiple variables

19
Hessian

20
Gradient of a Scalar Function with multiple variables

21
Solution of Unconstrained Minimization
SolveforXwherethegradientequationequaltozero.
⛛f(x)=0
ComputetheHessianmatrix⛛
2
f(x)atthecandidatesolutionandverify
that
○LocalMinimum
■EigenvaluesofHessianmatrixareallpositive
○LocalMaximum
■EigenvaluesofHessianmatrixareallnegative
○SaddlePoint
■EigenvaluesofHessianmatrixatthezero-gradientpositionare
negativeandpositive

22
Example

23
Example

24
Example

Vanishing gradients
●Functionf(x)=tanh(x)
●f′(x)=1−tanh
2
(x)
○f′(4)=0.0013.
●Gradientoffisclosetonil.
●Vanishinggradientscancause
optimizationtostall.
○Reparameterizationofthe
problemhelps.
○Goodinitializationofthe
parameters
25

Gradient Descent
26

27
How to find Global Minima?

28
Find Global Minima Iteratively

29
Approach of Gradient Descent

30
Approach of Gradient Descent

31
Approach of Gradient Descent
gd(eta, grad)
new = old -eta * grad

Approach of Gradient Descent
●First order Gradient Descent algos consider the first order derivatives
to get the value (magnitude) and direction of update.
defgd(eta, f_grad):
x = 10.0
results = [x]
fori inrange(10):
x -= eta * f_grad(x)
results.append(float(x))
returnresults
gd(eta = 0.2, f_grad)
32

Effect of Learning Rate on Gradient Descent
33

Learning Rate
●Theroleofthelearningrateistomoderatethedegreetowhich
weightsarechangedateachstep.
●Learningrateηissetbythealgorithmdesigner.
●Ifthelearningratethatistoosmall,itwillcausextoupdatevery
slowly,requiringmoreiterationstogetabettersolution.
●Ifthelearningratethatistoolarge,thesolutionoscillatesandinthe
worstcaseitmightevendiverge.
34

Learning rate and Gradient Descent
gd(eta = 0.05, f_grad) gd(eta = 1.1, f_grad)
Slow Learning
Ifwepicketatoosmall,we
makelittleprogress.
Oscillations
Ifwepicketatoolarge,the
solutionoscillatesandinthe
worstcaseitmightdiverge.
35

Stochastic Gradient Descent
36

Stochastic Gradient Descent
●In deep learning, the objective function is the average of the loss functions for
each example in the training dataset.
●Given a training dataset of n examples, let f
i (x) is the loss function with respect to
the training example of index i, where x is the parameter vector.
●The objective function
●The gradient of the objective function at x
●Update x as
●Computational cost of each iteration is O(1).
37

SGD with Constant Learning Rate
●Trajectory of the variables in the stochastic gradient descent is much
more noisy. This is due to the stochastic nature of the gradient. Even
near the minimum, uncertainty is injected by the instantaneous
gradient via η∇f
i(x).
38
defsgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += np.random.normal( 0.0, 1, (1,))
g2 += np.random.normal( 0.0, 1, (1,))
eta_t = eta * lr()
return(x1 -eta_t * g1, x2 -eta_t * g2, 0, 0)

Dynamic Learning Rate
39

Dynamic Learning Rate
●Replaceηwithatime-dependentlearningrateη(t)
○addstothecomplexityofcontrollingconvergenceofanoptimizationalgorithm.
●Afewbasicstrategiesthatadjustηovertime.
40
1.Piecewiseconstant
a.decreasethelearningrate,e.g.,wheneverprogressinoptimizationstalls.
b.Thisisacommonstrategyfortrainingdeepnetworks.A
2.Exponentialdecay
a.Leadstoprematurestoppingbeforethealgorithmhasconverged.
3.Polynomialdecaywithα=0.5.

Dynamic Learning Rate
●Replaceηwithatime-dependentlearningrateη(t)
○addstothecomplexityofcontrollingconvergenceofan
optimizationalgorithm.
●Afewbasicstrategiesthatadjustηovertime.
41

Exponential Decay
●Varianceintheparametersissignificantlyreduced.
●Thealgorithmfailstoconvergeatall.
42
defexponential_lr():
globalt
t += 1
returnmath.exp(-0.1* t)

Polynomial decay
●Useapolynomialdecay
○Learningratedecayswiththeinversesquarerootofthenumberofsteps
○Convergencegetsbetterafteronly50steps.
43
defpolynomial_lr():
globalt
t += 1
return(1+ 0.1* t) ** (-0.5)

Review
●Gradientdescent
○Usesthefulldatasettocomputegradientsandtoupdateparameters,onepassatatime.
○GradientDescentisnotparticularlydataefficientwheneverdataisverysimilar.
●StochasticGradientdescent
○Processesoneobservationatatimetomakeprogress.
○StochasticGradientDescentisnotparticularlycomputationallyefficientsinceCPUsandGPUs
cannotexploitthefullpowerofvectorization.
○Fornoisygradients,choiceofthelearningrateiscritical.
■Ifwedecreaseittoorapidly,convergencestalls.
■Ifwearetoolenient,wefailtoconvergetoagoodenoughsolutionsincenoisekeepson
drivingusawayfromoptimality.
●MInibatchSGD
○Acceleratecomputation,orbetterorcomputationalefficiency.
○Averaginggradientsreducetheamountofvariance.
44

Minibatch Stochastic Gradient Descent
45

Minibatch Stochastic Gradient Descent
●In each iteration, we first randomly sample a minibatch B consisting of a fixed number
of training examples.
●We then compute the derivative (gradient) of the average loss on the minibatch with
regard to the model parameters.
●Finally, we multiply the gradient by a predetermined positive value η and subtract the
resulting term from the current parameter values.
●|B|representsthenumberofexamplesineachminibatch(thebatchsize)andη
denotesthelearningrate.
46

Minibatch Stochastic Gradient Descent Algorithm
●Gradients at time t is calculated as
●|B|representsthenumberofexamplesineachminibatch(thebatch
size)andηdenotesthelearningrate.
●
● isthestochasticgradientdescentforsamplei
usingtheweightsupdatedattimet−1.
47

SGD Algorithm
48

Momentum
49

Leaky Average in Minibatch SGD
●Replacethegradientcomputationbya“leakyaverage“forbettervariance
reduction.β∈(0,1).
●Thiseffectivelyreplacestheinstantaneousgradientbyonethatʼsbeenaveraged
overmultiplepastgradients.
●viscalledmomentum.Itaccumulatespastgradients.
●Largeβamountstoalong-rangeaverageandsmallβamountstoonlyaslight
correctionrelativetoagradientmethod.
●Thenewgradientreplacementnolongerpointsintothedirectionofsteepest
descentonaparticularinstanceanylongerbutratherinthedirectionofa
weightedaverageofpastgradients.
50

Momentum Algorithm
51

Momentum Method Example
●Consider a moderately distorted ellipsoid objective
●f has its minimum at (0, 0). This function is very flat in x1 direction.
52
●For eta = 0.4. Without momentum
●The gradient in the x2 direction
oscillates than in the horizontal x1
direction.
●For eta = 0.6. Without momentum
●Convergence in the x1 direction
improves but the overall solution
quality is diverging.

Momentum Method Example
●Consider a moderately distorted ellipsoid objective
●Apply momentum for eta = 0.6
53
●For beta = 0.5.
●Converges well. Lesser oscillations.
Larger steps in x1 direction.
●For beta = 0.25.
●Reduced convergence. More
oscillations. Larger magnitude of
oscillations.

Momentum method: Summary
●Momentumreplacesgradientswithaleakyaverageoverpast
gradients.Thisacceleratesconvergencesignificantly.
●Momentumpreventsstallingoftheoptimizationprocessthatismuch
morelikelytooccurforstochasticgradientdescent.
●Theeffectivenumberofgradientsisgivenby1/(1−β)dueto
exponentiateddownweightingofpastdata.
●Implementationisquitestraightforwardbutitrequiresustostorean
additionalstatevector(momentumv).
54

Adagrad
55

Adagrad
●Usedforfeaturesthatoccurinfrequently(sparsefeatures)
●Adagradusesaggregateofthesquaresofpreviouslyobserved
gradients.
56

Adagrad Algorithm
●Variable s
tto accumulate past gradient variance.
●Operation are applied coordinate wise. √ (1 / v) has entries √( 1/ v
i)
and u · v has entries u
iv
i. η is the learning rate and ε is an additive
constant that ensures that we do not divide by 0.
●Initialize s0 = 0.
57

Adagrad: Summary
●Adagraddecreasesthelearningratedynamicallyonaper-coordinatebasis.
●Itusesthemagnitudeofthegradientasameansofadjustinghowquickly
progressisachieved-coordinateswithlargegradientsarecompensatedwitha
smallerlearningrate.
●IftheoptimizationproblemhasaratherunevenstructureAdagradcanhelp
mitigatethedistortion.
●Adagradisparticularlyeffectiveforsparsefeatureswherethelearningrateneeds
todecreasemoreslowlyforinfrequentlyoccurringterms.
●OndeeplearningproblemsAdagradcansometimesbetooaggressivein
reducinglearningrates.
58

RMSProp
59

RMSProp
●Adagraduselearningratethatdecreasesatapredefinedscheduleof
effectivelyO(t
−1/2
).
●RMSPropalgorithmdecouplesrateschedulingfromcoordinate-
adaptivelearningrates.Thisisessentialfornon-convexoptimization.
60

RMSProp Algorithm
●Use leaky average to accumulate past gradient variance.
●Parameter gamma > 0.
●The constant ε > 0 is set to 10
−6
to ensure that we do not suffer from
division by zero or overly large step sizes.
61

RMSProp Algorithm
62

RMSProp Example
63
efrmsprop_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2* x1, 4* x2, 1e-6
s1 = gamma * s1 + (1-gamma) * g1 ** 2
s2 = gamma * s2 + (1-gamma) * g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
returnx1, x2, s1, s2

RMS Prop: Summary
●RMSPropisverysimilartoAdagradasbothusethesquareofthe
gradienttoscalecoefficients.
●RMSPropshareswithmomentumtheleakyaveraging.However,
RMSPropusesthetechniquetoadjustthecoefficient-wise
preconditioner.
●Thelearningrateneedstobescheduledbytheexperimenterin
practice.
●Thecoefficientγ(gamma)determineshowlongthehistoryiswhen
adjustingtheper-coordinatescale.
64

Adam
65

Review of techniques learned so far
1.Stochastic gradient descent
○more effective than Gradient Descent when solving optimization problems, e.g.,
due to its inherent resilience to redundant data.
2.Minibatch Stochastic gradient descent
○affords significant additional efficiency arising from vectorization, using larger sets
of observations in one minibatch. This is the key to efficient multi-machine, multi-
GPU and overall parallel processing.
3.Momentum
○added a mechanism for aggregating a history of past gradients to accelerate
convergence.
4.Adagrad
○used per-coordinate scaling to allow for a computationally efficient preconditioner.
5.RMSProp
○decoupled per-coordinate scaling from a learning rate adjustment.
66

Adam
●Adamcombinesallthesetechniquesintooneefficientlearning
algorithm.
●Morerobustandeffectiveoptimizationalgorithmstouseindeep
learning.
●Adamcandivergeduetopoorvariancecontrol.(disadvantage)
●Adamusesexponentialweightedmovingaverages(alsoknownas
leakyaveraging)toobtainanestimateofboththemomentumand
alsothesecondmomentofthegradient.
67

Adam Algorithm
●State variables
●β1andβ2arenonnegativeweightingparameters.Commonchoices
forthemareβ1=0.9andβ2=0.999.Thatis,thevarianceestimate
movesmuchmoreslowlythanthemomentumterm.
●Initializev0=s0=0
●Normalize the state variables
68

Adam Algorithm
●Rescale the gradient
●Compute updates
69

Adam Algorithm
70

Adam: Summary
●Adamcombinesfeaturesofmanyoptimizationalgorithmsintoafairly
robustupdaterule.
●Adamusesbiascorrectiontoadjustforaslowstartupwhen
estimatingmomentumandasecondmoment.
●Forgradientswithsignificantvariancewemayencounterissueswith
convergence.Theycanbeamendedbyusinglargerminibatchesor
byswitchingtoanimprovedestimateforstatevariables.Yogi
algorithmofferssuchanalternative.
71

Learning Rate: Summary
1.Adjustingthelearningrateisoftenjustasimportantastheactualalgorithm.
2.Magnitudeofthelearningratematters.Ifitistoolarge,optimizationdiverges,ifit
istoosmall,ittakestoolongtotrainorweendupwithasuboptimalresult.
Momentumalgohelps.
3.Therateofdecayisjustasimportant.Ifthelearningrateremainslargewemay
simplyendupbouncingaroundtheminimumandthusnotreachoptimality.We
wanttheratetodecay,butprobablymoreslowlythanO(t
−1/2
).
4.Initializationpertainsbothtohowtheparametersaresetinitiallyandalsohow
theyevolveinitially.Thisisknownaswarmup,i.e.,howrapidlywestartmoving
towardsthesolutioninitially.
72

Numerical Problems
73

Question with Solution
74

Question
1.Compute the value that minimizes (w1 , w2). Compute the minimum
possible value of error.
2.What will be value of (w1 , w2 ) at time (t + 1) if standard gradient
descent is used?
3.What will be value of (w1 , w2 ) at time (t + 1) if momentum is used?
4.What will be value of (w1 , w2 ) at time (t + 1) if RMSPRop is used?
5.What will be value of (w1 , w2 ) at time (t + 1) if Adam is used?
75

Solution
76

Solution
77

Solution
78

Ref TB Dive into Deep Learning
●Chapter 12 (online version)
79

Next Session:
Regularization
80

Dep Neural Networks introduction new.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Dep Neural Networks introduction new.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77