Dep Neural Networks introduction new.pdf

ratnababum 19 views 80 slides Sep 19, 2024
Slide 1
Slide 1 of 80
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80

About This Presentation

DNN


Slide Content

Deep Neural Network
Module 3
1

Optimization
2

What we Learn….
3.1 Challenges in Neural Network Optimization –saddle points and
plateau
3.2 Non-convex optimization intuition
3.3 Overview of optimization algorithms
3.4 Momentum based algorithms
3.5 Algorithms with Adaptive Learning Rates
3

Optimization Algorithms
4

Optimization Algorithm
●Optimizationalgorithmstraindeeplearningmodels.
●Optimizationalgorithmsarethetoolsthatallow
○continueupdatingmodelparameters
○tominimizethevalueofthelossfunction,asevaluatedonthe
trainingset.
●Inoptimization,alossfunctionisoftenreferredtoastheobjective
functionoftheoptimizationproblem.
●Bytraditionandconventionmostoptimizationalgorithmsareconcerned
withminimization.
5

6
Optimization

Why Optimization Algorithm?
●Theperformanceoftheoptimizationalgorithmdirectlyaffectsthe
modelʼstrainingefficiency.
●Understandingtheprinciplesofdifferentoptimizationalgorithmsandthe
roleoftheirhyperparameterswillenableustotunethe
hyperparametersinatargetedmannertoimprovetheperformanceof
deeplearningmodels.
●Thegoalofoptimizationistoreducethetrainingerror.Thegoalof
deeplearningistoreducethegeneralizationerror,thisrequiresreduction
inoverfittingalso.
7

Optimization Challenges in Deep Learning
●Local minima
●Saddle points
●Vanishing gradients
8

Local Minima
●Foranyobjectivefunctionf(x),ifthevalue
off(x)atxissmallerthanthevaluesoff(x)
atanyotherpointsinthevicinityofx,thenf
(x)couldbealocalminimum.
●Ifthevalueoff(x)atxistheminimumof
theobjectivefunctionovertheentire
domain,thenf(x)istheglobalminimum.
●Inminibatchstochasticgradientdescent,
thenaturalvariationofgradientsover
minibatchesisabletodislodgethe
parametersfromlocalminima.
9

10
Finding Minimum of a Function

11
Finding Minimum of a Function

12
Derivatives of a Function

Saddle points
●Asaddlepointisanylocation
whereallgradientsofafunction
vanishbutwhichisneitheraglobal
noralocalminimum.
●Eg:f(x,y)=x^2−y^2
○saddlepointat(0,0)
○Maximumwrty
○Minimumwrtx
13

14
Derivatives at Saddle (Inflection) Point

15
Derivatives at Saddle (Inflection) Point

16
Functions of Multiple Variables

17
Gradient of a Scalar Function

18
Gradient of a Scalar Function with multiple variables

19
Hessian

20
Gradient of a Scalar Function with multiple variables

21
Solution of Unconstrained Minimization
SolveforXwherethegradientequationequaltozero.
⛛f(x)=0
ComputetheHessianmatrix⛛
2
f(x)atthecandidatesolutionandverify
that
○LocalMinimum
■EigenvaluesofHessianmatrixareallpositive
○LocalMaximum
■EigenvaluesofHessianmatrixareallnegative
○SaddlePoint
■EigenvaluesofHessianmatrixatthezero-gradientpositionare
negativeandpositive

22
Example

23
Example

24
Example

Vanishing gradients
●Functionf(x)=tanh(x)
●f′(x)=1−tanh
2
(x)
○f′(4)=0.0013.
●Gradientoffisclosetonil.
●Vanishinggradientscancause
optimizationtostall.
○Reparameterizationofthe
problemhelps.
○Goodinitializationofthe
parameters
25

Gradient Descent
26

27
How to find Global Minima?

28
Find Global Minima Iteratively

29
Approach of Gradient Descent

30
Approach of Gradient Descent

31
Approach of Gradient Descent
gd(eta, grad)
new = old -eta * grad

Approach of Gradient Descent
●First order Gradient Descent algos consider the first order derivatives
to get the value (magnitude) and direction of update.
defgd(eta, f_grad):
x = 10.0
results = [x]
fori inrange(10):
x -= eta * f_grad(x)
results.append(float(x))
returnresults
gd(eta = 0.2, f_grad)
32

Effect of Learning Rate on Gradient Descent
33

Learning Rate
●Theroleofthelearningrateistomoderatethedegreetowhich
weightsarechangedateachstep.
●Learningrateηissetbythealgorithmdesigner.
●Ifthelearningratethatistoosmall,itwillcausextoupdatevery
slowly,requiringmoreiterationstogetabettersolution.
●Ifthelearningratethatistoolarge,thesolutionoscillatesandinthe
worstcaseitmightevendiverge.
34

Learning rate and Gradient Descent
gd(eta = 0.05, f_grad) gd(eta = 1.1, f_grad)
Slow Learning
Ifwepicketatoosmall,we
makelittleprogress.
Oscillations
Ifwepicketatoolarge,the
solutionoscillatesandinthe
worstcaseitmightdiverge.
35

Stochastic Gradient Descent
36

Stochastic Gradient Descent
●In deep learning, the objective function is the average of the loss functions for
each example in the training dataset.
●Given a training dataset of n examples, let f
i (x) is the loss function with respect to
the training example of index i, where x is the parameter vector.
●The objective function
●The gradient of the objective function at x
●Update x as
●Computational cost of each iteration is O(1).
37

SGD with Constant Learning Rate
●Trajectory of the variables in the stochastic gradient descent is much
more noisy. This is due to the stochastic nature of the gradient. Even
near the minimum, uncertainty is injected by the instantaneous
gradient via η∇f
i(x).
38
defsgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += np.random.normal( 0.0, 1, (1,))
g2 += np.random.normal( 0.0, 1, (1,))
eta_t = eta * lr()
return(x1 -eta_t * g1, x2 -eta_t * g2, 0, 0)

Dynamic Learning Rate
39

Dynamic Learning Rate
●Replaceηwithatime-dependentlearningrateη(t)
○addstothecomplexityofcontrollingconvergenceofanoptimizationalgorithm.
●Afewbasicstrategiesthatadjustηovertime.
40
1.Piecewiseconstant
a.decreasethelearningrate,e.g.,wheneverprogressinoptimizationstalls.
b.Thisisacommonstrategyfortrainingdeepnetworks.A
2.Exponentialdecay
a.Leadstoprematurestoppingbeforethealgorithmhasconverged.
3.Polynomialdecaywithα=0.5.

Dynamic Learning Rate
●Replaceηwithatime-dependentlearningrateη(t)
○addstothecomplexityofcontrollingconvergenceofan
optimizationalgorithm.
●Afewbasicstrategiesthatadjustηovertime.
41

Exponential Decay
●Varianceintheparametersissignificantlyreduced.
●Thealgorithmfailstoconvergeatall.
42
defexponential_lr():
globalt
t += 1
returnmath.exp(-0.1* t)

Polynomial decay
●Useapolynomialdecay
○Learningratedecayswiththeinversesquarerootofthenumberofsteps
○Convergencegetsbetterafteronly50steps.
43
defpolynomial_lr():
globalt
t += 1
return(1+ 0.1* t) ** (-0.5)

Review
●Gradientdescent
○Usesthefulldatasettocomputegradientsandtoupdateparameters,onepassatatime.
○GradientDescentisnotparticularlydataefficientwheneverdataisverysimilar.
●StochasticGradientdescent
○Processesoneobservationatatimetomakeprogress.
○StochasticGradientDescentisnotparticularlycomputationallyefficientsinceCPUsandGPUs
cannotexploitthefullpowerofvectorization.
○Fornoisygradients,choiceofthelearningrateiscritical.
■Ifwedecreaseittoorapidly,convergencestalls.
■Ifwearetoolenient,wefailtoconvergetoagoodenoughsolutionsincenoisekeepson
drivingusawayfromoptimality.
●MInibatchSGD
○Acceleratecomputation,orbetterorcomputationalefficiency.
○Averaginggradientsreducetheamountofvariance.
44

Minibatch Stochastic Gradient Descent
45

Minibatch Stochastic Gradient Descent
●In each iteration, we first randomly sample a minibatch B consisting of a fixed number
of training examples.
●We then compute the derivative (gradient) of the average loss on the minibatch with
regard to the model parameters.
●Finally, we multiply the gradient by a predetermined positive value η and subtract the
resulting term from the current parameter values.
●|B|representsthenumberofexamplesineachminibatch(thebatchsize)andη
denotesthelearningrate.
46

Minibatch Stochastic Gradient Descent Algorithm
●Gradients at time t is calculated as
●|B|representsthenumberofexamplesineachminibatch(thebatch
size)andηdenotesthelearningrate.

● isthestochasticgradientdescentforsamplei
usingtheweightsupdatedattimet−1.
47

SGD Algorithm
48

Momentum
49

Leaky Average in Minibatch SGD
●Replacethegradientcomputationbya“leakyaverage“forbettervariance
reduction.β∈(0,1).
●Thiseffectivelyreplacestheinstantaneousgradientbyonethatʼsbeenaveraged
overmultiplepastgradients.
●viscalledmomentum.Itaccumulatespastgradients.
●Largeβamountstoalong-rangeaverageandsmallβamountstoonlyaslight
correctionrelativetoagradientmethod.
●Thenewgradientreplacementnolongerpointsintothedirectionofsteepest
descentonaparticularinstanceanylongerbutratherinthedirectionofa
weightedaverageofpastgradients.
50

Momentum Algorithm
51

Momentum Method Example
●Consider a moderately distorted ellipsoid objective
●f has its minimum at (0, 0). This function is very flat in x1 direction.
52
●For eta = 0.4. Without momentum
●The gradient in the x2 direction
oscillates than in the horizontal x1
direction.
●For eta = 0.6. Without momentum
●Convergence in the x1 direction
improves but the overall solution
quality is diverging.

Momentum Method Example
●Consider a moderately distorted ellipsoid objective
●Apply momentum for eta = 0.6
53
●For beta = 0.5.
●Converges well. Lesser oscillations.
Larger steps in x1 direction.
●For beta = 0.25.
●Reduced convergence. More
oscillations. Larger magnitude of
oscillations.

Momentum method: Summary
●Momentumreplacesgradientswithaleakyaverageoverpast
gradients.Thisacceleratesconvergencesignificantly.
●Momentumpreventsstallingoftheoptimizationprocessthatismuch
morelikelytooccurforstochasticgradientdescent.
●Theeffectivenumberofgradientsisgivenby1/(1−β)dueto
exponentiateddownweightingofpastdata.
●Implementationisquitestraightforwardbutitrequiresustostorean
additionalstatevector(momentumv).
54

Adagrad
55

Adagrad
●Usedforfeaturesthatoccurinfrequently(sparsefeatures)
●Adagradusesaggregateofthesquaresofpreviouslyobserved
gradients.
56

Adagrad Algorithm
●Variable s
tto accumulate past gradient variance.
●Operation are applied coordinate wise. √ (1 / v) has entries √( 1/ v
i)
and u · v has entries u
iv
i. η is the learning rate and ε is an additive
constant that ensures that we do not divide by 0.
●Initialize s0 = 0.
57

Adagrad: Summary
●Adagraddecreasesthelearningratedynamicallyonaper-coordinatebasis.
●Itusesthemagnitudeofthegradientasameansofadjustinghowquickly
progressisachieved-coordinateswithlargegradientsarecompensatedwitha
smallerlearningrate.
●IftheoptimizationproblemhasaratherunevenstructureAdagradcanhelp
mitigatethedistortion.
●Adagradisparticularlyeffectiveforsparsefeatureswherethelearningrateneeds
todecreasemoreslowlyforinfrequentlyoccurringterms.
●OndeeplearningproblemsAdagradcansometimesbetooaggressivein
reducinglearningrates.
58

RMSProp
59

RMSProp
●Adagraduselearningratethatdecreasesatapredefinedscheduleof
effectivelyO(t
−1/2
).
●RMSPropalgorithmdecouplesrateschedulingfromcoordinate-
adaptivelearningrates.Thisisessentialfornon-convexoptimization.
60

RMSProp Algorithm
●Use leaky average to accumulate past gradient variance.
●Parameter gamma > 0.
●The constant ε > 0 is set to 10
−6
to ensure that we do not suffer from
division by zero or overly large step sizes.
61

RMSProp Algorithm
62

RMSProp Example
63
efrmsprop_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2* x1, 4* x2, 1e-6
s1 = gamma * s1 + (1-gamma) * g1 ** 2
s2 = gamma * s2 + (1-gamma) * g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
returnx1, x2, s1, s2

RMS Prop: Summary
●RMSPropisverysimilartoAdagradasbothusethesquareofthe
gradienttoscalecoefficients.
●RMSPropshareswithmomentumtheleakyaveraging.However,
RMSPropusesthetechniquetoadjustthecoefficient-wise
preconditioner.
●Thelearningrateneedstobescheduledbytheexperimenterin
practice.
●Thecoefficientγ(gamma)determineshowlongthehistoryiswhen
adjustingtheper-coordinatescale.
64

Adam
65

Review of techniques learned so far
1.Stochastic gradient descent
○more effective than Gradient Descent when solving optimization problems, e.g.,
due to its inherent resilience to redundant data.
2.Minibatch Stochastic gradient descent
○affords significant additional efficiency arising from vectorization, using larger sets
of observations in one minibatch. This is the key to efficient multi-machine, multi-
GPU and overall parallel processing.
3.Momentum
○added a mechanism for aggregating a history of past gradients to accelerate
convergence.
4.Adagrad
○used per-coordinate scaling to allow for a computationally efficient preconditioner.
5.RMSProp
○decoupled per-coordinate scaling from a learning rate adjustment.
66

Adam
●Adamcombinesallthesetechniquesintooneefficientlearning
algorithm.
●Morerobustandeffectiveoptimizationalgorithmstouseindeep
learning.
●Adamcandivergeduetopoorvariancecontrol.(disadvantage)
●Adamusesexponentialweightedmovingaverages(alsoknownas
leakyaveraging)toobtainanestimateofboththemomentumand
alsothesecondmomentofthegradient.
67

Adam Algorithm
●State variables
●β1andβ2arenonnegativeweightingparameters.Commonchoices
forthemareβ1=0.9andβ2=0.999.Thatis,thevarianceestimate
movesmuchmoreslowlythanthemomentumterm.
●Initializev0=s0=0
●Normalize the state variables
68

Adam Algorithm
●Rescale the gradient
●Compute updates
69

Adam Algorithm
70

Adam: Summary
●Adamcombinesfeaturesofmanyoptimizationalgorithmsintoafairly
robustupdaterule.
●Adamusesbiascorrectiontoadjustforaslowstartupwhen
estimatingmomentumandasecondmoment.
●Forgradientswithsignificantvariancewemayencounterissueswith
convergence.Theycanbeamendedbyusinglargerminibatchesor
byswitchingtoanimprovedestimateforstatevariables.Yogi
algorithmofferssuchanalternative.
71

Learning Rate: Summary
1.Adjustingthelearningrateisoftenjustasimportantastheactualalgorithm.
2.Magnitudeofthelearningratematters.Ifitistoolarge,optimizationdiverges,ifit
istoosmall,ittakestoolongtotrainorweendupwithasuboptimalresult.
Momentumalgohelps.
3.Therateofdecayisjustasimportant.Ifthelearningrateremainslargewemay
simplyendupbouncingaroundtheminimumandthusnotreachoptimality.We
wanttheratetodecay,butprobablymoreslowlythanO(t
−1/2
).
4.Initializationpertainsbothtohowtheparametersaresetinitiallyandalsohow
theyevolveinitially.Thisisknownaswarmup,i.e.,howrapidlywestartmoving
towardsthesolutioninitially.
72

Numerical Problems
73

Question with Solution
74

Question
1.Compute the value that minimizes (w1 , w2). Compute the minimum
possible value of error.
2.What will be value of (w1 , w2 ) at time (t + 1) if standard gradient
descent is used?
3.What will be value of (w1 , w2 ) at time (t + 1) if momentum is used?
4.What will be value of (w1 , w2 ) at time (t + 1) if RMSPRop is used?
5.What will be value of (w1 , w2 ) at time (t + 1) if Adam is used?
75

Solution
76

Solution
77

Solution
78

Ref TB Dive into Deep Learning
●Chapter 12 (online version)
79

Next Session:
Regularization
80
Tags