18-20 Regularization, Bias Variance Tradeoff, L2 Regularization, Early Stepping

1/84
Deep Learning : Lecture 6
Regularization: Bias Variance Tradeo, l2 regularization, Early stopping,
Dataset augmentation, Parameter sharing and tying, Injecting noise at input,
Ensemble methods, Dropout
.
. Deep Learning : Lecture 6

2/84
Acknowledgements
Chapter 7, Deep Learning book
Ali Ghodsi's Video Lectures on Regularization
a
Dropout: A Simple Way to Prevent Neural Networks from Overtting
b
CS6910: Deep Learning Course by Prof. Mitesh M. Khapra, IIT Madras, India
a
Lecture 2.1
b
Dropout
. Deep Learning : Lecture 6

3/84
Module 8.1 : Bias and Variance
. Deep Learning : Lecture 6

4/84
We will begin with a quick overview of bias, variance and the trade-o between
them.
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

6/84 SimpleComplexThe points were drawn from
a sinusoidal function (the true
f(x))
We sample 25 points from the training data
and train a simple and a complex model
We repeat the process `k' times to train
multiple models (each model sees a dierent
sample of the training data)
We make a few observations from these plots
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6

9/84
We now dene,
Variance (
^
f(x)) =E[(
^
f(x)E[
^
f(x)])
2
]
(Standard denition from statistics)
Roughly speaking it tells us how much the dif-
ferent
^
f(x)'s (trained on dierent samples of
the data) dier from each other
It is clear that the simple model has a low vari-
ance whereas the complex model has a high
variance
. Deep Learning : Lecture 6

10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6

11/84
Module 8.2 : Train error vs Test error
. Deep Learning : Lecture 6

12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

16/84
E[( ^yiyi)
2
]
=E[(
^
f(xi)f(xi)"i)
2
]yi=f(xi) +"i)=E[(
^
f(xi)f(xi))
2
2"i(
^
f(xi)f(xi)) +"
2
i]=E[(
^
f(xi)f(xi))
2
]2E["i(
^
f(xi)f(xi))] +E["
2
i])E[(
^
f(xi)f(xi))
2
]=E[( ^yiyi)
2
]E["
2
i] + 2E["i(
^
f(xi)f(xi)) ]
. Deep Learning : Lecture 6

17/84
We will take a small detour to understand how to empirically estimate an
Expectation and then return to our derivation
. Deep Learning : Lecture 6

18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6

19/84
... returning back to our derivation
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error

1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6

23/84
Module 8.3 : True error and Model complexity
. Deep Learning : Lecture 6

24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =

2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6

25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6

26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for

2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6

27/84 model complexityerrorHigh biasHigh varianceSweet spot() should ensure
that model has reas-
onable complexity

2
n
P
n
i=1
@
^
f(xi)
@yi
. Deep Learning : Lecture 6

28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

30/84
Module 8.4 :l2regularization
. Deep Learning : Lecture 6

31/84
Dierent forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

32/84
Forl2regularization we have,
f
L(w) =L(w) +

2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

37/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
Each elementiofQ
T
w

gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6

38/84
The weight vector(w

) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6

39/84
Module 8.5 : Dataset augmentation
. Deep Learning : Lecture 6

40/84
Dierent forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6

43/84
Module 8.6 : Parameter Sharing and tying
. Deep Learning : Lecture 6

44/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

46/84
Module 8.7 : Adding Noise to the inputs
. Deep Learning : Lecture 6

47/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4

byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

50/84
Module 8.8 : Adding Noise to the outputs
. Deep Learning : Lecture 6

51/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6

53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6

54/84
Module 8.9 : Early stopping
. Deep Learning : Lecture 6

55/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6

57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6

58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6

59/84
We will now see a mathematical analysis of this
. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

61/84 wt= (IH)wt1+Hw

Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w

If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w

Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w

We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1

. Deep Learning : Lecture 6

62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6

63/84
Module 8.10 : Ensemble methods
. Deep Learning : Lecture 6

64/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6

66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6

69/84
Module 8.11 : Dropout
. Deep Learning : Lecture 6

70/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6

72/84
Dropout is a technique which ad-
dresses both these issues.
Eectively it allows training several
neural networks without any signic-
ant computational overhead.
Also gives an ecient approximate
way of combining exponentially many
dierent neural networks.
. Deep Learning : Lecture 6

73/84
Dropout refers to dropping out units Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a xed probability (typicallyp= 0:5) for hidden
nodes andp= 0:8 for visible nodes
. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6

77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6

78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6

79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6

80/84
Recap
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

81/84
Appendix
. Deep Learning : Lecture 6

82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w

wt=Q[I(I")
t
]Q
T
w

Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w

=QQ
T
w

w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w

=QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)
= (IQQ
T
)Q(I(I)
t
)Q
T
w

+QQ
T
w

=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)
=IQQ
T
)Q(I(I)
t
)Q
T
w

+QQ
T
w

(Opening this bracket)
=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)
= (IQQ
T
)Q(I(I)
t
)Q
T
w

+QQ
T
w

(Opening this bracket)
=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)
. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

=Q

I(I)
t
+(I)
t

Q
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

=Q

I(I)
t
+(I)
t

Q
T
w

=Q

I(I)
t
(I)

Q
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

=Q

I(I)
t
+(I)
t

Q
T
w

=Q

I(I)
t
(I)

Q
T
w

=Q(I(I)
t+1
)Q
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

=Q

I(I)
t
+(I)
t

Q
T
w

=Q

I(I)
t
(I)

Q
T
w

=Q(I(I)
t+1
)Q
T
w

Hence, proved!
. Deep Learning : Lecture 6

18-20 Regularization, Bias Variance Tradeoff, L2 Regularization, Early Stepping

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

18-20 Regularization, Bias Variance Tradeoff, L2 Regularization, Early Stepping

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77