18-20 Regularization, Bias Variance Tradeoff, L2 Regularization, Early Stepping
AbhasAbhirup
49 views
434 slides
Apr 29, 2024
Slide 1 of 434
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
About This Presentation
Regularization
Size: 6.06 MB
Language: en
Added: Apr 29, 2024
Slides: 434 pages
Slide Content
1/84
Deep Learning : Lecture 6
Regularization: Bias Variance Tradeo, l2 regularization, Early stopping,
Dataset augmentation, Parameter sharing and tying, Injecting noise at input,
Ensemble methods, Dropout
.
. Deep Learning : Lecture 6
2/84
Acknowledgements
Chapter 7, Deep Learning book
Ali Ghodsi's Video Lectures on Regularization
a
Dropout: A Simple Way to Prevent Neural Networks from Overtting
b
CS6910: Deep Learning Course by Prof. Mitesh M. Khapra, IIT Madras, India
a
Lecture 2.1
b
Dropout
. Deep Learning : Lecture 6
3/84
Module 8.1 : Bias and Variance
. Deep Learning : Lecture 6
4/84
We will begin with a quick overview of bias, variance and the trade-o between
them.
. Deep Learning : Lecture 6
5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6
5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6
5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6
5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6
5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6
5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6
5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6
5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6
6/84 SimpleComplexThe points were drawn from
a sinusoidal function (the true
f(x))
We sample 25 points from the training data
and train a simple and a complex model
We repeat the process `k' times to train
multiple models (each model sees a dierent
sample of the training data)
We make a few observations from these plots
. Deep Learning : Lecture 6
6/84 SimpleComplexThe points were drawn from
a sinusoidal function (the true
f(x))
We sample 25 points from the training data
and train a simple and a complex model
We repeat the process `k' times to train
multiple models (each model sees a dierent
sample of the training data)
We make a few observations from these plots
. Deep Learning : Lecture 6
6/84 SimpleComplexThe points were drawn from
a sinusoidal function (the true
f(x))
We sample 25 points from the training data
and train a simple and a complex model
We repeat the process `k' times to train
multiple models (each model sees a dierent
sample of the training data)
We make a few observations from these plots
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6
8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6
8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6
8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6
8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6
8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6
9/84
We now dene,
Variance (
^
f(x)) =E[(
^
f(x)E[
^
f(x)])
2
]
(Standard denition from statistics)
Roughly speaking it tells us how much the dif-
ferent
^
f(x)'s (trained on dierent samples of
the data) dier from each other
It is clear that the simple model has a low vari-
ance whereas the complex model has a high
variance
. Deep Learning : Lecture 6
9/84
We now dene,
Variance (
^
f(x)) =E[(
^
f(x)E[
^
f(x)])
2
]
(Standard denition from statistics)
Roughly speaking it tells us how much the dif-
ferent
^
f(x)'s (trained on dierent samples of
the data) dier from each other
It is clear that the simple model has a low vari-
ance whereas the complex model has a high
variance
. Deep Learning : Lecture 6
9/84
We now dene,
Variance (
^
f(x)) =E[(
^
f(x)E[
^
f(x)])
2
]
(Standard denition from statistics)
Roughly speaking it tells us how much the dif-
ferent
^
f(x)'s (trained on dierent samples of
the data) dier from each other
It is clear that the simple model has a low vari-
ance whereas the complex model has a high
variance
. Deep Learning : Lecture 6
10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6
10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6
10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6
10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6
10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6
11/84
Module 8.2 : Train error vs Test error
. Deep Learning : Lecture 6
12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6
12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6
12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6
12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6
14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6
14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6
14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6
14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6
15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6
15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6
15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6
15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6
15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6
15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6
15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6
15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6
17/84
We will take a small detour to understand how to empirically estimate an
Expectation and then return to our derivation
. Deep Learning : Lecture 6
18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6
18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6
18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6
18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6
19/84
... returning back to our derivation
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error
1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6
22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error
1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6
22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error
1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6
22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error
1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6
22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error
1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6
22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error
1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6
22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error
1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6
23/84
Module 8.3 : True error and Model complexity
. Deep Learning : Lecture 6
24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =
2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6
24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =
2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6
24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =
2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6
24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =
2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6
24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =
2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6
25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6
25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6
25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6
25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6
26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for
2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6
26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for
2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6
26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for
2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6
26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for
2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6
26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for
2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6
27/84 model complexityerrorHigh biasHigh varianceSweet spot() should ensure
that model has reas-
onable complexity
2
n
P
n
i=1
@
^
f(xi)
@yi
. Deep Learning : Lecture 6
28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6
28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6
28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6
28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6
28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6
29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6
29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6
29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6
29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6
29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6
29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6
29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6
29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6
30/84
Module 8.4 :l2regularization
. Deep Learning : Lecture 6
31/84
Dierent forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
32/84
Forl2regularization we have,
f
L(w) =L(w) +
2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6
32/84
Forl2regularization we have,
f
L(w) =L(w) +
2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6
32/84
Forl2regularization we have,
f
L(w) =L(w) +
2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6
32/84
Forl2regularization we have,
f
L(w) =L(w) +
2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6
32/84
Forl2regularization we have,
f
L(w) =L(w) +
2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
33/84
Assumew
is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w
optimal! rL(w
) = 0)
Consideru=ww
. Using Taylor series approximation (upto 2
nd
order)
L(w
+u)=L(w
) +u
T
rL(w
) +
1
2
u
T
HuL(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) (*rL(w
) = 0 )rL(w)=rL(w
) +H(ww
)=H(ww
)
Now,
r
f
L(w)=rL(w) +w=H(ww
) +w
. Deep Learning : Lecture 6
34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww
) +ew= 0)(H+I)ew=Hw
)ew= (H+I)
1
Hw
Notice that if!0 thenew!w
[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6
34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww
) +ew= 0)(H+I)ew=Hw
)ew= (H+I)
1
Hw
Notice that if!0 thenew!w
[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6
34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww
) +ew= 0)(H+I)ew=Hw
)ew= (H+I)
1
Hw
Notice that if!0 thenew!w
[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6
34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww
) +ew= 0)(H+I)ew=Hw
)ew= (H+I)
1
Hw
Notice that if!0 thenew!w
[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6
34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww
) +ew= 0)(H+I)ew=Hw
)ew= (H+I)
1
Hw
Notice that if!0 thenew!w
[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6
34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww
) +ew= 0)(H+I)ew=Hw
)ew= (H+I)
1
Hw
Notice that if!0 thenew!w
[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6
34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww
) +ew= 0)(H+I)ew=Hw
)ew= (H+I)
1
Hw
Notice that if!0 thenew!w
[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6
34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww
) +ew= 0)(H+I)ew=Hw
)ew= (H+I)
1
Hw
Notice that if!0 thenew!w
[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw
= (QQ
T
+I)
1
QQ
T
w
= (QQ
T
+QIQ
T
)
1
QQ
T
w
= [Q( +I)Q
T
]
1
QQ
T
w
=Q
T
1
( +I)
1
Q
1
QQ
T
w
=Q( +I)
1
Q
T
w
(*Q
T
1
=Q)ew=QDQ
T
w
whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6
gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6
gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6
gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6
gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6
38/84
The weight vector(w
) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6
38/84
The weight vector(w
) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6
38/84
The weight vector(w
) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6
38/84
The weight vector(w
) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6
40/84
Dierent forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
40/84
Dierent forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20
rotated by 65
shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6
42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6
42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6
42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6
42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6
43/84
Module 8.6 : Parameter Sharing and tying
. Deep Learning : Lecture 6
44/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
44/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6
46/84
Module 8.7 : Adding Noise to the inputs
. Deep Learning : Lecture 6
47/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
47/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6
48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6
48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6
48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"
by+
n
X
i=1
wi"iy
2
#
=E
2
4
byy
+
n
X
i=1
wi"i
!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i
2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6
50/84
Module 8.8 : Adding Noise to the outputs
. Deep Learning : Lecture 6
51/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6
52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6
52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6
52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6
52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6
52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6
53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6
53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6
53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6
53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6
53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6
54/84
Module 8.9 : Early stopping
. Deep Learning : Lecture 6
55/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
55/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6
56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6
56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6
56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6
57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6
57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6
57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6
57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6
58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6
58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6
58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6
58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6
58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6
58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6
58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6
59/84
We will now see a mathematical analysis of this
. Deep Learning : Lecture 6
60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) [ w
is optimal sorL(w
)is0 ]r(L(w))=H(ww
)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w
)= (IH)wt1+Hw
. Deep Learning : Lecture 6
60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) [ w
is optimal sorL(w
)is0 ]r(L(w))=H(ww
)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w
)= (IH)wt1+Hw
. Deep Learning : Lecture 6
60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) [ w
is optimal sorL(w
)is0 ]r(L(w))=H(ww
)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w
)= (IH)wt1+Hw
. Deep Learning : Lecture 6
60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) [ w
is optimal sorL(w
)is0 ]r(L(w))=H(ww
)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w
)= (IH)wt1+Hw
. Deep Learning : Lecture 6
60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) [ w
is optimal sorL(w
)is0 ]r(L(w))=H(ww
)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w
)= (IH)wt1+Hw
. Deep Learning : Lecture 6
60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) [ w
is optimal sorL(w
)is0 ]r(L(w))=H(ww
)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w
)= (IH)wt1+Hw
. Deep Learning : Lecture 6
60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) [ w
is optimal sorL(w
)is0 ]r(L(w))=H(ww
)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w
)= (IH)wt1+Hw
. Deep Learning : Lecture 6
60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w
) + (ww
)
T
rL(w
) +
1
2
(ww
)
T
H(ww
)=L(w
) +
1
2
(ww
)
T
H(ww
) [ w
is optimal sorL(w
)is0 ]r(L(w))=H(ww
)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w
)= (IH)wt1+Hw
. Deep Learning : Lecture 6
61/84 wt= (IH)wt1+Hw
Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w
If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w
Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w
We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1
. Deep Learning : Lecture 6
61/84 wt= (IH)wt1+Hw
Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w
If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w
Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w
We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1
. Deep Learning : Lecture 6
61/84 wt= (IH)wt1+Hw
Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w
If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w
Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w
We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1
. Deep Learning : Lecture 6
61/84 wt= (IH)wt1+Hw
Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w
If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w
Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w
We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1
. Deep Learning : Lecture 6
61/84 wt= (IH)wt1+Hw
Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w
If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w
Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w
We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1
. Deep Learning : Lecture 6
62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6
62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6
62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6
62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6
62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6
64/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
64/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6
65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6
65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6
65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6
65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6
65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6
66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6
66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6
66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6
66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6
66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6
66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6
68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6
68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6
68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6
68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6
68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6
70/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
70/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6
71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6
71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6
71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6
71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6
72/84
Dropout is a technique which ad-
dresses both these issues.
Eectively it allows training several
neural networks without any signic-
ant computational overhead.
Also gives an ecient approximate
way of combining exponentially many
dierent neural networks.
. Deep Learning : Lecture 6
72/84
Dropout is a technique which ad-
dresses both these issues.
Eectively it allows training several
neural networks without any signic-
ant computational overhead.
Also gives an ecient approximate
way of combining exponentially many
dierent neural networks.
. Deep Learning : Lecture 6
72/84
Dropout is a technique which ad-
dresses both these issues.
Eectively it allows training several
neural networks without any signic-
ant computational overhead.
Also gives an ecient approximate
way of combining exponentially many
dierent neural networks.
. Deep Learning : Lecture 6
73/84
Dropout refers to dropping out units Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a xed probability (typicallyp= 0:5) for hidden
nodes andp= 0:8 for visible nodes
. Deep Learning : Lecture 6
73/84
Dropout refers to dropping out units Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a xed probability (typicallyp= 0:5) for hidden
nodes andp= 0:8 for visible nodes
. Deep Learning : Lecture 6
73/84
Dropout refers to dropping out units Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a xed probability (typicallyp= 0:5) for hidden
nodes andp= 0:8 for visible nodes
. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6
76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6
76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6
76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6
76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6
76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6
76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6
77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6
77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6
77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6
77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6
78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6
78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6
78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6
78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6
79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6
79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6
79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6
79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6
79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6
79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6
80/84
Recap
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6
81/84
Appendix
. Deep Learning : Lecture 6
82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w
wt=Q[I(I")
t
]Q
T
w
Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w
=QQ
T
w
w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w
=QQ
T
w
. Deep Learning : Lecture 6
82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w
wt=Q[I(I")
t
]Q
T
w
Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w
=QQ
T
w
w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w
=QQ
T
w
. Deep Learning : Lecture 6
82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w
wt=Q[I(I")
t
]Q
T
w
Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w
=QQ
T
w
w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w
=QQ
T
w
. Deep Learning : Lecture 6
82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w
wt=Q[I(I")
t
]Q
T
w
Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w
=QQ
T
w
w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w
=QQ
T
w
. Deep Learning : Lecture 6
82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w
wt=Q[I(I")
t
]Q
T
w
Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w
=QQ
T
w
w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w
=QQ
T
w
. Deep Learning : Lecture 6
83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w
=Q[I(I")
t
]Q
T
w
Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w
=I(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
=Q(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
. Deep Learning : Lecture 6
83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w
=Q[I(I")
t
]Q
T
w
Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w
=I(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
=Q(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
. Deep Learning : Lecture 6
83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w
=Q[I(I")
t
]Q
T
w
Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w
=I(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
=Q(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
. Deep Learning : Lecture 6
83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w
=Q[I(I")
t
]Q
T
w
Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w
(usingwt=Q[I(I")
t
]Q
T
w
)=I(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
=Q(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
. Deep Learning : Lecture 6
83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w
=Q[I(I")
t
]Q
T
w
Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w
(usingwt=Q[I(I")
t
]Q
T
w
)
= (IQQ
T
)Q(I(I)
t
)Q
T
w
+QQ
T
w
=I(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
=Q(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
. Deep Learning : Lecture 6
83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w
=Q[I(I")
t
]Q
T
w
Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w
(usingwt=Q[I(I")
t
]Q
T
w
)
=IQQ
T
)Q(I(I)
t
)Q
T
w
+QQ
T
w
(Opening this bracket)
=I(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
=Q(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
. Deep Learning : Lecture 6
83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w
=Q[I(I")
t
]Q
T
w
Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w
(usingwt=Q[I(I")
t
]Q
T
w
)
= (IQQ
T
)Q(I(I)
t
)Q
T
w
+QQ
T
w
(Opening this bracket)
=I(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
=Q(I(I)
t
)Q
T
w
QQ
T
Q(I(I)
t
)Q
T
w
+QQ
T
w
. Deep Learning : Lecture 6
83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w
=Q[I(I")
t
]Q
T
w
Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w