18-20 Regularization, Bias Variance Tradeoff, L2 Regularization, Early Stepping

AbhasAbhirup 49 views 434 slides Apr 29, 2024
Slide 1
Slide 1 of 434
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161
Slide 162
162
Slide 163
163
Slide 164
164
Slide 165
165
Slide 166
166
Slide 167
167
Slide 168
168
Slide 169
169
Slide 170
170
Slide 171
171
Slide 172
172
Slide 173
173
Slide 174
174
Slide 175
175
Slide 176
176
Slide 177
177
Slide 178
178
Slide 179
179
Slide 180
180
Slide 181
181
Slide 182
182
Slide 183
183
Slide 184
184
Slide 185
185
Slide 186
186
Slide 187
187
Slide 188
188
Slide 189
189
Slide 190
190
Slide 191
191
Slide 192
192
Slide 193
193
Slide 194
194
Slide 195
195
Slide 196
196
Slide 197
197
Slide 198
198
Slide 199
199
Slide 200
200
Slide 201
201
Slide 202
202
Slide 203
203
Slide 204
204
Slide 205
205
Slide 206
206
Slide 207
207
Slide 208
208
Slide 209
209
Slide 210
210
Slide 211
211
Slide 212
212
Slide 213
213
Slide 214
214
Slide 215
215
Slide 216
216
Slide 217
217
Slide 218
218
Slide 219
219
Slide 220
220
Slide 221
221
Slide 222
222
Slide 223
223
Slide 224
224
Slide 225
225
Slide 226
226
Slide 227
227
Slide 228
228
Slide 229
229
Slide 230
230
Slide 231
231
Slide 232
232
Slide 233
233
Slide 234
234
Slide 235
235
Slide 236
236
Slide 237
237
Slide 238
238
Slide 239
239
Slide 240
240
Slide 241
241
Slide 242
242
Slide 243
243
Slide 244
244
Slide 245
245
Slide 246
246
Slide 247
247
Slide 248
248
Slide 249
249
Slide 250
250
Slide 251
251
Slide 252
252
Slide 253
253
Slide 254
254
Slide 255
255
Slide 256
256
Slide 257
257
Slide 258
258
Slide 259
259
Slide 260
260
Slide 261
261
Slide 262
262
Slide 263
263
Slide 264
264
Slide 265
265
Slide 266
266
Slide 267
267
Slide 268
268
Slide 269
269
Slide 270
270
Slide 271
271
Slide 272
272
Slide 273
273
Slide 274
274
Slide 275
275
Slide 276
276
Slide 277
277
Slide 278
278
Slide 279
279
Slide 280
280
Slide 281
281
Slide 282
282
Slide 283
283
Slide 284
284
Slide 285
285
Slide 286
286
Slide 287
287
Slide 288
288
Slide 289
289
Slide 290
290
Slide 291
291
Slide 292
292
Slide 293
293
Slide 294
294
Slide 295
295
Slide 296
296
Slide 297
297
Slide 298
298
Slide 299
299
Slide 300
300
Slide 301
301
Slide 302
302
Slide 303
303
Slide 304
304
Slide 305
305
Slide 306
306
Slide 307
307
Slide 308
308
Slide 309
309
Slide 310
310
Slide 311
311
Slide 312
312
Slide 313
313
Slide 314
314
Slide 315
315
Slide 316
316
Slide 317
317
Slide 318
318
Slide 319
319
Slide 320
320
Slide 321
321
Slide 322
322
Slide 323
323
Slide 324
324
Slide 325
325
Slide 326
326
Slide 327
327
Slide 328
328
Slide 329
329
Slide 330
330
Slide 331
331
Slide 332
332
Slide 333
333
Slide 334
334
Slide 335
335
Slide 336
336
Slide 337
337
Slide 338
338
Slide 339
339
Slide 340
340
Slide 341
341
Slide 342
342
Slide 343
343
Slide 344
344
Slide 345
345
Slide 346
346
Slide 347
347
Slide 348
348
Slide 349
349
Slide 350
350
Slide 351
351
Slide 352
352
Slide 353
353
Slide 354
354
Slide 355
355
Slide 356
356
Slide 357
357
Slide 358
358
Slide 359
359
Slide 360
360
Slide 361
361
Slide 362
362
Slide 363
363
Slide 364
364
Slide 365
365
Slide 366
366
Slide 367
367
Slide 368
368
Slide 369
369
Slide 370
370
Slide 371
371
Slide 372
372
Slide 373
373
Slide 374
374
Slide 375
375
Slide 376
376
Slide 377
377
Slide 378
378
Slide 379
379
Slide 380
380
Slide 381
381
Slide 382
382
Slide 383
383
Slide 384
384
Slide 385
385
Slide 386
386
Slide 387
387
Slide 388
388
Slide 389
389
Slide 390
390
Slide 391
391
Slide 392
392
Slide 393
393
Slide 394
394
Slide 395
395
Slide 396
396
Slide 397
397
Slide 398
398
Slide 399
399
Slide 400
400
Slide 401
401
Slide 402
402
Slide 403
403
Slide 404
404
Slide 405
405
Slide 406
406
Slide 407
407
Slide 408
408
Slide 409
409
Slide 410
410
Slide 411
411
Slide 412
412
Slide 413
413
Slide 414
414
Slide 415
415
Slide 416
416
Slide 417
417
Slide 418
418
Slide 419
419
Slide 420
420
Slide 421
421
Slide 422
422
Slide 423
423
Slide 424
424
Slide 425
425
Slide 426
426
Slide 427
427
Slide 428
428
Slide 429
429
Slide 430
430
Slide 431
431
Slide 432
432
Slide 433
433
Slide 434
434

About This Presentation

Regularization


Slide Content

1/84
Deep Learning : Lecture 6
Regularization: Bias Variance Tradeo, l2 regularization, Early stopping,
Dataset augmentation, Parameter sharing and tying, Injecting noise at input,
Ensemble methods, Dropout
.
. Deep Learning : Lecture 6

2/84
Acknowledgements
Chapter 7, Deep Learning book
Ali Ghodsi's Video Lectures on Regularization
a
Dropout: A Simple Way to Prevent Neural Networks from Overtting
b
CS6910: Deep Learning Course by Prof. Mitesh M. Khapra, IIT Madras, India
a
Lecture 2.1
b
Dropout
. Deep Learning : Lecture 6

3/84
Module 8.1 : Bias and Variance
. Deep Learning : Lecture 6

4/84
We will begin with a quick overview of bias, variance and the trade-o between
them.
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

5/84 SimpleComplex
The points were drawn from a si-
nusoidal function (the truef(x))
Let us consider the problem of tting a curve
through a given set of points
We consider two models :
Simple
(degree:1)
y=
^
f(x) =w1x+w0
Complex
(degree:25)
y=
^
f(x) =
25
X
i=1
wix
i
+w0
Note that in both cases we are making an as-
sumption about howyis related tox. We
have no idea about the true relationf(x)
The training data consists of 100 points
. Deep Learning : Lecture 6

6/84 SimpleComplexThe points were drawn from
a sinusoidal function (the true
f(x))
We sample 25 points from the training data
and train a simple and a complex model
We repeat the process `k' times to train
multiple models (each model sees a dierent
sample of the training data)
We make a few observations from these plots
. Deep Learning : Lecture 6

6/84 SimpleComplexThe points were drawn from
a sinusoidal function (the true
f(x))
We sample 25 points from the training data
and train a simple and a complex model
We repeat the process `k' times to train
multiple models (each model sees a dierent
sample of the training data)
We make a few observations from these plots
. Deep Learning : Lecture 6

6/84 SimpleComplexThe points were drawn from
a sinusoidal function (the true
f(x))
We sample 25 points from the training data
and train a simple and a complex model
We repeat the process `k' times to train
multiple models (each model sees a dierent
sample of the training data)
We make a few observations from these plots
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

7/84
Simple models trained on dierent samples of
the data do not dier much from each other
However they are very far from the true sinus-
oidal curve (under tting)
On the other hand, complex models trained on
dierent samples of the data are very dierent
from each other (high variance)
. Deep Learning : Lecture 6

8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6

8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6

8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6

8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6

8/84
Green Line: Average value of
^
f(x)
for the simple model
Blue Curve: Average value of
^
f(x)
for the complex model
Red Curve: True model (f(x))
Letf(x) be the true model (sinusoidal in this
case) and
^
f(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (
^
f(x)) =E[
^
f(x)]f(x)
E[
^
f(x)] is the average (or expected) value of
the model
We can see that for the simple model the av-
erage value (green line) is very far from the
true valuef(x) (sinusoidal function)
Mathematically, this means that the simple
model has a high bias
On the other hand, the complex model has a
low bias
. Deep Learning : Lecture 6

9/84
We now dene,
Variance (
^
f(x)) =E[(
^
f(x)E[
^
f(x)])
2
]
(Standard denition from statistics)
Roughly speaking it tells us how much the dif-
ferent
^
f(x)'s (trained on dierent samples of
the data) dier from each other
It is clear that the simple model has a low vari-
ance whereas the complex model has a high
variance
. Deep Learning : Lecture 6

9/84
We now dene,
Variance (
^
f(x)) =E[(
^
f(x)E[
^
f(x)])
2
]
(Standard denition from statistics)
Roughly speaking it tells us how much the dif-
ferent
^
f(x)'s (trained on dierent samples of
the data) dier from each other
It is clear that the simple model has a low vari-
ance whereas the complex model has a high
variance
. Deep Learning : Lecture 6

9/84
We now dene,
Variance (
^
f(x)) =E[(
^
f(x)E[
^
f(x)])
2
]
(Standard denition from statistics)
Roughly speaking it tells us how much the dif-
ferent
^
f(x)'s (trained on dierent samples of
the data) dier from each other
It is clear that the simple model has a low vari-
ance whereas the complex model has a high
variance
. Deep Learning : Lecture 6

10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6

10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6

10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6

10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6

10/84
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-o between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
. Deep Learning : Lecture 6

11/84
Module 8.2 : Train error vs Test error
. Deep Learning : Lecture 6

12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6

12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6

12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6

12/84
We can show that
E[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
See proof here
Consider a new point (x; y) which was not
seen during training
If we use the model
^
f(x) to predict the
value ofythen the mean square error is
given by
E[(y
^
f(x))
2
]
(average square error in predictingyfor
many such unseen points)
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

13/84 model complexityerrorHigh biasHigh varianceSweet spot-
-perfect tradeo
-ideal model
complexityE[(y
^
f(x))
2
] =Bias
2
+V ariance
+
2
(irreducible error)
The parameters of
^
f(x) (allwi's) are trained
using a training setf(xi; yi)g
n
i=1
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training
This gives rise to the following two entities of
interest:
trainerr(say, mean square error)
testerr(say, mean square error)
Typically these errors exhibit the trend shown
in the adjacent gure
. Deep Learning : Lecture 6

14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6

14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6

14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6

14/84
Intuitions developed so far Let there bentraining points andmtest (validation) points
trainerr=
1
n
n
X
i=1
(yi
^
f(xi))
2
testerr=
1
m
n+m
X
i=n+1
(yi
^
f(xi))
2
As the model complexity increasestrainerrbecomes overly optimistic and gives
us a wrong picture of how close
^
fis tof
The validation error gives the real picture of how close
^
fis tof We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

15/84
Let D=fxi; yig
m+n
i=1
, then for any
point (x; y) we have,
yi=f(xi) +"i
which means thatyiis related toxi
by some true functionfbut there is
also some noise"in the relation
For simplicity, we assume
" N(0;
2
)
and of course we do not knowf
Further we use
^
fto approximatef
and estimate the parameters using T
D such that
yi=
^
f(xi)
We are interested in knowing
E[(
^
f(xi)f(xi))
2
]
but we cannot estimate this directly
because we do not knowf We will see how to estimate this em-
pirically using the observationyi&
prediction ^yi
. Deep Learning : Lecture 6

16/84
E[( ^yiyi)
2
]
=E[(
^
f(xi)f(xi)"i)
2
]yi=f(xi) +"i)=E[(
^
f(xi)f(xi))
2
2"i(
^
f(xi)f(xi)) +"
2
i]=E[(
^
f(xi)f(xi))
2
]2E["i(
^
f(xi)f(xi))] +E["
2
i])E[(
^
f(xi)f(xi))
2
]=E[( ^yiyi)
2
]E["
2
i] + 2E["i(
^
f(xi)f(xi)) ]
. Deep Learning : Lecture 6

16/84
E[( ^yiyi)
2
]
=E[(
^
f(xi)f(xi)"i)
2
]yi=f(xi) +"i)=E[(
^
f(xi)f(xi))
2
2"i(
^
f(xi)f(xi)) +"
2
i]=E[(
^
f(xi)f(xi))
2
]2E["i(
^
f(xi)f(xi))] +E["
2
i])E[(
^
f(xi)f(xi))
2
]=E[( ^yiyi)
2
]E["
2
i] + 2E["i(
^
f(xi)f(xi)) ]
. Deep Learning : Lecture 6

16/84
E[( ^yiyi)
2
]
=E[(
^
f(xi)f(xi)"i)
2
]yi=f(xi) +"i)=E[(
^
f(xi)f(xi))
2
2"i(
^
f(xi)f(xi)) +"
2
i]=E[(
^
f(xi)f(xi))
2
]2E["i(
^
f(xi)f(xi))] +E["
2
i])E[(
^
f(xi)f(xi))
2
]=E[( ^yiyi)
2
]E["
2
i] + 2E["i(
^
f(xi)f(xi)) ]
. Deep Learning : Lecture 6

16/84
E[( ^yiyi)
2
]
=E[(
^
f(xi)f(xi)"i)
2
]yi=f(xi) +"i)=E[(
^
f(xi)f(xi))
2
2"i(
^
f(xi)f(xi)) +"
2
i]=E[(
^
f(xi)f(xi))
2
]2E["i(
^
f(xi)f(xi))] +E["
2
i])E[(
^
f(xi)f(xi))
2
]=E[( ^yiyi)
2
]E["
2
i] + 2E["i(
^
f(xi)f(xi)) ]
. Deep Learning : Lecture 6

16/84
E[( ^yiyi)
2
]
=E[(
^
f(xi)f(xi)"i)
2
]yi=f(xi) +"i)=E[(
^
f(xi)f(xi))
2
2"i(
^
f(xi)f(xi)) +"
2
i]=E[(
^
f(xi)f(xi))
2
]2E["i(
^
f(xi)f(xi))] +E["
2
i])E[(
^
f(xi)f(xi))
2
]=E[( ^yiyi)
2
]E["
2
i] + 2E["i(
^
f(xi)f(xi)) ]
. Deep Learning : Lecture 6

17/84
We will take a small detour to understand how to empirically estimate an
Expectation and then return to our derivation
. Deep Learning : Lecture 6

18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6

18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6

18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6

18/84
Suppose we have observed the goals scored(z) inkmatches as
z1= 2,z2= 1,z3= 0, ...zk= 2
Now we can empirically estimateE[z] i.e. the expected number of goals scored
as
E[z] =
1
k
k
X
i=1
zi
Analogy with our derivation: We have a certain number of observationsyi&
predictions ^yiusing which we can estimate
E[( ^yiyi)
2
] =
1
m
m
X
i=1
( ^yiyi)
2
. Deep Learning : Lecture 6

19/84
... returning back to our derivation
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

20/84
E[(
^
f(xi)f(xi))
2
] =E[( ^yiyi)
2
]E["
2
i
] + 2E["i(
^
f(xi)f(xi)) ]
We can empirically evaluate R.H.S using training observations or test observa-
tions
Case 1: Using test observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
*covariance(X; Y)
=E[(XX)(YY)]=E[(X)(YY)](ifX=E[X] = 0)=E[XY]E[XY]=E[XY]YE[X]=E[XY]
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

21/84
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
m
n+m
X
i=n+1
( ^yiyi)
2
| {z }
empirical estimation of error

1
m
n+m
X
i=n+1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
None of the test observations participated in the estimation of
^
f(x)[the para-
meters of
^
f(x) were estimated only using training data]
)"?(
^
f(xi)f(xi)))E["i(
^
f(xi)f(xi))]=E["i]E[
^
f(xi)f(xi))]= 0E[
^
f(xi)f(xi))]= 0)true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error
. Deep Learning : Lecture 6

22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error

1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6

22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error

1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6

22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error

1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6

22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error

1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6

22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error

1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6

22/84
Case 2: Using training observations
E[(
^
f(xi)f(xi))
2
]
| {z }
true error
=
1
n
n
X
i=1
( ^yiyi)
2
| {z }
empirical estimation of error

1
n
n
X
i=1
"
2
i
|{z}
small constant
+ 2E["i(
^
f(xi)f(xi)) ]
| {z }
=covariance("i;
^
f(xi)f(xi))
Now,"6?
^
f(x) because"was used for estimating the parameters of
^
f(x)
)E["i(
^
f(xi)f(xi))]
6=E["i]E[
^
f(xi)f(xi))]6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see
. Deep Learning : Lecture 6

23/84
Module 8.3 : True error and Model complexity
. Deep Learning : Lecture 6

24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =

2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6

24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =

2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6

24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =

2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6

24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =

2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6

24/84
Using Stein's Lemma (and some trickery) we can show that
1
n
n
X
i=1
"i(
^
f(xi)f(xi)) =

2
n
n
X
i=1
@
^
f(xi)
@yi
When will
@
^
f(xi)
@yi
be high? When a small change in the observation causes a
large change in the estimation(
^
f)
Can you link this to model complexity? Yes, indeed a complex model will be more sensitive to changes in observations
whereas a simple model will be less sensitive to changes in observations
Hence, we can say that
true error = empirical train error + small constant + (model complexity)
. Deep Learning : Lecture 6

25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6

25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6

25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6

25/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have tted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
. Deep Learning : Lecture 6

26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for

2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6

26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for

2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6

26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for

2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6

26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for

2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6

26/84
Hence while training, instead of minimizing the training errorLtrain() we
should minimize
min
w:r:t
Ltrain() + () =L()
Where () would be high for complex models and small for simple models
() acts as an approximate for

2
n
P
n
i=1
@
^
f(xi)
@yi
This is the basis for all regularization methods
We can show thatl1regularization,l2regularization, early stopping and inject-
ing noise in input are all instances of this form of regularization.
. Deep Learning : Lecture 6

27/84 model complexityerrorHigh biasHigh varianceSweet spot() should ensure
that model has reas-
onable complexity

2
n
P
n
i=1
@
^
f(xi)
@yi
. Deep Learning : Lecture 6

28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6

28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6

28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6

28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6

28/84
Why do we care about this
bias variance tradeo and
model complexity?
Deep Neural networks are highly complex
models.
Many parameters, many non-linearities. It is easy for them to overt and drive training
error to 0.
Hence we need some form of regularization.
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

29/84
Dierent forms of regularization l2regularization
Dataset augmentation Parameter Sharing and tying Adding Noise to the inputs Adding Noise to the outputs Early stopping Ensemble methods Dropout
. Deep Learning : Lecture 6

30/84
Module 8.4 :l2regularization
. Deep Learning : Lecture 6

31/84
Dierent forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

32/84
Forl2regularization we have,
f
L(w) =L(w) +

2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6

32/84
Forl2regularization we have,
f
L(w) =L(w) +

2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6

32/84
Forl2regularization we have,
f
L(w) =L(w) +

2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6

32/84
Forl2regularization we have,
f
L(w) =L(w) +

2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6

32/84
Forl2regularization we have,
f
L(w) =L(w) +

2
kwk
2
For SGD (or its variants), we are interested in
r
f
L(w) =rL(w) +w
Update rule:
wt+1=wtrL(wt)wt
Requires a very small modication to the code
Let us see the geometric interpretation of this
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

33/84
Assumew

is the optimal solution forL(w) [not
f
L(w)] i.e. the solution in
the absence of regularization (w

optimal! rL(w

) = 0)
Consideru=ww

. Using Taylor series approximation (upto 2
nd
order)
L(w

+u)=L(w

) +u
T
rL(w

) +
1
2
u
T
HuL(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) (*rL(w

) = 0 )rL(w)=rL(w

) +H(ww

)=H(ww

)
Now,
r
f
L(w)=rL(w) +w=H(ww

) +w
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

34/84
Letewbe the optimal solution for
e
L(w) [i.e regularized loss]
*r
e
L(ew) = 0
H(eww

) +ew= 0)(H+I)ew=Hw

)ew= (H+I)
1
Hw

Notice that if!0 thenew!w

[no regularization]
But we are interested in the case when6= 0 Let us analyse the case when6= 0
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

35/84
If H is symmetric Positive Semi Denite
H=QQ
T
[Qis orthogonal,QQ
T
=Q
T
Q=I]
ew= (H+I)
1
Hw

= (QQ
T
+I)
1
QQ
T
w

= (QQ
T
+QIQ
T
)
1
QQ
T
w

= [Q( +I)Q
T
]
1
QQ
T
w

=Q
T
1
( +I)
1
Q
1
QQ
T
w

=Q( +I)
1
Q
T
w

(*Q
T
1
=Q)ew=QDQ
T
w

whereD= ( +I)
1
, is a diagonal matrix which we will see in more detail
soon
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

36/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
So what is happening here?
w

rst gets rotated byQ
T
to give
Q
T
w

However if= 0 thenQrotates
Q
T
w

back to givew

If6= 0 then let us see whatD
looks like
So what is happening now?
. Deep Learning : Lecture 6

37/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
Each elementiofQ
T
w

gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6

37/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
Each elementiofQ
T
w

gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6

37/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
Each elementiofQ
T
w

gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6

37/84
ew=Q( +I)
1
Q
T
w

=QDQ
T
w

( +I)
1
=
2
6
6
6
4
1
1+
1
2+
.
.
.
1
n+
3
7
7
7
5
D= ( +I)
1
( +I)
1
=
2
6
6
6
4
1
1+
2
2+
.
.
.
n
n+
3
7
7
7
5
Each elementiofQ
T
w

gets scaled
by
i
i+
before it is rotated back by
Q
ifi>> then
i
i+
= 1 ifi<< then
i
i+
= 0 Thus only signicant directions
(larger eigen values) will be retained.
Eective parameters =
n
X
i=1
i
i+
< n
. Deep Learning : Lecture 6

38/84
The weight vector(w

) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6

38/84
The weight vector(w

) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6

38/84
The weight vector(w

) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6

38/84
The weight vector(w

) is getting rotated to ( ~w)
All of its elements are shrinking but some are shrinking more than the others This ensures that only important features are given high weights
. Deep Learning : Lecture 6

39/84
Module 8.5 : Dataset augmentation
. Deep Learning : Lecture 6

40/84
Dierent forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

40/84
Dierent forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

41/84 label = 2
[given training data]
We exploit the fact that
certain transformations
to the image do not
change the label of the
image.
label = 2rotated by 20

rotated by 65

shifted verticallyshifted horizontallyblurredchanged some pixels
[augmented data = created using some knowledge of the
task]
. Deep Learning : Lecture 6

42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6

42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6

42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6

42/84
Typically, More data = better learning
Works well for image classication / object recognition tasks Also shown to work well for speech For some tasks it may not be clear how to generate such data
. Deep Learning : Lecture 6

43/84
Module 8.6 : Parameter Sharing and tying
. Deep Learning : Lecture 6

44/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

44/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

45/84
Parameter Sharing Used in CNNs
Same lter applied at dierent
positions of the image
Or same weight matrix acts on
dierent input neurons
xh(x)^x
Parameter Tying Typically used in autoencoders
The encoder and decoder weights
are tied.
. Deep Learning : Lecture 6

46/84
Module 8.7 : Adding Noise to the inputs
. Deep Learning : Lecture 6

47/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

47/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6

48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6

48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6

48/84 x~xh(x)^xP(~xjx) noise process
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
to weight decay (L2regularisation)
Can be viewed as data augmentation
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

49/84
x1+"1x2+"2. . .x
k+"
k. . .xn+"n
" N(0;
2
)
exi=xi+"iby=
n
X
i=1
wixiey=
n
X
i=1
wiexi=
n
X
i=1
wixi+
n
X
i=1
wi"i=by+
n
X
i=1
wi"i
We are interested inE[(eyy)
2
]
E
h
(eyy)
2
i
=E
"

by+
n
X
i=1
wi"iy

2
#
=E
2
4


byy

+
n
X
i=1
wi"i

!
2
3
5=E
h
(byy)
2
i
+E
"
2(byy)
n
X
i=1
wi"i
#
+E
"
n
X
i=1
wi"i

2
#
=E
h
(byy)
2
i
+ 0 +E
"
n
X
i=1
w
2
i"
2
i
#
(*"iis independent of"jand"iis independent of (by-y) )
= (E
h
(byy)
2
i
+
2
n
X
i=1
w
2
i(same asL2norm penalty)
. Deep Learning : Lecture 6

50/84
Module 8.8 : Adding Noise to the outputs
. Deep Learning : Lecture 6

51/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6

52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6

52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6

52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6

52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6

52/84 Hard targets0010000000minimize:
9
X
i=0
pilogqitrue distribution:p=f0;0;1;0;0;0;0;0;0;0gestimated distribution:q
Intuition Do not trust the true labels, they may be noisy
Instead, use soft targets
. Deep Learning : Lecture 6

53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6

53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6

53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6

53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6

53/84 Soft targets
"
9
"
9
1"
"
9
"
9
"
9
"
9
"
9
"
9
"
9
"= small positive constantminimize:
9
X
i=0
pilogqitrue distribution + noise:p=
n
"
9
;
"
9
;1";
"
9
; : : :
o
estimated distribution:q
. Deep Learning : Lecture 6

54/84
Module 8.9 : Early stopping
. Deep Learning : Lecture 6

55/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

55/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6

56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6

56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6

56/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Track the validation error
Have a patience parameterp If you are at stepkand there was
no improvement in validation error in
the previouspsteps then stop train-
ing and return the model stored at
stepkp
Basically, stop the training early be-
fore it drives the training error to 0
and blows up the validation error
. Deep Learning : Lecture 6

57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6

57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6

57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6

57/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Very eective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such asl2)
How does it act as a regularizer ? We will rst see an intuitive explan-
ation and then a mathematical ana-
lysis
. Deep Learning : Lecture 6

58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6

58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6

58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6

58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6

58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6

58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6

58/84
StepsErrorT raining errorV alidation errorkpkstopreturn this model
Recall that the update rule in SGD is
wt+1=wtrwt=w0
t
X
i=1
rwi
Letbe the maximum value ofrwi
then
jwt+1w0j tjj
Thus,tcontrols how farwtcan go
from the initialw0
In other words it controls the space
of exploration
. Deep Learning : Lecture 6

59/84
We will now see a mathematical analysis of this
. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

60/84
Recall that the Taylor series approximation forL(w) is
L(w)=L(w

) + (ww

)
T
rL(w

) +
1
2
(ww

)
T
H(ww

)=L(w

) +
1
2
(ww

)
T
H(ww

) [ w

is optimal sorL(w

)is0 ]r(L(w))=H(ww

)
Now the SGD update rule is:
wt=wt1rL(wt1)=wt1H(wt1w

)= (IH)wt1+Hw

. Deep Learning : Lecture 6

61/84 wt= (IH)wt1+Hw

Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w

If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w

Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w

We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1

. Deep Learning : Lecture 6

61/84 wt= (IH)wt1+Hw

Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w

If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w

Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w

We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1

. Deep Learning : Lecture 6

61/84 wt= (IH)wt1+Hw

Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w

If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w

Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w

We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1

. Deep Learning : Lecture 6

61/84 wt= (IH)wt1+Hw

Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w

If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w

Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w

We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1

. Deep Learning : Lecture 6

61/84 wt= (IH)wt1+Hw

Using EVD ofHasH=QQ
T
, we get:
wt= (IQQ
T
)wt1+QQ
T
w

If we start withw0= 0 then we can show that (See Appendix)
wt=Q[I(I")
t
]Q
T
w

Compare this with the expression we had for optimum
~
WwithL2regularization
~w=Q[I( +I)
1
]Q
T
w

We observe thatwt= ~w, if we choose",tandsuch that
(I")
t
= ( +I)
1

. Deep Learning : Lecture 6

62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6

62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6

62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6

62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6

62/84
Things to be remember Early stopping only allowstupdates to the parameters.
If a parameterwcorresponds to a dimension which is important for the loss
L() then
@L()
@w
will be large
However if a parameter is not important (
@L()
@w
is small) then its updates will
be small and the parameter will not be able to grow large in `t
0
steps
Early stopping will thus eectively shrink the parameters corresponding to less
important directions (same as weight decay).
. Deep Learning : Lecture 6

63/84
Module 8.10 : Ensemble methods
. Deep Learning : Lecture 6

64/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

64/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6

65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6

65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6

65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6

65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6

65/84 yylrLogistic RegressionysvmSV Mynbx1x2x3x4yNaive Bayesyf inal
Combine the output of dierent models to re-
duce generalization error
The models can correspond to dierent clas-
siers
It could be dierent instances of the same clas-
sier trained with:
dierent hyperparameters
dierent features dierent samples of the training data
. Deep Learning : Lecture 6

66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6

66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6

66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6

66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6

66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6

66/84 yylr1yylr2yylr3LogisticLogisticLogisticRegressionRegressionRegressionyf inal
Each model trained with a dierent
sample of the data (sampling with
replacement)
Bagging: form an ensemble using dif-
ferent instances of the same classier
From a given dataset, construct mul-
tiple training sets by sampling with
replacement (T1; T2; :::; Tk)
Traini
th
instance of the classier us-
ing training setTi
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

67/84
The error made by the average
prediction of all the models is
1
k
P
i
"i
The expected squared error is :
mse=E[(
1
k
X
i
"i)
2
]=
1
k
2
E[
X
i
X
i=j
"i"j+
X
i
X
i6=j
"i"j]=
1
k
2
E[
X
i
"
2
i+
X
i
X
i6=j
"i"j]=
1
k
2
(
X
i
E["
2
i] +
X
i
X
i6=j
E["i"j])=
1
k
2
(kV+k(k1)C)=
1
k
V+
k1
k
C
When would bagging work?
Consider a set ofkLR mod-
els
Suppose that each model
makes an error"ion a test
example
Let"ibe drawn from a
zero mean multivariate nor-
mal distribution
V ariance=E["
2
i
] =V
Covariance=E["i"j] =C
. Deep Learning : Lecture 6

68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6

68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6

68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6

68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6

68/84
mse=
1
k
V+
k1
k
C
When would bagging work ?
If the errors of the model are perfectly
correlated thenV=Candmse=V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated thenC= 0
and the mse of the ensemble reduces
to
1
k
V
On average, the ensemble will per-
form at least as well as its individual
members
. Deep Learning : Lecture 6

69/84
Module 8.11 : Dropout
. Deep Learning : Lecture 6

70/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

70/84
Other forms of regularization
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6

71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6

71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6

71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6

71/84
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having dierent architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using dierent
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
. Deep Learning : Lecture 6

72/84
Dropout is a technique which ad-
dresses both these issues.
Eectively it allows training several
neural networks without any signic-
ant computational overhead.
Also gives an ecient approximate
way of combining exponentially many
dierent neural networks.
. Deep Learning : Lecture 6

72/84
Dropout is a technique which ad-
dresses both these issues.
Eectively it allows training several
neural networks without any signic-
ant computational overhead.
Also gives an ecient approximate
way of combining exponentially many
dierent neural networks.
. Deep Learning : Lecture 6

72/84
Dropout is a technique which ad-
dresses both these issues.
Eectively it allows training several
neural networks without any signic-
ant computational overhead.
Also gives an ecient approximate
way of combining exponentially many
dierent neural networks.
. Deep Learning : Lecture 6

73/84
Dropout refers to dropping out units Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a xed probability (typicallyp= 0:5) for hidden
nodes andp= 0:8 for visible nodes
. Deep Learning : Lecture 6

73/84
Dropout refers to dropping out units Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a xed probability (typicallyp= 0:5) for hidden
nodes andp= 0:8 for visible nodes
. Deep Learning : Lecture 6

73/84
Dropout refers to dropping out units Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a xed probability (typicallyp= 0:5) for hidden
nodes andp= 0:8 for visible nodes
. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

74/84
Suppose a neural network hasnnodes Using the dropout idea, each node can be retained or dropped For example, in the above case we drop 5 nodes to get a thinned network Given a total ofnnodes, what are the total number of thinned networks that
can be formed?
2
n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick:(1) Share the weights across all the networks(2) Sample a dierent network for each training instance Let us see how?. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

75/84
We initialize all the parameters (weights) of the network and start training For the rst training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate Which parameters will we update?
Only those which are active
. Deep Learning : Lecture 6

76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6

76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6

76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6

76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6

76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6

76/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a dierent thinned network
We again compute the loss and backpropagate to the active weights If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
. Deep Learning : Lecture 6

77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6

77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6

77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6

77/84 Present with
probabilityp
w1w2w3w4At training timeAlways
present
pw1pw2pw3pw4At test time
What happens at test time? Impossible to aggregate the outputs of 2
n
thinned networks Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
. Deep Learning : Lecture 6

78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6

78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6

78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6

78/84
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
. Deep Learning : Lecture 6

79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6

79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6

79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6

79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6

79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6

79/84 hi
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Supposehilearns to detect a face by
ring on detecting a nose
Droppinghithen corresponds to eras-
ing the information that a nose exists
The model should then learn another
hiwhich redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
. Deep Learning : Lecture 6

80/84
Recap
l2regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout
. Deep Learning : Lecture 6

81/84
Appendix
. Deep Learning : Lecture 6

82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w

wt=Q[I(I")
t
]Q
T
w

Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w

=QQ
T
w

w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w

=QQ
T
w

. Deep Learning : Lecture 6

82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w

wt=Q[I(I")
t
]Q
T
w

Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w

=QQ
T
w

w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w

=QQ
T
w

. Deep Learning : Lecture 6

82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w

wt=Q[I(I")
t
]Q
T
w

Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w

=QQ
T
w

w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w

=QQ
T
w

. Deep Learning : Lecture 6

82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w

wt=Q[I(I")
t
]Q
T
w

Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w

=QQ
T
w

w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w

=QQ
T
w

. Deep Learning : Lecture 6

82/84
To prove: The below two equations are equivalent
wt= (IQQ
T
)wt1+QQ
T
w

wt=Q[I(I")
t
]Q
T
w

Proof by induction:
Base case:t= 1 andw0=0: w1according to the rst equation:
w1= (IQQ
T
)w0+QQ
T
w

=QQ
T
w

w1according to the second equation:
w1=Q(I(I)
1
)Q
T
w

=QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)
= (IQQ
T
)Q(I(I)
t
)Q
T
w

+QQ
T
w

=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)
=IQQ
T
)Q(I(I)
t
)Q
T
w

+QQ
T
w

(Opening this bracket)
=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)
= (IQQ
T
)Q(I(I)
t
)Q
T
w

+QQ
T
w

(Opening this bracket)
=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

83/84
Induction step: Let the two equations be equivalent fort
th
step
)wt= (IQQ
T
)wt1+QQ
T
w

=Q[I(I")
t
]Q
T
w

Proof that this will hold for (t+ 1)
th
step
wt+1= (IQQ
T
)wt+QQ
T
w

(usingwt=Q[I(I")
t
]Q
T
w

)
= (IQQ
T
)Q(I(I)
t
)Q
T
w

+QQ
T
w

(Opening this bracket)
=I(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)
. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)
. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

=Q

I(I)
t
+(I)
t

Q
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

=Q

I(I)
t
+(I)
t

Q
T
w

=Q

I(I)
t
(I)

Q
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

=Q

I(I)
t
+(I)
t

Q
T
w

=Q

I(I)
t
(I)

Q
T
w

=Q(I(I)
t+1
)Q
T
w

. Deep Learning : Lecture 6

84/84
Continuing
wt+1=Q(I(I)
t
)Q
T
w

QQ
T
Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

(*Q
T
Q=I)=Q(I(I)
t
)Q
T
w

Q(I(I)
t
)Q
T
w

+QQ
T
w

=Q

(I(I)
t
)(I(I)
t
) +

Q
T
w

=Q

I(I)
t
+(I)
t

Q
T
w

=Q

I(I)
t
(I)

Q
T
w

=Q(I(I)
t+1
)Q
T
w

Hence, proved!
. Deep Learning : Lecture 6
Tags