Regularization in deep learning

KienLe47 5,518 views 42 slides May 27, 2019
Slide 1
Slide 1 of 42
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42

About This Presentation

Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.


Slide Content

Vietnam Japan AI Community 2019-05-26 Kien Le Regularization In Deep Learning

Model Fitting Introduction

Model (Function) Fitting How well a model performs on training/evaluation datasets will define its characteristics Underfit Overfit Good Fit Training Dataset Poor Very Good Good Evaluation Dataset Very Poor Poor Good

Model Fitting – Visualization Variations of model fitting [1]

Bias Variance Prediction errors [2] (Bias) 2 Variance  

Bias Variance Bias Represents the extent to which average prediction over all data sets differs from the desired regression function Variance Represent the extent to which the model is sensitive to the particular choice of data set

Quiz Model Fitting and Bias-Variance Relationship Underfit Overfit Good Fit Bias ? ? ? Variance ? ? ?

Quiz - Answer Fit a function to a dataset

Regularization Introduction

Counter Underfit What causes underfit? Model capacity is too small to fit the training dataset as well as generalize to new dataset. High bias, low variance Solution Increase the capacity of the model Examples: Increase number of layers, neurons in each layer, etc. Result: Lower Bias Underfit  Good Fit?

Counter Underfit It’s so simple, just turn it into an overfit model! 

Counter Overfit What cause overfit? Model capacity is so big that it adapts too well to training samples  Unable to generalize well to new, unseen samples Low bias, high variance Solution Regularization But How?

Regularization Definition Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. [4]

Regularization Techniques Early Stopping, L1/L2, Batch Norm, Dropout

Regularization Techniques Early Stopping L1/L2 Batch Norm Dropout Data Augmentation Layer Norm Weight Norm

Early Stopping There is point during training a large neural net when the model will stop generalizing and only focus on learning the statistical noise in the training dataset. Solution Stop whenever generalization errors increases

Early Stopping

Early Stopping Pros Very simple Highly recommend to use for all training along with other techniques Keras Implementation has option to save BEST_WEIGHT https://keras.io/callbacks/ Callback during training Cons May not work so well

L1/L2 Regularization L2 adds “ squared magnitude ” of coefficient as penalty term to the loss function. L1 adds “ absolute value of magnitude ” of coefficient as penalty term to the loss function. Weight Penalties  Smaller Weights  Simpler Model  Less Overfit  

L1/L2 Regularization Regularization works on assumption that smaller weights generate simpler model and thus helps avoid overfitting. [5] Why?

L1/L2 Comparison Robustness Sparsity

Robustness (Against Outliers) L1>L2 The loss of outliers increase Exponentially in L2 Linearly in L1 L2 pays more efforts to deal with outliers  Less Robust

Sparsity L1>L2 L1 zeros out coefficients, which leads to a sparse model L1 can be used for feature (coefficients) selection Unimportant ones have zero coefficients L2 will produce small values for almost all coefficients E.g : When applying L1/L2 to a layer with 4 weights, the results might look like L1: 0.8, 0, 1, 0 L2: 0.3,0.1,0.3, 0.2

Sparsity ([3])     gradient is constant (1 or -1) w1: 5->0 in 10 steps gradient is smaller over time ( w2: 5->0 in a big number of steps

L1/L2 Regularization Fun Fact: What does “L” in L1/L2 stand for?

Batch Norm Original Paper Title: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [6] Internal Covariate Shift: The change in the distribution of network activations due to the change in network parameters during training.

Internal Covariate Shift (More) Distribution of each layer’s inputs changes during training as the parameters of the previous layers change. The layers need to continuously adapt to the new distribution! Problems: Slower training Hard to use big learning rate

Batch Norm Algorithm Batch Norm tries to fix the means and variances of layer inputs Reduce Internal Covariate Shift Run over batch axis

Batch Norm Regularization Effect Each hidden units are multiplied by a random value at each step of training Add noises to training process Force layers to learn harder to be robust a lot of variation of inputs A form of data augmentation

Batch Norm Recap Pros Networks train faster Allow higher learning rates Make weights easier to initialize Make more activation functions viable Regularization by forcing layers to be more robust to noises (may replace Dropout) Cons Not good for online learning Not good for RNN, LSTM Different calculation between train and test Related techniques Layer norm Weight norm

Dropout How it works Randomly selected neurons are ignored during each training step. Dropped neurons don’t have effect on next layers. Dropped neurons are not updated in backward training. Questions: What’re the ideas? Why dropout help to reduce overfit?

Ensemble Models - Bagging How it works Train multiple models on different subsets of data Combine those models into a final model Characteristics Each sub-model is trained separately Each sub-model is normally overfit The combination of those overfit models produce a less overfit model overall

Ensemble Models Averaging multiple models to create a final model with low variance

Dropout - Ensemble Models for DNN Can we apply Bagging for Neural Network? It’s computationally prohibitive Dropout aims to solve this problem by providing a method to combine multiple models with practical computation cost.

Dropout Removing units from base model effectively creates a subnetwork. All those subnetworks are trained implicitly together with all parameters shared (different from bagging) At predict mode, all learned units are activated, which averages all trained subnetworks

Dropout – Regularization Effect Each hidden units are multiplied by a random value at each step of training Add noises to training process  Similar with Batch Norm

Regularization Summary Two types of regularization Model optimization: Reduce the model complexity Data augmentation: Increase the size of training data Categorize techniques we have learned Model optimization: ? Data augmentation: ?

Demo Batch Norm, Dropout

Notes MNIST Dataset To create overfit scenario Reduce dataset size (60K->1K) Create a complex (but not so good) model Techniques to try Early stopping Dropout Batch Norm Link: https://drive.google.com/drive/u/0/folders/14A6n8bdrJHmgUcaopv66g8p0y0ot6hSr

Key Takeaways Keywords: Overfit, Underfit, Bias, Variance Regularization Techniques: Dropout, Batch-Norm, Early Stopping

References [1] https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76 [2] Pattern Recognition and Machine Learning, M. Bishop [3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models [4] Deep Learning, Goodfellow et. al [5] https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2 [6] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe et al [7] https://towardsdatascience.com/batch-normalization-8a2e585775c9 [8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava et al [9] https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-overfitting/ [10] Popular Ensemble Methods: An Empirical Study, Optiz et. al