Regularization in deep learning

KienLe47 5,518 views 42 slides May 27, 2019

Slide 1 of 42

About This Presentation

Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.

Size: 573.08 KB

Language: en

Added: May 27, 2019

Slides: 42 pages

Slide Content

Vietnam Japan AI Community 2019-05-26 Kien Le Regularization In Deep Learning

Model Fitting Introduction

Model (Function) Fitting How well a model performs on training/evaluation datasets will define its characteristics Underfit Overfit Good Fit Training Dataset Poor Very Good Good Evaluation Dataset Very Poor Poor Good

Model Fitting – Visualization Variations of model fitting [1]

Bias Variance Prediction errors [2] (Bias) 2 Variance

Bias Variance Bias Represents the extent to which average prediction over all data sets differs from the desired regression function Variance Represent the extent to which the model is sensitive to the particular choice of data set

Quiz Model Fitting and Bias-Variance Relationship Underfit Overfit Good Fit Bias ? ? ? Variance ? ? ?

Quiz - Answer Fit a function to a dataset

Regularization Introduction

Counter Underfit What causes underfit? Model capacity is too small to fit the training dataset as well as generalize to new dataset. High bias, low variance Solution Increase the capacity of the model Examples: Increase number of layers, neurons in each layer, etc. Result: Lower Bias Underfit  Good Fit?

Counter Underfit It’s so simple, just turn it into an overfit model! 

Counter Overfit What cause overfit? Model capacity is so big that it adapts too well to training samples  Unable to generalize well to new, unseen samples Low bias, high variance Solution Regularization But How?

Regularization Definition Regularization is any modiﬁcation we make to a learning algorithm that is intended to reduce its generalization error but not its training error. [4]

Regularization Techniques Early Stopping, L1/L2, Batch Norm, Dropout

Regularization Techniques Early Stopping L1/L2 Batch Norm Dropout Data Augmentation Layer Norm Weight Norm

Early Stopping There is point during training a large neural net when the model will stop generalizing and only focus on learning the statistical noise in the training dataset. Solution Stop whenever generalization errors increases

Early Stopping

Early Stopping Pros Very simple Highly recommend to use for all training along with other techniques Keras Implementation has option to save BEST_WEIGHT https://keras.io/callbacks/ Callback during training Cons May not work so well

L1/L2 Regularization L2 adds “ squared magnitude ” of coefficient as penalty term to the loss function. L1 adds “ absolute value of magnitude ” of coefficient as penalty term to the loss function. Weight Penalties  Smaller Weights  Simpler Model  Less Overfit

L1/L2 Regularization Regularization works on assumption that smaller weights generate simpler model and thus helps avoid overfitting. [5] Why?

L1/L2 Comparison Robustness Sparsity

Robustness (Against Outliers) L1>L2 The loss of outliers increase Exponentially in L2 Linearly in L1 L2 pays more efforts to deal with outliers  Less Robust

Sparsity L1>L2 L1 zeros out coefficients, which leads to a sparse model L1 can be used for feature (coefficients) selection Unimportant ones have zero coefficients L2 will produce small values for almost all coefficients E.g : When applying L1/L2 to a layer with 4 weights, the results might look like L1: 0.8, 0, 1, 0 L2: 0.3,0.1,0.3, 0.2

Sparsity ([3]) gradient is constant (1 or -1) w1: 5->0 in 10 steps gradient is smaller over time ( w2: 5->0 in a big number of steps

L1/L2 Regularization Fun Fact: What does “L” in L1/L2 stand for?

Batch Norm Original Paper Title: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [6] Internal Covariate Shift: The change in the distribution of network activations due to the change in network parameters during training.

Internal Covariate Shift (More) Distribution of each layer’s inputs changes during training as the parameters of the previous layers change. The layers need to continuously adapt to the new distribution! Problems: Slower training Hard to use big learning rate

Batch Norm Algorithm Batch Norm tries to fix the means and variances of layer inputs Reduce Internal Covariate Shift Run over batch axis

Batch Norm Regularization Effect Each hidden units are multiplied by a random value at each step of training Add noises to training process Force layers to learn harder to be robust a lot of variation of inputs A form of data augmentation

Batch Norm Recap Pros Networks train faster Allow higher learning rates Make weights easier to initialize Make more activation functions viable Regularization by forcing layers to be more robust to noises (may replace Dropout) Cons Not good for online learning Not good for RNN, LSTM Different calculation between train and test Related techniques Layer norm Weight norm

Dropout How it works Randomly selected neurons are ignored during each training step. Dropped neurons don’t have effect on next layers. Dropped neurons are not updated in backward training. Questions: What’re the ideas? Why dropout help to reduce overfit?

Ensemble Models - Bagging How it works Train multiple models on different subsets of data Combine those models into a final model Characteristics Each sub-model is trained separately Each sub-model is normally overfit The combination of those overfit models produce a less overfit model overall

Ensemble Models Averaging multiple models to create a final model with low variance

Dropout - Ensemble Models for DNN Can we apply Bagging for Neural Network? It’s computationally prohibitive Dropout aims to solve this problem by providing a method to combine multiple models with practical computation cost.

Dropout Removing units from base model effectively creates a subnetwork. All those subnetworks are trained implicitly together with all parameters shared (different from bagging) At predict mode, all learned units are activated, which averages all trained subnetworks

Dropout – Regularization Effect Each hidden units are multiplied by a random value at each step of training Add noises to training process  Similar with Batch Norm

Regularization Summary Two types of regularization Model optimization: Reduce the model complexity Data augmentation: Increase the size of training data Categorize techniques we have learned Model optimization: ? Data augmentation: ?

Demo Batch Norm, Dropout

Notes MNIST Dataset To create overfit scenario Reduce dataset size (60K->1K) Create a complex (but not so good) model Techniques to try Early stopping Dropout Batch Norm Link: https://drive.google.com/drive/u/0/folders/14A6n8bdrJHmgUcaopv66g8p0y0ot6hSr

Key Takeaways Keywords: Overfit, Underfit, Bias, Variance Regularization Techniques: Dropout, Batch-Norm, Early Stopping

References [1] https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76 [2] Pattern Recognition and Machine Learning, M. Bishop [3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models [4] Deep Learning, Goodfellow et. al [5] https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2 [6] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe et al [7] https://towardsdatascience.com/batch-normalization-8a2e585775c9 [8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava et al [9] https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-overfitting/ [10] Popular Ensemble Methods: An Empirical Study, Optiz et. al

Regularization in deep learning

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Regularization in deep learning

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx