Explanation of Autoencoder to Variontal Auto Encoder

seshathirid 88 views 76 slides Aug 21, 2024
Slide 1
Slide 1 of 76
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76

About This Presentation

Autoencoder to Variontal Auto Encoder notes


Slide Content

From Autoencoder to Variational Autoencoder
Hao Dong
Peking University
1

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
2
From Autoencoder to Variational Autoencoder
Feature Representation
Distribution Representation
视频:https://www.youtube.com/watch?v=xH1mBw3tb_c&list=PLe5rNUydzV9QHe8VDStpU0o8Yp63OecdW&index=4&pbjreload=10

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
3

4
Vanilla Autoencoder
•What is it?
Reconstruct high-dimensional data using a neural network model with a narrow
bottleneck layer.
The bottleneck layer captures the compressed latent coding, so the nice by-product
is dimension reduction.
The low-dimensional representation can be used as the representation of the data
in various applications, e.g., image retrieval, data compression …
!"#"

Latent code: the compressed low
dimensional representation of the
input data
5
Vanilla Autoencoder
•How it works?
!"#"

decoder/generator
Z àX
encoder
X àZ
InputReconstructed Input
Ideally the input and reconstruction are identical
The encoder
network is for
dimension
reduction, just
like PCA

6
Vanilla Autoencoder
•Training
!!
!"
!#
"1
"2
"3!$
!%
!&
"4
#!!
#!"
#!#
#!$
#!%
#!&
hidden layerinput layeroutput layer
ℒ=1$
!
%&"&"#
Given 'data samples
•The hidden units are usually less than the number of inputs
•Dimension reduction ---Representation learning
The distance between two data can be measure by
Mean Squared Error (MSE):
ℒ=$
%∑&'$%(%&−'((%&)2
where @is the number of variables
•Itistryingtolearnanapproximationtotheidentityfunction
sothattheinputis“compress”tothe“compressed”features,
discoveringinterestingstructureaboutthedata.
EncoderDecoder

7
Vanilla Autoencoder
•Testing/Inferencing
!!
!"
!#
"1
"2
"3!$
!%
!&
"4
hidden layerinput layer
extracted features
•Autoencoderisanunsupervisedlearningmethodifwe
consideredthelatentcodeasthe“output”.
•Autoencoderisalsoaself-supervised(self-taught)learning
methodwhichisatypeofsupervisedlearningwherethe
traininglabelsaredeterminedbytheinputdata.
•Word2Vec(fromRNNlecture)isanotherunsupervised,
self-taughtlearningexample.
Autoencoder for MNIST dataset (28×28×1, 784 pixels)
%&
&
Encoder

8
Vanilla Autoencoder
•Example:
•Compress MNIST (28x28x1) to the latent code with only 2 variables
Lossy

9
Vanilla Autoencoder
•Power of Latent Representation
•t-SNE visualization on MNIST: PCA vs. Autoencoder
PCAAutoencoder (Winner)
2006 Science paper by Hinton and Salakhutdinov

10
Vanilla Autoencoder
•Discussion
•Hidden layer is overcomplete if greater than the input layer

11
Vanilla Autoencoder
•Discussion
•Hidden layer is overcomplete if greater than the input layer
•No compression
•No guarantee that the hidden units extract meaningful feature

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
12

13
DenoisingAutoencoder(DAE)
•Why?
•Avoid overfitting
•Learn robust representations

14
Denoising Autoencoder
•Architecture
!!
!"
!#
"1
"2
"3!$
!%
!&
"4
#!!
#!"
#!#
#!$
#!%
#!&
hidden layerinput layeroutput layer
!!
!"
!#
!$
!%
!&
Applying dropout between the input and the first hidden layer
•Improvetherobustness
EncoderDecoder

15
Denoising Autoencoder
•Feature Visualization
Visualizing the learned features
!!
!"
!#
"1
"2
"3!$
!%
!&
"4
Oneneuron==Onefeatureextractor
reshape à

16
Denoising Autoencoder
•Denoising Autoencoder & Dropout
Denoising autoencoder was proposed in 2008, 4 years before the dropout paper (Hinton, et al. 2012).
Denoising autoencoder can be seem as applying dropout between the input and the first layer.
Denoising autoencoder can be seem as one type of data augmentation on the input.

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
17

18
Sparse Autoencoder
•Why?
•Evenwhenthenumberofhiddenunitsislarge
(perhapsevengreaterthanthenumberofinput
pixels),wecanstilldiscoverinterestingstructure,
byimposingotherconstraintsonthenetwork.
•Inparticular,ifweimposea”‘sparsity”’constraint
onthehiddenunits,thentheautoencoderwillstill
discoverinterestingstructureinthedata,evenif
thenumberofhiddenunitsislarge.
!!
!"
!#
"1
"2
"3!$
!%
!&
"4
hidden layerinput layer
0.02 “inactive”
0.97 “active”
0.01 “inactive”
0.98 “active”
Sigmoid
Encoder

19
Sparse Autoencoder
•Recap: KL Divergence
Smaller==Closer

20
Sparse Autoencoder
•Sparsity Regularization
!!
!"
!#
"1
"2
"3!$
!%
!&
"4
hidden layerinput layer
0.02 “inactive”
0.97 “active”
0.01 “inactive”
0.98 “active”
Sigmoid
^_!=1
`$
"%&
!
a!
Given'datasamples(batchsize)andSigmoid
activationfunction,theactiveratioofaneurona!:
Tomaketheoutput“sparse”,wewouldliketo
enforcethefollowingconstraint,where_isa
“sparsityparameter”,suchas0.2(20%ofthe
neurons)
^_!=_
Thepenaltytermisasfollow,wheresisthe
numberofactivationoutputs.
ℒ'=∑(%&)cd(_||^_!)
=∑(%&)(_log'
*'!+(1−_)log&+'
&+*'!)
ℒ,-,./=ℒ!01+hℒ'
Thetotalloss:
Encoder
The number of hidden units can be greater than the number of input variables.

21
Sparse Autoencoder
•Sparsity RegularizationSmaller_==Moresparse
Autoencoders for MNIST dataset
%&
&
Autoencoder
Sparse Autoencoder%&
Input

22
Sparse Autoencoder
•Different regularization loss
ℒ&on the hidden activation output
MethodHidden
Activation
Reconstruction
Activation
Loss Function
Method 1SigmoidSigmoidℒ,-,./=ℒ!01+ℒ'
Method 2ReLUSoftplusℒ,-,./=ℒ!01+m

23
Sparse Autoencoder
•Sparse Autoencoder vs. Denoising Autoencoder
Feature Extractors of Sparse AutoencoderFeature Extractors of Denoising Autoencoder

24
Sparse Autoencoder
•Autoencoder vs. Denoising Autoencoder vs. Sparse Autoencoder
Autoencoders for MNIST dataset
%&
&
Autoencoder
Sparse Autoencoder%&
Input
Denoising Autoencoder%&

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
25

26
Contractive Autoencoder
•Why?
•Denoising Autoencoder and Sparse Autoencoder overcome the overcomplete
problem via the input and hidden layers.
•Could we add an explicit term in the loss to avoid uninteresting features?
We wish the features that ONLY reflect variations observed in the training set
https://www.youtube.com/watch?v=79sYlJ8Cvlc

27
Contractive Autoencoder
•How
•Penalize the representation being too sensitive to the input
•Improve the robustness to small perturbations
•Measure the sensitivity by the Frobeniusnorm of the Jacobian matrix of the
encoder activations

"=%#
#=#!#""="!""
&#=⁄("!(#!⁄("!(#"⁄(""(#!⁄(""(#"
&#!"=⁄(#!("!⁄(#!(""⁄(#"("!⁄(#"(""
#=%$!"
#!+#"2#!=%#!#"
&#=11
20
"!""=
""/2
"!−""/2=%$!"!""
#!#"=
&#!"=01/2
1−1/2
input
output
&#&#!"=/
28
Contractive Autoencoder
•Recap: JocobianMatrix

29
Contractive Autoencoder
•JocobianMatrix

30
Contractive Autoencoder
•New Loss
reconstructionnew regularization

31
Contractive Autoencoder
•vs. Denoising Autoencoder
•Advantages
•CAE can better model the distribution of raw data
•Disadvantages
•DAE is easier to implement
•CAE needs second-order optimization (conjugate gradient, LBFGS)

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
32

33
Stacked Autoencoder
•Start from Autoencoder: Learn FeatureFrom Input
!!
!"
!#
"!!
""!
"#!!$
!%
!&
"$!
#!!
#!"
#!#
#!$
#!%
#!&
hidden 1inputoutput
The feature extractor for the input data
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable weights
EncoderDecoderUnsupervised
Red color indicates the trainable weights

34
Stacked Autoencoder
•2ndStage: Learn 2ndLevel FeatureFrom 1stLevel Feature
!!
!"
!#
"!!
""!
"#!!$
!%
!&
"$!
hidden 1inputoutput
"!"
"""
"#"
"$"
#!!
#!"
#!#
#!$
#!%
#!&
hidden 2
The feature extractor for the first feature extractor
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable weights
EncoderEncoderDecoderUnsupervised
Red color indicates the trainable weights

35
Stacked Autoencoder
•3rdStage: Learn 3rdLevel FeatureFrom 2ndLevel Feature
!!
!"
!#
"!!
""!
"#!!$
!%
!&
"$!
"!"
"""
"#"
"$"
"!#
""#
"##
"$#
#!!
#!"
#!#
#!$
#!%
#!&
hidden 1input outputhidden 2hidden 3
The feature extractor for the second feature extractor
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable weights
EncoderEncoderEncoderDecoderUnsupervised
Red color indicates the trainable weights

36
Stacked Autoencoder
•4thStage: Learn 4thLevel FeatureFrom 3rdLevel Feature
!!
!"
!#
"!!
""!
"#!!$
!%
!&
"$!
"!"
"""
"#"
"$"
"!#
""#
"##
"$#
hidden 1input outputhidden 2hidden 3
"!$
""$
"#$
"$%
#!!
#!"
#!#
#!$
#!%
#!&
hidden 4
The feature extractor for the third feature extractor
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable weights
EncoderEncoderEncoderEncoderDecoderUnsupervised
Red color indicates the trainable weights

37
Stacked Autoencoder
•Use the Learned Feature Extractor for Downstream Tasks
!!
!"
!#
"!!
""!
"#!!$
!%
!&
"$!
"!"
"""
"#"
"$"
"!#
""#
"##
"$#
hidden 1input outputhidden 2hidden 3
"!$
""$
"#$
"$$
"!%
hidden 4
Learn to classify the input data by using
the labels and high-level features
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable weights
Supervised
Red color indicates the trainable weights

38
Stacked Autoencoder
•Fine-tuning
!!
!"
!#
"!!
""!
"#!!$
!%
!&
"$!
"!"
"""
"#"
"$"
"!#
""#
"##
"$#
hidden 1input outputhidden 2hidden 3
"!$
""$
"#$
"$$
"!%
hidden 4
Fine-tune the entire model for classification
Red lines indicate the trainable weights
Black lines indicate the fixed/nontrainable weights
Supervised
Red color indicates the trainable weights

39
Stacked Autoencoder
•Discussion
•Advantages
•…
•Disadvantages
•…

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
•From Neural Network Perspective
•From Probability Model Perspective
40

41
Before we start
•Question?
•Are the previous Autoencoders generative model?
•Recap: We want to learn a probability distribution !(#)over #
oGeneration (sampling): %pqr~!(x)
(NO, The compressed latent codes of autoencoders are not prior distributions, autoencoder
cannot learn to represent the data distribution)
oDensity Estimation: !(x)high if %looks like a real data
NO
oUnsupervised Representation Learning:
Discovering the underlying structure from the data distribution (e.g., ears, nose, eyes …)
(YES, Autoencoders learn the feature representation)

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
•From Neural Network Perspective
•From Probability Model Perspective
42

43
Variational Autoencoder
•How to perform generation (sampling)?
!!
!"
!#
(1
(2
(3!$
!%
!&
(4
#!!
#!"
#!#
#!$
#!%
#!&
hidden layerinput layeroutput layer
Can the hidden output be a prior distribution, e.g., Normal distribution?
(1
(2
(3
(4
#!!
#!"
#!#
#!$
#!%
#!&
u(0,1) Decoder(Generator) maps u(0,1)to
dataspace
EncoderDecoder
Decoder
wx=∑2wxyw(y)
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

44
Variational Autoencoder
•Quick Overview
ℒkl
!"#"
ℒ456 &u(0,1)
Bidirectional Mapping
Latent SpaceData Space
ℒ)*)+,=ℒ-./+ℒ0,
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
w(z|{)
generation
(decode)
|({|z)
Inference
(encoder)

45
Variational Autoencoder
•The neural net perspective
•A variational autoencoder consists of an encoder, a decoder, and a loss function
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

46
Variational Autoencoder
•Encoder, Decoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

47
Variational Autoencoder
•Loss function
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
regularizationCan be represented by MSE

48
•Which direction of the KL divergence to
use?
•Some applications require an
approximation that usually places
high probability anywhere that the
true distribution places high
probability: left one
•VAE requires an approximation that
rarely places high probability
anywhere that the true distribution
places low probability: right one
Variational Autoencoder
•Why KL(Q||P) not KL(P||Q)
If:

49
Variational Autoencoder
•ReparameterizationTrick
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
ℎ!
ℎ"
ℎ#
21
22
23
ℎ$
ℎ%
ℎ&
24
#!!
#!"
#!#
#!$
#!%
#!&
31
32
33
34
(1
(2
(3
(4
{3~u(~3,3)
Resamplingpredict means
predict std
!!
!"
!#
!$
!%
!&
1.Encode the input
2.Predict means
3.Predict standard derivations
4.Use the predicted means and standard
derivations to sample new latent
variables individually
5.Reconstruct the input
Latent variables are independent

50
Variational Autoencoder
•ReparameterizationTrick
•z ~ N(μ, σ) is not differentiable
•To make sampling z differentiable
•z=μ+σ* ϵ
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
ϵ~N(0, 1)

51
Variational Autoencoder
•ReparameterizationTrick
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

52
Variational Autoencoder
•Loss function
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

53
Variational Autoencoder
•Where is ‘variational’?
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

•Vanilla Autoencoder
•Denoising Autoencoder
•Sparse Autoencoder
•Contractive Autoencoder
•Stacked Autoencoder
•Variational Autoencoder (VAE)
•From Neural Network Perspective
•From Probability Model Perspective
54

55
Variational Autoencoder
•Problem Definition
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
Goal: Given (={#?,#?,#?…,#p}, find !(to represent (
How: It is difficult to directly model !(, so alternatively, we can …
!(=D
?
!(|F!(F)
where !F=G(0,1)is a prior/known distribution
i.e., sample (from F

56
Variational Autoencoder
•The probability modelperspective
•P(X) is hard to model
•Alternatively, we learn the joint distribution of X and Z
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
EF=G
4
EF|IE(I)
EF=G
4
EF,I
EF,I=EIE(F|I)

57
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Assumption

58
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Assumption

59
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Monte Carlo?
•nmight need to be extremely large before we have an accurate estimation of P(X)

60
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Monte Carlo?
•Pixel difference is different from perceptual difference

61
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Monte Carlo?
•VAE alters the sampling procedure

62
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Recap: Variational Inference
•VI turns inference into optimization
idealapproximation
EK%=E(%,K)
E(%)∝E(%,K)

63
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•VariationalInference
•VI turns inference into optimization
parameter distribution

64
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Setting up the objective
•Maximize P(X)
•Set Q(z) to be an arbitrary distributionEKF=EFKE(K)
E(F)
Goal: maximize thislogP(x)

65
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Setting up the objective
encoderidealreconstruction/decoderKLDGoal: maximize this
Goal becomes: optimize thisdifficult to compute
ℒkl
!"#"
ℒ456
ℒ)*)+,=ℒ-./+ℒ0,
w(z|{)
generation
|({|z)
inference

66
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Setting up the objective : ELBO
idealencoder
-ELBO
EKF=E(F,K)
E(F)

67
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Setting up the objective : ELBO

68
Variational Autoencoder
•Recap: The KL Divergence Loss
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
JK(L(M,N?)||L0,1)
=OLM,N?PQRL(M,N?)
L0,1S#
=O1
2UN?V
????5
??5PQR
1
2UN?V
????5
??5
1
2UV
??5
?
S#
=O1
2UN?V
????5
??5log(1
N?V
?5????5
??5)S#
=1
2O1
2UN?V
????5
??5−PQRN?+#?−#−M?
N?S#

69
Variational Autoencoder
•Recap: The KL Divergence Loss
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
JK(L(M,N?)||L0,1)
=1
2O1
2UN?V
????5
??5−PQRN?+#?−#−M?
N?S#
的注释 =1
2(−1234"+5"+4"−1)

70
Variational Autoencoder
•Recap: The KL Divergence Loss
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

71
Variational Autoencoder
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
•Optimizingthe objective
encoderidealreconstructionKLD
dataset
dataset

72
Variational Autoencoder
•VAE is a Generative Model
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013
!F|(is not G(0,1)
Can we input G(0,1)to the decoder for sampling?
YES: the goal of KL is to make !F|(to be G(0,1)

73
Variational Autoencoder
•VAE vs. Autoencoder
•VAE : distribution representation, p(z|x) is a distribution
•AE: feature representation, h = E(x) is deterministic
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

74
Variational Autoencoder
•Challenges
•Low quality images
•…
Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013

75
Summary: Take Home Message
•Autoencoders learn data representation in an unsupervised/ self-supervised way.
•Autoencoders learn data representation but cannot model the data distribution !(.
•Different with vanilla autoencoder, in sparse autoencoder, the number of hidden units
can be greater than the number of input variables.
•VAE
•…
•…
•…
•…
•…
•…

Thanks
76
Tags