Reducing the dimensionality of data with neural networks

ssuser77b8c6 2,962 views 20 slides Dec 18, 2016
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Reducing the dimensionality of data with neural networksについての論文紹介の発表資料です。


Slide Content

Reducing the Dimensionality of Data with Neural Networks @ St_Hakky Geoffrey E. Hinton; R. R. Salakhutdinov (2006-07-28). “Reducing the Dimensionality of Data with Neural Networks”. Science 313 (5786)

Dimensionality Reduction Dimensionality Reduction facipitates … Classification Visualization Communication Storage of high-dimensional data

Principal Components Analysis PCA( Principal Components Analysis ) A simple and widely used method Finds the directions of greatest variance in the data set Represents each data point by its coordinates along each of these directions

“Encoder” and “Decoder” Network This paper describe a nonlinear generalization of PCA(This is autoencoder ) use an adaptive, multilayer “encoder” network to transform the high-dimensional data into a low-dimensional code a similar “decoder” network to recover the data from the code

AutoEncoder Code Input Output Encoder Decoder

AutoEncoder Input data Reconstructin g data Hidden layer Input layer Outputlayer Dimensionality Reduction

How to train the AutoEncoder ・ Starting with random weights in the two networks Input data Reconstructin g data Hidden layer Input layer Outputlayer Dimensionality Reduction ・ They are trained by minimizing the discrepancy between the original data and its reconstruction. ・ G radients are obtained by the chain rule to back-propagate error from the decoder network to encoder network.

It is difficult to optimize multilayer autoencoder It is difficult to optimize the weights in nonlinear autoencoders that have multiple hidden layers(2-4). With large initial weights: autoencoders typically find poor local minima With small initial weights: the gradients in the early layers are tiny , making it infeasible to train autoencoders with many hidden layers If the initial weights are close to a good solution, gradient decent works well. However finding such initial weights is very difficult.

Pretraining This paper introduce this “ pretraining ” procedure for binary data, generalize it to real-valued data, and show that it works well for a variety of data sets.

Restricte d Boltzmann Machine(RBM) Visible units Hidden units The input data correspond to “ visible ” units of the RBM and t he feature detectors correspond to “hidden” units. A joint configuration of the visible and hidden units has an energy given by (1).           The network assigns a probability to every possible data via this energy function.

Pretraining consits of learning a stack of RBMs ・ The first layer of feature detectors then become the visible units for learning the next RBM. ・ This layer-by-layer learning can be repeated as many times as desired.

Experiment(2-A) The six units in the code layer were linear and all the other units were logistic. The network was trained on 20,000 images and tested on 10,000 new images. The autoencoder discovered how to convert each 784-pixel image into six real numbers that allow almost perfect reconstruction . Code Input Output Encoder Decoder 28 * 28 28 * 28 400 400 200 200 100 100 50 50 25 25 6 6 Data The function of layer Used AutoEncoder’s Network Observed Results

Experiment(2-A) (1) Random samples of curves from the test data set (2) R econstructions produced by the six-dimensional deep autoencoder (3) R econstructions by logistic PCA using six components (4) R econstructions by logistic PCA The average squared error per image for the last four rows is 1.44, 7.64, 2.45, 5.90. (5) S tandard PCA using 18 components. (1) (3) (5) (4) (2)

Experiment(2-B) Encoder Decoder 1000 1000 784 784 500 250 250 30 30 Used AutoEncoder’s Network The 30 units in the code layer were linear and all the other units were logistic. The function of layer The network was trained on 6 0,000 images and tested on 10,000 new images. Data 500

Experiment(2-B) : MNIST The average squared errors for the last three rows are 3.00, 8.01, and 13.87. (1) (3) (2) (4) (1) A random test image from each class (2) R econstructions by the 30-dimensional autoencoder (3) R econstructions by 30- dimensional logistic P CA (4) R econstructions by standard PCA

Experiment(2-B) A two-dimensional autoencoder produced a better visualization of the data than did the first two principal components . (A ) The two-dimensional codes for 500 digits of each class produced by taking the first two principal components of all 60,000 training images. (B) The two-dimensional codes found by a 784- 1000-500-250-2 autoencoder.

Experiment(2-C) Used AutoEncoder’s Network The 30 units in the code layer were linear and all the other units were logistic. The function of layer Olivetti face data set Data Encoder Decoder 2000 2000 625 625 1000 500 500 30 30 1000 Observed Results The autoencoder clearly outperformed PCA

Experiment(2-C) (1) Random samples from the test data set (1) (3) (2) (2) R econstructions by the 30-dimensional autoencoder (3) R econstructions by 30-dimensional PCA. The average squared errors are 126 and 135.

Conclusion It has been obvious since the 1980s that backpropagation through deep autoencoders would be very effective for nonlinear dimensionality reduction in the situation of … C omputers were fast enough Data sets were big enough T he initial weights were close enough to a good solution.

Conclusion Autoencoders give mappings in both directions between the data and code spaces, and they can be applied to very large data sets because both the pretraining and the fine-tuning scale linearly in time and space with the number of training cases.