PR 103: t-SNE

TaeohKim4 854 views 49 slides Sep 16, 2018
Slide 1
Slide 1 of 49
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49

About This Presentation

Tensorflow Korea 논문읽기 모임 PR12의 103번째 발표는 t-SNE입니다.


Slide Content

Visualizing Data usingt-SNE

Credits
▷Hyeongmin Lee, MVPLAB, Yonsei Univ
▷https://www.slideshare.net/ssuser06e0c5/visualizing-data-using-tsne-73621033

t-SNE:
Student TDistributed-Stochastic Neighbor Embedding
▷Nonlinear Dimension Reduction for Visualization (2-D or 3-D)
▷Advance Version of SNE (G. Hinton, NIPS 2003)
▷Gradient-based Machine Learning Algorithm

Dimension Reduction

RealWorld Data = Very High Dimension
= 3145728 Dimension per Sample (ProGAN)

Manifold Hypothesis –Dimension Reduction
Ref) PR-010, PR-101
Slide from H.Lee (MVPLAB)

History of Dimension Reduction
Slide from H.Lee (MVPLAB)
Linear
▷Principal Component Analysis (1901)
Non-Linear
▷Multidimentional Scaling (1964)
▷Sammon Mapping (1969)
▷IsoMap (2000)
▷Locally Linear Embedding (2000)
▷Stochasitic Neighbor Embedding (2002)

Swiss Roll Data
Slide from H.Lee (MVPLAB)

IsoMap
Slide from H.Lee (MVPLAB)

Locally Linear Embedding
Slide from H.Lee (MVPLAB)

Problem?
Good at Local Representation = Poor at Global Representation
Good at Swiss Roll = Poor at Real Data

Stochastic Neighbor Embedding (SNE)

UpdateLow-DimensionalMapping
by Considering Pairwise Relations in High-Dimension
Iterative Update
Cost Function
Label
Prediction

Distance
Similarity
�
�|�=
??????

�
�−�
�
2
2??????
�
2
σ
�≠�
??????

�
�−�
�
2
2??????
�
2
??????

Distance
Similarity
??????
??????
�
�|�=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2

Distance
Similarity
??????
�
�|�=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2
??????

??????
????????????
�
�|�
�
�|�
�
�|�

??????
????????????
�
�|�
�
�|�
�
�|�

??????=��(�|�=෍
�

�
�
�|�log
�
�|�
�
�|�
??????
??????
for Every Data
????????????
??????�
�
=2෍
�
(�
�|�−�
�|�+�
�|�−�
�|�)(�
�−�
�)
�
�|��
�|�

??????=෍
�

�
�
�|�log
�
�|�
�
�|�
KL-Divergenceis Asymmetric
If High-D becomes Smaller
Low-D should Smaller
For Equal Cost

Appendix A: Gradient of SNE
??????=෍
�

�
�
�|�log
�
�|�
�
�|��
�|�=
??????
−��−��
2
σ
�≠�
??????
−�
�−�
�
2
??????
12…i…N
1
2

i 0

N
????????????
??????�
�
=−෍
�
�
�|�log�
�|�−෍
�
�
�|�log�
�|�
??????σ
�
�
�|�log�
�|�
??????�
�
=෍
�
�
�|�
??????log�
�|�
??????�
�
=෍
�
�
�|�(
??????log�
�|�??????
??????�
�

??????log??????
??????�
�
)
=෍
�
�
�|�(
1
�
�|�??????
??????�
�|�??????
??????�
�

1
??????
????????????
??????�
�
)=෍
�
�
�|�(
1
??????
−�
�−�
�
2??????
−�
�−�
�
2
??????−
1
??????
????????????
??????�
�
)
=෍
�
�
�|�??????−෍
�
1
??????

�≠�
�
�|�??????
−�
�−�
�
2
??????=−2෍
�
(�
�−�
�)(�
�|�−�
�|�)
�
�|�??????=??????
−�
�−�
�
2
??????�
�|�??????
??????�
�
=??????=−2(�
�−�
�)

t-Distributed SNE

Problem of SNE t-SNE
▷Hard to Optimize Symmetric Probability
▷Crowding Problem Student t-Distribution

SNE Symmetric SNE t-SNE
Prob. In
High-D
�
�|�=
??????

�
�−�
�
2
2??????
�
2
σ
�≠�
??????

�
�−�
�
2
2??????
�
2
�
��=
??????

�
�−�
�
2
2??????
2
σ
�≠�
??????

�
�−�
�
2
2??????
2
�
��=
�
�|�+�
�|�
2??????
Prob. In
Low-D
�
�|�=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
1+�
�−�
�
2
−1
σ
�≠�
1+�
�−�
�
2−1
Cost
Function
??????=෍
�

�
�
�|�log
�
�|�
�
�|�
??????=෍
�

�
�
��log
�
��
�
��
Gradient of
Cost
Function
2෍
�
(�
�|�−�
�|�+�
�|�−�
�|�)(�
�−�
�) 4෍
�
(�
��−�
��)(�
�−�
�) 4෍
�
�
��−�
���
�−�
�1+�
�−�
�
2
−1

SNE t-SNE
▷Hard to Optimize Symmetric Probability (Simpler Gradient)
�
��=
??????

�
�−�
�
2
2??????
2
σ
�≠�
??????

��−��
2
2??????
2
�
��=
??????
−��−��
2
σ
�≠�
??????
−��−��
2
Single Scale
All Other Pairs

SNE t-SNE
▷Hard to Optimize Symmetric Probability (Simpler Gradient)
�
��=
??????

�
�−�
�
2
2??????
2
σ
�≠�
??????

��−��
2
2??????
2
�
��=
??????
−��−��
2
σ
�≠�
??????
−��−��
2
Single Scale
All Other Pairs
Outlier = Very Small �
��= No Contribution to the Cost

SNE t-SNE
▷Hard to Optimize Symmetric Probability (Simpler Gradient)
�
��=
??????

�
�−�
�
2
2??????
2
σ
�≠�
??????

��−��
2
2??????
2
�
��=
??????
−��−��
2
σ
�≠�
??????
−��−��
2
All Other Pairs
�
��=
�
�|�+�
�|�
2??????
Ensures that σ
��
��>
1
2??????
for all data, contributes to the cost

SNE Symmetric SNE t-SNE
Prob. In
High-D
�
�|�=
??????

�
�−�
�
2
2??????
�
2
σ
�≠�
??????

�
�−�
�
2
2??????
�
2
�
��=
??????

�
�−�
�
2
2??????
2
σ
�≠�
??????

�
�−�
�
2
2??????
2
�
��=
�
�|�+�
�|�
2??????
Prob. In
Low-D
�
�|�=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
1+�
�−�
�
2
−1
σ
�≠�
1+�
�−�
�
2−1
Cost
Function
??????=෍
�

�
�
�|�log
�
�|�
�
�|�
??????=෍
�

�
�
��log
�
��
�
��
Gradient of
Cost
Function
2෍
�
(�
�|�−�
�|�+�
�|�−�
�|�)(�
�−�
�) 4෍
�
(�
��−�
��)(�
�−�
�) 4෍
�
�
��−�
���
�−�
�1+�
�−�
�
2
−1

SNE t-SNE
▷Crowding Problem Student t-Distribution
Slide from H.Lee (MVPLAB)
Solution?
▷Close Points Closer
▷Moderate Points More Far Away

SNE t-SNE
▷Crowding Problem Student t-Distribution
Student t-Distribution in Low-Dimension

SNE Symmetric SNE t-SNE
Prob. In
High-D
�
�|�=
??????

�
�−�
�
2
2??????
�
2
σ
�≠�
??????

�
�−�
�
2
2??????
�
2
�
��=
??????

�
�−�
�
2
2??????
2
σ
�≠�
??????

�
�−�
�
2
2??????
2
�
��=
�
�|�+�
�|�
2??????
Prob. In
Low-D
�
�|�=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
1+�
�−�
�
2
−1
σ
�≠�
1+�
�−�
�
2−1
Cost
Function
??????=෍
�

�
�
�|�log
�
�|�
�
�|�
??????=෍
�

�
�
��log
�
��
�
��
Gradient of
Cost
Function
2෍
�
(�
�|�−�
�|�+�
�|�−�
�|�)(�
�−�
�) 4෍
�
(�
��−�
��)(�
�−�
�) 4෍
�
�
��−�
���
�−�
�1+�
�−�
�
2
−1

SNE t-SNE
▷Crowding Problem Student t-Distribution
Student t-Distribution in Low-Dimension
This High-Dimension Data

SNE t-SNE
▷Crowding Problem Student t-Distribution
Student t-Distribution in Low-Dimension
This High-Dimension Data
Loses its Probability
Closer

SNE t-SNE
▷Crowding Problem Student t-Distribution
Student t-Distribution in Low-Dimension
This High-Dimension Data

SNE t-SNE
▷Crowding Problem Student t-Distribution
Student t-Distribution in Low-Dimension
This High-Dimension Data
Gains its Probability
Morefaraway

High-D Low-D �
���
��(�
��−�
��)(�
�−�
�) Gradient
Large Large 1 1 0 Large 0
Small Small 0 0 0 Small 0
Small Large 0 1 -1 Large Large
Attraction
Large Small 1 0 1 Small Small
Repulsion
Small Replusion

Adding Slight Repulsion (Uniform Dist. in �
��)
Often Not the Case
Low-D Initialized by Gaussian

High-D Low-D �
���
��(�
��−�
��) (�
�−�
�) 1+�
�−�
�
2
−1
Gradient
Large Large 1 1 0 Large Small 0
Small Small 0 0 0 Small Large 0
Small Large 0 1 -1 Large Small Attraction
Large Small 1 0 1 Small Large Repulsion
Strong Replusion

SNE Symmetric SNE t-SNE
Prob. In
High-D
�
�|�=
??????

�
�−�
�
2
2??????
�
2
σ
�≠�
??????

�
�−�
�
2
2??????
�
2
�
��=
??????

�
�−�
�
2
2??????
2
σ
�≠�
??????

�
�−�
�
2
2??????
2
�
��=
�
�|�+�
�|�
2??????
Prob. In
Low-D
�
�|�=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
1+�
�−�
�
2
−1
σ
�≠�
1+�
�−�
�
2−1
Cost
Function
??????=෍
�

�
�
�|�log
�
�|�
�
�|�
??????=෍
�

�
�
��log
�
��
�
��
Gradient of
Cost
Function
2෍
�
(�
�|�−�
�|�+�
�|�−�
�|�)(�
�−�
�) 4෍
�
(�
��−�
��)(�
�−�
�) 4෍
�
�
��−�
���
�−�
�1+�
�−�
�
2
−1

Effects of t-Distribution
Close Points Closer

Results & Add-On

Slide from H.Lee (MVPLAB)

SNE Symmetric SNE t-SNE
Prob. In
High-D
�
�|�=
??????

�
�−�
�
2
2??????
�
2
σ
�≠�
??????

�
�−�
�
2
2??????
�
2
�
��=
??????

�
�−�
�
2
2??????
2
σ
�≠�
??????

�
�−�
�
2
2??????
2
�
��=
�
�|�+�
�|�
2??????
Prob. In
Low-D
�
�|�=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
??????
−�
�−�
�
2
σ
�≠�
??????
−�
�−�
�
2 �
��=
1+�
�−�
�
2
−1
σ
�≠�
1+�
�−�
�
2−1
Cost
Function
??????=෍
�

�
�
�|�log
�
�|�
�
�|�
??????=෍
�

�
�
��log
�
��
�
��
Gradient of
Cost
Function
2෍
�
(�
�|�−�
�|�+�
�|�−�
�|�)(�
�−�
�) 4෍
�
(�
��−�
��)(�
�−�
�) 4෍
�
�
��−�
���
�−�
�1+�
�−�
�
2
−1

Set ??????
�
2
Calculate �??????��(�
�)�??????���
�=�??????��????????????�????????????�?
�
�|�=
??????

�
�−�
�
2
2??????
�
2
σ
�≠�
??????

�
�−�
�
2
2??????
�
2
�??????���
�=2
−σ
�
??????
�|�log2??????
�|�
Hyper-parameter: Perplexity
Paper Suggests 5~50
Perplexity = Smoothed Measure of the Effective Number of Neighbors
Perplexity = Balancing between Local and Global Aspects of Data

Reference
▷How to use t-SNE Effectively, https://distill.pub/2016/misread-tsne/
▷Automatic Selection of t-SNE Perplexity, ICML17 AutoML Workshop
??????�??????��=2��(�|�+
log??????�??????��
??????

Thank You
Tags