An introduction to tSNE in the background of dimension reduction
Size: 4.96 MB
Language: en
Added: Jan 29, 2014
Slides: 20 pages
Slide Content
Visualization using tSNE Yan Xu Jun 7, 2013
Linear (PCA) Nonlinear Non-parametric Parametric (LDA) Dimension reduction Global (ISOMAP,MDS) Local (LLE, SNE) Dimension Reduction Overview MDS SNE sym SNE UNI-SNE tSNE Barnes-Hut-SNE Local+probability crowding problem more stable and faster solution tSNE (t-distributed Stochastic Neighbor Embedding) easier implementation 2002 2008 2013 O(N 2 )->O( NlogN ) 2007
MDS: Multi-Dimensional Scaling M ulti- D imensional S caling arranges the low-dimensional points so as to minimize the discrepancy between the pairwise distances in the original space and the pairwise distances in the low-D space.
high-D distance low-D distance Sammon mapping from MDS It puts too much emphasis on getting very small distances exactly right. It’s slow to optimize and also gets stuck in different local optima each time Global to Local?
The idea is to make the local configurations of points in the low-dimensional space resemble the local configurations in the high-dimensional space. Maps that preserve local geometry fixed weights Find the y that minimize the cost subject to the constraint that the y have unit variance on each dimension. LLE (Locally Linear E mbedding)
A probabilistic version of local MDS: Stochastic Neighbor Embedding (SNE) It is more important to get local distances right than non-local ones. Stochastic neighbor embedding has a probabilistic way of deciding if a pairwise distance is “ local ” . Convert each high-dimensional similarity into the probability that one data point will pick the other data point as its neighbor. probability of picking j given i in high D probability of picking j given i in low D
Picking the radius of the Gaussian that is used to compute the p ’ s We need to use different radii in different parts of the space so that we keep the effective number of neighbors about constant. A big radius leads to a high entropy for the distribution over neighbors of i . A small radius leads to a low entropy. So decide what entropy you want and then find the radius that produces that entropy. Its easier to specify perplexity:
The cost function for a low-dimensional representation Gradient descent: Gradient update with a momentum term: Learning rate Momentum
Simpler version SNE: Turning conditional probabilities into pairwise probabilities
MNIST Database of handwritten digits 28×28 images Problem?
Why SNE does not have gaps between classes A uniform background model (UNI-SNE) eliminates this effect and allows gaps between classes to appear. q ij can never fall below Crowding problem : the area accommodating moderately distant datapoints is not large enough compared with the area accommodating nearby datapoints .
From UNI-SNE to t-SNE High dimension: Convert distances into probabilities using a Gaussian distribution Low dimension: Convert distances into probabilities using a probability distribution that has much heavier tails than a Gaussian. Student’s t -distribution V : the number of degrees of freedom Standard Normal Dis. T-Dis. With V = 1
Compare tSNE with SNE and UNI-SNE 10 12 14 16 18 10 12 14 -2 -4
Optimization method for tSNE
Optimization method for tSNE Tricks: Keep momentum term small until the map points have become moderately well organized. Use adaptive learning rate described by Jacobs (1988), which gradually increases the learning rate in directions where the gradient is stable. Early compression: force map points to stay close together at the start of the optimization. Early exaggeration: multiply all the p ij ’s by 4, in the initial stages of the optimization.
6000 MNIST digits Isomap Locally Linear Embedding t-SNE Sammon mapping
tSNE vs Diffusion maps Diffusion distance: Diffusion maps:
Weakness It’s unclear how t-SNE performs on general dimensionality reduction task; The relative local nature of t-SNE makes it sensitive to the curse of the intrinsic dimensionality of the data; It’s not guaranteed to converge to a global optimum of its cost function.