Visualization using tSNE


Published on

An introduction to tSNE in the background of dimension reduction

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Perplexity, 2 to the power of the entropy of the distribution. It measures the uncertainty, in this case can be interpreted as a smooth measure of the effective number of neighbors
  • KL divergence of Q from P is a measure of the information lost when Q is used to approximate P.In the early stage of the optimization, Gaussian noise is added to the map points after each iteration. Gradually reduce the variance of this noise performs a type of simulated annealing that helps the optimization to escape from poor local minima in the cost function. This requires sensible choices of the initial amount of Gaussian noise and the rate at which it decays. These choices interact with the amount of momentum and the step size that are employed in the gradient descent. Run optimization several times on a data set to find appropriate values for the parameters.
  • when xi is an outlier, all pairwise would be large. pij would be very small for all j. so the location of yi has little effect on the cost function. This point is not well determined by the positions of the other map point. Points are pulled towards each other if the p’s are bigger than the q’s and repelled if the q’s are bigger than the p’s
  • if we want to model the small distances accurately in the map, most of the points at a moderate distance will have to be placed much too far away in the 2D map.small attractive force. the very large number of such forces crushes together the datapoints in the center of the map, preventing the gapsAs a result, for datapoints far apart in the high D space, q will always be larger than p, leading to slight repulsion. optimization of UNI-SNE is tedious:Optimize the UNI-SNE cost function directly does not work because two map points that are far apart will get all there qs from the uniform background. When p is large, no
  • This allows a moderate distance in the hD space to be faithfully modeled by a much larger distance in the map. Eliminate the attractive force.
  • UNI-SNE: the repulsion is only strong hen the pairwise distance between the points in ld is already large.the strength of repulsion between dissimilar points is proportional to the pairwise distance in ld map. Move too far awaytSNE introduces long-range forces in lowD that can pull back together two similar points that get separated early on in the optimization
  • Shammon mapping:Soft border between the local and global structure. tSNE determines the local neighborhood size for each datapointseperately based on the local density of the dataIsomap:Susceptibility to short circuiting (connecting the wrong point because of large k, leading to drastically different lowD visualization), modeling large geodesic distances rather than small ones.Weakness of LLE: easy to cheatThe only thing that prevents all datapoints from collapsing into a single point is a constraint on the covariance of the lowD representation. In practice, this is often satisfied by placing most of the map points near the center of the map and using a few widely scattered points to keep that variance.LLE and Isomap, the neighbor graphs, are not capable of visualizing data of two or more seperatedsubmanifolds. Lose relative similaries of the separate components.
  • Now mostly use tSNE for visualization. It’s not readily for reducing data to d > 3 dimensions because of the heavy tails. In high dim spaces, the heavy tails comprise a relatively large portion of the probability mass. It can lead to data presentation that do not preserve local structure of the data.Perplexity to define the neighborhood. End up with different lowD layout if we haven’t estimated this variable right.It needs several optimization parameters for solution. The same choice of optimization params can be used for a variety of different vis tasks. It’s relatively stable.
  • Visualization using tSNE

    1. 1. Visualization using tSNE Yan Xu Jun 7, 2013
    2. 2. Dimension Reduction Overview Parametric (LDA) Linear Dimension reduction (PCA) Global Nonparametric (ISOMAP,MDS) Nonlinear tSNE (t-distributed Stochastic Neighbor Embedding) easier implementation MDS SNE Local+probability 2002 Local more stable and faster solution sym SNE UNI-SNE crowding problem 2007 (LLE, SNE) tSNE Barnes-Hut-SNE O(N2)->O(NlogN) 2008 2013
    3. 3. MDS: Multi-Dimensional Scaling • Multi-Dimensional Scaling arranges the low-dimensional points so as to minimize the discrepancy between the pairwise distances in the original space and the pairwise distances in the low-D space. Cost (d ij i j d ij || xi x j ||2 ˆ d ij || yi y j ||2 ˆ d ij ) 2
    4. 4. Sammon mapping from MDS high-D distance low-D distance || xi x j || || y i y j || Cost ij 2 || xi x j || It puts too much emphasis on getting very small distances exactly right. It’s slow to optimize and also gets stuck in different local optima each time Global to Local?
    5. 5. Maps that preserve local geometry LLE (Locally Linear Embedding) The idea is to make the local configurations of points in the low-dimensional space resemble the local configurations in the high-dimensional space. Cost || xi i wij x j || 2 , j N (i ) wij 1 j N (i ) fixed weights Cost || y i i wij y j || 2 j N (i ) Find the y that minimize the cost subject to the constraint that the y have unit variance on each dimension.
    6. 6. A probabilistic version of local MDS: Stochastic Neighbor Embedding (SNE) • It is more important to get local distances right than non-local ones. • Stochastic neighbor embedding has a probabilistic way of deciding if a pairwise distance is “local”. • Convert each high-dimensional similarity into the probability that one data point will pick the other data point as its neighbor. probability of p picking j given i in j|i high D || xi x j ||2 2 i2 e || xi xk ||2 2 i2 e k e q j|i || yi y j ||2 e k 2 || yi yk || probability of picking j given i in low D
    7. 7. Picking the radius of the Gaussian that is used to compute the p’s • We need to use different radii in different parts of the space so that we keep the effective number of neighbors about constant. • A big radius leads to a high entropy for the distribution over neighbors of i. A small radius leads to a low entropy. • So decide what entropy you want and then find the radius that produces that entropy. • Its easier to specify perplexity: ||xi x j ||2 2 i2 e p j|i || xi xk ||2 2 i2 e k
    8. 8. The cost function for a low-dimensional representation Cost KL ( Pi || Qi ) i i j p j|i log p j|i q j|i Gradient descent: C yi 2 (y j y i ) ( p j|i q j|i j Gradient update with a momentum term: Learning rate Momentum pi| j qi| j )
    9. 9. Simpler version SNE: Turning conditional probabilities into pairwise probabilities pij e || xi x j ||2 2 2 e p j|i pij || xk xl ||2 2 2 2n k l pij j Cost KL( P || Q ) C yi 4 ( pij j pij log qij )( yi pi| j yj) pij qij 1 2n
    10. 10. MNIST Database of handwritten digits 28×28 images Problem?
    11. 11. Why SNE does not have gaps between classes Crowding problem: the area accommodating moderately distant datapoints is not large enough compared with the area accommodating nearby datapoints. A uniform background model (UNI-SNE) eliminates this effect and allows gaps between classes to appear. qij can never fall below 2 n(n 1)
    12. 12. From UNI-SNE to t-SNE High dimension: Convert distances into probabilities using a Gaussian distribution Low dimension: Convert distances into probabilities using a probability distribution that has much heavier tails than a Gaussian. Student’s t-distribution V : the number of degrees of freedom Standard Normal Dis. T-Dis. With V=1 qij (1 || yi (1 || yk k l y j ||2 ) 1 yl ||2 ) 1
    13. 13. Compare tSNE with SNE and UNI-SNE 18 16 14 12 14 12 10 10 -2 -4
    14. 14. Optimization method for tSNE ||xi x j ||2 2 i2 e p j|i e k || xi xk ||2 2 i2 qij (1 || yi (1 || yk k l y j ||2 ) 1 yl ||2 ) 1
    15. 15. Optimization method for tSNE Tricks: 1. Keep momentum term small until the map points have become moderately well organized. 2. Use adaptive learning rate described by Jacobs (1988), which gradually increases the learning rate in directions where the gradient is stable. 3. Early compression: force map points to stay close together at the start of the optimization. 4. Early exaggeration: multiply all the pij’s by 4, in the initial stages of the optimization.
    16. 16. Isomap Sammon mapping 6000 MNIST digits t-SNE Locally Linear Embedding
    17. 17. tSNE vs Diffusion maps Diffusion distance: || xi x j ||2 (1) pij e n Diffusion maps: ( pijt ) ( pikt k 1 1) ( pkjt 1)
    18. 18. Weakness 1. It’s unclear how t-SNE performs on general dimensionality reduction task; 2. The relative local nature of t-SNE makes it sensitive to the curse of the intrinsic dimensionality of the data; 3. It’s not guaranteed to converge to a global optimum of its cost function.
    19. 19. References: t-SNE homepage: Advanced Machine Learning: Lecture11: Non-linear Dimensionality Reduction Plugin Ad: tSNE in Farsight splot = new SNEPlotWindow(this); splot->setPerplexity(perplexity); splot->setModels(table, selection)) splot->show();