PR 103: t-SNE

1. Visualizing Data using t-SNE

2. Credits ▷ Hyeongmin Lee, MVPLAB, Yonsei Univ ▷ https://www.slideshare.net/ssuser06e0c5/visualizing-data-using-tsne-73621033

3. t-SNE: Student T Distributed-Stochastic Neighbor Embedding ▷ Nonlinear Dimension Reduction for Visualization (2-D or 3-D) ▷ Advance Version of SNE (G. Hinton, NIPS 2003) ▷ Gradient-based Machine Learning Algorithm

4. Dimension Reduction

5. Real World Data = Very High Dimension = 3145728 Dimension per Sample (ProGAN)

6. Manifold Hypothesis – Dimension Reduction Ref) PR-010, PR-101 Slide from H.Lee (MVPLAB)

7. History of Dimension Reduction Slide from H.Lee (MVPLAB) Linear ▷ Principal Component Analysis (1901) Non-Linear ▷ Multidimentional Scaling (1964) ▷ Sammon Mapping (1969) ▷ IsoMap (2000) ▷ Locally Linear Embedding (2000) ▷ Stochasitic Neighbor Embedding (2002)

8. Swiss Roll Data Slide from H.Lee (MVPLAB)

9. IsoMap Slide from H.Lee (MVPLAB)

10. Locally Linear Embedding Slide from H.Lee (MVPLAB)

11. Problem? Good at Local Representation = Poor at Global Representation Good at Swiss Roll = Poor at Real Data

12. Stochastic Neighbor Embedding (SNE)

13. Update Low-Dimensional Mapping by Considering Pairwise Relations in High-Dimension Iterative Update Cost Function Label Prediction

15. Distance Similarity 𝑝𝑗|𝑖 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥 𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑖

16. Distance Similarity 𝑖 𝑖 𝑞 𝑗|𝑖 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2

17. Distance Similarity 𝑖 𝑞 𝑗|𝑖 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑖

18. 𝑖 𝑖 𝑖 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝑞 𝑗|𝑖

19. 𝑖 𝑖 𝑖 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝑞 𝑗|𝑖

20. 𝐶 = 𝐾𝐿(𝑃| 𝑄 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝑖 𝑖 for Every Data 𝜕𝐶 𝜕𝑦𝑖 = 2 ෍ 𝑗 (𝑝𝑗|𝑖 − 𝑞 𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 𝑝𝑗|𝑖 𝑞 𝑗|𝑖

21. 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 KL-Divergence is Asymmetric If High-D becomes Smaller Low-D should Smaller For Equal Cost

23. t-Distributed SNE

24. Problem of SNE  t-SNE ▷ Hard to Optimize  Symmetric Probability ▷ Crowding Problem  Student t-Distribution

25. SNE Symmetric SNE t-SNE Prob. In High-D 𝑝𝑗|𝑖 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑝𝑖𝑗 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥𝑙 2 2𝜎2 𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑖|𝑗 2𝑛 Prob. In Low-D 𝑞 𝑗|𝑖 = 𝑒− 𝑦𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 𝑞𝑖𝑗 = 1 + 𝑦𝑖 − 𝑦𝑗 2 −1 σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙 2 −1 Cost Function 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑖𝑗 log 𝑝𝑖𝑗 𝑞𝑖𝑗 Gradient of Cost Function 2 ෍ 𝑗 (𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗 2 −1

26. SNE  t-SNE ▷ Hard to Optimize  Symmetric Probability (Simpler Gradient) 𝑝𝑖𝑗 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥 𝑙 2 2𝜎2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 Single Scale All Other Pairs

27. SNE  t-SNE ▷ Hard to Optimize  Symmetric Probability (Simpler Gradient) 𝑝𝑖𝑗 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥 𝑙 2 2𝜎2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 Single Scale All Other Pairs Outlier = Very Small 𝑝𝑖𝑗 = No Contribution to the Cost

28. SNE  t-SNE ▷ Hard to Optimize  Symmetric Probability (Simpler Gradient) 𝑝𝑖𝑗 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥 𝑙 2 2𝜎2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 All Other Pairs 𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑖|𝑗 2𝑛 Ensures that σ 𝑗 𝑝𝑖𝑗 > 1 2𝑛 for all data, contributes to the cost

30. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Slide from H.Lee (MVPLAB) Solution? ▷ Close Points  Closer ▷ Moderate Points  More Far Away

31. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension

33. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension This High-Dimension Data

34. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension This High-Dimension Data Loses its Probability  Closer

35. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension This High-Dimension Data

36. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension This High-Dimension Data Gains its Probability  More far away

37. High-D Low-D 𝑝𝑖𝑗 𝑞𝑖𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗) (𝑦𝑖 − 𝑦𝑗) Gradient Large Large 1 1 0 Large 0 Small Small 0 0 0 Small 0 Small Large 0 1 -1 Large Large Attraction Large Small 1 0 1 Small Small Repulsion Small Replusion

38. Adding Slight Repulsion (Uniform Dist. in 𝑞𝑖𝑗) Often Not the Case Low-D Initialized by Gaussian

39. High-D Low-D 𝑝𝑖𝑗 𝑞𝑖𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗) (𝑦𝑖 − 𝑦𝑗) 1 + 𝑦𝑖 − 𝑦𝑗 2 −1 Gradient Large Large 1 1 0 Large Small 0 Small Small 0 0 0 Small Large 0 Small Large 0 1 -1 Large Small Attraction Large Small 1 0 1 Small Large Repulsion Strong Replusion

41. Effects of t-Distribution Close Points  Closer

43. Results & Add-On

44. Slide from H.Lee (MVPLAB)

47. Set 𝜎𝑖 2  Calculate 𝑃𝑒𝑟𝑝(𝑃𝑖)  𝑃𝑒𝑟𝑝 𝑃𝑖 = 𝑃𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦? 𝑝𝑗|𝑖 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥 𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑃𝑒𝑟𝑝 𝑃𝑖 = 2− σ 𝑗 𝑝 𝑗|𝑖 log2 𝑝 𝑗|𝑖 Hyper-parameter: Perplexity Paper Suggests 5~50 Perplexity = Smoothed Measure of the Effective Number of Neighbors Perplexity = Balancing between Local and Global Aspects of Data

48. Reference ▷ How to use t-SNE Effectively, https://distill.pub/2016/misread-tsne/ ▷ Automatic Selection of t-SNE Perplexity, ICML17 AutoML Workshop 𝑆 𝑃𝑒𝑟𝑝 = 2𝐾𝐿(𝑃| 𝑄 + log 𝑛 𝑃𝑒𝑟𝑝 𝑛

49. Thank You

PR 103: t-SNE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PR 103: t-SNE

Similar to PR 103: t-SNE (20)

More from Taeoh Kim

More from Taeoh Kim (7)

Recently uploaded

Recently uploaded (20)

PR 103: t-SNE