SlideShare a Scribd company logo
Visualizing Data using t-SNE
Credits
▷ Hyeongmin Lee, MVPLAB, Yonsei Univ
▷ https://www.slideshare.net/ssuser06e0c5/visualizing-data-using-tsne-73621033
t-SNE:
Student T Distributed-Stochastic Neighbor Embedding
▷ Nonlinear Dimension Reduction for Visualization (2-D or 3-D)
▷ Advance Version of SNE (G. Hinton, NIPS 2003)
▷ Gradient-based Machine Learning Algorithm
Dimension Reduction
Real World Data = Very High Dimension
= 3145728 Dimension per Sample (ProGAN)
Manifold Hypothesis – Dimension Reduction
Ref) PR-010, PR-101
Slide from H.Lee (MVPLAB)
History of Dimension Reduction
Slide from H.Lee (MVPLAB)
Linear
▷ Principal Component Analysis (1901)
Non-Linear
▷ Multidimentional Scaling (1964)
▷ Sammon Mapping (1969)
▷ IsoMap (2000)
▷ Locally Linear Embedding (2000)
▷ Stochasitic Neighbor Embedding (2002)
Swiss Roll Data
Slide from H.Lee (MVPLAB)
IsoMap
Slide from H.Lee (MVPLAB)
Locally Linear Embedding
Slide from H.Lee (MVPLAB)
Problem?
Good at Local Representation = Poor at Global Representation
Good at Swiss Roll = Poor at Real Data
Stochastic Neighbor Embedding (SNE)
Update Low-Dimensional Mapping
by Considering Pairwise Relations in High-Dimension
Iterative Update
Cost Function
Label
Prediction
Distance
Similarity
𝑝𝑗|𝑖 =
𝑒
−
𝑥 𝑖−𝑥 𝑗
2
2𝜎𝑖
2
σ 𝑘≠𝑖 𝑒
−
𝑥 𝑖−𝑥 𝑘
2
2𝜎𝑖
2
𝑖
Distance
Similarity
𝑖
𝑖
𝑞 𝑗|𝑖 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘
2
Distance
Similarity
𝑖
𝑞 𝑗|𝑖 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘
2
𝑖
𝑖
𝑖 𝑖
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
𝑞 𝑗|𝑖
𝑖
𝑖 𝑖
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
𝑞 𝑗|𝑖
𝐶 = 𝐾𝐿(𝑃| 𝑄 = ෍
𝑖
෍
𝑗
𝑝𝑗|𝑖 log
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
𝑖
𝑖
for Every Data
𝜕𝐶
𝜕𝑦𝑖
= 2 ෍
𝑗
(𝑝𝑗|𝑖 − 𝑞 𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗)
𝑝𝑗|𝑖 𝑞 𝑗|𝑖
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑗|𝑖 log
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
KL-Divergence is Asymmetric
If High-D becomes Smaller
Low-D should Smaller
For Equal Cost
Appendix A: Gradient of SNE
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑗|𝑖 log
𝑝𝑗|𝑖
𝑞𝑗|𝑖𝑞𝑗|𝑖 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘
2
𝑍
1 2 … i … N
1
2
…
i 0
…
N
𝜕𝐶
𝜕𝑦𝑖
= − ෍
𝑗
𝑝𝑗|𝑖 log 𝑞𝑗|𝑖 − ෍
𝑗
𝑝𝑖|𝑗 log 𝑞𝑖|𝑗
𝜕 σ 𝑗 𝑝𝑗|𝑖 log 𝑞𝑗|𝑖
𝜕𝑦𝑖
= ෍
𝑗
𝑝𝑗|𝑖
𝜕 log 𝑞𝑗|𝑖
𝜕𝑦𝑖
= ෍
𝑗
𝑝𝑗|𝑖(
𝜕 log 𝑞𝑗|𝑖 𝑍
𝜕𝑦𝑖
−
𝜕 log 𝑍
𝜕𝑦𝑖
)
= ෍
𝑗
𝑝𝑗|𝑖(
1
𝑞𝑗|𝑖 𝑍
𝜕𝑞𝑗|𝑖 𝑍
𝜕𝑦𝑖
−
1
𝑍
𝜕𝑍
𝜕𝑦𝑖
) = ෍
𝑗
𝑝𝑗|𝑖(
1
𝑒− 𝑦 𝑖−𝑦 𝑗
2 𝑒− 𝑦𝑖−𝑦 𝑗
2
𝐴 −
1
𝑍
𝜕𝑍
𝜕𝑦𝑖
)
= ෍
𝑗
𝑝𝑗|𝑖 𝐴 − ෍
𝑗
1
𝑍
෍
𝑘≠𝑖
𝑝 𝑘|𝑖 𝑒− 𝑦𝑖−𝑦 𝑗
2
𝐴 = −2 ෍
𝑗
(𝑦𝑖 − 𝑦𝑗)(𝑝𝑗|𝑖 − 𝑞𝑗|𝑖)
𝑞𝑗|𝑖 𝑍 = 𝑒− 𝑦𝑖−𝑦 𝑗
2
𝜕𝑞 𝑗|𝑖 𝑍
𝜕𝑦𝑖
= 𝐴 = −2(𝑦𝑖 − 𝑦𝑗)
t-Distributed SNE
Problem of SNE  t-SNE
▷ Hard to Optimize  Symmetric Probability
▷ Crowding Problem  Student t-Distribution
SNE Symmetric SNE t-SNE
Prob. In
High-D 𝑝𝑗|𝑖 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎𝑖
2
σ 𝑘≠𝑖 𝑒
−
𝑥𝑖−𝑥 𝑘
2
2𝜎𝑖
2
𝑝𝑖𝑗 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎2
σ 𝑘≠𝑙 𝑒
−
𝑥 𝑘−𝑥𝑙
2
2𝜎2
𝑝𝑖𝑗 =
𝑝𝑗|𝑖 + 𝑝𝑖|𝑗
2𝑛
Prob. In
Low-D 𝑞 𝑗|𝑖 =
𝑒− 𝑦𝑖−𝑦 𝑗
2
σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘
2 𝑞𝑖𝑗 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙
2 𝑞𝑖𝑗 =
1 + 𝑦𝑖 − 𝑦𝑗
2 −1
σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙
2 −1
Cost
Function
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑗|𝑖 log
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑖𝑗 log
𝑝𝑖𝑗
𝑞𝑖𝑗
Gradient of
Cost
Function
2 ෍
𝑗
(𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
(𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗
2 −1
SNE  t-SNE
▷ Hard to Optimize  Symmetric Probability (Simpler Gradient)
𝑝𝑖𝑗 =
𝑒
−
𝑥 𝑖−𝑥 𝑗
2
2𝜎2
σ 𝑘≠𝑙 𝑒
−
𝑥 𝑘−𝑥 𝑙
2
2𝜎2
𝑞𝑖𝑗 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙
2
Single Scale All Other Pairs
SNE  t-SNE
▷ Hard to Optimize  Symmetric Probability (Simpler Gradient)
𝑝𝑖𝑗 =
𝑒
−
𝑥 𝑖−𝑥 𝑗
2
2𝜎2
σ 𝑘≠𝑙 𝑒
−
𝑥 𝑘−𝑥 𝑙
2
2𝜎2
𝑞𝑖𝑗 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙
2
Single Scale All Other Pairs
Outlier = Very Small 𝑝𝑖𝑗 = No Contribution to the Cost
SNE  t-SNE
▷ Hard to Optimize  Symmetric Probability (Simpler Gradient)
𝑝𝑖𝑗 =
𝑒
−
𝑥 𝑖−𝑥 𝑗
2
2𝜎2
σ 𝑘≠𝑙 𝑒
−
𝑥 𝑘−𝑥 𝑙
2
2𝜎2
𝑞𝑖𝑗 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙
2
All Other Pairs
𝑝𝑖𝑗 =
𝑝𝑗|𝑖 + 𝑝𝑖|𝑗
2𝑛
Ensures that σ 𝑗 𝑝𝑖𝑗 >
1
2𝑛
for all data, contributes to the cost
SNE Symmetric SNE t-SNE
Prob. In
High-D 𝑝𝑗|𝑖 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎𝑖
2
σ 𝑘≠𝑖 𝑒
−
𝑥𝑖−𝑥 𝑘
2
2𝜎𝑖
2
𝑝𝑖𝑗 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎2
σ 𝑘≠𝑙 𝑒
−
𝑥 𝑘−𝑥𝑙
2
2𝜎2
𝑝𝑖𝑗 =
𝑝𝑗|𝑖 + 𝑝𝑖|𝑗
2𝑛
Prob. In
Low-D 𝑞 𝑗|𝑖 =
𝑒− 𝑦𝑖−𝑦 𝑗
2
σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘
2 𝑞𝑖𝑗 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙
2 𝑞𝑖𝑗 =
1 + 𝑦𝑖 − 𝑦𝑗
2 −1
σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙
2 −1
Cost
Function
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑗|𝑖 log
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑖𝑗 log
𝑝𝑖𝑗
𝑞𝑖𝑗
Gradient of
Cost
Function
2 ෍
𝑗
(𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
(𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗
2 −1
SNE  t-SNE
▷ Crowding Problem  Student t-Distribution
Slide from H.Lee (MVPLAB)
Solution?
▷ Close Points  Closer
▷ Moderate Points  More Far Away
SNE  t-SNE
▷ Crowding Problem  Student t-Distribution
Student t-Distribution in Low-Dimension
SNE Symmetric SNE t-SNE
Prob. In
High-D 𝑝𝑗|𝑖 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎𝑖
2
σ 𝑘≠𝑖 𝑒
−
𝑥𝑖−𝑥 𝑘
2
2𝜎𝑖
2
𝑝𝑖𝑗 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎2
σ 𝑘≠𝑙 𝑒
−
𝑥 𝑘−𝑥𝑙
2
2𝜎2
𝑝𝑖𝑗 =
𝑝𝑗|𝑖 + 𝑝𝑖|𝑗
2𝑛
Prob. In
Low-D 𝑞 𝑗|𝑖 =
𝑒− 𝑦𝑖−𝑦 𝑗
2
σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘
2 𝑞𝑖𝑗 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙
2 𝑞𝑖𝑗 =
1 + 𝑦𝑖 − 𝑦𝑗
2 −1
σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙
2 −1
Cost
Function
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑗|𝑖 log
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑖𝑗 log
𝑝𝑖𝑗
𝑞𝑖𝑗
Gradient of
Cost
Function
2 ෍
𝑗
(𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
(𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗
2 −1
SNE  t-SNE
▷ Crowding Problem  Student t-Distribution
Student t-Distribution in Low-Dimension
This High-Dimension Data
SNE  t-SNE
▷ Crowding Problem  Student t-Distribution
Student t-Distribution in Low-Dimension
This High-Dimension Data
Loses its Probability
 Closer
SNE  t-SNE
▷ Crowding Problem  Student t-Distribution
Student t-Distribution in Low-Dimension
This High-Dimension Data
SNE  t-SNE
▷ Crowding Problem  Student t-Distribution
Student t-Distribution in Low-Dimension
This High-Dimension Data
Gains its Probability
 More far away
High-D Low-D 𝑝𝑖𝑗 𝑞𝑖𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗) (𝑦𝑖 − 𝑦𝑗) Gradient
Large Large 1 1 0 Large 0
Small Small 0 0 0 Small 0
Small Large 0 1 -1 Large Large
Attraction
Large Small 1 0 1 Small Small
Repulsion
Small Replusion
Adding Slight Repulsion (Uniform Dist. in 𝑞𝑖𝑗)
Often Not the Case
Low-D Initialized by Gaussian
High-D Low-D 𝑝𝑖𝑗 𝑞𝑖𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗) (𝑦𝑖 − 𝑦𝑗) 1 + 𝑦𝑖 − 𝑦𝑗
2 −1
Gradient
Large Large 1 1 0 Large Small 0
Small Small 0 0 0 Small Large 0
Small Large 0 1 -1 Large Small Attraction
Large Small 1 0 1 Small Large Repulsion
Strong Replusion
SNE Symmetric SNE t-SNE
Prob. In
High-D 𝑝𝑗|𝑖 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎𝑖
2
σ 𝑘≠𝑖 𝑒
−
𝑥𝑖−𝑥 𝑘
2
2𝜎𝑖
2
𝑝𝑖𝑗 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎2
σ 𝑘≠𝑙 𝑒
−
𝑥 𝑘−𝑥𝑙
2
2𝜎2
𝑝𝑖𝑗 =
𝑝𝑗|𝑖 + 𝑝𝑖|𝑗
2𝑛
Prob. In
Low-D 𝑞 𝑗|𝑖 =
𝑒− 𝑦𝑖−𝑦 𝑗
2
σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘
2 𝑞𝑖𝑗 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙
2 𝑞𝑖𝑗 =
1 + 𝑦𝑖 − 𝑦𝑗
2 −1
σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙
2 −1
Cost
Function
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑗|𝑖 log
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑖𝑗 log
𝑝𝑖𝑗
𝑞𝑖𝑗
Gradient of
Cost
Function
2 ෍
𝑗
(𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
(𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗
2 −1
Effects of t-Distribution
Close Points  Closer
Results & Add-On
Slide from H.Lee (MVPLAB)
SNE Symmetric SNE t-SNE
Prob. In
High-D 𝑝𝑗|𝑖 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎𝑖
2
σ 𝑘≠𝑖 𝑒
−
𝑥𝑖−𝑥 𝑘
2
2𝜎𝑖
2
𝑝𝑖𝑗 =
𝑒
−
𝑥𝑖−𝑥 𝑗
2
2𝜎2
σ 𝑘≠𝑙 𝑒
−
𝑥 𝑘−𝑥𝑙
2
2𝜎2
𝑝𝑖𝑗 =
𝑝𝑗|𝑖 + 𝑝𝑖|𝑗
2𝑛
Prob. In
Low-D 𝑞 𝑗|𝑖 =
𝑒− 𝑦𝑖−𝑦 𝑗
2
σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘
2 𝑞𝑖𝑗 =
𝑒− 𝑦 𝑖−𝑦 𝑗
2
σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙
2 𝑞𝑖𝑗 =
1 + 𝑦𝑖 − 𝑦𝑗
2 −1
σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙
2 −1
Cost
Function
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑗|𝑖 log
𝑝𝑗|𝑖
𝑞 𝑗|𝑖
𝐶 = ෍
𝑖
෍
𝑗
𝑝𝑖𝑗 log
𝑝𝑖𝑗
𝑞𝑖𝑗
Gradient of
Cost
Function
2 ෍
𝑗
(𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
(𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍
𝑗
𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗
2 −1
Set 𝜎𝑖
2
 Calculate 𝑃𝑒𝑟𝑝(𝑃𝑖)  𝑃𝑒𝑟𝑝 𝑃𝑖 = 𝑃𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦?
𝑝𝑗|𝑖 =
𝑒
−
𝑥 𝑖−𝑥 𝑗
2
2𝜎𝑖
2
σ 𝑘≠𝑖 𝑒
−
𝑥 𝑖−𝑥 𝑘
2
2𝜎𝑖
2
𝑃𝑒𝑟𝑝 𝑃𝑖 = 2− σ 𝑗 𝑝 𝑗|𝑖 log2 𝑝 𝑗|𝑖
Hyper-parameter: Perplexity
Paper Suggests 5~50
Perplexity = Smoothed Measure of the Effective Number of Neighbors
Perplexity = Balancing between Local and Global Aspects of Data
Reference
▷ How to use t-SNE Effectively, https://distill.pub/2016/misread-tsne/
▷ Automatic Selection of t-SNE Perplexity, ICML17 AutoML Workshop
𝑆 𝑃𝑒𝑟𝑝 = 2𝐾𝐿(𝑃| 𝑄 +
log 𝑛 𝑃𝑒𝑟𝑝
𝑛
Thank You

More Related Content

What's hot

Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Uniform and non-uniform pseudo random numbers generators for high dimensional...
Uniform and non-uniform pseudo random numbers generators for high dimensional...Uniform and non-uniform pseudo random numbers generators for high dimensional...
Uniform and non-uniform pseudo random numbers generators for high dimensional...
LEBRUN Régis
 

What's hot (20)

Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Probability & Information theory
Probability & Information theoryProbability & Information theory
Probability & Information theory
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNE
 
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
20191019 sinkhorn
20191019 sinkhorn20191019 sinkhorn
20191019 sinkhorn
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Uniform and non-uniform pseudo random numbers generators for high dimensional...
Uniform and non-uniform pseudo random numbers generators for high dimensional...Uniform and non-uniform pseudo random numbers generators for high dimensional...
Uniform and non-uniform pseudo random numbers generators for high dimensional...
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
 
Operations in Digital Image Processing + Convolution by Example
Operations in Digital Image Processing + Convolution by ExampleOperations in Digital Image Processing + Convolution by Example
Operations in Digital Image Processing + Convolution by Example
 
EfficientNet
EfficientNetEfficientNet
EfficientNet
 
KNN
KNNKNN
KNN
 
Computer Vision harris
Computer Vision harrisComputer Vision harris
Computer Vision harris
 
Canny Edge Detection
Canny Edge DetectionCanny Edge Detection
Canny Edge Detection
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 

Similar to PR 103: t-SNE

diffusion_posterior_sampling_for_general_noisy_inverse_problems_slideshare.pdf
diffusion_posterior_sampling_for_general_noisy_inverse_problems_slideshare.pdfdiffusion_posterior_sampling_for_general_noisy_inverse_problems_slideshare.pdf
diffusion_posterior_sampling_for_general_noisy_inverse_problems_slideshare.pdf
Chung Hyung Jin
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
butest
 

Similar to PR 103: t-SNE (20)

Direct solution of sparse network equations by optimally ordered triangular f...
Direct solution of sparse network equations by optimally ordered triangular f...Direct solution of sparse network equations by optimally ordered triangular f...
Direct solution of sparse network equations by optimally ordered triangular f...
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
diffusion_posterior_sampling_for_general_noisy_inverse_problems_slideshare.pdf
diffusion_posterior_sampling_for_general_noisy_inverse_problems_slideshare.pdfdiffusion_posterior_sampling_for_general_noisy_inverse_problems_slideshare.pdf
diffusion_posterior_sampling_for_general_noisy_inverse_problems_slideshare.pdf
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Isoparametric bilinear quadrilateral element _ ppt presentation
Isoparametric bilinear quadrilateral element _ ppt presentationIsoparametric bilinear quadrilateral element _ ppt presentation
Isoparametric bilinear quadrilateral element _ ppt presentation
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Summary of MAST
Summary of MASTSummary of MAST
Summary of MAST
 
Continuous control
Continuous controlContinuous control
Continuous control
 
MLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptxMLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptx
 
Stat 3203 -pps sampling
Stat 3203 -pps samplingStat 3203 -pps sampling
Stat 3203 -pps sampling
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
Photophysics of dendrimers
Photophysics of dendrimersPhotophysics of dendrimers
Photophysics of dendrimers
 
Two phase method lpp
Two phase method lppTwo phase method lpp
Two phase method lpp
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
 
DFA minimization algorithms in map reduce
DFA minimization algorithms in map reduceDFA minimization algorithms in map reduce
DFA minimization algorithms in map reduce
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Rational function 11
Rational function 11Rational function 11
Rational function 11
 

More from Taeoh Kim (7)

CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
PR 127: FaceNet
PR 127: FaceNetPR 127: FaceNet
PR 127: FaceNet
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
 
Pr083 Non-local Neural Networks
Pr083 Non-local Neural NetworksPr083 Non-local Neural Networks
Pr083 Non-local Neural Networks
 
Pr072 deep compression
Pr072 deep compressionPr072 deep compression
Pr072 deep compression
 
Pr057 mask rcnn
Pr057 mask rcnnPr057 mask rcnn
Pr057 mask rcnn
 
Pr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationPr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentation
 

Recently uploaded

LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 

Recently uploaded (20)

Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Danfoss NeoCharge Technology -A Revolution in 2024.pdf
Danfoss NeoCharge Technology -A Revolution in 2024.pdfDanfoss NeoCharge Technology -A Revolution in 2024.pdf
Danfoss NeoCharge Technology -A Revolution in 2024.pdf
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Introduction to Casting Processes in Manufacturing
Introduction to Casting Processes in ManufacturingIntroduction to Casting Processes in Manufacturing
Introduction to Casting Processes in Manufacturing
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
 

PR 103: t-SNE

  • 2. Credits ▷ Hyeongmin Lee, MVPLAB, Yonsei Univ ▷ https://www.slideshare.net/ssuser06e0c5/visualizing-data-using-tsne-73621033
  • 3. t-SNE: Student T Distributed-Stochastic Neighbor Embedding ▷ Nonlinear Dimension Reduction for Visualization (2-D or 3-D) ▷ Advance Version of SNE (G. Hinton, NIPS 2003) ▷ Gradient-based Machine Learning Algorithm
  • 5. Real World Data = Very High Dimension = 3145728 Dimension per Sample (ProGAN)
  • 6. Manifold Hypothesis – Dimension Reduction Ref) PR-010, PR-101 Slide from H.Lee (MVPLAB)
  • 7. History of Dimension Reduction Slide from H.Lee (MVPLAB) Linear ▷ Principal Component Analysis (1901) Non-Linear ▷ Multidimentional Scaling (1964) ▷ Sammon Mapping (1969) ▷ IsoMap (2000) ▷ Locally Linear Embedding (2000) ▷ Stochasitic Neighbor Embedding (2002)
  • 8. Swiss Roll Data Slide from H.Lee (MVPLAB)
  • 10. Locally Linear Embedding Slide from H.Lee (MVPLAB)
  • 11. Problem? Good at Local Representation = Poor at Global Representation Good at Swiss Roll = Poor at Real Data
  • 13. Update Low-Dimensional Mapping by Considering Pairwise Relations in High-Dimension Iterative Update Cost Function Label Prediction
  • 14.
  • 15. Distance Similarity 𝑝𝑗|𝑖 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥 𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑖
  • 16. Distance Similarity 𝑖 𝑖 𝑞 𝑗|𝑖 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2
  • 17. Distance Similarity 𝑖 𝑞 𝑗|𝑖 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑖
  • 20. 𝐶 = 𝐾𝐿(𝑃| 𝑄 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝑖 𝑖 for Every Data 𝜕𝐶 𝜕𝑦𝑖 = 2 ෍ 𝑗 (𝑝𝑗|𝑖 − 𝑞 𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 𝑝𝑗|𝑖 𝑞 𝑗|𝑖
  • 21. 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 KL-Divergence is Asymmetric If High-D becomes Smaller Low-D should Smaller For Equal Cost
  • 22. Appendix A: Gradient of SNE 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞𝑗|𝑖𝑞𝑗|𝑖 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑍 1 2 … i … N 1 2 … i 0 … N 𝜕𝐶 𝜕𝑦𝑖 = − ෍ 𝑗 𝑝𝑗|𝑖 log 𝑞𝑗|𝑖 − ෍ 𝑗 𝑝𝑖|𝑗 log 𝑞𝑖|𝑗 𝜕 σ 𝑗 𝑝𝑗|𝑖 log 𝑞𝑗|𝑖 𝜕𝑦𝑖 = ෍ 𝑗 𝑝𝑗|𝑖 𝜕 log 𝑞𝑗|𝑖 𝜕𝑦𝑖 = ෍ 𝑗 𝑝𝑗|𝑖( 𝜕 log 𝑞𝑗|𝑖 𝑍 𝜕𝑦𝑖 − 𝜕 log 𝑍 𝜕𝑦𝑖 ) = ෍ 𝑗 𝑝𝑗|𝑖( 1 𝑞𝑗|𝑖 𝑍 𝜕𝑞𝑗|𝑖 𝑍 𝜕𝑦𝑖 − 1 𝑍 𝜕𝑍 𝜕𝑦𝑖 ) = ෍ 𝑗 𝑝𝑗|𝑖( 1 𝑒− 𝑦 𝑖−𝑦 𝑗 2 𝑒− 𝑦𝑖−𝑦 𝑗 2 𝐴 − 1 𝑍 𝜕𝑍 𝜕𝑦𝑖 ) = ෍ 𝑗 𝑝𝑗|𝑖 𝐴 − ෍ 𝑗 1 𝑍 ෍ 𝑘≠𝑖 𝑝 𝑘|𝑖 𝑒− 𝑦𝑖−𝑦 𝑗 2 𝐴 = −2 ෍ 𝑗 (𝑦𝑖 − 𝑦𝑗)(𝑝𝑗|𝑖 − 𝑞𝑗|𝑖) 𝑞𝑗|𝑖 𝑍 = 𝑒− 𝑦𝑖−𝑦 𝑗 2 𝜕𝑞 𝑗|𝑖 𝑍 𝜕𝑦𝑖 = 𝐴 = −2(𝑦𝑖 − 𝑦𝑗)
  • 24. Problem of SNE  t-SNE ▷ Hard to Optimize  Symmetric Probability ▷ Crowding Problem  Student t-Distribution
  • 25. SNE Symmetric SNE t-SNE Prob. In High-D 𝑝𝑗|𝑖 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑝𝑖𝑗 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥𝑙 2 2𝜎2 𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑖|𝑗 2𝑛 Prob. In Low-D 𝑞 𝑗|𝑖 = 𝑒− 𝑦𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 𝑞𝑖𝑗 = 1 + 𝑦𝑖 − 𝑦𝑗 2 −1 σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙 2 −1 Cost Function 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑖𝑗 log 𝑝𝑖𝑗 𝑞𝑖𝑗 Gradient of Cost Function 2 ෍ 𝑗 (𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗 2 −1
  • 26. SNE  t-SNE ▷ Hard to Optimize  Symmetric Probability (Simpler Gradient) 𝑝𝑖𝑗 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥 𝑙 2 2𝜎2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 Single Scale All Other Pairs
  • 27. SNE  t-SNE ▷ Hard to Optimize  Symmetric Probability (Simpler Gradient) 𝑝𝑖𝑗 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥 𝑙 2 2𝜎2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 Single Scale All Other Pairs Outlier = Very Small 𝑝𝑖𝑗 = No Contribution to the Cost
  • 28. SNE  t-SNE ▷ Hard to Optimize  Symmetric Probability (Simpler Gradient) 𝑝𝑖𝑗 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥 𝑙 2 2𝜎2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 All Other Pairs 𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑖|𝑗 2𝑛 Ensures that σ 𝑗 𝑝𝑖𝑗 > 1 2𝑛 for all data, contributes to the cost
  • 29. SNE Symmetric SNE t-SNE Prob. In High-D 𝑝𝑗|𝑖 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑝𝑖𝑗 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥𝑙 2 2𝜎2 𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑖|𝑗 2𝑛 Prob. In Low-D 𝑞 𝑗|𝑖 = 𝑒− 𝑦𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 𝑞𝑖𝑗 = 1 + 𝑦𝑖 − 𝑦𝑗 2 −1 σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙 2 −1 Cost Function 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑖𝑗 log 𝑝𝑖𝑗 𝑞𝑖𝑗 Gradient of Cost Function 2 ෍ 𝑗 (𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗 2 −1
  • 30. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Slide from H.Lee (MVPLAB) Solution? ▷ Close Points  Closer ▷ Moderate Points  More Far Away
  • 31. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension
  • 32. SNE Symmetric SNE t-SNE Prob. In High-D 𝑝𝑗|𝑖 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑝𝑖𝑗 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥𝑙 2 2𝜎2 𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑖|𝑗 2𝑛 Prob. In Low-D 𝑞 𝑗|𝑖 = 𝑒− 𝑦𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 𝑞𝑖𝑗 = 1 + 𝑦𝑖 − 𝑦𝑗 2 −1 σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙 2 −1 Cost Function 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑖𝑗 log 𝑝𝑖𝑗 𝑞𝑖𝑗 Gradient of Cost Function 2 ෍ 𝑗 (𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗 2 −1
  • 33. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension This High-Dimension Data
  • 34. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension This High-Dimension Data Loses its Probability  Closer
  • 35. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension This High-Dimension Data
  • 36. SNE  t-SNE ▷ Crowding Problem  Student t-Distribution Student t-Distribution in Low-Dimension This High-Dimension Data Gains its Probability  More far away
  • 37. High-D Low-D 𝑝𝑖𝑗 𝑞𝑖𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗) (𝑦𝑖 − 𝑦𝑗) Gradient Large Large 1 1 0 Large 0 Small Small 0 0 0 Small 0 Small Large 0 1 -1 Large Large Attraction Large Small 1 0 1 Small Small Repulsion Small Replusion
  • 38. Adding Slight Repulsion (Uniform Dist. in 𝑞𝑖𝑗) Often Not the Case Low-D Initialized by Gaussian
  • 39. High-D Low-D 𝑝𝑖𝑗 𝑞𝑖𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗) (𝑦𝑖 − 𝑦𝑗) 1 + 𝑦𝑖 − 𝑦𝑗 2 −1 Gradient Large Large 1 1 0 Large Small 0 Small Small 0 0 0 Small Large 0 Small Large 0 1 -1 Large Small Attraction Large Small 1 0 1 Small Large Repulsion Strong Replusion
  • 40. SNE Symmetric SNE t-SNE Prob. In High-D 𝑝𝑗|𝑖 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑝𝑖𝑗 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥𝑙 2 2𝜎2 𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑖|𝑗 2𝑛 Prob. In Low-D 𝑞 𝑗|𝑖 = 𝑒− 𝑦𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 𝑞𝑖𝑗 = 1 + 𝑦𝑖 − 𝑦𝑗 2 −1 σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙 2 −1 Cost Function 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑖𝑗 log 𝑝𝑖𝑗 𝑞𝑖𝑗 Gradient of Cost Function 2 ෍ 𝑗 (𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗 2 −1
  • 41. Effects of t-Distribution Close Points  Closer
  • 42.
  • 44. Slide from H.Lee (MVPLAB)
  • 45.
  • 46. SNE Symmetric SNE t-SNE Prob. In High-D 𝑝𝑗|𝑖 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑝𝑖𝑗 = 𝑒 − 𝑥𝑖−𝑥 𝑗 2 2𝜎2 σ 𝑘≠𝑙 𝑒 − 𝑥 𝑘−𝑥𝑙 2 2𝜎2 𝑝𝑖𝑗 = 𝑝𝑗|𝑖 + 𝑝𝑖|𝑗 2𝑛 Prob. In Low-D 𝑞 𝑗|𝑖 = 𝑒− 𝑦𝑖−𝑦 𝑗 2 σ 𝑘≠𝑖 𝑒− 𝑦 𝑖−𝑦 𝑘 2 𝑞𝑖𝑗 = 𝑒− 𝑦 𝑖−𝑦 𝑗 2 σ 𝑘≠𝑙 𝑒− 𝑦 𝑘−𝑦 𝑙 2 𝑞𝑖𝑗 = 1 + 𝑦𝑖 − 𝑦𝑗 2 −1 σ 𝑘≠𝑙 1 + 𝑦 𝑘 − 𝑦𝑙 2 −1 Cost Function 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑗|𝑖 log 𝑝𝑗|𝑖 𝑞 𝑗|𝑖 𝐶 = ෍ 𝑖 ෍ 𝑗 𝑝𝑖𝑗 log 𝑝𝑖𝑗 𝑞𝑖𝑗 Gradient of Cost Function 2 ෍ 𝑗 (𝑝𝑗|𝑖 − 𝑞𝑗|𝑖 + 𝑝𝑖|𝑗 − 𝑞𝑖|𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 (𝑝𝑖𝑗 − 𝑞𝑖𝑗)(𝑦𝑖 − 𝑦𝑗) 4 ෍ 𝑗 𝑝𝑖𝑗 − 𝑞𝑖𝑗 𝑦𝑖 − 𝑦𝑗 1 + 𝑦𝑖 − 𝑦𝑗 2 −1
  • 47. Set 𝜎𝑖 2  Calculate 𝑃𝑒𝑟𝑝(𝑃𝑖)  𝑃𝑒𝑟𝑝 𝑃𝑖 = 𝑃𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦? 𝑝𝑗|𝑖 = 𝑒 − 𝑥 𝑖−𝑥 𝑗 2 2𝜎𝑖 2 σ 𝑘≠𝑖 𝑒 − 𝑥 𝑖−𝑥 𝑘 2 2𝜎𝑖 2 𝑃𝑒𝑟𝑝 𝑃𝑖 = 2− σ 𝑗 𝑝 𝑗|𝑖 log2 𝑝 𝑗|𝑖 Hyper-parameter: Perplexity Paper Suggests 5~50 Perplexity = Smoothed Measure of the Effective Number of Neighbors Perplexity = Balancing between Local and Global Aspects of Data
  • 48. Reference ▷ How to use t-SNE Effectively, https://distill.pub/2016/misread-tsne/ ▷ Automatic Selection of t-SNE Perplexity, ICML17 AutoML Workshop 𝑆 𝑃𝑒𝑟𝑝 = 2𝐾𝐿(𝑃| 𝑄 + log 𝑛 𝑃𝑒𝑟𝑝 𝑛