2. Singular Value Decomposition
first right
singular vector
• Singular Value Decomposition (SVD) is also called
Spectral Decomposition
• Instead of using two coordinates (𝒙, 𝒚) to describe point
locations, let’s use only one coordinate 𝒛
• Point’s position is its location along vector 𝒗 𝟏
• How to choose 𝒗 𝟏? Minimize reconstruction error
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
3. Singular Value Decomposition
• Goal: Minimize the sum
of reconstruction errors:
) ) 𝑥+, − 𝑧+,
/
0
,12
3
+12
• where 𝒙 𝒊𝒋 are the “old” and 𝒛𝒊𝒋 are the
“new” coordinates
• SVD gives ‘best’ axis to project on:
• ‘best’ = minimizing the reconstruction errors
• In other words, minimum reconstruction error
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
5. Singular Value Decomposition
A U
Sigma
VT
=
B U
Sigma
VT
=
B is best approximation of A
How Many Singular Values Should
We Retain?
• A useful rule of thumb is to
retain enough singular values
to make up 90% of the energy
in Σ.
• Sum of the squares of the
retained singular values should
be at least 90% of the sum of the
squares of all the singular values.
• Example: the total energy is
(12.4)2 + (9.5)2 + (1.3)2 =
245.70, while the retained
energy is (12.4)2 + (9.5)2 =
244.01.
• We have retained over 99% of the
energy. However, were we to
eliminate the second singular
value, 9.5, the retained energy
would be only (12.4)2/245.70 or
about 63%.J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
6. Relation to Eigen-decomposition
• SVD gives us:
• A = U Σ VT
• Eigen-decomposition:
• A = X Λ XT
• A is symmetric
• U, V, X are orthonormal (UTU=I),
• Λ, Σ are diagonal
• Now let’s calculate:
• AAT= UΣ VT(UΣ VT)T = UΣ VT(VΣTUT) = UΣΣT UT
• ATA = V ΣT UT (UΣ VT) = V ΣΣT VT
X Λ2 XT
X Λ2 XT
Shows how to compute
SVD using eigenvalue
decomposition!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
9. Why do we need dimensionality reduction?
• You need to visualize it to some non-technical board members which
are probably not familiar with : terms like cosine similarity etc.
• Based on the constraint, such as preserve 80% of the data.
• You need to reduce the data you have and any new data as it comes,
which method would you choose?
10. Non-Linear Dimensionality Reduction
• Given a low dimensional surface embedded non-linearly in high dimensional space.
Such a surfaceiscalledManifold.
• Agood wayto representdatapointsisbytheir low-dimensionalcoordinates.
• The low-dimensional representation of the data should capture information about
high- dimensionalpairwisedistances.
• Non-lineardimensionalityreductionisalsocalledManifoldlearning.
• Idea :- Torecoverthe lowdimensionalsurface
14. Stochastic Neighbor Embedding (SNE)
𝑃,|+ =
𝑒
;(<=;<>)
/?=
@
@
∑ 𝑒
;(<=;<B)
/?=
@
@
CD+
𝑄,|+ =
𝑒 F=;F>
@
∑ 𝑒 F=;F>
@
CD+
𝜎 =
1
2𝜋
𝐾𝐿(𝑃| 𝑄 = ) 𝑃 𝑗 𝑖 log
𝑃 𝑗 𝑖
𝑄 𝑗 𝑖
𝑚𝑖𝑛
F=,F>
𝐾𝐿(𝑃||𝑄)
High dimensional space
Minimization function
Low dimensional space (2-D)
1.Large 𝑷𝒋|𝒊 is modeled as Low 𝑸 𝒋|𝒊 à
High Cost
2.Small 𝑷𝒋|𝒊 is modeled as High 𝑸 𝒋|𝒊 à
Low Cost
1.SNE is not Symmetric whereas
t-SNE is Symmetric.
2.Symmetricity makes t-SNE
fast.
15. T-Stochastic Neighbor Embedding (t-SNE)
𝑄,+ =
(1 + ( 𝑦+ − 𝑦,
/
);2
∑ (1 + ( 𝑦+ − 𝑦,
/
);2
CD+
𝑃,+ =
𝑒
;(<=;<>)
/?@
@
∑ 𝑒
;(<=;<B)
/?@
@
CD+
1. t-distribution has longer tails, embeds
more points in higher dimension to low
dimension.
2. There are some heuristics underlying t-
SNE.
3. Develops an intuition for what’s going
on in the high dimensional data
4. Find structure where other
dimensionality-reduction algorithms
cannot
High dimensional space
Low dimensional space (2-D)