Lunch Learn Data Science SVD Dimensionality Reduction

Lunch and Learn
At
Data Science for Social Good
NOVA-SBE and University of Chicago
By
Manas Gaur

Singular Value Decomposition
first right
singular vector
• Singular Value Decomposition (SVD) is also called
Spectral Decomposition
• Instead of using two coordinates (𝒙, 𝒚) to describe point
locations, let’s use only one coordinate 𝒛
• Point’s position is its location along vector 𝒗 𝟏
• How to choose 𝒗 𝟏? Minimize reconstruction error
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

• Goal: Minimize the sum
of reconstruction errors:
) ) 𝑥+, − 𝑧+,
/
0
,12
3
+12
• where 𝒙 𝒊𝒋 are the “old” and 𝒛𝒊𝒋 are the
“new” coordinates
• SVD gives ‘best’ axis to project on:
• ‘best’ = minimizing the reconstruction errors
• In other words, minimum reconstruction error

•A = U Σ VT - example:
• V: “movie-to-concept” matrix
• U: “user-to-concept” matrix
= x x
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
12.4 0 0
0 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09
variance (‘spread’)
on the v1 axis
Movie 1 rating
Movie2rating

A U
Sigma
VT
=
B U
Sigma
VT
=
B is best approximation of A
How Many Singular Values Should
We Retain?
• A useful rule of thumb is to
retain enough singular values
to make up 90% of the energy
in Σ.
• Sum of the squares of the
retained singular values should
be at least 90% of the sum of the
squares of all the singular values.
• Example: the total energy is
(12.4)2 + (9.5)2 + (1.3)2 =
245.70, while the retained
energy is (12.4)2 + (9.5)2 =
244.01.
• We have retained over 99% of the
energy. However, were we to
eliminate the second singular
value, 9.5, the retained energy
would be only (12.4)2/245.70 or
about 63%.J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Relation to Eigen-decomposition
• SVD gives us:
• A = U Σ VT
• Eigen-decomposition:
• A = X Λ XT
• A is symmetric
• U, V, X are orthonormal (UTU=I),
• Λ, Σ are diagonal
• Now let’s calculate:
• AAT= UΣ VT(UΣ VT)T = UΣ VT(VΣTUT) = UΣΣT UT
• ATA = V ΣT UT (UΣ VT) = V ΣΣT VT
X Λ2 XT
X Λ2 XT
Shows how to compute
SVD using eigenvalue
decomposition!

Non-Linear Dimensionality
Reduction

Brainstorming
• What is dimensionality of data ?
• What is degree of freedoms of data ?
• Is the data always exist in high-dimensional space ?
• What is the rank of a matrix ?
• What motivates us for non-linear dimensionality reduction ?
• Can the deep learning’s popular MNIST dataset problem, solvable by
simple machine learning model ?

Why do we need dimensionality reduction?
• You need to visualize it to some non-technical board members which
are probably not familiar with : terms like cosine similarity etc.
• Based on the constraint, such as preserve 80% of the data.
• You need to reduce the data you have and any new data as it comes,
which method would you choose?

Non-Linear Dimensionality Reduction
• Given a low dimensional surface embedded non-linearly in high dimensional space.
Such a surfaceiscalledManifold.
• Agood wayto representdatapointsisbytheir low-dimensionalcoordinates.
• The low-dimensional representation of the data should capture information about
high- dimensionalpairwisedistances.
• Non-lineardimensionalityreductionisalsocalledManifoldlearning.
• Idea :- Torecoverthe lowdimensionalsurface

ISOMAP
Stochastic
Nearest
Embedding
T-Stochastic
Nearest
Embedding

NLDR over PCA
NLDR PCA
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction

Isomap
• Isomapusesthesame core ideasastheMDS algorithm:
• Obtainamatrixofproximities(distancesbetweenpointsinadataset).
• Thisdistancematrixisamatrixofinnerproducts.
• AnEigendecompositionofthismatrixgivesusthelowerdimensionembedding.

Stochastic Neighbor Embedding (SNE)
𝑃,|+ =
𝑒
;(<=;<>)
/?=
@
@
∑ 𝑒
;(<=;<B)
/?=
@
@
CD+
𝑄,|+ =
𝑒 F=;F>
@
∑ 𝑒 F=;F>
@
CD+
𝜎 =
1
2𝜋
𝐾𝐿(𝑃| 𝑄 = ) 𝑃 𝑗 𝑖 log
𝑃 𝑗 𝑖
𝑄 𝑗 𝑖

𝑚𝑖𝑛
F=,F>
𝐾𝐿(𝑃||𝑄)
High dimensional space
Minimization function
Low dimensional space (2-D)
1.Large 𝑷𝒋|𝒊 is modeled as Low 𝑸 𝒋|𝒊 à
High Cost
2.Small 𝑷𝒋|𝒊 is modeled as High 𝑸 𝒋|𝒊 à
Low Cost
1.SNE is not Symmetric whereas
t-SNE is Symmetric.
2.Symmetricity makes t-SNE
fast.

T-Stochastic Neighbor Embedding (t-SNE)
𝑄,+ =
(1 + ( 𝑦+ − 𝑦,
/
);2
∑ (1 + ( 𝑦+ − 𝑦,
/
);2
CD+
𝑃,+ =
𝑒
;(<=;<>)
/?@
@
∑ 𝑒
;(<=;<B)
/?@
@
CD+
1. t-distribution has longer tails, embeds
more points in higher dimension to low
dimension.
2. There are some heuristics underlying t-
SNE.
3. Develops an intuition for what’s going
on in the high dimensional data
4. Find structure where other
dimensionality-reduction algorithms
cannot
High dimensional space
Low dimensional space (2-D)

Lunch Learn Data Science SVD Dimensionality Reduction

Recommended

Recommended

More Related Content

Similar to Lunch Learn Data Science SVD Dimensionality Reduction

Similar to Lunch Learn Data Science SVD Dimensionality Reduction (20)

Recently uploaded

Recently uploaded (20)

Lunch Learn Data Science SVD Dimensionality Reduction