Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8

CHAPTER-08
Dimensionality Reduction
@St_Hakky
Hands-On Machine Learning with Scikit-Learn and TensorFlow
Github : https://github.com/ageron/handson-ml

CHAPTER-08

Why should we think about this topic?
• Machine Learning problems involve
thousands or even millions of features
for each training instance.
• Curse of Dimensionality’s problems
• Make training extremely slow
• Make it much harder to find a good solution
• For example, we often get much information from
data visualization but it is difficult in high
dimensionality.
• Need much more training data

Reducing the number of features
• Fortunately, it is often possible to reduce
the number of features considerably.
• If we can reducing dimension without
loosing information for some task…
• Make training faster
• Make it much easier to find a good solution
• Reduce the training data to resolve the task

MNIST : Example of Reducing Dimension
Pixels on the image borders
are almost always white.
We can drop dimension
without loosing info
For the classification task, many pixels
are utterly unimportant.
Moreover, two neighboring pixels are
often highly correlated
If we merge them into a single pixel,
we will not lose much information.

Reducing Dimension for Visualization
1. Can you understand what’s going on
this data?(42 dimensions)
Dimensionality reduction is also extremely useful
for data visualization.
2. Reducing the number of dimensions down to two makes it
possible to plot a high-dimensional training set on a graph

Contents
• The Curse of Dimensionality
• Main Approaches for
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of
Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and Tuning
Hyperparameters
• LLE
• Other Dimensionality Reduction
Techniques
• MDS
• SOM
• Isomap
• t-SNE

The Curse of Dimensionality
Even a basic 4D hypercube is incredibly hard to picture, let
alone a 200-dimensional ellipsoid bent in a 1,000-dimensional
space.
We live in three dimensions that our intuition fails us when
we try to imagine a high-dimensional space.

https://youtu.be/BVo2igbFSPE
https://youtu.be/-x60xZe0Si0

Example of our intuition failing
• Let’s think about picking a random point in a unit
square (1 × 1 square).
• Only about a 0.4% chance of being located less than
0.001 from a border.
• What’s happen in a 10,000-dimensional unit
hypercube?
• This probability is greater than 99.999999%.
• Most points in a high-dimensional hypercube are very
close to the border.
• This is quite counterintuitive.

Example of our intuition failing
• Let’s think about another example here.
• Picking two points randomly in a unit square.
• The distance between these two points will be, on
average, roughly 0.52.
• But what about two points picked randomly in a
1,000,000-dimensional hypercube?
• The average distance, believe it or not, will be about
408.25!

Need more data in High Dimension
• These examples means that a new instance will
likely be far away from any training instance.
• This makes predictions much less reliable than in
lower dimensions, since they will be based on much
larger extrapolations.
• In short, the more dimensions the training set
has, the greater the risk of overfitting it.
• So, we need more data.

The solution of Curse of Dimensionality
• Increase the size of the training set to reach a
sufficient density of training instances.
• Unfortunately, in practice, the number of training
instances required to reach a given density grows
exponentially with the number of dimensions.

Main Approaches for Dimensionality
Reduction
• There are two main approaches to
reducing dimensionality
• Projection
• Manifold Learning.

Projection
• In most problems, training instances are
not spread out uniformly across all
dimensions.
• as discussed earlier for MNIST
You can see a 3D dataset represented by
the circles
All training instances actually
lie within a much lower-
dimensional subspace.

3D→2D
You can see a 3D
dataset represented
by the circles
The new 2D dataset
after projection
Project every
training instance
We have just reduced the dataset’s dimensionality from 3D to
2D!!
The axes correspond to
new features z1 and z2

Projection is not always the best
approach
In many cases the subspace may twist and turn.
Swiss roll

approach
Simply projecting onto a
plane would squash
different layers of
the Swiss roll together.
Swiss roll
Dropping x3

approach
Simply projecting onto a
plane would squash
different layers of
the Swiss roll together.
Swiss roll
What you really want is this.
Dropping x3
More example : https://goo.gl/7ILsqR

Manifold
• What is Manifold?
• A d-dimensional manifold is a part of an n-
dimensional space (where d < n) that locally
resembles a d-dimensional hyperplane.
• 2D manifold is a 2D shape that can be bent
and twisted in a higher-dimensional space.
d = 2 and n = 3
Example of a 2D manifold : Swiss roll

Manifold Learning
• What is Manifold Learning?
• Modeling the manifold on which the training
instances lie.
• It relies on the manifold assumption(manifold
hypothesis)
• Most real-world high-dimensional datasets lie close
to a much lower-dimensional manifold.

Once again, MNIST example
• Handwritten digit images have some
similarities.
• Connected lines
• Borders are white
• they are more or less centered…
• These constraints tend to squeeze the
dataset into a lower dimensional manifold.

Manifold assumption
• The manifold assumption is often accompanied
by another implicit assumption
• The task at hand will be simpler if expressed in the
lower-dimensional space of the manifold.
This can be
split into two
classes
The decision boundary would
be fairly complex, but…
The decision boundary is
a simple straight line.

This assumption does not always hold
It looks more complex
in the unrolled
manifold.
x1 = 5
This decision boundary looks
very simple in the original
3D space

Data only knows the best way of
Reducing the dimensionality of your
training set before training a model.
Definitely speeding up training
But it may not always lead to a better or
simpler solution : This all depends on the
dataset.

Principal Component Analysis (PCA)
• PCA is the most popular dimensionality
reduction algorithm.
• PCA have two steps:
1. It identifies the hyperplane that lies closest
to the data
2. It projects the data onto it.

Preserving the Variance
Before projecting the training set onto a lower-
dimensional hyperplane, you first need to choose
the right hyperplane.
The projection of
the dataset onto
each of new axes.
If you select the axis that preserves the max
variance, it will most likely lose less information.

Another way to choose axis
• Another way is to choose axis that
minimizes the mean squared distance
between the original dataset and its
projection onto that axis.

PCA identifies the axis
PCA identifies the axis that accounts for the
largest amount of variance.
It also finds a second axis that is orthogonal to the
first one and accounts for the largest amount of
remaining variance.

Principal Components
1th axis
2th axis
So how can you find the principal components
of a training set?
The unit vector that defines the ith axis is called the
ith principal component (PC).

Singular Value Decomposition(SVD)
SVD can decompose the training set matrix 𝑋 into
the dot product of three matrices 𝑈･Σ･𝑉 𝑇, where 𝑉 𝑇
contains all the principal components
Python Code of SVD

Contents
• The Curse of Dimensionality
• Main Approaches for
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d
Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of
Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and Tuning
Hyperparameters
• LLE
• Other Dimensionality Reduction
Techniques
• MDS
• SOM
• Isomap
• t-SNE

Projecting Down to 𝑑 Dimensions
You can reduce the dimensionality down to 𝑑
dimensions by projecting it onto the
hyperplane defined by the first 𝒅 principal
components.

Projecting Down to 𝑑 Dimensions
𝑊 𝑑, defined as the matrix containing
the first d principal components
To project the training set onto the hyperplane,
you can simply compute the following equation.
The following Python code projects the training set onto
the plane defined by the first two principal components:

Using Scikit-Learn
※it automatically takes
care of centering the data
Scikit-Learn’s PCA class implements PCA using
SVD decomposition just like we did before.
After fitting the PCA, you can access the principal
components using the components_ variable.

Explained Variance Ratio
Explained Variance Ratio
84.2% of the dataset’s variance lies along the
first axis, and 14.6% lies along the second axis.
The proportion of the dataset’s variance that
lies along the axis of each principal component.

Choosing the Right Number of
Dimensions
• Generally, it is preferable to choose the
number of dimensions that add up to a
sufficiently large portion of the variance
(e.g., 95%).

Sample Code
Computing PCA without reducing
dimensionality

Sample Code
Computing PCA without reducing
dimensionality
Computing the minimum number of dimensions
required to preserve 95% of the training set’s
variance

Sample Code
There is a much better way :
You can set n_components to be a
float between 0.0 and 1.0

Plot the explained variance
Elbow = The explained
variance stops growing
fast.
You can think of Elbow point as the intrinsic
dimensionality of the dataset.
Another option is to plot the explained
variance.

PCA for Compression
Example: Applying PCA to the MNIST
Obviously after dimensionality reduction, the
training set takes up much less space.
95%
This is a reasonable compression ratio and this can
speed up a classification algorithm tremendously.
Each instance will have just over 150
features, instead of the original 784
features
The dataset is now less than
20% of its original size!

Decompress the reduced datasets
The equation of the
inverse transformation:
You also can decompress the reduced dataset by
the inverse transformation of the PCA projection.

Incremental PCA(IPCA)
• One problem with implementation of PCA
• It requires the whole training set to fit in
memory for the SVD.
• IPCA algorithms have been developed
• Split the training set into mini-batches
• Feed an IPCA algorithm one mini-batch at a
time.
• This is useful for large training sets, and
also to apply PCA online.

Sample Code
Spliting the MNIST dataset into 100 mini-batches

Sample Code
Feeding them to Scikit-Learn’s IPCA class

Another Sample Code
NumPy’s memmap class allows you to
manipulate a large array stored in a binary
file on disk.
The class loads only the data it
needs in memory, when it needs it.

Randomized PCA(RPCA)
Computational complexity
𝑂(𝑚 × 𝑑2) + 𝑂(𝑑3)
𝑂(𝑚 × 𝑑2) + 𝑂(𝑛3)
It is dramatically faster
when 𝑑 is much smaller
than 𝑛.
This is a stochastic algorithm that quickly finds an
approximation of the first d principal components.
PCA
RPCA

Kernel Trick
• Kernel Trick
• A mathematical technique that implicitly maps
instances into a very high-dimensional space
• A linear decision boundary in the high-dimensional
feature space corresponds to a complex nonlinear
decision boundary in the original space.

Kernel PCA(kPCA)
It is often good at preserving clusters of instances
after projection.
Making kernel trick possible to perform complex
nonlinear projections for dimensionality reduction.
kPCA

Sample Code
• Scikit-Learn’s KernelPCA class to perform
kPCA with an RBF kernel

Selecting a Kernel and Tuning
Hyperparameters
• As kPCA is an unsupervised learning:
• There is no obvious performance measure to
select the best kernel and hyperparameters.
• However, dimensionality reduction is
often a preparation step for a supervised
learning task.

Grid Search
You can simply use grid
search to select the
kernel and
hyperparameters.
The best kernel and
hyperparameters are then
available.

Selecting a Kernel and Tuning Hyperparameters
With Lowest reconstruction error
• Another approach is to select the kernel
and hyperparameters that yield the
lowest reconstruction error.
• This time entirely unsupervised
• However, reconstruction is not as easy as
with linear PCA.

Example : Reconstruction is not easy
The original Swiss roll 3D dataset Resulting 2D dataset after kPCA
is applied using an RBF kernel
Mapping the dataset to an infinite-
dimensional space by kernel trickReconstruction pre-image

We calculate the
reconstruction
error by this.
Reconstruction
by kernel PCA

Resulting 2D dataset after kPCA
Reconstruction
by kernel PCA
Since the feature space is infinite-
dimensional, we cannot compute
the reconstructed point.
→ We cannot compute the true
reconstruction error.
We calculate the
reconstruction
error by this.

Fortunately, it is possible to find a
point in the original space that
would map close to the
reconstructed point. This is called
the reconstruction pre-image.

You can measure its
squared distance to the
original instance.
By this value, you can
select the kernel and
hyperparameters.

Locally Linear Embedding (LLE)
• LLE is Powerful nonlinear dimensionality
reduction method.
• A Manifold Learning technique that does not
rely on projections like the previous algorithms.
• This makes it good at unrolling twisted
manifolds
• Especially when there is not too much noise.
Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction
by locally linear embedding." science 290.5500 (2000): 2323-2326.

Sample Code
Result
Swiss roll is completely
unrolled.
The distances between
instances are locally well
preserved.

Sample Code
Result
However, distances are not
preserved on a larger scale
Squeezed Stretched

How LLE works
1. First, the algorithm identifies its 𝑘 closest
neighbors for each training instance 𝑥(𝑖)
• Find the weights 𝑤𝑖,𝑗 such that the squared
distance between 𝑥(𝑖) and 𝑗=1
𝑚
𝑤𝑖,𝑗 𝑥(𝑗) is as small
as possible.
• 𝑤𝑖,𝑗 = 0 if 𝑥(𝑗) is not one of the 𝑘 closest neighbors of
𝑥(𝑖)
2. Then, trying to reconstruct 𝑥(𝑖)
as a linear
function of these neighbors.

How LLE works
1. First, the algorithm identifies its 𝒌
closest neighbors for each training
instance 𝒙(𝒊)
• Find the weights 𝑤𝑖,𝑗 such that the squared
distance between 𝑥(𝑖) and 𝑗=1
𝑚
𝑤𝑖,𝑗 𝑥(𝑗) is as small
as possible.
• 𝑤𝑖,𝑗 = 0 if 𝑥(𝑗) is not one of the 𝑘 closest neighbors of
𝑥(𝑖)
2. Then, trying to reconstruct 𝑥(𝑖)
as a linear

How LLE works
• The detail first step of LLE is here.
• First step of LLE is the constrained optimization
problem described in Equation 8-4.
• Second constraint simply normalizes the weights
for each training instance 𝑥(𝑖).
𝑾 is the weight matrix
containing all the weights 𝑤𝑖,𝑗

How LLE works
1. First, the algorithm identifies its 𝑘 closest
neighbors for each training instance 𝑥(𝑖)
• Find the weights 𝑤𝑖,𝑗 such that the squared distance
between 𝑥(𝑖) and 𝑗=1
𝑚
𝑤𝑖,𝑗 𝑥(𝑗) is as small as possible.
• 𝑤𝑖,𝑗 = 0 if 𝑥(𝑗) is not one of the 𝑘 closest neighbors of 𝑥(𝑖)
2. Then, trying to reconstruct 𝒙(𝒊) as a linear

How LLE works
• The detail second step of LLE is here.
• Now the second step is to map the training instances
into a 𝑑-dimensional space (where 𝑑 < 𝑛).
• If 𝑧(𝑖) is the image of 𝑥(𝑖) in this 𝑑-dimensional space,
then we want the squared distance between 𝑧(𝑖)
and
𝑗=1
𝑚
𝑤𝑖,𝑗 𝑧(𝑖)to be as small as possible.
Note that 𝒁 is the matrix containing all 𝑧(𝑖)

How LLE works
These look very similar

How LLE works
Keeping the instances
fixed and finding the
optimal weights
We are doing the reverse
Keeping the weights fixed and
finding the optimal position in the
low dimensional space.

Other Dimensionality Reduction
Techniques
• There are many other dimensionality
reduction techniques.
• MDS
• Isomap
• t-SNE

Multidimensional Scaling (MDS)
• MDS reduces dimensionality while trying to
preserve the distances between the instances.

Isomap
• First, creating a graph by connecting each
instance to its nearest neighbors
• Then reducing dimensionality while trying
to preserve the geodesic distances
between the instances.

t-Distributed Stochastic Neighbor
Embedding (t-SNE)
• t-SNE reduces dimensionality while trying
to keep similar instances close and
dissimilar instances apart.
• It is mostly used for visualization
Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-
SNE." Journal of Machine Learning Research 9.Nov (2008): 2579-2605.

Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8

Similar to Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8 (20)

Recently uploaded

Recently uploaded (20)

Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8

Editor's Notes