Dimensionality Reduction

Dimensionality
Reduction
By Saad Elbeleidy
1

Agenda
● Curse Of Dimensionality
● Why Reduce Dimensions
● Types of Dimensionality Reduction
○ Feature Selection
○ Feature Extraction
● Application Specific Methods
2

Curse Of Dimensionality
3
The collection of issues that arise when dealing
with high dimensional data.
More data is good.
More detailed data (dimensions) might not be.
Why Reduce Dimensions
Types of Dimensionality
Reduction
● Feature Selection
● Feature Extraction
Application Specific
Methods

Data Representation
4
Index Label Shape Size Color Feature 4
0
1
2
3
4 More
Data
More
Detail

Just one more feature
Adding just one more feature could increase the amount of data
you need by an additional power.
In the age of Big Data, we still do not have enough data to
account for the curse of dimensionality.
5

Data Examples
● HD Images
○ Should we represent all pixels? Are they all useful?
● Video
○ What changes is what’s useful
● Text
○ Keywords vs all words
● Time series
○ Does every second matter?
6

7
1. Computation
Less dimensions allow us to compute models
more efficiently.
2. Visualization
Difficult to visualize more than 3 dimensions.
3. Remove Useless Information
Reduction
Methods

Dimensionality Reduction
8
Dimensionality reduction aims to map the data from the original
dimension space to a lower dimension space while minimizing
(relevant) information loss.

Types of Dimensionality Reduction
9
Select features from the available features
Generate synthetic features that represent the
available features.
Reduction
Methods

Feature Selection
10
Select some features from the originally available
set.
1. Filter
2. Wrapper
3. Embedded
Reduction
Methods

Process
1. Start with all features
2. Select a subset
3. Learn on the subset (train model)
4. Measure performance
11

Filter
12
All Features Subset Learning Performance
Filter to best subset

Wrapper
13

Embedded
14

Comparison
1. Filter
a. Pros: Low computation time and robust to overfitting
b. Cons: Select redundant data, results not as great, greedy
c. Used in pre-processing
2. Wrapper
a. Pros: Take learnings into consideration so better fit the data
b. Cons: Potentially high computation time, prone to overfitting, greedy
3. Embedded
a. Pros: Combine advantages of both Filter and Wrapper
b. Cons: Computation time
15

Example Methods
1. Filter
a. Mutual Information
2. Wrapper
a. Recursive Feature Elimination
3. Embedded
a. LASSO
b. LARS
16

Mutual Information
Calculate the value of each feature based on the mutual
information with the target class.
17

Recursive Feature Elimination
Select features by recursively considering smaller and smaller sets
of features.
1. Train estimator and obtain importance of each feature
2. Prune least important k features
3. Repeat until only left with n features
Parameters: n features to keep, k features to drop per iteration
18

Norms Review
l1
norm: Manhattan distance or absolute value
l2
norm: Euclidian distance
lp
norm: (p >= 1)
l0
norm: number of non-zero entries
19

Least Absolute Shrinkage & Selection Operator
A linear model that estimates sparse coefficients
Uses l1
norm regularization.
20

Elastic Net
A linear regression model trained with l1
and l2
norms
regularization.
Elastic-net is useful when there are multiple features which are
correlated with one another. Lasso is likely to pick one of these at
random, while elastic-net is likely to pick both.
21

OLS
LASSO
Ridge
Elastic Nets
Regression Comparison
22

Least Angle RegreSsion
A regression algorithm for high-dimensional data.
1. Start with all coefficients equal to zero
2. Find the predictor most correlated with the response, say xj1
.
3. Take the largest step possible in the direction of this predictor
4. When some other predictor, say xj2
, has as much correlation
with the current residual, proceed in a direction equiangular
between the two predictors.
23

Feature Extraction
24
Generate synthetic (read: made up) features that
represent the available ones.
Methods to cover:
1. PCA
2. t-SNE
3. Spectral Embedding
4. LDA
Reduction
Methods

Principal Component Analysis
Decomposes a dataset into a set of successive orthogonal
components that explain a maximum amount of variance.
25
Demo: http://setosa.io/ev/principal-component-analysis

Singular Value Decomposition
Factorization of A into the product of three matrices.
The columns of U and V are orthonormal and the matrix D is
diagonal with positive real entries.
26

27

28
Singular value decomposition is essentially trying to reduce a rank
d matrix to a rank r matrix.
We can take a list of d unique vectors, and approximate them as a
linear combination of r unique vectors.

t-Stochastic Neighbor Embedding
1. Computes probabilities that are proportional to the similarity
of objects.
2. Uses the probabilities to learn a d dimensional map that
reflects the similarities as well as possible.
3. Minimize the Kullback–Leibler divergence.
29
Demo: https://distill.pub/2016/misread-tsne/

t-Stochastic Neighbor Embedding
1. Conditional Probabilities
2. Joint Probabilities
3. Minimization
30

Spectral Embedding
Example of nonlinear dimensionality reduction.
1. Weighted Graph Construction
Transform the raw input data into graph representation using
affinity (adjacency) matrix representation.
2. Graph Laplacian Construction
3. Partial Eigenvalue Decomposition
Eigenvalue decomposition is done on graph Laplacian
31

Linear Discriminant Analysis
Use classifier coefficients to calculate a projection of the data in k
dimensions.
Ensure that the value of k is less than C - 1. (C = number of classes)
32

Application Specific Methods
Simple yet effective methods of dimensionality
reduction in:
● Computer Vision
● Natural Language Processing
● Time Series
33
Reduction
Methods

Computer Vision
Applications:
● Facial Recognition
● Self Driving Cars
● Optical Character Recognition (OCR)
Data: matrix of pixels’ (RGB or Grayscale) values
34

Pooling
Pooling algorithm: max, average, other statistical metrics
Filter: patch to apply the pooling algorithm to
Stride: the step distance to the next patch
35

Natural Language Processing
Applications:
● Question Answering
● Summarization
● Spam Filtering
Data: vector representations of words in a string
36

Removing Words
Stopwords are common terms that do not add significance
relative to the machine learning application.
Ex. the, a, an, he, she, because, not
Stopword removal is very common in NLP applications but
selecting stopwords may be difficult.
37

Combining Words
Reducing words to their
stem/root word.
Examples:
● presumably -> presum
● multiply -> multipli
● crying -> cri
38
Similar to stemming but includes
parts of speech.
Examples:
● hike_verb
● hike_noun
Stemming Lemmatization
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

Time Series
Applications:
● Financial Forecasting
● Medical Diagnosis
● Server Log Analysis
Data: table where most features are a time unit.
39

Time Scale Reduction
If modeling annual performance, no need to use per second data.
Reduction algorithms:
● Start
● Average
● Min/Max
40

Other Methods
Granger Causality
Forecastable Component Analysis (FCA)
41

References & Resources
43
Feature Selection
http://scikit-learn.org/stable/modules/feature_selection.html
http://scikit-learn.org/stable/modules/linear_model.html
http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf
Feature Extraction
https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/book-chapter-4.pdf
http://scikit-learn.org/stable/modules/decomposition.html
https://www.quora.com/What-is-an-intuitive-explanation-of-singular-value-decomposition-SVD
http://scikit-learn.org/stable/modules/manifold.html#t-sne
http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/luxburg06_TR_v
2_4139%5b1%5d.pdf

Computer Vision
http://cs231n.github.io/convolutional-networks/
http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
Time Series
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6244856
References & Resources
44

Relevant Wikipedia Pages
Norms
Curse of Dimensionality
Dimensionality Reduction
Feature Selection
Feature Extraction
Feature Engineering
45
Mutual Information
LASSO
Ridge Regression
Elastic Nets
LARS
SVD
PCA
t-SNE
LDA

Relevant Wikipedia Pages
Computer Vision
Time Series
46
Pooling Layers
Stopwords
Stemming
Lemmatization
Granger Causality

Dimensionality Reduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dimensionality Reduction

Similar to Dimensionality Reduction (20)

Recently uploaded

Recently uploaded (20)

Dimensionality Reduction