1. ”The Curse of Dimensionality”
October 9, 2019
Amit Praseed Classification October 9, 2019 1 / 12
2. Are all Features Equally Important?
F1 F2 F3 F4 F5 F6 T
7 18 7 11 22 1 B
1 5 1 18 36 0 B
0 15 0 2 4 1 B
7 5 7 12 24 0 A
1 15 1 12 24 1 B
3 20 3 6 12 2 B
0 5 0 18 36 1 B
7 10 7 12 24 0.5 A
10 8 10 20 40 1 A
9 20 9 17 34 0.5 A
Amit Praseed Classification October 9, 2019 2 / 12
3. Do we really need so many Dimensions?
Dimensionality Reduction Techniques are broadly classified into two
categories:
Feature Selection: These techniques select a subset of dimensions.
Filter Methods: Evaluate the importance of each feature one by one.
Wrapper Methods: Evaluate different subsets of features, and test their
performance on a classifier to select the best subset.
Embedded Methods: Certain classification algorithms, such as Decision
Trees, automatically select the best subset of features to model the data.
Feature Extraction: These techniques transform the data to a lower
dimensional space without loss of data.
Eg. Principal Component Analysis (PCA)
Amit Praseed Classification October 9, 2019 3 / 12
4. Filter Methods
Filter methods inspect each independent variable individually, or oc-
casionally inpspect a single independent variable with the dependent
variable (which will be the class value for classification).
Advantages:
Fast and Simple.
Works well for simple applications.
Disadvantages:
Does not consider relationships between variables.
It is a very general technique and not tied to a particular classifier. So
there is no guarantee that the newly reduced data will perform well on
all classifiers.
Amit Praseed Classification October 9, 2019 4 / 12
5. Eliminate Dimensions based on Variance
F1 F2 F3 F4 F5 F6 T
7 18 7 11 22 1 B
1 5 1 18 36 0 B
0 15 0 2 4 1 B
7 5 7 12 24 0 A
1 15 1 12 24 1 B
3 20 3 6 12 2 B
0 5 0 18 36 1 B
7 10 7 12 24 0.5 A
10 8 10 20 40 1 A
9 20 9 17 34 0.5 A
Variance measures the spread
of a random variable around its
mean value.
Var(X) = E[(X − µ)2
]
Dimensions with low values of
variance can be eliminated with
minimal loss of information.
In this case, F6 has very low
variance compared to the other
dimensions and hence can be
ignored.
Amit Praseed Classification October 9, 2019 5 / 12
6. Do all of the Features Influence the Output?
F1 F2 F3 F4 F5 T
7 18 7 11 22 A
1 5 1 18 36 B
0 15 0 2 4 B
7 5 7 12 24 B
1 15 1 12 24 A
3 20 3 6 12 B
0 5 0 18 36 B
7 10 7 12 24 B
10 8 10 20 40 A
9 20 9 17 34 A
A feature can be regarded as irrel-
evant if it is conditionally inde-
pendent of the class labels.
How to identify irrelevant fea-
tures?
Pearson’s Correlation between
features and the output variable
Point Biserial Correlation be-
tween nominal and numeric vari-
ables
Cramer’s V value between two
nominal variables
Advantage: Simple, and works
well for certain datasets.
Drawback: Does not consider re-
lationships between variables.
Amit Praseed Classification October 9, 2019 6 / 12
7. Are all the Variables Independent?
F1 F2 F3 F4 F5 T
7 18 7 11 22 A
1 5 1 18 36 B
0 15 0 2 4 B
7 5 7 12 24 B
1 15 1 12 24 A
3 20 3 6 12 B
0 5 0 18 36 B
7 10 7 12 24 B
10 8 10 20 40 A
9 20 9 17 34 A
A feature can be removed from
the feature set if it provides no
more information than already
provided.
How to identify redundant fea-
tures?
Correlation between features
Corr(x, y) =
n
i=1(xi − µx )(yi − µy )
n
i=1
(xi − µx )2 n
i=1
(yi − µy )2
Amit Praseed Classification October 9, 2019 7 / 12
8. Wrapper Methods
Wrapper methods also test for relationships between variables.
The wrapper method essentially selects a subset of features and feeds
the reduced data to a classifier. A heuristic value, which is the accuracy
of the classifier on the newly reduced data, is obtained.
The wrapper tests for different subsets till it obtains the feature subset
which gives an optimal value of the heuristic.
The feature subset or the reduced feature space is now optimal for the
classifier.
It is considerably more complex and time consuming than the filter
method but performs better.
Amit Praseed Classification October 9, 2019 8 / 12
9. Wait... Subset Selection is NP-Complete!!!
Selecting the optimal subset of features is an NP-Complete problem,
so approximations are usually used.
Simple approximations include specifying the maximum number of fea-
tures or iterations.
Typical heuristic-based search algorithms such as Hill Climbing, Steep-
est Ascent Hill Climbing, Simulated Annealing etc. are used.
Wrapper methods are usually of four categories:
Forward Selection
Backward Elimination
Recursive Selection/Elimination
Amit Praseed Classification October 9, 2019 9 / 12
10. Feature Extraction
Feature selection involves loss of data due to the loss of features.
Even though care is taken to remove dimensions which are unlikely to
contribute much to data mining, it is still encouraged to retain all the
input data in one way or the other.
So, how to retain all the input data, and reduce the number of dimen-
sions?
Feature Extraction maps the data in a higher dimension feature space
to a lower dimension feature space without much loss of data.
The most common feature extraction technique used is Principal Com-
ponent Analysis (PCA).
Amit Praseed Classification October 9, 2019 10 / 12
11. The Idea behind PCA
Amit Praseed Classification October 9, 2019 11 / 12
12. The Idea behind PCA
Amit Praseed Classification October 9, 2019 12 / 12