Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)

Analysis of Feature Selection
Algorithms
Branch and Bound | Beam Search algorithm
Parinda Rajapaksha UCSC
1

ROAD MAP
• Motivation
• Introduction
• Analysis
– Algorithm
– Pseudo Code
– Illustration of examples
• Applications
• Observations and Recommendations
• Comparison between two algorithms
• References
2

SECTION 1
Branch and Bound
Algorithm
3

MOTIVATION
• The optimal feature selection (subset selection) has been very
difficult because of its computational complexity
• All the subsets of given cardinality that have to be evaluated to find
the optimal set of features among large set of measurements
• Exhaustive search is impractical even for relatively small size of
problems
– Finding 2 features from 10 feature set would generate 45 possible
combinations.
• This challenge has motivated over the years to speeding up the
search process in the arena of feature selection
4

INTRODUCTION
• As a solution Branch and Bound algorithm was developed by
Narendra and Fukunaga in 1977
• Introduced heuristic measures which can help to identify parts of
the search space that can be left unexplored without missing the
optimal solution
• Guaranteed to find the optimal feature subset without evaluating
all possible subsets
• B & B is an exponential search method
• Assumes that feature selection criterion is monotonic
5

INTRODUCTION Monotonicity Property
• For given two feature subsets (X ,Y) and feature selection criterion
function (J);
X ⊂ Y => J(X) < J(Y) Ex: Y = {2,4,5}
• It ensures that the values of the leaf nodes of that branch cannot be
better than the current bound
• Allows to create short cuts in the search tree representing the
feature set optimization process
• Reduce the number of nodes and branches of the search tree that
have to be explored
X = {2,4}
6

INTRODUCTION Monotonicity Property
Feature set => {x1, x2, x3, x4,……..xn }
Xn
X2
{x1, x2, x3… xn}
{x1} x1
{x1, x2}
J(x1..n)
J(x1) < J(x1 ,x2) < J(x1, x2, x3) < ……..< J(x1, x2, x3, …. xn)
7

• Start from the full set of features and remove features using depth
first strategy
• Monotonicity property should be satisfied to apply the algorithm
• Branching is the constructing process of tree
• For each tree level, a limited number of sub-trees is generated by
deleting one feature from the set of features from the parent node
• Bounding is the process of finding optimal feature set by traversing
the constructed tree
ANALYSIS
8

ANALYSIS Algorithm
1. Construct an ordered tree by satisfying the Monotonicity
property
Let xj be the set of features obtained by removing j features y1 , y2 … yj
from the set Y of all features
Xj = Y {y1 , y2 … yj }
The monotonicity condition assumes that, for feature subsets x1 , x2 … xj
where,
x1 ⊂ x2 ⊂ x3 …. ⊂ xj
The criterion function J fulfills,
J(x1) < J(x2) < J(x3) < … < J(xj)
9

ANALYSIS Algorithm
2. Traverse the tree from right to left in depth first search pattern
• If the value of the criterion is less than the threshold (relevant to the
most recent best subset) at a given node,
All its successors will also have a value less than criterion value
3. Pruning
• Anytime the criterion value J(xm) in some internal node is found
to be lower than the current bound, due to the Monotonicity
condition the whole sub tree may be cut off and many computations
may be omitted
• B&B creates tree with all possible combinations of r element subsets
from the n set, but searches only some of them
10

ANALYSIS Tree Properties
 Root of the tree represents the set of all features (n) and leaves
represent target subsets (r) of features
 For each tree level, a limited number of sub-trees is generated by
deleting one feature from the set of features from the parent node
{ X1,X2,X3 }
X1 X2 X3
All features (n)
{ X2,X3 } { X1,X3 } { X1,X2 }
Target subset (r)
Removed
feature
12

 In practice, we have only allowed variables to be removed in
increasing order. This removes unnecessary repetitions in the
calculation. Therefore tree is not symmetrical
{ X1,X2,X3,X4 }
{ X2,X3,X4 } { X1,X3,X4 } Not in
increasing order
X1 X2
X2
{ X3,X4 }
{ X3,X4 }
X1
Repetition
13

 Number of leaf nodes in tree = nCr
 Number of levels = n – r
 Ex:
X1 X2 X3
No of leaf nodes = 3C2 = 3
No of levels = 3 – 2 = 1
{ X1,X2,X3 }
{ X2,X3 } { X1,X3 } { X1,X2 }
3 features reduced to 2 features
14

EXAMPLE
How to reduce 5 features in to 2 features using B & B
Algorithm?
1, 2, 3, 4,
5
? , ?
Finding best 2 features from full set of features
15

EXAMPLE Branching Step 1
 Identify the Tree properties
- No of levels = 5-2 = 3 (5  4  3 2)
- No of leaf nodes = 5C2 = 10
- Choose a criterion function J(x).
16

1,2,3,4,5
3
2,3,4,5 1,3,4,5 1,2,4,5
L 0
1 2
L 1
17
Note : If feature 4 and 5 remove from initial states, tree does not
become a complete tree. There will be no features to remove
in the next levels.

1,2,3,4,5
3
2,3,4,5 1,3,4,5 1,2,4,5
L 0
1 2
2 3 4 3 4 4
3,4,5 2,4,5 2,3,5 1,4,5 1,3,5 1,2,5
L 1
L 2
18

1,2,3,4,5
3
2,3,4,5 1,3,4,5 1,2,4,5
L 0
1 2
2 3 4 3 4 4
3,4,5 2,4,5 2,3,5 1,4,5 1,3,5 1,2,5
3 4 5 4 5 5 4 5 5 5
4,5 3,5 3,4 2,5 2,4 2,3 1,5 1,4 1,3 1,2
L 1
L 2
L 3
19

EXAMPLE Criterion Values
• Assume the Criterion function J(X) will give following results which
15
satisfied the Monotonicity Property
Criterion values
10 12 11
6 7 8 8 10 9
3 4 5 5 6 7 6 7 9 8
20

EXAMPLE Backtracking
• Calculate the criterion values using J(x) function (values are Assumed)
• Set the right most value as the Bound (this branch has the minimum
15
11
9
8
number of child nodes and edges)
Set Bound
Current V = 8
Bound = 8
10 12
6 7 8 8 10
3 4 5 5 6 7 6 7 9
21

• Update the bound when backtracking reach to a leaf node
15
• Backtrack along the tree (depth search) if
Current Node Value ≥ Bound
10 12 11
6 7 8 8 10 9
3 4 5 5 6 7 6 7 9 8
Update Bound
Current V = 9
Bound = 9
22

15
• If Current Node Value ≤ Bound
Discard the below branches (Prune)
• Bound will not change
10 12 11
6 7 8 8 10 9
X
Current V = 8
Bound = 9
3 4 5 5 6 7 6 7 9 8
23

15
• Repeat the previous steps
10 12 11
6 7 8 8 10 9
X X
Current V = 8
Bound = 9
3 4 5 5 6 7 6 7 9 8
24

• Maximum bound in leaf nodes = 9
• Optimal feature subset = {1,3}
• Note that the some subsets in L3 can be omitted without calculating
15
10 12 11
6 7 8 8 10 9
X X
Current V = 6
Bound = 9
X X
3 4 5 5 6 7 6 7 9 8
{1,3} 25

EXAMPLE 2
Reduce 10 features to 6 features
1, 2, 3, 4,
5,6,7,8,9,
10
No of levels = 10 - 6 = 4
No of leaf nodes = 10C6 = 210
?
?,?,?,?
?
n = 10 r = 6
26

1 2 3 4 5 6 7
2 . . . . 8 8
3 . . . . 9 9
4 . . . . 10 10
210 Leaf nodes
4 Levels
EXAMPLE 2 Reduce 10 features to 6
27

APPLICATIONS B & B
 Evaluation of Feature Selection Techniques for Analysis of
Functional MRI and EEG
– This paper compares the performance of classical sequential methods
and the B&B algorithm when applied to functional Magnetic
Resonance Images (MRI) and intracranial EEG to classify pathological
events.
– They have used 12 features for MRI and 14 features for EEG data
– The results of this work contradict that claim in several sources that the
B&B algorithm is an optimal search algorithm for feature selection
28

APPLICATIONS B & B
– The algorithm fails to create subsets with better classification accuracy
in this application
Classification accuracy as a function of subset size for the MRI data
29

OBSERVATIONS & RECOMMENDATION
• Every B & B algorithm requires additional computations
– Not only the target subsets of r features, but also their supersets n
have to be evaluated
• Does not guarantee that enough sub-trees will be cut off to keep the
total number of criterion computations lower than their number in
exhaustive search
• In the worst case, Criterion function would be computed in every
tree node
– Same as the Exhaustive search
30

OBSERVATIONS & RECOMMENDATION
• Criterion value computation is usually slower near to the root
– Evaluated feature subsets are larger J(X1,X2…Xn)
• Sub tree cut-offs are less frequent near to the root
– Higher criterion values following from larger subsets are
compared to the bound, which is updated in leaves
• The B & B algorithm usually spends most of the time by tedious,
less promising evaluation of the tree nodes in levels closer to the
root
• This effect is to be expected, especially for r <<< n
31

SECTION 2
Beam Search
Algorithm
32

INTRODUCTION
• Beam search is a heuristic method for solving combinatorial
optimization problems
• It is similar to breadth-first search as it progresses level by level
• Only the most promising nodes at each level of the search tree are
selected for further branching, while the remaining nodes are
pruned off permanently
• Beam search was first used in the artificial intelligence community
for the speech recognition and the image understanding problems
• The running time is polynomial in the problem size
33

ANALYSIS Algorithm
a) Compute the classifier performance using each of the n features
individually (n 1-tuples)
b) Select the best K (beam-width) features based on a pre-defined
selection criterion among these 1-tuples
c) Add a new feature to each of these K features, forming K(n−1)
2-tuples of features. The tuple-size t is equal to 2 at this stage
d) Evaluate the performance of each of these t-tuples. Of these,
select the best K, based on classification performance.
34

ANALYSIS Algorithm
e) Form all possible (t + 1) tuples by appending these K t-tuples with
other features (not already in that t tuple)
f) Repeat steps d) to e) until the stopping criterion is met.
the tuple size at this stage is m.
g) The best K m-tuples are the result of beam search.
35

ANALYSIS Easy 5 Steps
1. Stat with the empty set (With no features) and evaluate the values
of each feature individually
– Values can be calculated by using criterion function or evaluating
classifier performance
2. Choose a value for Beam width (K)
– K define the number of subsets to be carried for the next level
3. Carry the best K subsets to the next level
– Cut off value can be checked before the selection of best subsets
4. Add a new feature (previously not used) to each of these selected feature
subsets
5. Repeat the process until tree reach to the target subset
– Or a stopping criteria can be defined to terminate the process
37

EXAMPLE
How to reduce 5 features in to 3 features using Beam
Search Algorithm?
{1, 2, 3, 4, 5 } { ?,?,? }
Finding best 3 features from full set of features
38

EXAMPLE Step 1
• Start with the empty subset and evaluate the values of each feature
individually (Assume the values as follows)
{ }
1 2 3 4 5
30 14 28 16 25
39
{1} {2} {3} {4} {5}

EXAMPLE Step 2
• Select the best K (beam width) features based on a pre-defined
selection criterion. (Assume K =3 )
{ }
1 2 3 4 5
Best features
{ 1 }
{ 3 }
{ 5 }
30 14 28 16 25
40
{1} {2} {3} {4} {5}

EXAMPLE Step 3
• Add a new feature to each of these selected features. Order is not
important. Duplications cannot be avoided.
{ }
1 2 3 4 5
Best features
{ 1 }
{ 3 }
{ 5 }
30 14 28 16 25
2 3 4 5 1 2 4 5 1 2 3 4
31 35 60 39 30 55 40 50 34 35 34 48
41

EXAMPLE Step 4
• Choose the best K performance subsets among new feature sets.
{ }
1 2 3 4 5
30 14 28 16
25
Best features
{ 1,4 }
{ 3,2 }
{ 3,5 }
2 3 4 5 1 2 4 5 1 2 3 4
31 35 60 39 30 55 40 50 34 35 34 48
42
{1,4} {3,2} {3,5}

EXAMPLE Step 5
• Carry the best K performance subsets to next level by adding rest of
the available features.
{ }
1 2 3 4 5
14 16
2 3 4 5 1 2 4 5 1 2 3 4
31 35 60 39 30 55 40 50 34 35 34 48
2 3 5 1 4 5 1 2 4
45 40 70 56 58 88 67 62 75
Best features
{ 1,4 }
{ 3,2 }
{ 3,5 }
30 28 25
43

EXAMPLE Step 6
• Tree has reached to the 3 features which is the target subset.
Maximum value will give the best feature set.
{ }
1 2 3 4 5
30 14 28 16
25
2 3 4 5 1 2 4 5 1 2 3 4
31 35 39 30 40 34 35 34 48
2 3 5 1 4 5 1 2 4
45 40 70 56 58 88 67 62 75
Best features
{ 1,4,5 }
{ 3,2,5 }
{ 3,5,4 }
60 55 50
44
{1,4,5} {3,2,5} {3,5,4}

APPLICATIONS Beam Search
 Beam Search for Feature Selection in Automatic SVM Defect
Classification
– In this paper they have used have implemented beam search with a
support vector machine (SVM) classifier to select the candidate subsets
– Improvements have been proposed to the beam search algorithm for
feature selection, and the modified version is called Smart Beam
Search (SBS)
– Each defect in the data set is described by a high dimensional feature
vector consisting of about 100 features
45

APPLICATIONS Beam Search
– The data set is comprised of about 3000 images, with 13 defect classes
and presented the results for beam widths K= 2 and 5
– SBS feature selection approach has reduces the dimensionality of the
feature space and increased the classifier performance
Overall accuracy using features selected by SBS
46

OBSERVATIONS & RECOMMENDATIONS
• There is no backtracking, since the intent of this technique is to
search quickly
• Therefore, beam search methods are not guaranteed to find an
optimal solution and cannot recover from wrong decisions
• Duplications cannot be avoided in the tree
• If a node leading to the optimal solution is discarded during the
search process, there is no way to reach that optimal solution
afterwards
• Beam width parameter K is fixed to a value before searching starts
• A wider beam width allows greater safety, but it will increase the
computational cost
47

COMPARISON
Branch and Bound Beam Search
Follow depth fast strategy Similar to breadth fast search
Guaranteed to find the optimal
feature subset
Not guaranteed to find optimal
feature subset
It is an Exponential search
Polynomial running time in the
problem size
Backtracking needed to prune
unnecessary subsets
No need of backtracking process
Need additional computations to
backtrack after constructing the tree
No additional computations needed
after constructing the tree
Need to fulfill Monotonicity Property
No need to consider about
Monotonicity Property
Duplicate subsets are omitted Duplications cannot be avoided
48

REFERENCES
1. Narendra, P. M., & Fukunaga, K. (1977). A branch and bound algorithm
for feature subset selection. Computers, IEEE Transactions on, 100(9),
917-922.
2. Somol, P., Pudil, P., & Kittler, J. (2004). Fast branch & bound algorithms for
optimal feature selection. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 26(7), 900-912.
3. Burrell, L., Smart, O., Georgoulas, G. K., Marsh, E., & Vachtsevanos, G. J.
(2007, June). Evaluation of Feature Selection Techniques for Analysis of
Functional MRI and EEG. In DMIN (pp. 256-262)
4. Gupta, P., Doermann, D., & DeMenthon, D. (2002). Beam search for
feature selection in automatic SVM defect classification. In Pattern
Recognition, 2002. Proceedings. 16th International Conference on (Vol. 2,
pp. 212-215). IEEE.
49

REFERENCES
5. Dashti, M. T., & Wijs, A. J. (2007). Pruning state spaces with extended
beam search. In Automated Technology for Verification and Analysis (pp.
543-552). Springer Berlin Heidelberg.
6. Valente, J., & Alves, R. A. (2005). Filtered and recovering beam search
algorithms for the early/tardy scheduling problem with no idle
time. Computers & Industrial Engineering, 48(2), 363-375.
50

Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)

Similar to Analysis of Feature Selection Algorithms (Branch & Bound and Beam search) (20)

More from Parinda Rajapaksha

More from Parinda Rajapaksha (8)

Recently uploaded

Recently uploaded (20)

Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)