The document describes a proposed approach called SHSEL for hierarchical feature selection in machine learning. SHSEL exploits the hierarchical structure of feature spaces, where more specific features imply more general ones. It initially selects ranges of similar features in each branch based on relevance similarity. It then prunes the set further by selecting only the most relevant remaining features. The authors evaluate SHSEL on real and synthetic datasets compared to other feature selection methods, finding it achieves comparable or improved accuracy while significantly reducing the feature space.
VIRUSES structure and classification ppt by Dr.Prince C P
Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL
1. 1
Feature Selection in
Hierarchical Feature Spaces
10/12/2014 Petar Ristoski, Heiko Paulheim
2. Motivation: Linked Open Data as Background
Knowledge
• Linked Open Data is a method for publishing interlinked
datasets using machine interpretable semantics
• Started 2007
• A collection of ~1,000 datasets
– Various domains, e.g. general knowledge, government data, …
– Using semantic web standards (HTTP, RDF, SPARQL)
• Free of charge
• Machine processable
• Sophisticated tool stacks
10/12/2014 Petar Ristoski, Heiko Paulheim
2
3. Motivation: Linked Open Data as Background
Knowledge
10/12/2014 Petar Ristoski, Heiko Paulheim
3
4. Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
• Results: M5Rules down to almost half the prediction error
– i.e. on average, we are wrong by 1.6 instead of 2.9 MPG
Attribute set
Linear Regression M5Rules
RMSE RE RMSE RE
original 3.359 0.118 2.859 0.088
original + direct types 3.334 0.117 2.835 0.091
original + categories 4.474 0.144 2.926 0.090
original + direct types + categories 2.551 0.088 1.574 0.042
10/12/2014 Petar Ristoski, Heiko Paulheim 4
5. Drawbacks
• The generated feature sets are rather large
– e.g. for dataset of 300 instances, it may generate up to 5,000 features
from one source
• Increase complexity and runtime
• Overfitting for too specific features
10/12/2014 Petar Ristoski, Heiko Paulheim 5
6. Linked Open Data is Backed by Ontologies
LOD Graph Excerpt Ontology Excerpt
10/12/2014 Petar Ristoski, Heiko Paulheim 6
8. Problem Statement
• Each instance is an n-dimensional binary feature vector (v1,v2,…,vn),
where vi ∈ {0,1} for all 1≤ vi ≤n
• Feature space: V={v1,v2,…, vn}
• Hierarchic relation between two features vi and vj can be denoted as
vi < vj, where vi is more specific than vj
• For all hierarchical features, the following implication holds:
vi < vj → (vi = 1 → vj = 1)
• Transitivity between hierarchical features exists:
vi < vj ˄ vj < vk → vi < vk
• The problem of feature selection can be defined as finding a
projection of V to V’, where V’ ⊆ V and p(V’) ≥ p(V), where p is a
performance function:
푝: 푃 푉 → [0,1]
10/12/2014 Petar Ristoski, Heiko Paulheim 8
9. Hierarchical Feature Space: Example
Josh Donaldson is the best 3rd
baseman in the American League.
LeBron James NOT ranked #1 after
newly released list of Top NBA players
“Two things are infinite: the universe
and human stupidity; and I'm not sure
about the universe.”―Albert Einstein
Nineteen-year-old figure skater Yuzuru
Hanyu, who won a gold medal in the
Sochi Olympics, is among the 684
peo... http://bit.ly/1kb6W5y
In his weekly address, President
Barack Obama discusses expanding
opportunity for hard-working
Americans: http://ofa.bo/ccH
Barack Obama cracks jokes at Vladimir
Putin's expense http://dlvr.it/5Z7JCR
I spotted the Lance Armstrong case in
2006 when everyone thought he was
God, and now this case catches my
attention.
10/12/2014 Petar Ristoski, Heiko Paulheim 9
10. Hierarchical Feature Space: Example
dbpedia-owl:
Basketball_Player
dbpedia-owl:
Baseball_Player
dbpedia-owl:
Athlete
dbpedia:LeBron_James dbpedia:Josh_Donaldson
Josh Donaldson is the best 3rd
baseman in the American League.
LeBron James NOT ranked #1 after
newly released list of Top NBA players
10/12/2014 Petar Ristoski, Heiko Paulheim 10
14. Standard Feature Selection
• Wrapper methods
– Computationally expensive
• Filter methods
– Several techniques for scoring the relevance of the features
• Information Gain
• χ2
• Information Gain Ratio
• Gini Index
– Often similar results
10/12/2014 Petar Ristoski, Heiko Paulheim 14
17. TSEL Feature Selection
• Tree-based feature selection (Jeong et al.)
– Select most representative and most effective feature from each branch
of the hierarchy
• 푙푖푓푡 =
푃(푓|퐶)
푃(퐶)
10/12/2014 Petar Ristoski, Heiko Paulheim 17
18. Bottom-Up Hill-Climbing Feature Selection
• Bottom-up hill climbing search algorithm to find an optimal subset of
concepts for document representation (Wang et al.)
푓 = 1 +
α − 푛
α
∗ β ∗
푖∈퐷
퐷푐푖 , 퐷푐푖⊆ 퐷퐾푁푁푖 푎푛푑 β > 0
10/12/2014 Petar Ristoski, Heiko Paulheim 18
19. Greedy Top-Down Feature Selection
• Greedy based top-down search strategy for feature selection (Lu et al.)
– Select the most effective nodes from different levels of the hierarchy
10/12/2014 Petar Ristoski, Heiko Paulheim 19
21. Hierarchical Feature Selection Approach
(SHSEL)
• Exploit the hierarchical structure of the feature space
• Hierarchical relation : vi < vj → (vi = 1 → vj = 1)
• Relevance similarity:
– Relevance (Blum et al.) : A feature vi is relevant to a target class C if
there exists a pair of examples A and B in the instance space such that
A and B differ only in their assignment to vi and C(A) ≠ C(B)
• Two features vi and vj have similar relevance if:
1 − 푅 푣푖 − 푅 푣푗 ≥ 푡, 푡 → [0,1]
• Goal: Identify features with similar relevance, and select the most
valuable abstract features, without losing predictive power
10/12/2014 Petar Ristoski, Heiko Paulheim 21
22. Hierarchical Feature Selection Approach
(SHSEL)
• Initial Selection
– Identify and filter out ranges of nodes with similar relevance in each
branch of the hierarchy
• Pruning
– Select only the most relevant features from the previously reduced set
10/12/2014 Petar Ristoski, Heiko Paulheim 22
23. Initial SHSEL Feature Selection
1. Identify range of nodes with similar relevance in each branch:
– Information 푠 푣, 푣gain: = 1 푠− (푣푖 0.45 , 푣푗 ) = − 1 0.5 − = 퐼퐺 0.95
푣푖 − 퐼퐺(푣푗 )
푖 푗 – Correlation: t=0.9
푠(푣푖 , 푣푗) = 퐶표푟푟푒푙푎푡푖표푛(푣푖 , 푣푗)
s>t
2. If the similarity is greater than a user specified threshold, remove
the more specific feature, based on the hierarchical relation
10/12/2014 Petar Ristoski, Heiko Paulheim 23
24. Post SHSEL Feature Selection
• Select the features with the highest relevance on each path
– user specified threshold
– select features with relevance above path average relevance
퐼퐺(푣푖)=0.2
AVG(Sp)=0.25
10/12/2014 Petar Ristoski, Heiko Paulheim 24
26. Evaluation
• We use 5 real-world datasets and 6 synthetically generated datasets
• Classification methods:
– Naïve Bayes
– k-Nearest Neighbors (k=3)
– Support Vector Machine (polynomial kernel function)
No parameter optimization
10/12/2014 Petar Ristoski, Heiko Paulheim 26
27. Evaluation: Real World Datasets
Name Features #Instances Class Labels #Features
Sports Tweets T DBpedia Direct Types 1,179 positive(523); negative(656) 4,082
Sports Tweets C DBpedia Categories 1,179 positive(523); negative(656) 10,883
Cities DBpedia Direct Types 212 high(67); medium(106); low(39) 727
NY Daily Headings DBpedia Direct Types 1,016 positive(580); negative(436) 5,145
StumbleUpon DMOZ Categories 3,020 positive(1,370); negative(1,650) 3,976
• Hierarchical features are generated from DBpedia (structured version of
Wikipedia)
– The text is annotated with concepts using DBpedia Spotlight
• The feature generation is independent of the class labels, and it is unbiased
towards any of the feature selection approaches
10/12/2014 Petar Ristoski, Heiko Paulheim 27
28. Evaluation: Synthetic Datasets
Name #Instances Class Labels #Features
S-D2-B2 1,000 positive(500); negative(500) 1,201
S-D2-B5 1,000 positive(500); negative(500) 1,021
S-D2-B10 1,000 positive(500); negative(500) 961
S-D4-B2 1,000 positive(500); negative(500) 2,101
S-D4-B4 1,000 positive(500); negative(500) 1,741
S-D4-B10 1,000 positive(500); negative(500) 1,621
• Generate the middle layer using polynomial function
• Generate the hierarchy upwards and downwards following the
hierarchical feature implication and transitivity rule
• The depth and branching factor are controlled with parameters D
and B
10/12/2014 Petar Ristoski, Heiko Paulheim 28
30. Evaluation: Synthetic Datasets
Name #Instances Class Labels #Features
S-D2-B2 1,000 positive(500); negative(500) 1,201
S-D2-B5 1,000 positive(500); negative(500) 1,021
S-D2-B10 1,000 positive(500); negative(500) 961
S-D4-B2 1,000 positive(500); negative(500) 2,101
S-D4-B4 1,000 positive(500); negative(500) 1,741
S-D4-B10 1,000 positive(500); negative(500) 1,621
• Generate the middle layer using polynomial function
• Generate the hierarchy upwards and downwards following the
hierarchical feature implication and transitivity rule
• The depth and branching factor are controlled with parameters D
and B
10/12/2014 Petar Ristoski, Heiko Paulheim 30
31. Evaluation: Approach
• Testing all approaches using two classification methods
– Naïve Bayes, KNN and SVM
• Metrics for performance evaluation
– Accuracy: Acc V′ =
퐶표푟푟푒푐푡푙푦 퐶푙푎푠푠푓푖푒푑 퐼푛푠푡푎푛푐푒푠 (푉′)
푇표푡푎푙 푁푢푚푏푒푟 표푓 퐼푛푠푡푎푛푐푒푠
– Feature Space Compression: 푐 푉′ = 1 −
|푉′|
|푉|
– Harmonic Mean: 퐻 = 2 ∗
퐴푐푐 푉′ ∗푐 푉′
퐴푐푐 푉′ +푐 푉′
• Results calculated using stratified 10-fold cross validation
– Feature selection is performed inside each fold
• Parameter optimization for each feature selection strategy
10/12/2014 Petar Ristoski, Heiko Paulheim 31
32. • Classification accuracy when using different relevance similarity threshold
on the cities dataset
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Relevance Similarity Threshold
Accuracy
Compression
H. Mean
Evaluation: SHSEL IG
10/12/2014 Petar Ristoski, Heiko Paulheim 32
33. Evaluation: Classification Accuracy (NB)
10/12/2014 Petar Ristoski, Heiko Paulheim 33
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings
original
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10
original
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
34. Evaluation: Feature Space Compression (NB)
10/12/2014 Petar Ristoski, Heiko Paulheim 36
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
35. Evaluation: Harmonic Mean (NB)
10/12/2014 Petar Ristoski, Heiko Paulheim 39
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
36. Conclusion & Outlook
• Contribution
– An approach that exploits hierarchies for feature selection in
combination with standard metrics
– The evaluation shows that the approach outperforms standard feature
selection techniques, and other approaches using hierarchies
• Future Work
– Conduct further experiments
• E.g. text mining, bioinformatics
– Feature Selection in unsupervised learning
• E.g. clustering, outlier detection
• Laplacian Score
10/12/2014 Petar Ristoski, Heiko Paulheim 43
37. 44
Feature Selection in
Hierarchical Feature Spaces
10/12/2014 Petar Ristoski, Heiko Paulheim