Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL

1
Feature Selection in
Hierarchical Feature Spaces
10/12/2014 Petar Ristoski, Heiko Paulheim

Motivation: Linked Open Data as Background
Knowledge
• Linked Open Data is a method for publishing interlinked
datasets using machine interpretable semantics
• Started 2007
• A collection of ~1,000 datasets
– Various domains, e.g. general knowledge, government data, …
– Using semantic web standards (HTTP, RDF, SPARQL)
• Free of charge
• Machine processable
• Sophisticated tool stacks
2

Motivation: Linked Open Data as Background
Knowledge
3

Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
• Results: M5Rules down to almost half the prediction error
– i.e. on average, we are wrong by 1.6 instead of 2.9 MPG
Attribute set
Linear Regression M5Rules
RMSE RE RMSE RE
original 3.359 0.118 2.859 0.088
original + direct types 3.334 0.117 2.835 0.091
original + categories 4.474 0.144 2.926 0.090
original + direct types + categories 2.551 0.088 1.574 0.042
10/12/2014 Petar Ristoski, Heiko Paulheim 4

Drawbacks
• The generated feature sets are rather large
– e.g. for dataset of 300 instances, it may generate up to 5,000 features
from one source
• Increase complexity and runtime
• Overfitting for too specific features

Linked Open Data is Backed by Ontologies
LOD Graph Excerpt Ontology Excerpt

HIERARCHICAL FEATURE
SPACE

Problem Statement
• Each instance is an n-dimensional binary feature vector (v1,v2,…,vn),
where vi ∈ {0,1} for all 1≤ vi ≤n
• Feature space: V={v1,v2,…, vn}
• Hierarchic relation between two features vi and vj can be denoted as
vi < vj, where vi is more specific than vj
• For all hierarchical features, the following implication holds:
vi < vj → (vi = 1 → vj = 1)
• Transitivity between hierarchical features exists:
vi < vj ˄ vj < vk → vi < vk
• The problem of feature selection can be defined as finding a
projection of V to V’, where V’ ⊆ V and p(V’) ≥ p(V), where p is a
performance function:
푝: 푃 푉 → [0,1]

Hierarchical Feature Space: Example
Josh Donaldson is the best 3rd
baseman in the American League.
LeBron James NOT ranked #1 after
newly released list of Top NBA players
“Two things are infinite: the universe
and human stupidity; and I'm not sure
about the universe.”―Albert Einstein
Nineteen-year-old figure skater Yuzuru
Hanyu, who won a gold medal in the
Sochi Olympics, is among the 684
peo... http://bit.ly/1kb6W5y
In his weekly address, President
Barack Obama discusses expanding
opportunity for hard-working
Americans: http://ofa.bo/ccH
Barack Obama cracks jokes at Vladimir
Putin's expense http://dlvr.it/5Z7JCR
I spotted the Lance Armstrong case in
2006 when everyone thought he was
God, and now this case catches my
attention.

dbpedia-owl:
Basketball_Player
dbpedia-owl:
Baseball_Player
dbpedia-owl:
Athlete
dbpedia:LeBron_James dbpedia:Josh_Donaldson
Josh Donaldson is the best 3rd
baseman in the American League.
LeBron James NOT ranked #1 after
newly released list of Top NBA players

Hierarchical Feature Space
• Linked Open Data
– DBpedia, YAGO, Biperpedia, Google Knowledge Graph
• Lexical Databses
– WordNet, DANTE
• Domain specific ontologies, taxonomies and vocabularies
– Bioinformatics: Gene Ontology (GO), Entrez
– Drugs: the Drug Ontology
– E-commerce: GoodRelations

RELATED APPROACHES

Standard Feature Selection
• Wrapper methods
– Computationally expensive
• Filter methods
– Several techniques for scoring the relevance of the features
• Information Gain
• χ2
• Information Gain Ratio
• Gini Index
– Often similar results

Optimal Feature Selection

Standard Feature Selection: Information Gain

TSEL Feature Selection
• Tree-based feature selection (Jeong et al.)
– Select most representative and most effective feature from each branch
of the hierarchy
• 푙푖푓푡 =
푃(푓|퐶)
푃(퐶)

Bottom-Up Hill-Climbing Feature Selection
• Bottom-up hill climbing search algorithm to find an optimal subset of
concepts for document representation (Wang et al.)
푓 = 1 +
α − 푛
α
∗ β ∗
푖∈퐷
퐷푐푖 , 퐷푐푖⊆ 퐷퐾푁푁푖 푎푛푑 β > 0

Greedy Top-Down Feature Selection
• Greedy based top-down search strategy for feature selection (Lu et al.)
– Select the most effective nodes from different levels of the hierarchy

PROPOSED APPROACH

Hierarchical Feature Selection Approach
(SHSEL)
• Exploit the hierarchical structure of the feature space
• Hierarchical relation : vi < vj → (vi = 1 → vj = 1)
• Relevance similarity:
– Relevance (Blum et al.) : A feature vi is relevant to a target class C if
there exists a pair of examples A and B in the instance space such that
A and B differ only in their assignment to vi and C(A) ≠ C(B)
• Two features vi and vj have similar relevance if:
1 − 푅 푣푖 − 푅 푣푗 ≥ 푡, 푡 → [0,1]
• Goal: Identify features with similar relevance, and select the most
valuable abstract features, without losing predictive power

Hierarchical Feature Selection Approach
(SHSEL)
• Initial Selection
– Identify and filter out ranges of nodes with similar relevance in each
branch of the hierarchy
• Pruning
– Select only the most relevant features from the previously reduced set

Initial SHSEL Feature Selection
1. Identify range of nodes with similar relevance in each branch:
– Information 푠 푣, 푣gain: = 1 푠− (푣푖 0.45 , 푣푗 ) = − 1 0.5 − = 퐼퐺 0.95
푣푖 − 퐼퐺(푣푗 )
푖 푗 – Correlation: t=0.9
푠(푣푖 , 푣푗) = 퐶표푟푟푒푙푎푡푖표푛(푣푖 , 푣푗)
s>t
2. If the similarity is greater than a user specified threshold, remove
the more specific feature, based on the hierarchical relation

Post SHSEL Feature Selection
• Select the features with the highest relevance on each path
– user specified threshold
– select features with relevance above path average relevance
퐼퐺(푣푖)=0.2
AVG(Sp)=0.25

EVALUATION

Evaluation
• We use 5 real-world datasets and 6 synthetically generated datasets
• Classification methods:
– Naïve Bayes
– k-Nearest Neighbors (k=3)
– Support Vector Machine (polynomial kernel function)
 No parameter optimization

Evaluation: Real World Datasets
Name Features #Instances Class Labels #Features
Sports Tweets T DBpedia Direct Types 1,179 positive(523); negative(656) 4,082
Sports Tweets C DBpedia Categories 1,179 positive(523); negative(656) 10,883
Cities DBpedia Direct Types 212 high(67); medium(106); low(39) 727
NY Daily Headings DBpedia Direct Types 1,016 positive(580); negative(436) 5,145
StumbleUpon DMOZ Categories 3,020 positive(1,370); negative(1,650) 3,976
• Hierarchical features are generated from DBpedia (structured version of
Wikipedia)
– The text is annotated with concepts using DBpedia Spotlight
• The feature generation is independent of the class labels, and it is unbiased
towards any of the feature selection approaches

Evaluation: Synthetic Datasets
Name #Instances Class Labels #Features
S-D2-B2 1,000 positive(500); negative(500) 1,201
S-D2-B10 1,000 positive(500); negative(500) 961
• Generate the middle layer using polynomial function
• Generate the hierarchy upwards and downwards following the
hierarchical feature implication and transitivity rule
• The depth and branching factor are controlled with parameters D
and B

• Depth = 1 & Branching = 2
1
1 1 1 0
1 0 1 1 0 1 0 0
0
0 1 0 0 1 0 0

Name #Instances Class Labels #Features
S-D2-B10 1,000 positive(500); negative(500) 961
• Generate the middle layer using polynomial function
• Generate the hierarchy upwards and downwards following the
hierarchical feature implication and transitivity rule
• The depth and branching factor are controlled with parameters D
and B

Evaluation: Approach
• Testing all approaches using two classification methods
– Naïve Bayes, KNN and SVM
• Metrics for performance evaluation
– Accuracy: Acc V′ =
퐶표푟푟푒푐푡푙푦 퐶푙푎푠푠푓푖푒푑 퐼푛푠푡푎푛푐푒푠 (푉′)
푇표푡푎푙 푁푢푚푏푒푟 표푓 퐼푛푠푡푎푛푐푒푠
– Feature Space Compression: 푐 푉′ = 1 −
|푉′|
|푉|
– Harmonic Mean: 퐻 = 2 ∗
퐴푐푐 푉′ ∗푐 푉′
퐴푐푐 푉′ +푐 푉′
• Results calculated using stratified 10-fold cross validation
– Feature selection is performed inside each fold
• Parameter optimization for each feature selection strategy

• Classification accuracy when using different relevance similarity threshold
on the cities dataset
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Relevance Similarity Threshold
Accuracy
Compression
H. Mean
Evaluation: SHSEL IG

Evaluation: Classification Accuracy (NB)
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings
original
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10
original
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown

Evaluation: Feature Space Compression (NB)
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown

Evaluation: Harmonic Mean (NB)
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
initialSHSEL IG
initialSHSEL C
pruneSHSEL IG
pruneSHSEL C
SIG
SC
TSEL Lift
TSEL IG
HillClimbing
GreedyTopDown

Conclusion & Outlook
• Contribution
– An approach that exploits hierarchies for feature selection in
combination with standard metrics
– The evaluation shows that the approach outperforms standard feature
selection techniques, and other approaches using hierarchies
• Future Work
– Conduct further experiments
• E.g. text mining, bioinformatics
– Feature Selection in unsupervised learning
• E.g. clustering, outlier detection
• Laplacian Score

44
Feature Selection in
Hierarchical Feature Spaces

Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL

Similar to Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL (20)

Recently uploaded

Recently uploaded (20)

Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL