SlideShare a Scribd company logo
1 of 37
1 
Feature Selection in 
Hierarchical Feature Spaces 
10/12/2014 Petar Ristoski, Heiko Paulheim
Motivation: Linked Open Data as Background 
Knowledge 
• Linked Open Data is a method for publishing interlinked 
datasets using machine interpretable semantics 
• Started 2007 
• A collection of ~1,000 datasets 
– Various domains, e.g. general knowledge, government data, … 
– Using semantic web standards (HTTP, RDF, SPARQL) 
• Free of charge 
• Machine processable 
• Sophisticated tool stacks 
10/12/2014 Petar Ristoski, Heiko Paulheim 
2
Motivation: Linked Open Data as Background 
Knowledge 
10/12/2014 Petar Ristoski, Heiko Paulheim 
3
Example: the Auto MPG Dataset 
• A well-known UCI dataset 
– Goal: predict fuel consumption of cars 
• Hypothesis: background knowledge → more accurate predictions 
• Used background knowledge: 
– Entity types and categories from DBpedia (=Wikipedia) 
• Results: M5Rules down to almost half the prediction error 
– i.e. on average, we are wrong by 1.6 instead of 2.9 MPG 
Attribute set 
Linear Regression M5Rules 
RMSE RE RMSE RE 
original 3.359 0.118 2.859 0.088 
original + direct types 3.334 0.117 2.835 0.091 
original + categories 4.474 0.144 2.926 0.090 
original + direct types + categories 2.551 0.088 1.574 0.042 
10/12/2014 Petar Ristoski, Heiko Paulheim 4
Drawbacks 
• The generated feature sets are rather large 
– e.g. for dataset of 300 instances, it may generate up to 5,000 features 
from one source 
• Increase complexity and runtime 
• Overfitting for too specific features 
10/12/2014 Petar Ristoski, Heiko Paulheim 5
Linked Open Data is Backed by Ontologies 
LOD Graph Excerpt Ontology Excerpt 
10/12/2014 Petar Ristoski, Heiko Paulheim 6
HIERARCHICAL FEATURE 
SPACE 
10/12/2014 Petar Ristoski, Heiko Paulheim 7
Problem Statement 
• Each instance is an n-dimensional binary feature vector (v1,v2,…,vn), 
where vi ∈ {0,1} for all 1≤ vi ≤n 
• Feature space: V={v1,v2,…, vn} 
• Hierarchic relation between two features vi and vj can be denoted as 
vi < vj, where vi is more specific than vj 
• For all hierarchical features, the following implication holds: 
vi < vj → (vi = 1 → vj = 1) 
• Transitivity between hierarchical features exists: 
vi < vj ˄ vj < vk → vi < vk 
• The problem of feature selection can be defined as finding a 
projection of V to V’, where V’ ⊆ V and p(V’) ≥ p(V), where p is a 
performance function: 
푝: 푃 푉 → [0,1] 
10/12/2014 Petar Ristoski, Heiko Paulheim 8
Hierarchical Feature Space: Example 
Josh Donaldson is the best 3rd 
baseman in the American League. 
LeBron James NOT ranked #1 after 
newly released list of Top NBA players 
“Two things are infinite: the universe 
and human stupidity; and I'm not sure 
about the universe.”―Albert Einstein 
Nineteen-year-old figure skater Yuzuru 
Hanyu, who won a gold medal in the 
Sochi Olympics, is among the 684 
peo... http://bit.ly/1kb6W5y 
In his weekly address, President 
Barack Obama discusses expanding 
opportunity for hard-working 
Americans: http://ofa.bo/ccH 
Barack Obama cracks jokes at Vladimir 
Putin's expense http://dlvr.it/5Z7JCR 
I spotted the Lance Armstrong case in 
2006 when everyone thought he was 
God, and now this case catches my 
attention. 
10/12/2014 Petar Ristoski, Heiko Paulheim 9
Hierarchical Feature Space: Example 
dbpedia-owl: 
Basketball_Player 
dbpedia-owl: 
Baseball_Player 
dbpedia-owl: 
Athlete 
dbpedia:LeBron_James dbpedia:Josh_Donaldson 
Josh Donaldson is the best 3rd 
baseman in the American League. 
LeBron James NOT ranked #1 after 
newly released list of Top NBA players 
10/12/2014 Petar Ristoski, Heiko Paulheim 10
Hierarchical Feature Space: Example 
10/12/2014 Petar Ristoski, Heiko Paulheim 11
Hierarchical Feature Space 
• Linked Open Data 
– DBpedia, YAGO, Biperpedia, Google Knowledge Graph 
• Lexical Databses 
– WordNet, DANTE 
• Domain specific ontologies, taxonomies and vocabularies 
– Bioinformatics: Gene Ontology (GO), Entrez 
– Drugs: the Drug Ontology 
– E-commerce: GoodRelations 
10/12/2014 Petar Ristoski, Heiko Paulheim 12
RELATED APPROACHES 
10/12/2014 Petar Ristoski, Heiko Paulheim 13
Standard Feature Selection 
• Wrapper methods 
– Computationally expensive 
• Filter methods 
– Several techniques for scoring the relevance of the features 
• Information Gain 
• χ2 
• Information Gain Ratio 
• Gini Index 
– Often similar results 
10/12/2014 Petar Ristoski, Heiko Paulheim 14
Optimal Feature Selection 
10/12/2014 Petar Ristoski, Heiko Paulheim 15
Standard Feature Selection: Information Gain 
10/12/2014 Petar Ristoski, Heiko Paulheim 16
TSEL Feature Selection 
• Tree-based feature selection (Jeong et al.) 
– Select most representative and most effective feature from each branch 
of the hierarchy 
• 푙푖푓푡 = 
푃(푓|퐶) 
푃(퐶) 
10/12/2014 Petar Ristoski, Heiko Paulheim 17
Bottom-Up Hill-Climbing Feature Selection 
• Bottom-up hill climbing search algorithm to find an optimal subset of 
concepts for document representation (Wang et al.) 
푓 = 1 + 
α − 푛 
α 
∗ β ∗ 
푖∈퐷 
퐷푐푖 , 퐷푐푖⊆ 퐷퐾푁푁푖 푎푛푑 β > 0 
10/12/2014 Petar Ristoski, Heiko Paulheim 18
Greedy Top-Down Feature Selection 
• Greedy based top-down search strategy for feature selection (Lu et al.) 
– Select the most effective nodes from different levels of the hierarchy 
10/12/2014 Petar Ristoski, Heiko Paulheim 19
PROPOSED APPROACH 
10/12/2014 Petar Ristoski, Heiko Paulheim 20
Hierarchical Feature Selection Approach 
(SHSEL) 
• Exploit the hierarchical structure of the feature space 
• Hierarchical relation : vi < vj → (vi = 1 → vj = 1) 
• Relevance similarity: 
– Relevance (Blum et al.) : A feature vi is relevant to a target class C if 
there exists a pair of examples A and B in the instance space such that 
A and B differ only in their assignment to vi and C(A) ≠ C(B) 
• Two features vi and vj have similar relevance if: 
1 − 푅 푣푖 − 푅 푣푗 ≥ 푡, 푡 → [0,1] 
• Goal: Identify features with similar relevance, and select the most 
valuable abstract features, without losing predictive power 
10/12/2014 Petar Ristoski, Heiko Paulheim 21
Hierarchical Feature Selection Approach 
(SHSEL) 
• Initial Selection 
– Identify and filter out ranges of nodes with similar relevance in each 
branch of the hierarchy 
• Pruning 
– Select only the most relevant features from the previously reduced set 
10/12/2014 Petar Ristoski, Heiko Paulheim 22
Initial SHSEL Feature Selection 
1. Identify range of nodes with similar relevance in each branch: 
– Information 푠 푣, 푣gain: = 1 푠− (푣푖 0.45 , 푣푗 ) = − 1 0.5 − = 퐼퐺 0.95 
푣푖 − 퐼퐺(푣푗 ) 
푖 푗 – Correlation: t=0.9 
푠(푣푖 , 푣푗) = 퐶표푟푟푒푙푎푡푖표푛(푣푖 , 푣푗) 
s>t 
2. If the similarity is greater than a user specified threshold, remove 
the more specific feature, based on the hierarchical relation 
10/12/2014 Petar Ristoski, Heiko Paulheim 23
Post SHSEL Feature Selection 
• Select the features with the highest relevance on each path 
– user specified threshold 
– select features with relevance above path average relevance 
퐼퐺(푣푖)=0.2 
AVG(Sp)=0.25 
10/12/2014 Petar Ristoski, Heiko Paulheim 24
EVALUATION 
10/12/2014 Petar Ristoski, Heiko Paulheim 25
Evaluation 
• We use 5 real-world datasets and 6 synthetically generated datasets 
• Classification methods: 
– Naïve Bayes 
– k-Nearest Neighbors (k=3) 
– Support Vector Machine (polynomial kernel function) 
 No parameter optimization 
10/12/2014 Petar Ristoski, Heiko Paulheim 26
Evaluation: Real World Datasets 
Name Features #Instances Class Labels #Features 
Sports Tweets T DBpedia Direct Types 1,179 positive(523); negative(656) 4,082 
Sports Tweets C DBpedia Categories 1,179 positive(523); negative(656) 10,883 
Cities DBpedia Direct Types 212 high(67); medium(106); low(39) 727 
NY Daily Headings DBpedia Direct Types 1,016 positive(580); negative(436) 5,145 
StumbleUpon DMOZ Categories 3,020 positive(1,370); negative(1,650) 3,976 
• Hierarchical features are generated from DBpedia (structured version of 
Wikipedia) 
– The text is annotated with concepts using DBpedia Spotlight 
• The feature generation is independent of the class labels, and it is unbiased 
towards any of the feature selection approaches 
10/12/2014 Petar Ristoski, Heiko Paulheim 27
Evaluation: Synthetic Datasets 
Name #Instances Class Labels #Features 
S-D2-B2 1,000 positive(500); negative(500) 1,201 
S-D2-B5 1,000 positive(500); negative(500) 1,021 
S-D2-B10 1,000 positive(500); negative(500) 961 
S-D4-B2 1,000 positive(500); negative(500) 2,101 
S-D4-B4 1,000 positive(500); negative(500) 1,741 
S-D4-B10 1,000 positive(500); negative(500) 1,621 
• Generate the middle layer using polynomial function 
• Generate the hierarchy upwards and downwards following the 
hierarchical feature implication and transitivity rule 
• The depth and branching factor are controlled with parameters D 
and B 
10/12/2014 Petar Ristoski, Heiko Paulheim 28
Evaluation: Synthetic Datasets 
• Depth = 1 & Branching = 2 
1 
1 1 1 0 
1 0 1 1 0 1 0 0 
0 
0 1 0 0 1 0 0 
10/12/2014 Petar Ristoski, Heiko Paulheim 29
Evaluation: Synthetic Datasets 
Name #Instances Class Labels #Features 
S-D2-B2 1,000 positive(500); negative(500) 1,201 
S-D2-B5 1,000 positive(500); negative(500) 1,021 
S-D2-B10 1,000 positive(500); negative(500) 961 
S-D4-B2 1,000 positive(500); negative(500) 2,101 
S-D4-B4 1,000 positive(500); negative(500) 1,741 
S-D4-B10 1,000 positive(500); negative(500) 1,621 
• Generate the middle layer using polynomial function 
• Generate the hierarchy upwards and downwards following the 
hierarchical feature implication and transitivity rule 
• The depth and branching factor are controlled with parameters D 
and B 
10/12/2014 Petar Ristoski, Heiko Paulheim 30
Evaluation: Approach 
• Testing all approaches using two classification methods 
– Naïve Bayes, KNN and SVM 
• Metrics for performance evaluation 
– Accuracy: Acc V′ = 
퐶표푟푟푒푐푡푙푦 퐶푙푎푠푠푓푖푒푑 퐼푛푠푡푎푛푐푒푠 (푉′) 
푇표푡푎푙 푁푢푚푏푒푟 표푓 퐼푛푠푡푎푛푐푒푠 
– Feature Space Compression: 푐 푉′ = 1 − 
|푉′| 
|푉| 
– Harmonic Mean: 퐻 = 2 ∗ 
퐴푐푐 푉′ ∗푐 푉′ 
퐴푐푐 푉′ +푐 푉′ 
• Results calculated using stratified 10-fold cross validation 
– Feature selection is performed inside each fold 
• Parameter optimization for each feature selection strategy 
10/12/2014 Petar Ristoski, Heiko Paulheim 31
• Classification accuracy when using different relevance similarity threshold 
on the cities dataset 
100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
Relevance Similarity Threshold 
Accuracy 
Compression 
H. Mean 
Evaluation: SHSEL IG 
10/12/2014 Petar Ristoski, Heiko Paulheim 32
Evaluation: Classification Accuracy (NB) 
10/12/2014 Petar Ristoski, Heiko Paulheim 33 
100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings 
original 
initialSHSEL IG 
initialSHSEL C 
pruneSHSEL IG 
pruneSHSEL C 
SIG 
SC 
TSEL Lift 
TSEL IG 
HillClimbing 
GreedyTopDown 
100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10 
original 
initialSHSEL IG 
initialSHSEL C 
pruneSHSEL IG 
pruneSHSEL C 
SIG 
SC 
TSEL Lift 
TSEL IG 
HillClimbing 
GreedyTopDown
Evaluation: Feature Space Compression (NB) 
10/12/2014 Petar Ristoski, Heiko Paulheim 36 
100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings 
initialSHSEL IG 
initialSHSEL C 
pruneSHSEL IG 
pruneSHSEL C 
SIG 
SC 
TSEL Lift 
TSEL IG 
HillClimbing 
GreedyTopDown 
100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10 
initialSHSEL IG 
initialSHSEL C 
pruneSHSEL IG 
pruneSHSEL C 
SIG 
SC 
TSEL Lift 
TSEL IG 
HillClimbing 
GreedyTopDown
Evaluation: Harmonic Mean (NB) 
10/12/2014 Petar Ristoski, Heiko Paulheim 39 
100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings 
initialSHSEL IG 
initialSHSEL C 
pruneSHSEL IG 
pruneSHSEL C 
SIG 
SC 
TSEL Lift 
TSEL IG 
HillClimbing 
GreedyTopDown 
100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10 
initialSHSEL IG 
initialSHSEL C 
pruneSHSEL IG 
pruneSHSEL C 
SIG 
SC 
TSEL Lift 
TSEL IG 
HillClimbing 
GreedyTopDown
Conclusion & Outlook 
• Contribution 
– An approach that exploits hierarchies for feature selection in 
combination with standard metrics 
– The evaluation shows that the approach outperforms standard feature 
selection techniques, and other approaches using hierarchies 
• Future Work 
– Conduct further experiments 
• E.g. text mining, bioinformatics 
– Feature Selection in unsupervised learning 
• E.g. clustering, outlier detection 
• Laplacian Score 
10/12/2014 Petar Ristoski, Heiko Paulheim 43
44 
Feature Selection in 
Hierarchical Feature Spaces 
10/12/2014 Petar Ristoski, Heiko Paulheim

More Related Content

What's hot

Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...Olaf Hartig
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...Dimitris Kontokostas
 
Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1net2-project
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Debunking some “RDF vs. Property Graph” Alternative Facts
Debunking some “RDF vs. Property Graph” Alternative FactsDebunking some “RDF vs. Property Graph” Alternative Facts
Debunking some “RDF vs. Property Graph” Alternative FactsNeo4j
 
Two graph data models : RDF and Property Graphs
Two graph data models : RDF and Property GraphsTwo graph data models : RDF and Property Graphs
Two graph data models : RDF and Property Graphsandyseaborne
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLJerven Bolleman
 
Querying the Web of Data
Querying the Web of DataQuerying the Web of Data
Querying the Web of DataRinke Hoekstra
 
1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in r1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in rSimple Research
 
A year on the Semantic Web @ W3C
A year on the Semantic Web @ W3CA year on the Semantic Web @ W3C
A year on the Semantic Web @ W3CIvan Herman
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011François Scharffe
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeNational Institute of Informatics
 
Approximating Numeric Role Fillers via Predictive Clustering Trees for Know...
Approximating Numeric Role Fillers via Predictive Clustering Trees  for  Know...Approximating Numeric Role Fillers via Predictive Clustering Trees  for  Know...
Approximating Numeric Role Fillers via Predictive Clustering Trees for Know...Giuseppe Rizzo
 
Li Pei-Temporal RL-VLDB2011
Li Pei-Temporal RL-VLDB2011Li Pei-Temporal RL-VLDB2011
Li Pei-Temporal RL-VLDB2011Pei Li
 

What's hot (19)

Link Discovery Tutorial Part I: Efficiency
Link Discovery Tutorial Part I: EfficiencyLink Discovery Tutorial Part I: Efficiency
Link Discovery Tutorial Part I: Efficiency
 
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Debunking some “RDF vs. Property Graph” Alternative Facts
Debunking some “RDF vs. Property Graph” Alternative FactsDebunking some “RDF vs. Property Graph” Alternative Facts
Debunking some “RDF vs. Property Graph” Alternative Facts
 
Two graph data models : RDF and Property Graphs
Two graph data models : RDF and Property GraphsTwo graph data models : RDF and Property Graphs
Two graph data models : RDF and Property Graphs
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 
Jesús Barrasa
Jesús BarrasaJesús Barrasa
Jesús Barrasa
 
Querying the Web of Data
Querying the Web of DataQuerying the Web of Data
Querying the Web of Data
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in r1.3 introduction to R language, importing dataset in r, data exploration in r
1.3 introduction to R language, importing dataset in r, data exploration in r
 
A year on the Semantic Web @ W3C
A year on the Semantic Web @ W3CA year on the Semantic Web @ W3C
A year on the Semantic Web @ W3C
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Semantic Web Technology
Semantic Web TechnologySemantic Web Technology
Semantic Web Technology
 
Approximating Numeric Role Fillers via Predictive Clustering Trees for Know...
Approximating Numeric Role Fillers via Predictive Clustering Trees  for  Know...Approximating Numeric Role Fillers via Predictive Clustering Trees  for  Know...
Approximating Numeric Role Fillers via Predictive Clustering Trees for Know...
 
Li Pei-Temporal RL-VLDB2011
Li Pei-Temporal RL-VLDB2011Li Pei-Temporal RL-VLDB2011
Li Pei-Temporal RL-VLDB2011
 

Viewers also liked

Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionSOYEON KIM
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection methodAmir Razmjou
 
Submodularity slides
Submodularity slidesSubmodularity slides
Submodularity slidesdragonthu
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar AhmedZaffar Ahmed Shaikh
 
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Parinda Rajapaksha
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMSkoolkampus
 

Viewers also liked (6)

Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS Detection
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
Submodularity slides
Submodularity slidesSubmodularity slides
Submodularity slides
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar Ahmed
 
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 

Similar to Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL

Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
 
How to build recommender system
How to build recommender systemHow to build recommender system
How to build recommender systemMitko Gurbanski
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...Heiko Paulheim
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...GUANGYUAN PIAO
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationEnrico Palumbo
 
Training di Base Neo4j
Training di Base Neo4jTraining di Base Neo4j
Training di Base Neo4jNeo4j
 
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Ryan B Harvey, CSDP, CSM
 
A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...Petar Ristoski
 
Research on Recommender Systems: Beyond Ratings and Lists
Research on Recommender Systems: Beyond Ratings and ListsResearch on Recommender Systems: Beyond Ratings and Lists
Research on Recommender Systems: Beyond Ratings and ListsDenis Parra Santander
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-convertedNeo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-convertedsnehapandey01
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...GUANGYUAN PIAO
 
Neo4j - graph database for recommendations
Neo4j - graph database for recommendationsNeo4j - graph database for recommendations
Neo4j - graph database for recommendationsproksik
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRoku
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
 

Similar to Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL (20)

Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
How to build recommender system
How to build recommender systemHow to build recommender system
How to build recommender system
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
 
Training di Base Neo4j
Training di Base Neo4jTraining di Base Neo4j
Training di Base Neo4j
 
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
 
A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...
 
Research on Recommender Systems: Beyond Ratings and Lists
Research on Recommender Systems: Beyond Ratings and ListsResearch on Recommender Systems: Beyond Ratings and Lists
Research on Recommender Systems: Beyond Ratings and Lists
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-convertedNeo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
 
Lecture2-DT.pptx
Lecture2-DT.pptxLecture2-DT.pptx
Lecture2-DT.pptx
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
 
Leopard ISWC Semantic Web Challenge 2017
Leopard ISWC Semantic Web Challenge 2017Leopard ISWC Semantic Web Challenge 2017
Leopard ISWC Semantic Web Challenge 2017
 
Neo4j - graph database for recommendations
Neo4j - graph database for recommendationsNeo4j - graph database for recommendations
Neo4j - graph database for recommendations
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 

Recently uploaded

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 

Recently uploaded (20)

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 

Feature selection in hierarchical feature spaces using similarity-based hierarchical feature selection (SHSEL

  • 1. 1 Feature Selection in Hierarchical Feature Spaces 10/12/2014 Petar Ristoski, Heiko Paulheim
  • 2. Motivation: Linked Open Data as Background Knowledge • Linked Open Data is a method for publishing interlinked datasets using machine interpretable semantics • Started 2007 • A collection of ~1,000 datasets – Various domains, e.g. general knowledge, government data, … – Using semantic web standards (HTTP, RDF, SPARQL) • Free of charge • Machine processable • Sophisticated tool stacks 10/12/2014 Petar Ristoski, Heiko Paulheim 2
  • 3. Motivation: Linked Open Data as Background Knowledge 10/12/2014 Petar Ristoski, Heiko Paulheim 3
  • 4. Example: the Auto MPG Dataset • A well-known UCI dataset – Goal: predict fuel consumption of cars • Hypothesis: background knowledge → more accurate predictions • Used background knowledge: – Entity types and categories from DBpedia (=Wikipedia) • Results: M5Rules down to almost half the prediction error – i.e. on average, we are wrong by 1.6 instead of 2.9 MPG Attribute set Linear Regression M5Rules RMSE RE RMSE RE original 3.359 0.118 2.859 0.088 original + direct types 3.334 0.117 2.835 0.091 original + categories 4.474 0.144 2.926 0.090 original + direct types + categories 2.551 0.088 1.574 0.042 10/12/2014 Petar Ristoski, Heiko Paulheim 4
  • 5. Drawbacks • The generated feature sets are rather large – e.g. for dataset of 300 instances, it may generate up to 5,000 features from one source • Increase complexity and runtime • Overfitting for too specific features 10/12/2014 Petar Ristoski, Heiko Paulheim 5
  • 6. Linked Open Data is Backed by Ontologies LOD Graph Excerpt Ontology Excerpt 10/12/2014 Petar Ristoski, Heiko Paulheim 6
  • 7. HIERARCHICAL FEATURE SPACE 10/12/2014 Petar Ristoski, Heiko Paulheim 7
  • 8. Problem Statement • Each instance is an n-dimensional binary feature vector (v1,v2,…,vn), where vi ∈ {0,1} for all 1≤ vi ≤n • Feature space: V={v1,v2,…, vn} • Hierarchic relation between two features vi and vj can be denoted as vi < vj, where vi is more specific than vj • For all hierarchical features, the following implication holds: vi < vj → (vi = 1 → vj = 1) • Transitivity between hierarchical features exists: vi < vj ˄ vj < vk → vi < vk • The problem of feature selection can be defined as finding a projection of V to V’, where V’ ⊆ V and p(V’) ≥ p(V), where p is a performance function: 푝: 푃 푉 → [0,1] 10/12/2014 Petar Ristoski, Heiko Paulheim 8
  • 9. Hierarchical Feature Space: Example Josh Donaldson is the best 3rd baseman in the American League. LeBron James NOT ranked #1 after newly released list of Top NBA players “Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”―Albert Einstein Nineteen-year-old figure skater Yuzuru Hanyu, who won a gold medal in the Sochi Olympics, is among the 684 peo... http://bit.ly/1kb6W5y In his weekly address, President Barack Obama discusses expanding opportunity for hard-working Americans: http://ofa.bo/ccH Barack Obama cracks jokes at Vladimir Putin's expense http://dlvr.it/5Z7JCR I spotted the Lance Armstrong case in 2006 when everyone thought he was God, and now this case catches my attention. 10/12/2014 Petar Ristoski, Heiko Paulheim 9
  • 10. Hierarchical Feature Space: Example dbpedia-owl: Basketball_Player dbpedia-owl: Baseball_Player dbpedia-owl: Athlete dbpedia:LeBron_James dbpedia:Josh_Donaldson Josh Donaldson is the best 3rd baseman in the American League. LeBron James NOT ranked #1 after newly released list of Top NBA players 10/12/2014 Petar Ristoski, Heiko Paulheim 10
  • 11. Hierarchical Feature Space: Example 10/12/2014 Petar Ristoski, Heiko Paulheim 11
  • 12. Hierarchical Feature Space • Linked Open Data – DBpedia, YAGO, Biperpedia, Google Knowledge Graph • Lexical Databses – WordNet, DANTE • Domain specific ontologies, taxonomies and vocabularies – Bioinformatics: Gene Ontology (GO), Entrez – Drugs: the Drug Ontology – E-commerce: GoodRelations 10/12/2014 Petar Ristoski, Heiko Paulheim 12
  • 13. RELATED APPROACHES 10/12/2014 Petar Ristoski, Heiko Paulheim 13
  • 14. Standard Feature Selection • Wrapper methods – Computationally expensive • Filter methods – Several techniques for scoring the relevance of the features • Information Gain • χ2 • Information Gain Ratio • Gini Index – Often similar results 10/12/2014 Petar Ristoski, Heiko Paulheim 14
  • 15. Optimal Feature Selection 10/12/2014 Petar Ristoski, Heiko Paulheim 15
  • 16. Standard Feature Selection: Information Gain 10/12/2014 Petar Ristoski, Heiko Paulheim 16
  • 17. TSEL Feature Selection • Tree-based feature selection (Jeong et al.) – Select most representative and most effective feature from each branch of the hierarchy • 푙푖푓푡 = 푃(푓|퐶) 푃(퐶) 10/12/2014 Petar Ristoski, Heiko Paulheim 17
  • 18. Bottom-Up Hill-Climbing Feature Selection • Bottom-up hill climbing search algorithm to find an optimal subset of concepts for document representation (Wang et al.) 푓 = 1 + α − 푛 α ∗ β ∗ 푖∈퐷 퐷푐푖 , 퐷푐푖⊆ 퐷퐾푁푁푖 푎푛푑 β > 0 10/12/2014 Petar Ristoski, Heiko Paulheim 18
  • 19. Greedy Top-Down Feature Selection • Greedy based top-down search strategy for feature selection (Lu et al.) – Select the most effective nodes from different levels of the hierarchy 10/12/2014 Petar Ristoski, Heiko Paulheim 19
  • 20. PROPOSED APPROACH 10/12/2014 Petar Ristoski, Heiko Paulheim 20
  • 21. Hierarchical Feature Selection Approach (SHSEL) • Exploit the hierarchical structure of the feature space • Hierarchical relation : vi < vj → (vi = 1 → vj = 1) • Relevance similarity: – Relevance (Blum et al.) : A feature vi is relevant to a target class C if there exists a pair of examples A and B in the instance space such that A and B differ only in their assignment to vi and C(A) ≠ C(B) • Two features vi and vj have similar relevance if: 1 − 푅 푣푖 − 푅 푣푗 ≥ 푡, 푡 → [0,1] • Goal: Identify features with similar relevance, and select the most valuable abstract features, without losing predictive power 10/12/2014 Petar Ristoski, Heiko Paulheim 21
  • 22. Hierarchical Feature Selection Approach (SHSEL) • Initial Selection – Identify and filter out ranges of nodes with similar relevance in each branch of the hierarchy • Pruning – Select only the most relevant features from the previously reduced set 10/12/2014 Petar Ristoski, Heiko Paulheim 22
  • 23. Initial SHSEL Feature Selection 1. Identify range of nodes with similar relevance in each branch: – Information 푠 푣, 푣gain: = 1 푠− (푣푖 0.45 , 푣푗 ) = − 1 0.5 − = 퐼퐺 0.95 푣푖 − 퐼퐺(푣푗 ) 푖 푗 – Correlation: t=0.9 푠(푣푖 , 푣푗) = 퐶표푟푟푒푙푎푡푖표푛(푣푖 , 푣푗) s>t 2. If the similarity is greater than a user specified threshold, remove the more specific feature, based on the hierarchical relation 10/12/2014 Petar Ristoski, Heiko Paulheim 23
  • 24. Post SHSEL Feature Selection • Select the features with the highest relevance on each path – user specified threshold – select features with relevance above path average relevance 퐼퐺(푣푖)=0.2 AVG(Sp)=0.25 10/12/2014 Petar Ristoski, Heiko Paulheim 24
  • 25. EVALUATION 10/12/2014 Petar Ristoski, Heiko Paulheim 25
  • 26. Evaluation • We use 5 real-world datasets and 6 synthetically generated datasets • Classification methods: – Naïve Bayes – k-Nearest Neighbors (k=3) – Support Vector Machine (polynomial kernel function)  No parameter optimization 10/12/2014 Petar Ristoski, Heiko Paulheim 26
  • 27. Evaluation: Real World Datasets Name Features #Instances Class Labels #Features Sports Tweets T DBpedia Direct Types 1,179 positive(523); negative(656) 4,082 Sports Tweets C DBpedia Categories 1,179 positive(523); negative(656) 10,883 Cities DBpedia Direct Types 212 high(67); medium(106); low(39) 727 NY Daily Headings DBpedia Direct Types 1,016 positive(580); negative(436) 5,145 StumbleUpon DMOZ Categories 3,020 positive(1,370); negative(1,650) 3,976 • Hierarchical features are generated from DBpedia (structured version of Wikipedia) – The text is annotated with concepts using DBpedia Spotlight • The feature generation is independent of the class labels, and it is unbiased towards any of the feature selection approaches 10/12/2014 Petar Ristoski, Heiko Paulheim 27
  • 28. Evaluation: Synthetic Datasets Name #Instances Class Labels #Features S-D2-B2 1,000 positive(500); negative(500) 1,201 S-D2-B5 1,000 positive(500); negative(500) 1,021 S-D2-B10 1,000 positive(500); negative(500) 961 S-D4-B2 1,000 positive(500); negative(500) 2,101 S-D4-B4 1,000 positive(500); negative(500) 1,741 S-D4-B10 1,000 positive(500); negative(500) 1,621 • Generate the middle layer using polynomial function • Generate the hierarchy upwards and downwards following the hierarchical feature implication and transitivity rule • The depth and branching factor are controlled with parameters D and B 10/12/2014 Petar Ristoski, Heiko Paulheim 28
  • 29. Evaluation: Synthetic Datasets • Depth = 1 & Branching = 2 1 1 1 1 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 10/12/2014 Petar Ristoski, Heiko Paulheim 29
  • 30. Evaluation: Synthetic Datasets Name #Instances Class Labels #Features S-D2-B2 1,000 positive(500); negative(500) 1,201 S-D2-B5 1,000 positive(500); negative(500) 1,021 S-D2-B10 1,000 positive(500); negative(500) 961 S-D4-B2 1,000 positive(500); negative(500) 2,101 S-D4-B4 1,000 positive(500); negative(500) 1,741 S-D4-B10 1,000 positive(500); negative(500) 1,621 • Generate the middle layer using polynomial function • Generate the hierarchy upwards and downwards following the hierarchical feature implication and transitivity rule • The depth and branching factor are controlled with parameters D and B 10/12/2014 Petar Ristoski, Heiko Paulheim 30
  • 31. Evaluation: Approach • Testing all approaches using two classification methods – Naïve Bayes, KNN and SVM • Metrics for performance evaluation – Accuracy: Acc V′ = 퐶표푟푟푒푐푡푙푦 퐶푙푎푠푠푓푖푒푑 퐼푛푠푡푎푛푐푒푠 (푉′) 푇표푡푎푙 푁푢푚푏푒푟 표푓 퐼푛푠푡푎푛푐푒푠 – Feature Space Compression: 푐 푉′ = 1 − |푉′| |푉| – Harmonic Mean: 퐻 = 2 ∗ 퐴푐푐 푉′ ∗푐 푉′ 퐴푐푐 푉′ +푐 푉′ • Results calculated using stratified 10-fold cross validation – Feature selection is performed inside each fold • Parameter optimization for each feature selection strategy 10/12/2014 Petar Ristoski, Heiko Paulheim 31
  • 32. • Classification accuracy when using different relevance similarity threshold on the cities dataset 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Relevance Similarity Threshold Accuracy Compression H. Mean Evaluation: SHSEL IG 10/12/2014 Petar Ristoski, Heiko Paulheim 32
  • 33. Evaluation: Classification Accuracy (NB) 10/12/2014 Petar Ristoski, Heiko Paulheim 33 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings original initialSHSEL IG initialSHSEL C pruneSHSEL IG pruneSHSEL C SIG SC TSEL Lift TSEL IG HillClimbing GreedyTopDown 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10 original initialSHSEL IG initialSHSEL C pruneSHSEL IG pruneSHSEL C SIG SC TSEL Lift TSEL IG HillClimbing GreedyTopDown
  • 34. Evaluation: Feature Space Compression (NB) 10/12/2014 Petar Ristoski, Heiko Paulheim 36 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings initialSHSEL IG initialSHSEL C pruneSHSEL IG pruneSHSEL C SIG SC TSEL Lift TSEL IG HillClimbing GreedyTopDown 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10 initialSHSEL IG initialSHSEL C pruneSHSEL IG pruneSHSEL C SIG SC TSEL Lift TSEL IG HillClimbing GreedyTopDown
  • 35. Evaluation: Harmonic Mean (NB) 10/12/2014 Petar Ristoski, Heiko Paulheim 39 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings initialSHSEL IG initialSHSEL C pruneSHSEL IG pruneSHSEL C SIG SC TSEL Lift TSEL IG HillClimbing GreedyTopDown 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10 initialSHSEL IG initialSHSEL C pruneSHSEL IG pruneSHSEL C SIG SC TSEL Lift TSEL IG HillClimbing GreedyTopDown
  • 36. Conclusion & Outlook • Contribution – An approach that exploits hierarchies for feature selection in combination with standard metrics – The evaluation shows that the approach outperforms standard feature selection techniques, and other approaches using hierarchies • Future Work – Conduct further experiments • E.g. text mining, bioinformatics – Feature Selection in unsupervised learning • E.g. clustering, outlier detection • Laplacian Score 10/12/2014 Petar Ristoski, Heiko Paulheim 43
  • 37. 44 Feature Selection in Hierarchical Feature Spaces 10/12/2014 Petar Ristoski, Heiko Paulheim