An Ensemble Approach for Entity Type
Prediction over Linked Data
Guangyuan Piao, John G. Breslin
Insight Centre for Data Analytics @NUI Galway, Ireland
Unit for Social Software
The Data Challenge at 5th Joint International Semantic Technology Conference
Yichang, China, 11/11/2015
Contents
• Introduction of the Data Challenge
• Overall Approach
• Results
2
• the main task of the challenge is to predict labels of
entities/resources in Zhishi.me1
• 1,897 entity URLs are provided and 1,397 of them are provided
with label information. Information related to the entities:
• abstracts of entities
• infobox properties
• external links
• related pages
• 13 participated teams
3
Introduction of the Data Challenge
1. http://zhishi.apexlab.org/
• features for predicting entity types
1. all distinct properties of entities in the dataset
2. semantic similarities between the entity and all labels (i.e., insect,
novel etc.)
3. a bag of Named Entities (NEs) created from all abstracts of entities in
the dataset
• feature selection
• in total, there were 1,888 features
• filter out irrelevant features using GainRatioAttributeEval method in
Weka1 (1,888  458 features)
• prediction strategy
• Random Forest as the classification method (using 100 trees)
4
Overall Approach
1. http://www.cs.waikato.ac.nz/ml/weka/
1. all distinct properties of entities in the dataset
• the value of the property is 1 if the entity has the property, 0 if not
1. semantic similarities between the entity and all labels (i.e.,
insect , novel etc.)
• RESIM(ei, ej)1: a measure for calculating the semantic similarity
between two entities in the context of a Linked Data graph
• |lu|: the total # of entities of label lu
5
Overall Approach - Features
1. Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes, Piao et al., JIST2015
3. a bag of Named Entities (NEs) created from all abstracts
of entities in the dataset
• entities appeared at the beginning of an abstract and appeared
frequently in the abstract can have higher weights.
6
Overall Approach - Features
abstract Stanford NER1
segmented NEs
• nei : a NE appeared more than 10 times
• pos(nei, a): the position of nei in a
• n: the # of NEs in a
1. http://nlp.stanford.edu/software/CRF-NER.shtml
• performance on the provided training set (10-fold cross-
validation)
Classifier Precision Recall F-score
Decision Tree 0.942 0.942 0.942
SVM 0.920 0.910 0.912
Random Forest 0.970 0.969 0.969
Stacking 0.949 0.948 0.948
7
Results
• performance on the provided test set (4th among 13 teams)
Team Precision Recall F-score
4PKUICL 0.985439 0.987461 0.986449
1CBrain 0.983633 0.985069 0.98435
3FRDC_ML 0.978096 0.977348 0.977722
6pgy 0.977442 0.977247 0.977344
JIST2015-data challenge

JIST2015-data challenge

  • 1.
    An Ensemble Approachfor Entity Type Prediction over Linked Data Guangyuan Piao, John G. Breslin Insight Centre for Data Analytics @NUI Galway, Ireland Unit for Social Software The Data Challenge at 5th Joint International Semantic Technology Conference Yichang, China, 11/11/2015
  • 2.
    Contents • Introduction ofthe Data Challenge • Overall Approach • Results 2
  • 3.
    • the maintask of the challenge is to predict labels of entities/resources in Zhishi.me1 • 1,897 entity URLs are provided and 1,397 of them are provided with label information. Information related to the entities: • abstracts of entities • infobox properties • external links • related pages • 13 participated teams 3 Introduction of the Data Challenge 1. http://zhishi.apexlab.org/
  • 4.
    • features forpredicting entity types 1. all distinct properties of entities in the dataset 2. semantic similarities between the entity and all labels (i.e., insect, novel etc.) 3. a bag of Named Entities (NEs) created from all abstracts of entities in the dataset • feature selection • in total, there were 1,888 features • filter out irrelevant features using GainRatioAttributeEval method in Weka1 (1,888  458 features) • prediction strategy • Random Forest as the classification method (using 100 trees) 4 Overall Approach 1. http://www.cs.waikato.ac.nz/ml/weka/
  • 5.
    1. all distinctproperties of entities in the dataset • the value of the property is 1 if the entity has the property, 0 if not 1. semantic similarities between the entity and all labels (i.e., insect , novel etc.) • RESIM(ei, ej)1: a measure for calculating the semantic similarity between two entities in the context of a Linked Data graph • |lu|: the total # of entities of label lu 5 Overall Approach - Features 1. Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes, Piao et al., JIST2015
  • 6.
    3. a bagof Named Entities (NEs) created from all abstracts of entities in the dataset • entities appeared at the beginning of an abstract and appeared frequently in the abstract can have higher weights. 6 Overall Approach - Features abstract Stanford NER1 segmented NEs • nei : a NE appeared more than 10 times • pos(nei, a): the position of nei in a • n: the # of NEs in a 1. http://nlp.stanford.edu/software/CRF-NER.shtml
  • 7.
    • performance onthe provided training set (10-fold cross- validation) Classifier Precision Recall F-score Decision Tree 0.942 0.942 0.942 SVM 0.920 0.910 0.912 Random Forest 0.970 0.969 0.969 Stacking 0.949 0.948 0.948 7 Results • performance on the provided test set (4th among 13 teams) Team Precision Recall F-score 4PKUICL 0.985439 0.987461 0.986449 1CBrain 0.983633 0.985069 0.98435 3FRDC_ML 0.978096 0.977348 0.977722 6pgy 0.977442 0.977247 0.977344