JIST2015-data challenge

An Ensemble Approach for Entity Type
Prediction over Linked Data
Guangyuan Piao, John G. Breslin
Insight Centre for Data Analytics @NUI Galway, Ireland
Unit for Social Software
The Data Challenge at 5th Joint International Semantic Technology Conference
Yichang, China, 11/11/2015

Contents
• Introduction of the Data Challenge
• Overall Approach
• Results
2

• the main task of the challenge is to predict labels of
entities/resources in Zhishi.me1
• 1,897 entity URLs are provided and 1,397 of them are provided
with label information. Information related to the entities:
• abstracts of entities
• infobox properties
• external links
• related pages
• 13 participated teams
3
Introduction of the Data Challenge
1. http://zhishi.apexlab.org/

• features for predicting entity types
1. all distinct properties of entities in the dataset
2. semantic similarities between the entity and all labels (i.e., insect,
novel etc.)
3. a bag of Named Entities (NEs) created from all abstracts of entities in
the dataset
• feature selection
• in total, there were 1,888 features
• filter out irrelevant features using GainRatioAttributeEval method in
Weka1 (1,888  458 features)
• prediction strategy
• Random Forest as the classification method (using 100 trees)
4
Overall Approach
1. http://www.cs.waikato.ac.nz/ml/weka/

1. all distinct properties of entities in the dataset
• the value of the property is 1 if the entity has the property, 0 if not
1. semantic similarities between the entity and all labels (i.e.,
insect , novel etc.)
• RESIM(ei, ej)1: a measure for calculating the semantic similarity
between two entities in the context of a Linked Data graph
• |lu|: the total # of entities of label lu
5
Overall Approach - Features
1. Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes, Piao et al., JIST2015

3. a bag of Named Entities (NEs) created from all abstracts
of entities in the dataset
• entities appeared at the beginning of an abstract and appeared
frequently in the abstract can have higher weights.
6
Overall Approach - Features
abstract Stanford NER1
segmented NEs
• nei : a NE appeared more than 10 times
• pos(nei, a): the position of nei in a
• n: the # of NEs in a
1. http://nlp.stanford.edu/software/CRF-NER.shtml

• performance on the provided training set (10-fold cross-
validation)
Classifier Precision Recall F-score
Decision Tree 0.942 0.942 0.942
SVM 0.920 0.910 0.912
Random Forest 0.970 0.969 0.969
Stacking 0.949 0.948 0.948
7
Results
• performance on the provided test set (4th among 13 teams)
Team Precision Recall F-score
4PKUICL 0.985439 0.987461 0.986449
1CBrain 0.983633 0.985069 0.98435
3FRDC_ML 0.978096 0.977348 0.977722
6pgy 0.977442 0.977247 0.977344

JIST2015-data challenge

More Related Content

What's hot

Similar to JIST2015-data challenge

More from GUANGYUAN PIAO

Recently uploaded

JIST2015-data challenge