SlideShare a Scribd company logo
An Ensemble Approach for Entity Type
Prediction over Linked Data
Guangyuan Piao, John G. Breslin
Insight Centre for Data Analytics @NUI Galway, Ireland
Unit for Social Software
The Data Challenge at 5th Joint International Semantic Technology Conference
Yichang, China, 11/11/2015
Contents
• Introduction of the Data Challenge
• Overall Approach
• Results
2
• the main task of the challenge is to predict labels of
entities/resources in Zhishi.me1
• 1,897 entity URLs are provided and 1,397 of them are provided
with label information. Information related to the entities:
• abstracts of entities
• infobox properties
• external links
• related pages
• 13 participated teams
3
Introduction of the Data Challenge
1. http://zhishi.apexlab.org/
• features for predicting entity types
1. all distinct properties of entities in the dataset
2. semantic similarities between the entity and all labels (i.e., insect,
novel etc.)
3. a bag of Named Entities (NEs) created from all abstracts of entities in
the dataset
• feature selection
• in total, there were 1,888 features
• filter out irrelevant features using GainRatioAttributeEval method in
Weka1 (1,888  458 features)
• prediction strategy
• Random Forest as the classification method (using 100 trees)
4
Overall Approach
1. http://www.cs.waikato.ac.nz/ml/weka/
1. all distinct properties of entities in the dataset
• the value of the property is 1 if the entity has the property, 0 if not
1. semantic similarities between the entity and all labels (i.e.,
insect , novel etc.)
• RESIM(ei, ej)1: a measure for calculating the semantic similarity
between two entities in the context of a Linked Data graph
• |lu|: the total # of entities of label lu
5
Overall Approach - Features
1. Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes, Piao et al., JIST2015
3. a bag of Named Entities (NEs) created from all abstracts
of entities in the dataset
• entities appeared at the beginning of an abstract and appeared
frequently in the abstract can have higher weights.
6
Overall Approach - Features
abstract Stanford NER1
segmented NEs
• nei : a NE appeared more than 10 times
• pos(nei, a): the position of nei in a
• n: the # of NEs in a
1. http://nlp.stanford.edu/software/CRF-NER.shtml
• performance on the provided training set (10-fold cross-
validation)
Classifier Precision Recall F-score
Decision Tree 0.942 0.942 0.942
SVM 0.920 0.910 0.912
Random Forest 0.970 0.969 0.969
Stacking 0.949 0.948 0.948
7
Results
• performance on the provided test set (4th among 13 teams)
Team Precision Recall F-score
4PKUICL 0.985439 0.987461 0.986449
1CBrain 0.983633 0.985069 0.98435
3FRDC_ML 0.978096 0.977348 0.977722
6pgy 0.977442 0.977247 0.977344
JIST2015-data challenge

More Related Content

What's hot

Real Time Competitive Marketing Intelligence
Real Time Competitive Marketing IntelligenceReal Time Competitive Marketing Intelligence
Real Time Competitive Marketing Intelligence
feiwin
 
Mining Product Reputations On the Web
Mining Product Reputations On the WebMining Product Reputations On the Web
Mining Product Reputations On the Web
feiwin
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
Maribel Acosta Deibe
 

What's hot (20)

Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)
 
Information retrieval 10 vector and probabilistic models
Information retrieval 10 vector and probabilistic modelsInformation retrieval 10 vector and probabilistic models
Information retrieval 10 vector and probabilistic models
 
PhD Defense Slides
PhD Defense SlidesPhD Defense Slides
PhD Defense Slides
 
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...
 
Naive Bayes | Statistics
Naive Bayes | StatisticsNaive Bayes | Statistics
Naive Bayes | Statistics
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Real Time Competitive Marketing Intelligence
Real Time Competitive Marketing IntelligenceReal Time Competitive Marketing Intelligence
Real Time Competitive Marketing Intelligence
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
Mining Product Reputations On the Web
Mining Product Reputations On the WebMining Product Reputations On the Web
Mining Product Reputations On the Web
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervised
 
Mappings Validation
Mappings ValidationMappings Validation
Mappings Validation
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 

Similar to JIST2015-data challenge

Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Hilmar Lapp
 
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Lucidworks
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
Artificial Intelligence Institute at UofSC
 

Similar to JIST2015-data challenge (20)

Epistemic networks for Epistemic Commitments
Epistemic networks for Epistemic CommitmentsEpistemic networks for Epistemic Commitments
Epistemic networks for Epistemic Commitments
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Towards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesTowards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data Services
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
Searching for Interestingness in Wikipedia and Yahoo! Answers
Searching for Interestingness in Wikipedia and Yahoo! AnswersSearching for Interestingness in Wikipedia and Yahoo! Answers
Searching for Interestingness in Wikipedia and Yahoo! Answers
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant Science
 
Metadata Analyser: measuring metadata quality
Metadata Analyser: measuring metadata qualityMetadata Analyser: measuring metadata quality
Metadata Analyser: measuring metadata quality
 
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
01 Nature of Biology
01 Nature of Biology01 Nature of Biology
01 Nature of Biology
 
AELA: An Adaptive Entity Linking Approach
AELA: An Adaptive Entity Linking ApproachAELA: An Adaptive Entity Linking Approach
AELA: An Adaptive Entity Linking Approach
 
Studying archives of online behavior
Studying archives of online behaviorStudying archives of online behavior
Studying archives of online behavior
 
Chapter-OBDD.pptx
Chapter-OBDD.pptxChapter-OBDD.pptx
Chapter-OBDD.pptx
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Martone grethe
Martone gretheMartone grethe
Martone grethe
 

More from GUANGYUAN PIAO

A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
GUANGYUAN PIAO
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
GUANGYUAN PIAO
 
ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...
ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...
ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...
GUANGYUAN PIAO
 
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
GUANGYUAN PIAO
 

More from GUANGYUAN PIAO (18)

Env2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningEnv2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep Learning
 
Domain-Aware Sentiment Classification with GRUs and CNNs
Domain-Aware Sentiment Classification with GRUs and CNNsDomain-Aware Sentiment Classification with GRUs and CNNs
Domain-Aware Sentiment Classification with GRUs and CNNs
 
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
 
Retweet Prediction with Attention-based Deep Neural Network
Retweet Prediction with Attention-based Deep Neural NetworkRetweet Prediction with Attention-based Deep Neural Network
Retweet Prediction with Attention-based Deep Neural Network
 
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
WISE2017 - Factorization Machines Leveraging Lightweight Linked Open Data-ena...
 
Hypertext2017-Leveraging Followee List Memberships for Inferring User Interes...
Hypertext2017-Leveraging Followee List Memberships for Inferring User Interes...Hypertext2017-Leveraging Followee List Memberships for Inferring User Interes...
Hypertext2017-Leveraging Followee List Memberships for Inferring User Interes...
 
ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...
ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...
ECIR2017-Inferring User Interests for Passive Users on Twitter by Leveraging ...
 
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...
 
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
SEMANTiCS2016 - Exploring Dynamics and Semantics of User Interests for User ...
 
UMAP2016EA - Analyzing MOOC Entries of Professionals on LinkedIn for User Mod...
UMAP2016EA - Analyzing MOOC Entries of Professionals on LinkedIn for User Mod...UMAP2016EA - Analyzing MOOC Entries of Professionals on LinkedIn for User Mod...
UMAP2016EA - Analyzing MOOC Entries of Professionals on LinkedIn for User Mod...
 
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
UMAP2016 - Analyzing Aggregated Semantics-enabled User Modeling on Google+ an...
 
SAC2016-Measuring Semantic Distance for Linked Open Data-enabled Recommender ...
SAC2016-Measuring Semantic Distance for Linked Open Data-enabled Recommender ...SAC2016-Measuring Semantic Distance for Linked Open Data-enabled Recommender ...
SAC2016-Measuring Semantic Distance for Linked Open Data-enabled Recommender ...
 
Analyzing User Modeling on Twitter for Personalized News Recommendations
Analyzing User Modeling on Twitter for Personalized News RecommendationsAnalyzing User Modeling on Twitter for Personalized News Recommendations
Analyzing User Modeling on Twitter for Personalized News Recommendations
 
RDFa Basics
RDFa BasicsRDFa Basics
RDFa Basics
 
Owl 2.0 Overview
Owl 2.0 OverviewOwl 2.0 Overview
Owl 2.0 Overview
 
OWL 2.0 Primer Part01
OWL 2.0 Primer Part01OWL 2.0 Primer Part01
OWL 2.0 Primer Part01
 
OWL2.0 Primer Part02
OWL2.0 Primer Part02OWL2.0 Primer Part02
OWL2.0 Primer Part02
 
Hdd industry
Hdd industryHdd industry
Hdd industry
 

Recently uploaded

Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
Kamal Acharya
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DrGurudutt
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
Kamal Acharya
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
Kamal Acharya
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 

Recently uploaded (20)

Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
Lect 2 - Design of slender column-2.pptx
Lect 2 - Design of slender column-2.pptxLect 2 - Design of slender column-2.pptx
Lect 2 - Design of slender column-2.pptx
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
1. Henrich Triangle Safety and Fire Presentation
1. Henrich Triangle Safety and Fire Presentation1. Henrich Triangle Safety and Fire Presentation
1. Henrich Triangle Safety and Fire Presentation
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and ClusteringKIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 

JIST2015-data challenge

  • 1. An Ensemble Approach for Entity Type Prediction over Linked Data Guangyuan Piao, John G. Breslin Insight Centre for Data Analytics @NUI Galway, Ireland Unit for Social Software The Data Challenge at 5th Joint International Semantic Technology Conference Yichang, China, 11/11/2015
  • 2. Contents • Introduction of the Data Challenge • Overall Approach • Results 2
  • 3. • the main task of the challenge is to predict labels of entities/resources in Zhishi.me1 • 1,897 entity URLs are provided and 1,397 of them are provided with label information. Information related to the entities: • abstracts of entities • infobox properties • external links • related pages • 13 participated teams 3 Introduction of the Data Challenge 1. http://zhishi.apexlab.org/
  • 4. • features for predicting entity types 1. all distinct properties of entities in the dataset 2. semantic similarities between the entity and all labels (i.e., insect, novel etc.) 3. a bag of Named Entities (NEs) created from all abstracts of entities in the dataset • feature selection • in total, there were 1,888 features • filter out irrelevant features using GainRatioAttributeEval method in Weka1 (1,888  458 features) • prediction strategy • Random Forest as the classification method (using 100 trees) 4 Overall Approach 1. http://www.cs.waikato.ac.nz/ml/weka/
  • 5. 1. all distinct properties of entities in the dataset • the value of the property is 1 if the entity has the property, 0 if not 1. semantic similarities between the entity and all labels (i.e., insect , novel etc.) • RESIM(ei, ej)1: a measure for calculating the semantic similarity between two entities in the context of a Linked Data graph • |lu|: the total # of entities of label lu 5 Overall Approach - Features 1. Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes, Piao et al., JIST2015
  • 6. 3. a bag of Named Entities (NEs) created from all abstracts of entities in the dataset • entities appeared at the beginning of an abstract and appeared frequently in the abstract can have higher weights. 6 Overall Approach - Features abstract Stanford NER1 segmented NEs • nei : a NE appeared more than 10 times • pos(nei, a): the position of nei in a • n: the # of NEs in a 1. http://nlp.stanford.edu/software/CRF-NER.shtml
  • 7. • performance on the provided training set (10-fold cross- validation) Classifier Precision Recall F-score Decision Tree 0.942 0.942 0.942 SVM 0.920 0.910 0.912 Random Forest 0.970 0.969 0.969 Stacking 0.949 0.948 0.948 7 Results • performance on the provided test set (4th among 13 teams) Team Precision Recall F-score 4PKUICL 0.985439 0.987461 0.986449 1CBrain 0.983633 0.985069 0.98435 3FRDC_ML 0.978096 0.977348 0.977722 6pgy 0.977442 0.977247 0.977344