SlideShare a Scribd company logo
1 of 26
Download to read offline
Using Linked Data to Mine
RDF from Wikipedia’s Tables
http://emunoz.org/wikitables
Emir Muñoz
Fujitsu (Ireland) Limited
National University of Ireland Galway
Joint work with A. Hogan and A. Mileo
WSDM 2014 @ New York City, February 24-28
Emir M. - WSDM, New York City, USA, 27th February, 2014 2
MOTIVATION
(1/10)
Emir M. - WSDM, New York City, USA, 27th February, 2014 3
MOTIVATION
The tables embedded in Wikipedia articles contain rich,
semi-structured encyclopaedic content
… BUT we cannot query all that content…
A query example:
(2/10)
Wikipedia tables or tables in the body are ignored
[Borrowed from Entity Linking tutorial]
Emir M. - WSDM, New York City, USA, 27th February, 2014 4
Results at
25-02-2014
Emir M. - WSDM, New York City, USA, 27th February, 2014 5
First result
Emir M. - WSDM, New York City, USA, 27th February, 2014 6
Second result
10
Airlines
Emir M. - WSDM, New York City, USA, 27th February, 2014 7
Third result
19
Airlines
• Same query in SPARQL over
Emir M. - WSDM, New York City, USA, 27th February, 2014 8
MOTIVATION
SELECT ?p ?o WHERE
{ <http://dbpedia.org/resource/Airbus_A380> ?p ?o . }
FAIL
(7/10)
Emir M. - WSDM, New York City, USA, 27th February, 2014 9
Emir M. - WSDM, New York City, USA, 27th February, 2014 10
No evidence of A380
• We perform automatic facts extraction (RDF)
from Wikipedia tables using KBs
MOTIVATION
Emir M. - WSDM, New York City, USA, 27th February, 2014 11
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
(10/10)
• As far as we know, DBpedia and YAGO
ignore tables in article’s body
– Mainly focused on info-boxes
• Languages such as R2RML can express
custom mappings from relational database
tables to RDF
– Each row as a subject, each column as a
predicate and each cell as an object
– Needs a mapping definition
Emir M. - WSDM, New York City, USA, 27th February, 2014 12
EXTRACTING RDF FROM TABLES (1/4)
• [Limaye et al. 2010; Mulwad et al. 2010&2013]
presented approaches using a in-house KB and
small datasets for validation
– Entity recognition/disambiguation
– Determine types for each column
– Determine relationships between columns
• We focus on Wikipedia tables, running our
algorithms over the entire corpus with
“row-centric” features for Machine
Learning models
Emir M. - WSDM, New York City, USA, 27th February, 2014 13
EXTRACTING RDF FROM TABLES (2/4)
Emir M. - WSDM, New York City, USA, 27th February, 2014 14
EXTRACTING RDF FROM TABLES
• Extraction of two types of relationships
– Between main entity and cell in the same columns,
e.g., “Manchester United F.C.” and “David de Gea”
– Between entities in different columns but same row
(3/4)
dbp:currentClub
dbp:position
Emir M. - WSDM, New York City, USA, 27th February, 2014 15
EXTRACTING RDF FROM TABLES (4/4)
• Wikipedia dump from February 13th 2013
• Table taxonomy
Emir M. - WSDM, New York City, USA, 27th February, 2014 16
WIKITABLES SURVEY (1/2)
1.14 million tables
• Table model
– Input: a source of tables (a set of tables)
• E.g., a Wikipedia article
• Each table belongs to is modeled as
an matrix
• We do normalize the tables and convert
each HTML table into a matrix
Emir M. - WSDM, New York City, USA, 27th February, 2014 17
WIKITABLES SURVEY (2/2)
• To extract RDF from Wikitables we rely on
a reference knowledge base
– Version 3.8
Emir M. - WSDM, New York City, USA, 27th February, 2014 18
MINING RDF FROM WIKITABLES
Extract links in the cells
Mapping links to DBpedia
Lookups on DBpedia to find
relationships between entities
in the same row
Candidate
relationships
Wikipedia
table
(1/6)
• We aim to discover:
– Relations between entities on the same row
– Relations between entities in the table and the
protagonist of the article
• Map the links inside the cells to RDF
resources
• Get candidate relationships from the KB
Emir M. - WSDM, New York City, USA, 27th February, 2014 19
MINING RDF FROM WIKITABLES
SELECT DISTINCT ?p1 ?p2
WHERE { {<e1>} ?p1 <e2> } UNION { <e2> ?p2 <e1>} }
(2/6)
• We detected some weak relationships
• … We need more filtering for relationships
Emir M. - WSDM, New York City, USA, 27th February, 2014 20
MINING RDF FROM WIKITABLES
dbp:currentClub
dbp:youthClubs
(3/6)
• Features at different levels used to train
Machine Learning models
• Article features (e.g., # of tables)
• Table features (e.g., #rows, #columns, ratios)
• Cell features (e.g., # of entities, string length, has
format)
• Column features (e.g., # of entities, # of unique
entities)
• Predicate/Column features (e.g., string similarity, # of
rows where relation holds)
• Predicate features (e.g., triple count, count unique)
• Triple features (e.g., is the table from article or body)
Emir M. - WSDM, New York City, USA, 27th February, 2014 21
MINING RDF FROM WIKITABLES (4/6)
• The experimentation set-up
– Wikipedia dump from February 2013
– DBpedia dump version 3.8
– 8 machines (ca. 2005) with 4GB of RAM,
2.2GHz single-core processors
• After 12 days we got 34.9 million unique
triples not in DBpedia
• We manually annotated a sample of 750
triples to train the ML models
Emir M. - WSDM, New York City, USA, 27th February, 2014 22
MINING RDF FROM WIKITABLES (5/6)
Emir M. - WSDM, New York City, USA, 27th February, 2014 23
MINING RDF FROM WIKITABLES (6/6)
Bagging DT Simple Logistic SVM
accuracy 78.1% 78.53% 72.6%
precision 81.5% 79.62% 72.4%
recall 77.4% 79.01% 75.8%
• In this work we aimed to
– Interpret the semantic of tables using KB’s
– Enrich KB’s with new facts mined from tables
• With the best model we got 7.9 million
unique novel triples
• We still don’t
– consider literals/string values in the cells
– Explode domain/range of predicates
– Test other KBs like Freebase and YAGO
Emir M. - WSDM, New York City, USA, 27th February, 2014 24
CONCLUSION
• Most of the related papers use some
knowledge base, such as DBpedia
– They can be benefited by new RDF triples
extracted from Wikipedia tables
• We can use the similarity proposed in
Knowledge-based graph document modeling, by
Schuhmacher and Ponzetto, to improve the
relation extraction
• And use the paper Trust, but Verify: Predicting
Contribution Quality for Knowledge Base Construction
and Curation, Chun How et al, to determine the
correctness of the quality of the output triples
Emir M. - WSDM, New York City, USA, 27th February, 2014
CONTRAST WITH OTHER PAPERS
25
Thank you!
Emir Muñoz
SVM our third best model 
http://emunoz.org/wikitables

More Related Content

What's hot

Database design
Database designDatabase design
Database design
Riteshkiit
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinths
Daniel Camarda
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
Alexandra Roatiș
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
Mustafa Jarrar
 

What's hot (20)

Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databases
 
Mods0210
Mods0210Mods0210
Mods0210
 
Database design
Database designDatabase design
Database design
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Linked list
Linked listLinked list
Linked list
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinths
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
Analyzing poetry databases to develop a metadata application profile. Why eac...
Analyzing poetry databases to develop a metadata application profile. Why eac...Analyzing poetry databases to develop a metadata application profile. Why eac...
Analyzing poetry databases to develop a metadata application profile. Why eac...
 
Reference Hackers
Reference HackersReference Hackers
Reference Hackers
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge query
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
 
PhillyR 18-19 Kickoff - Data Structure Intro
PhillyR 18-19 Kickoff - Data Structure IntroPhillyR 18-19 Kickoff - Data Structure Intro
PhillyR 18-19 Kickoff - Data Structure Intro
 
Jesús Barrasa
Jesús BarrasaJesús Barrasa
Jesús Barrasa
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia Mappings
 
ISDD Database Structure N5
ISDD Database Structure N5ISDD Database Structure N5
ISDD Database Structure N5
 

Viewers also liked

Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
Yury Lifshits
 
Business models in two-sided markets: an assessment of strategies for app pla...
Business models in two-sided markets: an assessment of strategies for app pla...Business models in two-sided markets: an assessment of strategies for app pla...
Business models in two-sided markets: an assessment of strategies for app pla...
IntoTheMinds
 

Viewers also liked (6)

Competition, bargaining power and pricing in two sided markets
Competition, bargaining power and pricing in two sided marketsCompetition, bargaining power and pricing in two sided markets
Competition, bargaining power and pricing in two sided markets
 
An Optimization Framework for Query Recommendation
An Optimization Framework for Query RecommendationAn Optimization Framework for Query Recommendation
An Optimization Framework for Query Recommendation
 
Wsdm west wesley-smith
Wsdm west wesley-smithWsdm west wesley-smith
Wsdm west wesley-smith
 
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010Evolution of  Two Sided Markets - Yury Lifshits - WSDM 2010
Evolution of Two Sided Markets - Yury Lifshits - WSDM 2010
 
Response prediction for display advertising - WSDM 2014
Response prediction for display advertising - WSDM 2014Response prediction for display advertising - WSDM 2014
Response prediction for display advertising - WSDM 2014
 
Business models in two-sided markets: an assessment of strategies for app pla...
Business models in two-sided markets: an assessment of strategies for app pla...Business models in two-sided markets: an assessment of strategies for app pla...
Business models in two-sided markets: an assessment of strategies for app pla...
 

Similar to Using Linked Data to Mine RDF from Wikipedia's Tables

DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
Digitised Manuscripts to Europeana
 
Tue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddataTue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddata
eswcsummerschool
 

Similar to Using Linked Data to Mine RDF from Wikipedia's Tables (20)

Re-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutRe-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playout
 
semantic web resource description framework
semantic web resource description frameworksemantic web resource description framework
semantic web resource description framework
 
Rdf
RdfRdf
Rdf
 
Technical Background
Technical BackgroundTechnical Background
Technical Background
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
KCS-501-3.pdf
KCS-501-3.pdfKCS-501-3.pdf
KCS-501-3.pdf
 
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
 
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Dbms and sqlpptx
Dbms and sqlpptxDbms and sqlpptx
Dbms and sqlpptx
 
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
DM2E Project meeting Bergen: WP2 RDF Validation, Kai Eckert (University of Ma...
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
 
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata MattersAlphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
 
Chapter3_a_updated.ppt
Chapter3_a_updated.pptChapter3_a_updated.ppt
Chapter3_a_updated.ppt
 
Tue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddataTue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddata
 
Dublin Core Metadata Initiative Abstract Model
Dublin Core Metadata Initiative Abstract ModelDublin Core Metadata Initiative Abstract Model
Dublin Core Metadata Initiative Abstract Model
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Publishing and Using Linked Open Data - Day 2
Publishing and Using Linked Open Data - Day 2Publishing and Using Linked Open Data - Day 2
Publishing and Using Linked Open Data - Day 2
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic Web
 
Scalable Web Data Management using RDF
Scalable Web Data Management using RDF  Scalable Web Data Management using RDF
Scalable Web Data Management using RDF
 

More from Emir Muñoz

Reading Group 2014
Reading Group 2014Reading Group 2014
Reading Group 2014
Emir Muñoz
 
WikiTables DERI Talk
WikiTables DERI TalkWikiTables DERI Talk
WikiTables DERI Talk
Emir Muñoz
 

More from Emir Muñoz (11)

A Linked Data-Based Decision Tree Classifier to Review Movies
A Linked Data-Based Decision Tree Classifier to Review MoviesA Linked Data-Based Decision Tree Classifier to Review Movies
A Linked Data-Based Decision Tree Classifier to Review Movies
 
The Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data ModellingThe Philosophical Aspects of Data Modelling
The Philosophical Aspects of Data Modelling
 
Web Intelligence - 2010
Web Intelligence - 2010Web Intelligence - 2010
Web Intelligence - 2010
 
μRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elementsμRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elements
 
Learning Content Patterns from Linked Data
Learning Content Patterns from Linked DataLearning Content Patterns from Linked Data
Learning Content Patterns from Linked Data
 
Claves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y ValidaciónClaves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y Validación
 
Reading Group 2014
Reading Group 2014Reading Group 2014
Reading Group 2014
 
Soft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML DataSoft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML Data
 
DRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From WikitablesDRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From Wikitables
 
DEXA 2012 Talk
DEXA 2012 TalkDEXA 2012 Talk
DEXA 2012 Talk
 
WikiTables DERI Talk
WikiTables DERI TalkWikiTables DERI Talk
WikiTables DERI Talk
 

Recently uploaded

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Using Linked Data to Mine RDF from Wikipedia's Tables

  • 1. Using Linked Data to Mine RDF from Wikipedia’s Tables http://emunoz.org/wikitables Emir Muñoz Fujitsu (Ireland) Limited National University of Ireland Galway Joint work with A. Hogan and A. Mileo WSDM 2014 @ New York City, February 24-28
  • 2. Emir M. - WSDM, New York City, USA, 27th February, 2014 2 MOTIVATION (1/10)
  • 3. Emir M. - WSDM, New York City, USA, 27th February, 2014 3 MOTIVATION The tables embedded in Wikipedia articles contain rich, semi-structured encyclopaedic content … BUT we cannot query all that content… A query example: (2/10) Wikipedia tables or tables in the body are ignored [Borrowed from Entity Linking tutorial]
  • 4. Emir M. - WSDM, New York City, USA, 27th February, 2014 4 Results at 25-02-2014
  • 5. Emir M. - WSDM, New York City, USA, 27th February, 2014 5 First result
  • 6. Emir M. - WSDM, New York City, USA, 27th February, 2014 6 Second result 10 Airlines
  • 7. Emir M. - WSDM, New York City, USA, 27th February, 2014 7 Third result 19 Airlines
  • 8. • Same query in SPARQL over Emir M. - WSDM, New York City, USA, 27th February, 2014 8 MOTIVATION SELECT ?p ?o WHERE { <http://dbpedia.org/resource/Airbus_A380> ?p ?o . } FAIL (7/10)
  • 9. Emir M. - WSDM, New York City, USA, 27th February, 2014 9
  • 10. Emir M. - WSDM, New York City, USA, 27th February, 2014 10 No evidence of A380
  • 11. • We perform automatic facts extraction (RDF) from Wikipedia tables using KBs MOTIVATION Emir M. - WSDM, New York City, USA, 27th February, 2014 11 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ (10/10)
  • 12. • As far as we know, DBpedia and YAGO ignore tables in article’s body – Mainly focused on info-boxes • Languages such as R2RML can express custom mappings from relational database tables to RDF – Each row as a subject, each column as a predicate and each cell as an object – Needs a mapping definition Emir M. - WSDM, New York City, USA, 27th February, 2014 12 EXTRACTING RDF FROM TABLES (1/4)
  • 13. • [Limaye et al. 2010; Mulwad et al. 2010&2013] presented approaches using a in-house KB and small datasets for validation – Entity recognition/disambiguation – Determine types for each column – Determine relationships between columns • We focus on Wikipedia tables, running our algorithms over the entire corpus with “row-centric” features for Machine Learning models Emir M. - WSDM, New York City, USA, 27th February, 2014 13 EXTRACTING RDF FROM TABLES (2/4)
  • 14. Emir M. - WSDM, New York City, USA, 27th February, 2014 14 EXTRACTING RDF FROM TABLES • Extraction of two types of relationships – Between main entity and cell in the same columns, e.g., “Manchester United F.C.” and “David de Gea” – Between entities in different columns but same row (3/4) dbp:currentClub dbp:position
  • 15. Emir M. - WSDM, New York City, USA, 27th February, 2014 15 EXTRACTING RDF FROM TABLES (4/4)
  • 16. • Wikipedia dump from February 13th 2013 • Table taxonomy Emir M. - WSDM, New York City, USA, 27th February, 2014 16 WIKITABLES SURVEY (1/2) 1.14 million tables
  • 17. • Table model – Input: a source of tables (a set of tables) • E.g., a Wikipedia article • Each table belongs to is modeled as an matrix • We do normalize the tables and convert each HTML table into a matrix Emir M. - WSDM, New York City, USA, 27th February, 2014 17 WIKITABLES SURVEY (2/2)
  • 18. • To extract RDF from Wikitables we rely on a reference knowledge base – Version 3.8 Emir M. - WSDM, New York City, USA, 27th February, 2014 18 MINING RDF FROM WIKITABLES Extract links in the cells Mapping links to DBpedia Lookups on DBpedia to find relationships between entities in the same row Candidate relationships Wikipedia table (1/6)
  • 19. • We aim to discover: – Relations between entities on the same row – Relations between entities in the table and the protagonist of the article • Map the links inside the cells to RDF resources • Get candidate relationships from the KB Emir M. - WSDM, New York City, USA, 27th February, 2014 19 MINING RDF FROM WIKITABLES SELECT DISTINCT ?p1 ?p2 WHERE { {<e1>} ?p1 <e2> } UNION { <e2> ?p2 <e1>} } (2/6)
  • 20. • We detected some weak relationships • … We need more filtering for relationships Emir M. - WSDM, New York City, USA, 27th February, 2014 20 MINING RDF FROM WIKITABLES dbp:currentClub dbp:youthClubs (3/6)
  • 21. • Features at different levels used to train Machine Learning models • Article features (e.g., # of tables) • Table features (e.g., #rows, #columns, ratios) • Cell features (e.g., # of entities, string length, has format) • Column features (e.g., # of entities, # of unique entities) • Predicate/Column features (e.g., string similarity, # of rows where relation holds) • Predicate features (e.g., triple count, count unique) • Triple features (e.g., is the table from article or body) Emir M. - WSDM, New York City, USA, 27th February, 2014 21 MINING RDF FROM WIKITABLES (4/6)
  • 22. • The experimentation set-up – Wikipedia dump from February 2013 – DBpedia dump version 3.8 – 8 machines (ca. 2005) with 4GB of RAM, 2.2GHz single-core processors • After 12 days we got 34.9 million unique triples not in DBpedia • We manually annotated a sample of 750 triples to train the ML models Emir M. - WSDM, New York City, USA, 27th February, 2014 22 MINING RDF FROM WIKITABLES (5/6)
  • 23. Emir M. - WSDM, New York City, USA, 27th February, 2014 23 MINING RDF FROM WIKITABLES (6/6) Bagging DT Simple Logistic SVM accuracy 78.1% 78.53% 72.6% precision 81.5% 79.62% 72.4% recall 77.4% 79.01% 75.8%
  • 24. • In this work we aimed to – Interpret the semantic of tables using KB’s – Enrich KB’s with new facts mined from tables • With the best model we got 7.9 million unique novel triples • We still don’t – consider literals/string values in the cells – Explode domain/range of predicates – Test other KBs like Freebase and YAGO Emir M. - WSDM, New York City, USA, 27th February, 2014 24 CONCLUSION
  • 25. • Most of the related papers use some knowledge base, such as DBpedia – They can be benefited by new RDF triples extracted from Wikipedia tables • We can use the similarity proposed in Knowledge-based graph document modeling, by Schuhmacher and Ponzetto, to improve the relation extraction • And use the paper Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation, Chun How et al, to determine the correctness of the quality of the output triples Emir M. - WSDM, New York City, USA, 27th February, 2014 CONTRAST WITH OTHER PAPERS 25
  • 26. Thank you! Emir Muñoz SVM our third best model  http://emunoz.org/wikitables