SlideShare a Scribd company logo
1 of 28
Download to read offline
Extending DBpedia (LOD) using
WikiTables
Emir Muñoz
Unit for Reasoning and Querying
emir.munoz@deri.org
Linked Open Data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
October 12, 2012 -- E. Muñoz
Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
October 12, 2012 -- E. Muñoz
Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
But not yet a version of all normal Wikipedia tables or wikitables
October 12, 2012 -- E. Muñoz
Tables as a source of LOD
http://en.wikipedia.org/wiki/Dublin
Caption as
another row
Column header represents
types of information
The values
represent
instances of that
types
http://en.wikipedia.org/wiki/Galway
Infoboxes
(attr-value)
October 12, 2012 -- E. Muñoz
Tables are inherently concise
as well as information rich
Reasoning over Wikipedia Tables
http://en.wikipedia.org/wiki/Dublin
Recovering Table Semantics …
October 12, 2012 -- E. Muñoz
Dublin is twinned with the following places:
Reasoning over Wikipedia Tables
dbpedia.org/resource/San_Jose,_California
dbpedia.org/resource/Liverpool
dbpedia.org/resource/Matsue,_Shimane
dbpedia.org/resource/Barcelona
dbpedia.org/resource/Beijing
dbpedia.org/resource/United_States
dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Japan
dbpedia.org/resource/Spain
dbpedia.org/resource/People’s_Republic_of_China
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
http://en.wikipedia.org/wiki/Dublin
Entity annotation for cells, mappings to DBpedia resources
(xsd:integer)
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
dbpedia.org/resource/San_Jose,_California
dbpedia.org/resource/Liverpool
dbpedia.org/resource/Matsue,_Shimane
dbpedia.org/resource/Barcelona
dbpedia.org/resource/Beijing
dbpedia.org/resource/United_States
dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Japan
dbpedia.org/resource/Spain
dbpedia.org/resource/People’s_Republic_of_China
(xsd:integer)
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
dbpedia.org/ontology/country
dbpedia.org/property/subdivisionName
is dbpedia.org/ontology/country of
http://en.wikipedia.org/wiki/Dublin
Extracting relations
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• <http://dbpedia.org/resource/San_Jose,_California>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/San_Jose,_California>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/Liverpool>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Liverpool>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Matsue,_Shimane>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Matsue,_Shimane>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Barcelona>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Barcelona>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Beijing>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/People's_Republic_of_China> .
• <http://dbpedia.org/resource/Beijing>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/People's_Republic_of_China> .
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• <http://dbpedia.org/resource/San_Jose,_California>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/San_Jose,_California>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/Liverpool>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Liverpool>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Matsue,_Shimane>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Matsue,_Shimane>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Barcelona>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Barcelona>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Beijing>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/People's_Republic_of_China> .
• <http://dbpedia.org/resource/Beijing>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/People's_Republic_of_China> .
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• Let’s analyze these cases …
• Liverpool
• Matsue
• Beijing
October 12, 2012 -- E. Muñoz
Not that simple…
• Web tables usually don’t have explicit semantics
by themselves.
• Main issues:
– Complex tables with spans
– Captions inside the table as another row
– Not well-formed tables (i.e., not a matrix)
– We need filters (e.g., min 2 columns, 2 rows)
• We are extracting relations at row level and
between the main entity and the table resources
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
http://en.wikipedia.org/wiki/People%27s_Republic_of_China
Caption as
another row
Table split
October 12, 2012 -- E. Muñoz
Rowspans
with pictures
First step: parsing Wiki format
Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
Same page link Many different
formats
Anchor text
vs.
Content text
http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s
October 12, 2012 -- E. Muñoz
Extracting Relations
A table
containing tables
http://en.wikipedia.org/wiki/AFC_Ajax
October 12, 2012 -- E. Muñoz
Extracting Relations
• Also relations between the main entity and
the entities in the table
dbpedia.org/resource/AFC_Ajax
14 dbpedia.org/ontology/team
14 dbpedia.org/property/clubs
11 dbpedia.org/property/currentclub
3 dbpedia.org/property/youthclubs
In his dbpedia page
there is no mention
to AFC Ajax
http://en.wikipedia.org/wiki/AFC_Ajax
16 players
October 12, 2012 -- E. Muñoz
dbpedia.org/resource/Christian_Eriksen
Disambiguation page
dbpedia.org/resource/Ajax
http://en.wikipedia.org/wiki/AFC_Ajax
October 12, 2012 -- E. Muñoz
Our Dataset
• enwiki dump from 2012-09-03 02:17:37
• 8.6 GB of Wikipedia pages that comprise
– 10,531,986 documents (HTML pages)
– Only 413,256 HTML contains tables
– 2,989,098 tables
– 905,929 tables after the filter
• 27.7% of the whole tables
– 0.46 tables per page (or 2.15 discarding pages
without tables)
October 12, 2012 -- E. Muñoz
Methodology
October 12, 2012 -- E. Muñoz
Ranking of Relationships
• The current ranking function is naïve
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/AFC_Ajax
16 players
freq relationship score
14 dbpedia.org/ontology/team 0,875
14 dbpedia.org/property/clubs 0,875
11 dbpedia.org/property/currentclub 0,6875
3 dbpedia.org/property/youthclubs 0,1875
𝑠𝑐𝑜𝑟𝑒 =
𝑓𝑟𝑒𝑙
𝑛 𝑟𝑜𝑤𝑠
Ranking of Relationships
• For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1]
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Danny_Kaye
Ongoing Work and Challenges
• Improve the ranking function for relations.
• Store the 5.5M DBpedia (transitive) redirects
locally (optimizing time).
• Statistical analysis of Wikipedia tables
– Number of columns, rows
– Headers, Captions
– External and internal links
• The big following challenge is the evaluation.
October 12, 2012 -- E. Muñoz
What’s next?
• Some ideas in mind:
– Use the extracted relations to classify WikiTables
– Define a similarity function for WikiTables
English Italian
October 12, 2012 -- E. Muñoz
What’s next?
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Electronegativity
What means
this number?
Here there is no reference to those numbers!
What’s next?
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Electronegativity
http://en.wikipedia.org/wiki/Chlorine
Chlorous acid is a chlorite
http://dbpedia.org/page/Chlorous_acid
Open problems
• Handle multiple-entities in the same cell
• Improve the ranking function
• Handle redirects before querying DBpedia
• How to evaluate the outcome
October 12, 2012 -- E. Muñoz
Thanks!
Q & A
Thanks!
Emir Muñoz
Unit for Reasoning and Querying
emir.munoz@deri.org

More Related Content

Similar to Extending DBpedia (LOD) using WikiTables

WikiTables DERI Talk
WikiTables DERI TalkWikiTables DERI Talk
WikiTables DERI Talk
Emir Muñoz
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
pathsproject
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
Andrew Brust
 
Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]
Marcia Zeng
 

Similar to Extending DBpedia (LOD) using WikiTables (20)

WikiTables DERI Talk
WikiTables DERI TalkWikiTables DERI Talk
WikiTables DERI Talk
 
06 gioca-ontologies
06 gioca-ontologies06 gioca-ontologies
06 gioca-ontologies
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
 
Tutorial: Building and using ontologies - E.Simperl - ESWC SS 2014
 Tutorial: Building and using ontologies -  E.Simperl - ESWC SS 2014 Tutorial: Building and using ontologies -  E.Simperl - ESWC SS 2014
Tutorial: Building and using ontologies - E.Simperl - ESWC SS 2014
 
Building and using ontologies
Building and using ontologies Building and using ontologies
Building and using ontologies
 
On Storing Big Data
On Storing Big DataOn Storing Big Data
On Storing Big Data
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooNoSQL and The Big Data Hullabaloo
NoSQL and The Big Data Hullabaloo
 
DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)
 
Database_Introduction.pdf
Database_Introduction.pdfDatabase_Introduction.pdf
Database_Introduction.pdf
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
 
The XML Submission Tool: A System for Managing Text Collections at Indiana Un...
The XML Submission Tool: A System for Managing Text Collections at Indiana Un...The XML Submission Tool: A System for Managing Text Collections at Indiana Un...
The XML Submission Tool: A System for Managing Text Collections at Indiana Un...
 
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
CIS 111 STUDY Introduction Education--cis111study.com
CIS 111 STUDY Introduction Education--cis111study.comCIS 111 STUDY Introduction Education--cis111study.com
CIS 111 STUDY Introduction Education--cis111study.com
 
Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
DBMS Notes.pdf
DBMS Notes.pdfDBMS Notes.pdf
DBMS Notes.pdf
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An Analysis
 

More from net2-project

More from net2-project (11)

Vector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection ExampleVector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection Example
 
Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1
 
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
 
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social Communities
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDF
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Data
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Data
 
Answer-set programming
Answer-set programmingAnswer-set programming
Answer-set programming
 
Evolving web, evolving search
Evolving web, evolving searchEvolving web, evolving search
Evolving web, evolving search
 
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Extending DBpedia (LOD) using WikiTables

  • 1. Extending DBpedia (LOD) using WikiTables Emir Muñoz Unit for Reasoning and Querying emir.munoz@deri.org
  • 2. Linked Open Data Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ October 12, 2012 -- E. Muñoz
  • 3. Linked Open Data • DBpedia, an export of Wikipedia’s structured data DBpedia provides RDF version of all wikipedia structured data (infoboxes) October 12, 2012 -- E. Muñoz
  • 4. Linked Open Data • DBpedia, an export of Wikipedia’s structured data DBpedia provides RDF version of all wikipedia structured data (infoboxes) But not yet a version of all normal Wikipedia tables or wikitables October 12, 2012 -- E. Muñoz
  • 5. Tables as a source of LOD http://en.wikipedia.org/wiki/Dublin Caption as another row Column header represents types of information The values represent instances of that types http://en.wikipedia.org/wiki/Galway Infoboxes (attr-value) October 12, 2012 -- E. Muñoz Tables are inherently concise as well as information rich
  • 6. Reasoning over Wikipedia Tables http://en.wikipedia.org/wiki/Dublin Recovering Table Semantics … October 12, 2012 -- E. Muñoz Dublin is twinned with the following places:
  • 7. Reasoning over Wikipedia Tables dbpedia.org/resource/San_Jose,_California dbpedia.org/resource/Liverpool dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Barcelona dbpedia.org/resource/Beijing dbpedia.org/resource/United_States dbpedia.org/resource/United_Kingdom dbpedia.org/resource/Japan dbpedia.org/resource/Spain dbpedia.org/resource/People’s_Republic_of_China dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since http://en.wikipedia.org/wiki/Dublin Entity annotation for cells, mappings to DBpedia resources (xsd:integer) October 12, 2012 -- E. Muñoz
  • 8. Reasoning over Wikipedia Tables dbpedia.org/resource/San_Jose,_California dbpedia.org/resource/Liverpool dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Barcelona dbpedia.org/resource/Beijing dbpedia.org/resource/United_States dbpedia.org/resource/United_Kingdom dbpedia.org/resource/Japan dbpedia.org/resource/Spain dbpedia.org/resource/People’s_Republic_of_China (xsd:integer) dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since dbpedia.org/ontology/country dbpedia.org/property/subdivisionName is dbpedia.org/ontology/country of http://en.wikipedia.org/wiki/Dublin Extracting relations October 12, 2012 -- E. Muñoz
  • 9. Reasoning over Wikipedia Tables • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> . • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> . • <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> . • <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> . • <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> . • <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> . • <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> . • <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> . • <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/People's_Republic_of_China> . • <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/People's_Republic_of_China> . October 12, 2012 -- E. Muñoz
  • 10. Reasoning over Wikipedia Tables • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> . • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> . • <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> . • <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> . • <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> . • <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> . • <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> . • <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> . • <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/People's_Republic_of_China> . • <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/People's_Republic_of_China> . October 12, 2012 -- E. Muñoz
  • 11. Reasoning over Wikipedia Tables • Let’s analyze these cases … • Liverpool • Matsue • Beijing October 12, 2012 -- E. Muñoz
  • 12. Not that simple… • Web tables usually don’t have explicit semantics by themselves. • Main issues: – Complex tables with spans – Captions inside the table as another row – Not well-formed tables (i.e., not a matrix) – We need filters (e.g., min 2 columns, 2 rows) • We are extracting relations at row level and between the main entity and the table resources October 12, 2012 -- E. Muñoz
  • 13. Parsing: Extracting Tables http://en.wikipedia.org/wiki/People%27s_Republic_of_China Caption as another row Table split October 12, 2012 -- E. Muñoz Rowspans with pictures First step: parsing Wiki format
  • 14. Parsing: Extracting Tables • Problems with parsing the cell’s content http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  • 15. Parsing: Extracting Tables • Problems with parsing the cell’s content http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  • 16. Parsing: Extracting Tables Same page link Many different formats Anchor text vs. Content text http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s October 12, 2012 -- E. Muñoz
  • 17. Extracting Relations A table containing tables http://en.wikipedia.org/wiki/AFC_Ajax October 12, 2012 -- E. Muñoz
  • 18. Extracting Relations • Also relations between the main entity and the entities in the table dbpedia.org/resource/AFC_Ajax 14 dbpedia.org/ontology/team 14 dbpedia.org/property/clubs 11 dbpedia.org/property/currentclub 3 dbpedia.org/property/youthclubs In his dbpedia page there is no mention to AFC Ajax http://en.wikipedia.org/wiki/AFC_Ajax 16 players October 12, 2012 -- E. Muñoz
  • 20. Our Dataset • enwiki dump from 2012-09-03 02:17:37 • 8.6 GB of Wikipedia pages that comprise – 10,531,986 documents (HTML pages) – Only 413,256 HTML contains tables – 2,989,098 tables – 905,929 tables after the filter • 27.7% of the whole tables – 0.46 tables per page (or 2.15 discarding pages without tables) October 12, 2012 -- E. Muñoz
  • 22. Ranking of Relationships • The current ranking function is naïve October 12, 2012 -- E. Muñoz http://en.wikipedia.org/wiki/AFC_Ajax 16 players freq relationship score 14 dbpedia.org/ontology/team 0,875 14 dbpedia.org/property/clubs 0,875 11 dbpedia.org/property/currentclub 0,6875 3 dbpedia.org/property/youthclubs 0,1875 𝑠𝑐𝑜𝑟𝑒 = 𝑓𝑟𝑒𝑙 𝑛 𝑟𝑜𝑤𝑠
  • 23. Ranking of Relationships • For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1] October 12, 2012 -- E. Muñoz http://en.wikipedia.org/wiki/Danny_Kaye
  • 24. Ongoing Work and Challenges • Improve the ranking function for relations. • Store the 5.5M DBpedia (transitive) redirects locally (optimizing time). • Statistical analysis of Wikipedia tables – Number of columns, rows – Headers, Captions – External and internal links • The big following challenge is the evaluation. October 12, 2012 -- E. Muñoz
  • 25. What’s next? • Some ideas in mind: – Use the extracted relations to classify WikiTables – Define a similarity function for WikiTables English Italian October 12, 2012 -- E. Muñoz
  • 26. What’s next? October 12, 2012 -- E. Muñoz http://en.wikipedia.org/wiki/Electronegativity What means this number? Here there is no reference to those numbers!
  • 27. What’s next? October 12, 2012 -- E. Muñoz http://en.wikipedia.org/wiki/Electronegativity http://en.wikipedia.org/wiki/Chlorine Chlorous acid is a chlorite http://dbpedia.org/page/Chlorous_acid
  • 28. Open problems • Handle multiple-entities in the same cell • Improve the ranking function • Handle redirects before querying DBpedia • How to evaluate the outcome October 12, 2012 -- E. Muñoz Thanks! Q & A Thanks! Emir Muñoz Unit for Reasoning and Querying emir.munoz@deri.org