1. Extending DBpedia (LOD) using
WikiTables
Emir Muñoz
Unit for Reasoning and Querying
emir.munoz@deri.org
2. Linked Open Data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
October 12, 2012 -- E. Muñoz
3. Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
October 12, 2012 -- E. Muñoz
4. Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
But not yet a version of all normal Wikipedia tables or wikitables
October 12, 2012 -- E. Muñoz
5. Tables as a source of LOD
http://en.wikipedia.org/wiki/Dublin
Caption as
another row
Column header represents
types of information
The values
represent
instances of that
types
http://en.wikipedia.org/wiki/Galway
Infoboxes
(attr-value)
October 12, 2012 -- E. Muñoz
Tables are inherently concise
as well as information rich
6. Reasoning over Wikipedia Tables
http://en.wikipedia.org/wiki/Dublin
Recovering Table Semantics …
October 12, 2012 -- E. Muñoz
Dublin is twinned with the following places:
7. Reasoning over Wikipedia Tables
dbpedia.org/resource/San_Jose,_California
dbpedia.org/resource/Liverpool
dbpedia.org/resource/Matsue,_Shimane
dbpedia.org/resource/Barcelona
dbpedia.org/resource/Beijing
dbpedia.org/resource/United_States
dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Japan
dbpedia.org/resource/Spain
dbpedia.org/resource/People’s_Republic_of_China
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
http://en.wikipedia.org/wiki/Dublin
Entity annotation for cells, mappings to DBpedia resources
(xsd:integer)
October 12, 2012 -- E. Muñoz
8. Reasoning over Wikipedia Tables
dbpedia.org/resource/San_Jose,_California
dbpedia.org/resource/Liverpool
dbpedia.org/resource/Matsue,_Shimane
dbpedia.org/resource/Barcelona
dbpedia.org/resource/Beijing
dbpedia.org/resource/United_States
dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Japan
dbpedia.org/resource/Spain
dbpedia.org/resource/People’s_Republic_of_China
(xsd:integer)
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
dbpedia.org/ontology/country
dbpedia.org/property/subdivisionName
is dbpedia.org/ontology/country of
http://en.wikipedia.org/wiki/Dublin
Extracting relations
October 12, 2012 -- E. Muñoz
11. Reasoning over Wikipedia Tables
• Let’s analyze these cases …
• Liverpool
• Matsue
• Beijing
October 12, 2012 -- E. Muñoz
12. Not that simple…
• Web tables usually don’t have explicit semantics
by themselves.
• Main issues:
– Complex tables with spans
– Captions inside the table as another row
– Not well-formed tables (i.e., not a matrix)
– We need filters (e.g., min 2 columns, 2 rows)
• We are extracting relations at row level and
between the main entity and the table resources
October 12, 2012 -- E. Muñoz
14. Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
15. Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
16. Parsing: Extracting Tables
Same page link Many different
formats
Anchor text
vs.
Content text
http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s
October 12, 2012 -- E. Muñoz
18. Extracting Relations
• Also relations between the main entity and
the entities in the table
dbpedia.org/resource/AFC_Ajax
14 dbpedia.org/ontology/team
14 dbpedia.org/property/clubs
11 dbpedia.org/property/currentclub
3 dbpedia.org/property/youthclubs
In his dbpedia page
there is no mention
to AFC Ajax
http://en.wikipedia.org/wiki/AFC_Ajax
16 players
October 12, 2012 -- E. Muñoz
20. Our Dataset
• enwiki dump from 2012-09-03 02:17:37
• 8.6 GB of Wikipedia pages that comprise
– 10,531,986 documents (HTML pages)
– Only 413,256 HTML contains tables
– 2,989,098 tables
– 905,929 tables after the filter
• 27.7% of the whole tables
– 0.46 tables per page (or 2.15 discarding pages
without tables)
October 12, 2012 -- E. Muñoz
22. Ranking of Relationships
• The current ranking function is naïve
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/AFC_Ajax
16 players
freq relationship score
14 dbpedia.org/ontology/team 0,875
14 dbpedia.org/property/clubs 0,875
11 dbpedia.org/property/currentclub 0,6875
3 dbpedia.org/property/youthclubs 0,1875
𝑠𝑐𝑜𝑟𝑒 =
𝑓𝑟𝑒𝑙
𝑛 𝑟𝑜𝑤𝑠
23. Ranking of Relationships
• For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1]
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Danny_Kaye
24. Ongoing Work and Challenges
• Improve the ranking function for relations.
• Store the 5.5M DBpedia (transitive) redirects
locally (optimizing time).
• Statistical analysis of Wikipedia tables
– Number of columns, rows
– Headers, Captions
– External and internal links
• The big following challenge is the evaluation.
October 12, 2012 -- E. Muñoz
25. What’s next?
• Some ideas in mind:
– Use the extracted relations to classify WikiTables
– Define a similarity function for WikiTables
English Italian
October 12, 2012 -- E. Muñoz
26. What’s next?
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Electronegativity
What means
this number?
Here there is no reference to those numbers!
27. What’s next?
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Electronegativity
http://en.wikipedia.org/wiki/Chlorine
Chlorous acid is a chlorite
http://dbpedia.org/page/Chlorous_acid
28. Open problems
• Handle multiple-entities in the same cell
• Improve the ranking function
• Handle redirects before querying DBpedia
• How to evaluate the outcome
October 12, 2012 -- E. Muñoz
Thanks!
Q & A
Thanks!
Emir Muñoz
Unit for Reasoning and Querying
emir.munoz@deri.org