WikiTables DERI Talk
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
931
On Slideshare
879
From Embeds
52
Number of Embeds
3

Actions

Shares
Downloads
7
Comments
2
Likes
2

Embeds 52

http://tal2tot4uenli8d3lphbjvrrl237cfes-a-sites-opensocial.googleusercontent.com 47
http://emunoz.org 3
http://localhost 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Extending DBpedia (LOD) using WikiTables Emir Muñoz Unit for Reasoning and Querying emir.munoz@deri.org
  • 2. Linked Open DataLinking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ October 12, 2012 -- E. Muñoz
  • 3. Linked Open Data• DBpedia, an export of Wikipedia’s structured dataDBpedia provides RDF version of all wikipedia structured data (infoboxes) October 12, 2012 -- E. Muñoz
  • 4. Linked Open Data• DBpedia, an export of Wikipedia’s structured dataDBpedia provides RDF version of all wikipedia structured data (infoboxes) But not yet a version of all normal Wikipedia tables or wikitables October 12, 2012 -- E. Muñoz
  • 5. Tables as a source of LOD Tables are inherently concise Infoboxes as well as information rich (attr-value) The values Column header represents represent types of information Caption asinstances of that another row types http://en.wikipedia.org/wiki/Dublin http://en.wikipedia.org/wiki/Galway October 12, 2012 -- E. Muñoz
  • 6. Reasoning over Wikipedia Tables Recovering Table Semantics …Dublin is twinned with the following places: http://en.wikipedia.org/wiki/Dublin October 12, 2012 -- E. Muñoz
  • 7. Reasoning over Wikipedia Tables Entity annotation for cells, mappings to DBpedia resources http://en.wikipedia.org/wiki/Dublin dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/sincedbpedia.org/resource/San_Jose,_California dbpedia.org/resource/United_States dbpedia.org/resource/Liverpool dbpedia.org/resource/United_Kingdom dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Japan dbpedia.org/resource/Barcelona dbpedia.org/resource/Spain dbpedia.org/resource/Beijing dbpedia.org/resource/People’s_Republic_of_China (xsd:integer) October 12, 2012 -- E. Muñoz
  • 8. Reasoning over Wikipedia Tables dbpedia.org/ontology/country dbpedia.org/property/subdivisionName Extracting relations http://en.wikipedia.org/wiki/Dublin dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/sincedbpedia.org/resource/San_Jose,_California dbpedia.org/resource/United_States dbpedia.org/resource/Liverpool dbpedia.org/resource/United_Kingdom dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Japan dbpedia.org/resource/Barcelona dbpedia.org/resource/Spain dbpedia.org/resource/Beijing dbpedia.org/resource/People’s_Republic_of_China (xsd:integer) is dbpedia.org/ontology/country of October 12, 2012 -- E. Muñoz
  • 9. • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> .• Reasoning over Wikipedia Tables <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> .• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> .• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> .• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> .• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> .• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> .• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> .• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Peoples_Republic_of_China> .• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Peoples_Republic_of_China> . October 12, 2012 -- E. Muñoz
  • 10. • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> .• Reasoning over Wikipedia Tables <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> .• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> .• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> .• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> .• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> .• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> .• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> .• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Peoples_Republic_of_China> .• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Peoples_Republic_of_China> . October 12, 2012 -- E. Muñoz
  • 11. Reasoning over Wikipedia Tables• Let’s analyze these cases …• Liverpool• Matsue• Beijing October 12, 2012 -- E. Muñoz
  • 12. Not that simple…• Web tables usually don’t have explicit semantics by themselves.• Main issues: – Complex tables with spans – Captions inside the table as another row – Not well-formed tables (i.e., not a matrix) – We need filters (e.g., min 2 columns, 2 rows)• We are extracting relations at row level and between the main entity and the table resources October 12, 2012 -- E. Muñoz
  • 13. Parsing: Extracting TablesFirst step: parsing Wiki format Caption as another row http://en.wikipedia.org/wiki/People%27s_Republic_of_China Rowspans Table splitwith pictures October 12, 2012 -- E. Muñoz
  • 14. Parsing: Extracting Tables• Problems with parsing the cell’s content http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  • 15. Parsing: Extracting Tables• Problems with parsing the cell’s content http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  • 16. Parsing: Extracting Tables Same page link Many different formatsAnchor text vs.Content text http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s October 12, 2012 -- E. Muñoz
  • 17. Extracting Relations http://en.wikipedia.org/wiki/AFC_Ajax A tablecontaining tables October 12, 2012 -- E. Muñoz
  • 18. Extracting Relations• Also relations between the main entity and the entities in the table http://en.wikipedia.org/wiki/AFC_Ajax 16 playersdbpedia.org/resource/AFC_Ajax14 dbpedia.org/ontology/team14 dbpedia.org/property/clubs11 dbpedia.org/property/currentclub3 dbpedia.org/property/youthclubs In his dbpedia page there is no mention to AFC Ajax October 12, 2012 -- E. Muñoz
  • 19. dbpedia.org/resource/Christian_Eriksen http://en.wikipedia.org/wiki/AFC_AjaxDisambiguation pagedbpedia.org/resource/Ajax October 12, 2012 -- E. Muñoz
  • 20. Our Dataset• enwiki dump from 2012-09-03 02:17:37• 8.6 GB of Wikipedia pages that comprise – 10,531,986 documents (HTML pages) – Only 413,256 HTML contains tables – 2,989,098 tables – 905,929 tables after the filter • 27.7% of the whole tables – 0.46 tables per page (or 2.15 discarding pages without tables) October 12, 2012 -- E. Muñoz
  • 21. Methodology October 12, 2012 -- E. Muñoz
  • 22. Ranking of Relationships• The current ranking function is naïve 𝑓 𝑟𝑒𝑙 http://en.wikipedia.org/wiki/AFC_Ajax 𝑠𝑐𝑜𝑟𝑒 = 𝑛 𝑟𝑜𝑤𝑠 16 playersfreq relationship score 14 dbpedia.org/ontology/team 0,875 14 dbpedia.org/property/clubs 0,875 11 dbpedia.org/property/currentclub 0,6875 3 dbpedia.org/property/youthclubs 0,1875 October 12, 2012 -- E. Muñoz
  • 23. Ranking of Relationships• For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1] http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  • 24. Ongoing Work and Challenges• Improve the ranking function for relations.• Store the 5.5M DBpedia (transitive) redirects locally (optimizing time).• Statistical analysis of Wikipedia tables – Number of columns, rows – Headers, Captions – External and internal links• The big following challenge is the evaluation. October 12, 2012 -- E. Muñoz
  • 25. What’s next?• Some ideas in mind: – Use the extracted relations to classify WikiTables – Define a similarity function for WikiTables English Italian October 12, 2012 -- E. Muñoz
  • 26. What’s next?http://en.wikipedia.org/wiki/Electronegativity What means Here there is no reference to those numbers! this number? October 12, 2012 -- E. Muñoz
  • 27. What’s next? http://dbpedia.org/page/Chlorous_acidhttp://en.wikipedia.org/wiki/Electronegativity Chlorous acid is a chlorite http://en.wikipedia.org/wiki/Chlorine October 12, 2012 -- E. Muñoz
  • 28. Open problems• Handle multiple-entities in the same cell• Improve the ranking function Thanks!• Handle redirects before querying DBpedia Q&A• How to evaluate the outcome Thanks! Emir Muñoz Unit for Reasoning and Querying emir.munoz@deri.org October 12, 2012 -- E. Muñoz