0
Extending DBpedia (LOD) using         WikiTables              Emir Muñoz   Unit for Reasoning and Querying         emir.mu...
Linked Open DataLinking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/             ...
Linked Open Data• DBpedia, an export of Wikipedia’s structured dataDBpedia provides RDF version of all wikipedia structure...
Linked Open Data• DBpedia, an export of Wikipedia’s structured dataDBpedia provides RDF version of all wikipedia structure...
Tables as a source of LOD      Tables are inherently concise                                            Infoboxes       as...
Reasoning over Wikipedia Tables   Recovering Table Semantics …Dublin is twinned with the following places:                ...
Reasoning over Wikipedia Tables Entity annotation for cells, mappings to DBpedia resources                                ...
Reasoning over Wikipedia Tables                         dbpedia.org/ontology/country                     dbpedia.org/prope...
•   <http://dbpedia.org/resource/San_Jose,_California>    <http://dbpedia.org/property/subdivisionName>    <http://dbpedia...
•   <http://dbpedia.org/resource/San_Jose,_California>    <http://dbpedia.org/property/subdivisionName>    <http://dbpedia...
Reasoning over Wikipedia Tables• Let’s analyze these cases …• Liverpool• Matsue• Beijing                  October 12, 2012...
Not that simple…• Web tables usually don’t have explicit semantics  by themselves.• Main issues:  –   Complex tables with ...
Parsing: Extracting TablesFirst step: parsing Wiki format                                         Caption as              ...
Parsing: Extracting Tables• Problems with parsing the cell’s content                                         http://en.wik...
Parsing: Extracting Tables• Problems with parsing the cell’s content                                         http://en.wik...
Parsing: Extracting Tables                         Same page link                              Many different             ...
Extracting Relations                                                http://en.wikipedia.org/wiki/AFC_Ajax     A tableconta...
Extracting Relations• Also relations between the main entity and  the entities in the table      http://en.wikipedia.org/w...
dbpedia.org/resource/Christian_Eriksen                                                               http://en.wikipedia.o...
Our Dataset• enwiki dump from 2012-09-03 02:17:37• 8.6 GB of Wikipedia pages that comprise  – 10,531,986 documents (HTML p...
Methodology October 12, 2012 -- E. Muñoz
Ranking of Relationships• The current ranking function is naïve                          𝑓 𝑟𝑒𝑙                            ...
Ranking of Relationships• For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1]                                         http://en.w...
Ongoing Work and Challenges• Improve the ranking function for relations.• Store the 5.5M DBpedia (transitive) redirects  l...
What’s next?• Some ideas in mind:  – Use the extracted relations to classify WikiTables  – Define a similarity function fo...
What’s next?http://en.wikipedia.org/wiki/Electronegativity                   What means                       Here there i...
What’s next?                                                                      http://dbpedia.org/page/Chlorous_acidhtt...
Open problems•   Handle multiple-entities in the same cell•   Improve the ranking function                      Thanks!•  ...
Upcoming SlideShare
Loading in...5
×

WikiTables DERI Talk

675

Published on

Published in: Education
2 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
675
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
2
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "WikiTables DERI Talk"

  1. 1. Extending DBpedia (LOD) using WikiTables Emir Muñoz Unit for Reasoning and Querying emir.munoz@deri.org
  2. 2. Linked Open DataLinking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ October 12, 2012 -- E. Muñoz
  3. 3. Linked Open Data• DBpedia, an export of Wikipedia’s structured dataDBpedia provides RDF version of all wikipedia structured data (infoboxes) October 12, 2012 -- E. Muñoz
  4. 4. Linked Open Data• DBpedia, an export of Wikipedia’s structured dataDBpedia provides RDF version of all wikipedia structured data (infoboxes) But not yet a version of all normal Wikipedia tables or wikitables October 12, 2012 -- E. Muñoz
  5. 5. Tables as a source of LOD Tables are inherently concise Infoboxes as well as information rich (attr-value) The values Column header represents represent types of information Caption asinstances of that another row types http://en.wikipedia.org/wiki/Dublin http://en.wikipedia.org/wiki/Galway October 12, 2012 -- E. Muñoz
  6. 6. Reasoning over Wikipedia Tables Recovering Table Semantics …Dublin is twinned with the following places: http://en.wikipedia.org/wiki/Dublin October 12, 2012 -- E. Muñoz
  7. 7. Reasoning over Wikipedia Tables Entity annotation for cells, mappings to DBpedia resources http://en.wikipedia.org/wiki/Dublin dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/sincedbpedia.org/resource/San_Jose,_California dbpedia.org/resource/United_States dbpedia.org/resource/Liverpool dbpedia.org/resource/United_Kingdom dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Japan dbpedia.org/resource/Barcelona dbpedia.org/resource/Spain dbpedia.org/resource/Beijing dbpedia.org/resource/People’s_Republic_of_China (xsd:integer) October 12, 2012 -- E. Muñoz
  8. 8. Reasoning over Wikipedia Tables dbpedia.org/ontology/country dbpedia.org/property/subdivisionName Extracting relations http://en.wikipedia.org/wiki/Dublin dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/sincedbpedia.org/resource/San_Jose,_California dbpedia.org/resource/United_States dbpedia.org/resource/Liverpool dbpedia.org/resource/United_Kingdom dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Japan dbpedia.org/resource/Barcelona dbpedia.org/resource/Spain dbpedia.org/resource/Beijing dbpedia.org/resource/People’s_Republic_of_China (xsd:integer) is dbpedia.org/ontology/country of October 12, 2012 -- E. Muñoz
  9. 9. • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> .• Reasoning over Wikipedia Tables <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> .• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> .• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> .• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> .• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> .• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> .• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> .• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Peoples_Republic_of_China> .• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Peoples_Republic_of_China> . October 12, 2012 -- E. Muñoz
  10. 10. • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> .• Reasoning over Wikipedia Tables <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> .• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> .• <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> .• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> .• <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> .• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> .• <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> .• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Peoples_Republic_of_China> .• <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Peoples_Republic_of_China> . October 12, 2012 -- E. Muñoz
  11. 11. Reasoning over Wikipedia Tables• Let’s analyze these cases …• Liverpool• Matsue• Beijing October 12, 2012 -- E. Muñoz
  12. 12. Not that simple…• Web tables usually don’t have explicit semantics by themselves.• Main issues: – Complex tables with spans – Captions inside the table as another row – Not well-formed tables (i.e., not a matrix) – We need filters (e.g., min 2 columns, 2 rows)• We are extracting relations at row level and between the main entity and the table resources October 12, 2012 -- E. Muñoz
  13. 13. Parsing: Extracting TablesFirst step: parsing Wiki format Caption as another row http://en.wikipedia.org/wiki/People%27s_Republic_of_China Rowspans Table splitwith pictures October 12, 2012 -- E. Muñoz
  14. 14. Parsing: Extracting Tables• Problems with parsing the cell’s content http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  15. 15. Parsing: Extracting Tables• Problems with parsing the cell’s content http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  16. 16. Parsing: Extracting Tables Same page link Many different formatsAnchor text vs.Content text http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s October 12, 2012 -- E. Muñoz
  17. 17. Extracting Relations http://en.wikipedia.org/wiki/AFC_Ajax A tablecontaining tables October 12, 2012 -- E. Muñoz
  18. 18. Extracting Relations• Also relations between the main entity and the entities in the table http://en.wikipedia.org/wiki/AFC_Ajax 16 playersdbpedia.org/resource/AFC_Ajax14 dbpedia.org/ontology/team14 dbpedia.org/property/clubs11 dbpedia.org/property/currentclub3 dbpedia.org/property/youthclubs In his dbpedia page there is no mention to AFC Ajax October 12, 2012 -- E. Muñoz
  19. 19. dbpedia.org/resource/Christian_Eriksen http://en.wikipedia.org/wiki/AFC_AjaxDisambiguation pagedbpedia.org/resource/Ajax October 12, 2012 -- E. Muñoz
  20. 20. Our Dataset• enwiki dump from 2012-09-03 02:17:37• 8.6 GB of Wikipedia pages that comprise – 10,531,986 documents (HTML pages) – Only 413,256 HTML contains tables – 2,989,098 tables – 905,929 tables after the filter • 27.7% of the whole tables – 0.46 tables per page (or 2.15 discarding pages without tables) October 12, 2012 -- E. Muñoz
  21. 21. Methodology October 12, 2012 -- E. Muñoz
  22. 22. Ranking of Relationships• The current ranking function is naïve 𝑓 𝑟𝑒𝑙 http://en.wikipedia.org/wiki/AFC_Ajax 𝑠𝑐𝑜𝑟𝑒 = 𝑛 𝑟𝑜𝑤𝑠 16 playersfreq relationship score 14 dbpedia.org/ontology/team 0,875 14 dbpedia.org/property/clubs 0,875 11 dbpedia.org/property/currentclub 0,6875 3 dbpedia.org/property/youthclubs 0,1875 October 12, 2012 -- E. Muñoz
  23. 23. Ranking of Relationships• For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1] http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  24. 24. Ongoing Work and Challenges• Improve the ranking function for relations.• Store the 5.5M DBpedia (transitive) redirects locally (optimizing time).• Statistical analysis of Wikipedia tables – Number of columns, rows – Headers, Captions – External and internal links• The big following challenge is the evaluation. October 12, 2012 -- E. Muñoz
  25. 25. What’s next?• Some ideas in mind: – Use the extracted relations to classify WikiTables – Define a similarity function for WikiTables English Italian October 12, 2012 -- E. Muñoz
  26. 26. What’s next?http://en.wikipedia.org/wiki/Electronegativity What means Here there is no reference to those numbers! this number? October 12, 2012 -- E. Muñoz
  27. 27. What’s next? http://dbpedia.org/page/Chlorous_acidhttp://en.wikipedia.org/wiki/Electronegativity Chlorous acid is a chlorite http://en.wikipedia.org/wiki/Chlorine October 12, 2012 -- E. Muñoz
  28. 28. Open problems• Handle multiple-entities in the same cell• Improve the ranking function Thanks!• Handle redirects before querying DBpedia Q&A• How to evaluate the outcome Thanks! Emir Muñoz Unit for Reasoning and Querying emir.munoz@deri.org October 12, 2012 -- E. Muñoz
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×