Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DBpedia Framework - BBC Talk

DBpedia Framework - BBC Talk

  1. 1. Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin
  2. 2. Hello again <ul><li>Georgi Kobilarov </li></ul><ul><li>Researcher at Freie Universität Berlin </li></ul><ul><li>DBpedia Development Lead </li></ul>
  3. 3. Agenda <ul><li>Status Quo </li></ul><ul><li>Technical Overview </li></ul><ul><li>Challenges </li></ul><ul><li>Outlook </li></ul>
  4. 4. <ul><li>How to extract Wikipedia data </li></ul><ul><li>and how to not do it </li></ul>
  5. 5. <ul><li>Lessons learned </li></ul>
  6. 6. Title Description Languages Web Links Categorization Domain specific Data Images Infoboxes
  7. 8. <ul><li><http://dbpedia.org/resource/Hewlett-Packard> </li></ul><ul><li>rdfs:label “Hewlett-Packard” </li></ul><ul><li>p:foundation dbpedia:Palo_Alto </li></ul><ul><li>p:keypeople dbpedia:Bill_Hewlett </li></ul><ul><li>p:keypeople dbpedia:David_Packard </li></ul><ul><li>p:keypeople dbpedia:Mark_V._Hurd </li></ul><ul><li>p:industry dbpedia:Computer_Systems </li></ul><ul><li>p:industry dbpedia:Computer_software </li></ul><ul><li>p:revenue 104300000000 $ </li></ul><ul><li>p:netincome 7300000000 $ </li></ul><ul><li>p:employees 156000 </li></ul><ul><li>p:slogan “Invent” </li></ul>
  8. 9. Problems <ul><li>Poor Abstract extraction </li></ul><ul><li>Property synomys </li></ul><ul><li>Redirects </li></ul><ul><li>Missing class hierarchy </li></ul><ul><li>Range validation </li></ul>
  9. 10. Property Synonyms
  10. 11. Redirects <ul><li>Florida located_in USA </li></ul><ul><li>California located_in United_States </li></ul><ul><li>USA redirects_to United_States </li></ul>
  11. 12. Class Hierarchy <ul><li>„ Select all PEOPLE born in …“ </li></ul>
  12. 13. Range Validation <ul><li>dbpedia:Google </li></ul><ul><li>keyperson Eric Schmidt </li></ul><ul><li>keyperson Sergey Brin </li></ul><ul><li>keyperson Larry Page </li></ul><ul><li>keyperson CEO </li></ul><ul><li>keyperson Chairman </li></ul>
  13. 14. Range Validation
  14. 15. <ul><li>Technical Overview </li></ul>
  15. 16. And how does it work? <ul><li>Extraction Framework </li></ul><ul><li>(and a lot of regular expressions) </li></ul>
  16. 17. Extraction Framework <ul><li>Open Source </li></ul><ul><li>http://dbpedia.svn.sourceforge.net </li></ul><ul><li>implemented in PHP </li></ul>
  17. 18. Extraction Framework <ul><li>Data Input ( PageCollections ) </li></ul><ul><li>DatabaseWikipedia </li></ul><ul><li>LiveWikipedia </li></ul>
  18. 19. Extraction Framework <ul><li>Data Processing ( Extractors ) </li></ul><ul><li>InfoboxExtractor </li></ul><ul><li>LabelExtractor </li></ul><ul><li>CategoryExtractor </li></ul><ul><li>RedirectExtractor </li></ul><ul><li>GeoExtracor </li></ul>
  19. 20. Extraction Framework <ul><li>Data Output ( Destinations ) </li></ul><ul><li>SimpleDumpDestination (stdout) </li></ul><ul><li>NTripleDumpDestination </li></ul>
  20. 21. Extraction Framework <ul><li>Tie things together </li></ul><ul><li>Extraction Manager </li></ul><ul><li>Extraction Jobs </li></ul>
  21. 22. DBpedia Dataset <ul><li>Provided as RDF Dumps </li></ul><ul><li>Updated every 3 month </li></ul><ul><li>Hosted by Openlink Software </li></ul><ul><li>Available as Linked Data </li></ul>
  22. 23. SPARQL Endpoint <ul><li>http://dbpedia.org/sparql </li></ul>
  23. 24. Linked Data <ul><li>Use URIs as names for things </li></ul><ul><li>Use HTTP URIs so that people can look up those names. </li></ul><ul><li>When someone looks up a URI, provide useful information. </li></ul><ul><li>Include links to other URIs. so that they can discover more things. </li></ul>
  24. 25. HTTP URIs Information Resources http://dbpedia.org/page/Bristol HTTP GET -> 200 OK Non-Information Resources http://dbpedia.org/resource/Bristol HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol -> 200 OK
  25. 26. How to get started <ul><li>Documentation http://wiki.dbpedia.org/Documentation </li></ul><ul><li>Source Code </li></ul><ul><li>start.php </li></ul>
  26. 27. Next Tasks <ul><li>Improve Extractors </li></ul><ul><li>Cleaner Abstracts </li></ul><ul><li>Include Redirects into Extraction Process </li></ul><ul><li>Fix more Extraction Bugs </li></ul><ul><li> http://sourceforge.net/projects/dbpedia/ </li></ul><ul><li>Provide Live Update Service </li></ul>
  27. 28. Infobox Extraction <ul><li>One script to rule them all </li></ul><ul><li>Not sufficient </li></ul>
  28. 29. <ul><li>Next Challenges </li></ul>
  29. 30. Next challenges <ul><li>Higher Data Quality + Ontologies </li></ul><ul><li>Consistency Checks </li></ul><ul><li>Augmentation </li></ul><ul><li>Live Updates </li></ul>
  30. 31. Live Updates <ul><li>Wikipedia Update Stream </li></ul><ul><li>Extraction Cluster </li></ul><ul><li>Named Graphs </li></ul>
  31. 32. Augmentation <ul><li>Enrich DBpedia with data from: </li></ul><ul><li>1. other languages </li></ul><ul><li>2. external datasets </li></ul>
  32. 33. Consistency Checks <ul><li>German Wikipedia says, Berlin‘s population is X </li></ul><ul><li>Italian Wikipedia says, it‘s Y </li></ul>
  33. 34. Data Quality <ul><li>We need humans </li></ul>
  34. 35. <ul><li>The Vision </li></ul>
  35. 36. Semantic Web <ul><li>Users shouldn’t care </li></ul>
  36. 37. Semantic Web <ul><li>Users shouldn’t have to care </li></ul><ul><li>(del.icio.us lesson ) </li></ul>
  37. 38. DBpedia Extraction Wikipedia DBpedia Extraction Framework Triple Store
  38. 39. Freebase Extraction Wikipedia Extraction Metaweb Graph Store
  39. 40. <ul><li>What is the </li></ul><ul><li>Wikipedia for Data? </li></ul>
  40. 41. <ul><li>Wikipedia is the </li></ul><ul><li>Wikipedia for Data </li></ul>
  41. 43. Crowd Sourced Extraction <ul><li>Where‘s the user benefit ? </li></ul>
  42. 44. Users <ul><li>Mashup Developer </li></ul>
  43. 45. <ul><li>Benefit </li></ul>
  44. 46. <ul><li>Outlook </li></ul>
  45. 47. Infobox Extraction <ul><li>We need a new approach </li></ul><ul><li>Break it down into smaller pieces </li></ul>
  46. 48. Step 1: Create an ontology <ul><li>Five domains: </li></ul><ul><li>people, places, organisations, </li></ul><ul><li>events, works </li></ul>
  47. 49. People <ul><li>Actors </li></ul><ul><li>Athlete </li></ul><ul><li>Journalist </li></ul><ul><li>MusicalArtist </li></ul><ul><li>Politician </li></ul><ul><li>Scientist </li></ul><ul><li>Writer </li></ul>
  48. 50. Places <ul><li>Airport </li></ul><ul><li>City </li></ul><ul><li>Country </li></ul><ul><li>Island </li></ul><ul><li>Mountain </li></ul><ul><li>River </li></ul>
  49. 51. Organisations <ul><li>Band </li></ul><ul><li>Company </li></ul><ul><li>Educational Institution </li></ul><ul><li>Radio Station </li></ul><ul><li>Sports Team </li></ul>
  50. 52. Event <ul><li>Convention </li></ul><ul><li>Military Conflict </li></ul><ul><li>Music Event </li></ul><ul><li>Sport Event </li></ul>
  51. 53. Work <ul><li>Book </li></ul><ul><li>Broadcast </li></ul><ul><li>Film </li></ul><ul><li>Software </li></ul><ul><li>Television </li></ul>
  52. 54. Step 2: Template Mapping <ul><li>Infobox Cricketer </li></ul><ul><li>Infobox Historic Cricketer </li></ul><ul><li>Infobox Recent Cricketer </li></ul><ul><li>Infobox Old Cricketer </li></ul><ul><li>Infobox Cricketer Biography </li></ul><ul><li>=> Class Cricketer (Athlete) </li></ul>
  53. 55. Step 2: Template Mapping <ul><li>Class TV Episode (Work) </li></ul><ul><li>Wikipedia Templates: </li></ul><ul><li>Television Episode </li></ul><ul><li>UK Office Episode </li></ul><ul><li>Simpsons Episode </li></ul><ul><li>DoctorWhoBox </li></ul>
  54. 56. Step 3: Parsers <ul><li>Handle Templates Values specifically </li></ul><ul><li>Example: Property splitting </li></ul><ul><li>Person born „1.1.1980, [[Berlin]]“ </li></ul><ul><li>=> split to birthplace Berlin </li></ul><ul><li>birthdate 1980-01-01 </li></ul>
  55. 57. Step 3: Parsers <ul><li>Example: Class Rules </li></ul><ul><li>MusicalArtist </li></ul><ul><li>If property „currentMembers“ is set </li></ul><ul><li>=> Group </li></ul><ul><li>Otherwise </li></ul><ul><li>=> Person </li></ul>
  56. 58. Step 3: Parsers <ul><li>Example: Range Validation </li></ul><ul><li>Google keypeople </li></ul><ul><li>„ [[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]] </li></ul><ul><li>Company#keyperson range Person#Class </li></ul><ul><li>Google keyperson Eric Schmidt </li></ul><ul><li>Sergey Brin </li></ul><ul><li>Larry Page </li></ul>
  57. 59. Step 4: Crowd Source it
  58. 60. Step 4: Crowd Source it
  59. 61. <ul><li>Linking Framework </li></ul>
  60. 62. Interlinking Framework
  61. 63. Interlinking Framework
  62. 64. <ul><li>„ Apple“ </li></ul>
  63. 65. <ul><li>Apple </li></ul><ul><li>Google </li></ul><ul><li>Microsoft </li></ul>
  64. 66. <ul><li>Apple </li></ul><ul><li>Orange </li></ul><ul><li>Pear </li></ul>
  65. 67. <ul><li>Orange </li></ul><ul><li>Vodafone </li></ul><ul><li>T-Mobile </li></ul>
  66. 68. <ul><li>Context </li></ul><ul><li>Similarity </li></ul>
  67. 69. Linking: The Future <ul><li>Hosted Webservice </li></ul><ul><li>for Linked Data publishers </li></ul>
  68. 70. Summary
  69. 71. <ul><li>http://dbpedia.org </li></ul><ul><li>Georgi Kobilarov </li></ul><ul><li>Freie Universität Berlin </li></ul>

×