Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin
Hello again <ul><li>Georgi Kobilarov </li></ul><ul><li>Researcher at Freie Universität Berlin </li></ul><ul><li>DBpedia De...
Agenda <ul><li>Status Quo </li></ul><ul><li>Technical Overview </li></ul><ul><li>Challenges </li></ul><ul><li>Outlook </li...
<ul><li>How to extract Wikipedia data </li></ul><ul><li>and how to  not  do it </li></ul>
<ul><li>Lessons  learned </li></ul>
Title Description Languages Web Links Categorization Domain specific Data Images Infoboxes
 
<ul><li><http://dbpedia.org/resource/Hewlett-Packard> </li></ul><ul><li>rdfs:label  “Hewlett-Packard” </li></ul><ul><li>p:...
Problems <ul><li>Poor Abstract extraction </li></ul><ul><li>Property synomys </li></ul><ul><li>Redirects </li></ul><ul><li...
Property Synonyms
Redirects <ul><li>Florida located_in USA </li></ul><ul><li>California located_in United_States </li></ul><ul><li>USA redir...
Class Hierarchy <ul><li>„ Select all PEOPLE born in …“ </li></ul>
Range Validation <ul><li>dbpedia:Google </li></ul><ul><li>keyperson Eric Schmidt </li></ul><ul><li>keyperson Sergey Brin <...
Range Validation
<ul><li>Technical Overview </li></ul>
And how does it work? <ul><li>Extraction Framework </li></ul><ul><li>(and a lot of regular  expressions) </li></ul>
Extraction Framework <ul><li>Open Source  </li></ul><ul><li>http://dbpedia.svn.sourceforge.net </li></ul><ul><li>implement...
Extraction Framework <ul><li>Data Input ( PageCollections ) </li></ul><ul><li>DatabaseWikipedia </li></ul><ul><li>LiveWiki...
Extraction Framework <ul><li>Data Processing ( Extractors ) </li></ul><ul><li>InfoboxExtractor </li></ul><ul><li>LabelExtr...
Extraction Framework <ul><li>Data Output ( Destinations ) </li></ul><ul><li>SimpleDumpDestination (stdout) </li></ul><ul><...
Extraction Framework <ul><li>Tie things together </li></ul><ul><li>Extraction Manager </li></ul><ul><li>Extraction Jobs </...
DBpedia Dataset <ul><li>Provided as RDF Dumps </li></ul><ul><li>Updated every 3 month </li></ul><ul><li>Hosted by Openlink...
SPARQL Endpoint <ul><li>http://dbpedia.org/sparql </li></ul>
Linked Data <ul><li>Use URIs as names for things </li></ul><ul><li>Use HTTP URIs so that people can look up those names. <...
HTTP URIs Information Resources http://dbpedia.org/page/Bristol HTTP GET -> 200 OK Non-Information Resources http://dbpedi...
How to get started <ul><li>Documentation http://wiki.dbpedia.org/Documentation </li></ul><ul><li>Source Code </li></ul><ul...
Next Tasks <ul><li>Improve Extractors </li></ul><ul><li>Cleaner Abstracts </li></ul><ul><li>Include Redirects into Extract...
Infobox Extraction <ul><li>One script to rule them all </li></ul><ul><li>Not  sufficient </li></ul>
<ul><li>Next Challenges </li></ul>
Next challenges <ul><li>Higher Data Quality + Ontologies </li></ul><ul><li>Consistency  Checks  </li></ul><ul><li>Augmenta...
Live Updates <ul><li>Wikipedia Update Stream </li></ul><ul><li>Extraction Cluster </li></ul><ul><li>Named Graphs </li></ul>
Augmentation <ul><li>Enrich DBpedia with data from: </li></ul><ul><li>1. other languages </li></ul><ul><li>2. external dat...
Consistency Checks <ul><li>German Wikipedia says, Berlin‘s population is X </li></ul><ul><li>Italian Wikipedia says, it‘s ...
Data Quality <ul><li>We need  humans </li></ul>
<ul><li>The Vision </li></ul>
Semantic Web <ul><li>Users  shouldn’t  care </li></ul>
Semantic Web <ul><li>Users  shouldn’t   have   to  care </li></ul><ul><li>(del.icio.us  lesson ) </li></ul>
DBpedia Extraction Wikipedia DBpedia Extraction  Framework Triple Store
Freebase Extraction Wikipedia Extraction  Metaweb  Graph Store
<ul><li>What is the  </li></ul><ul><li>Wikipedia for Data? </li></ul>
<ul><li>Wikipedia is the  </li></ul><ul><li>Wikipedia for Data </li></ul>
 
Crowd Sourced Extraction <ul><li>Where‘s the  user benefit ? </li></ul>
Users <ul><li>Mashup Developer </li></ul>
<ul><li>Benefit </li></ul>
<ul><li>Outlook </li></ul>
Infobox Extraction <ul><li>We need a new approach </li></ul><ul><li>Break it down into smaller pieces </li></ul>
Step 1: Create an ontology <ul><li>Five domains: </li></ul><ul><li>people, places, organisations,  </li></ul><ul><li>event...
People <ul><li>Actors </li></ul><ul><li>Athlete </li></ul><ul><li>Journalist </li></ul><ul><li>MusicalArtist </li></ul><ul...
Places <ul><li>Airport </li></ul><ul><li>City </li></ul><ul><li>Country </li></ul><ul><li>Island </li></ul><ul><li>Mountai...
Organisations <ul><li>Band </li></ul><ul><li>Company </li></ul><ul><li>Educational Institution </li></ul><ul><li>Radio Sta...
Event <ul><li>Convention </li></ul><ul><li>Military Conflict </li></ul><ul><li>Music Event </li></ul><ul><li>Sport Event <...
Work <ul><li>Book </li></ul><ul><li>Broadcast </li></ul><ul><li>Film </li></ul><ul><li>Software </li></ul><ul><li>Televisi...
Step 2: Template Mapping <ul><li>Infobox Cricketer </li></ul><ul><li>Infobox Historic Cricketer </li></ul><ul><li>Infobox ...
Step 2: Template Mapping <ul><li>Class  TV Episode  (Work) </li></ul><ul><li>Wikipedia Templates: </li></ul><ul><li>Televi...
Step 3: Parsers <ul><li>Handle Templates Values  specifically </li></ul><ul><li>Example: Property splitting </li></ul><ul>...
Step 3: Parsers <ul><li>Example: Class Rules </li></ul><ul><li>MusicalArtist </li></ul><ul><li>If property „currentMembers...
Step 3: Parsers <ul><li>Example: Range Validation </li></ul><ul><li>Google keypeople </li></ul><ul><li>„ [[Eric Schmidt]] ...
Step 4: Crowd Source it
Step 4: Crowd Source it
<ul><li>Linking Framework </li></ul>
Interlinking Framework
Interlinking Framework
<ul><li>„ Apple“ </li></ul>
<ul><li>Apple </li></ul><ul><li>Google </li></ul><ul><li>Microsoft </li></ul>
<ul><li>Apple </li></ul><ul><li>Orange </li></ul><ul><li>Pear </li></ul>
<ul><li>Orange </li></ul><ul><li>Vodafone </li></ul><ul><li>T-Mobile </li></ul>
<ul><li>Context </li></ul><ul><li>Similarity </li></ul>
Linking: The Future <ul><li>Hosted Webservice  </li></ul><ul><li>for Linked Data publishers </li></ul>
Summary
<ul><li>http://dbpedia.org </li></ul><ul><li>Georgi Kobilarov </li></ul><ul><li>Freie Universität Berlin </li></ul>
Upcoming SlideShare
Loading in...5
×

DBpedia Framework - BBC Talk

2,481

Published on

Published in: Technology, Travel
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,481
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
87
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • DBpedia Framework - BBC Talk

    1. 1. Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin
    2. 2. Hello again <ul><li>Georgi Kobilarov </li></ul><ul><li>Researcher at Freie Universität Berlin </li></ul><ul><li>DBpedia Development Lead </li></ul>
    3. 3. Agenda <ul><li>Status Quo </li></ul><ul><li>Technical Overview </li></ul><ul><li>Challenges </li></ul><ul><li>Outlook </li></ul>
    4. 4. <ul><li>How to extract Wikipedia data </li></ul><ul><li>and how to not do it </li></ul>
    5. 5. <ul><li>Lessons learned </li></ul>
    6. 6. Title Description Languages Web Links Categorization Domain specific Data Images Infoboxes
    7. 8. <ul><li><http://dbpedia.org/resource/Hewlett-Packard> </li></ul><ul><li>rdfs:label “Hewlett-Packard” </li></ul><ul><li>p:foundation dbpedia:Palo_Alto </li></ul><ul><li>p:keypeople dbpedia:Bill_Hewlett </li></ul><ul><li>p:keypeople dbpedia:David_Packard </li></ul><ul><li>p:keypeople dbpedia:Mark_V._Hurd </li></ul><ul><li>p:industry dbpedia:Computer_Systems </li></ul><ul><li>p:industry dbpedia:Computer_software </li></ul><ul><li>p:revenue 104300000000 $ </li></ul><ul><li>p:netincome 7300000000 $ </li></ul><ul><li>p:employees 156000 </li></ul><ul><li>p:slogan “Invent” </li></ul>
    8. 9. Problems <ul><li>Poor Abstract extraction </li></ul><ul><li>Property synomys </li></ul><ul><li>Redirects </li></ul><ul><li>Missing class hierarchy </li></ul><ul><li>Range validation </li></ul>
    9. 10. Property Synonyms
    10. 11. Redirects <ul><li>Florida located_in USA </li></ul><ul><li>California located_in United_States </li></ul><ul><li>USA redirects_to United_States </li></ul>
    11. 12. Class Hierarchy <ul><li>„ Select all PEOPLE born in …“ </li></ul>
    12. 13. Range Validation <ul><li>dbpedia:Google </li></ul><ul><li>keyperson Eric Schmidt </li></ul><ul><li>keyperson Sergey Brin </li></ul><ul><li>keyperson Larry Page </li></ul><ul><li>keyperson CEO </li></ul><ul><li>keyperson Chairman </li></ul>
    13. 14. Range Validation
    14. 15. <ul><li>Technical Overview </li></ul>
    15. 16. And how does it work? <ul><li>Extraction Framework </li></ul><ul><li>(and a lot of regular expressions) </li></ul>
    16. 17. Extraction Framework <ul><li>Open Source </li></ul><ul><li>http://dbpedia.svn.sourceforge.net </li></ul><ul><li>implemented in PHP </li></ul>
    17. 18. Extraction Framework <ul><li>Data Input ( PageCollections ) </li></ul><ul><li>DatabaseWikipedia </li></ul><ul><li>LiveWikipedia </li></ul>
    18. 19. Extraction Framework <ul><li>Data Processing ( Extractors ) </li></ul><ul><li>InfoboxExtractor </li></ul><ul><li>LabelExtractor </li></ul><ul><li>CategoryExtractor </li></ul><ul><li>RedirectExtractor </li></ul><ul><li>GeoExtracor </li></ul>
    19. 20. Extraction Framework <ul><li>Data Output ( Destinations ) </li></ul><ul><li>SimpleDumpDestination (stdout) </li></ul><ul><li>NTripleDumpDestination </li></ul>
    20. 21. Extraction Framework <ul><li>Tie things together </li></ul><ul><li>Extraction Manager </li></ul><ul><li>Extraction Jobs </li></ul>
    21. 22. DBpedia Dataset <ul><li>Provided as RDF Dumps </li></ul><ul><li>Updated every 3 month </li></ul><ul><li>Hosted by Openlink Software </li></ul><ul><li>Available as Linked Data </li></ul>
    22. 23. SPARQL Endpoint <ul><li>http://dbpedia.org/sparql </li></ul>
    23. 24. Linked Data <ul><li>Use URIs as names for things </li></ul><ul><li>Use HTTP URIs so that people can look up those names. </li></ul><ul><li>When someone looks up a URI, provide useful information. </li></ul><ul><li>Include links to other URIs. so that they can discover more things. </li></ul>
    24. 25. HTTP URIs Information Resources http://dbpedia.org/page/Bristol HTTP GET -> 200 OK Non-Information Resources http://dbpedia.org/resource/Bristol HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol -> 200 OK
    25. 26. How to get started <ul><li>Documentation http://wiki.dbpedia.org/Documentation </li></ul><ul><li>Source Code </li></ul><ul><li>start.php </li></ul>
    26. 27. Next Tasks <ul><li>Improve Extractors </li></ul><ul><li>Cleaner Abstracts </li></ul><ul><li>Include Redirects into Extraction Process </li></ul><ul><li>Fix more Extraction Bugs </li></ul><ul><li> http://sourceforge.net/projects/dbpedia/ </li></ul><ul><li>Provide Live Update Service </li></ul>
    27. 28. Infobox Extraction <ul><li>One script to rule them all </li></ul><ul><li>Not sufficient </li></ul>
    28. 29. <ul><li>Next Challenges </li></ul>
    29. 30. Next challenges <ul><li>Higher Data Quality + Ontologies </li></ul><ul><li>Consistency Checks </li></ul><ul><li>Augmentation </li></ul><ul><li>Live Updates </li></ul>
    30. 31. Live Updates <ul><li>Wikipedia Update Stream </li></ul><ul><li>Extraction Cluster </li></ul><ul><li>Named Graphs </li></ul>
    31. 32. Augmentation <ul><li>Enrich DBpedia with data from: </li></ul><ul><li>1. other languages </li></ul><ul><li>2. external datasets </li></ul>
    32. 33. Consistency Checks <ul><li>German Wikipedia says, Berlin‘s population is X </li></ul><ul><li>Italian Wikipedia says, it‘s Y </li></ul>
    33. 34. Data Quality <ul><li>We need humans </li></ul>
    34. 35. <ul><li>The Vision </li></ul>
    35. 36. Semantic Web <ul><li>Users shouldn’t care </li></ul>
    36. 37. Semantic Web <ul><li>Users shouldn’t have to care </li></ul><ul><li>(del.icio.us lesson ) </li></ul>
    37. 38. DBpedia Extraction Wikipedia DBpedia Extraction Framework Triple Store
    38. 39. Freebase Extraction Wikipedia Extraction Metaweb Graph Store
    39. 40. <ul><li>What is the </li></ul><ul><li>Wikipedia for Data? </li></ul>
    40. 41. <ul><li>Wikipedia is the </li></ul><ul><li>Wikipedia for Data </li></ul>
    41. 43. Crowd Sourced Extraction <ul><li>Where‘s the user benefit ? </li></ul>
    42. 44. Users <ul><li>Mashup Developer </li></ul>
    43. 45. <ul><li>Benefit </li></ul>
    44. 46. <ul><li>Outlook </li></ul>
    45. 47. Infobox Extraction <ul><li>We need a new approach </li></ul><ul><li>Break it down into smaller pieces </li></ul>
    46. 48. Step 1: Create an ontology <ul><li>Five domains: </li></ul><ul><li>people, places, organisations, </li></ul><ul><li>events, works </li></ul>
    47. 49. People <ul><li>Actors </li></ul><ul><li>Athlete </li></ul><ul><li>Journalist </li></ul><ul><li>MusicalArtist </li></ul><ul><li>Politician </li></ul><ul><li>Scientist </li></ul><ul><li>Writer </li></ul>
    48. 50. Places <ul><li>Airport </li></ul><ul><li>City </li></ul><ul><li>Country </li></ul><ul><li>Island </li></ul><ul><li>Mountain </li></ul><ul><li>River </li></ul>
    49. 51. Organisations <ul><li>Band </li></ul><ul><li>Company </li></ul><ul><li>Educational Institution </li></ul><ul><li>Radio Station </li></ul><ul><li>Sports Team </li></ul>
    50. 52. Event <ul><li>Convention </li></ul><ul><li>Military Conflict </li></ul><ul><li>Music Event </li></ul><ul><li>Sport Event </li></ul>
    51. 53. Work <ul><li>Book </li></ul><ul><li>Broadcast </li></ul><ul><li>Film </li></ul><ul><li>Software </li></ul><ul><li>Television </li></ul>
    52. 54. Step 2: Template Mapping <ul><li>Infobox Cricketer </li></ul><ul><li>Infobox Historic Cricketer </li></ul><ul><li>Infobox Recent Cricketer </li></ul><ul><li>Infobox Old Cricketer </li></ul><ul><li>Infobox Cricketer Biography </li></ul><ul><li>=> Class Cricketer (Athlete) </li></ul>
    53. 55. Step 2: Template Mapping <ul><li>Class TV Episode (Work) </li></ul><ul><li>Wikipedia Templates: </li></ul><ul><li>Television Episode </li></ul><ul><li>UK Office Episode </li></ul><ul><li>Simpsons Episode </li></ul><ul><li>DoctorWhoBox </li></ul>
    54. 56. Step 3: Parsers <ul><li>Handle Templates Values specifically </li></ul><ul><li>Example: Property splitting </li></ul><ul><li>Person born „1.1.1980, [[Berlin]]“ </li></ul><ul><li>=> split to birthplace Berlin </li></ul><ul><li>birthdate 1980-01-01 </li></ul>
    55. 57. Step 3: Parsers <ul><li>Example: Class Rules </li></ul><ul><li>MusicalArtist </li></ul><ul><li>If property „currentMembers“ is set </li></ul><ul><li>=> Group </li></ul><ul><li>Otherwise </li></ul><ul><li>=> Person </li></ul>
    56. 58. Step 3: Parsers <ul><li>Example: Range Validation </li></ul><ul><li>Google keypeople </li></ul><ul><li>„ [[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]] </li></ul><ul><li>Company#keyperson range Person#Class </li></ul><ul><li>Google keyperson Eric Schmidt </li></ul><ul><li>Sergey Brin </li></ul><ul><li>Larry Page </li></ul>
    57. 59. Step 4: Crowd Source it
    58. 60. Step 4: Crowd Source it
    59. 61. <ul><li>Linking Framework </li></ul>
    60. 62. Interlinking Framework
    61. 63. Interlinking Framework
    62. 64. <ul><li>„ Apple“ </li></ul>
    63. 65. <ul><li>Apple </li></ul><ul><li>Google </li></ul><ul><li>Microsoft </li></ul>
    64. 66. <ul><li>Apple </li></ul><ul><li>Orange </li></ul><ul><li>Pear </li></ul>
    65. 67. <ul><li>Orange </li></ul><ul><li>Vodafone </li></ul><ul><li>T-Mobile </li></ul>
    66. 68. <ul><li>Context </li></ul><ul><li>Similarity </li></ul>
    67. 69. Linking: The Future <ul><li>Hosted Webservice </li></ul><ul><li>for Linked Data publishers </li></ul>
    68. 70. Summary
    69. 71. <ul><li>http://dbpedia.org </li></ul><ul><li>Georgi Kobilarov </li></ul><ul><li>Freie Universität Berlin </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×