Your SlideShare is downloading. ×
0
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

DBpedia Framework - BBC Talk

2,464

Published on

Published in: Technology, Travel
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,464
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
87
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin
    • 2. Hello again <ul><li>Georgi Kobilarov </li></ul><ul><li>Researcher at Freie Universität Berlin </li></ul><ul><li>DBpedia Development Lead </li></ul>
    • 3. Agenda <ul><li>Status Quo </li></ul><ul><li>Technical Overview </li></ul><ul><li>Challenges </li></ul><ul><li>Outlook </li></ul>
    • 4. <ul><li>How to extract Wikipedia data </li></ul><ul><li>and how to not do it </li></ul>
    • 5. <ul><li>Lessons learned </li></ul>
    • 6. Title Description Languages Web Links Categorization Domain specific Data Images Infoboxes
    • 7.  
    • 8. <ul><li><http://dbpedia.org/resource/Hewlett-Packard> </li></ul><ul><li>rdfs:label “Hewlett-Packard” </li></ul><ul><li>p:foundation dbpedia:Palo_Alto </li></ul><ul><li>p:keypeople dbpedia:Bill_Hewlett </li></ul><ul><li>p:keypeople dbpedia:David_Packard </li></ul><ul><li>p:keypeople dbpedia:Mark_V._Hurd </li></ul><ul><li>p:industry dbpedia:Computer_Systems </li></ul><ul><li>p:industry dbpedia:Computer_software </li></ul><ul><li>p:revenue 104300000000 $ </li></ul><ul><li>p:netincome 7300000000 $ </li></ul><ul><li>p:employees 156000 </li></ul><ul><li>p:slogan “Invent” </li></ul>
    • 9. Problems <ul><li>Poor Abstract extraction </li></ul><ul><li>Property synomys </li></ul><ul><li>Redirects </li></ul><ul><li>Missing class hierarchy </li></ul><ul><li>Range validation </li></ul>
    • 10. Property Synonyms
    • 11. Redirects <ul><li>Florida located_in USA </li></ul><ul><li>California located_in United_States </li></ul><ul><li>USA redirects_to United_States </li></ul>
    • 12. Class Hierarchy <ul><li>„ Select all PEOPLE born in …“ </li></ul>
    • 13. Range Validation <ul><li>dbpedia:Google </li></ul><ul><li>keyperson Eric Schmidt </li></ul><ul><li>keyperson Sergey Brin </li></ul><ul><li>keyperson Larry Page </li></ul><ul><li>keyperson CEO </li></ul><ul><li>keyperson Chairman </li></ul>
    • 14. Range Validation
    • 15. <ul><li>Technical Overview </li></ul>
    • 16. And how does it work? <ul><li>Extraction Framework </li></ul><ul><li>(and a lot of regular expressions) </li></ul>
    • 17. Extraction Framework <ul><li>Open Source </li></ul><ul><li>http://dbpedia.svn.sourceforge.net </li></ul><ul><li>implemented in PHP </li></ul>
    • 18. Extraction Framework <ul><li>Data Input ( PageCollections ) </li></ul><ul><li>DatabaseWikipedia </li></ul><ul><li>LiveWikipedia </li></ul>
    • 19. Extraction Framework <ul><li>Data Processing ( Extractors ) </li></ul><ul><li>InfoboxExtractor </li></ul><ul><li>LabelExtractor </li></ul><ul><li>CategoryExtractor </li></ul><ul><li>RedirectExtractor </li></ul><ul><li>GeoExtracor </li></ul>
    • 20. Extraction Framework <ul><li>Data Output ( Destinations ) </li></ul><ul><li>SimpleDumpDestination (stdout) </li></ul><ul><li>NTripleDumpDestination </li></ul>
    • 21. Extraction Framework <ul><li>Tie things together </li></ul><ul><li>Extraction Manager </li></ul><ul><li>Extraction Jobs </li></ul>
    • 22. DBpedia Dataset <ul><li>Provided as RDF Dumps </li></ul><ul><li>Updated every 3 month </li></ul><ul><li>Hosted by Openlink Software </li></ul><ul><li>Available as Linked Data </li></ul>
    • 23. SPARQL Endpoint <ul><li>http://dbpedia.org/sparql </li></ul>
    • 24. Linked Data <ul><li>Use URIs as names for things </li></ul><ul><li>Use HTTP URIs so that people can look up those names. </li></ul><ul><li>When someone looks up a URI, provide useful information. </li></ul><ul><li>Include links to other URIs. so that they can discover more things. </li></ul>
    • 25. HTTP URIs Information Resources http://dbpedia.org/page/Bristol HTTP GET -> 200 OK Non-Information Resources http://dbpedia.org/resource/Bristol HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol -> 200 OK
    • 26. How to get started <ul><li>Documentation http://wiki.dbpedia.org/Documentation </li></ul><ul><li>Source Code </li></ul><ul><li>start.php </li></ul>
    • 27. Next Tasks <ul><li>Improve Extractors </li></ul><ul><li>Cleaner Abstracts </li></ul><ul><li>Include Redirects into Extraction Process </li></ul><ul><li>Fix more Extraction Bugs </li></ul><ul><li> http://sourceforge.net/projects/dbpedia/ </li></ul><ul><li>Provide Live Update Service </li></ul>
    • 28. Infobox Extraction <ul><li>One script to rule them all </li></ul><ul><li>Not sufficient </li></ul>
    • 29. <ul><li>Next Challenges </li></ul>
    • 30. Next challenges <ul><li>Higher Data Quality + Ontologies </li></ul><ul><li>Consistency Checks </li></ul><ul><li>Augmentation </li></ul><ul><li>Live Updates </li></ul>
    • 31. Live Updates <ul><li>Wikipedia Update Stream </li></ul><ul><li>Extraction Cluster </li></ul><ul><li>Named Graphs </li></ul>
    • 32. Augmentation <ul><li>Enrich DBpedia with data from: </li></ul><ul><li>1. other languages </li></ul><ul><li>2. external datasets </li></ul>
    • 33. Consistency Checks <ul><li>German Wikipedia says, Berlin‘s population is X </li></ul><ul><li>Italian Wikipedia says, it‘s Y </li></ul>
    • 34. Data Quality <ul><li>We need humans </li></ul>
    • 35. <ul><li>The Vision </li></ul>
    • 36. Semantic Web <ul><li>Users shouldn’t care </li></ul>
    • 37. Semantic Web <ul><li>Users shouldn’t have to care </li></ul><ul><li>(del.icio.us lesson ) </li></ul>
    • 38. DBpedia Extraction Wikipedia DBpedia Extraction Framework Triple Store
    • 39. Freebase Extraction Wikipedia Extraction Metaweb Graph Store
    • 40. <ul><li>What is the </li></ul><ul><li>Wikipedia for Data? </li></ul>
    • 41. <ul><li>Wikipedia is the </li></ul><ul><li>Wikipedia for Data </li></ul>
    • 42.  
    • 43. Crowd Sourced Extraction <ul><li>Where‘s the user benefit ? </li></ul>
    • 44. Users <ul><li>Mashup Developer </li></ul>
    • 45. <ul><li>Benefit </li></ul>
    • 46. <ul><li>Outlook </li></ul>
    • 47. Infobox Extraction <ul><li>We need a new approach </li></ul><ul><li>Break it down into smaller pieces </li></ul>
    • 48. Step 1: Create an ontology <ul><li>Five domains: </li></ul><ul><li>people, places, organisations, </li></ul><ul><li>events, works </li></ul>
    • 49. People <ul><li>Actors </li></ul><ul><li>Athlete </li></ul><ul><li>Journalist </li></ul><ul><li>MusicalArtist </li></ul><ul><li>Politician </li></ul><ul><li>Scientist </li></ul><ul><li>Writer </li></ul>
    • 50. Places <ul><li>Airport </li></ul><ul><li>City </li></ul><ul><li>Country </li></ul><ul><li>Island </li></ul><ul><li>Mountain </li></ul><ul><li>River </li></ul>
    • 51. Organisations <ul><li>Band </li></ul><ul><li>Company </li></ul><ul><li>Educational Institution </li></ul><ul><li>Radio Station </li></ul><ul><li>Sports Team </li></ul>
    • 52. Event <ul><li>Convention </li></ul><ul><li>Military Conflict </li></ul><ul><li>Music Event </li></ul><ul><li>Sport Event </li></ul>
    • 53. Work <ul><li>Book </li></ul><ul><li>Broadcast </li></ul><ul><li>Film </li></ul><ul><li>Software </li></ul><ul><li>Television </li></ul>
    • 54. Step 2: Template Mapping <ul><li>Infobox Cricketer </li></ul><ul><li>Infobox Historic Cricketer </li></ul><ul><li>Infobox Recent Cricketer </li></ul><ul><li>Infobox Old Cricketer </li></ul><ul><li>Infobox Cricketer Biography </li></ul><ul><li>=> Class Cricketer (Athlete) </li></ul>
    • 55. Step 2: Template Mapping <ul><li>Class TV Episode (Work) </li></ul><ul><li>Wikipedia Templates: </li></ul><ul><li>Television Episode </li></ul><ul><li>UK Office Episode </li></ul><ul><li>Simpsons Episode </li></ul><ul><li>DoctorWhoBox </li></ul>
    • 56. Step 3: Parsers <ul><li>Handle Templates Values specifically </li></ul><ul><li>Example: Property splitting </li></ul><ul><li>Person born „1.1.1980, [[Berlin]]“ </li></ul><ul><li>=> split to birthplace Berlin </li></ul><ul><li>birthdate 1980-01-01 </li></ul>
    • 57. Step 3: Parsers <ul><li>Example: Class Rules </li></ul><ul><li>MusicalArtist </li></ul><ul><li>If property „currentMembers“ is set </li></ul><ul><li>=> Group </li></ul><ul><li>Otherwise </li></ul><ul><li>=> Person </li></ul>
    • 58. Step 3: Parsers <ul><li>Example: Range Validation </li></ul><ul><li>Google keypeople </li></ul><ul><li>„ [[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]] </li></ul><ul><li>Company#keyperson range Person#Class </li></ul><ul><li>Google keyperson Eric Schmidt </li></ul><ul><li>Sergey Brin </li></ul><ul><li>Larry Page </li></ul>
    • 59. Step 4: Crowd Source it
    • 60. Step 4: Crowd Source it
    • 61. <ul><li>Linking Framework </li></ul>
    • 62. Interlinking Framework
    • 63. Interlinking Framework
    • 64. <ul><li>„ Apple“ </li></ul>
    • 65. <ul><li>Apple </li></ul><ul><li>Google </li></ul><ul><li>Microsoft </li></ul>
    • 66. <ul><li>Apple </li></ul><ul><li>Orange </li></ul><ul><li>Pear </li></ul>
    • 67. <ul><li>Orange </li></ul><ul><li>Vodafone </li></ul><ul><li>T-Mobile </li></ul>
    • 68. <ul><li>Context </li></ul><ul><li>Similarity </li></ul>
    • 69. Linking: The Future <ul><li>Hosted Webservice </li></ul><ul><li>for Linked Data publishers </li></ul>
    • 70. Summary
    • 71. <ul><li>http://dbpedia.org </li></ul><ul><li>Georgi Kobilarov </li></ul><ul><li>Freie Universität Berlin </li></ul>

    ×