Your SlideShare is downloading. ×
DBpedia Framework - BBC Talk
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

DBpedia Framework - BBC Talk

2,408
views

Published on

Published in: Technology, Travel

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,408
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
84
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin
    • 2. Hello again
      • Georgi Kobilarov
      • Researcher at Freie Universität Berlin
      • DBpedia Development Lead
    • 3. Agenda
      • Status Quo
      • Technical Overview
      • Challenges
      • Outlook
    • 4.
      • How to extract Wikipedia data
      • and how to not do it
    • 5.
      • Lessons learned
    • 6. Title Description Languages Web Links Categorization Domain specific Data Images Infoboxes
    • 7.  
    • 8.
      • <http://dbpedia.org/resource/Hewlett-Packard>
      • rdfs:label “Hewlett-Packard”
      • p:foundation dbpedia:Palo_Alto
      • p:keypeople dbpedia:Bill_Hewlett
      • p:keypeople dbpedia:David_Packard
      • p:keypeople dbpedia:Mark_V._Hurd
      • p:industry dbpedia:Computer_Systems
      • p:industry dbpedia:Computer_software
      • p:revenue 104300000000 $
      • p:netincome 7300000000 $
      • p:employees 156000
      • p:slogan “Invent”
    • 9. Problems
      • Poor Abstract extraction
      • Property synomys
      • Redirects
      • Missing class hierarchy
      • Range validation
    • 10. Property Synonyms
    • 11. Redirects
      • Florida located_in USA
      • California located_in United_States
      • USA redirects_to United_States
    • 12. Class Hierarchy
      • „ Select all PEOPLE born in …“
    • 13. Range Validation
      • dbpedia:Google
      • keyperson Eric Schmidt
      • keyperson Sergey Brin
      • keyperson Larry Page
      • keyperson CEO
      • keyperson Chairman
    • 14. Range Validation
    • 15.
      • Technical Overview
    • 16. And how does it work?
      • Extraction Framework
      • (and a lot of regular expressions)
    • 17. Extraction Framework
      • Open Source
      • http://dbpedia.svn.sourceforge.net
      • implemented in PHP
    • 18. Extraction Framework
      • Data Input ( PageCollections )
      • DatabaseWikipedia
      • LiveWikipedia
    • 19. Extraction Framework
      • Data Processing ( Extractors )
      • InfoboxExtractor
      • LabelExtractor
      • CategoryExtractor
      • RedirectExtractor
      • GeoExtracor
    • 20. Extraction Framework
      • Data Output ( Destinations )
      • SimpleDumpDestination (stdout)
      • NTripleDumpDestination
    • 21. Extraction Framework
      • Tie things together
      • Extraction Manager
      • Extraction Jobs
    • 22. DBpedia Dataset
      • Provided as RDF Dumps
      • Updated every 3 month
      • Hosted by Openlink Software
      • Available as Linked Data
    • 23. SPARQL Endpoint
      • http://dbpedia.org/sparql
    • 24. Linked Data
      • Use URIs as names for things
      • Use HTTP URIs so that people can look up those names.
      • When someone looks up a URI, provide useful information.
      • Include links to other URIs. so that they can discover more things.
    • 25. HTTP URIs Information Resources http://dbpedia.org/page/Bristol HTTP GET -> 200 OK Non-Information Resources http://dbpedia.org/resource/Bristol HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol -> 200 OK
    • 26. How to get started
      • Documentation http://wiki.dbpedia.org/Documentation
      • Source Code
      • start.php
    • 27. Next Tasks
      • Improve Extractors
      • Cleaner Abstracts
      • Include Redirects into Extraction Process
      • Fix more Extraction Bugs
      • http://sourceforge.net/projects/dbpedia/
      • Provide Live Update Service
    • 28. Infobox Extraction
      • One script to rule them all
      • Not sufficient
    • 29.
      • Next Challenges
    • 30. Next challenges
      • Higher Data Quality + Ontologies
      • Consistency Checks
      • Augmentation
      • Live Updates
    • 31. Live Updates
      • Wikipedia Update Stream
      • Extraction Cluster
      • Named Graphs
    • 32. Augmentation
      • Enrich DBpedia with data from:
      • 1. other languages
      • 2. external datasets
    • 33. Consistency Checks
      • German Wikipedia says, Berlin‘s population is X
      • Italian Wikipedia says, it‘s Y
    • 34. Data Quality
      • We need humans
    • 35.
      • The Vision
    • 36. Semantic Web
      • Users shouldn’t care
    • 37. Semantic Web
      • Users shouldn’t have to care
      • (del.icio.us lesson )
    • 38. DBpedia Extraction Wikipedia DBpedia Extraction Framework Triple Store
    • 39. Freebase Extraction Wikipedia Extraction Metaweb Graph Store
    • 40.
      • What is the
      • Wikipedia for Data?
    • 41.
      • Wikipedia is the
      • Wikipedia for Data
    • 42.  
    • 43. Crowd Sourced Extraction
      • Where‘s the user benefit ?
    • 44. Users
      • Mashup Developer
    • 45.
      • Benefit
    • 46.
      • Outlook
    • 47. Infobox Extraction
      • We need a new approach
      • Break it down into smaller pieces
    • 48. Step 1: Create an ontology
      • Five domains:
      • people, places, organisations,
      • events, works
    • 49. People
      • Actors
      • Athlete
      • Journalist
      • MusicalArtist
      • Politician
      • Scientist
      • Writer
    • 50. Places
      • Airport
      • City
      • Country
      • Island
      • Mountain
      • River
    • 51. Organisations
      • Band
      • Company
      • Educational Institution
      • Radio Station
      • Sports Team
    • 52. Event
      • Convention
      • Military Conflict
      • Music Event
      • Sport Event
    • 53. Work
      • Book
      • Broadcast
      • Film
      • Software
      • Television
    • 54. Step 2: Template Mapping
      • Infobox Cricketer
      • Infobox Historic Cricketer
      • Infobox Recent Cricketer
      • Infobox Old Cricketer
      • Infobox Cricketer Biography
      • => Class Cricketer (Athlete)
    • 55. Step 2: Template Mapping
      • Class TV Episode (Work)
      • Wikipedia Templates:
      • Television Episode
      • UK Office Episode
      • Simpsons Episode
      • DoctorWhoBox
    • 56. Step 3: Parsers
      • Handle Templates Values specifically
      • Example: Property splitting
      • Person born „1.1.1980, [[Berlin]]“
      • => split to birthplace Berlin
      • birthdate 1980-01-01
    • 57. Step 3: Parsers
      • Example: Class Rules
      • MusicalArtist
      • If property „currentMembers“ is set
      • => Group
      • Otherwise
      • => Person
    • 58. Step 3: Parsers
      • Example: Range Validation
      • Google keypeople
      • „ [[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]]
      • Company#keyperson range Person#Class
      • Google keyperson Eric Schmidt
      • Sergey Brin
      • Larry Page
    • 59. Step 4: Crowd Source it
    • 60. Step 4: Crowd Source it
    • 61.
      • Linking Framework
    • 62. Interlinking Framework
    • 63. Interlinking Framework
    • 64.
      • „ Apple“
    • 65.
      • Apple
      • Google
      • Microsoft
    • 66.
      • Apple
      • Orange
      • Pear
    • 67.
      • Orange
      • Vodafone
      • T-Mobile
    • 68.
      • Context
      • Similarity
    • 69. Linking: The Future
      • Hosted Webservice
      • for Linked Data publishers
    • 70. Summary
    • 71.
      • http://dbpedia.org
      • Georgi Kobilarov
      • Freie Universität Berlin

    ×