DBpedia Framework - BBC Talk
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,548
On Slideshare
3,538
From Embeds
10
Number of Embeds
1

Actions

Shares
Downloads
83
Comments
0
Likes
5

Embeds 10

http://www.slideshare.net 10

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin
  • 2. Hello again
    • Georgi Kobilarov
    • Researcher at Freie Universität Berlin
    • DBpedia Development Lead
  • 3. Agenda
    • Status Quo
    • Technical Overview
    • Challenges
    • Outlook
  • 4.
    • How to extract Wikipedia data
    • and how to not do it
  • 5.
    • Lessons learned
  • 6. Title Description Languages Web Links Categorization Domain specific Data Images Infoboxes
  • 7.  
  • 8.
    • <http://dbpedia.org/resource/Hewlett-Packard>
    • rdfs:label “Hewlett-Packard”
    • p:foundation dbpedia:Palo_Alto
    • p:keypeople dbpedia:Bill_Hewlett
    • p:keypeople dbpedia:David_Packard
    • p:keypeople dbpedia:Mark_V._Hurd
    • p:industry dbpedia:Computer_Systems
    • p:industry dbpedia:Computer_software
    • p:revenue 104300000000 $
    • p:netincome 7300000000 $
    • p:employees 156000
    • p:slogan “Invent”
  • 9. Problems
    • Poor Abstract extraction
    • Property synomys
    • Redirects
    • Missing class hierarchy
    • Range validation
  • 10. Property Synonyms
  • 11. Redirects
    • Florida located_in USA
    • California located_in United_States
    • USA redirects_to United_States
  • 12. Class Hierarchy
    • „ Select all PEOPLE born in …“
  • 13. Range Validation
    • dbpedia:Google
    • keyperson Eric Schmidt
    • keyperson Sergey Brin
    • keyperson Larry Page
    • keyperson CEO
    • keyperson Chairman
  • 14. Range Validation
  • 15.
    • Technical Overview
  • 16. And how does it work?
    • Extraction Framework
    • (and a lot of regular expressions)
  • 17. Extraction Framework
    • Open Source
    • http://dbpedia.svn.sourceforge.net
    • implemented in PHP
  • 18. Extraction Framework
    • Data Input ( PageCollections )
    • DatabaseWikipedia
    • LiveWikipedia
  • 19. Extraction Framework
    • Data Processing ( Extractors )
    • InfoboxExtractor
    • LabelExtractor
    • CategoryExtractor
    • RedirectExtractor
    • GeoExtracor
  • 20. Extraction Framework
    • Data Output ( Destinations )
    • SimpleDumpDestination (stdout)
    • NTripleDumpDestination
  • 21. Extraction Framework
    • Tie things together
    • Extraction Manager
    • Extraction Jobs
  • 22. DBpedia Dataset
    • Provided as RDF Dumps
    • Updated every 3 month
    • Hosted by Openlink Software
    • Available as Linked Data
  • 23. SPARQL Endpoint
    • http://dbpedia.org/sparql
  • 24. Linked Data
    • Use URIs as names for things
    • Use HTTP URIs so that people can look up those names.
    • When someone looks up a URI, provide useful information.
    • Include links to other URIs. so that they can discover more things.
  • 25. HTTP URIs Information Resources http://dbpedia.org/page/Bristol HTTP GET -> 200 OK Non-Information Resources http://dbpedia.org/resource/Bristol HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol -> 200 OK
  • 26. How to get started
    • Documentation http://wiki.dbpedia.org/Documentation
    • Source Code
    • start.php
  • 27. Next Tasks
    • Improve Extractors
    • Cleaner Abstracts
    • Include Redirects into Extraction Process
    • Fix more Extraction Bugs
    • http://sourceforge.net/projects/dbpedia/
    • Provide Live Update Service
  • 28. Infobox Extraction
    • One script to rule them all
    • Not sufficient
  • 29.
    • Next Challenges
  • 30. Next challenges
    • Higher Data Quality + Ontologies
    • Consistency Checks
    • Augmentation
    • Live Updates
  • 31. Live Updates
    • Wikipedia Update Stream
    • Extraction Cluster
    • Named Graphs
  • 32. Augmentation
    • Enrich DBpedia with data from:
    • 1. other languages
    • 2. external datasets
  • 33. Consistency Checks
    • German Wikipedia says, Berlin‘s population is X
    • Italian Wikipedia says, it‘s Y
  • 34. Data Quality
    • We need humans
  • 35.
    • The Vision
  • 36. Semantic Web
    • Users shouldn’t care
  • 37. Semantic Web
    • Users shouldn’t have to care
    • (del.icio.us lesson )
  • 38. DBpedia Extraction Wikipedia DBpedia Extraction Framework Triple Store
  • 39. Freebase Extraction Wikipedia Extraction Metaweb Graph Store
  • 40.
    • What is the
    • Wikipedia for Data?
  • 41.
    • Wikipedia is the
    • Wikipedia for Data
  • 42.  
  • 43. Crowd Sourced Extraction
    • Where‘s the user benefit ?
  • 44. Users
    • Mashup Developer
  • 45.
    • Benefit
  • 46.
    • Outlook
  • 47. Infobox Extraction
    • We need a new approach
    • Break it down into smaller pieces
  • 48. Step 1: Create an ontology
    • Five domains:
    • people, places, organisations,
    • events, works
  • 49. People
    • Actors
    • Athlete
    • Journalist
    • MusicalArtist
    • Politician
    • Scientist
    • Writer
  • 50. Places
    • Airport
    • City
    • Country
    • Island
    • Mountain
    • River
  • 51. Organisations
    • Band
    • Company
    • Educational Institution
    • Radio Station
    • Sports Team
  • 52. Event
    • Convention
    • Military Conflict
    • Music Event
    • Sport Event
  • 53. Work
    • Book
    • Broadcast
    • Film
    • Software
    • Television
  • 54. Step 2: Template Mapping
    • Infobox Cricketer
    • Infobox Historic Cricketer
    • Infobox Recent Cricketer
    • Infobox Old Cricketer
    • Infobox Cricketer Biography
    • => Class Cricketer (Athlete)
  • 55. Step 2: Template Mapping
    • Class TV Episode (Work)
    • Wikipedia Templates:
    • Television Episode
    • UK Office Episode
    • Simpsons Episode
    • DoctorWhoBox
  • 56. Step 3: Parsers
    • Handle Templates Values specifically
    • Example: Property splitting
    • Person born „1.1.1980, [[Berlin]]“
    • => split to birthplace Berlin
    • birthdate 1980-01-01
  • 57. Step 3: Parsers
    • Example: Class Rules
    • MusicalArtist
    • If property „currentMembers“ is set
    • => Group
    • Otherwise
    • => Person
  • 58. Step 3: Parsers
    • Example: Range Validation
    • Google keypeople
    • „ [[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]]
    • Company#keyperson range Person#Class
    • Google keyperson Eric Schmidt
    • Sergey Brin
    • Larry Page
  • 59. Step 4: Crowd Source it
  • 60. Step 4: Crowd Source it
  • 61.
    • Linking Framework
  • 62. Interlinking Framework
  • 63. Interlinking Framework
  • 64.
    • „ Apple“
  • 65.
    • Apple
    • Google
    • Microsoft
  • 66.
    • Apple
    • Orange
    • Pear
  • 67.
    • Orange
    • Vodafone
    • T-Mobile
  • 68.
    • Context
    • Similarity
  • 69. Linking: The Future
    • Hosted Webservice
    • for Linked Data publishers
  • 70. Summary
  • 71.
    • http://dbpedia.org
    • Georgi Kobilarov
    • Freie Universität Berlin