DBpedia Framework - BBC Talk

  • 2,288 views
Uploaded on

 

More in: Technology , Travel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,288
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
83
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin
  • 2. Hello again
    • Georgi Kobilarov
    • Researcher at Freie Universität Berlin
    • DBpedia Development Lead
  • 3. Agenda
    • Status Quo
    • Technical Overview
    • Challenges
    • Outlook
  • 4.
    • How to extract Wikipedia data
    • and how to not do it
  • 5.
    • Lessons learned
  • 6. Title Description Languages Web Links Categorization Domain specific Data Images Infoboxes
  • 7.  
  • 8.
    • <http://dbpedia.org/resource/Hewlett-Packard>
    • rdfs:label “Hewlett-Packard”
    • p:foundation dbpedia:Palo_Alto
    • p:keypeople dbpedia:Bill_Hewlett
    • p:keypeople dbpedia:David_Packard
    • p:keypeople dbpedia:Mark_V._Hurd
    • p:industry dbpedia:Computer_Systems
    • p:industry dbpedia:Computer_software
    • p:revenue 104300000000 $
    • p:netincome 7300000000 $
    • p:employees 156000
    • p:slogan “Invent”
  • 9. Problems
    • Poor Abstract extraction
    • Property synomys
    • Redirects
    • Missing class hierarchy
    • Range validation
  • 10. Property Synonyms
  • 11. Redirects
    • Florida located_in USA
    • California located_in United_States
    • USA redirects_to United_States
  • 12. Class Hierarchy
    • „ Select all PEOPLE born in …“
  • 13. Range Validation
    • dbpedia:Google
    • keyperson Eric Schmidt
    • keyperson Sergey Brin
    • keyperson Larry Page
    • keyperson CEO
    • keyperson Chairman
  • 14. Range Validation
  • 15.
    • Technical Overview
  • 16. And how does it work?
    • Extraction Framework
    • (and a lot of regular expressions)
  • 17. Extraction Framework
    • Open Source
    • http://dbpedia.svn.sourceforge.net
    • implemented in PHP
  • 18. Extraction Framework
    • Data Input ( PageCollections )
    • DatabaseWikipedia
    • LiveWikipedia
  • 19. Extraction Framework
    • Data Processing ( Extractors )
    • InfoboxExtractor
    • LabelExtractor
    • CategoryExtractor
    • RedirectExtractor
    • GeoExtracor
  • 20. Extraction Framework
    • Data Output ( Destinations )
    • SimpleDumpDestination (stdout)
    • NTripleDumpDestination
  • 21. Extraction Framework
    • Tie things together
    • Extraction Manager
    • Extraction Jobs
  • 22. DBpedia Dataset
    • Provided as RDF Dumps
    • Updated every 3 month
    • Hosted by Openlink Software
    • Available as Linked Data
  • 23. SPARQL Endpoint
    • http://dbpedia.org/sparql
  • 24. Linked Data
    • Use URIs as names for things
    • Use HTTP URIs so that people can look up those names.
    • When someone looks up a URI, provide useful information.
    • Include links to other URIs. so that they can discover more things.
  • 25. HTTP URIs Information Resources http://dbpedia.org/page/Bristol HTTP GET -> 200 OK Non-Information Resources http://dbpedia.org/resource/Bristol HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol -> 200 OK
  • 26. How to get started
    • Documentation http://wiki.dbpedia.org/Documentation
    • Source Code
    • start.php
  • 27. Next Tasks
    • Improve Extractors
    • Cleaner Abstracts
    • Include Redirects into Extraction Process
    • Fix more Extraction Bugs
    • http://sourceforge.net/projects/dbpedia/
    • Provide Live Update Service
  • 28. Infobox Extraction
    • One script to rule them all
    • Not sufficient
  • 29.
    • Next Challenges
  • 30. Next challenges
    • Higher Data Quality + Ontologies
    • Consistency Checks
    • Augmentation
    • Live Updates
  • 31. Live Updates
    • Wikipedia Update Stream
    • Extraction Cluster
    • Named Graphs
  • 32. Augmentation
    • Enrich DBpedia with data from:
    • 1. other languages
    • 2. external datasets
  • 33. Consistency Checks
    • German Wikipedia says, Berlin‘s population is X
    • Italian Wikipedia says, it‘s Y
  • 34. Data Quality
    • We need humans
  • 35.
    • The Vision
  • 36. Semantic Web
    • Users shouldn’t care
  • 37. Semantic Web
    • Users shouldn’t have to care
    • (del.icio.us lesson )
  • 38. DBpedia Extraction Wikipedia DBpedia Extraction Framework Triple Store
  • 39. Freebase Extraction Wikipedia Extraction Metaweb Graph Store
  • 40.
    • What is the
    • Wikipedia for Data?
  • 41.
    • Wikipedia is the
    • Wikipedia for Data
  • 42.  
  • 43. Crowd Sourced Extraction
    • Where‘s the user benefit ?
  • 44. Users
    • Mashup Developer
  • 45.
    • Benefit
  • 46.
    • Outlook
  • 47. Infobox Extraction
    • We need a new approach
    • Break it down into smaller pieces
  • 48. Step 1: Create an ontology
    • Five domains:
    • people, places, organisations,
    • events, works
  • 49. People
    • Actors
    • Athlete
    • Journalist
    • MusicalArtist
    • Politician
    • Scientist
    • Writer
  • 50. Places
    • Airport
    • City
    • Country
    • Island
    • Mountain
    • River
  • 51. Organisations
    • Band
    • Company
    • Educational Institution
    • Radio Station
    • Sports Team
  • 52. Event
    • Convention
    • Military Conflict
    • Music Event
    • Sport Event
  • 53. Work
    • Book
    • Broadcast
    • Film
    • Software
    • Television
  • 54. Step 2: Template Mapping
    • Infobox Cricketer
    • Infobox Historic Cricketer
    • Infobox Recent Cricketer
    • Infobox Old Cricketer
    • Infobox Cricketer Biography
    • => Class Cricketer (Athlete)
  • 55. Step 2: Template Mapping
    • Class TV Episode (Work)
    • Wikipedia Templates:
    • Television Episode
    • UK Office Episode
    • Simpsons Episode
    • DoctorWhoBox
  • 56. Step 3: Parsers
    • Handle Templates Values specifically
    • Example: Property splitting
    • Person born „1.1.1980, [[Berlin]]“
    • => split to birthplace Berlin
    • birthdate 1980-01-01
  • 57. Step 3: Parsers
    • Example: Class Rules
    • MusicalArtist
    • If property „currentMembers“ is set
    • => Group
    • Otherwise
    • => Person
  • 58. Step 3: Parsers
    • Example: Range Validation
    • Google keypeople
    • „ [[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]]
    • Company#keyperson range Person#Class
    • Google keyperson Eric Schmidt
    • Sergey Brin
    • Larry Page
  • 59. Step 4: Crowd Source it
  • 60. Step 4: Crowd Source it
  • 61.
    • Linking Framework
  • 62. Interlinking Framework
  • 63. Interlinking Framework
  • 64.
    • „ Apple“
  • 65.
    • Apple
    • Google
    • Microsoft
  • 66.
    • Apple
    • Orange
    • Pear
  • 67.
    • Orange
    • Vodafone
    • T-Mobile
  • 68.
    • Context
    • Similarity
  • 69. Linking: The Future
    • Hosted Webservice
    • for Linked Data publishers
  • 70. Summary
  • 71.
    • http://dbpedia.org
    • Georgi Kobilarov
    • Freie Universität Berlin