• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
DBpedia Framework - BBC Talk
 

DBpedia Framework - BBC Talk

on

  • 3,311 views

 

Statistics

Views

Total Views
3,311
Views on SlideShare
3,301
Embed Views
10

Actions

Likes
4
Downloads
79
Comments
0

1 Embed 10

http://www.slideshare.net 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

DBpedia Framework - BBC Talk DBpedia Framework - BBC Talk Presentation Transcript

  • Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin
  • Hello again
    • Georgi Kobilarov
    • Researcher at Freie Universität Berlin
    • DBpedia Development Lead
  • Agenda
    • Status Quo
    • Technical Overview
    • Challenges
    • Outlook
    • How to extract Wikipedia data
    • and how to not do it
    • Lessons learned
  • Title Description Languages Web Links Categorization Domain specific Data Images Infoboxes
  •  
    • <http://dbpedia.org/resource/Hewlett-Packard>
    • rdfs:label “Hewlett-Packard”
    • p:foundation dbpedia:Palo_Alto
    • p:keypeople dbpedia:Bill_Hewlett
    • p:keypeople dbpedia:David_Packard
    • p:keypeople dbpedia:Mark_V._Hurd
    • p:industry dbpedia:Computer_Systems
    • p:industry dbpedia:Computer_software
    • p:revenue 104300000000 $
    • p:netincome 7300000000 $
    • p:employees 156000
    • p:slogan “Invent”
  • Problems
    • Poor Abstract extraction
    • Property synomys
    • Redirects
    • Missing class hierarchy
    • Range validation
  • Property Synonyms
  • Redirects
    • Florida located_in USA
    • California located_in United_States
    • USA redirects_to United_States
  • Class Hierarchy
    • „ Select all PEOPLE born in …“
  • Range Validation
    • dbpedia:Google
    • keyperson Eric Schmidt
    • keyperson Sergey Brin
    • keyperson Larry Page
    • keyperson CEO
    • keyperson Chairman
  • Range Validation
    • Technical Overview
  • And how does it work?
    • Extraction Framework
    • (and a lot of regular expressions)
  • Extraction Framework
    • Open Source
    • http://dbpedia.svn.sourceforge.net
    • implemented in PHP
  • Extraction Framework
    • Data Input ( PageCollections )
    • DatabaseWikipedia
    • LiveWikipedia
  • Extraction Framework
    • Data Processing ( Extractors )
    • InfoboxExtractor
    • LabelExtractor
    • CategoryExtractor
    • RedirectExtractor
    • GeoExtracor
  • Extraction Framework
    • Data Output ( Destinations )
    • SimpleDumpDestination (stdout)
    • NTripleDumpDestination
  • Extraction Framework
    • Tie things together
    • Extraction Manager
    • Extraction Jobs
  • DBpedia Dataset
    • Provided as RDF Dumps
    • Updated every 3 month
    • Hosted by Openlink Software
    • Available as Linked Data
  • SPARQL Endpoint
    • http://dbpedia.org/sparql
  • Linked Data
    • Use URIs as names for things
    • Use HTTP URIs so that people can look up those names.
    • When someone looks up a URI, provide useful information.
    • Include links to other URIs. so that they can discover more things.
  • HTTP URIs Information Resources http://dbpedia.org/page/Bristol HTTP GET -> 200 OK Non-Information Resources http://dbpedia.org/resource/Bristol HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol -> 200 OK
  • How to get started
    • Documentation http://wiki.dbpedia.org/Documentation
    • Source Code
    • start.php
  • Next Tasks
    • Improve Extractors
    • Cleaner Abstracts
    • Include Redirects into Extraction Process
    • Fix more Extraction Bugs
    • http://sourceforge.net/projects/dbpedia/
    • Provide Live Update Service
  • Infobox Extraction
    • One script to rule them all
    • Not sufficient
    • Next Challenges
  • Next challenges
    • Higher Data Quality + Ontologies
    • Consistency Checks
    • Augmentation
    • Live Updates
  • Live Updates
    • Wikipedia Update Stream
    • Extraction Cluster
    • Named Graphs
  • Augmentation
    • Enrich DBpedia with data from:
    • 1. other languages
    • 2. external datasets
  • Consistency Checks
    • German Wikipedia says, Berlin‘s population is X
    • Italian Wikipedia says, it‘s Y
  • Data Quality
    • We need humans
    • The Vision
  • Semantic Web
    • Users shouldn’t care
  • Semantic Web
    • Users shouldn’t have to care
    • (del.icio.us lesson )
  • DBpedia Extraction Wikipedia DBpedia Extraction Framework Triple Store
  • Freebase Extraction Wikipedia Extraction Metaweb Graph Store
    • What is the
    • Wikipedia for Data?
    • Wikipedia is the
    • Wikipedia for Data
  •  
  • Crowd Sourced Extraction
    • Where‘s the user benefit ?
  • Users
    • Mashup Developer
    • Benefit
    • Outlook
  • Infobox Extraction
    • We need a new approach
    • Break it down into smaller pieces
  • Step 1: Create an ontology
    • Five domains:
    • people, places, organisations,
    • events, works
  • People
    • Actors
    • Athlete
    • Journalist
    • MusicalArtist
    • Politician
    • Scientist
    • Writer
  • Places
    • Airport
    • City
    • Country
    • Island
    • Mountain
    • River
  • Organisations
    • Band
    • Company
    • Educational Institution
    • Radio Station
    • Sports Team
  • Event
    • Convention
    • Military Conflict
    • Music Event
    • Sport Event
  • Work
    • Book
    • Broadcast
    • Film
    • Software
    • Television
  • Step 2: Template Mapping
    • Infobox Cricketer
    • Infobox Historic Cricketer
    • Infobox Recent Cricketer
    • Infobox Old Cricketer
    • Infobox Cricketer Biography
    • => Class Cricketer (Athlete)
  • Step 2: Template Mapping
    • Class TV Episode (Work)
    • Wikipedia Templates:
    • Television Episode
    • UK Office Episode
    • Simpsons Episode
    • DoctorWhoBox
  • Step 3: Parsers
    • Handle Templates Values specifically
    • Example: Property splitting
    • Person born „1.1.1980, [[Berlin]]“
    • => split to birthplace Berlin
    • birthdate 1980-01-01
  • Step 3: Parsers
    • Example: Class Rules
    • MusicalArtist
    • If property „currentMembers“ is set
    • => Group
    • Otherwise
    • => Person
  • Step 3: Parsers
    • Example: Range Validation
    • Google keypeople
    • „ [[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]]
    • Company#keyperson range Person#Class
    • Google keyperson Eric Schmidt
    • Sergey Brin
    • Larry Page
  • Step 4: Crowd Source it
  • Step 4: Crowd Source it
    • Linking Framework
  • Interlinking Framework
  • Interlinking Framework
    • „ Apple“
    • Apple
    • Google
    • Microsoft
    • Apple
    • Orange
    • Pear
    • Orange
    • Vodafone
    • T-Mobile
    • Context
    • Similarity
  • Linking: The Future
    • Hosted Webservice
    • for Linked Data publishers
  • Summary
    • http://dbpedia.org
    • Georgi Kobilarov
    • Freie Universität Berlin