Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DBpedia InsideOut


Published on

Introduction to DBpedia, the most popular and interconnected source of Linked Open Data. Part of EXPLORING WIKIDATA AND THE SEMANTIC WEB FOR LIBRARIES at METRO

Published in: Technology

DBpedia InsideOut

  1. 1. DBPEDIA INSIDEOUT: AN INTRODUCTION TO THE MAJOR HUB FOR LINKED OPEN DATA Cristina Pattuelli, Pratt Institute March 16, 2015
  2. 2. “DBpedia is the Semantic Web mirror of Wikipedia”
  3. 3. WHAT IT IS DBpedia is  a crowd-sourced community effort to  extract structured information from Wikipedia and make this information available on the Web in the form of Linked Open Data.
  4. 4. Source: THE STATE OF THE LOD CLOUD 2014
  5. 5. Source: THE STATE OF THE LOD CLOUD 2014 2011: 295 DATASETS 2014: 570 DATASETS (+93%)
  6. 6. Source:
  7. 7.  Connected with other Linked Datasets by  50 million RDF links Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows CENTRAL INTERLINKING HUB OF THE WEB OF DATA
  8. 8. Web of Data Browsing and Crawling Web Data Integration and Mashups
  9. 9. “Which albums did Miles Davis record with female instrumentalists?” “Which populated places in Australia are below sea level?” “What did Andy Warhol and Thelonious Monk have in common ?”
  10. 10. PEAN TO DBPEDIA Multi-domain Automatically evolving Community consensus driven Multilingual >125 language editions Accessible on the Web
  11. 11. DBPEDIA SEMANTICS 4.58 million “things” 583 million “facts”
  12. 12. “THINGS” Each thing in the DBpedia dataset is identified by a URI of the form Name is  derived from the  URL of  the source Wikipedia article, which has the form
  13. 13. Dereferencing the URI DBpedia: Billie Holiday’s Green Page
  14. 14.
  15. 15. /Billie_Holiday
  16. 16. DBPEDIA SEMANTICS 4.58 million “things” 583 million “facts”
  17. 17. “Facts” as RDF Triples has name Subject Predicate Object (Thing) Billie Holiday
  18. 18. GENERATING FACTS FOR THE ENTITY BILLIE HOLIDAY has name Subject Predicate Object S <> P <> O ”Billie Holiday” Billie Holiday
  19. 19. S <> P <http://dbpedia-owl:alias> O “Lady Day”
  20. 20. S <> P <http://dbpedia-owl:occupation> O <>
  21. 21. CHARTING DBPEDIA Extraction Mapping Categorization
  22. 22. HARVESTING FACTS Wikipedia articles consist mostly of  free text, but  also contain different types of structured information, such as  infobox templates, categorization information, images, geo-coordinates, and  links to  external Web pages.
  23. 23. DBPEDIA COMPONENTS Source:
  24. 24. DBPEDIA COMPONENTS Extractors turn a specific type of wiki markup into triples.
  25. 25. DBPEDIA COMPONENTS Extractors turn a specific type of wiki markup into triples.
  26. 26. The  core of  DBpedia consists of an infobox extraction process. Infoboxes are  templates contained in  many Wikipedia articles. They are  usually displayed in the top right corner of  articles and  contain factual information.
  27. 27. Infobox for MusicalArtist
  28. 28. INFOBOX EXTRACTION Raw Infobox Extraction – create triples directly from the infobox data. Mapping-based Infobox Extraction – mappings against the DBpedia Ontology.
  29. 29. RAW INFOBOX EXTRACTION Generic Algorithm-based Retains property names used in the infobox Properties are identified by the dbpprop prefix.
  30. 30. MAPPING-BASED INFOBOX EXTRACTION Mapping of infobox data to community-curated DBpedia Ontology. Properties are identified by the dbpedia-owl prefix.
  31. 31. RAW INFOBOX EXTRACTION Pros: Complete coverage of all the infobox attributes (not all the infoboxes have been mapped yet) Cons: Lower data quality (synonyms are not resolved e.g., paceOfBirth/birthPlace; high error rate to determine the datatype of an attribute value)
  32. 32. MAPPING-BASED INFOBOX EXTRACTION Pros: Data is cleaner (typing resources, merging name variants, assigning specific datatypes to the values). Cons: Not full coverage. 4.58 million things 4.22 million are classified in a consistent ontology.
  33. 33. Normalization of variant names
  34. 34. THE DBPEDIA ONTOLOGY Cross-domain ontology Large thematic coverage Currently covers 685 classes which form a  subsumption hierarchy and  2,795 different pr oper ties describing the c lasses (aircraftHelicopterAttack) Shallow (≤ 5 levels)
  35. 35. THE DBPEDIA ONTOLOGY Because the DBpedia Ontology is built upon infobox templates, its semantic structure suffers from a lack of logical consistency and present significant semantic gaps in the hierarchy.
  37. 37. Hierarchy is kept shallow (sake of visualization and navigation). –
  39. 39. WIKIPEDIA CATEGORY SYSTEM Wikipedia categories to group articles that share similar subjects. Wikipedia categories are constantly evolving and currently number more than 740,000. 80.9 million links to Wikipedia categories.
  40. 40. WIKIPEDIA CATEGORY SYSTEM Most categories are assigned manually by Wikipedia contributors and can be found listed as links at the bottom of a Wikipedia article.
  41. 41. CATEGORIZING PEOPLE At least four categories: •  the year the person was born •  the year they died •  their nationality •  their reason for being notable.
  42. 42. CATEGORIZATION OF PEOPLE First sentence of an article: Billie Holiday (born Eleanora Fagan; April 7, 1915 – July 17, 1959) was an American jazz singer and songwriter. Year born: Category:1915 births Year died: Category:1959 deaths Nationality: Category: American people Reason for notability / Occupation: Category:Musicians
  43. 43. WIKIPEDIA CATEGORY SYSTEM Collaborative effort Advantages à categories are continually updated to correspond with article content. Dis/advantages à lack of consistency in its hierarchical structure and “rather loose relatedness between articles” (Bizer et al. (2009). “Messy hierarchy”
  44. 44. RE-CATEGORIZATION OF BILLIE HOLIDAY (→‎External links: re-categorisation per Wikipedia:Categories for discussion/Log/2014 December 26, replaced: Category:American women composers → Category:American female composers) (undo) -- (Robot - Moving category African-American female musicians toCategory:African-American musicians per CFD at Wikipedia:Categories for discussion/Log/2013 January 10.)
  45. 45. WIKIPEDIA ONTOLOGY IN DBPEDIA The hierarchical structure of the categories is represented in DBpedia by way of two different properties: dcterms:subject (relate entity to category) skos:broader (relate child to parent category)
  46. 46. The  Hierarchy  of  categories  between  “flower”  and  “cucumber”  
  48. 48. YAGO ONTOLOGY A robust classification scheme with a deep hierarchical structure. Originally derived from the Wikipedia category system using the semantic lexicon WordNet. Over 350,000 classes; 100 relationships Provides DBpedia data with coherence and structural consistency A taxonomic backbone
  49. 49. QUERYING DBPEDIA FOR LINKED JAZZ Jazz Name Vocabulary Personal name vocabulary in the form of RDF statements including the artist’s name paired with a Uniform Resource Identifier (URI). <>! <> ! “Billie Holiday”  
  50. 50. QUERYING DBPEDIA FOR LINKED JAZZ DBpedia was initially queried for literal triples with a foaf:name predicate that satisfied the following criteria: 1. the entity must be an rdf:type of dbpedia- owl:MusicalArtist 2. must have dbpedia:genre property: dbpedia:Jazz.
  51. 51. QUERYING DBPEDIA FOR LINKED JAZZ DBpedia was initially queried for literal triples with a foaf:name predicate that satisfied the following criteria: 1. the entity must be an rdf:type of dbpedia-owl:MusicalArtist 2. must have dbpedia:genre property: dbpedia:Jazz. + rdfs:label à name of the resource
  52. 52. QUERYING DBPEDIA FOR LINKED JAZZ Prominent musicians who we expected to find by querying dbpedia:Jazz property were not returned. Example: “Count Basie” -  f e l l u n d e r d b p e d i a : S w i n g _ m u s i c , dbpedia:Big_band_music and dbpedia:Piano_blues -  not under dbpedia:Jazz This required us to revise our query method by expanding it to include additional relevant music genres.
  53. 53. Name Extraction from DBpedia Bootstrapping  &   Querying  
  54. 54. IN SUM New type of knowledge representation environment -constant state of flux. -decentralized interplay of different descriptive and classification systems. -it challenges our tolerance threshold for data quality and our traditional notion of authority control.
  55. 55. LodLive Visualizing DBpedia
  56. 56. Thank You! @cristinapattuel