Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sebastian Hellmann

Data Quality and Data Usage in a large-scale Multilingual Knowledge Graph

  • Login to see the comments

Sebastian Hellmann

  1. 1. DBpedia - A Global Open Knowledge Network Sebastian Hellmann and Sören Auer
  2. 2. Outline 1. Concepts and DBpedia Strategy 2. Technologies 3. Outlook 2
  3. 3. Introduction ● Core and Context separation ○ Core Data: High value, low maintenance ○ Context data: :ow value, high maintenance ● Fraud detection (Credit Card institute) ○ Core data: Credit card transactions ○ Context data: Public information (ATMs, cities, flight plans, products, crime rate, etc.) ● Supply-chain management (Manufacturing) ○ Core data: Know-How of Manufacturing (What can you build ) ○ Context data: Supplier market ● Publishers ○ Core data: Content ○ Context data: Taxonomies and items to describe the content (Persons, Places, Events) 3
  4. 4. Common Challenges for us ● Speed of data ingestion ○ How fast can you find, understand and integrate external data? ● Virtually no feedback mechanisms to data providers ● No effective collaboration on your data, although you are curating the same data as others How do we enable collaboration on data? 4
  5. 5. DBpedia Strategy Overview Starting point DBpedia is the most successful open knowledge graph (OKG) 5 OKG Governance Collaboration & Curation Max. societal value Medium term goals: ● 10 millions of users ● millions of active contributors ● thousands of new businesses and initiatives take DBpedia to a global level
  6. 6. Potentializing Societal Value by: ● OKG Governance - licensing, incubation, maturity model for OKGs ○ Apache Foundation for data ● OKG Collaboration & Curation - for individuals & organizations ○ Git and GitHub for data ● Providing a trustworthy global OKG infrastructure - for enterprises small and large as well as non-profits and societal initiatives alike ● Maximizing societal value of open knowledge by incubating open knowledge initiatives and businesses (e.g. in education, public health, open science) 6
  7. 7. GitHub for Data DBpedia aims to create a knowledge graph curation service, which allows communities to collaborate on rich semantic representations. ● The knowledge graph uses the RDF data model as scaffold but is augmented with rich metadata about provenance, discourse, evolution etc. ● Atomic units of the knowledge graph are facts/statements, which are aggregated into resources/entity descriptions ● All contributions and changes are tracked and versioned 7
  8. 8. OKG Clearinghouse and Steward DBpedia will be a clearinghouse for OKG contributions and steward for their sustainable maintenance ● Open Data license and contributor agreements ● Incubation and maturity model for OKG assets ○ Based on an automatic and sample-based quality and coverage assessment ● Continuous integration for OKG assets ○ Automatic link generation ○ Execution of test cases ○ OKG publication in various human-readable and machine-readable formats ● Communication and collaboration infrastructure for OKG communities 8
  9. 9. Comparing with related initiatives Use-case driven ● contrast with platform-first approaches (repositories like ● We build the platform to support the use cases. Collaboration-driven ● contrast with volunteer-driven (like Wikidata) ● improving completeness & correctness of areas that are most used by the partners (full-time curators, instead of sporadic) Knowledge-integration-driven ● contrast with loose collections (like data markets) ● We make every small piece of information identifiable & referenceable ⇒ knowledge melting pot. 9 We are willing and able to collaborate with and integrate other open data initiatives.
  10. 10. “Publish and Link” falls short for the Web of Data Connecting data is about connecting people and organisations. 10
  11. 11. Collaboration in the Web of Data Connecting data is about connecting people and organisations. 11 ● DBpedia’s mission is to ○ serve as an access point for data ○ facilitate collaboration ○ disseminate data on a global scale
  12. 12. Data Incubator model 12 Counselling Analysis Integration Colla- boration Full collaboration benefits Shared OKG Governance DBpedia Contributor Requirements Excel anarchy, no governance LVL 0 LVL 1 LVL 2 LVL 3 LVL 3: access to all relevant data, links, users of the ecosystem By reaching LVL 3 cost for maintaining LVL 2 as well as OKG Governance and Curation is shared effectively with the DBpedia ecosystem
  13. 13. LVL 0: Excel Anarchy, No Governance ● Each employee/department governs his own data (Anarchy) ● Intensive counselling required ● Best to build a parallel structure and show the value (KG prototype) 13
  14. 14. LVL 1: DBpedia Contributor Requirements ● Stable identifiers ● Good level of schema unification and management ● Data strategy & Knowledge Graph available ● Core and Context separation ○ Core Data: High value, low maintenance ○ Context data: Very low value, very high maintenance (commodity) ● What data maintenance can you outsource to DBpedia? (Analysis) 14
  15. 15. LVL 2: Shared OKG Governance ● Technical steps (Integration): ○ Identifier Linking ○ Schema Mapping ○ Release data into DBpedia ● Continuously maturing tool stack to improve these three steps DBpedia Association consists of a huge network of universities -> we mediate internships to tackle above tasks 15
  16. 16. LVL 3: Full collaboration benefits ● Link triangulation (Who links to you? subscription) ● Sources validation (Error reports) ● Data comparison (your data with all other data sources) ● Mediate contact to other organisations with the same data ● Any user feedback is directed to the sources 16
  17. 17. Incubator model ● Organisations… ○ can use the DBpedia incubator model to improve their OKG ○ each joining will upvalue DBpedia with data and experience ● DBpedia... ○ acts as the mediator ○ will distribute value to other orgs and users on a global scale 17
  18. 18. Technologies ● ID Management + Linking ● DataID (Metadata treatment) ● Data comparison and feedback ● SHACL - Test-driven data development
  19. 19. ID Management + Linking ● For each source ID, DBpedia assigns a local DBpedia ID ● Links are then grouped into clusters ● From the cluster a representative ID is chosen, others are redirects ● Properties: ○ Every imported entity is identifiable and traceable with local ID ○ Holistic identifier space -> allows a complete linkage ○ Stable IDs allow to improve link accuracy over time ●
  20. 20. DataID (Metadata treatment) ● FOAF and WebID -> you keep your data local, all online accounts are updated automatically ● DataID -> DCAT extension, keep data description locally, all data repos will be updated automatically
  21. 21. Data comparison and feedback Show differences in the data: areaTotal of London Population of London P1082
  22. 22. CrossWikiFact
  23. 23. Test-driven data development ● Test-driven data development (2014) ● Dimitris Kontokostas (CTO of DBpedia) co-editor of the SHACL ● RDFUnit ○ uses Machine Learning (DL-Learner) to enrich the OWL schema ○ TAG - Test Autogenerators from enriched schema ○ 44,000 tests generated from the DBpedia Ontology ● Tests are transferred to sources (schema mapping) ● Tests are written collaboratively: ○ Universal: deathdate should not be before birthdate ○ Shared: specialised domain and application tests Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web.
  24. 24. DBpedia in 10 years ● DBpedia connects hundreds of thousands of data spaces (centralised-decentralised architecture) ● Data about the world is a commodity (freely available to everybody) ● Working with data will be fun
  25. 25. Become a supporter or an early adopter This is not a vision of the far future, it is happening now:
  26. 26. Contact for the DBpedia Association (non-profit) @dbpedia