DBpedia - A Global Open Knowledge
Network
Sebastian Hellmann and Sören Auer
http://dbpedia.org
Outline
1. Concepts and DBpedia Strategy
2. Technologies
3. Outlook
2
Introduction
● Core and Context separation
○ Core Data: High value, low maintenance
○ Context data: :ow value, high maintenance
● Fraud detection (Credit Card institute)
○ Core data: Credit card transactions
○ Context data: Public information (ATMs, cities, flight plans, products, crime rate, etc.)
● Supply-chain management (Manufacturing)
○ Core data: Know-How of Manufacturing (What can you build )
○ Context data: Supplier market
● Publishers
○ Core data: Content
○ Context data: Taxonomies and items to describe the content (Persons, Places, Events)
3
Common Challenges for us
● Speed of data ingestion
○ How fast can you find, understand and integrate external data?
● Virtually no feedback mechanisms to data providers
● No effective collaboration on your data, although you are curating the same
data as others
How do we enable collaboration on data?
4
DBpedia Strategy Overview
Starting point
DBpedia is the most successful
open knowledge graph (OKG)
5
OKG Governance
Collaboration & Curation
Max. societal value
Medium term goals:
● 10 millions of users
● millions of active contributors
● thousands of new businesses
and initiatives
take DBpedia to a global level
Potentializing Societal Value by:
● OKG Governance - licensing, incubation, maturity model for OKGs
○ Apache Foundation for data
● OKG Collaboration & Curation - for individuals & organizations
○ Git and GitHub for data
● Providing a trustworthy global OKG infrastructure - for enterprises small
and large as well as non-profits and societal initiatives alike
● Maximizing societal value of open knowledge by incubating open
knowledge initiatives and businesses (e.g. in education, public health, open
science)
6
GitHub for Data
DBpedia aims to create a knowledge graph curation service, which allows
communities to collaborate on rich semantic representations.
● The knowledge graph uses the RDF data model as scaffold but is augmented
with rich metadata about provenance, discourse, evolution etc.
● Atomic units of the knowledge graph are facts/statements, which are
aggregated into resources/entity descriptions
● All contributions and changes are tracked and versioned
7
OKG Clearinghouse and Steward
DBpedia will be a clearinghouse for OKG contributions and steward for their sustainable
maintenance
● Open Data license and contributor agreements
● Incubation and maturity model for OKG assets
○ Based on an automatic and sample-based quality and coverage assessment
● Continuous integration for OKG assets
○ Automatic link generation
○ Execution of test cases
○ OKG publication in various human-readable and machine-readable formats
● Communication and collaboration infrastructure for OKG communities
8
Comparing with related initiatives
Use-case driven
● contrast with platform-first approaches (repositories like DataHub.io)
● We build the platform to support the use cases.
Collaboration-driven
● contrast with volunteer-driven (like Wikidata)
● improving completeness & correctness of areas that are most used by the partners (full-time
curators, instead of sporadic)
Knowledge-integration-driven
● contrast with loose collections (like data markets)
● We make every small piece of information identifiable & referenceable
⇒ knowledge melting pot.
9
We are willing and able to collaborate with and integrate other open data initiatives.
“Publish and Link” falls short for the Web of Data
Connecting data is about connecting people and organisations. 10
Collaboration in the Web of Data
Connecting data is about connecting people and organisations.
11
● DBpedia’s mission is to
○ serve as an access point for data
○ facilitate collaboration
○ disseminate data on a global scale
Data Incubator model
12
Counselling
Analysis
Integration
Colla-
boration
Full collaboration benefits
Shared OKG Governance
DBpedia Contributor Requirements
Excel anarchy, no governance
LVL 0
LVL 1
LVL 2
LVL 3
LVL 3: access to all
relevant data, links, users
of the ecosystem
By reaching LVL 3 cost for
maintaining LVL 2 as well
as OKG Governance and
Curation is shared
effectively with the
DBpedia ecosystem
LVL 0: Excel Anarchy, No Governance
● Each employee/department governs his own data (Anarchy)
● Intensive counselling required
● Best to build a parallel structure and show the value (KG prototype)
13
LVL 1: DBpedia Contributor Requirements
● Stable identifiers
● Good level of schema unification and management
● Data strategy & Knowledge Graph available
● Core and Context separation
○ Core Data: High value, low maintenance
○ Context data: Very low value, very high maintenance (commodity)
● What data maintenance can you outsource to DBpedia? (Analysis)
14
LVL 2: Shared OKG Governance
● Technical steps (Integration):
○ Identifier Linking
○ Schema Mapping
○ Release data into DBpedia
● Continuously maturing tool stack to improve these three steps
DBpedia Association consists of a huge network of universities
-> we mediate internships to tackle above tasks
15
LVL 3: Full collaboration benefits
● Link triangulation (Who links to you? subscription)
● Sources validation (Error reports)
● Data comparison (your data with all other data sources)
● Mediate contact to other organisations with the same data
● Any user feedback is directed to the sources
16
Incubator model
● Organisations…
○ can use the DBpedia incubator model to improve their OKG
○ each joining will upvalue DBpedia with data and experience
● DBpedia...
○ acts as the mediator
○ will distribute value to other orgs and users on a global scale
17
Technologies
● ID Management + Linking
● DataID (Metadata treatment)
● Data comparison and feedback
● SHACL - Test-driven data development
ID Management + Linking
● For each source ID, DBpedia assigns a local DBpedia ID
● Links are then grouped into clusters
● From the cluster a representative ID is chosen, others are redirects
● Properties:
○ Every imported entity is identifiable and traceable with local ID
○ Holistic identifier space -> allows a complete linkage
○ Stable IDs allow to improve link accuracy over time
● http://dbpedia.github.io/links/tools/linkviz/
DataID (Metadata treatment)
● FOAF and WebID -> you keep your data local, all online accounts are
updated automatically
● DataID -> DCAT extension, keep data description locally, all data repos will
be updated automatically
Data comparison and feedback
Show differences in the data:
http://downloads.dbpedia.org/temporary/crosswikifact/results/q84.html
http://wikidata.org/wiki/Q84
areaTotal of London
Population of London P1082
CrossWikiFact
Test-driven data development
● Test-driven data development (2014)
● Dimitris Kontokostas (CTO of DBpedia) co-editor of the SHACL
● RDFUnit
○ uses Machine Learning (DL-Learner) to enrich the OWL schema
○ TAG - Test Autogenerators from enriched schema
○ 44,000 tests generated from the DBpedia Ontology
● Tests are transferred to sources (schema mapping)
● Tests are written collaboratively:
○ Universal: deathdate should not be before birthdate
○ Shared: specialised domain and application tests
Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland
Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web.
DBpedia in 10 years
● DBpedia connects hundreds of thousands of data spaces
(centralised-decentralised architecture)
● Data about the world is a commodity (freely available to everybody)
● Working with data will be fun
Become a supporter or an early adopter
This is not a vision of the far future, it is happening now:
Contact for the DBpedia Association (non-profit)
dbpedia@infai.org @dbpedia
wiki.dbpedia.org @dbpedia.org

Sebastian Hellmann

  • 1.
    DBpedia - AGlobal Open Knowledge Network Sebastian Hellmann and Sören Auer http://dbpedia.org
  • 2.
    Outline 1. Concepts andDBpedia Strategy 2. Technologies 3. Outlook 2
  • 3.
    Introduction ● Core andContext separation ○ Core Data: High value, low maintenance ○ Context data: :ow value, high maintenance ● Fraud detection (Credit Card institute) ○ Core data: Credit card transactions ○ Context data: Public information (ATMs, cities, flight plans, products, crime rate, etc.) ● Supply-chain management (Manufacturing) ○ Core data: Know-How of Manufacturing (What can you build ) ○ Context data: Supplier market ● Publishers ○ Core data: Content ○ Context data: Taxonomies and items to describe the content (Persons, Places, Events) 3
  • 4.
    Common Challenges forus ● Speed of data ingestion ○ How fast can you find, understand and integrate external data? ● Virtually no feedback mechanisms to data providers ● No effective collaboration on your data, although you are curating the same data as others How do we enable collaboration on data? 4
  • 5.
    DBpedia Strategy Overview Startingpoint DBpedia is the most successful open knowledge graph (OKG) 5 OKG Governance Collaboration & Curation Max. societal value Medium term goals: ● 10 millions of users ● millions of active contributors ● thousands of new businesses and initiatives take DBpedia to a global level
  • 6.
    Potentializing Societal Valueby: ● OKG Governance - licensing, incubation, maturity model for OKGs ○ Apache Foundation for data ● OKG Collaboration & Curation - for individuals & organizations ○ Git and GitHub for data ● Providing a trustworthy global OKG infrastructure - for enterprises small and large as well as non-profits and societal initiatives alike ● Maximizing societal value of open knowledge by incubating open knowledge initiatives and businesses (e.g. in education, public health, open science) 6
  • 7.
    GitHub for Data DBpediaaims to create a knowledge graph curation service, which allows communities to collaborate on rich semantic representations. ● The knowledge graph uses the RDF data model as scaffold but is augmented with rich metadata about provenance, discourse, evolution etc. ● Atomic units of the knowledge graph are facts/statements, which are aggregated into resources/entity descriptions ● All contributions and changes are tracked and versioned 7
  • 8.
    OKG Clearinghouse andSteward DBpedia will be a clearinghouse for OKG contributions and steward for their sustainable maintenance ● Open Data license and contributor agreements ● Incubation and maturity model for OKG assets ○ Based on an automatic and sample-based quality and coverage assessment ● Continuous integration for OKG assets ○ Automatic link generation ○ Execution of test cases ○ OKG publication in various human-readable and machine-readable formats ● Communication and collaboration infrastructure for OKG communities 8
  • 9.
    Comparing with relatedinitiatives Use-case driven ● contrast with platform-first approaches (repositories like DataHub.io) ● We build the platform to support the use cases. Collaboration-driven ● contrast with volunteer-driven (like Wikidata) ● improving completeness & correctness of areas that are most used by the partners (full-time curators, instead of sporadic) Knowledge-integration-driven ● contrast with loose collections (like data markets) ● We make every small piece of information identifiable & referenceable ⇒ knowledge melting pot. 9 We are willing and able to collaborate with and integrate other open data initiatives.
  • 10.
    “Publish and Link”falls short for the Web of Data Connecting data is about connecting people and organisations. 10
  • 11.
    Collaboration in theWeb of Data Connecting data is about connecting people and organisations. 11 ● DBpedia’s mission is to ○ serve as an access point for data ○ facilitate collaboration ○ disseminate data on a global scale
  • 12.
    Data Incubator model 12 Counselling Analysis Integration Colla- boration Fullcollaboration benefits Shared OKG Governance DBpedia Contributor Requirements Excel anarchy, no governance LVL 0 LVL 1 LVL 2 LVL 3 LVL 3: access to all relevant data, links, users of the ecosystem By reaching LVL 3 cost for maintaining LVL 2 as well as OKG Governance and Curation is shared effectively with the DBpedia ecosystem
  • 13.
    LVL 0: ExcelAnarchy, No Governance ● Each employee/department governs his own data (Anarchy) ● Intensive counselling required ● Best to build a parallel structure and show the value (KG prototype) 13
  • 14.
    LVL 1: DBpediaContributor Requirements ● Stable identifiers ● Good level of schema unification and management ● Data strategy & Knowledge Graph available ● Core and Context separation ○ Core Data: High value, low maintenance ○ Context data: Very low value, very high maintenance (commodity) ● What data maintenance can you outsource to DBpedia? (Analysis) 14
  • 15.
    LVL 2: SharedOKG Governance ● Technical steps (Integration): ○ Identifier Linking ○ Schema Mapping ○ Release data into DBpedia ● Continuously maturing tool stack to improve these three steps DBpedia Association consists of a huge network of universities -> we mediate internships to tackle above tasks 15
  • 16.
    LVL 3: Fullcollaboration benefits ● Link triangulation (Who links to you? subscription) ● Sources validation (Error reports) ● Data comparison (your data with all other data sources) ● Mediate contact to other organisations with the same data ● Any user feedback is directed to the sources 16
  • 17.
    Incubator model ● Organisations… ○can use the DBpedia incubator model to improve their OKG ○ each joining will upvalue DBpedia with data and experience ● DBpedia... ○ acts as the mediator ○ will distribute value to other orgs and users on a global scale 17
  • 18.
    Technologies ● ID Management+ Linking ● DataID (Metadata treatment) ● Data comparison and feedback ● SHACL - Test-driven data development
  • 19.
    ID Management +Linking ● For each source ID, DBpedia assigns a local DBpedia ID ● Links are then grouped into clusters ● From the cluster a representative ID is chosen, others are redirects ● Properties: ○ Every imported entity is identifiable and traceable with local ID ○ Holistic identifier space -> allows a complete linkage ○ Stable IDs allow to improve link accuracy over time ● http://dbpedia.github.io/links/tools/linkviz/
  • 20.
    DataID (Metadata treatment) ●FOAF and WebID -> you keep your data local, all online accounts are updated automatically ● DataID -> DCAT extension, keep data description locally, all data repos will be updated automatically
  • 21.
    Data comparison andfeedback Show differences in the data: http://downloads.dbpedia.org/temporary/crosswikifact/results/q84.html http://wikidata.org/wiki/Q84 areaTotal of London Population of London P1082
  • 22.
  • 23.
    Test-driven data development ●Test-driven data development (2014) ● Dimitris Kontokostas (CTO of DBpedia) co-editor of the SHACL ● RDFUnit ○ uses Machine Learning (DL-Learner) to enrich the OWL schema ○ TAG - Test Autogenerators from enriched schema ○ 44,000 tests generated from the DBpedia Ontology ● Tests are transferred to sources (schema mapping) ● Tests are written collaboratively: ○ Universal: deathdate should not be before birthdate ○ Shared: specialised domain and application tests Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web.
  • 24.
    DBpedia in 10years ● DBpedia connects hundreds of thousands of data spaces (centralised-decentralised architecture) ● Data about the world is a commodity (freely available to everybody) ● Working with data will be fun
  • 25.
    Become a supporteror an early adopter This is not a vision of the far future, it is happening now:
  • 26.
    Contact for theDBpedia Association (non-profit) dbpedia@infai.org @dbpedia wiki.dbpedia.org @dbpedia.org