Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
KNOWLEDGE GRAPHS
AND THE ROLE OF DBPEDIA
Paul Groth @pgroth
pgroth.com
Thanks to Joao Moura
Elsevier Labs @elsevierlabs
6t...
FAVORITE DBPEDIA PREDICATE….
OUTLINE
• The Importance of Structure
• Better taxonomies
• Knowledge graph construction
ELSEVIER LABS - INTRO
WORLD LEADER IN DIGITAL INFO SOLUTIONS
4
Published over
330,000 articles
in 2013
Founded over
130 ye...
60 % OF TIME IS SPENT ON DATA
PREPARATION
STRUCTURED DATA
STRUCTURED DATA
CONNECTING DATA TO APPS
BUILDING BETTER TAXONOMIES
• Ontologies and taxonomies help organize and query content
• Annotation
• Classification / Nav...
TAXONOMY INDUCTION
Starting with a very shallow hierarchy of syntactical concepts with almost no intersections:
1. Matchin...
Example Given two concepts, check if they form a parent-child relation:
select distinct * where{
<http://dbpedia.org/resou...
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Informat...
ENTITY RESOLUTION: GLAUCOMA
Surface form triples downsampled from 49M entity-resolved triples
ANNOTATION
• http://www.slideshare.net/SparkSummit/dictionary-based-annotation-at-scale-with-spark-by-sujit-pal
• What is ...
ANNOTATION PIPELINE
DICTIONARY BASED NE ANNOTATOR (SODA)
DICTIONARY BASED NE ANNOTATOR (SODA)
• Part of Document Annotation Pipeline.
• Annota...
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Informat...
MATRIX CONSTRUCTION: GLAUCOMA
p=83
r = 176
83 x 176 sparse binary-valued matrix
with 366 entries
surface form
relations
st...
MATRIX COMPLETION: GLAUCOMA
Latent factor matrix
r = 176
p=83
Latentfactormatrix
×
83 x 176 real-valued matrix with
14,608...
PREDICTED RELATIONS: GLAUCOMA
• At threshold = 0.08
• 22 unseen relations
• F1 = 0.71
• Applications beyond
knowledge grap...
A DBPEDIA IDEA?
• Connect to the Scholarly Ecosystem
• Crossref & Data Cite DOIs + ORCIDS
CONCLUSION
• DBPedia and Wikipedia KBs are great reference sources
• Beyond expected use for…
• Internal knowledge curatio...
Knowledge Graph Construction and the Role of DBPedia
Upcoming SlideShare
Loading in …5
×

Knowledge Graph Construction and the Role of DBPedia

2,900 views

Published on

Some uses of DBPedia & Wikidata for structuring internal information. Presented at: http://wiki.dbpedia.org/meetings/TheHague2016

Published in: Technology
  • Be the first to comment

Knowledge Graph Construction and the Role of DBPedia

  1. 1. KNOWLEDGE GRAPHS AND THE ROLE OF DBPEDIA Paul Groth @pgroth pgroth.com Thanks to Joao Moura Elsevier Labs @elsevierlabs 6th DBpedia Community Meeting in The Hague 2016 Feb. 12, 2016
  2. 2. FAVORITE DBPEDIA PREDICATE….
  3. 3. OUTLINE • The Importance of Structure • Better taxonomies • Knowledge graph construction
  4. 4. ELSEVIER LABS - INTRO WORLD LEADER IN DIGITAL INFO SOLUTIONS 4 Published over 330,000 articles in 2013 Founded over 130 years ago Work with over 30 million Scientists, students, health & information professionals Employ over 7,000 employees in 24 countries Received over 1 million submissions in 2013 Over the last 50 years the majority of Noble Laureates have published with Elsevier Over 53 million items indexed by Scopus Elsevier eBooks, Online Journals, Databases Publishes over 2,200 online journals & over 10,000 e-books SOLUTIONS Elsevier R+D Solutions Elsevier Clinical Solutions Helps corporate researchers, R+D professionals, and engineers improve how they interact with, share, and apply information to solve problems using our digital workflow tools, analytics, and data Provides universities, governments, and research institutions with the resources and insights to improve institutional research strategy, management, and performance. Elsevier Education Helps medical professionals apply trusted data and sophisticated tools to make better clinical decisions, deliver better care, and produce better healthcare outcomes. Helps educate highly-skilled, effective healthcare professionals, using the most advanced pedagogical tools and reference works. Elsevier Research Intelligence CONTENT CAPABILITIESPLATFORMS
  5. 5. 60 % OF TIME IS SPENT ON DATA PREPARATION
  6. 6. STRUCTURED DATA
  7. 7. STRUCTURED DATA
  8. 8. CONNECTING DATA TO APPS
  9. 9. BUILDING BETTER TAXONOMIES • Ontologies and taxonomies help organize and query content • Annotation • Classification / Navigation • Autocomplete • Suggestion & Recommendation • We have lots of taxonomies/ontologies • Journal Classification for Scopus • Mendeley classification system • Science Direct Subject classification • Reference Modules Hierarchies for Books • Submission system Journal classifications • … • Connect to external ontologies (e.g. MESH) • Ontology Maintenance, Usage and Mapping
  10. 10. TAXONOMY INDUCTION Starting with a very shallow hierarchy of syntactical concepts with almost no intersections: 1. Matching concepts against a target (well accepted) taxonomy and dbpedia: • Problems: Same concept may have different names or terminologies in different branches; Multiple languages etc. 2. Check for partial orders between these concepts, using the hierarchy of the target taxonomy and dbpedia (skos:broader). 3. Finding/completing missing links between concepts.
  11. 11. Example Given two concepts, check if they form a parent-child relation: select distinct * where{ <http://dbpedia.org/resource/Model-checking> dbo:wikiPageRedirect* ?conceptChild. ?conceptChild dbo:wikiPageRedirects* ?redirectedChild. ?redirectedChild dct:subject ?subjectChild. <http://dbpedia.org/resource/Formal_methods> dbo:wikiPageRedirect* ?conceptParent. ?conceptParent dbo:wikiPageRedirects* ?redirectedParent. ?redirectedParent dct:subject ?subjectParent. ?subjectChild skos:broader ?subjectChildsParent Filter(?subjectChildsParent = ?subjectParent) }
  12. 12. Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction TOWARDS AN ELSEVIER KNOWLEDGE GRAPH • Ongoing proof-of-concept work by Paul Groth, Sujit Pal and Ron Daniel of Elsevier Labs • Unsupervised, scalable and built with off-the-shelf technologies • Based on recent work at University College London and University of Massachusetts Amherst • Riedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. "Relation extraction with matrix factorization and universal schemas." (2013). 14M articles from Science Direct 3.3M triples 475M triples 49M triples p x r matrix p x k, k x r latent factor matrices ~102 triples 920K concepts from EMMeT
  13. 13. ENTITY RESOLUTION: GLAUCOMA Surface form triples downsampled from 49M entity-resolved triples
  14. 14. ANNOTATION • http://www.slideshare.net/SparkSummit/dictionary-based-annotation-at-scale-with-spark-by-sujit-pal • What is the problem? • Annotate millions of documents from different corpora. • 14M docs from Science Direct alone. • More from other corpora, dependency parsing, etc. • Critical step for Machine Reading and Knowledge Graph applications. • Why is this such a big deal? • Takes advantage of existing linked data. • No model training for multiple complex STM domains. • However, simple until done at scale.
  15. 15. ANNOTATION PIPELINE
  16. 16. DICTIONARY BASED NE ANNOTATOR (SODA) DICTIONARY BASED NE ANNOTATOR (SODA) • Part of Document Annotation Pipeline. • Annotates text with Named Entities from external Dictionaries. • Why do we have to scale (Wikipedia KBs) – 8 Million entities • Built with Open Source Components • Apache Solr – Highly reliable, scalable and fault-tolerant search index. • SolrTextTagger – Solr component for text tagging, uses Lucene FST technology. • Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text. • Apache Spark – Lightning fast, large scale data processing. • Uses ideas from other Open Source libraries • FuzzyWuzzy – Fuzzy String Matching like a boss. • Contributed back to Open Source • https://github.com/elsevierlabs-os/soda
  17. 17. Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction TOWARDS AN ELSEVIER KNOWLEDGE GRAPH 14M articles from Science Direct 3.3M triples 475M triples 49M triples p x r matrix p x k, k x r latent factor matrices ~102 triples 920K concepts from EMMeT
  18. 18. MATRIX CONSTRUCTION: GLAUCOMA p=83 r = 176 83 x 176 sparse binary-valued matrix with 366 entries surface form relations structured relations entitypairs
  19. 19. MATRIX COMPLETION: GLAUCOMA Latent factor matrix r = 176 p=83 Latentfactormatrix × 83 x 176 real-valued matrix with 14,608 entries =
  20. 20. PREDICTED RELATIONS: GLAUCOMA • At threshold = 0.08 • 22 unseen relations • F1 = 0.71 • Applications beyond knowledge graph construction • Taxonomy and ontology maintenance • Entity search in task- specific and/or mobile context • Question answering glaucoma developed many years after chronic inflammation of uveal tract glaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucoma glaucoma can appear soon in age over 40 glaucoma the risk of functional visual field loss glaucoma contributing causes of functional visual field loss glaucoma contributed to functional visual field loss glaucoma is considered the second leading cause of functional visual field loss glaucoma remains the second leading cause of functional visual field loss This is a unique entity not a string
  21. 21. A DBPEDIA IDEA? • Connect to the Scholarly Ecosystem • Crossref & Data Cite DOIs + ORCIDS
  22. 22. CONCLUSION • DBPedia and Wikipedia KBs are great reference sources • Beyond expected use for… • Internal knowledge curation • Stress testing • We’re hiring 

×