Knowledge Graph Construction and the Role of DBPedia

2,285 views

Published on

Some uses of DBPedia & Wikidata for structuring internal information. Presented at: http://wiki.dbpedia.org/meetings/TheHague2016

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,285
On SlideShare
0
From Embeds
0
Number of Embeds
146
Actions
Shares
0
Downloads
30
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • http://wiki.dbpedia.org/meetings/TheHague2016
  • More context…
  • Elsevier Labs in context
  • NASA, A.40 Computational Modeling Algorithms and Cyberinfrastructure, tech. report, NASA, 19 Dec. 2011
  • “Mendeley Suggest” is our personalised article recommender. It is based on what users have in their libraries, and recommends other related articles.  Uses taxonomies
  • Can we do structuring automatically?
  • Knowledge Graph Construction and the Role of DBPedia

    1. 1. KNOWLEDGE GRAPHS AND THE ROLE OF DBPEDIA Paul Groth @pgroth pgroth.com Thanks to Joao Moura Elsevier Labs @elsevierlabs 6th DBpedia Community Meeting in The Hague 2016 Feb. 12, 2016
    2. 2. FAVORITE DBPEDIA PREDICATE….
    3. 3. OUTLINE • The Importance of Structure • Better taxonomies • Knowledge graph construction
    4. 4. ELSEVIER LABS - INTRO WORLD LEADER IN DIGITAL INFO SOLUTIONS 4 Published over 330,000 articles in 2013 Founded over 130 years ago Work with over 30 million Scientists, students, health & information professionals Employ over 7,000 employees in 24 countries Received over 1 million submissions in 2013 Over the last 50 years the majority of Noble Laureates have published with Elsevier Over 53 million items indexed by Scopus Elsevier eBooks, Online Journals, Databases Publishes over 2,200 online journals & over 10,000 e-books SOLUTIONS Elsevier R+D Solutions Elsevier Clinical Solutions Helps corporate researchers, R+D professionals, and engineers improve how they interact with, share, and apply information to solve problems using our digital workflow tools, analytics, and data Provides universities, governments, and research institutions with the resources and insights to improve institutional research strategy, management, and performance. Elsevier Education Helps medical professionals apply trusted data and sophisticated tools to make better clinical decisions, deliver better care, and produce better healthcare outcomes. Helps educate highly-skilled, effective healthcare professionals, using the most advanced pedagogical tools and reference works. Elsevier Research Intelligence CONTENT CAPABILITIESPLATFORMS
    5. 5. 60 % OF TIME IS SPENT ON DATA PREPARATION
    6. 6. STRUCTURED DATA
    7. 7. STRUCTURED DATA
    8. 8. CONNECTING DATA TO APPS
    9. 9. BUILDING BETTER TAXONOMIES • Ontologies and taxonomies help organize and query content • Annotation • Classification / Navigation • Autocomplete • Suggestion & Recommendation • We have lots of taxonomies/ontologies • Journal Classification for Scopus • Mendeley classification system • Science Direct Subject classification • Reference Modules Hierarchies for Books • Submission system Journal classifications • … • Connect to external ontologies (e.g. MESH) • Ontology Maintenance, Usage and Mapping
    10. 10. TAXONOMY INDUCTION Starting with a very shallow hierarchy of syntactical concepts with almost no intersections: 1. Matching concepts against a target (well accepted) taxonomy and dbpedia: • Problems: Same concept may have different names or terminologies in different branches; Multiple languages etc. 2. Check for partial orders between these concepts, using the hierarchy of the target taxonomy and dbpedia (skos:broader). 3. Finding/completing missing links between concepts.
    11. 11. Example Given two concepts, check if they form a parent-child relation: select distinct * where{ <http://dbpedia.org/resource/Model-checking> dbo:wikiPageRedirect* ?conceptChild. ?conceptChild dbo:wikiPageRedirects* ?redirectedChild. ?redirectedChild dct:subject ?subjectChild. <http://dbpedia.org/resource/Formal_methods> dbo:wikiPageRedirect* ?conceptParent. ?conceptParent dbo:wikiPageRedirects* ?redirectedParent. ?redirectedParent dct:subject ?subjectParent. ?subjectChild skos:broader ?subjectChildsParent Filter(?subjectChildsParent = ?subjectParent) }
    12. 12. Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction TOWARDS AN ELSEVIER KNOWLEDGE GRAPH • Ongoing proof-of-concept work by Paul Groth, Sujit Pal and Ron Daniel of Elsevier Labs • Unsupervised, scalable and built with off-the-shelf technologies • Based on recent work at University College London and University of Massachusetts Amherst • Riedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. "Relation extraction with matrix factorization and universal schemas." (2013). 14M articles from Science Direct 3.3M triples 475M triples 49M triples p x r matrix p x k, k x r latent factor matrices ~102 triples 920K concepts from EMMeT
    13. 13. ENTITY RESOLUTION: GLAUCOMA Surface form triples downsampled from 49M entity-resolved triples
    14. 14. ANNOTATION • http://www.slideshare.net/SparkSummit/dictionary-based-annotation-at-scale-with-spark-by-sujit-pal • What is the problem? • Annotate millions of documents from different corpora. • 14M docs from Science Direct alone. • More from other corpora, dependency parsing, etc. • Critical step for Machine Reading and Knowledge Graph applications. • Why is this such a big deal? • Takes advantage of existing linked data. • No model training for multiple complex STM domains. • However, simple until done at scale.
    15. 15. ANNOTATION PIPELINE
    16. 16. DICTIONARY BASED NE ANNOTATOR (SODA) DICTIONARY BASED NE ANNOTATOR (SODA) • Part of Document Annotation Pipeline. • Annotates text with Named Entities from external Dictionaries. • Why do we have to scale (Wikipedia KBs) – 8 Million entities • Built with Open Source Components • Apache Solr – Highly reliable, scalable and fault-tolerant search index. • SolrTextTagger – Solr component for text tagging, uses Lucene FST technology. • Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text. • Apache Spark – Lightning fast, large scale data processing. • Uses ideas from other Open Source libraries • FuzzyWuzzy – Fuzzy String Matching like a boss. • Contributed back to Open Source • https://github.com/elsevierlabs-os/soda
    17. 17. Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction TOWARDS AN ELSEVIER KNOWLEDGE GRAPH 14M articles from Science Direct 3.3M triples 475M triples 49M triples p x r matrix p x k, k x r latent factor matrices ~102 triples 920K concepts from EMMeT
    18. 18. MATRIX CONSTRUCTION: GLAUCOMA p=83 r = 176 83 x 176 sparse binary-valued matrix with 366 entries surface form relations structured relations entitypairs
    19. 19. MATRIX COMPLETION: GLAUCOMA Latent factor matrix r = 176 p=83 Latentfactormatrix × 83 x 176 real-valued matrix with 14,608 entries =
    20. 20. PREDICTED RELATIONS: GLAUCOMA • At threshold = 0.08 • 22 unseen relations • F1 = 0.71 • Applications beyond knowledge graph construction • Taxonomy and ontology maintenance • Entity search in task- specific and/or mobile context • Question answering glaucoma developed many years after chronic inflammation of uveal tract glaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucoma glaucoma can appear soon in age over 40 glaucoma the risk of functional visual field loss glaucoma contributing causes of functional visual field loss glaucoma contributed to functional visual field loss glaucoma is considered the second leading cause of functional visual field loss glaucoma remains the second leading cause of functional visual field loss This is a unique entity not a string
    21. 21. A DBPEDIA IDEA? • Connect to the Scholarly Ecosystem • Crossref & Data Cite DOIs + ORCIDS
    22. 22. CONCLUSION • DBPedia and Wikipedia KBs are great reference sources • Beyond expected use for… • Internal knowledge curation • Stress testing • We’re hiring 

    ×