Successfully reported this slideshow.
Your SlideShare is downloading. ×

Knowledge Graph Maintenance

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Knowledge Graph Maintenance
Knowledge Graph Maintenance
Loading in …3
×

Check these out next

1 of 53 Ad

Knowledge Graph Maintenance

Download to read offline

Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.

Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Knowledge Graph Maintenance (20)

Advertisement

More from Paul Groth (13)

Recently uploaded (20)

Advertisement

Knowledge Graph Maintenance

  1. 1. Knowledge Graph Maintenance Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Daniel Daza, Corey Harper, Thiviyan Thanapalsingam, Niels ten Oever, Marieke van Erp, Valentin Vogelman and Frank van Harmelen NEC Lab Europae January 22, 2021
  2. 2. We investigate intelligent systems that support people in their work with data and information from diverse sources. In this area, we perform applied and fundamental research informed by empirical insights into data science practice. Current topics: • Automated Knowledge Base Construction • Data Search + Data Provenance • Data Management for Machine Learning • Causality for machine learning on messy data indelab.org
  3. 3. Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure W R I T T E N B Y Nadia Eghbal Source: https://www.fordfoundation.org/work/learning/research-reports/roads-and-bridges-the-unseen-labor-behind-our- digital-infrastructure/
  4. 4. Faculty of Science Knowledge Graphs
  5. 5. Source: Azzaoui, K., Jacoby, E., Senger, S., Rodríguez, E. C., Loza, M., Zdrazil, B., … Ecker, G. F. (2013). Scienti fi c competency questions as the basis for semantically enriched open pharmacological space development. Drug Discovery Today, 18(17–18), 843–852. https://doi.org/10.1016/j.drudis.2013.05.008
  6. 6. Source: https://www.biocuration2019.org/about
  7. 7. Source: https://www.wired.com/story/inside-the-alexa-friendly-world-of-wikidata/
  8. 8. Source: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/user-edits/normal||2001-01-01~2019-09-01|~total|
  9. 9. Crowdsourcing 100,000s of hand annotated examples The TAC Relation Extraction Dataset Source: Zhang, Yuhao, et al. "Position-aware attention and supervised data improve slot fi lling." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. Karen Fort, Gilles Adda, Kevin Bretonnel Cohen. Amazon Mechanical Turk: Gold Mine or Coal Mine?. Computational Linguistics, Massachusetts Institute of Technology Press (MIT Press), 2011, pp.413-420. 10.1162/COLI_a_00057
  10. 10. Concept1 Concept2 Concept3 KOS Professional Curators Literature Software Non-professional contributors 1. dealing with changing cultural and societal norms, specifically to address or correct bias; 2. political influence 3. new concepts and terminology arising from discoveries or change in perspective within a technical/scientific community 4. gardening 5. incremental contributorship 6. progressive formalization 7. software and automation 8. integration of large numbers of data sources 9. variance in algorithm training data Data ⚐ Society & Politics (4, 5, 6) (7, 8, 9) (3) (1, 2) Source: Michael Lauruhn and Paul Groth. “Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  11. 11. Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Knowledge Graph Curation and Re fi nement
  12. 12. Apply ML
  13. 13. Link Prediction
  14. 14. Inductive Prediction Daniel Daza, Michael Cochez, and Paul Groth. "Inductive Entity Representations from Text via Link Prediction." arXiv preprint arXiv:2010.03496 (2020). Accepted to WebConf 2021
  15. 15. Inductive Prediction
  16. 16. Inductive Prediction
  17. 17. Inductive Prediction Transformer
  18. 18. Evaluation Scenarios • Transductive (standard) - predict links between entities seen at training time • Dynamic - predict links between entities where a unseen entity at training time can be in the head, tail or both positions of the triple. • Situation: adding entities to a KG • Transfer - predict links on entities that are unseen at training time. • Situation: predicting links on an unseen KG
  19. 19. Dynamic Transfer
  20. 20. Representations for other tasks https://github.com/dfdazac/blp
  21. 21. “john lennon, parents” “bicycle holiday nature” “Airports in Germany” “What is the longest river?”.
  22. 22. Future: Sub-graph Prediction
  23. 23. Future: Learning KG Pipelines End-to-End Paul T. Groth, Antony Scerri, Ron Daniel, Bradley P. Allen:
 End-to-End Learning for Answering Structured Queries Directly over Text. DL4KG@ESWC 2019: 57-70
  24. 24. KG POPULATION & THE LONG TAIL
  25. 25. The Importance of Attributes
  26. 26. • Extract numeric patterns (12 ± 3, 53–55, 0.245) • Extract corresponding units (°C, μM, hours, h, MPa) • Nanoamperes (nA) for neural cell Rheobase values • Megapascals (MPa) for compressive strength of concrete • Milligrams per Kilogram (mg/kg) for administered drug dosages Units of Measurement Corey Harper
  27. 27. Simple Annotation - Surprisingly Hard
  28. 28. https://competitions.codalab.org/competitions/25770
  29. 29. Task List
  30. 30. Guidelines for applying the model
  31. 31. More and more detailed guidelines...
  32. 32. And started annotating, a paragraph at a time...
  33. 33. NLP Competitions -- Labs Online Lecture -- Thorne, Harper But the paragraphs got messy... 2020-12-07
  34. 34. 2020-12-07 https://github.com/harperco/measeval
  35. 35. NLP Competitions -- Labs Online Lecture -- Thorne, Harper 2020-12-07 40
  36. 36. NLP Competitions -- Labs Online Lecture -- Thorne, Harper Some Inter-annotator Agreement Info (Krippendorff’s Alpha) 2020-12-07 41
  37. 37. NLP Competitions -- Labs Online Lecture -- Thorne, Harper Evaluation Output 2020-12-07 42
  38. 38. NLP Competitions -- Labs Online Lecture -- Thorne, Harper Evaluation Process 2020-12-07 43
  39. 39. NLP Competitions -- Labs Online Lecture -- Thorne, Harper Local Evaluation Micro Averages • Can run micro averages • Broken down: • By entity / relation class • By document (paragraph) • By class & document • Could add additional analysis 2020-12-07 44
  40. 40. KG POPULATION & THE LONG TAIL
  41. 41. LINKING & IDENTITY Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2129–2137 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC 2129 Towards Entity Spaces Marieke van Erp⇤ , Paul Groth† ⇤ KNAW Humanities Cluster - DHLab, Amsterdam, NL † University of Amsterdam, Amsterdam, NL marieke.van.erp@dh.huc.knaw.nl, p.groth@uva.nl Abstract Entities are a central element of knowledge bases and are important input to many knowledge-centric tasks including text analysis. For example, they allow us to find documents relevant to a specific entity irrespective of the underlying syntactic expression within a document. However, the entities that are commonly represented in knowledge bases are often a simplification of what is truly being referred to in text. For example, in a knowledge base, we may have an entity for Germany as a country but not for the more fuzzy concept of Germany that covers notions of German Population, German Drivers, and the German Government. Inspired by recent advances in contextual word embeddings, we introduce the concept of entity spaces - specific representations of a set of associated entities with near-identity. Thus, these entity spaces provide a handle to an amorphous grouping of entities. We developed a proof-of-concept for English showing how, through the introduction of entity spaces in the form of disambiguation pages, the recall of entity linking can be improved. Keywords: entity, identity, knowledge representation, entity linking 1. Introduction Entities are a central element for knowledge bases and text analysis tasks (Balog, 2018). However, the way in which entities are represented in knowledge bases and how subsequent tools use these representations are a sim- plification of the complexity of many entities. For ex- ample, the entity Germany in Wikidata as represented by wikidata:Q183 focuses on its properties as a location and geopolitical entity due to its membership as an in- stance of sovereign state, country, federal state, repub- lic, social state, legal state, and administrative territorial entity. Similarly, in DBpedia (version 2016-10), Ger- many is represented as entity of type populated place and some subtypes such as yago:WikicatFederalCountries and yago:WikicatMemberStatesOfTheEuropeanUnion.1 However, when the term Germany is used in text, it can take on many meanings that all have ‘something to do’ with Germany as it is represented in knowledge bases, but are all not quite the same: (1) Germany imported 47,600 sheep from Britain last year, nearly half of total imports. (2) German July car registrations up 14.2 pct yr / yr. (3) Australia last won the Davis Cup in 1986, but they were beaten finalists against Germany three years ago under Fraser’s guidance. In Example (1), Germany refers partly to the location, but a location usually cannot take on an active role, such that the entity ‘importing’ the sheep is most likely a referent to the German meat industry. Germany in Example (2), refers to the German population buying and registering more cars than a year before. Finally, in Example (3), Germany refers to the German Davis Cup team from 1993 (the news article is from 1996). In the AIDA-YAGO dataset, this entity is tagged as dbp:Germany Davis Cup team but this presents 1 Germany also has rdf:type dbo:Person but we assume this is a glitch. us with another layer of identities, namely that every year, or every couple of years, the German Davis cup team con- sists of different players. In 1993, the German Davis cup team consisted of Michael Stich and Marc-Kevin Goellner, in 1996 of David Prinosil and Hendrik Dreekmann and at the time of writing this article in 2019 of Alexander Zverev and Philipp Kohlschreiber. Both MAG (Moussallem et al., 2017) and DBpedia spotlight (Daiber et al., 2013a) annotate Australia and Germany in Example (3) as dbp:Australia and dbp:Germany respectively. While both the annotations and automatic linkages are close to the identity of the en- tity in resolving these referents to dbp:Germany, we argue this is an underspecification and highlights a larger problem with identity representation in knowledge bases. Collapsing of identities has been a frequent topic within Semantic Web discourse. However, most discussions have focused on issues with owl:sameAs links (McCusker and McGuinness, 2010; Raad et al., 2018). However, the prob- lem of simplified entity representations (e.g. the collaps- ing of identities) also occurs before the creation of such owl:sameAs links. Specifically, with the fact that most knowledge bases represent a single or limited number of an entity’s facets. In this paper, we analyse the extent of the problem by connecting Semantic Web representations of identity to linguistic representations of entities, namely coreference and near-identity. To overcome this identity problem, we argue for the introduction of explicit represen- tations of near-identity within knowledge bases. We term these explicit representations - entity spaces. We illustrate how the introduction of entity spaces can boost the perfor- mance of state-of-the-art entity linking pipelines. Our contributions are: 1) the definition of entity spaces; 2) a prototype showing the use of entity spaces over multiple entity linking pipelines; and 3) experiments on 13 English entity linking datasets showing the impact of a more toler- ant approach to entity linking made possible through entity spaces. Our code and experimental results are available via https: //github.com/MvanErp/entity-spaces. A good question: Thinking about an answer:
  42. 42. Situated Dialogue • Computer mediated dialogue is omnipresent • Key observation: dialogue is situated. It not only is about the “back-and-forth” between parties but also the larger environment (e.g. documents, concepts, projects, world knowledge). • Not just chat
  43. 43. Complex Environments (e.g. Standards Organizations) in-sight-it.github.io
  44. 44. conversationkg - kg extraction from dialogue
  45. 45. Concept1 Concept2 Concept3 KOS Professional Curators Literature Software Non-professional contributors 1. dealing with changing cultural and societal norms, specifically to address or correct bias; 2. political influence 3. new concepts and terminology arising from discoveries or change in perspective within a technical/scientific community 4. gardening 5. incremental contributorship 6. progressive formalization 7. software and automation 8. integration of large numbers of data sources 9. variance in algorithm training data Data ⚐ Society & Politics (4, 5, 6) (7, 8, 9) (3) (1, 2) Source: Michael Lauruhn and Paul Groth. “Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  46. 46. Knowledge Engineering Revisited • Knowledge graphs are built ad-hoc • 100s of components (extractors, scrapers, quality, scoring,  user feedback, ….) • Unique for each organization • Existing knowledge engineering theory does not apply: • Assumes small scale • Assumes slow change • People-centric • Expressive representations • an updated theory and methods for knowledge engineering designed for the demands of modern knowledge graphs
  47. 47. knowledgescientist.org
  48. 48. Conclusion • Knowledge graphs require maintenance • Maintenance is frequently people work • Need for new ML based methods & new human + machine work fl ows • Interested? Happy to talk more Paul Groth | @pgroth | pgroth.com | indelab.org

×