Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semantic similarity for faster Knowledge Graph delivery at scale

365 views

Published on

Knowledge graphs promise a novel platform for better holistic decision making and analytics. Many projects fail to reach their full potential because of the prohibitively high cost of integrating new knowledge from the required information sources.

The talk explains the concept of semantic similarity as a tool for efficient entity clustering and matching based on graph and text embeddings. It will demonstrate the underlying scalable and easy to understand algorithm of Random Indexing.

This work is part of the Ontotext Platform, which increases productivity in developing and maintaining large scale knowledge graphs. The platform enables enterprises to develop and operate on top of such mission-critical systems for decision support, information discovery and metadata management.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Semantic similarity for faster Knowledge Graph delivery at scale

  1. 1. making sense of text and data October, 2019 Connected Data London Semantic Similarity for Faster Knowledge Graph Delivery at Scale
  2. 2. Why Knowledge Graphs? “Cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions—and less than 1% of its unstructured data is analyzed or used at all” What’s Your Data Strategy? Leandro DalleMule and Thomas H. Davenport, Harvard Business Review Top 5 USA Banks
  3. 3. Presentation Outline Enterprise Knowledge Graphs Smart Graphs with Embeddings Implementing Knowledge Graphs Presentation Outline
  4. 4. What is a Knowledge Graph? Graph, Semantics, Smart, Alive
  5. 5. Multiple Enterprise Data Management Systems KG platforms combine capabilities of several enterprise systems: o Master and reference data management o Corporate/Enterprise Taxonomy o Datawarehouse o Metadata management o Digital asset management o Enterprise search
  6. 6. Challenges in Enterprise Semantic Integration Type Titles TV Episodes 4’044’529 Short film 681’067 Feature film 516’726 Video 164’061 TV series 164’061 TV movies 126’206 … … Total * 5’838’514 Type Titles film 235’707 silent short film 16’377 television film 15’345 short film 11’225 animated film 3’785 … … … … Total 289’650 IMDB WikiData * Later the tests use only 5K crawled datasets
  7. 7. Challenges in Enterprise Semantic Integration Multiple levels of inconsistencies: o Types: film vs “TV movie” o Meta-data: “science fiction”, “military science fiction” vs “Sci-Fi” o Reference data: “US” vs. “United States” o Manually curated cross-links (!) for testing purposes only
  8. 8. A Classical Approach o Start with string matching of the Titles “Harry Potter and the Deathly Hallows: Part II” vs. “Harry Potter and the Deathly Hallows – Part 2” “Perfume: The Story of a Murderer” vs “Perfume” “Pirate Radio” vs. “The Boat That Rocked” “Avatar” vs ”Avatar” (4 movies)
  9. 9. A Classical Approach with extra Rules o Add release date matching Lose 10% of the matches due to bad dates o Ambiguity is greatly reduced but still many: tt0238520 16 October 1995 50 min tt1125875 11 April 1995 48 min tt0238520 23 June 1995 1h 21 min
  10. 10. Presentation Outline Enterprise Knowledge Graphs Smart Graphs with Embeddings Implementing Knowledge Graphs Presentation Outline
  11. 11. What is Knowledge Graph Embedding? o Predict similar graph nodes or properties o Require no input training data o Mathematical representation of graph nodes as vectors: duration drama comedy The Godfather (2h 58m) American Pie (1h 15 min) vs.
  12. 12. o For each film include all actors, director, country of origin o Vast matrix with entities and literals Knowledge Graph Embedding Example Movie [Actor] “Adam LeFevre” [Actor] “Anthony Anderson ” [Actor] “Mia Farrow” [Country] “France” [Country] ”US” [Country] ”United states” [Director]” Luc Besson” … wd: Q550232 1 1 1 1 1 imdb: tt0344854 1 1 1 1 ... … … … … … … … … TermsDocument
  13. 13. Random Indexing (RI) Algorithm o Reduces the matrix dimension with elemental vectors For each term, w calculate a context vector S(w) by summing the index vectors of all elemental vectors x appearing in the context of w o Light-weight and fast (250K x 1.45M matrix in < 5m) o Fast sub-second searches and requires limited RAM Actors Movie Adam LeFevre Anthony Anderson Mia Farrow Elemental vectors wd: Q550232 1 1 1 imdb: tt0344854 1 0 1 ... … … …
  14. 14. Random Indexing (RI) Algorithm #2 o Supports similarity searches for: Document to Document – similar movies Document to Term – specific actor/director Term to Term – similar actor/directors Term to Document – find movies specific for this actor/director o Features all properties of a Vector Space model o Partial matching, weights, ranking + context sensitive semantic search Actors Movie Adam LeFevre Anthony Anderson Mia Farrow Elemental vectors wd: Q550232 1 1 1 imdb: tt0344854 1 0 1 ... … … …
  15. 15. Presentation Outline Enterprise Knowledge Graphs Smart Graphs with Embeddings Implementing Knowledge Graphs Presentation Outline
  16. 16. KG Consumers GraphDB Reference Software Architecture o Easy consumption of data o No backend development o Flexible data processing tools o Standard and open interfaces Ontotext Platform GQL query SPARQL RDF / Structured data GQL mutation GQL Federation Similarity Plugin
  17. 17. Transform CSV to RDF o Perform standard ETL tasks o Trim spaces, parse numbers and dates o Parse IMDB ids from links for testing o Map table data to RDF o SPARQL over tabular data o Split multi-valued fields like ”Action|Thriller” o Not yet applied schema level alignment
  18. 18. Similarity Plugin API subject predicate object wd:Q550232 :actor “Adam LeFevre” imdb:tt0344854 :actor "Adam LeFevre” … … … o Accepts a graph described by <s, p, o> o Indexes any RDF types o Works with virtual overlays like: “Adam LeFevre” imdb: tt0344854 wd: Q550232 “Adam LeFevre” wd:Q2702 964 rdfs:label wdt:P161 imdb:actor_2_name
  19. 19. Specify KG Embeddings – Select Predicates o Similarity plugin expects triples <s, p, o>
  20. 20. Specify KG Embeddings – Align Schema o Set a translation table of the predicates
  21. 21. Results o Find similar RDF resources to “Pirate Radio” o Even a limited set of predicates return acceptable results o Important independent alternative for entity matching
  22. 22. Important Design Considerations o Prefer RDF over Property Graph o Much richer technology ecosystem (schema, dataset, reasoning, strings vs things) o Virtualization versus Consolidation o Virtualization works only for simple lookup queries, but not real data integration o Push result federation to the GraphQL data consumption layer o Integrating Random Indexing in the KG database o Push heavy computation as closest to the data o Choose GraphQL over SPARQL for app developers:
  23. 23. Questions & Answering

×