Semantic similarity for faster Knowledge Graph delivery at scale

making sense of text and data
October, 2019
Connected Data London
Semantic Similarity for Faster
Knowledge Graph Delivery at Scale

Why Knowledge Graphs?
“Cross-industry studies show that on average, less than half of an
organization’s structured data is actively used in making decisions—and
less than 1% of its unstructured data is analyzed or used at all”
What’s Your Data Strategy? Leandro DalleMule and Thomas H. Davenport, Harvard Business Review
Top 5 USA
Banks

Presentation Outline
Enterprise Knowledge Graphs
Smart Graphs with Embeddings
Implementing Knowledge Graphs
Presentation Outline

What is a Knowledge Graph?
Graph, Semantics, Smart, Alive

Multiple Enterprise Data Management Systems
KG platforms combine capabilities of several enterprise systems:
o Master and reference data management
o Corporate/Enterprise Taxonomy
o Datawarehouse
o Metadata management
o Digital asset management
o Enterprise search

Challenges in Enterprise Semantic Integration
Type Titles
TV Episodes 4’044’529
Short film 681’067
Feature film 516’726
Video 164’061
TV series 164’061
TV movies 126’206
… …
Total * 5’838’514
Type Titles
film 235’707
silent short film 16’377
television film 15’345
short film 11’225
animated film 3’785
… …
… …
Total 289’650
IMDB WikiData
* Later the tests use only 5K crawled datasets

Challenges in Enterprise Semantic Integration
Multiple levels of inconsistencies:
o Types: film vs “TV movie”
o Meta-data: “science fiction”, “military
science fiction” vs “Sci-Fi”
o Reference data: “US” vs. “United States”
o Manually curated cross-links (!) for testing
purposes only

A Classical Approach
o Start with string matching of the Titles
“Harry Potter and the Deathly Hallows: Part II” vs.
“Harry Potter and the Deathly Hallows – Part 2”
“Perfume: The Story of a Murderer” vs “Perfume”
“Pirate Radio” vs. “The Boat That Rocked”
“Avatar” vs ”Avatar” (4 movies)

A Classical Approach with extra Rules
o Add release date matching
Lose 10% of the matches due to bad dates
o Ambiguity is greatly reduced but still many:
tt0238520
16 October 1995
50 min
tt1125875
11 April 1995
48 min
tt0238520
23 June 1995
1h 21 min

What is Knowledge Graph Embedding?
o Predict similar graph nodes or properties
o Require no input training data
o Mathematical representation of graph nodes as vectors:
duration
drama
comedy
The Godfather
(2h 58m)
American Pie
(1h 15 min)
vs.

o For each film include all actors, director, country of origin
o Vast matrix with entities and literals
Knowledge Graph Embedding Example
Movie [Actor]
“Adam
LeFevre”
[Actor]
“Anthony
Anderson
”
[Actor]
“Mia
Farrow”
[Country]
“France”
[Country]
”US”
[Country]
”United
states”
[Director]”
Luc
Besson”
…
wd:
Q550232
1 1 1 1 1
imdb:
tt0344854
1 1 1 1
... … … … … … … … …
TermsDocument

Random Indexing (RI) Algorithm
o Reduces the matrix dimension
with elemental vectors
For each term, w calculate a context vector S(w) by
summing the index vectors of all elemental vectors
x appearing in the context of w
o Light-weight and fast
(250K x 1.45M matrix in < 5m)
o Fast sub-second searches and
requires limited RAM
Actors
Movie
Adam
LeFevre
Anthony
Anderson
Mia
Farrow
Elemental
vectors
wd:
Q550232
1 1 1
imdb:
tt0344854
1 0 1
... … … …

Random Indexing (RI) Algorithm #2
o Supports similarity searches for:
Document to Document – similar movies
Document to Term – specific actor/director
Term to Term – similar actor/directors
Term to Document – find movies specific for this
actor/director
o Features all properties of a
Vector Space model
o Partial matching, weights, ranking + context
sensitive semantic search
Actors
Movie
Adam
LeFevre
Anthony
Anderson
Mia
Farrow
Elemental
vectors
wd:
Q550232
1 1 1
imdb:
tt0344854
1 0 1
... … … …

KG Consumers
GraphDB
Reference Software Architecture
o Easy consumption of data
o No backend development
o Flexible data processing tools
o Standard and open interfaces
Ontotext Platform
GQL query
SPARQL
RDF /
Structured
data
GQL
mutation
GQL
Federation
Similarity
Plugin

Transform CSV to RDF
o Perform standard ETL tasks
o Trim spaces, parse numbers and dates
o Parse IMDB ids from links for testing
o Map table data to RDF
o SPARQL over tabular data
o Split multi-valued fields like ”Action|Thriller”
o Not yet applied schema level
alignment

Similarity Plugin API
subject predicate object
wd:Q550232 :actor “Adam LeFevre”
imdb:tt0344854 :actor "Adam LeFevre”
… … …
o Accepts a graph described by <s, p, o>
o Indexes any RDF types
o Works with virtual overlays like:
“Adam LeFevre”
imdb:
tt0344854
wd:
Q550232
“Adam LeFevre”
wd:Q2702
964
rdfs:label
wdt:P161
imdb:actor_2_name

Specify KG Embeddings – Select Predicates
o Similarity plugin expects triples <s, p, o>

Specify KG Embeddings – Align Schema
o Set a translation table of the predicates

Results
o Find similar RDF resources to “Pirate Radio”
o Even a limited set of predicates return acceptable results
o Important independent alternative for entity matching

Important Design Considerations
o Prefer RDF over Property Graph
o Much richer technology ecosystem (schema, dataset, reasoning, strings vs things)
o Virtualization versus Consolidation
o Virtualization works only for simple lookup queries, but not real data integration
o Push result federation to the GraphQL data consumption layer
o Integrating Random Indexing in the KG database
o Push heavy computation as closest to the data
o Choose GraphQL over SPARQL for app developers:

Semantic similarity for faster Knowledge Graph delivery at scale

Recommended

Recommended

More Related Content

Similar to Semantic similarity for faster Knowledge Graph delivery at scale

Similar to Semantic similarity for faster Knowledge Graph delivery at scale (20)

More from Connected Data World

More from Connected Data World (20)

Recently uploaded

Recently uploaded (20)

Semantic similarity for faster Knowledge Graph delivery at scale