Seminar - Similarity Joins in SQL (performance and semantic joins)

Advance Topics:
Database Systems
Eyal Trabelsi

Agenda
• Similarity Join
- Introduction
- Applications
- Naive solution
- Similarity Join in RDBMS
• Similarity Join Performance Optimizations
- Introduction
- SSJoin
• Semantic Similarity For Text
- Introduction
- Overall architecture
- New capabilities

Definition
“ Resembling without
being identical “

• Input:
- Two sets of objects: R and S
- A similarity function: sim(r,s)
- A threshold: t
• Output:
- all pairs of objects of r in R and
s in S such that sim(r,s) < t
Formal
Definition

Formal
Definition
The Input Can Be:
- Numbers
- Pictures
- Vectors
- Sets
- Text

Formal
Definition
The Input Can Be:
- Pretty much everything

Similarity
Functions
Is it new to us ?

• Lp
Distance-
• Hamming Distance
• Cosine Similarity-
• Jaccard-
• Edit Distance
Similarity
Functions
Is it new to us ?

Applications
Non Exact deduplication

Applications
Data consolidation
- Lack of consistency, for example writing both $
and dollars
- typos for example, “why everybdoy can
understand this”
- Precision, for example rounding numbers

Solution
So how do we
solve this

By simple nested loop
algorithm and comparing
all pairs using the
similarity function.
Naive
Solution

• Time complexity
• Is this good enough
for RDBMS
Naive
Solution

• Time complexity ? O(n2
)
• Is this good enough? It depend on
the application
for rdbms?
Naive
Solution

“RDBMS should provide a
solution that is as generic
and performant as possible “Similarity
Join In
RDBMS

• Handle large dataset = many rows
• Handle high dimensions dataset = many columns
• Support variety of similarity functions
• Support hard similarity functions
• Should be correct*
The solution should
Similarity
Join In
RDBMS

• Handle large dataset = many rows
• Handle high dimensions dataset = many columns
• Support variety of similarity functions
• Support hard similarity functions
• Should be correct, answer the application needs
The solution should
Similarity
Join In
RDBMS

• Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity
Similarity
Join In
RDBMS

• Consider only promising pairs [1] , by filter first
What will it tackle
Similarity
Join In
RDBMS

• Consider only promising pairs [1] , by filter first
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS

• Resort to approximate solutions[1]
What will it tackle
Similarity
Join In
RDBMS

• Resort to approximate solutions[1]
What will it tackle
- Large Datasets
- Should be correct
Similarity
Join In
RDBMS

• Pruning and refinement paradigm [1][2][3]
What will it tackle
Similarity
Join In
RDBMS

• Pruning and refinement paradigm [1][2][3]
What will it tackle
- Large Datasets
- Should be correct
Similarity
Join In
RDBMS

Similarity Join
Performance
Optimizations

• Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
• Numbers
• Vectors
• Sets
• Text
Inputs can be:
Introduction

• Similarity between sets
- Binary similarity functions like contains intersect
- Numerical similarity functions like overlap, jaccard or cosine
• Similarity between strings
- Treat strings as sets and using Jaccard (on q-gram)
or edit distance
Similarity Join On Strings/Sets:
Introduction

Our Goal
To perform filtering before the cross product
occur and reduce the pairs constructed
for the join.
Introduction

String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
Introduction

String Set
Weighted
Set
Set similarity
The big picture
A- food
B- good
Introduction

String Set
Weighted
Set
Set similarity
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
Introduction

String Set
Weighted
Set
Set similarity
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Introduction

String Set
Weighted
Set
Set similarity
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Didn’t we say we want to support
multiple similarity functions
Introduction

String Set
Weighted
Set
Set similarity
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Didn’t we say we want to support
multiple similarity functions
By using overlap we can implement
many other similarity functions
Introduction

SS JOIN
To exploit the observation that set overlap can be used
effectively to support a variety of similarity functions :
● Jaccard similarity.
● Edit similarity and generalized edit similarity.
● Hamming distance.
● Similarity based on cooccurrences.
Proposed solution[2][3]

• Algorithm[2]:
1. Computing an equi-join on the B columns between R and S
and adding the weights of all joining values of B.
2. Candidate phase , compute the overlap between groups on
R.A and S.A. by grouping the result on < R.A, S.A > .
3. Verify phase, ensuring through the having clause, that the
overlap is greater than the specified threshold α would yield
the result of the SSJoin.
SS JOIN

SS JOIN
Given two relations S and R holding companies
names, compute similarity join with overlap > 60%*

1. Computing an equi-join on the B columns between
R and S and adding the weights of all joining values
of B.
In our case B is the 3-gram column.
SS JOIN

2. Candidate phase , compute the overlap between
groups on R.A and S.A. by grouping the result on
< R.A, S.A > .
In our example A is the orgName column, and the
overlap between the grouped orgName, is as follow:
- Microsoft has overlap of 10.
- Google has overlap of 2.
SS JOIN

3. Verify phase, ensuring through the having clause, that
the overlap is greater than the specified threshold α
would yield the result of the SSJoin.
In our example since we are looking for 60% overlap, and
the verify phase is computed in the following way:
- Since Microsoft has overlap of 10 out of 12 it has 83%
and returned in the resultant join.
- Since Google has overlap of 2 out of 4 it has 50%
overlap and filtered by the join.
SS JOIN

SS JOIN
• Time complexity
Performance

SS JOIN
• Time complexity ?
Since we use equi-join we can use rdbms
optimizations like merge/hash join etc and get
O(N+M), or even less if one table fit RAM.
Performance

SS JOIN
• Is it still problematic
Performance

SS JOIN
• Is it still problematic ?
Yes,the size of the equi-join on B varies
widely with the joint-frequency distribution of
B, which can be very large.
Performance

SS JOIN
• Is there another any
Performance

SS JOIN
• Is there another any
Optimization opportunity ?
Yes, using “prefix filtering principle”[2]
Performance

SS JOIN
With Prefix
Filtering
reduce the intermediate number of
<R.A, S.A> groups compared, and thus
reduce the size of the resultant equi-join
Goal

SS JOIN
With Prefix
Filtering
Instead of performing an equi-join on R and S, we
may ignore a large subset of S and perform the
equi-join on R and a small filtered subset of S
using prefix-filtering.
How

SS JOIN
With Prefix
Filtering
Intuition
if two records are similar, some fragments of
them should overlap with each other, as
otherwise the two records won’t have enough
overlap.

SS JOIN
With Prefix
Filtering Its implemented by establishing an upper
bound of the overlap between two sets based
on part of them
Intuition
if two records are similar, some fragments of
them should overlap with each other, as
otherwise the two records won’t have enough
overlap.

• Formal Algorithm[2]
- Prefix(U) ∩ Prefix(V) = ε , overlap(U, V) < t
- Global ordering is important
SS JOIN
With Prefix
Filtering

• Algorithm[2]:
1. Compute prefix(S) for each record S.
2. Computing an equi-join on the B columns between R and S
and adding the weights of all joining values of B.
3. Candidate phase ,pair all records that share at least one
token in their prefix.
4. Compute the overlap between groups on R.A and S.A. by
grouping the result on < R.A, S.A > .
5. Verify phase, ensuring, through the having clause, that the
overlap is greater than the specified threshold α would yield
the result of the SSJoin.
SS JOIN
With Prefix
Filtering

Semantic Similarity
Join On Text

MotivationText
Similarity
“Different people have a
slightly different notion on
what text similarity means”

MotivationTypes
• Lexical similarity s to compute how 'close'
two pieces of text are in surface closeness[5].
• Semantic similarity s to compute how 'close'
two pieces of text are in their meaning[5].

Motivation
Enhancing queries, by allowing
To quantify semantic relationships
inside database using Natural
Language processing.Goal

Motivation
New
Capabilities
• Semantic similarity queries
- Find the most similar customer (semantically) to a potential
customer by industry
• Analogies
- Find all pairs of product a, b which relate to
Themself as peanut butter relate to jelly
• Schema-less navigation
- Find all tickets of user “moshe” given unknown
fuzzy foreign key between tickets and users.

CI QueriesMotivationArchitecture
Overview
How is this done?

The big picture The Author Version[6]
Architecture
Overview

Architecture
Overview
SQL with UDF
The big picture
Query
Engine

Architecture
Overview
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
The big picture

Architecture
Overview
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
Tokenization
+Calculate
Relationvector
The big picture

Architecture
Overview
The big picture
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
Results
Results
Tokenization
+Calculate
Relationvector

Architecture
Overview Tokenize
Relation [6]
Embed
Relation [4]
Relation embedding

Motivation
Semantic
Similarity
Queries
Building Blocks Needed
- cosineSemilarity(a,b) which takes vectors a, b return their
cosine distance
- vec(token) which takes a token and returns its associated vector
- Token entity e declares a variable that can be bound to tokens.
- Contain(row, entity) which states that entity must be bound to a
token generated by tokenizing row.

Motivation
Semantic
Similarity
Queries
Questions
- Find the most similar customer (semantically) to a potential
customer by industry.

Find the most similar customer (semantically) to
potential customer by industry
SELECT c.name
FROM customer c, potential_customer pc
WHERE c.id < cp.id AND
ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC
LIMIT 1
Semantic
Similarity
Queries
Why do we need c1.id < c2.id
What change is needed to avoid non similar customers

Find the most similar customer (semantically) to
potential customer by industry
SELECT c.name
FROM customer c, potential_customer pc
WHERE c.id < cp.id AND
ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC
LIMIT 1
Semantic
Similarity
Queries Why do we need c1.id < c2.id ?
In order to avoid duplication
What change is needed to avoid non similar customers?
Adding filter on on the proximity to the Where clause

Motivation
Analogies
Find all pairs of product a, b which relate
to themself as peanut butter relate to jelly

Let's break the solution to the following steps:
1. Create table with product names and their distance vector
2. Create table with product names and the cosine distance
between distance vector and jelly/peanut butter vector
3. Find all pair of product a,b which relate to themself as peanut
butter relates to jelly
Analogies

Analogies CREATE TABLE products_distance AS
SELECT p1.id AS p_name_1,
p2.id AS p_name_2,
vec(p1.description) - vec(p2.description) AS dist_vec
FROM products p1, products p2
WHERE p1.id < p2.id;
1. Create table with product names and
their distance vector

Analogies
2. Create table with product names and
their cosine distance between distance
vector and jelly/peanut butter vector
CREATE TABLE products_complemantry_distance AS
SELECT p_name_1,
p_name_2,
cosineDistance(dist_vec - (vec(‘peanut_butter’) + vec(‘jelly’)))
AS compl_dist
FROM products_distance

Analogies
3. Find all pairs of product a, b which relate to
themself as peanut butter relate to jelly.
SELECT p_name_1 ,
p_name_2 ,
RANK() OVER (PARTITION BY p_name_1
ORDER BY compl_dist ASC) AS rnk
FROM products_complemantry_distance
WHERE rnk = 0

Motivation
Schemaless
Navigation
Schema-less navigation
- Find all tickets of user “moshe” given unknown
fuzzy foreign key between tickets and users

Schemaless
Navigation
Find all tickets of user “moshe” given
unknown fuzzy foreign key between tickets
and users
SELECT users.* ,
tickets.*
FROM users; Token e1, e2
INNER JOIN tickets
ON contains(users.email, e1) AND
contains(tickets.*,e2) AND
cosineDistance(e1,e2) > 0.5
WHERE users.name = “moshe”

1. Wang, W. (2008). Similarity Join Algorithms:An Introduction. Retrieved from
http://www.cse.unsw.edu.au/~weiw/project/tutorial-simjoin-SEBD08.pdf.
2. Surajit, C., Venkatesh, G., Raghav, K. A Primitive Operator for Similarity Joins
in Data Cleaning, Proceedings of the 22nd International Conference on
Data Engineering, p.5, April 03-07, 2006.
3. Xiao, C., Wang, W., Lin, X., Yu, J. X., Wang, G (2011). Efficient Similarity Joins
for Near Duplicate Detection. ACM Trans. Datab. Syst. V, N.
4. T. Mikolov. word2vec: Tool for computing continuous distributed representations
of words. https://code.google.com/p/word2vec.
5. Ganesan, K (2015, November). What is text similarity? [Blog post]. Retrieved from
http://kavita-ganesan.com/what-is-text-similarity/#.Wppog5NuYv88.
6. Shmueli, O., & Bordawekar, R. (2016, Mar). Enabling Cognitive Intelligence
Queries in Relational Databases using Low-dimensional Word Embeddings.
Retrieved fro https://arxiv.org/abs/1603.07185.
MotivationReferences

Seminar - Similarity Joins in SQL (performance and semantic joins)

Recommended

Recommended

More Related Content

Similar to Seminar - Similarity Joins in SQL (performance and semantic joins)

Similar to Seminar - Similarity Joins in SQL (performance and semantic joins) (20)

More from Eyal Trabelsi

More from Eyal Trabelsi (6)

Recently uploaded

Recently uploaded (20)

Seminar - Similarity Joins in SQL (performance and semantic joins)