5. • Input:
- Two sets of objects: R and S
- A similarity function: sim(r,s)
- A threshold: t
• Output:
- all pairs of objects of r in R and
s in S such that sim(r,s) < t
Formal
Definition
13. Applications
Data consolidation
- Lack of consistency, for example writing both $
and dollars
- typos for example, “why everybdoy can
understand this”
- Precision, for example rounding numbers
16. By simple nested loop
algorithm and comparing
all pairs using the
similarity function.
Naive
Solution
17. • Time complexity
• Is this good enough
• Is this good enough
for RDBMS
Naive
Solution
18. • Time complexity ? O(n2
)
• Is this good enough? It depend on
the application
• Is this good enough
for rdbms?
Naive
Solution
19. “RDBMS should provide a
solution that is as generic
and performant as possible “Similarity
Join In
RDBMS
20. “RDBMS should provide a
solution that is as generic
and performant as possible “Similarity
Join In
RDBMS
21. • Handle large dataset = many rows
• Handle high dimensions dataset = many columns
• Support variety of similarity functions
• Support hard similarity functions
• Should be correct*
The solution should
Similarity
Join In
RDBMS
22. • Handle large dataset = many rows
• Handle high dimensions dataset = many columns
• Support variety of similarity functions
• Support hard similarity functions
• Should be correct, answer the application needs
The solution should
Similarity
Join In
RDBMS
23. • Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity
Similarity
Join In
RDBMS
24. • Consider only promising pairs [1] , by filter first
What will it tackle
Optimization opportunity
Similarity
Join In
RDBMS
25. • Consider only promising pairs [1] , by filter first
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
26. • Resort to approximate solutions[1]
What will it tackle
Optimization opportunity:
Similarity
Join In
RDBMS
27. • Resort to approximate solutions[1]
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
28. • Pruning and refinement paradigm [1][2][3]
What will it tackle
Optimization opportunity:
Similarity
Join In
RDBMS
29. • Pruning and refinement paradigm [1][2][3]
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
31. • Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity:
• Numbers
• Vectors
• Sets
• Text
Inputs can be:
Introduction
32. • Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity:
• Numbers
• Vectors
• Sets
• Text
Inputs can be:
Introduction
33. • Similarity between sets
- Binary similarity functions like contains intersect
- Numerical similarity functions like overlap, jaccard or cosine
• Similarity between strings
- Treat strings as sets and using Jaccard (on q-gram)
or edit distance
Similarity Join On Strings/Sets:
Introduction
34. Our Goal
To perform filtering before the cross product
occur and reduce the pairs constructed
for the join.
Introduction
35. Our Goal
To perform filtering before the cross product
occur and reduce the pairs constructed
for the join.
Introduction
38. String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
Introduction
39. String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Introduction
40. String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Didn’t we say we want to support
multiple similarity functions
Introduction
41. String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Didn’t we say we want to support
multiple similarity functions
By using overlap we can implement
many other similarity functions
Introduction
42. String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Introduction
43. SS JOIN
To exploit the observation that set overlap can be used
effectively to support a variety of similarity functions :
● Jaccard similarity.
● Edit similarity and generalized edit similarity.
● Hamming distance.
● Similarity based on cooccurrences.
Proposed solution[2][3]
44. • Algorithm[2]:
1. Computing an equi-join on the B columns between R and S
and adding the weights of all joining values of B.
2. Candidate phase , compute the overlap between groups on
R.A and S.A. by grouping the result on < R.A, S.A > .
3. Verify phase, ensuring through the having clause, that the
overlap is greater than the specified threshold α would yield
the result of the SSJoin.
SS JOIN
45. SS JOIN
Given two relations S and R holding companies
names, compute similarity join with overlap > 60%*
46. 1. Computing an equi-join on the B columns between
R and S and adding the weights of all joining values
of B.
In our case B is the 3-gram column.
SS JOIN
47. 2. Candidate phase , compute the overlap between
groups on R.A and S.A. by grouping the result on
< R.A, S.A > .
In our example A is the orgName column, and the
overlap between the grouped orgName, is as follow:
- Microsoft has overlap of 10.
- Google has overlap of 2.
SS JOIN
48. 3. Verify phase, ensuring through the having clause, that
the overlap is greater than the specified threshold α
would yield the result of the SSJoin.
In our example since we are looking for 60% overlap, and
the verify phase is computed in the following way:
- Since Microsoft has overlap of 10 out of 12 it has 83%
and returned in the resultant join.
- Since Google has overlap of 2 out of 4 it has 50%
overlap and filtered by the join.
SS JOIN
50. SS JOIN
• Time complexity ?
Since we use equi-join we can use rdbms
optimizations like merge/hash join etc and get
O(N+M), or even less if one table fit RAM.
Performance
52. SS JOIN
• Is it still problematic ?
Yes,the size of the equi-join on B varies
widely with the joint-frequency distribution of
B, which can be very large.
Performance
53. SS JOIN
• Is there another any
Optimization opportunity
Performance
54. SS JOIN
• Is there another any
Optimization opportunity ?
Yes, using “prefix filtering principle”[2]
Performance
55.
56. SS JOIN
With Prefix
Filtering
reduce the intermediate number of
<R.A, S.A> groups compared, and thus
reduce the size of the resultant equi-join
Goal
57. SS JOIN
With Prefix
Filtering
Instead of performing an equi-join on R and S, we
may ignore a large subset of S and perform the
equi-join on R and a small filtered subset of S
using prefix-filtering.
How
58. SS JOIN
With Prefix
Filtering
Intuition
if two records are similar, some fragments of
them should overlap with each other, as
otherwise the two records won’t have enough
overlap.
59. SS JOIN
With Prefix
Filtering Its implemented by establishing an upper
bound of the overlap between two sets based
on part of them
Intuition
if two records are similar, some fragments of
them should overlap with each other, as
otherwise the two records won’t have enough
overlap.
60. • Formal Algorithm[2]
- Prefix(U) ∩ Prefix(V) = ε , overlap(U, V) < t
- Global ordering is important
SS JOIN
With Prefix
Filtering
61. • Algorithm[2]:
1. Compute prefix(S) for each record S.
2. Computing an equi-join on the B columns between R and S
and adding the weights of all joining values of B.
3. Candidate phase ,pair all records that share at least one
token in their prefix.
4. Compute the overlap between groups on R.A and S.A. by
grouping the result on < R.A, S.A > .
5. Verify phase, ensuring, through the having clause, that the
overlap is greater than the specified threshold α would yield
the result of the SSJoin.
SS JOIN
With Prefix
Filtering
64. MotivationTypes
• Lexical similarity s to compute how 'close'
two pieces of text are in surface closeness[5].
• Semantic similarity s to compute how 'close'
two pieces of text are in their meaning[5].
65. Motivation
Enhancing queries, by allowing
To quantify semantic relationships
inside database using Natural
Language processing.Goal
66. Motivation
New
Capabilities
• Semantic similarity queries
- Find the most similar customer (semantically) to a potential
customer by industry
• Analogies
- Find all pairs of product a, b which relate to
Themself as peanut butter relate to jelly
• Schema-less navigation
- Find all tickets of user “moshe” given unknown
fuzzy foreign key between tickets and users.
74. Motivation
Semantic
Similarity
Queries
Building Blocks Needed
- cosineSemilarity(a,b) which takes vectors a, b return their
cosine distance
- vec(token) which takes a token and returns its associated vector
- Token entity e declares a variable that can be bound to tokens.
- Contain(row, entity) which states that entity must be bound to a
token generated by tokenizing row.
76. Find the most similar customer (semantically) to
potential customer by industry
SELECT c.name
FROM customer c, potential_customer pc
WHERE c.id < cp.id AND
ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC
LIMIT 1
Semantic
Similarity
Queries
Why do we need c1.id < c2.id
What change is needed to avoid non similar customers
77. Find the most similar customer (semantically) to
potential customer by industry
SELECT c.name
FROM customer c, potential_customer pc
WHERE c.id < cp.id AND
ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC
LIMIT 1
Semantic
Similarity
Queries Why do we need c1.id < c2.id ?
In order to avoid duplication
What change is needed to avoid non similar customers?
Adding filter on on the proximity to the Where clause
79. Let's break the solution to the following steps:
1. Create table with product names and their distance vector
2. Create table with product names and the cosine distance
between distance vector and jelly/peanut butter vector
3. Find all pair of product a,b which relate to themself as peanut
butter relates to jelly
Analogies
80. Analogies CREATE TABLE products_distance AS
SELECT p1.id AS p_name_1,
p2.id AS p_name_2,
vec(p1.description) - vec(p2.description) AS dist_vec
FROM products p1, products p2
WHERE p1.id < p2.id;
1. Create table with product names and
their distance vector
81. Analogies
2. Create table with product names and
their cosine distance between distance
vector and jelly/peanut butter vector
CREATE TABLE products_complemantry_distance AS
SELECT p_name_1,
p_name_2,
cosineDistance(dist_vec - (vec(‘peanut_butter’) + vec(‘jelly’)))
AS compl_dist
FROM products_distance
82. Analogies
3. Find all pairs of product a, b which relate to
themself as peanut butter relate to jelly.
SELECT p_name_1 ,
p_name_2 ,
RANK() OVER (PARTITION BY p_name_1
ORDER BY compl_dist ASC) AS rnk
FROM products_complemantry_distance
WHERE rnk = 0
84. Schemaless
Navigation
Find all tickets of user “moshe” given
unknown fuzzy foreign key between tickets
and users
SELECT users.* ,
tickets.*
FROM users; Token e1, e2
INNER JOIN tickets
ON contains(users.email, e1) AND
contains(tickets.*,e2) AND
cosineDistance(e1,e2) > 0.5
WHERE users.name = “moshe”
85. 1. Wang, W. (2008). Similarity Join Algorithms:An Introduction. Retrieved from
http://www.cse.unsw.edu.au/~weiw/project/tutorial-simjoin-SEBD08.pdf.
2. Surajit, C., Venkatesh, G., Raghav, K. A Primitive Operator for Similarity Joins
in Data Cleaning, Proceedings of the 22nd International Conference on
Data Engineering, p.5, April 03-07, 2006.
3. Xiao, C., Wang, W., Lin, X., Yu, J. X., Wang, G (2011). Efficient Similarity Joins
for Near Duplicate Detection. ACM Trans. Datab. Syst. V, N.
4. T. Mikolov. word2vec: Tool for computing continuous distributed representations
of words. https://code.google.com/p/word2vec.
5. Ganesan, K (2015, November). What is text similarity? [Blog post]. Retrieved from
http://kavita-ganesan.com/what-is-text-similarity/#.Wppog5NuYv88.
6. Shmueli, O., & Bordawekar, R. (2016, Mar). Enabling Cognitive Intelligence
Queries in Relational Databases using Low-dimensional Word Embeddings.
Retrieved fro https://arxiv.org/abs/1603.07185.
MotivationReferences