SlideShare a Scribd company logo
Advance Topics:
Database Systems
Eyal Trabelsi
Agenda
• Similarity Join
- Introduction
- Applications
- Naive solution
- Similarity Join in RDBMS
• Similarity Join Performance Optimizations
- Introduction
- SSJoin
• Semantic Similarity For Text
- Introduction
- Overall architecture
- New capabilities
Similarity Join
Definition
“ Resembling without
being identical “
• Input:
- Two sets of objects: R and S
- A similarity function: sim(r,s)
- A threshold: t
• Output:
- all pairs of objects of r in R and
s in S such that sim(r,s) < t
Formal
Definition
Formal
Definition
The Input Can Be:
- Numbers
- Pictures
- Vectors
- Sets
- Text
Formal
Definition
The Input Can Be:
- Pretty much everything
Similarity
Functions
Is it new to us ?
• Lp
Distance-
• Hamming Distance
• Cosine Similarity-
• Jaccard-
• Edit Distance
Similarity
Functions
Is it new to us ?
Applications
Non Exact deduplication
Applications
Search Engines
Applications
Analytics
Applications
Data consolidation
- Lack of consistency, for example writing both $
and dollars
- typos for example, “why everybdoy can
understand this”
- Precision, for example rounding numbers
Solution
So how do we
solve this
Solution
Naive Solution
By simple nested loop
algorithm and comparing
all pairs using the
similarity function.
Naive
Solution
• Time complexity
• Is this good enough
• Is this good enough
for RDBMS
Naive
Solution
• Time complexity ? O(n2
)
• Is this good enough? It depend on
the application
• Is this good enough
for rdbms?
Naive
Solution
“RDBMS should provide a
solution that is as generic
and performant as possible “Similarity
Join In
RDBMS
“RDBMS should provide a
solution that is as generic
and performant as possible “Similarity
Join In
RDBMS
• Handle large dataset = many rows
• Handle high dimensions dataset = many columns
• Support variety of similarity functions
• Support hard similarity functions
• Should be correct*
The solution should
Similarity
Join In
RDBMS
• Handle large dataset = many rows
• Handle high dimensions dataset = many columns
• Support variety of similarity functions
• Support hard similarity functions
• Should be correct, answer the application needs
The solution should
Similarity
Join In
RDBMS
• Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity
Similarity
Join In
RDBMS
• Consider only promising pairs [1] , by filter first
What will it tackle
Optimization opportunity
Similarity
Join In
RDBMS
• Consider only promising pairs [1] , by filter first
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
• Resort to approximate solutions[1]
What will it tackle
Optimization opportunity:
Similarity
Join In
RDBMS
• Resort to approximate solutions[1]
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
• Pruning and refinement paradigm [1][2][3]
What will it tackle
Optimization opportunity:
Similarity
Join In
RDBMS
• Pruning and refinement paradigm [1][2][3]
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
Similarity Join
Performance
Optimizations
• Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity:
• Numbers
• Vectors
• Sets
• Text
Inputs can be:
Introduction
• Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity:
• Numbers
• Vectors
• Sets
• Text
Inputs can be:
Introduction
• Similarity between sets
- Binary similarity functions like contains intersect
- Numerical similarity functions like overlap, jaccard or cosine
• Similarity between strings
- Treat strings as sets and using Jaccard (on q-gram)
or edit distance
Similarity Join On Strings/Sets:
Introduction
Our Goal
To perform filtering before the cross product
occur and reduce the pairs constructed
for the join.
Introduction
Our Goal
To perform filtering before the cross product
occur and reduce the pairs constructed
for the join.
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Didn’t we say we want to support
multiple similarity functions
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Didn’t we say we want to support
multiple similarity functions
By using overlap we can implement
many other similarity functions
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Introduction
SS JOIN
To exploit the observation that set overlap can be used
effectively to support a variety of similarity functions :
● Jaccard similarity.
● Edit similarity and generalized edit similarity.
● Hamming distance.
● Similarity based on cooccurrences.
Proposed solution[2][3]
• Algorithm[2]:
1. Computing an equi-join on the B columns between R and S
and adding the weights of all joining values of B.
2. Candidate phase , compute the overlap between groups on
R.A and S.A. by grouping the result on < R.A, S.A > .
3. Verify phase, ensuring through the having clause, that the
overlap is greater than the specified threshold α would yield
the result of the SSJoin.
SS JOIN
SS JOIN
Given two relations S and R holding companies
names, compute similarity join with overlap > 60%*
1. Computing an equi-join on the B columns between
R and S and adding the weights of all joining values
of B.
In our case B is the 3-gram column.
SS JOIN
2. Candidate phase , compute the overlap between
groups on R.A and S.A. by grouping the result on
< R.A, S.A > .
In our example A is the orgName column, and the
overlap between the grouped orgName, is as follow:
- Microsoft has overlap of 10.
- Google has overlap of 2.
SS JOIN
3. Verify phase, ensuring through the having clause, that
the overlap is greater than the specified threshold α
would yield the result of the SSJoin.
In our example since we are looking for 60% overlap, and
the verify phase is computed in the following way:
- Since Microsoft has overlap of 10 out of 12 it has 83%
and returned in the resultant join.
- Since Google has overlap of 2 out of 4 it has 50%
overlap and filtered by the join.
SS JOIN
SS JOIN
• Time complexity
Performance
SS JOIN
• Time complexity ?
Since we use equi-join we can use rdbms
optimizations like merge/hash join etc and get
O(N+M), or even less if one table fit RAM.
Performance
SS JOIN
• Is it still problematic
Performance
SS JOIN
• Is it still problematic ?
Yes,the size of the equi-join on B varies
widely with the joint-frequency distribution of
B, which can be very large.
Performance
SS JOIN
• Is there another any
Optimization opportunity
Performance
SS JOIN
• Is there another any
Optimization opportunity ?
Yes, using “prefix filtering principle”[2]
Performance
SS JOIN
With Prefix
Filtering
reduce the intermediate number of
<R.A, S.A> groups compared, and thus
reduce the size of the resultant equi-join
Goal
SS JOIN
With Prefix
Filtering
Instead of performing an equi-join on R and S, we
may ignore a large subset of S and perform the
equi-join on R and a small filtered subset of S
using prefix-filtering.
How
SS JOIN
With Prefix
Filtering
Intuition
if two records are similar, some fragments of
them should overlap with each other, as
otherwise the two records won’t have enough
overlap.
SS JOIN
With Prefix
Filtering Its implemented by establishing an upper
bound of the overlap between two sets based
on part of them
Intuition
if two records are similar, some fragments of
them should overlap with each other, as
otherwise the two records won’t have enough
overlap.
• Formal Algorithm[2]
- Prefix(U) ∩ Prefix(V) = ε , overlap(U, V) < t
- Global ordering is important
SS JOIN
With Prefix
Filtering
• Algorithm[2]:
1. Compute prefix(S) for each record S.
2. Computing an equi-join on the B columns between R and S
and adding the weights of all joining values of B.
3. Candidate phase ,pair all records that share at least one
token in their prefix.
4. Compute the overlap between groups on R.A and S.A. by
grouping the result on < R.A, S.A > .
5. Verify phase, ensuring, through the having clause, that the
overlap is greater than the specified threshold α would yield
the result of the SSJoin.
SS JOIN
With Prefix
Filtering
Semantic Similarity
Join On Text
MotivationText
Similarity
“Different people have a
slightly different notion on
what text similarity means”
MotivationTypes
• Lexical similarity s to compute how 'close'
two pieces of text are in surface closeness[5].
• Semantic similarity s to compute how 'close'
two pieces of text are in their meaning[5].
Motivation
Enhancing queries, by allowing
To quantify semantic relationships
inside database using Natural
Language processing.Goal
Motivation
New
Capabilities
• Semantic similarity queries
- Find the most similar customer (semantically) to a potential
customer by industry
• Analogies
- Find all pairs of product a, b which relate to
Themself as peanut butter relate to jelly
• Schema-less navigation
- Find all tickets of user “moshe” given unknown
fuzzy foreign key between tickets and users.
CI QueriesMotivationArchitecture
Overview
How is this done?
The big picture The Author Version[6]
Architecture
Overview
Architecture
Overview
SQL with UDF
The big picture
Query
Engine
Architecture
Overview
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
The big picture
Architecture
Overview
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
Tokenization
+Calculate
Relationvector
The big picture
Architecture
Overview
The big picture
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
Results
Results
Tokenization
+Calculate
Relationvector
Architecture
Overview Tokenize
Relation [6]
Embed
Relation [4]
Relation embedding
Motivation
Semantic
Similarity
Queries
Building Blocks Needed
- cosineSemilarity(a,b) which takes vectors a, b return their
cosine distance
- vec(token) which takes a token and returns its associated vector
- Token entity e declares a variable that can be bound to tokens.
- Contain(row, entity) which states that entity must be bound to a
token generated by tokenizing row.
 
Motivation
Semantic
Similarity
Queries
Questions
- Find the most similar customer (semantically) to a potential
customer by industry.
Find the most similar customer (semantically) to
potential customer by industry
SELECT c.name
FROM customer c, potential_customer pc
WHERE c.id < cp.id AND
ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC
LIMIT 1
Semantic
Similarity
Queries
Why do we need c1.id < c2.id
What change is needed to avoid non similar customers
Find the most similar customer (semantically) to
potential customer by industry
SELECT c.name
FROM customer c, potential_customer pc
WHERE c.id < cp.id AND
ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC
LIMIT 1
Semantic
Similarity
Queries Why do we need c1.id < c2.id ?
In order to avoid duplication
What change is needed to avoid non similar customers?
Adding filter on on the proximity to the Where clause
Motivation
Analogies
Find all pairs of product a, b which relate
to themself as peanut butter relate to jelly
Let's break the solution to the following steps:
1. Create table with product names and their distance vector
2. Create table with product names and the cosine distance
between distance vector and jelly/peanut butter vector
3. Find all pair of product a,b which relate to themself as peanut
butter relates to jelly
Analogies
Analogies CREATE TABLE products_distance AS
SELECT p1.id AS p_name_1,
p2.id AS p_name_2,
vec(p1.description) - vec(p2.description) AS dist_vec
FROM products p1, products p2
WHERE p1.id < p2.id;
1. Create table with product names and
their distance vector
Analogies
2. Create table with product names and
their cosine distance between distance
vector and jelly/peanut butter vector
CREATE TABLE products_complemantry_distance AS
SELECT p_name_1,
p_name_2,
cosineDistance(dist_vec - (vec(‘peanut_butter’) + vec(‘jelly’)))
AS compl_dist
FROM products_distance
Analogies
3. Find all pairs of product a, b which relate to
themself as peanut butter relate to jelly.
SELECT p_name_1 ,
p_name_2 ,
RANK() OVER (PARTITION BY p_name_1
ORDER BY compl_dist ASC) AS rnk
FROM products_complemantry_distance
WHERE rnk = 0
Motivation
Schemaless
Navigation
Schema-less navigation
- Find all tickets of user “moshe” given unknown
fuzzy foreign key between tickets and users
Schemaless
Navigation
Find all tickets of user “moshe” given
unknown fuzzy foreign key between tickets
and users
SELECT users.* ,
tickets.*
FROM users; Token e1, e2
INNER JOIN tickets
ON contains(users.email, e1) AND
contains(tickets.*,e2) AND
cosineDistance(e1,e2) > 0.5
WHERE users.name = “moshe”
1. Wang, W. (2008). Similarity Join Algorithms:An Introduction. Retrieved from
http://www.cse.unsw.edu.au/~weiw/project/tutorial-simjoin-SEBD08.pdf.
2. Surajit, C., Venkatesh, G., Raghav, K. A Primitive Operator for Similarity Joins
in Data Cleaning, Proceedings of the 22nd International Conference on
Data Engineering, p.5, April 03-07, 2006.
3. Xiao, C., Wang, W., Lin, X., Yu, J. X., Wang, G (2011). Efficient Similarity Joins
for Near Duplicate Detection. ACM Trans. Datab. Syst. V, N.
4. T. Mikolov. word2vec: Tool for computing continuous distributed representations
of words. https://code.google.com/p/word2vec.
5. Ganesan, K (2015, November). What is text similarity? [Blog post]. Retrieved from
http://kavita-ganesan.com/what-is-text-similarity/#.Wppog5NuYv88.
6. Shmueli, O., & Bordawekar, R. (2016, Mar). Enabling Cognitive Intelligence
Queries in Relational Databases using Low-dimensional Word Embeddings.
Retrieved fro https://arxiv.org/abs/1603.07185.
MotivationReferences
MotivationQuestions

More Related Content

Similar to Seminar - Similarity Joins in SQL (performance and semantic joins)

Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
Sanaym
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Kyong-Ha Lee
 
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfLecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
ssuserf86fba
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
AlexAman1
 
Join operation
Join operationJoin operation
Join operation
Jeeva Nanthini
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
Prof. Wim Van Criekinge
 
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
Nexgen Technology
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
Nexgen Technology
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
nexgentech15
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
Continuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsContinuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based Systems
CHOOSE
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
國騰 丁
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
Sotiris Baratsas
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
dickonsondorris
 
Discovery of adaptable services
Discovery of adaptable servicesDiscovery of adaptable services
Discovery of adaptable services
José Antonio Martín Baena
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
Prof. Wim Van Criekinge
 
User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Prof. Wim Van Criekinge
 

Similar to Seminar - Similarity Joins in SQL (performance and semantic joins) (20)

Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfLecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
 
Join operation
Join operationJoin operation
Join operation
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Continuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsContinuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based Systems
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
Discovery of adaptable services
Discovery of adaptable servicesDiscovery of adaptable services
Discovery of adaptable services
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
User biglm
User biglmUser biglm
User biglm
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 

More from Eyal Trabelsi

Structuring and packaging your python project
Structuring and packaging your python projectStructuring and packaging your python project
Structuring and packaging your python project
Eyal Trabelsi
 
Getting to know any dataset
Getting to know any datasetGetting to know any dataset
Getting to know any dataset
Eyal Trabelsi
 
Make Terminal Fun Again
Make Terminal Fun AgainMake Terminal Fun Again
Make Terminal Fun Again
Eyal Trabelsi
 
Advance sql session - strings
Advance sql  session - stringsAdvance sql  session - strings
Advance sql session - strings
Eyal Trabelsi
 
Bring sanity back to sql (advance sql)
Bring sanity back to sql (advance sql)Bring sanity back to sql (advance sql)
Bring sanity back to sql (advance sql)
Eyal Trabelsi
 
Advance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricksAdvance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricks
Eyal Trabelsi
 

More from Eyal Trabelsi (6)

Structuring and packaging your python project
Structuring and packaging your python projectStructuring and packaging your python project
Structuring and packaging your python project
 
Getting to know any dataset
Getting to know any datasetGetting to know any dataset
Getting to know any dataset
 
Make Terminal Fun Again
Make Terminal Fun AgainMake Terminal Fun Again
Make Terminal Fun Again
 
Advance sql session - strings
Advance sql  session - stringsAdvance sql  session - strings
Advance sql session - strings
 
Bring sanity back to sql (advance sql)
Bring sanity back to sql (advance sql)Bring sanity back to sql (advance sql)
Bring sanity back to sql (advance sql)
 
Advance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricksAdvance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricks
 

Recently uploaded

一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 

Recently uploaded (20)

一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 

Seminar - Similarity Joins in SQL (performance and semantic joins)

  • 2. Agenda • Similarity Join - Introduction - Applications - Naive solution - Similarity Join in RDBMS • Similarity Join Performance Optimizations - Introduction - SSJoin • Semantic Similarity For Text - Introduction - Overall architecture - New capabilities
  • 5. • Input: - Two sets of objects: R and S - A similarity function: sim(r,s) - A threshold: t • Output: - all pairs of objects of r in R and s in S such that sim(r,s) < t Formal Definition
  • 6. Formal Definition The Input Can Be: - Numbers - Pictures - Vectors - Sets - Text
  • 7. Formal Definition The Input Can Be: - Pretty much everything
  • 9. • Lp Distance- • Hamming Distance • Cosine Similarity- • Jaccard- • Edit Distance Similarity Functions Is it new to us ?
  • 13. Applications Data consolidation - Lack of consistency, for example writing both $ and dollars - typos for example, “why everybdoy can understand this” - Precision, for example rounding numbers
  • 14. Solution So how do we solve this
  • 16. By simple nested loop algorithm and comparing all pairs using the similarity function. Naive Solution
  • 17. • Time complexity • Is this good enough • Is this good enough for RDBMS Naive Solution
  • 18. • Time complexity ? O(n2 ) • Is this good enough? It depend on the application • Is this good enough for rdbms? Naive Solution
  • 19. “RDBMS should provide a solution that is as generic and performant as possible “Similarity Join In RDBMS
  • 20. “RDBMS should provide a solution that is as generic and performant as possible “Similarity Join In RDBMS
  • 21. • Handle large dataset = many rows • Handle high dimensions dataset = many columns • Support variety of similarity functions • Support hard similarity functions • Should be correct* The solution should Similarity Join In RDBMS
  • 22. • Handle large dataset = many rows • Handle high dimensions dataset = many columns • Support variety of similarity functions • Support hard similarity functions • Should be correct, answer the application needs The solution should Similarity Join In RDBMS
  • 23. • Consider only promising pairs[1]. • Pruning and refinement paradigm[1][2][3]. • Resort to approximate solutions[1]. Optimization opportunity Similarity Join In RDBMS
  • 24. • Consider only promising pairs [1] , by filter first What will it tackle Optimization opportunity Similarity Join In RDBMS
  • 25. • Consider only promising pairs [1] , by filter first What will it tackle Optimization opportunity: - Large Datasets - High Dimensional Datasets - Hard similarity function - Should be correct Similarity Join In RDBMS
  • 26. • Resort to approximate solutions[1] What will it tackle Optimization opportunity: Similarity Join In RDBMS
  • 27. • Resort to approximate solutions[1] What will it tackle Optimization opportunity: - Large Datasets - High Dimensional Datasets - Hard similarity function - Should be correct Similarity Join In RDBMS
  • 28. • Pruning and refinement paradigm [1][2][3] What will it tackle Optimization opportunity: Similarity Join In RDBMS
  • 29. • Pruning and refinement paradigm [1][2][3] What will it tackle Optimization opportunity: - Large Datasets - High Dimensional Datasets - Hard similarity function - Should be correct Similarity Join In RDBMS
  • 31. • Consider only promising pairs[1]. • Pruning and refinement paradigm[1][2][3]. • Resort to approximate solutions[1]. Optimization opportunity: • Numbers • Vectors • Sets • Text Inputs can be: Introduction
  • 32. • Consider only promising pairs[1]. • Pruning and refinement paradigm[1][2][3]. • Resort to approximate solutions[1]. Optimization opportunity: • Numbers • Vectors • Sets • Text Inputs can be: Introduction
  • 33. • Similarity between sets - Binary similarity functions like contains intersect - Numerical similarity functions like overlap, jaccard or cosine • Similarity between strings - Treat strings as sets and using Jaccard (on q-gram) or edit distance Similarity Join On Strings/Sets: Introduction
  • 34. Our Goal To perform filtering before the cross product occur and reduce the pairs constructed for the join. Introduction
  • 35. Our Goal To perform filtering before the cross product occur and reduce the pairs constructed for the join. Introduction
  • 36. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture Introduction
  • 37. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good Introduction
  • 38. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} Introduction
  • 39. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} = 2 Introduction
  • 40. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} = 2 Didn’t we say we want to support multiple similarity functions Introduction
  • 41. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} = 2 Didn’t we say we want to support multiple similarity functions By using overlap we can implement many other similarity functions Introduction
  • 42. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} = 2 Introduction
  • 43. SS JOIN To exploit the observation that set overlap can be used effectively to support a variety of similarity functions : ● Jaccard similarity. ● Edit similarity and generalized edit similarity. ● Hamming distance. ● Similarity based on cooccurrences. Proposed solution[2][3]
  • 44. • Algorithm[2]: 1. Computing an equi-join on the B columns between R and S and adding the weights of all joining values of B. 2. Candidate phase , compute the overlap between groups on R.A and S.A. by grouping the result on < R.A, S.A > . 3. Verify phase, ensuring through the having clause, that the overlap is greater than the specified threshold α would yield the result of the SSJoin. SS JOIN
  • 45. SS JOIN Given two relations S and R holding companies names, compute similarity join with overlap > 60%*
  • 46. 1. Computing an equi-join on the B columns between R and S and adding the weights of all joining values of B. In our case B is the 3-gram column. SS JOIN
  • 47. 2. Candidate phase , compute the overlap between groups on R.A and S.A. by grouping the result on < R.A, S.A > . In our example A is the orgName column, and the overlap between the grouped orgName, is as follow: - Microsoft has overlap of 10. - Google has overlap of 2. SS JOIN
  • 48. 3. Verify phase, ensuring through the having clause, that the overlap is greater than the specified threshold α would yield the result of the SSJoin. In our example since we are looking for 60% overlap, and the verify phase is computed in the following way: - Since Microsoft has overlap of 10 out of 12 it has 83% and returned in the resultant join. - Since Google has overlap of 2 out of 4 it has 50% overlap and filtered by the join. SS JOIN
  • 49. SS JOIN • Time complexity Performance
  • 50. SS JOIN • Time complexity ? Since we use equi-join we can use rdbms optimizations like merge/hash join etc and get O(N+M), or even less if one table fit RAM. Performance
  • 51. SS JOIN • Is it still problematic Performance
  • 52. SS JOIN • Is it still problematic ? Yes,the size of the equi-join on B varies widely with the joint-frequency distribution of B, which can be very large. Performance
  • 53. SS JOIN • Is there another any Optimization opportunity Performance
  • 54. SS JOIN • Is there another any Optimization opportunity ? Yes, using “prefix filtering principle”[2] Performance
  • 55.
  • 56. SS JOIN With Prefix Filtering reduce the intermediate number of <R.A, S.A> groups compared, and thus reduce the size of the resultant equi-join Goal
  • 57. SS JOIN With Prefix Filtering Instead of performing an equi-join on R and S, we may ignore a large subset of S and perform the equi-join on R and a small filtered subset of S using prefix-filtering. How
  • 58. SS JOIN With Prefix Filtering Intuition if two records are similar, some fragments of them should overlap with each other, as otherwise the two records won’t have enough overlap.
  • 59. SS JOIN With Prefix Filtering Its implemented by establishing an upper bound of the overlap between two sets based on part of them Intuition if two records are similar, some fragments of them should overlap with each other, as otherwise the two records won’t have enough overlap.
  • 60. • Formal Algorithm[2] - Prefix(U) ∩ Prefix(V) = ε , overlap(U, V) < t - Global ordering is important SS JOIN With Prefix Filtering
  • 61. • Algorithm[2]: 1. Compute prefix(S) for each record S. 2. Computing an equi-join on the B columns between R and S and adding the weights of all joining values of B. 3. Candidate phase ,pair all records that share at least one token in their prefix. 4. Compute the overlap between groups on R.A and S.A. by grouping the result on < R.A, S.A > . 5. Verify phase, ensuring, through the having clause, that the overlap is greater than the specified threshold α would yield the result of the SSJoin. SS JOIN With Prefix Filtering
  • 63. MotivationText Similarity “Different people have a slightly different notion on what text similarity means”
  • 64. MotivationTypes • Lexical similarity s to compute how 'close' two pieces of text are in surface closeness[5]. • Semantic similarity s to compute how 'close' two pieces of text are in their meaning[5].
  • 65. Motivation Enhancing queries, by allowing To quantify semantic relationships inside database using Natural Language processing.Goal
  • 66. Motivation New Capabilities • Semantic similarity queries - Find the most similar customer (semantically) to a potential customer by industry • Analogies - Find all pairs of product a, b which relate to Themself as peanut butter relate to jelly • Schema-less navigation - Find all tickets of user “moshe” given unknown fuzzy foreign key between tickets and users.
  • 68. The big picture The Author Version[6] Architecture Overview
  • 69. Architecture Overview SQL with UDF The big picture Query Engine
  • 70. Architecture Overview Database Relations SQL with UDF Query Engine Calculate SQL The big picture
  • 71. Architecture Overview Database Relations SQL with UDF Query Engine Calculate SQL Tokenization +Calculate Relationvector The big picture
  • 72. Architecture Overview The big picture Database Relations SQL with UDF Query Engine Calculate SQL Results Results Tokenization +Calculate Relationvector
  • 74. Motivation Semantic Similarity Queries Building Blocks Needed - cosineSemilarity(a,b) which takes vectors a, b return their cosine distance - vec(token) which takes a token and returns its associated vector - Token entity e declares a variable that can be bound to tokens. - Contain(row, entity) which states that entity must be bound to a token generated by tokenizing row.  
  • 75. Motivation Semantic Similarity Queries Questions - Find the most similar customer (semantically) to a potential customer by industry.
  • 76. Find the most similar customer (semantically) to potential customer by industry SELECT c.name FROM customer c, potential_customer pc WHERE c.id < cp.id AND ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC LIMIT 1 Semantic Similarity Queries Why do we need c1.id < c2.id What change is needed to avoid non similar customers
  • 77. Find the most similar customer (semantically) to potential customer by industry SELECT c.name FROM customer c, potential_customer pc WHERE c.id < cp.id AND ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC LIMIT 1 Semantic Similarity Queries Why do we need c1.id < c2.id ? In order to avoid duplication What change is needed to avoid non similar customers? Adding filter on on the proximity to the Where clause
  • 78. Motivation Analogies Find all pairs of product a, b which relate to themself as peanut butter relate to jelly
  • 79. Let's break the solution to the following steps: 1. Create table with product names and their distance vector 2. Create table with product names and the cosine distance between distance vector and jelly/peanut butter vector 3. Find all pair of product a,b which relate to themself as peanut butter relates to jelly Analogies
  • 80. Analogies CREATE TABLE products_distance AS SELECT p1.id AS p_name_1, p2.id AS p_name_2, vec(p1.description) - vec(p2.description) AS dist_vec FROM products p1, products p2 WHERE p1.id < p2.id; 1. Create table with product names and their distance vector
  • 81. Analogies 2. Create table with product names and their cosine distance between distance vector and jelly/peanut butter vector CREATE TABLE products_complemantry_distance AS SELECT p_name_1, p_name_2, cosineDistance(dist_vec - (vec(‘peanut_butter’) + vec(‘jelly’))) AS compl_dist FROM products_distance
  • 82. Analogies 3. Find all pairs of product a, b which relate to themself as peanut butter relate to jelly. SELECT p_name_1 , p_name_2 , RANK() OVER (PARTITION BY p_name_1 ORDER BY compl_dist ASC) AS rnk FROM products_complemantry_distance WHERE rnk = 0
  • 83. Motivation Schemaless Navigation Schema-less navigation - Find all tickets of user “moshe” given unknown fuzzy foreign key between tickets and users
  • 84. Schemaless Navigation Find all tickets of user “moshe” given unknown fuzzy foreign key between tickets and users SELECT users.* , tickets.* FROM users; Token e1, e2 INNER JOIN tickets ON contains(users.email, e1) AND contains(tickets.*,e2) AND cosineDistance(e1,e2) > 0.5 WHERE users.name = “moshe”
  • 85. 1. Wang, W. (2008). Similarity Join Algorithms:An Introduction. Retrieved from http://www.cse.unsw.edu.au/~weiw/project/tutorial-simjoin-SEBD08.pdf. 2. Surajit, C., Venkatesh, G., Raghav, K. A Primitive Operator for Similarity Joins in Data Cleaning, Proceedings of the 22nd International Conference on Data Engineering, p.5, April 03-07, 2006. 3. Xiao, C., Wang, W., Lin, X., Yu, J. X., Wang, G (2011). Efficient Similarity Joins for Near Duplicate Detection. ACM Trans. Datab. Syst. V, N. 4. T. Mikolov. word2vec: Tool for computing continuous distributed representations of words. https://code.google.com/p/word2vec. 5. Ganesan, K (2015, November). What is text similarity? [Blog post]. Retrieved from http://kavita-ganesan.com/what-is-text-similarity/#.Wppog5NuYv88. 6. Shmueli, O., & Bordawekar, R. (2016, Mar). Enabling Cognitive Intelligence Queries in Relational Databases using Low-dimensional Word Embeddings. Retrieved fro https://arxiv.org/abs/1603.07185. MotivationReferences