SlideShare a Scribd company logo
1 of 86
Download to read offline
Advance Topics:
Database Systems
Eyal Trabelsi
Agenda
• Similarity Join
- Introduction
- Applications
- Naive solution
- Similarity Join in RDBMS
• Similarity Join Performance Optimizations
- Introduction
- SSJoin
• Semantic Similarity For Text
- Introduction
- Overall architecture
- New capabilities
Similarity Join
Definition
“ Resembling without
being identical “
• Input:
- Two sets of objects: R and S
- A similarity function: sim(r,s)
- A threshold: t
• Output:
- all pairs of objects of r in R and
s in S such that sim(r,s) < t
Formal
Definition
Formal
Definition
The Input Can Be:
- Numbers
- Pictures
- Vectors
- Sets
- Text
Formal
Definition
The Input Can Be:
- Pretty much everything
Similarity
Functions
Is it new to us ?
• Lp
Distance-
• Hamming Distance
• Cosine Similarity-
• Jaccard-
• Edit Distance
Similarity
Functions
Is it new to us ?
Applications
Non Exact deduplication
Applications
Search Engines
Applications
Analytics
Applications
Data consolidation
- Lack of consistency, for example writing both $
and dollars
- typos for example, “why everybdoy can
understand this”
- Precision, for example rounding numbers
Solution
So how do we
solve this
Solution
Naive Solution
By simple nested loop
algorithm and comparing
all pairs using the
similarity function.
Naive
Solution
• Time complexity
• Is this good enough
• Is this good enough
for RDBMS
Naive
Solution
• Time complexity ? O(n2
)
• Is this good enough? It depend on
the application
• Is this good enough
for rdbms?
Naive
Solution
“RDBMS should provide a
solution that is as generic
and performant as possible “Similarity
Join In
RDBMS
“RDBMS should provide a
solution that is as generic
and performant as possible “Similarity
Join In
RDBMS
• Handle large dataset = many rows
• Handle high dimensions dataset = many columns
• Support variety of similarity functions
• Support hard similarity functions
• Should be correct*
The solution should
Similarity
Join In
RDBMS
• Handle large dataset = many rows
• Handle high dimensions dataset = many columns
• Support variety of similarity functions
• Support hard similarity functions
• Should be correct, answer the application needs
The solution should
Similarity
Join In
RDBMS
• Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity
Similarity
Join In
RDBMS
• Consider only promising pairs [1] , by filter first
What will it tackle
Optimization opportunity
Similarity
Join In
RDBMS
• Consider only promising pairs [1] , by filter first
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
• Resort to approximate solutions[1]
What will it tackle
Optimization opportunity:
Similarity
Join In
RDBMS
• Resort to approximate solutions[1]
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
• Pruning and refinement paradigm [1][2][3]
What will it tackle
Optimization opportunity:
Similarity
Join In
RDBMS
• Pruning and refinement paradigm [1][2][3]
What will it tackle
Optimization opportunity:
- Large Datasets
- High Dimensional Datasets
- Hard similarity function
- Should be correct
Similarity
Join In
RDBMS
Similarity Join
Performance
Optimizations
• Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity:
• Numbers
• Vectors
• Sets
• Text
Inputs can be:
Introduction
• Consider only promising pairs[1].
• Pruning and refinement paradigm[1][2][3].
• Resort to approximate solutions[1].
Optimization opportunity:
• Numbers
• Vectors
• Sets
• Text
Inputs can be:
Introduction
• Similarity between sets
- Binary similarity functions like contains intersect
- Numerical similarity functions like overlap, jaccard or cosine
• Similarity between strings
- Treat strings as sets and using Jaccard (on q-gram)
or edit distance
Similarity Join On Strings/Sets:
Introduction
Our Goal
To perform filtering before the cross product
occur and reduce the pairs constructed
for the join.
Introduction
Our Goal
To perform filtering before the cross product
occur and reduce the pairs constructed
for the join.
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Didn’t we say we want to support
multiple similarity functions
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Didn’t we say we want to support
multiple similarity functions
By using overlap we can implement
many other similarity functions
Introduction
String Set
Weighted
Set
mapping string to set
Set similarity
Set similaritySet weights
The big picture
A- food
B- good
A = { fo, oo,od}
B = { fo, oo,od}
= 2
Introduction
SS JOIN
To exploit the observation that set overlap can be used
effectively to support a variety of similarity functions :
● Jaccard similarity.
● Edit similarity and generalized edit similarity.
● Hamming distance.
● Similarity based on cooccurrences.
Proposed solution[2][3]
• Algorithm[2]:
1. Computing an equi-join on the B columns between R and S
and adding the weights of all joining values of B.
2. Candidate phase , compute the overlap between groups on
R.A and S.A. by grouping the result on < R.A, S.A > .
3. Verify phase, ensuring through the having clause, that the
overlap is greater than the specified threshold α would yield
the result of the SSJoin.
SS JOIN
SS JOIN
Given two relations S and R holding companies
names, compute similarity join with overlap > 60%*
1. Computing an equi-join on the B columns between
R and S and adding the weights of all joining values
of B.
In our case B is the 3-gram column.
SS JOIN
2. Candidate phase , compute the overlap between
groups on R.A and S.A. by grouping the result on
< R.A, S.A > .
In our example A is the orgName column, and the
overlap between the grouped orgName, is as follow:
- Microsoft has overlap of 10.
- Google has overlap of 2.
SS JOIN
3. Verify phase, ensuring through the having clause, that
the overlap is greater than the specified threshold α
would yield the result of the SSJoin.
In our example since we are looking for 60% overlap, and
the verify phase is computed in the following way:
- Since Microsoft has overlap of 10 out of 12 it has 83%
and returned in the resultant join.
- Since Google has overlap of 2 out of 4 it has 50%
overlap and filtered by the join.
SS JOIN
SS JOIN
• Time complexity
Performance
SS JOIN
• Time complexity ?
Since we use equi-join we can use rdbms
optimizations like merge/hash join etc and get
O(N+M), or even less if one table fit RAM.
Performance
SS JOIN
• Is it still problematic
Performance
SS JOIN
• Is it still problematic ?
Yes,the size of the equi-join on B varies
widely with the joint-frequency distribution of
B, which can be very large.
Performance
SS JOIN
• Is there another any
Optimization opportunity
Performance
SS JOIN
• Is there another any
Optimization opportunity ?
Yes, using “prefix filtering principle”[2]
Performance
SS JOIN
With Prefix
Filtering
reduce the intermediate number of
<R.A, S.A> groups compared, and thus
reduce the size of the resultant equi-join
Goal
SS JOIN
With Prefix
Filtering
Instead of performing an equi-join on R and S, we
may ignore a large subset of S and perform the
equi-join on R and a small filtered subset of S
using prefix-filtering.
How
SS JOIN
With Prefix
Filtering
Intuition
if two records are similar, some fragments of
them should overlap with each other, as
otherwise the two records won’t have enough
overlap.
SS JOIN
With Prefix
Filtering Its implemented by establishing an upper
bound of the overlap between two sets based
on part of them
Intuition
if two records are similar, some fragments of
them should overlap with each other, as
otherwise the two records won’t have enough
overlap.
• Formal Algorithm[2]
- Prefix(U) ∩ Prefix(V) = ε , overlap(U, V) < t
- Global ordering is important
SS JOIN
With Prefix
Filtering
• Algorithm[2]:
1. Compute prefix(S) for each record S.
2. Computing an equi-join on the B columns between R and S
and adding the weights of all joining values of B.
3. Candidate phase ,pair all records that share at least one
token in their prefix.
4. Compute the overlap between groups on R.A and S.A. by
grouping the result on < R.A, S.A > .
5. Verify phase, ensuring, through the having clause, that the
overlap is greater than the specified threshold α would yield
the result of the SSJoin.
SS JOIN
With Prefix
Filtering
Semantic Similarity
Join On Text
MotivationText
Similarity
“Different people have a
slightly different notion on
what text similarity means”
MotivationTypes
• Lexical similarity s to compute how 'close'
two pieces of text are in surface closeness[5].
• Semantic similarity s to compute how 'close'
two pieces of text are in their meaning[5].
Motivation
Enhancing queries, by allowing
To quantify semantic relationships
inside database using Natural
Language processing.Goal
Motivation
New
Capabilities
• Semantic similarity queries
- Find the most similar customer (semantically) to a potential
customer by industry
• Analogies
- Find all pairs of product a, b which relate to
Themself as peanut butter relate to jelly
• Schema-less navigation
- Find all tickets of user “moshe” given unknown
fuzzy foreign key between tickets and users.
CI QueriesMotivationArchitecture
Overview
How is this done?
The big picture The Author Version[6]
Architecture
Overview
Architecture
Overview
SQL with UDF
The big picture
Query
Engine
Architecture
Overview
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
The big picture
Architecture
Overview
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
Tokenization
+Calculate
Relationvector
The big picture
Architecture
Overview
The big picture
Database Relations
SQL with UDF
Query
Engine
Calculate SQL
Results
Results
Tokenization
+Calculate
Relationvector
Architecture
Overview Tokenize
Relation [6]
Embed
Relation [4]
Relation embedding
Motivation
Semantic
Similarity
Queries
Building Blocks Needed
- cosineSemilarity(a,b) which takes vectors a, b return their
cosine distance
- vec(token) which takes a token and returns its associated vector
- Token entity e declares a variable that can be bound to tokens.
- Contain(row, entity) which states that entity must be bound to a
token generated by tokenizing row.
 
Motivation
Semantic
Similarity
Queries
Questions
- Find the most similar customer (semantically) to a potential
customer by industry.
Find the most similar customer (semantically) to
potential customer by industry
SELECT c.name
FROM customer c, potential_customer pc
WHERE c.id < cp.id AND
ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC
LIMIT 1
Semantic
Similarity
Queries
Why do we need c1.id < c2.id
What change is needed to avoid non similar customers
Find the most similar customer (semantically) to
potential customer by industry
SELECT c.name
FROM customer c, potential_customer pc
WHERE c.id < cp.id AND
ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC
LIMIT 1
Semantic
Similarity
Queries Why do we need c1.id < c2.id ?
In order to avoid duplication
What change is needed to avoid non similar customers?
Adding filter on on the proximity to the Where clause
Motivation
Analogies
Find all pairs of product a, b which relate
to themself as peanut butter relate to jelly
Let's break the solution to the following steps:
1. Create table with product names and their distance vector
2. Create table with product names and the cosine distance
between distance vector and jelly/peanut butter vector
3. Find all pair of product a,b which relate to themself as peanut
butter relates to jelly
Analogies
Analogies CREATE TABLE products_distance AS
SELECT p1.id AS p_name_1,
p2.id AS p_name_2,
vec(p1.description) - vec(p2.description) AS dist_vec
FROM products p1, products p2
WHERE p1.id < p2.id;
1. Create table with product names and
their distance vector
Analogies
2. Create table with product names and
their cosine distance between distance
vector and jelly/peanut butter vector
CREATE TABLE products_complemantry_distance AS
SELECT p_name_1,
p_name_2,
cosineDistance(dist_vec - (vec(‘peanut_butter’) + vec(‘jelly’)))
AS compl_dist
FROM products_distance
Analogies
3. Find all pairs of product a, b which relate to
themself as peanut butter relate to jelly.
SELECT p_name_1 ,
p_name_2 ,
RANK() OVER (PARTITION BY p_name_1
ORDER BY compl_dist ASC) AS rnk
FROM products_complemantry_distance
WHERE rnk = 0
Motivation
Schemaless
Navigation
Schema-less navigation
- Find all tickets of user “moshe” given unknown
fuzzy foreign key between tickets and users
Schemaless
Navigation
Find all tickets of user “moshe” given
unknown fuzzy foreign key between tickets
and users
SELECT users.* ,
tickets.*
FROM users; Token e1, e2
INNER JOIN tickets
ON contains(users.email, e1) AND
contains(tickets.*,e2) AND
cosineDistance(e1,e2) > 0.5
WHERE users.name = “moshe”
1. Wang, W. (2008). Similarity Join Algorithms:An Introduction. Retrieved from
http://www.cse.unsw.edu.au/~weiw/project/tutorial-simjoin-SEBD08.pdf.
2. Surajit, C., Venkatesh, G., Raghav, K. A Primitive Operator for Similarity Joins
in Data Cleaning, Proceedings of the 22nd International Conference on
Data Engineering, p.5, April 03-07, 2006.
3. Xiao, C., Wang, W., Lin, X., Yu, J. X., Wang, G (2011). Efficient Similarity Joins
for Near Duplicate Detection. ACM Trans. Datab. Syst. V, N.
4. T. Mikolov. word2vec: Tool for computing continuous distributed representations
of words. https://code.google.com/p/word2vec.
5. Ganesan, K (2015, November). What is text similarity? [Blog post]. Retrieved from
http://kavita-ganesan.com/what-is-text-similarity/#.Wppog5NuYv88.
6. Shmueli, O., & Bordawekar, R. (2016, Mar). Enabling Cognitive Intelligence
Queries in Relational Databases using Low-dimensional Word Embeddings.
Retrieved fro https://arxiv.org/abs/1603.07185.
MotivationReferences
MotivationQuestions

More Related Content

Similar to Seminar - Similarity Joins in SQL (performance and semantic joins)

Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
 
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfLecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfssuserf86fba
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning AlexAman1
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in anexgentech15
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in aNexgen Technology
 
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...Nexgen Technology
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentRai University
 
Continuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsContinuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsCHOOSE
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients國騰 丁
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RSotiris Baratsas
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxdickonsondorris
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Prof. Wim Van Criekinge
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeProf. Wim Van Criekinge
 

Similar to Seminar - Similarity Joins in SQL (performance and semantic joins) (20)

Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfLecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
 
Join operation
Join operationJoin operation
Join operation
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
 
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Continuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsContinuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based Systems
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
Discovery of adaptable services
Discovery of adaptable servicesDiscovery of adaptable services
Discovery of adaptable services
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
User biglm
User biglmUser biglm
User biglm
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 

More from Eyal Trabelsi

Structuring and packaging your python project
Structuring and packaging your python projectStructuring and packaging your python project
Structuring and packaging your python projectEyal Trabelsi
 
Getting to know any dataset
Getting to know any datasetGetting to know any dataset
Getting to know any datasetEyal Trabelsi
 
Make Terminal Fun Again
Make Terminal Fun AgainMake Terminal Fun Again
Make Terminal Fun AgainEyal Trabelsi
 
Advance sql session - strings
Advance sql  session - stringsAdvance sql  session - strings
Advance sql session - stringsEyal Trabelsi
 
Bring sanity back to sql (advance sql)
Bring sanity back to sql (advance sql)Bring sanity back to sql (advance sql)
Bring sanity back to sql (advance sql)Eyal Trabelsi
 
Advance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricksAdvance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricksEyal Trabelsi
 

More from Eyal Trabelsi (6)

Structuring and packaging your python project
Structuring and packaging your python projectStructuring and packaging your python project
Structuring and packaging your python project
 
Getting to know any dataset
Getting to know any datasetGetting to know any dataset
Getting to know any dataset
 
Make Terminal Fun Again
Make Terminal Fun AgainMake Terminal Fun Again
Make Terminal Fun Again
 
Advance sql session - strings
Advance sql  session - stringsAdvance sql  session - strings
Advance sql session - strings
 
Bring sanity back to sql (advance sql)
Bring sanity back to sql (advance sql)Bring sanity back to sql (advance sql)
Bring sanity back to sql (advance sql)
 
Advance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricksAdvance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricks
 

Recently uploaded

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Seminar - Similarity Joins in SQL (performance and semantic joins)

  • 2. Agenda • Similarity Join - Introduction - Applications - Naive solution - Similarity Join in RDBMS • Similarity Join Performance Optimizations - Introduction - SSJoin • Semantic Similarity For Text - Introduction - Overall architecture - New capabilities
  • 5. • Input: - Two sets of objects: R and S - A similarity function: sim(r,s) - A threshold: t • Output: - all pairs of objects of r in R and s in S such that sim(r,s) < t Formal Definition
  • 6. Formal Definition The Input Can Be: - Numbers - Pictures - Vectors - Sets - Text
  • 7. Formal Definition The Input Can Be: - Pretty much everything
  • 9. • Lp Distance- • Hamming Distance • Cosine Similarity- • Jaccard- • Edit Distance Similarity Functions Is it new to us ?
  • 13. Applications Data consolidation - Lack of consistency, for example writing both $ and dollars - typos for example, “why everybdoy can understand this” - Precision, for example rounding numbers
  • 14. Solution So how do we solve this
  • 16. By simple nested loop algorithm and comparing all pairs using the similarity function. Naive Solution
  • 17. • Time complexity • Is this good enough • Is this good enough for RDBMS Naive Solution
  • 18. • Time complexity ? O(n2 ) • Is this good enough? It depend on the application • Is this good enough for rdbms? Naive Solution
  • 19. “RDBMS should provide a solution that is as generic and performant as possible “Similarity Join In RDBMS
  • 20. “RDBMS should provide a solution that is as generic and performant as possible “Similarity Join In RDBMS
  • 21. • Handle large dataset = many rows • Handle high dimensions dataset = many columns • Support variety of similarity functions • Support hard similarity functions • Should be correct* The solution should Similarity Join In RDBMS
  • 22. • Handle large dataset = many rows • Handle high dimensions dataset = many columns • Support variety of similarity functions • Support hard similarity functions • Should be correct, answer the application needs The solution should Similarity Join In RDBMS
  • 23. • Consider only promising pairs[1]. • Pruning and refinement paradigm[1][2][3]. • Resort to approximate solutions[1]. Optimization opportunity Similarity Join In RDBMS
  • 24. • Consider only promising pairs [1] , by filter first What will it tackle Optimization opportunity Similarity Join In RDBMS
  • 25. • Consider only promising pairs [1] , by filter first What will it tackle Optimization opportunity: - Large Datasets - High Dimensional Datasets - Hard similarity function - Should be correct Similarity Join In RDBMS
  • 26. • Resort to approximate solutions[1] What will it tackle Optimization opportunity: Similarity Join In RDBMS
  • 27. • Resort to approximate solutions[1] What will it tackle Optimization opportunity: - Large Datasets - High Dimensional Datasets - Hard similarity function - Should be correct Similarity Join In RDBMS
  • 28. • Pruning and refinement paradigm [1][2][3] What will it tackle Optimization opportunity: Similarity Join In RDBMS
  • 29. • Pruning and refinement paradigm [1][2][3] What will it tackle Optimization opportunity: - Large Datasets - High Dimensional Datasets - Hard similarity function - Should be correct Similarity Join In RDBMS
  • 31. • Consider only promising pairs[1]. • Pruning and refinement paradigm[1][2][3]. • Resort to approximate solutions[1]. Optimization opportunity: • Numbers • Vectors • Sets • Text Inputs can be: Introduction
  • 32. • Consider only promising pairs[1]. • Pruning and refinement paradigm[1][2][3]. • Resort to approximate solutions[1]. Optimization opportunity: • Numbers • Vectors • Sets • Text Inputs can be: Introduction
  • 33. • Similarity between sets - Binary similarity functions like contains intersect - Numerical similarity functions like overlap, jaccard or cosine • Similarity between strings - Treat strings as sets and using Jaccard (on q-gram) or edit distance Similarity Join On Strings/Sets: Introduction
  • 34. Our Goal To perform filtering before the cross product occur and reduce the pairs constructed for the join. Introduction
  • 35. Our Goal To perform filtering before the cross product occur and reduce the pairs constructed for the join. Introduction
  • 36. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture Introduction
  • 37. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good Introduction
  • 38. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} Introduction
  • 39. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} = 2 Introduction
  • 40. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} = 2 Didn’t we say we want to support multiple similarity functions Introduction
  • 41. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} = 2 Didn’t we say we want to support multiple similarity functions By using overlap we can implement many other similarity functions Introduction
  • 42. String Set Weighted Set mapping string to set Set similarity Set similaritySet weights The big picture A- food B- good A = { fo, oo,od} B = { fo, oo,od} = 2 Introduction
  • 43. SS JOIN To exploit the observation that set overlap can be used effectively to support a variety of similarity functions : ● Jaccard similarity. ● Edit similarity and generalized edit similarity. ● Hamming distance. ● Similarity based on cooccurrences. Proposed solution[2][3]
  • 44. • Algorithm[2]: 1. Computing an equi-join on the B columns between R and S and adding the weights of all joining values of B. 2. Candidate phase , compute the overlap between groups on R.A and S.A. by grouping the result on < R.A, S.A > . 3. Verify phase, ensuring through the having clause, that the overlap is greater than the specified threshold α would yield the result of the SSJoin. SS JOIN
  • 45. SS JOIN Given two relations S and R holding companies names, compute similarity join with overlap > 60%*
  • 46. 1. Computing an equi-join on the B columns between R and S and adding the weights of all joining values of B. In our case B is the 3-gram column. SS JOIN
  • 47. 2. Candidate phase , compute the overlap between groups on R.A and S.A. by grouping the result on < R.A, S.A > . In our example A is the orgName column, and the overlap between the grouped orgName, is as follow: - Microsoft has overlap of 10. - Google has overlap of 2. SS JOIN
  • 48. 3. Verify phase, ensuring through the having clause, that the overlap is greater than the specified threshold α would yield the result of the SSJoin. In our example since we are looking for 60% overlap, and the verify phase is computed in the following way: - Since Microsoft has overlap of 10 out of 12 it has 83% and returned in the resultant join. - Since Google has overlap of 2 out of 4 it has 50% overlap and filtered by the join. SS JOIN
  • 49. SS JOIN • Time complexity Performance
  • 50. SS JOIN • Time complexity ? Since we use equi-join we can use rdbms optimizations like merge/hash join etc and get O(N+M), or even less if one table fit RAM. Performance
  • 51. SS JOIN • Is it still problematic Performance
  • 52. SS JOIN • Is it still problematic ? Yes,the size of the equi-join on B varies widely with the joint-frequency distribution of B, which can be very large. Performance
  • 53. SS JOIN • Is there another any Optimization opportunity Performance
  • 54. SS JOIN • Is there another any Optimization opportunity ? Yes, using “prefix filtering principle”[2] Performance
  • 55.
  • 56. SS JOIN With Prefix Filtering reduce the intermediate number of <R.A, S.A> groups compared, and thus reduce the size of the resultant equi-join Goal
  • 57. SS JOIN With Prefix Filtering Instead of performing an equi-join on R and S, we may ignore a large subset of S and perform the equi-join on R and a small filtered subset of S using prefix-filtering. How
  • 58. SS JOIN With Prefix Filtering Intuition if two records are similar, some fragments of them should overlap with each other, as otherwise the two records won’t have enough overlap.
  • 59. SS JOIN With Prefix Filtering Its implemented by establishing an upper bound of the overlap between two sets based on part of them Intuition if two records are similar, some fragments of them should overlap with each other, as otherwise the two records won’t have enough overlap.
  • 60. • Formal Algorithm[2] - Prefix(U) ∩ Prefix(V) = ε , overlap(U, V) < t - Global ordering is important SS JOIN With Prefix Filtering
  • 61. • Algorithm[2]: 1. Compute prefix(S) for each record S. 2. Computing an equi-join on the B columns between R and S and adding the weights of all joining values of B. 3. Candidate phase ,pair all records that share at least one token in their prefix. 4. Compute the overlap between groups on R.A and S.A. by grouping the result on < R.A, S.A > . 5. Verify phase, ensuring, through the having clause, that the overlap is greater than the specified threshold α would yield the result of the SSJoin. SS JOIN With Prefix Filtering
  • 63. MotivationText Similarity “Different people have a slightly different notion on what text similarity means”
  • 64. MotivationTypes • Lexical similarity s to compute how 'close' two pieces of text are in surface closeness[5]. • Semantic similarity s to compute how 'close' two pieces of text are in their meaning[5].
  • 65. Motivation Enhancing queries, by allowing To quantify semantic relationships inside database using Natural Language processing.Goal
  • 66. Motivation New Capabilities • Semantic similarity queries - Find the most similar customer (semantically) to a potential customer by industry • Analogies - Find all pairs of product a, b which relate to Themself as peanut butter relate to jelly • Schema-less navigation - Find all tickets of user “moshe” given unknown fuzzy foreign key between tickets and users.
  • 68. The big picture The Author Version[6] Architecture Overview
  • 69. Architecture Overview SQL with UDF The big picture Query Engine
  • 70. Architecture Overview Database Relations SQL with UDF Query Engine Calculate SQL The big picture
  • 71. Architecture Overview Database Relations SQL with UDF Query Engine Calculate SQL Tokenization +Calculate Relationvector The big picture
  • 72. Architecture Overview The big picture Database Relations SQL with UDF Query Engine Calculate SQL Results Results Tokenization +Calculate Relationvector
  • 74. Motivation Semantic Similarity Queries Building Blocks Needed - cosineSemilarity(a,b) which takes vectors a, b return their cosine distance - vec(token) which takes a token and returns its associated vector - Token entity e declares a variable that can be bound to tokens. - Contain(row, entity) which states that entity must be bound to a token generated by tokenizing row.  
  • 75. Motivation Semantic Similarity Queries Questions - Find the most similar customer (semantically) to a potential customer by industry.
  • 76. Find the most similar customer (semantically) to potential customer by industry SELECT c.name FROM customer c, potential_customer pc WHERE c.id < cp.id AND ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC LIMIT 1 Semantic Similarity Queries Why do we need c1.id < c2.id What change is needed to avoid non similar customers
  • 77. Find the most similar customer (semantically) to potential customer by industry SELECT c.name FROM customer c, potential_customer pc WHERE c.id < cp.id AND ORDER BY cosineDistance(vec(c.industry), vec(cp.industry)) DESC LIMIT 1 Semantic Similarity Queries Why do we need c1.id < c2.id ? In order to avoid duplication What change is needed to avoid non similar customers? Adding filter on on the proximity to the Where clause
  • 78. Motivation Analogies Find all pairs of product a, b which relate to themself as peanut butter relate to jelly
  • 79. Let's break the solution to the following steps: 1. Create table with product names and their distance vector 2. Create table with product names and the cosine distance between distance vector and jelly/peanut butter vector 3. Find all pair of product a,b which relate to themself as peanut butter relates to jelly Analogies
  • 80. Analogies CREATE TABLE products_distance AS SELECT p1.id AS p_name_1, p2.id AS p_name_2, vec(p1.description) - vec(p2.description) AS dist_vec FROM products p1, products p2 WHERE p1.id < p2.id; 1. Create table with product names and their distance vector
  • 81. Analogies 2. Create table with product names and their cosine distance between distance vector and jelly/peanut butter vector CREATE TABLE products_complemantry_distance AS SELECT p_name_1, p_name_2, cosineDistance(dist_vec - (vec(‘peanut_butter’) + vec(‘jelly’))) AS compl_dist FROM products_distance
  • 82. Analogies 3. Find all pairs of product a, b which relate to themself as peanut butter relate to jelly. SELECT p_name_1 , p_name_2 , RANK() OVER (PARTITION BY p_name_1 ORDER BY compl_dist ASC) AS rnk FROM products_complemantry_distance WHERE rnk = 0
  • 83. Motivation Schemaless Navigation Schema-less navigation - Find all tickets of user “moshe” given unknown fuzzy foreign key between tickets and users
  • 84. Schemaless Navigation Find all tickets of user “moshe” given unknown fuzzy foreign key between tickets and users SELECT users.* , tickets.* FROM users; Token e1, e2 INNER JOIN tickets ON contains(users.email, e1) AND contains(tickets.*,e2) AND cosineDistance(e1,e2) > 0.5 WHERE users.name = “moshe”
  • 85. 1. Wang, W. (2008). Similarity Join Algorithms:An Introduction. Retrieved from http://www.cse.unsw.edu.au/~weiw/project/tutorial-simjoin-SEBD08.pdf. 2. Surajit, C., Venkatesh, G., Raghav, K. A Primitive Operator for Similarity Joins in Data Cleaning, Proceedings of the 22nd International Conference on Data Engineering, p.5, April 03-07, 2006. 3. Xiao, C., Wang, W., Lin, X., Yu, J. X., Wang, G (2011). Efficient Similarity Joins for Near Duplicate Detection. ACM Trans. Datab. Syst. V, N. 4. T. Mikolov. word2vec: Tool for computing continuous distributed representations of words. https://code.google.com/p/word2vec. 5. Ganesan, K (2015, November). What is text similarity? [Blog post]. Retrieved from http://kavita-ganesan.com/what-is-text-similarity/#.Wppog5NuYv88. 6. Shmueli, O., & Bordawekar, R. (2016, Mar). Enabling Cognitive Intelligence Queries in Relational Databases using Low-dimensional Word Embeddings. Retrieved fro https://arxiv.org/abs/1603.07185. MotivationReferences