SlideShare a Scribd company logo
1 of 28
TOP-k String Similarity Search
Chiao-Meng Huang
Guanghao Peng
Liwen Hu
Qing Hu
Motivation
Top-k String Similarity Search
• Given a collection of strings and query string,
return the top-k string with edit-distance
constraints.
• EX:
▫ Search “shout” with K=5
▫ scout, shoot, short, shot, spout
Related Works
• Search by q-gram (Z. Yang et. al)
▫ Preprocessing string collections into inverted lists
of q-gram
▫ Given a query string, calculate q-gram frequency.
Retrieve top-k results based on q-gram and some
distance metrics
Related Works
• Search with threshold (Z. Zhang et. al)
▫ Ordering the dictionary by string length and
alphabetical order.
▫ Similar strings tends to be close in this ordered
dictionary
▫ Some similar strings may scatter in different
positions
▫ Divide query string into n-gram, and search it in
high dimension space
 Ex: database -> “da”, “at”, “ta”,”ab”….
Related Works
• Similarity join (J. Wang et. al)
▫ Given two sets of strings, find pair of strings
belong to two sets that are similar
 Ex: Given {kobe, ebay…}, {bag, koby}, returns
<kobe, koby>
▫ Top-k search is a special case of similarity join
that one of the input set contains only one string
Related Works
• Top-k similarity search by trie (J. Wang et. al)
▫ Construct a trie structure for input set
▫ Search the trie by increasing edit-distance
▫ Definition:
 Pivot Entry<n, j, nc>
 Node nc is node n’s child
 ED(nc, q[1, j+1]) != ED(n, q[1, j])
Trie-based
• Given query q=“srajit”
• E0
▫ <n0, 0, n21>
▫ <n1, 1, n2>
▫ <n1, 1, n6>
▫ <n1, 1, n11>
Trie-based
• After substitution (increase j and goes down)
▫ <n0, 0, n21>
▫ to <n21, 1, n22>
▫ <n1, 1, n2>
▫ to <n2, 2, n3>
▫ …
Trie-based
• After insertion (goes down)
▫ <n0, 0, n21>
▫ to <n21, 0, n22>
▫ <n1, 1, n11>
▫ to <n11, 1, n12>
▫ and <n11, 1, n16>
 Node n16 match the
rest of query (“rajit”)
 Add “surajit” to result
Trie-based
• After deletion (increase j)
▫ <n0, 0, n21>
▫ to <n0, 1, n21>
▫ <n1, 1, n2>
▫ to <n1, 2, n2>
▫ …
Trie-based
• Applying substitution, insertion, deletion to
E0 to extend it to E1 (find strings with ED=1 on
the fly)
• Do the extension on Ei to Ei+1 until find k
results
Trie-based
• More advanced version uses a range variable to
include several entry pivots
▫ <n1, 1, n2>, <n1, 1, n6>, <n1, 1, n11> can be
shorten as <1, 5, j, d>:
▫ Strings with id 1 to 5 are
pivot entries under depth d
and substring of query
from index j
Our Method
• Inspired by the trie-based appraoch
• Similar strings are still scattered around the trie
▫ symmetry and asymmetry
▫ shout and scout
• Solution: Applying clustering to remove similar
strings
Clustering
Function cluster(S){
map<string, vector<string>> clusters;
while(S.length > 0){
s  randomly select a string from S
T  find strings with one edit-distance with s from S
clusters[s] = T;
erase T strings in S
}
return clusters;
}
Clustered Top-k Search
Function search(clusters, query, k){
construct primary trie Trie from centers of clusters
construct secondary tries sTrie[i] from cluster I
R = {};
ActiveCenters = {};
d = 0;
while(R.size < k)
if(d == 0)
ActiveCenters  find initial pivot entry(trie, query)
else
ActiveCenters  ActiveCenters ∩ expend pivot entry(trie, query)
end if
for each center string i in ActiveCenters{
R = R ∩ find strings within edit-distance d in sTrie[i] with query
end for
d++;
end while
}
Clustered Top-k Search
Query: shout
Distance Active
Centers
shoot shouter shorter
0 shout
1 shoot shoot shoute
2 shouter shoots shouter shoter
3 shorter Shouters shorter
4 shortier
Evaluation
• Dataset A
▫ Around 100,000 common English words
• Dataset B
▫ Around 200,000 words
▫ Dataset A plus additional suffix (dog, dogs)
• Dataset C
▫ Around 200,000 words
▫ Dataset A plus additional prefix (top, atop)
• Queries
▫ Randomly select 100 words from the dataset
CPU Time
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset A
Range
Cluster
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset B
(suffix)
DP
Range
CPU Time
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset A
Range
Cluster
0
10
20
30
40
50
60
70
80
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset C
(prefix)
Range
Cluster
Discussion
• With higher k, our method outperformed
previous method
• Adding additional suffix words doesn’t affect the
performance of previous method
• However, adding prefix decrease the
performance, because prefix words are scattered
in different position in trie
Entries
0
50000
100000
150000
200000
250000
1 3 5 10 25 50 100 200 400
#ofEntries
Size K
# of Entries on A
Cluster
Range
Time to Expand
0
0.5
1
1.5
2
2.5
3
3.5
0 1 2 3 4 5 6 7 8 9 10
CPUTime(s)
xth entry
Average Time to Expand
Pivot Entries
Range
Cluster
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 1 2 3 4 5 6 7 8 9 10
CPUTime(s)
xth entry
Average Time to Expand
Pivot Entries (Cluster)
Primary
Secondary
Scalability Study
0
5
10
15
20
25
30
35
40
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time with Different Dataset Size
12500
25000
50000
100000
Clustering Study
0
5
10
15
20
25
30
35
40
45
50
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time with Different # of Cluster Centers
56335
61347
70957
71036
Challenge and Future Work
• Dataset
▫ With too big dataset, we don’t have enough main
memory to hold it
▫ With too small dataset, it tends to find solution
with large edit-distance and becomes very slow
• Clustering
▫ It takes a lot of time to cluster data
▫ The resulting clusters are highly skewed that lots
of them contains only one string
Task Breakdown
• Chiao-Meng Huang
▫ Implemented range-based top-k string similarity search
▫ Implemented our proposed method
• Guanghao Peng
▫ Paper survey (search by threshold)
▫ Drafting paper
▫ Parsing and preparing dataset
• Liwen Hu
▫ Paper survey (search by q-gram)
▫ Drafting and finalizing our paper
▫ Implemented base-line edit-distance metric (including dynamic
programming, progressive and pivotal entry based top-k search
• Qing Hu
▫ Paper survey (similarity join)
▫ Drafting paper
▫ Parsing and preparing dataset

More Related Content

What's hot

Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization Sourabh Sahu
 
The Ring programming language version 1.10 book - Part 45 of 212
The Ring programming language version 1.10 book - Part 45 of 212The Ring programming language version 1.10 book - Part 45 of 212
The Ring programming language version 1.10 book - Part 45 of 212Mahmoud Samir Fayed
 
Heaps
HeapsHeaps
HeapsIIUM
 
Developing and Deploying Edge Analytics with Redis
Developing and Deploying Edge Analytics with RedisDeveloping and Deploying Edge Analytics with Redis
Developing and Deploying Edge Analytics with RedisDavid Rauschenbach
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERTQAware GmbH
 
Representing and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic WebRepresenting and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic WebKostis Kyzirakos
 
Chapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for printChapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for printAbdii Rashid
 
Understanding Graph Databases with Neo4j and Cypher
Understanding Graph Databases with Neo4j and CypherUnderstanding Graph Databases with Neo4j and Cypher
Understanding Graph Databases with Neo4j and CypherRuhaim Izmeth
 
Building Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresBuilding Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresKostis Kyzirakos
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using sparkRan Silberman
 
lecture 4
lecture 4lecture 4
lecture 4sajinsc
 
Maximal slice problem
Maximal slice problemMaximal slice problem
Maximal slice problemmininerej
 

What's hot (20)

Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Heap sort
Heap sort Heap sort
Heap sort
 
Heaps
HeapsHeaps
Heaps
 
The Ring programming language version 1.10 book - Part 45 of 212
The Ring programming language version 1.10 book - Part 45 of 212The Ring programming language version 1.10 book - Part 45 of 212
The Ring programming language version 1.10 book - Part 45 of 212
 
Algorithms
AlgorithmsAlgorithms
Algorithms
 
Heaps
HeapsHeaps
Heaps
 
Heaps
HeapsHeaps
Heaps
 
Developing and Deploying Edge Analytics with Redis
Developing and Deploying Edge Analytics with RedisDeveloping and Deploying Edge Analytics with Redis
Developing and Deploying Edge Analytics with Redis
 
Efficient Programs
Efficient ProgramsEfficient Programs
Efficient Programs
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERT
 
Representing and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic WebRepresenting and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic Web
 
Chapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for printChapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for print
 
Understanding Graph Databases with Neo4j and Cypher
Understanding Graph Databases with Neo4j and CypherUnderstanding Graph Databases with Neo4j and Cypher
Understanding Graph Databases with Neo4j and Cypher
 
Building Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresBuilding Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF Stores
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
Heap and heapsort
Heap and heapsortHeap and heapsort
Heap and heapsort
 
lecture 4
lecture 4lecture 4
lecture 4
 
Maximal slice problem
Maximal slice problemMaximal slice problem
Maximal slice problem
 
ThreeTen
ThreeTenThreeTen
ThreeTen
 

Viewers also liked

情強アルゴリズムDIMSUM
情強アルゴリズムDIMSUM情強アルゴリズムDIMSUM
情強アルゴリズムDIMSUMKotaro Tanahashi
 
Apache NiFiで、楽して、つながる、広がる IoTプロジェクト
Apache NiFiで、楽して、つながる、広がる IoTプロジェクトApache NiFiで、楽して、つながる、広がる IoTプロジェクト
Apache NiFiで、楽して、つながる、広がる IoTプロジェクトKoji Kawamura
 
IoTアプリケーションで利用するApache NiFi
IoTアプリケーションで利用するApache NiFiIoTアプリケーションで利用するApache NiFi
IoTアプリケーションで利用するApache NiFiYuta Imai
 
そのデータフロー NiFiで楽にしてあげましょう
そのデータフロー NiFiで楽にしてあげましょうそのデータフロー NiFiで楽にしてあげましょう
そのデータフロー NiFiで楽にしてあげましょうKoji Kawamura
 
Apache NiFiと 他プロダクトのつなぎ方
Apache NiFiと他プロダクトのつなぎ方Apache NiFiと他プロダクトのつなぎ方
Apache NiFiと 他プロダクトのつなぎ方Sotaro Kimura
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellKoji Kawamura
 
Hadoop最新事情とHortonworks Data Platform
Hadoop最新事情とHortonworks Data PlatformHadoop最新事情とHortonworks Data Platform
Hadoop最新事情とHortonworks Data PlatformYuta Imai
 

Viewers also liked (7)

情強アルゴリズムDIMSUM
情強アルゴリズムDIMSUM情強アルゴリズムDIMSUM
情強アルゴリズムDIMSUM
 
Apache NiFiで、楽して、つながる、広がる IoTプロジェクト
Apache NiFiで、楽して、つながる、広がる IoTプロジェクトApache NiFiで、楽して、つながる、広がる IoTプロジェクト
Apache NiFiで、楽して、つながる、広がる IoTプロジェクト
 
IoTアプリケーションで利用するApache NiFi
IoTアプリケーションで利用するApache NiFiIoTアプリケーションで利用するApache NiFi
IoTアプリケーションで利用するApache NiFi
 
そのデータフロー NiFiで楽にしてあげましょう
そのデータフロー NiFiで楽にしてあげましょうそのデータフロー NiFiで楽にしてあげましょう
そのデータフロー NiFiで楽にしてあげましょう
 
Apache NiFiと 他プロダクトのつなぎ方
Apache NiFiと他プロダクトのつなぎ方Apache NiFiと他プロダクトのつなぎ方
Apache NiFiと 他プロダクトのつなぎ方
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in Nutshell
 
Hadoop最新事情とHortonworks Data Platform
Hadoop最新事情とHortonworks Data PlatformHadoop最新事情とHortonworks Data Platform
Hadoop最新事情とHortonworks Data Platform
 

Similar to Top-k String Similarity Search with Clustering

Gotcha! Ruby things that will come back to bite you.
Gotcha! Ruby things that will come back to bite you.Gotcha! Ruby things that will come back to bite you.
Gotcha! Ruby things that will come back to bite you.David Tollmyr
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Ontico
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Ontico
 
MongoDB's index and query optimize
MongoDB's index and query optimizeMongoDB's index and query optimize
MongoDB's index and query optimizemysqlops
 
Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)MongoSF
 
Row Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cRow Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cStew Ashton
 
DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..KarthikeyaLanka1
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxjainaaru59
 
LeetCode Solutions In Java .pdf
LeetCode Solutions In Java .pdfLeetCode Solutions In Java .pdf
LeetCode Solutions In Java .pdfzupsezekno
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7decoupled
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course PROIDEA
 
time_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdftime_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdfSrinivasaReddyPolamR
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeProf. Wim Van Criekinge
 

Similar to Top-k String Similarity Search with Clustering (20)

Gotcha! Ruby things that will come back to bite you.
Gotcha! Ruby things that will come back to bite you.Gotcha! Ruby things that will come back to bite you.
Gotcha! Ruby things that will come back to bite you.
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
 
MongoDB's index and query optimize
MongoDB's index and query optimizeMongoDB's index and query optimize
MongoDB's index and query optimize
 
Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)
 
sorting
sortingsorting
sorting
 
Row Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cRow Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12c
 
Data Mining Lecture_8(b).pptx
Data Mining Lecture_8(b).pptxData Mining Lecture_8(b).pptx
Data Mining Lecture_8(b).pptx
 
DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptx
 
LeetCode Solutions In Java .pdf
LeetCode Solutions In Java .pdfLeetCode Solutions In Java .pdf
LeetCode Solutions In Java .pdf
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
R intro 20140716-advance
R intro 20140716-advanceR intro 20140716-advance
R intro 20140716-advance
 
time_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdftime_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdf
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 

Recently uploaded

Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 

Recently uploaded (20)

Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage example
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 

Top-k String Similarity Search with Clustering

  • 1. TOP-k String Similarity Search Chiao-Meng Huang Guanghao Peng Liwen Hu Qing Hu
  • 3. Top-k String Similarity Search • Given a collection of strings and query string, return the top-k string with edit-distance constraints. • EX: ▫ Search “shout” with K=5 ▫ scout, shoot, short, shot, spout
  • 4. Related Works • Search by q-gram (Z. Yang et. al) ▫ Preprocessing string collections into inverted lists of q-gram ▫ Given a query string, calculate q-gram frequency. Retrieve top-k results based on q-gram and some distance metrics
  • 5.
  • 6. Related Works • Search with threshold (Z. Zhang et. al) ▫ Ordering the dictionary by string length and alphabetical order. ▫ Similar strings tends to be close in this ordered dictionary ▫ Some similar strings may scatter in different positions ▫ Divide query string into n-gram, and search it in high dimension space  Ex: database -> “da”, “at”, “ta”,”ab”….
  • 7. Related Works • Similarity join (J. Wang et. al) ▫ Given two sets of strings, find pair of strings belong to two sets that are similar  Ex: Given {kobe, ebay…}, {bag, koby}, returns <kobe, koby> ▫ Top-k search is a special case of similarity join that one of the input set contains only one string
  • 8. Related Works • Top-k similarity search by trie (J. Wang et. al) ▫ Construct a trie structure for input set ▫ Search the trie by increasing edit-distance ▫ Definition:  Pivot Entry<n, j, nc>  Node nc is node n’s child  ED(nc, q[1, j+1]) != ED(n, q[1, j])
  • 9. Trie-based • Given query q=“srajit” • E0 ▫ <n0, 0, n21> ▫ <n1, 1, n2> ▫ <n1, 1, n6> ▫ <n1, 1, n11>
  • 10. Trie-based • After substitution (increase j and goes down) ▫ <n0, 0, n21> ▫ to <n21, 1, n22> ▫ <n1, 1, n2> ▫ to <n2, 2, n3> ▫ …
  • 11. Trie-based • After insertion (goes down) ▫ <n0, 0, n21> ▫ to <n21, 0, n22> ▫ <n1, 1, n11> ▫ to <n11, 1, n12> ▫ and <n11, 1, n16>  Node n16 match the rest of query (“rajit”)  Add “surajit” to result
  • 12. Trie-based • After deletion (increase j) ▫ <n0, 0, n21> ▫ to <n0, 1, n21> ▫ <n1, 1, n2> ▫ to <n1, 2, n2> ▫ …
  • 13. Trie-based • Applying substitution, insertion, deletion to E0 to extend it to E1 (find strings with ED=1 on the fly) • Do the extension on Ei to Ei+1 until find k results
  • 14. Trie-based • More advanced version uses a range variable to include several entry pivots ▫ <n1, 1, n2>, <n1, 1, n6>, <n1, 1, n11> can be shorten as <1, 5, j, d>: ▫ Strings with id 1 to 5 are pivot entries under depth d and substring of query from index j
  • 15. Our Method • Inspired by the trie-based appraoch • Similar strings are still scattered around the trie ▫ symmetry and asymmetry ▫ shout and scout • Solution: Applying clustering to remove similar strings
  • 16. Clustering Function cluster(S){ map<string, vector<string>> clusters; while(S.length > 0){ s  randomly select a string from S T  find strings with one edit-distance with s from S clusters[s] = T; erase T strings in S } return clusters; }
  • 17. Clustered Top-k Search Function search(clusters, query, k){ construct primary trie Trie from centers of clusters construct secondary tries sTrie[i] from cluster I R = {}; ActiveCenters = {}; d = 0; while(R.size < k) if(d == 0) ActiveCenters  find initial pivot entry(trie, query) else ActiveCenters  ActiveCenters ∩ expend pivot entry(trie, query) end if for each center string i in ActiveCenters{ R = R ∩ find strings within edit-distance d in sTrie[i] with query end for d++; end while }
  • 18. Clustered Top-k Search Query: shout Distance Active Centers shoot shouter shorter 0 shout 1 shoot shoot shoute 2 shouter shoots shouter shoter 3 shorter Shouters shorter 4 shortier
  • 19. Evaluation • Dataset A ▫ Around 100,000 common English words • Dataset B ▫ Around 200,000 words ▫ Dataset A plus additional suffix (dog, dogs) • Dataset C ▫ Around 200,000 words ▫ Dataset A plus additional prefix (top, atop) • Queries ▫ Randomly select 100 words from the dataset
  • 20. CPU Time 0 5 10 15 20 25 30 35 40 45 1 3 5 10 25 50 100 200 400 CPUTime(s) Size K CPU Time on Dataset A Range Cluster 0 5 10 15 20 25 30 35 40 45 1 3 5 10 25 50 100 200 400 CPUTime(s) Size K CPU Time on Dataset B (suffix) DP Range
  • 21. CPU Time 0 5 10 15 20 25 30 35 40 45 1 3 5 10 25 50 100 200 400 CPUTime(s) Size K CPU Time on Dataset A Range Cluster 0 10 20 30 40 50 60 70 80 1 3 5 10 25 50 100 200 400 CPUTime(s) Size K CPU Time on Dataset C (prefix) Range Cluster
  • 22. Discussion • With higher k, our method outperformed previous method • Adding additional suffix words doesn’t affect the performance of previous method • However, adding prefix decrease the performance, because prefix words are scattered in different position in trie
  • 23. Entries 0 50000 100000 150000 200000 250000 1 3 5 10 25 50 100 200 400 #ofEntries Size K # of Entries on A Cluster Range
  • 24. Time to Expand 0 0.5 1 1.5 2 2.5 3 3.5 0 1 2 3 4 5 6 7 8 9 10 CPUTime(s) xth entry Average Time to Expand Pivot Entries Range Cluster 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 1 2 3 4 5 6 7 8 9 10 CPUTime(s) xth entry Average Time to Expand Pivot Entries (Cluster) Primary Secondary
  • 25. Scalability Study 0 5 10 15 20 25 30 35 40 1 3 5 10 25 50 100 200 400 CPUTime(s) Size K CPU Time with Different Dataset Size 12500 25000 50000 100000
  • 26. Clustering Study 0 5 10 15 20 25 30 35 40 45 50 1 3 5 10 25 50 100 200 400 CPUTime(s) Size K CPU Time with Different # of Cluster Centers 56335 61347 70957 71036
  • 27. Challenge and Future Work • Dataset ▫ With too big dataset, we don’t have enough main memory to hold it ▫ With too small dataset, it tends to find solution with large edit-distance and becomes very slow • Clustering ▫ It takes a lot of time to cluster data ▫ The resulting clusters are highly skewed that lots of them contains only one string
  • 28. Task Breakdown • Chiao-Meng Huang ▫ Implemented range-based top-k string similarity search ▫ Implemented our proposed method • Guanghao Peng ▫ Paper survey (search by threshold) ▫ Drafting paper ▫ Parsing and preparing dataset • Liwen Hu ▫ Paper survey (search by q-gram) ▫ Drafting and finalizing our paper ▫ Implemented base-line edit-distance metric (including dynamic programming, progressive and pivotal entry based top-k search • Qing Hu ▫ Paper survey (similarity join) ▫ Drafting paper ▫ Parsing and preparing dataset