Top-k String Similarity Search with Clustering

TOP-k String Similarity Search
Chiao-Meng Huang
Guanghao Peng
Liwen Hu
Qing Hu

Top-k String Similarity Search
• Given a collection of strings and query string,
return the top-k string with edit-distance
constraints.
• EX:
▫ Search “shout” with K=5
▫ scout, shoot, short, shot, spout

Related Works
• Search by q-gram (Z. Yang et. al)
▫ Preprocessing string collections into inverted lists
of q-gram
▫ Given a query string, calculate q-gram frequency.
Retrieve top-k results based on q-gram and some
distance metrics

Related Works
• Search with threshold (Z. Zhang et. al)
▫ Ordering the dictionary by string length and
alphabetical order.
▫ Similar strings tends to be close in this ordered
dictionary
▫ Some similar strings may scatter in different
positions
▫ Divide query string into n-gram, and search it in
high dimension space
 Ex: database -> “da”, “at”, “ta”,”ab”….

Related Works
• Similarity join (J. Wang et. al)
▫ Given two sets of strings, find pair of strings
belong to two sets that are similar
 Ex: Given {kobe, ebay…}, {bag, koby}, returns
<kobe, koby>
▫ Top-k search is a special case of similarity join
that one of the input set contains only one string

Related Works
• Top-k similarity search by trie (J. Wang et. al)
▫ Construct a trie structure for input set
▫ Search the trie by increasing edit-distance
▫ Definition:
 Pivot Entry<n, j, nc>
 Node nc is node n’s child
 ED(nc, q[1, j+1]) != ED(n, q[1, j])

Trie-based
• Given query q=“srajit”
• E0
▫ <n0, 0, n21>
▫ <n1, 1, n2>
▫ <n1, 1, n6>
▫ <n1, 1, n11>

Trie-based
• After substitution (increase j and goes down)
▫ <n0, 0, n21>
▫ to <n21, 1, n22>
▫ <n1, 1, n2>
▫ to <n2, 2, n3>
▫ …

Trie-based
• After insertion (goes down)
▫ <n0, 0, n21>
▫ to <n21, 0, n22>
▫ <n1, 1, n11>
▫ to <n11, 1, n12>
▫ and <n11, 1, n16>
 Node n16 match the
rest of query (“rajit”)
 Add “surajit” to result

Trie-based
• After deletion (increase j)
▫ <n0, 0, n21>
▫ to <n0, 1, n21>
▫ <n1, 1, n2>
▫ to <n1, 2, n2>
▫ …

Trie-based
• Applying substitution, insertion, deletion to
E0 to extend it to E1 (find strings with ED=1 on
the fly)
• Do the extension on Ei to Ei+1 until find k
results

Trie-based
• More advanced version uses a range variable to
include several entry pivots
▫ <n1, 1, n2>, <n1, 1, n6>, <n1, 1, n11> can be
shorten as <1, 5, j, d>:
▫ Strings with id 1 to 5 are
pivot entries under depth d
and substring of query
from index j

Our Method
• Inspired by the trie-based appraoch
• Similar strings are still scattered around the trie
▫ symmetry and asymmetry
▫ shout and scout
• Solution: Applying clustering to remove similar
strings

Clustering
Function cluster(S){
map<string, vector<string>> clusters;
while(S.length > 0){
s  randomly select a string from S
T  find strings with one edit-distance with s from S
clusters[s] = T;
erase T strings in S
}
return clusters;
}

Clustered Top-k Search
Function search(clusters, query, k){
construct primary trie Trie from centers of clusters
construct secondary tries sTrie[i] from cluster I
R = {};
ActiveCenters = {};
d = 0;
while(R.size < k)
if(d == 0)
ActiveCenters  find initial pivot entry(trie, query)
else
ActiveCenters  ActiveCenters ∩ expend pivot entry(trie, query)
end if
for each center string i in ActiveCenters{
R = R ∩ find strings within edit-distance d in sTrie[i] with query
end for
d++;
end while
}

Clustered Top-k Search
Query: shout
Distance Active
Centers
shoot shouter shorter
0 shout
1 shoot shoot shoute
2 shouter shoots shouter shoter
3 shorter Shouters shorter
4 shortier

Evaluation
• Dataset A
▫ Around 100,000 common English words
• Dataset B
▫ Around 200,000 words
▫ Dataset A plus additional suffix (dog, dogs)
• Dataset C
▫ Around 200,000 words
▫ Dataset A plus additional prefix (top, atop)
• Queries
▫ Randomly select 100 words from the dataset

CPU Time
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset A
Range
Cluster
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset B
(suffix)
DP
Range

CPU Time
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset A
Range
Cluster
0
10
20
30
40
50
60
70
80
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset C
(prefix)
Range
Cluster

Discussion
• With higher k, our method outperformed
previous method
• Adding additional suffix words doesn’t affect the
performance of previous method
• However, adding prefix decrease the
performance, because prefix words are scattered
in different position in trie

Entries
0
50000
100000
150000
200000
250000
1 3 5 10 25 50 100 200 400
#ofEntries
Size K
# of Entries on A
Cluster
Range

Time to Expand
0
0.5
1
1.5
2
2.5
3
3.5
0 1 2 3 4 5 6 7 8 9 10
CPUTime(s)
xth entry
Average Time to Expand
Pivot Entries
Range
Cluster
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 1 2 3 4 5 6 7 8 9 10
CPUTime(s)
xth entry
Average Time to Expand
Pivot Entries (Cluster)
Primary
Secondary

Scalability Study
0
5
10
15
20
25
30
35
40
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time with Different Dataset Size
12500
25000
50000
100000

Clustering Study
0
5
10
15
20
25
30
35
40
45
50
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time with Different # of Cluster Centers
56335
61347
70957
71036

Challenge and Future Work
• Dataset
▫ With too big dataset, we don’t have enough main
memory to hold it
▫ With too small dataset, it tends to find solution
with large edit-distance and becomes very slow
• Clustering
▫ It takes a lot of time to cluster data
▫ The resulting clusters are highly skewed that lots
of them contains only one string

Task Breakdown
• Chiao-Meng Huang
▫ Implemented range-based top-k string similarity search
▫ Implemented our proposed method
• Guanghao Peng
▫ Paper survey (search by threshold)
▫ Drafting paper
▫ Parsing and preparing dataset
• Liwen Hu
▫ Paper survey (search by q-gram)
▫ Drafting and finalizing our paper
▫ Implemented base-line edit-distance metric (including dynamic
programming, progressive and pivotal entry based top-k search
• Qing Hu
▫ Paper survey (similarity join)
▫ Drafting paper
▫ Parsing and preparing dataset

Top-k String Similarity Search with Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Top-k String Similarity Search with Clustering

Similar to Top-k String Similarity Search with Clustering (20)

Recently uploaded

Recently uploaded (20)

Top-k String Similarity Search with Clustering