This document discusses top-k string similarity search and proposes a clustered trie-based approach. Previous work used tries or q-grams with edit distance metrics. The proposed method clusters similar strings and constructs a primary trie of cluster centers and secondary tries of cluster contents. It finds pivot entries between tries to iteratively expand the search. Evaluation shows the clustered approach outperforms others with higher k values and is more robust to prefix/suffix additions. Challenges include scaling to large datasets and skewed clustering.
3. Top-k String Similarity Search
• Given a collection of strings and query string,
return the top-k string with edit-distance
constraints.
• EX:
▫ Search “shout” with K=5
▫ scout, shoot, short, shot, spout
4. Related Works
• Search by q-gram (Z. Yang et. al)
▫ Preprocessing string collections into inverted lists
of q-gram
▫ Given a query string, calculate q-gram frequency.
Retrieve top-k results based on q-gram and some
distance metrics
5.
6. Related Works
• Search with threshold (Z. Zhang et. al)
▫ Ordering the dictionary by string length and
alphabetical order.
▫ Similar strings tends to be close in this ordered
dictionary
▫ Some similar strings may scatter in different
positions
▫ Divide query string into n-gram, and search it in
high dimension space
Ex: database -> “da”, “at”, “ta”,”ab”….
7. Related Works
• Similarity join (J. Wang et. al)
▫ Given two sets of strings, find pair of strings
belong to two sets that are similar
Ex: Given {kobe, ebay…}, {bag, koby}, returns
<kobe, koby>
▫ Top-k search is a special case of similarity join
that one of the input set contains only one string
8. Related Works
• Top-k similarity search by trie (J. Wang et. al)
▫ Construct a trie structure for input set
▫ Search the trie by increasing edit-distance
▫ Definition:
Pivot Entry<n, j, nc>
Node nc is node n’s child
ED(nc, q[1, j+1]) != ED(n, q[1, j])
10. Trie-based
• After substitution (increase j and goes down)
▫ <n0, 0, n21>
▫ to <n21, 1, n22>
▫ <n1, 1, n2>
▫ to <n2, 2, n3>
▫ …
11. Trie-based
• After insertion (goes down)
▫ <n0, 0, n21>
▫ to <n21, 0, n22>
▫ <n1, 1, n11>
▫ to <n11, 1, n12>
▫ and <n11, 1, n16>
Node n16 match the
rest of query (“rajit”)
Add “surajit” to result
12. Trie-based
• After deletion (increase j)
▫ <n0, 0, n21>
▫ to <n0, 1, n21>
▫ <n1, 1, n2>
▫ to <n1, 2, n2>
▫ …
13. Trie-based
• Applying substitution, insertion, deletion to
E0 to extend it to E1 (find strings with ED=1 on
the fly)
• Do the extension on Ei to Ei+1 until find k
results
14. Trie-based
• More advanced version uses a range variable to
include several entry pivots
▫ <n1, 1, n2>, <n1, 1, n6>, <n1, 1, n11> can be
shorten as <1, 5, j, d>:
▫ Strings with id 1 to 5 are
pivot entries under depth d
and substring of query
from index j
15. Our Method
• Inspired by the trie-based appraoch
• Similar strings are still scattered around the trie
▫ symmetry and asymmetry
▫ shout and scout
• Solution: Applying clustering to remove similar
strings
16. Clustering
Function cluster(S){
map<string, vector<string>> clusters;
while(S.length > 0){
s randomly select a string from S
T find strings with one edit-distance with s from S
clusters[s] = T;
erase T strings in S
}
return clusters;
}
17. Clustered Top-k Search
Function search(clusters, query, k){
construct primary trie Trie from centers of clusters
construct secondary tries sTrie[i] from cluster I
R = {};
ActiveCenters = {};
d = 0;
while(R.size < k)
if(d == 0)
ActiveCenters find initial pivot entry(trie, query)
else
ActiveCenters ActiveCenters ∩ expend pivot entry(trie, query)
end if
for each center string i in ActiveCenters{
R = R ∩ find strings within edit-distance d in sTrie[i] with query
end for
d++;
end while
}
19. Evaluation
• Dataset A
▫ Around 100,000 common English words
• Dataset B
▫ Around 200,000 words
▫ Dataset A plus additional suffix (dog, dogs)
• Dataset C
▫ Around 200,000 words
▫ Dataset A plus additional prefix (top, atop)
• Queries
▫ Randomly select 100 words from the dataset
20. CPU Time
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset A
Range
Cluster
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset B
(suffix)
DP
Range
21. CPU Time
0
5
10
15
20
25
30
35
40
45
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset A
Range
Cluster
0
10
20
30
40
50
60
70
80
1 3 5 10 25 50 100 200 400
CPUTime(s)
Size K
CPU Time on Dataset C
(prefix)
Range
Cluster
22. Discussion
• With higher k, our method outperformed
previous method
• Adding additional suffix words doesn’t affect the
performance of previous method
• However, adding prefix decrease the
performance, because prefix words are scattered
in different position in trie
27. Challenge and Future Work
• Dataset
▫ With too big dataset, we don’t have enough main
memory to hold it
▫ With too small dataset, it tends to find solution
with large edit-distance and becomes very slow
• Clustering
▫ It takes a lot of time to cluster data
▫ The resulting clusters are highly skewed that lots
of them contains only one string
28. Task Breakdown
• Chiao-Meng Huang
▫ Implemented range-based top-k string similarity search
▫ Implemented our proposed method
• Guanghao Peng
▫ Paper survey (search by threshold)
▫ Drafting paper
▫ Parsing and preparing dataset
• Liwen Hu
▫ Paper survey (search by q-gram)
▫ Drafting and finalizing our paper
▫ Implemented base-line edit-distance metric (including dynamic
programming, progressive and pivotal entry based top-k search
• Qing Hu
▫ Paper survey (similarity join)
▫ Drafting paper
▫ Parsing and preparing dataset