Set Similarity Search using a Distributed Prefix Tree Index

•OCT 4TH, 2017
Set Similarity Search using a
Distributed Prefix Tree Index
Fabian Fier
Prof. Johann-Christoph Freytag, Ph.D.

Problem Statement: Set Similarity Search
• Input
• A set of records R
• each consisting of a token set
• A search record s
• A similarity function sim
• A similarity threshold t
• Output
• All pairs of records where sim(r,s) ≥ t (r ∈ R)
Set Similarity Search using a Distributed Prefix Tree Index 2

Example: Jaccard Similarity Function
𝑠𝑖𝑚 𝑟, 𝑠 =
|𝑟 ∩ 𝑠|
|𝑟 ∪ 𝑠|
=
3
8
sr

Approaches for Set Similarity Search
• Naive: compute similarity for each element in R
• Use Indexes (distributed):
• Inverted Index
• Optimization: filters
• New Approach: Prefix Tree (Trie)

Inverted Index (1)
Build an inverted index {[token, {recordId}]}
r1 a b e
r2 a d e
r3 b c d e f g
r4 b c d f g
r5 b d f g
a r1, r2
b r1, r3, r4, r5
c r3, r4
d r2, r3, r4, r5
e r1, r2, r3
f r3, r4, r5
g r3, r4, r5

Inverted Index (2)
Probe the index
• Get the inverted lists for each token of s
• Count record ID frequencies (=overlap) and calculate the similarities
s c d f g
c r3, r4
d r2, r3, r4, r5
f r3, r4, r5
g r3, r4, r5
r2 1 → 1/6
r3 4 → 4/6
r4 4 → 4/5
r5 3 → 3/4
(r4, s)
t = 0.8
inverted index candidates
resultquery

Inverted Index (3)
• Optimization:
• Only documents with a similar length can be similar
• Add length to the index and use it to shrink the candidate set
r1 a b e
r2 a d e
r3 b c d e f g
r4 b c d f g
r5 b d f g
a 3 r1, r2
b 3 r1
b 4 r5
b 5 r4
… … …
s c d f g
r4 4 → 4/5
r5 3 → 3/4
(r4, s)
query: t = 0.8, length 4 or 5
result
c 5 r4
d 4 r5
d 5 r4
f 4 r5
… … …
inverted index
{[token, length, {recordId}]}
only two
candidates
candidates
n
e
w

Prefix Tree (1)
• Inspired by Charles Kaminskis approach (prefix trees for ED similarity search)
→ Our goal: find similar records with the Jaccard similarity function
1. Build the prefix tree
r1 a b e
r2 a d e
r3 b c d e f g
r4 b c d f g
r5 b d f g
a (3,3) b (4,6)
b e (3,3)
r1
d e (3,3)
r2
d f g (4,4)
r5
c d (5,6)
e f g (6,6)
r3
f g (5,5)
r4

Prefix Tree (2)
2. Probe the tree
• Start at the root of the tree and
follow all paths
• For each path:
• Discard subtrees which fail the
length filter
• Compare the query tokens
with the node tokens and
count all mismatches
• If there are too many
mismatches, discard this path
or subtree
s c d f g
query
a (3,3) b (4,6)
b e (3,3)
r1
d e (3,3)
r2
d f g (4,4)
r5
c d (5,6)
e f g (6,6)
r3
f g (5,5)
r4
t = 0.8
→ length 4 or 5
→ allowed mismatches:
0 (length 4), 1 (length 5)
1. 2.
too short
too long
too many
mismatches
similar
m: 1 (b)
3.
m: 1
4. 5.
m: 1
6.
m: 2
(b, c)

Implementation of the Prefix Tree (1)
1. Build the prefix tree:
• Result: INDEX which contains all prefix tree nodes
• Key: parent node id
• Payload: own node id, min. and max. path length, record id (or 0) and
is_record (boolean)
2. Probe the tree: Breadth-first search with LOOP and JOIN

• Remarks
1. Token orders
• All records must have the same token order → Which one?
• The token order influences the shape of the prefix tree
→ We experimented with diffent token orders
2. Level number in the prefix tree
• Each JOIN in the LOOP joins a new level from the tree with queries
→ We add a integer „level“ to all tree nodes and change the index key to
parent_id and level
→ We add „RIGHT.level = COUNTER“ to the JOIN condition

Experiments and Results
• Datasets
• Flickr (253 MB), DBLP (685 MB), Enron (1.0 GB), Netflix (1.1 GB), CSX (3.5
GB)
• US Patent Data from 2005 (9.5 GB) and 2010 (16.5 GB)
• Queries
• 100 records from the original dataset
• Token orders
• Least frequent to most frequent
• Most frequent to least frequent
• Random
• Cluster configuration
• 6 Thor nodes with 3 Thor slaves per node

Result 1: Token Order has Significant Influence on
Query Runtime
• Least frequent tokens at the beginning (inc)
• Tree is wide
• Most frequent tokens at the beginning (dec)
• Tree is deep
r1 r2 r3 r4
r5 r6
r7 r8
r4
r7 r8
r3r1 r2
r5 r6
0
100
200
300
400
500
60708090100
Runtimeins
Threshold in %
DBLP
inc
dec
ran
0
20
40
60
80
100
120
140
160
180
200
60708090100
Runtimeins
Threshold in %
Enron
inc
dec
ran

Result 2: Tree Level as Additional Index Key
0
100
200
300
400
500
6080100
Runtimeins
Threshold in %
DBLP inc
0
100
200
300
400
500
6080100
Runtimeins
Threshold in %
DBLP dec
0
100
200
300
400
500
6080100
Runtimeins
Threshold in %
DBLP ran
normal
level

Result 3: Comparing (Prefix) Inverted Indexes to
Prefix Trees
• Prefix inverted indexes are better for high thresholds
• Normal inverted indexes are better for low thresholds
0
20
40
60
80
100
120
140
60708090100
Runtimeins
Threshold in %
DBLP
0
20
40
60
80
100
120
140
60708090100
Runtimeins
Threshold in %
enron
prefixtree_best
inverted_index
prefix_inverted_index

• The patent datasets contain stopwords which appear in almost every record
• We removed the most frequent 0.075% of the tokens
• Average record length has been reduced to 44% (2005) and 40% (2010)
Result 4: Stop Word Removal Important for Big
Datasets
0
50000
100000
150000
200000
0 100000 200000 300000
frequency
rank
Token distribution of the most frequent token (1 %)
2005
2010
0.075 99% of the tokens appear
only 400 times or less

Result 4: Stop Word Removal Important for Big
Datasets
0
500
1000
1500
2000
2500
3000
3500
Runtimeins
2005, t = 0.95
0
500
1000
1500
2000
2500
3000
3500
Runtimeins
2010, t = 0.95
with stopwords
with stopwords
and level as key
without stopwords
(0.075%)
without stopwords
(0.075%) and
level as key
timeout

Thank you!
Questions?

Backup

Implementation of the Inverted Index (1)
Build
• Extract tokens and length for each record with NORMALIZE
• Combine the record ids to record id sets for each token and length with ROLLUP
• BUILD an INDEX with token and length as a key and the record id set as payload
r1 a b e
a 3 r1
b 3 r1
e 3 r1
NORMALIZE(inputDS,
COUNT(LEFT.token_set),
getTidCntRid(LEFT,
COUNTER))
a 3 r1
a 3 r2
ROLLUP(tupelDS,
LEFT.token = RIGHT.token and
LEFT.cnt = RIGHT.cnt,
combineRids(LEFT,RIGHT),
local)
a 3 r1, r2(distributed
and sorted)

Implementation of the Inverted Index (2)
Probe
• Read the index and the query records
• Use PROJECT to find the similarity pairs for each query
• PROJECT(queryDS, findSimPairs(LEFT))
• TRANSFORM function findSimPairs:
• JOIN the query token and the inverted index
with the conditions LEFT.token = RIGHT.token
and the length filter to find all candidates
• Extract all candidate record ids with NORMALIZE
and count them with TABLE
• Calculate all similarities with PROJECT and SKIP
all candidates which are not similar
s c d f g
r4 5 4
r5 4 3
(r4, s)
JOIN
result
c 5 r4
d 4 r5
d 5 r4
… … …
index
NORMALIZE,
TABLE
PROJECT with
SKIP

Inverted Index with Prefix Filtering
• Idea:
• Two documents can only be similar, if their prefixes share at least one
token!
• Approach:
• Create the inverted index with only the prefixes of R
→ Reduce the index size
• Search the candidates with only the prefix token from s
→ Should decrease the candidate set size
• Calculate the similarity for each candidate with the original documents
→ Need an additional access to the documents

Example
r1 a b e
r2 a d e
r3 b c d e f
g
r4 b c d f g
r5 b d f g
a 3 r1, r2
b 3 r1
b 4 r5
b 5 r4
… … …
s c d f g
r4 4 →
4/5
r5 3 →
3/4
(r4,
s)
t = 0.8, length 4 or 5
result
c 5 r4
d 4 r5
d 5 r4
f 4 r5
… … …
1. Build the inverted index
{[token, length, {recordId}]}
only one
candidate
2. Use the index to seach

Implementation of the Prefix Inverted Index
• Similar to the first version
• Changes:
• Build an additional INDEX for the input records with the record id as a key
and the tokens as payload
• Build the inverted index only for the prefixes
• Change the NORMALIZE expression from COUNT(LEFT.token_set)
to indexLength(COUNT(LEFT.token_set))
• Use again PROJECT to find the similarity pairs for each query and change
the findSimPairs function
• JOIN only the query prefix with the index to get the candidate record ids
• Get the candidate records from the new index
• Verify the candidates with a new C++ function

1. Build the prefix tree:
• Result: INDEX which contains all prefix tree nodes
• Key: parent node id
• Payload: own node id, min. and max. path length, record id (or 0) and
is_record (boolean)
2. Probe the tree: Breadth-first search with LOOP and JOIN
LOOP(QueryDS,
LEFT.is_record = false, EXISTS(ROWS(LEFT)) = true,
JOIN(ROWS(LEFT), pt_index,
LEFT.node_id = RIGHT.parent_id AND
/* length filter */ AND
LEFT.too_much = false,
QueryPTTransform(LEFT,RIGHT),LIMIT(0),INNER))(too_much=false);

Set Similarity Search using a Distributed Prefix Tree Index

More Related Content

What's hot

Similar to Set Similarity Search using a Distributed Prefix Tree Index

More from HPCC Systems

Recently uploaded

Set Similarity Search using a Distributed Prefix Tree Index

Editor's Notes