Gwt presen alsip-20111201

ALSIP, Dec. 1 2011

Kernel-based similarity search
in massive graph databases
with wavelet trees
Yasuo Tabei and Koji Tsuda
JST ERATO Minato Project,
National Institute of Advanced Industrial
Science and Technology

Outline
• Overview
• Wavelet Tree
✓ Problem = Range intersection on array
• Graph similarity search
✓ Weisfeiler-Lehman kernel
✓ Apply wavelet tree

• Experiments
✓ Comparison to inverted index
✓ 25 million molecular graphs

Graph similarity search
• Similarity search for 25 million molecular
graphs
✓ Find all graphs whose similarity to the query 1
✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009)

• Use data-structure called “Wavelet
Tree” (SODA, 2003)
✓ Self-index of an integer array
✓ Enable fast array operations
‣ e.g., range minimum query, range intersection

Range intersection on array
• Array A of length N, 1 Ai M

i j k "
A 1 3 6 8 2 5 7 1 2 7 4 5

• Range intersection: rint(A, [i,j],[k,])
✓ Find common elements of A[i,j] and A[k,]

• The naive method is to concatenate and sort
Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8

• Use wavelet tree and solve the problem faster

Tree of subarrays:
Lower half=left, Higher half=right
[1,8]
1 3 6 8 2 5 7 1 7 2 4 5

[1,4] [5,8]
1 3 2 1 2 4 6 8 5 7 7 5

[1,2] [3,4] [5,6] [7,8]
1 2 1 2 3 4 6 5 5 8 7 7

1 1 2 2 3 4 5 5 6 7 7 8

Remember if each element is either
in lower half(0) or higher half(1)
[1,8]
0 0 1 1 0 1 1 0 1 0 0 1

[1,4] [5,8]
0 1 0 0 0 1 0 1 0 1 1 0

[1,2] [3,4] [5,6] [7,8]
0 1 0 1 0 1 1 0 0 1 0 0

1 2 3 4 5 6 7 8

Index each bit array
with a rank dictionary
• Using rank dictionary, the rank operation can be
performed in O(1) time
✓ rankc(B, i): return the number of c {0, 1} in B[1,i]

• Several methods known: rank9sel (Vigna, 08)
• Example) B=0110011100
i 1 2 3 4 5 6 7 8 9 10
rank1 (B, 8) = 5 011001110 0
rank0 (B, 5) = 3 011001110 0

O(1)-division of an interval
• Using the rank operation, the division of an
•

interval can be done in constant time
✓ rank0 for left child and rank1 for right child

• Naive = linear time to the total number of elements

[1,8]
Aroot 1 3 6 8 2 5 7 1 7 2 4 5

rank0 rank1
[1,4] [5,8]
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5

Fast computation of rank
intersection by pruning
Pruned [1,8]
1 3 6 8 2 5 7 1 7 2 4 5

[1,4] [5,8]
1 3 2 1 2 4 6 8 5 7 7 5

[1,2] [3,4] [5,6] [6,8]
1 2 1 2 3 4 6 5 5 8 7 7

1 1 2 2 3 4 5 5 6 7 7 8

solution!!

Graph Similarity Search
• Bag-of-words representation of graph
✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
Kashima (ICDM, 2009), Wang et al., (EDBT, 2009)

W=(A,D,E,H)

• Consine similarity query
✓ Find all graphs W whose cosine similarity (kernel) to
the query Q is at least 1

Weisfeiler-Lehman Procedure (NIPS,09)
• Convert a graph into a set of words (bag-of-words)

Semi-conjunctive query
• Cosine similarity query can be relaxed to
the following form
W s.t. |W Q| k
✓ Find all graphs W which share at least k words
to the query Q

• No false negatives
• False positives can easily be ﬁltered out by
cosine calculations

Inverted index, Array, Wavelet Tree
• Inverted index is built from
graph database
• Concatenate all rows to make
•

an array
• Index the array with wavelet
•

tree
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
• Semi-conjunctive query =
•

Extension of range intersection
Wavelet Tree
✓ Find graph ids which appear at
least k times in given intervals

Pruning search space
• Find all graphs W in the database whose cosine
to a query Q is larger than a threshold 1
|W Q|
W s.t. KN (W, Q) = 1
W Q
✓ W,Q: bag-of-words of graphs
• The above solution can be relaxed as follows
•

If KN (W, Q) 1 , then
|Q|
(1 ) |Q|
2
|W |
(1 )2
✓ Can be used for pruning search space

Complexity
• Time per query: O(τm)
• τ: the number of traversed nodes
• m: the number of bag-of-words in a query

• Memory: (1+α)N log n + M log N
• N: the number of all words in the database
• M: Maximum integer in the array
• n: the number of graphs
• α: overhead for rank dictionary (α=0.6)

• Inverted index takes Nlog n bits
• About 60% overhead to inverted index!

Outline
• Overview
• A data-structure
✓ Wavelet Tree
• Graph similarity search
✓ Weisfeiler-Lehman kernel
✓ Apply wavelet tree

• Experiments
✓ Comparison to inverted index
✓ 25 million molecular graphs

Experiments

• 25 million chemical compounds from PubChem
database
• Evaluate search time and memory usage
• Cosine threshold ε=0.3,0.35,0.4
• Compare our method gWT to
✓ Inverted index (concatenate all intervals and sort)
✓ Sequential scan (Compute similarity one by one)

Search time
40 sec
38 sec

8 sec
3 sec
2 sec

Construction time
7h

Summary
• Efﬁcient similarity search method of
massive graph databases
• Solve semi-conjunctive query efﬁciently
• Build on Wavelet Tree
• Use Weisfeiler-Lehman procedure to
represent graphs as bag-of-words
• Applicable to 25 million graphs
• Software
•

http://code.google.com/p/gwt

Gwt presen alsip-20111201

More Related Content

What's hot

Viewers also liked

Similar to Gwt presen alsip-20111201

Recently uploaded

Gwt presen alsip-20111201