- 1. ALSIP, Dec. 1 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, National Institute of Advanced Industrial Science and Technology
- 2. Outline • Overview • Wavelet Tree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
- 3. Graph similarity search • Similarity search for 25 million molecular graphs ✓ Find all graphs whose similarity to the query 1 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009) • Use data-structure called “Wavelet Tree” (SODA, 2003) ✓ Self-index of an integer array ✓ Enable fast array operations ‣ e.g., range minimum query, range intersection
- 4. Range intersection on array • Array A of length N, 1 Ai M i j k " A 1 3 6 8 2 5 7 1 2 7 4 5 • Range intersection: rint(A, [i,j],[k,]) ✓ Find common elements of A[i,j] and A[k,] • The naive method is to concatenate and sort Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8 • Use wavelet tree and solve the problem faster
- 5. Tree of subarrays: Lower half=left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
- 6. Remember if each element is either in lower half(0) or higher half(1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8
- 7. Index each bit array with a rank dictionary • Using rank dictionary, the rank operation can be performed in O(1) time ✓ rankc(B, i): return the number of c {0, 1} in B[1,i] • Several methods known: rank9sel (Vigna, 08) • Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1 (B, 8) = 5 011001110 0 rank0 (B, 5) = 3 011001110 0
- 8. O(1)-division of an interval • Using the rank operation, the division of an • interval can be done in constant time ✓ rank0 for left child and rank1 for right child • Naive = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
- 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
- 10. Outline • Overview • Wavelet Tree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
- 11. Graph Similarity Search • Bag-of-words representation of graph ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima (ICDM, 2009), Wang et al., (EDBT, 2009) W=(A,D,E,H) • Consine similarity query ✓ Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1
- 12. Weisfeiler-Lehman Procedure (NIPS,09) • Convert a graph into a set of words (bag-of-words)
- 13. Semi-conjunctive query • Cosine similarity query can be relaxed to the following form W s.t. |W Q| k ✓ Find all graphs W which share at least k words to the query Q • No false negatives • False positives can easily be ﬁltered out by cosine calculations
- 14. Inverted index, Array, Wavelet Tree • Inverted index is built from graph database • Concatenate all rows to make • an array • Index the array with wavelet • tree Aroot 1 3 6 8 2 5 7 1 2 7 4 5 • Semi-conjunctive query = • Extension of range intersection Wavelet Tree ✓ Find graph ids which appear at least k times in given intervals
- 15. Pruning search space • Find all graphs W in the database whose cosine to a query Q is larger than a threshold 1 |W Q| W s.t. KN (W, Q) = 1 W Q ✓ W,Q: bag-of-words of graphs • The above solution can be relaxed as follows • If KN (W, Q) 1 , then |Q| (1 ) |Q| 2 |W | (1 )2 ✓ Can be used for pruning search space
- 16. Complexity • Time per query: O(τm) • τ: the number of traversed nodes • m: the number of bag-of-words in a query • Memory: (1+α)N log n + M log N • N: the number of all words in the database • M: Maximum integer in the array • n: the number of graphs • α: overhead for rank dictionary (α=0.6) • Inverted index takes Nlog n bits • About 60% overhead to inverted index!
- 17. Outline • Overview • A data-structure ✓ Wavelet Tree • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
- 18. Experiments • 25 million chemical compounds from PubChem database • Evaluate search time and memory usage • Cosine threshold ε=0.3,0.35,0.4 • Compare our method gWT to ✓ Inverted index (concatenate all intervals and sort) ✓ Sequential scan (Compute similarity one by one)
- 19. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
- 20. Memory usage 20GB
- 21. Construction time 7h
- 22. Summary • Efﬁcient similarity search method of massive graph databases • Solve semi-conjunctive query efﬁciently • Build on Wavelet Tree • Use Weisfeiler-Lehman procedure to represent graphs as bag-of-words • Applicable to 25 million graphs • Software • http://code.google.com/p/gwt