Upcoming SlideShare
×

# Gwt sdm public

4,405

Published on

Presentation slide at 2011 SIAM international conference on data mining

Published in: Technology
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
4,405
On Slideshare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
21
0
Likes
2
Embeds 0
No embeds

No notes for slide

### Gwt sdm public

1. 1. 2011 SIAM International Conference on Data Mining,Apr. 28, 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, Japan AIST Comp Bio Research Center, Japan
2. 2. Graph similarity search n  Similarity search for 25 million molecular graphs •  Find all graphs whose similarity to the query ≥ 1 − •  Similarity = Weisfeiler-Lehman graph kernel (NIPS,2009)n  Use data structure called “Wavelet Tree” (SODA,2003) •  Self-index of an integer array •  Combination of rank dictionaries of bit arrays •  Enables fast array operations •  e.g., range minimum query, range intersectionn  First review wavelet tree in the range intersection problem, then graph similarity search 11/05/08 2
3. 3. Range intersection on array n  Array A of length N, 1 ≤ Ai ≤ M i j k  A 1 3 6 8 2 5 7 1 2 7 4 5n  Range intersection: rint(A,[i,j],[k,]) •  Find set intersection of A[i,j] and A[k,] •  The naïve solution is to concatenate and sortn  Use an index structure of the array (Wavelet Tree) and solve the problem faster! 11/05/08 3
4. 4. Tree of subarrays:Lower half = left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5[1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
5. 5. Remember if each element is either inlower half (0) or higher half (1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8 11/05/08 5
6. 6. Index each bit array with a rank dictionary n  With rank dictionary, the rank operation can be done in O(1) time •  rankc(B,i): return the number of c {0,1} in B[1…i]n  Several algorithms known: rank9sel Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1(B,8)=5 011001110 0 rank0(B,5)=3 0 1 1 0 0 1 1 1 0 0 11/05/08 6
7. 7. Implementation of rank dictionary n  Divide the bit array B intoB large blocks of length l=log2n RL=Ranks of large blocksRL n  Divide each large block to small blocks of length s=logn/2 Rs=Ranks of small blocksRS relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n +o(n) bits
8. 8. O(1)-time division of an interval n  Using the rank operations, the division of an interval can be done in constant time •  rank0 for left child and rank1 for right childn  Naïve = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5 11/05/08 8
9. 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
10. 10. Graph Similarity Search n  Bag-of-words representation of graph •  Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima(2009),Wang et al.,(2009) W=(A,D,E,H) n  Cosine similarity query •  Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1-ε 11/05/08 10
11. 11. Weisfeiler-Lehman Procedure (NIPS,09) n Convert a graph into a set of words (bag-of-words) i) Make a label set of adjacent A vertices ex) {E,A,D} B E ii) Sort ex) A,D,E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a R E unique value ex) B,A,D,E R D v) Assign the value as the new Bag-of-words {A,B,D,E,…,R,…} vertex label
12. 12. Semi-conjunctive query n  Cosine similarity query can be relaxed to the following form W s.t |W ∩ Q| ≥ kn  Find all graphs W which share at least k words to the query Qn  No false negativesn  False positives can easily be filtered out by cosine calculations 11/05/08 12
13. 13. Inverted index, Array, Wavelet Tree n  Graph database is represented as inverted index n  Concatenate all rows to make an array n  Index the array with wavelet treeAroot 1 3 6 8 2 5 7 1 2 7 4 5 n  Semi-conjunctive query = Extension of range intersection Wavelet Tree •  Find graph ids which appear at least k times in the array 11/05/08 13
14. 14. Pruning condition n  Find all graphs W in the database whose cosine is larger than a threshold 1-ε | W !Q | W s.t K N (W,Q) = ! 1 ! |W | |Q | l  W, Q: bag-of-words of graphsn The above solution can be relaxed as follows, If K N (Q,W ) ! 1 ! , then |Q | (1! ! )2 | Q |!| W |! (1! ! )2 l  Can be used for pruning
15. 15. Complexity n  Time per query: O(τm) •  τ: the number of traversed nodes •  m: the number of bag-of-words in a queryn  Memory: (1+α)N log n + M log N •  N: the number of all words in the database •  M: Maximum integer in the array •  n: the number of graphs •  α: overhead for rank dictionary (α=0.6) n  Inverted index takes Nlogn bits n  About 60% overhead to inverted index!
16. 16. Experiments n  25 million chemical compounds from PubChem databasen  Evaluated search time and memory usagen  Cosine thresholds ε=0.3,0.35,0.4n  Compare our method gWT to •  Inverted index (concatenate all intervals and sort) •  Sequential scan (compute kernel one by one)
17. 17. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
18. 18. Memory usage 20GB
19. 19. Construction time 7h
20. 20. Related work A lot of methods have been proposed so far.n  1.Graph Grep [Shasha et al., 02] 2.gIndex [Yan et al., 04] 3.Tree+Delta [Zhao et al., 07] 4.TreePi [Zhang et al., 07] 5.Gstring [Jiang et al., 07] 6.FG-Index [Cheng et al., 07] 7.GDIndex [Williams et al., 07] etc
21. 21. Related work n  Decompose graphs into a set of substructures •  subgraphs, trees, Decompose paths etc … n  Build a substructure based index n  Drawbacks Index •  Need substructure mining •  Do not scale to millions of graphs
22. 22. Summary n  Efficient similarity search method of massive graph databasesn  Solve semi-conjunctive queries efficientlyn  Our method is built on wavelet treesn  Use Weisfeiler-Lehman procedure to convert graphs into bag-of-wordsn  Applicable to 25 million graphsn  Software http://code.google.com/p/gwt/