2011 SIAM International Conference on Data Mining,Apr. 28, 2011       Kernel-based similarity search       in massive grap...
Graph similarity search	n    Similarity search for 25 million molecular graphs       •  Find all graphs whose similarity ...
Range intersection on array	n    Array A of length N,    1 ≤ Ai ≤ M                    i	  j	    k            A       1 ...
Tree of subarrays:Lower half = left, Higher half=right	          [1,8]              1 3 6 8 2 5 7 1 7 2 4 5  [1,4]        ...
Remember if each element is either inlower half (0) or higher half (1) 	           [1,8]               0 0 1 1 0 1 1 0 1 0...
Index each bit array with a rank dictionary	n    With rank dictionary, the rank operation can be      done in O(1) time  ...
Implementation of rank dictionary	                        n  Divide the bit array B intoB	                     large bloc...
O(1)-time division of an interval	n    Using the rank operations, the division of an      interval can be done in constan...
Fast computation of rank             intersection by pruning	Pruned      [1,8]                  1 3 6 8 2 5 7 1 7 2 4 5   ...
Graph Similarity Search	n    Bag-of-words representation of graph      •  Weisfeiler-Lehman procedure (NIPS, 2009), Hido ...
Weisfeiler-Lehman Procedure (NIPS,09)                                    	n Convert         a graph into a set of words (...
Semi-conjunctive query	n    Cosine similarity query can be relaxed to the      following form	                W s.t |W ∩ ...
Inverted index, Array, Wavelet Tree	         	                   	    n   Graph database is represented    	             ...
Pruning condition	n    Find all graphs W in the database whose cosine      is larger than a threshold 1-ε                ...
Complexity	n  Time per query: O(τm)     •  τ: the number of traversed nodes     •  m: the number of bag-of-words in a que...
Experiments	n  25 million chemical compounds from PubChem    databasen  Evaluated search time and memory usagen  Cosine...
Search time	                40 sec	                38 sec	                8 sec	                3 sec	                2 sec
Memory usage	                 20GB
Construction time	                      7h
Related work	  A lot of methods have been proposed so far.n  1.Graph Grep [Shasha et al., 02] 2.gIndex [Yan et al., 04] 3...
Related work 	                      n  Decompose graphs into                          a set of substructures             ...
Summary	n  Efficient similarity search method of massive    graph databasesn  Solve semi-conjunctive queries efficiently...
Upcoming SlideShare
Loading in...5
×

Gwt sdm public

4,405

Published on

Presentation slide at 2011 SIAM international conference on data mining

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,405
On Slideshare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
21
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Gwt sdm public

  1. 1. 2011 SIAM International Conference on Data Mining,Apr. 28, 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, Japan AIST Comp Bio Research Center, Japan
  2. 2. Graph similarity search n  Similarity search for 25 million molecular graphs •  Find all graphs whose similarity to the query ≥ 1 − •  Similarity = Weisfeiler-Lehman graph kernel (NIPS,2009)n  Use data structure called “Wavelet Tree” (SODA,2003) •  Self-index of an integer array •  Combination of rank dictionaries of bit arrays •  Enables fast array operations •  e.g., range minimum query, range intersectionn  First review wavelet tree in the range intersection problem, then graph similarity search 11/05/08 2
  3. 3. Range intersection on array n  Array A of length N, 1 ≤ Ai ≤ M i j k  A 1 3 6 8 2 5 7 1 2 7 4 5n  Range intersection: rint(A,[i,j],[k,]) •  Find set intersection of A[i,j] and A[k,] •  The naïve solution is to concatenate and sortn  Use an index structure of the array (Wavelet Tree) and solve the problem faster! 11/05/08 3
  4. 4. Tree of subarrays:Lower half = left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5[1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
  5. 5. Remember if each element is either inlower half (0) or higher half (1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8 11/05/08 5
  6. 6. Index each bit array with a rank dictionary n  With rank dictionary, the rank operation can be done in O(1) time •  rankc(B,i): return the number of c {0,1} in B[1…i]n  Several algorithms known: rank9sel Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1(B,8)=5 011001110 0 rank0(B,5)=3 0 1 1 0 0 1 1 1 0 0 11/05/08 6
  7. 7. Implementation of rank dictionary n  Divide the bit array B intoB large blocks of length l=log2n RL=Ranks of large blocksRL n  Divide each large block to small blocks of length s=logn/2 Rs=Ranks of small blocksRS relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n +o(n) bits
  8. 8. O(1)-time division of an interval n  Using the rank operations, the division of an interval can be done in constant time •  rank0 for left child and rank1 for right childn  Naïve = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5 11/05/08 8
  9. 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
  10. 10. Graph Similarity Search n  Bag-of-words representation of graph •  Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima(2009),Wang et al.,(2009) W=(A,D,E,H) n  Cosine similarity query •  Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1-ε 11/05/08 10
  11. 11. Weisfeiler-Lehman Procedure (NIPS,09) n Convert a graph into a set of words (bag-of-words) i) Make a label set of adjacent A vertices ex) {E,A,D} B E ii) Sort ex) A,D,E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a R E unique value ex) B,A,D,E R D v) Assign the value as the new Bag-of-words {A,B,D,E,…,R,…} vertex label
  12. 12. Semi-conjunctive query n  Cosine similarity query can be relaxed to the following form W s.t |W ∩ Q| ≥ kn  Find all graphs W which share at least k words to the query Qn  No false negativesn  False positives can easily be filtered out by cosine calculations 11/05/08 12
  13. 13. Inverted index, Array, Wavelet Tree n  Graph database is represented as inverted index n  Concatenate all rows to make an array n  Index the array with wavelet treeAroot 1 3 6 8 2 5 7 1 2 7 4 5 n  Semi-conjunctive query = Extension of range intersection Wavelet Tree •  Find graph ids which appear at least k times in the array 11/05/08 13
  14. 14. Pruning condition n  Find all graphs W in the database whose cosine is larger than a threshold 1-ε | W !Q | W s.t K N (W,Q) = ! 1 ! |W | |Q | l  W, Q: bag-of-words of graphsn The above solution can be relaxed as follows, If K N (Q,W ) ! 1 ! , then |Q | (1! ! )2 | Q |!| W |! (1! ! )2 l  Can be used for pruning
  15. 15. Complexity n  Time per query: O(τm) •  τ: the number of traversed nodes •  m: the number of bag-of-words in a queryn  Memory: (1+α)N log n + M log N •  N: the number of all words in the database •  M: Maximum integer in the array •  n: the number of graphs •  α: overhead for rank dictionary (α=0.6) n  Inverted index takes Nlogn bits n  About 60% overhead to inverted index!
  16. 16. Experiments n  25 million chemical compounds from PubChem databasen  Evaluated search time and memory usagen  Cosine thresholds ε=0.3,0.35,0.4n  Compare our method gWT to •  Inverted index (concatenate all intervals and sort) •  Sequential scan (compute kernel one by one)
  17. 17. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
  18. 18. Memory usage 20GB
  19. 19. Construction time 7h
  20. 20. Related work A lot of methods have been proposed so far.n  1.Graph Grep [Shasha et al., 02] 2.gIndex [Yan et al., 04] 3.Tree+Delta [Zhao et al., 07] 4.TreePi [Zhang et al., 07] 5.Gstring [Jiang et al., 07] 6.FG-Index [Cheng et al., 07] 7.GDIndex [Williams et al., 07] etc
  21. 21. Related work n  Decompose graphs into a set of substructures •  subgraphs, trees, Decompose paths etc … n  Build a substructure based index n  Drawbacks Index •  Need substructure mining •  Do not scale to millions of graphs
  22. 22. Summary n  Efficient similarity search method of massive graph databasesn  Solve semi-conjunctive queries efficientlyn  Our method is built on wavelet treesn  Use Weisfeiler-Lehman procedure to convert graphs into bag-of-wordsn  Applicable to 25 million graphsn  Software http://code.google.com/p/gwt/

×