Gwt presen alsip-20111201

645 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
645
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Gwt presen alsip-20111201

  1. 1. ALSIP, Dec. 1 2011Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, National Institute of Advanced Industrial Science and Technology
  2. 2. Outline• Overview• Wavelet Tree ✓ Problem = Range intersection on array• Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree• Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  3. 3. Graph similarity search• Similarity search for 25 million molecular graphs ✓ Find all graphs whose similarity to the query 1 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009)• Use data-structure called “Wavelet Tree” (SODA, 2003) ✓ Self-index of an integer array ✓ Enable fast array operations ‣ e.g., range minimum query, range intersection
  4. 4. Range intersection on array• Array A of length N, 1 Ai M i j k " A 1 3 6 8 2 5 7 1 2 7 4 5• Range intersection: rint(A, [i,j],[k,]) ✓ Find common elements of A[i,j] and A[k,]• The naive method is to concatenate and sort Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8• Use wavelet tree and solve the problem faster
  5. 5. Tree of subarrays:Lower half=left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
  6. 6. Remember if each element is either in lower half(0) or higher half(1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8
  7. 7. Index each bit array with a rank dictionary• Using rank dictionary, the rank operation can be performed in O(1) time ✓ rankc(B, i): return the number of c {0, 1} in B[1,i]• Several methods known: rank9sel (Vigna, 08)• Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1 (B, 8) = 5 011001110 0 rank0 (B, 5) = 3 011001110 0
  8. 8. O(1)-division of an interval• Using the rank operation, the division of an• interval can be done in constant time ✓ rank0 for left child and rank1 for right child• Naive = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  9. 9. Fast computation of rank intersection by pruningPruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
  10. 10. Outline• Overview• Wavelet Tree ✓ Problem = Range intersection on array• Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree• Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  11. 11. Graph Similarity Search• Bag-of-words representation of graph ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima (ICDM, 2009), Wang et al., (EDBT, 2009) W=(A,D,E,H)• Consine similarity query ✓ Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1
  12. 12. Weisfeiler-Lehman Procedure (NIPS,09)• Convert a graph into a set of words (bag-of-words)
  13. 13. Semi-conjunctive query• Cosine similarity query can be relaxed to the following form W s.t. |W Q| k ✓ Find all graphs W which share at least k words to the query Q• No false negatives• False positives can easily be filtered out by cosine calculations
  14. 14. Inverted index, Array, Wavelet Tree • Inverted index is built from graph database • Concatenate all rows to make • an array • Index the array with wavelet • treeAroot 1 3 6 8 2 5 7 1 2 7 4 5 • Semi-conjunctive query = • Extension of range intersection Wavelet Tree ✓ Find graph ids which appear at least k times in given intervals
  15. 15. Pruning search space• Find all graphs W in the database whose cosine to a query Q is larger than a threshold 1 |W Q| W s.t. KN (W, Q) = 1 W Q ✓ W,Q: bag-of-words of graphs• The above solution can be relaxed as follows• If KN (W, Q) 1 , then |Q| (1 ) |Q| 2 |W | (1 )2 ✓ Can be used for pruning search space
  16. 16. Complexity• Time per query: O(τm) • τ: the number of traversed nodes • m: the number of bag-of-words in a query• Memory: (1+α)N log n + M log N • N: the number of all words in the database • M: Maximum integer in the array • n: the number of graphs • α: overhead for rank dictionary (α=0.6)• Inverted index takes Nlog n bits• About 60% overhead to inverted index!
  17. 17. Outline• Overview• A data-structure ✓ Wavelet Tree• Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree• Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  18. 18. Experiments• 25 million chemical compounds from PubChem database• Evaluate search time and memory usage• Cosine threshold ε=0.3,0.35,0.4• Compare our method gWT to ✓ Inverted index (concatenate all intervals and sort) ✓ Sequential scan (Compute similarity one by one)
  19. 19. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
  20. 20. Memory usage 20GB
  21. 21. Construction time 7h
  22. 22. Summary• Efficient similarity search method of massive graph databases• Solve semi-conjunctive query efficiently• Build on Wavelet Tree• Use Weisfeiler-Lehman procedure to represent graphs as bag-of-words• Applicable to 25 million graphs• Software• http://code.google.com/p/gwt

×