Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Scalable Partial Least Squares Regr... by Yasuo Tabei 1450 views
- Mlab2012 tabei 20120806 by Yasuo Tabei 3618 views
- SPIRE2013-tabei20131009 by Yasuo Tabei 4369 views
- CPM2013-tabei201306 by Yasuo Tabei 3931 views
- WABI2012-SuccinctMultibitTree by Yasuo Tabei 5032 views
- NIPS2013読み会: Scalable kernels for g... by Yasuo Tabei 7371 views

5,122 views

Published on

Presentation slide at 2011 SIAM international conference on data mining

Published in:
Technology

No Downloads

Total views

5,122

On SlideShare

0

From Embeds

0

Number of Embeds

3,387

Shares

0

Downloads

21

Comments

0

Likes

2

No embeds

No notes for slide

- 1. 2011 SIAM International Conference on Data Mining,Apr. 28, 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, Japan AIST Comp Bio Research Center, Japan
- 2. Graph similarity search n Similarity search for 25 million molecular graphs • Find all graphs whose similarity to the query ≥ 1 − • Similarity = Weisfeiler-Lehman graph kernel (NIPS,2009)n Use data structure called “Wavelet Tree” (SODA,2003) • Self-index of an integer array • Combination of rank dictionaries of bit arrays • Enables fast array operations • e.g., range minimum query, range intersectionn First review wavelet tree in the range intersection problem, then graph similarity search 11/05/08 2
- 3. Range intersection on array n Array A of length N, 1 ≤ Ai ≤ M i j k A 1 3 6 8 2 5 7 1 2 7 4 5n Range intersection: rint(A,[i,j],[k,]) • Find set intersection of A[i,j] and A[k,] • The naïve solution is to concatenate and sortn Use an index structure of the array (Wavelet Tree) and solve the problem faster! 11/05/08 3
- 4. Tree of subarrays:Lower half = left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5[1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
- 5. Remember if each element is either inlower half (0) or higher half (1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8 11/05/08 5
- 6. Index each bit array with a rank dictionary n With rank dictionary, the rank operation can be done in O(1) time • rankc(B,i): return the number of c {0,1} in B[1…i]n Several algorithms known: rank9sel Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1(B,8)=5 011001110 0 rank0(B,5)=3 0 1 1 0 0 1 1 1 0 0 11/05/08 6
- 7. Implementation of rank dictionary n Divide the bit array B intoB large blocks of length l=log2n RL=Ranks of large blocksRL n Divide each large block to small blocks of length s=logn/2 Rs=Ranks of small blocksRS relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n +o(n) bits
- 8. O(1)-time division of an interval n Using the rank operations, the division of an interval can be done in constant time • rank0 for left child and rank1 for right childn Naïve = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5 11/05/08 8
- 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
- 10. Graph Similarity Search n Bag-of-words representation of graph • Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima(2009),Wang et al.,(2009) W=(A,D,E,H) n Cosine similarity query • Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1-ε 11/05/08 10
- 11. Weisfeiler-Lehman Procedure (NIPS,09) n Convert a graph into a set of words (bag-of-words) i) Make a label set of adjacent A vertices ex) {E,A,D} B E ii) Sort ex) A,D,E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a R E unique value ex) B,A,D,E R D v) Assign the value as the new Bag-of-words {A,B,D,E,…,R,…} vertex label
- 12. Semi-conjunctive query n Cosine similarity query can be relaxed to the following form W s.t |W ∩ Q| ≥ kn Find all graphs W which share at least k words to the query Qn No false negativesn False positives can easily be filtered out by cosine calculations 11/05/08 12
- 13. Inverted index, Array, Wavelet Tree n Graph database is represented as inverted index n Concatenate all rows to make an array n Index the array with wavelet treeAroot 1 3 6 8 2 5 7 1 2 7 4 5 n Semi-conjunctive query = Extension of range intersection Wavelet Tree • Find graph ids which appear at least k times in the array 11/05/08 13
- 14. Pruning condition n Find all graphs W in the database whose cosine is larger than a threshold 1-ε | W !Q | W s.t K N (W,Q) = ! 1 ! |W | |Q | l W, Q: bag-of-words of graphsn The above solution can be relaxed as follows, If K N (Q,W ) ! 1 ! , then |Q | (1! ! )2 | Q |!| W |! (1! ! )2 l Can be used for pruning
- 15. Complexity n Time per query: O(τm) • τ: the number of traversed nodes • m: the number of bag-of-words in a queryn Memory: (1+α)N log n + M log N • N: the number of all words in the database • M: Maximum integer in the array • n: the number of graphs • α: overhead for rank dictionary (α=0.6) n Inverted index takes Nlogn bits n About 60% overhead to inverted index!
- 16. Experiments n 25 million chemical compounds from PubChem databasen Evaluated search time and memory usagen Cosine thresholds ε=0.3,0.35,0.4n Compare our method gWT to • Inverted index (concatenate all intervals and sort) • Sequential scan (compute kernel one by one)
- 17. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
- 18. Memory usage 20GB
- 19. Construction time 7h
- 20. Related work A lot of methods have been proposed so far.n 1.Graph Grep [Shasha et al., 02] 2.gIndex [Yan et al., 04] 3.Tree+Delta [Zhao et al., 07] 4.TreePi [Zhang et al., 07] 5.Gstring [Jiang et al., 07] 6.FG-Index [Cheng et al., 07] 7.GDIndex [Williams et al., 07] etc
- 21. Related work n Decompose graphs into a set of substructures • subgraphs, trees, Decompose paths etc … n Build a substructure based index n Drawbacks Index • Need substructure mining • Do not scale to millions of graphs
- 22. Summary n Efficient similarity search method of massive graph databasesn Solve semi-conjunctive queries efficientlyn Our method is built on wavelet treesn Use Weisfeiler-Lehman procedure to convert graphs into bag-of-wordsn Applicable to 25 million graphsn Software http://code.google.com/p/gwt/

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment