Gwt sdm public

Yasuo Tabei
Yasuo TabeiResearcher, Software Developer at Japan Science and Technology Agency
2011 SIAM International Conference on Data Mining,
Apr. 28, 2011


       Kernel-based similarity search
       in massive graph databases
       with wavelet trees

         Yasuo Tabei and Koji Tsuda

         JST ERATO Minato Project, Japan
         AIST Comp Bio Research Center, Japan
Graph similarity search	
n    Similarity search for 25 million molecular graphs
       •  Find all graphs whose similarity to the query ≥ 1 − 
       •  Similarity = Weisfeiler-Lehman graph kernel (NIPS,2009)
n    Use data structure called “Wavelet Tree”
      (SODA,2003)
       •  Self-index of an integer array
          •  Combination of rank dictionaries of bit arrays
       •  Enables fast array operations
          •  e.g., range minimum query, range intersection
n    First review wavelet tree in the range intersection
      problem, then graph similarity search
                                                              11/05/12   2
Range intersection on array	
n    Array A of length N,    1 ≤ Ai ≤ M
                    i	
  j	
    k   
         A       1 3 6 8 2 5 7 1 2 7 4 5

n    Range intersection: rint(A,[i,j],[k,])
      •  Find set intersection of A[i,j] and A[k,]
      •  The naïve solution is to concatenate and sort
n    Use an index structure of the array (Wavelet
      Tree) and solve the problem faster!	


                                                 11/05/12   3
Tree of subarrays:
Lower half = left, Higher half=right	
          [1,8]
              1 3 6 8 2 5 7 1 7 2 4 5

  [1,4]                                             [5,8]
          1 3 2 1 2 4                 6 8 5 7 7 5

[1,2]                   [3,4] [5,6]                 [7,8]
   1 2 1 2          3 4          6 5 5        8 7 7


 1 1       2 2      3      4     5 5     6    7 7      8
Remember if each element is either in
lower half (0) or higher half (1) 	
           [1,8]
               0 0 1 1 0 1 1 0 1 0 0 1

   [1,4]                                                 [5,8]
           0 1 0 0 0 1                 0 1 0 1 1 0


 [1,2]                   [3,4] [5,6]                    [7,8]
     0 1 0 1             0 1      1 0 0          1 0 0


   1	
       2	
     3	
 4	
       5	
 6	
       7	
 8	
                                             11/05/12            5
Index each bit array with a rank
 dictionary	
n    With rank dictionary, the rank operation can be
      done in O(1) time
      •  rankc(B,i): return the number of c   {0,1} in B[1…i]
n    Several algorithms known: rank9sel	
  Example) B=0110011100	
                                i 1 2 3 4 5 6 7 8 9 10
         rank1(B,8)=5             011001110 0
         rank0(B,5)=3	
           0 1 1 0 0 1 1 1 0 0	

                                                     11/05/12   6
Implementation of rank dictionary	

                        n  Divide the bit array B into
B	
                     large blocks of length l=log2n
                          RL=Ranks of large blocks
RL	
                    n  Divide each large block to
                        small blocks of length s=logn/2
                          Rs=Ranks of small blocks
RS	
                         relative to the large block
            rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)	
                              Time:O(1)
                              Memory: n +o(n) bits
O(1)-time division of an interval	
n    Using the rank operations, the division of an
      interval can be done in constant time
      •  rank0 for left child and rank1 for right child
n    Naïve = linear time to the total number of elements

           [1,8]
            Aroot 1 3 6 8 2 5 7 1 7 2 4 5

                      rank0	
                                rank1	
       [1,4]                            [5,8]
       Aleft    1 3 2 1 2 4              Aright 6 8 5 7 7 5
                                                          11/05/12     8
Fast computation of rank
             intersection by pruning	
Pruned      [1,8]
                  1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                                [5,8]
            1 3 2 1 2 4                    6 8 5 7 7 5

  [1,2]                      [3,4] [5,6]             [6,8]
     1 2 1 2             3 4          6 5 5        8 7 7


    1 1       2 2        3      4     5 5     6    7 7     8

            solution!!
Graph Similarity Search	

n    Bag-of-words representation of graph
      •  Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
         Kashima(2009),Wang et al.,(2009)

                         W=(A,D,E,H)	

n    Cosine similarity query
      •  Find all graphs W whose cosine similarity
         (kernel) to the query Q is at least 1-ε


                                              11/05/12    10
Weisfeiler-Lehman Procedure (NIPS,09)
                                    	
 n Convert   a graph into a set of words (bag-of-words)	

                  i) Make a label set of adjacent
   A	
                     vertices ex) {E,A,D}
        B	
  E	
  ii) Sort    ex) A,D,E
                  iii) Add the vertex label as a prefix
   D	
                              ex) B,A,D,E
    A	
                  iv) Map the label sequence to a
        R	
  E	
      unique value
                               ex) B,A,D,E R
    D	
                  v) Assign the value as the new
  Bag-of-words
  {A,B,D,E,…,R,…}	
 vertex label
Semi-conjunctive query	

n    Cosine similarity query can be relaxed to the
      following form	
                W s.t |W ∩ Q| ≥ k
n  Find all graphs W which share at least k words to the
    query Q
n  No false negatives
n  False positives can easily be filtered out by cosine
    calculations


                                            11/05/12   12
Inverted index, Array, Wavelet Tree	
         	
                   	
    n   Graph database is represented
    	
                   	
              as inverted index
    	
              	
                                     n  Concatenate all rows to make
    	
              	
    	
         	
                                         an array
                                     n  Index the array with wavelet
                                         tree
Aroot 1 3 6 8 2 5 7 1 2 7 4        5
                                     n  Semi-conjunctive query =
                                         Extension of range intersection
     Wavelet Tree	
                       •  Find graph ids which appear at
                                             least k times in the array

                                                            11/05/12     13
Pruning condition	

n    Find all graphs W in the database whose cosine
      is larger than a threshold 1-ε
                                          | W !Q |
                   W s.t K N (W,Q) =                ! 1 !
                                          |W | |Q |
      l    W, Q: bag-of-words of graphs
n The above solution can be relaxed as follows,
  If K N (Q,W ) ! 1 ! , then
                                                |Q |
                      (1! ! )2 | Q |!| W |!
                                              (1! ! )2

      l    Can be used for pruning
Complexity	

n   Time per query: O(τm)
      •  τ: the number of traversed nodes
      •  m: the number of bag-of-words in a query
n  Memory: (1+α)N log n + M log N
      •  N: the number of all words in the database
      •  M: Maximum integer in the array
      •  n: the number of graphs
      •  α: overhead for rank dictionary (α=0.6)
 n  Inverted index takes Nlogn bits
 n  About 60% overhead to inverted index!
Experiments	

n  25 million chemical compounds from PubChem
    database
n  Evaluated search time and memory usage
n  Cosine thresholds ε=0.3,0.35,0.4
n  Compare our method gWT to
      •  Inverted index (concatenate all intervals and sort)
      •  Sequential scan (compute kernel one by one)
Search time	
                40 sec	
                38 sec	




                8 sec	
                3 sec	
                2 sec
Memory usage	
                 20GB
Construction time	

                      7h
Related work	
  A lot of methods have been proposed so far.
n 
 1.Graph Grep [Shasha et al., 02]
 2.gIndex [Yan et al., 04]
 3.Tree+Delta [Zhao et al., 07]
 4.TreePi [Zhang et al., 07]
 5.Gstring [Jiang et al., 07]
 6.FG-Index [Cheng et al., 07]
 7.GDIndex [Williams et al., 07]
 etc
Related work 	
                      n  Decompose graphs into
                          a set of substructures
                           •  graphs, trees, paths
        Decompose	
                              etc

                   …	
n  Build a substructure
                          based index
                      n  Drawbacks
         Index	
                           •  Need substructure
                              mining
                           •  Do not scale to
                              millions of graphs
Summary	
n  Efficient similarity search method of massive
    graph databases
n  Solve semi-conjunctive queries efficiently
n  Our method is built on wavelet trees
n  Use Weisfeiler-Lehman procedure to convert
    graphs into bag-of-words
n  Applicable to 25 million graphs
n  Software
 http://code.google.com/p/gwt/
1 of 22

Recommended

Gwt presen alsip-20111201 by
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201Yasuo Tabei
631 views22 slides
Mlab2012 tabei 20120806 by
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Yasuo Tabei
2.8K views32 slides
WABI2012-SuccinctMultibitTree by
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeYasuo Tabei
4.3K views33 slides
Dmss2011 public by
Dmss2011 publicDmss2011 public
Dmss2011 publicYasuo Tabei
588 views39 slides
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA... by
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...grssieee
599 views50 slides
Tensor Train decomposition in machine learning by
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learningAlexander Novikov
7K views29 slides

More Related Content

What's hot

Neural Processes by
Neural ProcessesNeural Processes
Neural ProcessesSangwoo Mo
682 views43 slides
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs by
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsIJASRD Journal
164 views7 slides
6 adesh kumar tripathi -71-74 by
6 adesh kumar tripathi -71-746 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-74Alexander Decker
390 views4 slides
Mgm by
MgmMgm
MgmMarc Coevoet
248 views11 slides
1 cb02e45d01 by
1 cb02e45d011 cb02e45d01
1 cb02e45d01nehagurjar
126 views5 slides
l1-Embeddings and Algorithmic Applications by
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic ApplicationsGrigory Yaroslavtsev
255 views17 slides

What's hot(20)

Neural Processes by Sangwoo Mo
Neural ProcessesNeural Processes
Neural Processes
Sangwoo Mo682 views
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs by IJASRD Journal
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
IJASRD Journal164 views
1 cb02e45d01 by nehagurjar
1 cb02e45d011 cb02e45d01
1 cb02e45d01
nehagurjar126 views
NIPS2010: optimization algorithms in machine learning by zukun
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learning
zukun1.1K views
Multilinear singular integrals with entangled structure by VjekoslavKovac1
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
VjekoslavKovac1222 views
Tales on two commuting transformations or flows by VjekoslavKovac1
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
VjekoslavKovac163 views
A Szemerédi-type theorem for subsets of the unit cube by VjekoslavKovac1
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
VjekoslavKovac156 views
11.the univalence of some integral operators by Alexander Decker
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operators
Alexander Decker220 views
The univalence of some integral operators by Alexander Decker
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operators
Alexander Decker155 views
Paraproducts with general dilations by VjekoslavKovac1
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
VjekoslavKovac1241 views
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon... by MLconf
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf2.3K views
Tensorizing Neural Network by Ruochun Tzeng
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
Ruochun Tzeng326 views
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ... by Yandex
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Yandex541 views
Scattering theory analogues of several classical estimates in Fourier analysis by VjekoslavKovac1
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
VjekoslavKovac1240 views
Higher-order organization of complex networks by David Gleich
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
David Gleich4.5K views

Viewers also liked

「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2 by
「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2
「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2Yahoo!デベロッパーネットワーク
2.2K views20 slides
Hcj2014 myui by
Hcj2014 myuiHcj2014 myui
Hcj2014 myuiMakoto Yui
12.6K views42 slides
「最近傍検索とその応用」#yjdsw2 by
「最近傍検索とその応用」#yjdsw2「最近傍検索とその応用」#yjdsw2
「最近傍検索とその応用」#yjdsw2Yahoo!デベロッパーネットワーク
1.4K views23 slides
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014) by
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Hadoop / Spark Conference Japan
6.6K views42 slides
Hadoopカンファレンス20140707 by
Hadoopカンファレンス20140707Hadoopカンファレンス20140707
Hadoopカンファレンス20140707Recruit Technologies
26.4K views58 slides
Hivemall v0.3の機能紹介@1st Hivemall meetup by
Hivemall v0.3の機能紹介@1st Hivemall meetupHivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupMakoto Yui
8.2K views44 slides

Viewers also liked(10)

Hcj2014 myui by Makoto Yui
Hcj2014 myuiHcj2014 myui
Hcj2014 myui
Makoto Yui12.6K views
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014) by Hadoop / Spark Conference Japan
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Hivemall v0.3の機能紹介@1st Hivemall meetup by Makoto Yui
Hivemall v0.3の機能紹介@1st Hivemall meetupHivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetup
Makoto Yui8.2K views
FluentdやNorikraを使った データ集約基盤への取り組み紹介 by Recruit Technologies
FluentdやNorikraを使った データ集約基盤への取り組み紹介FluentdやNorikraを使った データ集約基盤への取り組み紹介
FluentdやNorikraを使った データ集約基盤への取り組み紹介
Recruit Technologies25.5K views
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ... by MapR Technologies Japan
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...
Treasure Data on The YARN - Hadoop Conference Japan 2014 by Ryu Kobayashi
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
Ryu Kobayashi14.2K views
Shib: WebUI tool provides crossover of Hive and MPP by SATOSHI TAGOMORI
Shib: WebUI tool provides crossover of Hive and MPPShib: WebUI tool provides crossover of Hive and MPP
Shib: WebUI tool provides crossover of Hive and MPP
SATOSHI TAGOMORI9.3K views

Similar to Gwt sdm public

SISAP17 by
SISAP17SISAP17
SISAP17Yasuo Tabei
325 views32 slides
Lca seminar modified by
Lca seminar modifiedLca seminar modified
Lca seminar modifiedInbok Lee
858 views58 slides
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docx by
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docxSalem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docx
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docxanhlodge
3 views24 slides
PAM.ppt by
PAM.pptPAM.ppt
PAM.pptjanaki raman
25 views52 slides
Linear sorting by
Linear sortingLinear sorting
Linear sorting Krishna Chaytaniah
763 views6 slides
Review session2 by
Review session2Review session2
Review session2NEEDY12345
685 views29 slides

Similar to Gwt sdm public(20)

Lca seminar modified by Inbok Lee
Lca seminar modifiedLca seminar modified
Lca seminar modified
Inbok Lee858 views
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docx by anhlodge
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docxSalem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docx
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docx
anhlodge3 views
Review session2 by NEEDY12345
Review session2Review session2
Review session2
NEEDY12345685 views
Week09 by hccit
Week09Week09
Week09
hccit240 views
Hussam Malibari Heckman MAT 242 Spring 2017Assignment Chapte.docx by wellesleyterresa
Hussam Malibari Heckman MAT 242 Spring 2017Assignment Chapte.docxHussam Malibari Heckman MAT 242 Spring 2017Assignment Chapte.docx
Hussam Malibari Heckman MAT 242 Spring 2017Assignment Chapte.docx
Algo-Exercises-2-hash-AVL-Tree.ppt by HebaSamy22
Algo-Exercises-2-hash-AVL-Tree.pptAlgo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.ppt
HebaSamy228 views
lecture 17 by sajinsc
lecture 17lecture 17
lecture 17
sajinsc588 views
Dinive conquer algorithm by Mohd Arif
Dinive conquer algorithmDinive conquer algorithm
Dinive conquer algorithm
Mohd Arif10.3K views
Dsoop (co 221) 1 by Puja Koch
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1
Puja Koch539 views

More from Yasuo Tabei

Space-efficient Feature Maps for String Alignment Kernels by
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsYasuo Tabei
253 views15 slides
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices by
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
4.4K views26 slides
Kdd2015reading-tabei by
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
1.1K views18 slides
DCC2014 - Fully Online Grammar Compression in Constant Space by
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
1.1K views21 slides
NIPS2013読み会: Scalable kernels for graphs with continuous attributes by
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
8.3K views22 slides
GIW2013 by
GIW2013GIW2013
GIW2013Yasuo Tabei
1.1K views26 slides

More from Yasuo Tabei(14)

Space-efficient Feature Maps for String Alignment Kernels by Yasuo Tabei
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
Yasuo Tabei253 views
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices by Yasuo Tabei
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Yasuo Tabei4.4K views
Kdd2015reading-tabei by Yasuo Tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
Yasuo Tabei1.1K views
DCC2014 - Fully Online Grammar Compression in Constant Space by Yasuo Tabei
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
Yasuo Tabei1.1K views
NIPS2013読み会: Scalable kernels for graphs with continuous attributes by Yasuo Tabei
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
Yasuo Tabei8.3K views
CPM2013-tabei201306 by Yasuo Tabei
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
Yasuo Tabei4.5K views
SPIRE2013-tabei20131009 by Yasuo Tabei
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
Yasuo Tabei4.9K views
Ibisml2011 06-20 by Yasuo Tabei
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
Yasuo Tabei896 views
Lgm pakdd2011 public by Yasuo Tabei
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
Yasuo Tabei995 views
Lgm saarbrucken by Yasuo Tabei
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
Yasuo Tabei1.2K views
Sketch sort sugiyamalab-20101026 - public by Yasuo Tabei
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
Yasuo Tabei632 views
Sketch sort ochadai20101015-public by Yasuo Tabei
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
Yasuo Tabei1.4K views

Recently uploaded

Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
40 views52 slides
"Surviving highload with Node.js", Andrii Shumada by
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
49 views29 slides
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...ShapeBlue
113 views18 slides
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesShapeBlue
178 views15 slides
State of the Union - Rohit Yadav - Apache CloudStack by
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStackShapeBlue
218 views53 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
81 views46 slides

Recently uploaded(20)

Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays49 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue113 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue86 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue52 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue48 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue134 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue56 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash103 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue128 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue154 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views

Gwt sdm public

  • 1. 2011 SIAM International Conference on Data Mining, Apr. 28, 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, Japan AIST Comp Bio Research Center, Japan
  • 2. Graph similarity search n  Similarity search for 25 million molecular graphs •  Find all graphs whose similarity to the query ≥ 1 − •  Similarity = Weisfeiler-Lehman graph kernel (NIPS,2009) n  Use data structure called “Wavelet Tree” (SODA,2003) •  Self-index of an integer array •  Combination of rank dictionaries of bit arrays •  Enables fast array operations •  e.g., range minimum query, range intersection n  First review wavelet tree in the range intersection problem, then graph similarity search 11/05/12 2
  • 3. Range intersection on array n  Array A of length N, 1 ≤ Ai ≤ M i j k  A 1 3 6 8 2 5 7 1 2 7 4 5 n  Range intersection: rint(A,[i,j],[k,]) •  Find set intersection of A[i,j] and A[k,] •  The naïve solution is to concatenate and sort n  Use an index structure of the array (Wavelet Tree) and solve the problem faster! 11/05/12 3
  • 4. Tree of subarrays: Lower half = left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
  • 5. Remember if each element is either in lower half (0) or higher half (1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8 11/05/12 5
  • 6. Index each bit array with a rank dictionary n  With rank dictionary, the rank operation can be done in O(1) time •  rankc(B,i): return the number of c {0,1} in B[1…i] n  Several algorithms known: rank9sel Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1(B,8)=5 011001110 0 rank0(B,5)=3 0 1 1 0 0 1 1 1 0 0 11/05/12 6
  • 7. Implementation of rank dictionary n  Divide the bit array B into B large blocks of length l=log2n RL=Ranks of large blocks RL n  Divide each large block to small blocks of length s=logn/2 Rs=Ranks of small blocks RS relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n +o(n) bits
  • 8. O(1)-time division of an interval n  Using the rank operations, the division of an interval can be done in constant time •  rank0 for left child and rank1 for right child n  Naïve = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5 11/05/12 8
  • 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
  • 10. Graph Similarity Search n  Bag-of-words representation of graph •  Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima(2009),Wang et al.,(2009) W=(A,D,E,H) n  Cosine similarity query •  Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1-ε 11/05/12 10
  • 11. Weisfeiler-Lehman Procedure (NIPS,09) n Convert a graph into a set of words (bag-of-words) i) Make a label set of adjacent A vertices ex) {E,A,D} B E ii) Sort ex) A,D,E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a R E unique value ex) B,A,D,E R D v) Assign the value as the new Bag-of-words {A,B,D,E,…,R,…} vertex label
  • 12. Semi-conjunctive query n  Cosine similarity query can be relaxed to the following form W s.t |W ∩ Q| ≥ k n  Find all graphs W which share at least k words to the query Q n  No false negatives n  False positives can easily be filtered out by cosine calculations 11/05/12 12
  • 13. Inverted index, Array, Wavelet Tree n  Graph database is represented as inverted index n  Concatenate all rows to make an array n  Index the array with wavelet tree Aroot 1 3 6 8 2 5 7 1 2 7 4 5 n  Semi-conjunctive query = Extension of range intersection Wavelet Tree •  Find graph ids which appear at least k times in the array 11/05/12 13
  • 14. Pruning condition n  Find all graphs W in the database whose cosine is larger than a threshold 1-ε | W !Q | W s.t K N (W,Q) = ! 1 ! |W | |Q | l  W, Q: bag-of-words of graphs n The above solution can be relaxed as follows, If K N (Q,W ) ! 1 ! , then |Q | (1! ! )2 | Q |!| W |! (1! ! )2 l  Can be used for pruning
  • 15. Complexity n  Time per query: O(τm) •  τ: the number of traversed nodes •  m: the number of bag-of-words in a query n  Memory: (1+α)N log n + M log N •  N: the number of all words in the database •  M: Maximum integer in the array •  n: the number of graphs •  α: overhead for rank dictionary (α=0.6) n  Inverted index takes Nlogn bits n  About 60% overhead to inverted index!
  • 16. Experiments n  25 million chemical compounds from PubChem database n  Evaluated search time and memory usage n  Cosine thresholds ε=0.3,0.35,0.4 n  Compare our method gWT to •  Inverted index (concatenate all intervals and sort) •  Sequential scan (compute kernel one by one)
  • 17. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
  • 18. Memory usage 20GB
  • 20. Related work A lot of methods have been proposed so far. n  1.Graph Grep [Shasha et al., 02] 2.gIndex [Yan et al., 04] 3.Tree+Delta [Zhao et al., 07] 4.TreePi [Zhang et al., 07] 5.Gstring [Jiang et al., 07] 6.FG-Index [Cheng et al., 07] 7.GDIndex [Williams et al., 07] etc
  • 21. Related work n  Decompose graphs into a set of substructures •  graphs, trees, paths Decompose etc … n  Build a substructure based index n  Drawbacks Index •  Need substructure mining •  Do not scale to millions of graphs
  • 22. Summary n  Efficient similarity search method of massive graph databases n  Solve semi-conjunctive queries efficiently n  Our method is built on wavelet trees n  Use Weisfeiler-Lehman procedure to convert graphs into bag-of-words n  Applicable to 25 million graphs n  Software http://code.google.com/p/gwt/