SlideShare a Scribd company logo
March 30, 2011, The 5th International Workshop on
Data-Mining and Statistical Science@Osaka University	


    Kernel-based Similarity Search
     in Massive Graph Databases
          with Wavelet Trees

               Yasuo Tabei
        (JST ERATO Minato Project)
         Joint work with Koji Tsuda (AIST)
Outline	
n  Introduction
  l  Recent development of graph databases
  l  Needs for graph similarity search

  l  Bag-of-words representation of a graph

  l  Semi-conjunctive query

n  Method
  l    Scalable similarity search with wavelet trees
n  Experiments
  l    Use a large-scale graph dataset
  l    25 million chemical compounds
Graphs are everywhere	



Gene co-expression
network                                     RNA 2D structure



                     Protein 3D structure

	
                                             	
Chemical compound
                                             Social Network
Graph Similarity Search	
n  Retrievegraphs similar to the query
n  Large databases
  l    More than 20 million chemical compounds in
        PubChem database
n  Bag-of-words    representation of graphs
  l    WL procedure (NIPS 2009)
n  Why    not use document retrieval methods?
  l  Inverted index
  l  Not that easy (explained later)
Weisfeiler-Lehman Procedure (NIPS,09)	
n Convert
         a graph into a set of words
 (bag-of-words)	
 i) Make a label set of adjacent
    A	
                     vertices ex) {E,A,D}
        B	
  E	
  ii) Sort     ex) A,D,E
                  iii) Add the vertex label as a prefix
    D	
                              ex) B,A,D,E
    A	
                  iv) Map the label sequence to a
        R	
   E	
     unique value
                               ex) B,A,D,E R
    D	
                  v) Assign the value as the new
 Bag-of-words
 {A,B,D,E,…,R,…}	
 vertex label
Search by cosine similarity	
n  Identify
          all graphs in the database whose
  cosine is larger than a threshold 1-ε
                                        | Wi !Q |
                 Wi s.t K N (Wi ,Q) =                ! 1" !
                                        | Wi | | Q |

   l    Wi, Q: bag-of-words of graphs
n  The above solution can be relaxed as
  follows,
 If K (Q,W ) ! 1" ! , then
         N

                                             |Q |
                   (1! ! )2 | Q |!| W |!
                                           (1! ! )2
   l    Can be used for fast search
Semi-conjunctive query	
n Cosine   query can be relaxed to the following
form
              Wi s.t | Wi !Q |" k
  l  The number of common words between
      two bag-of-words Wi and Q
  Ex) |Wi Q|=|(A,C,E,F,H) (A,E,I,J,L)|
                 =|(A,E)|=2
  l  k=(1-ε)2|Q|
  l No false negatives

  l False positives can easily be filtered out by
  cosine calculations
Inverted index	
n  Innatural language processing, inverted
   index has been used to solve semi-
   conjunctive query	

Inverted Index	
         	
           	
   n  Associative  map
  	
                 	
    l    key    word
  	
          	
  	
            	
         l    value graph identifiers
  	
          	
  	
                 	
                                       including a word
Bottom-up search	
   Inverted Index	
        	
           	
       i) Look the index up with
   	
               	
   	
             	
              query bag-of-words
   	
                	
   	
             	
          ii) Aggregate all the lists
   	
                   	
         of graph indices
         Query:(A,C,E)	
      iii) Sort
                Aggregation	
     )Scan
(2,8,13,15,8,10,16,4,9,13,14)	
               Sort	
(2,4,8,8,9,10,13,13,14,15,16)	
	
 k=
Search time of inverted index
         on 25 million graphs	
n Searchtime of inverted index is not so different from
that of sequencial scan	
                                              40 sec	
                                              38 sec
Why?	
                           n  Each    word is not
        Query:(A,C,E)	
          specific enough
                Aggregation	
 Query contains 1000s
                             n 
(2,8,13,15,8,10,16,4,9,13,14)	
 of words
                Sort	
       n  Aggregated array is

(2,4,8,8,9,10,13,13,14,15,16)	
 VERY long
                             n  Sorting takes O(ClogC)
              C	
                in time
Overview of our method 	
n  Top-down   search in a tree over the series of
    graphs
n  Huge memory, if tree is implemented with
    pointers
n  Wavelet Tree: Succinct data structure
n  The smaller the similarity threshold is, the
    quicker the algorithm finishes
     •  Not the case in inverted index
Binary tree over graphs	
                                      n  leaf       graph
                  [1,8]	
           0	
                        n  node       interval
                            1	
        [1,4]	
             [5,8]	
                                      n        Each node is identified
      0	
        1	
      0	
      1	
          by a bit string (v={01})
                                            n  At the leaves, the graph
    [1,2]	
    [3,4]	
   [5,6]	
  [7,8]	
                                                indices correspond to
 0	
     1	
 0	
 1	
0	
 1	
 0	
        1	
                                                int(v)+1
  1	
 2	
 3	
 4	
 5	
 6	
 7	
 8	
{000}	
 001}	
      {     {010}	
 {100}	
 {110}	
                  {011}	
 {101}	
   {111}	
    (e.g., int(010)+1=2+1=3)
Summarization of bag-of-words	
  Represent bag-of-words as a bit array
n 
                       12345678
Ex) Wi=(1,3,4,7,8) xi=(1,0,1,1,0,0,1,1)

n    Take disjunction   of all bit arrays in the interval
      of a node v


Ex) For an interval [1,4]
X1=(0,1,0,0,0,0,1,0)           yv=x1 x2 x3 x4
X2=(1,0,1,1,0,0,0,0)             =(1,1,1,1,0,0,1,1)	
X3=(1,0,0,0,0,0,1,1)
X4=(1,0,0,0,0,0,0,1)
Binary tree over graphs	
                        yv=111111	
                       n  Assign   to each node
                         [1,8]	
                             v a bit arrays yv
        yv=110111	
                yv=101101	
                                                          n  yv : bit array
              [1,4]	
                 [5,8]	
                                                            l    i-th bit is 1 if graphs
yv=010101	
yv=110100	
 v=100100	
yv=001101	
                    y
                                                                  in an interval have
    [1,2]	
       [3,4]	
       [5,6]	
         [7,8]	
                                                                  the corresponding
                                                                  word.
  1	
   2	
      3	
     4	
   5	
    6	
   7	
     8	
{000}	
 001}	
      {     {010}	
 {100}	
 {110}	
                 {011}	
 {101}	
 {111}
Top-down traversal	
                   yv=111111	
                      n  Q: bag-of-words of a
                       [1,8]	
                          query
        vy =110111 	
                               y =101101
                               v          	
        n  Perform top-down
             [1,4]	
              [5,8]	
               traversal
 y =010101 y =110100 y =100100 y =001101 n  Prune the search
 v          	
 v        	
 v       	
 v        	

     [1,2]	
     [3,4]	
     [5,6]	
      [7,8]	
       space if # y [ j] ! k
                                                             j"Q
                                                                   v


                                                    n  The larger k is, the
   1	
 2	
 3	
 4	
 5	
 6	
 7	
 8	
{000}	
 001}	
       {      {010}	
 {100}	
 {110}	
                     {011}	
 {101}	
        {111}	
     smaller the search
                                                        space is
Huge Memory	
n  Time   is O(τm) : Very fast
  l  τ: the number of traversed node
  l  m: the number of bag-of-words in a query

n  Space   is O(Mnlogn) bit
  l  M: the number of unique words
  l  n: the number of graphs
Wavelet Tree! (SODA,03)	
n  Replace yv    in each node v by a rank
  dictionary
  l    explained in next slides
n  Implement  a tree without using pointers
n  Only 60% memory overhead compared to
    the inverted index (Vigna,08)
n  Access to the summary information in any
    internal node
Rank dictionary (Raman,02)	
n  Give    bit array B[1,n] the following operation:
   l    rankc(B,i): return the number of c   {0,1} in B[1…i]	



Ex) B=0110011100	
                        i 1 2 3 4 5 6 7 8 9 10
           rank1(B,8)=5 0 1 1 0 0 1 1 1 0 0
           rank0(B,5)=3	
 0 1 1 0 0 1 1 1 0 0
Implementation of rank dictionary	
                           n  Divide the bit array B into
B	
                        large blocks of length l=log2n
                             RL=Ranks of large blocks
RL	
                       n  Divide each large block to
                           small blocks of length s=logn/2
                             Rs=Ranks of small blocks
RS	
                            relative to the large block
               rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)	
                                 Time:O(1)
                                 Memory: n +o(n) bits
Restricted inverted index	
Inverted Index	
                      n  Concatenate    graph ids
     	
                          	
                                          for words in the root
	
                          	
	
                     	
             n  Restrict the inverted index for
	
                     	
                 the interval [sv,tv] of a node v	
	
                	

          [1,8]    A	
    B	
  C	
   D	
          Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                                       4	
              >4	
[1,4]        A	
 B	
 C	
 D	
                           A	
 B	
 C	
 D	
 [4,8]
 Aleft      1 3 2 1 2 4                        Aright 6 8 5 7 7 5
Whole structure of
          restricted inverted index	
          [1,8]    A	
    B	
   C	
  D	
                1 3 6 8 2 5 7 1 2 7 4 5


  [1,4]    A	
 B	
 C	
 D	
                  A	
 B	
 C	
 D	
[5,8]
          1 3 2 1 2 4                      6 8 5 7 7 5

[1,2]                        [3,4] [5,6]                 [6,7]
   A	
B	
 C	
          A	
D	
        A	
B	
 D	
      A	
 B	
C	
   1 2 1 2             3 4           6 5 5            8 7 7


 A	
C	
    B	
C	
      A	
     D	
   B	
D	
 A	
      B	
 C	
 A	
 1 1       2 2         3        4    5 5 6           7 7     8
Similarity search	
n  To
    retrieve graphs similar to a query
  Q=(A,C), the tree is traversed in the top-
  down manner.

          [1,8]    A	
         C	
          Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                       4	
           >4	
 [1,4]       A	
  C	
        [5,8]   A	
   C	
  Aleft     1 3 2 1 2 4      Aright 6 8 5 7 7 5
Similarity search	
n    To retrieve graphs similar to a query Q=(A,C), the tree
      is traversed in a top-down manner.
n  Observation
      l    To perform top-down traversal, only intervals
            of words in each node are necessary

                    A [1,4]	
    C [8,10]	
              Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                             4	
              >4	
            A [1,2]	
C [4,5]	
             A[1,2]	
 C[5,5]	
       Aleft 1 3 2 1 2 4            Aright 6 8 5 7 7 5
Similarity search	
n  Replace
          restricted inverted index Av in
  each node v with a bit array bv.
   l    bv[i]=1 if Av[i] goes to the right child


                	
 0	
 0	
 1	
1	
 0	
 1	
 1	
 0	
 0	
 1	
 0	
 1	
            Aroot 1 3 6 8 2 5 7 1 2 7 4 5
                             0	
                       1	

 bleft	
     0	
 1	
 0	
0	
 0	
 1	
      bright	
 0	
 1	
 0	
1	
 1	
 0	
 Aleft       1 3 2 1 2 4                 Aright 6 8 5 7 7 5
Similarity search	
n  Intervals       of child nodes can be computed by
      rank operations
l    sleft(v),j=rank0(bv,svj-1)+1,tleft(v),j=rank0(bv,tvj)
l    sright(v),j =rank1(bv,svj-1)+1,tright(v),j=rank1(bv,tvj)
                                               C [8,10]	
      Ex)
                    	
 0	
 0	
 1	
1	
 0	
 1	
 1	
 0	
 0	
 1	
 0	
 1	
                Aroot 1 3 6 8 2 5 7 1 2 7 4 5
rank0(broot,8-1)+1=4,                                            rank1(broot,8-1)+1=5,
rank0(broot,10)=5	
                                              rank1(broot,10)=5	
                        C [4,5]	
                               C [5,5]	
      bleft	
    0	
 1	
 0	
0	
 0	
 1	
       bright	
 0	
 1	
 0	
1	
 1	
 0	
      Aleft      1 3 2 1 2 4                  Aright 6 8 5 7 7 5
Wavelet Tree	
n Wavelet tree can be obtained to replace the restricted
Inverted indices with bit arrays
n Wavelet tree consists of bit arrays bv and initial
intervals Croot.

     Croot	
      A	
    B	
   C	
  D	
               0 0 1 1 0 1 1 0 0 1 0 1



         0 1 0 0 0 1             0 1 0 1 1 0



    0 1 0 1          0 1       1 0 0       1 0 0
Wavelet Tree	
n Graphids can be recovered from bit strings
 on the path from the root to leaves	
  Croot	
 A	
       B	
   C	
   D	
                 0 0 1 1 0 1 1 0 0 1 0 1

                     0	
                   1	

        0 1 0 0 0 1                0 1 0 1 1 0

        0	
           1	
           0	
           1	

  0 1 0 1                  0 1   1 0 0            1 0 0

  0	
      1	
        0	
 1	
    0	
 1	
          0	
 1	
 000    001          010 011 100 101             110 111
Memory	
n  (1+α)Nlogn          + MlogN bits
        Bit arrays bv	
 Initial intervals Croot	
      l  N: the number of all words in the database
      l  n: the number of graphs

      l  α: overhead for rank dictionary (α=0.6)

n    For inverted index, Nlogn bits
n    About 60% overhead to inverted index!!
Experiments	
n  25 million chemical compounds from
    PubChem database
n  Use search time and memory as
    evaluation measures
n  Compare our method gWT to
   l  inverted index
   l  sequential scan implemented in G-Hash

       (Wang et al, 2009)
Search time on 25 million graphs	
                              40 sec	
                              38 sec	




                              8 sec	
                              3 sec	
                              2 sec
Memory usage
Overhead of rank dictionary
Construction time
Related work	
•  A lot of methods have been proposed so far.
 1.gIndex [Yan et al., 04]
 2.Graph Grep [Shasha et al., 07]
 3.Tree+Delta [Zhao et al., 07]
 4.TreePi [Zhang et al., 07]
 5.Gstring [Jiang et al., 07]
 6.FG-Index [Cheng et al., 07]
 7.GDIndex [Williams et al., 07]
 etc
Related work 	
              •  Decompose graphs
                 into a set of
Decompose	
      substructures
               - subgraphs, trees,
          …	
    paths etc
              •  Build a substructure-
Index	
                 based index
Drawbacks	
•  Require frequent subgraph mining
•  Do not scale to millions of graphs
Summary	
n  Efficientsimilarity search method for
    massive graph databases
n  Solve semi-conjunctive query efficiently
n  Built on wavelet trees
n  Use Weisfeiler-Lehman procedure to
    convert graphs into bag-of-words
n  Applicable to more than 20 million graphs
n  Software
    http://code.google.com/p/gwt/
Acknowledgements	

•  Prof. Shin-ichi Minato (Hokkaido Univ.)
•  Dr. Daisuke Okanohara (PFI)
•  Members in ERATO Minato Project

More Related Content

What's hot

A new class of a stable implicit schemes for treatment of stiff
A new class of a stable implicit schemes for treatment of stiffA new class of a stable implicit schemes for treatment of stiff
A new class of a stable implicit schemes for treatment of stiff
Alexander Decker
 
Introduction to inverse problems
Introduction to inverse problemsIntroduction to inverse problems
Introduction to inverse problems
Delta Pi Systems
 
1 cb02e45d01
1 cb02e45d011 cb02e45d01
1 cb02e45d01
nehagurjar
 
Image denoising
Image denoisingImage denoising
Image denoising
Yap Wooi Hen
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
VjekoslavKovac1
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svm
sesejun
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
VjekoslavKovac1
 
Numerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodNumerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis method
Alexander Decker
 
cyclic_code.pdf
cyclic_code.pdfcyclic_code.pdf
cyclic_code.pdf
rahelbirhanu1
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
VjekoslavKovac1
 
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
grssieee
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial Sector
SSA KPI
 
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Alex (Oleksiy) Varfolomiyev
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
VjekoslavKovac1
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex Optimization
Gabriel Peyré
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems Regularization
Gabriel Peyré
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svm
sesejun
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDF
net2-project
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
VjekoslavKovac1
 
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
IJERA Editor
 

What's hot (20)

A new class of a stable implicit schemes for treatment of stiff
A new class of a stable implicit schemes for treatment of stiffA new class of a stable implicit schemes for treatment of stiff
A new class of a stable implicit schemes for treatment of stiff
 
Introduction to inverse problems
Introduction to inverse problemsIntroduction to inverse problems
Introduction to inverse problems
 
1 cb02e45d01
1 cb02e45d011 cb02e45d01
1 cb02e45d01
 
Image denoising
Image denoisingImage denoising
Image denoising
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svm
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
Numerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodNumerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis method
 
cyclic_code.pdf
cyclic_code.pdfcyclic_code.pdf
cyclic_code.pdf
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial Sector
 
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex Optimization
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems Regularization
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svm
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDF
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
 

Viewers also liked

GIW2013
GIW2013GIW2013
GIW2013
Yasuo Tabei
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
Yasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
Yasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
Yasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
Yasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
Yasuo Tabei
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Yasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
Yasuo Tabei
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
MapR Technologies Japan
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
Preferred Networks
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
Shirou Maruyama
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
Preferred Networks
 
мобильный смарт фитнес
мобильный смарт фитнесмобильный смарт фитнес
мобильный смарт фитнес
Интернет магазин Staleks.SU
 
Grammar Tenses.pptx
Grammar Tenses.pptxGrammar Tenses.pptx
Grammar Tenses.pptx
wendyvinueza
 
Com Ensenyar Llengua A Xinesos Xiv Tallers
Com Ensenyar Llengua A Xinesos Xiv TallersCom Ensenyar Llengua A Xinesos Xiv Tallers
Com Ensenyar Llengua A Xinesos Xiv TallersArnau Cerdà
 

Viewers also liked (20)

GIW2013
GIW2013GIW2013
GIW2013
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
 
мобильный смарт фитнес
мобильный смарт фитнесмобильный смарт фитнес
мобильный смарт фитнес
 
Grammar Tenses.pptx
Grammar Tenses.pptxGrammar Tenses.pptx
Grammar Tenses.pptx
 
Com Ensenyar Llengua A Xinesos Xiv Tallers
Com Ensenyar Llengua A Xinesos Xiv TallersCom Ensenyar Llengua A Xinesos Xiv Tallers
Com Ensenyar Llengua A Xinesos Xiv Tallers
 

Similar to Dmss2011 public

Partitions
PartitionsPartitions
Partitions
Nicholas Teff
 
SISAP17
SISAP17SISAP17
SISAP17
Yasuo Tabei
 
Vector Space.pptx
Vector Space.pptxVector Space.pptx
Vector Space.pptx
Dr Ashok Tiwari
 
Algorithm Exam Help
Algorithm Exam HelpAlgorithm Exam Help
Algorithm Exam Help
Programming Exam Help
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
Deep Learning JP
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
Kazuki Fujikawa
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
Jen Aman
 
Review session2
Review session2Review session2
Review session2
NEEDY12345
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
Programming Homework Help
 
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic AlgorithmsEstimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
Annibale Panichella
 
Pre-Cal 20S January 20, 2009
Pre-Cal 20S January 20, 2009Pre-Cal 20S January 20, 2009
Pre-Cal 20S January 20, 2009
Darren Kuropatwa
 
Thesis defense
Thesis defenseThesis defense
Thesis defense
Zheng Mengdi
 
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
inventy
 
Skew Products on Directed Graphs
Skew Products on Directed GraphsSkew Products on Directed Graphs
Skew Products on Directed Graphs
Caridad Arroyo
 
Data Structures and Agorithm: DS 21 Graph Theory.pptx
Data Structures and Agorithm: DS 21 Graph Theory.pptxData Structures and Agorithm: DS 21 Graph Theory.pptx
Data Structures and Agorithm: DS 21 Graph Theory.pptx
RashidFaridChishti
 
Q
QQ
ALG5.1.ppt
ALG5.1.pptALG5.1.ppt
ALG5.1.ppt
AbhinavAshish17
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
Computer Science Club
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
Krishna Chaytaniah
 
10_rnn.pdf
10_rnn.pdf10_rnn.pdf

Similar to Dmss2011 public (20)

Partitions
PartitionsPartitions
Partitions
 
SISAP17
SISAP17SISAP17
SISAP17
 
Vector Space.pptx
Vector Space.pptxVector Space.pptx
Vector Space.pptx
 
Algorithm Exam Help
Algorithm Exam HelpAlgorithm Exam Help
Algorithm Exam Help
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
 
Review session2
Review session2Review session2
Review session2
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
 
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic AlgorithmsEstimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
 
Pre-Cal 20S January 20, 2009
Pre-Cal 20S January 20, 2009Pre-Cal 20S January 20, 2009
Pre-Cal 20S January 20, 2009
 
Thesis defense
Thesis defenseThesis defense
Thesis defense
 
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
 
Skew Products on Directed Graphs
Skew Products on Directed GraphsSkew Products on Directed Graphs
Skew Products on Directed Graphs
 
Data Structures and Agorithm: DS 21 Graph Theory.pptx
Data Structures and Agorithm: DS 21 Graph Theory.pptxData Structures and Agorithm: DS 21 Graph Theory.pptx
Data Structures and Agorithm: DS 21 Graph Theory.pptx
 
Q
QQ
Q
 
ALG5.1.ppt
ALG5.1.pptALG5.1.ppt
ALG5.1.ppt
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
 
10_rnn.pdf
10_rnn.pdf10_rnn.pdf
10_rnn.pdf
 

Recently uploaded

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 

Recently uploaded (20)

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 

Dmss2011 public

  • 1. March 30, 2011, The 5th International Workshop on Data-Mining and Statistical Science@Osaka University Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees Yasuo Tabei (JST ERATO Minato Project) Joint work with Koji Tsuda (AIST)
  • 2. Outline n  Introduction l  Recent development of graph databases l  Needs for graph similarity search l  Bag-of-words representation of a graph l  Semi-conjunctive query n  Method l  Scalable similarity search with wavelet trees n  Experiments l  Use a large-scale graph dataset l  25 million chemical compounds
  • 3. Graphs are everywhere Gene co-expression network RNA 2D structure Protein 3D structure Chemical compound Social Network
  • 4. Graph Similarity Search n  Retrievegraphs similar to the query n  Large databases l  More than 20 million chemical compounds in PubChem database n  Bag-of-words representation of graphs l  WL procedure (NIPS 2009) n  Why not use document retrieval methods? l  Inverted index l  Not that easy (explained later)
  • 5. Weisfeiler-Lehman Procedure (NIPS,09) n Convert a graph into a set of words (bag-of-words) i) Make a label set of adjacent A vertices ex) {E,A,D} B E ii) Sort ex) A,D,E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a R E unique value ex) B,A,D,E R D v) Assign the value as the new Bag-of-words {A,B,D,E,…,R,…} vertex label
  • 6. Search by cosine similarity n  Identify all graphs in the database whose cosine is larger than a threshold 1-ε | Wi !Q | Wi s.t K N (Wi ,Q) = ! 1" ! | Wi | | Q | l  Wi, Q: bag-of-words of graphs n  The above solution can be relaxed as follows, If K (Q,W ) ! 1" ! , then N |Q | (1! ! )2 | Q |!| W |! (1! ! )2 l  Can be used for fast search
  • 7. Semi-conjunctive query n Cosine query can be relaxed to the following form Wi s.t | Wi !Q |" k l  The number of common words between two bag-of-words Wi and Q Ex) |Wi Q|=|(A,C,E,F,H) (A,E,I,J,L)| =|(A,E)|=2 l  k=(1-ε)2|Q| l No false negatives l False positives can easily be filtered out by cosine calculations
  • 8. Inverted index n  Innatural language processing, inverted index has been used to solve semi- conjunctive query Inverted Index n  Associative map l  key word l  value graph identifiers including a word
  • 9. Bottom-up search Inverted Index i) Look the index up with query bag-of-words ii) Aggregate all the lists of graph indices Query:(A,C,E) iii) Sort Aggregation )Scan (2,8,13,15,8,10,16,4,9,13,14) Sort (2,4,8,8,9,10,13,13,14,15,16) k=
  • 10. Search time of inverted index on 25 million graphs n Searchtime of inverted index is not so different from that of sequencial scan 40 sec 38 sec
  • 11. Why? n  Each word is not Query:(A,C,E) specific enough Aggregation Query contains 1000s n  (2,8,13,15,8,10,16,4,9,13,14) of words Sort n  Aggregated array is (2,4,8,8,9,10,13,13,14,15,16) VERY long n  Sorting takes O(ClogC) C in time
  • 12. Overview of our method n  Top-down search in a tree over the series of graphs n  Huge memory, if tree is implemented with pointers n  Wavelet Tree: Succinct data structure n  The smaller the similarity threshold is, the quicker the algorithm finishes •  Not the case in inverted index
  • 13. Binary tree over graphs n  leaf graph [1,8] 0 n  node interval 1 [1,4] [5,8] n  Each node is identified 0 1 0 1 by a bit string (v={01}) n  At the leaves, the graph [1,2] [3,4] [5,6] [7,8] indices correspond to 0 1 0 1 0 1 0 1 int(v)+1 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111} (e.g., int(010)+1=2+1=3)
  • 14. Summarization of bag-of-words Represent bag-of-words as a bit array n  12345678 Ex) Wi=(1,3,4,7,8) xi=(1,0,1,1,0,0,1,1) n  Take disjunction of all bit arrays in the interval of a node v Ex) For an interval [1,4] X1=(0,1,0,0,0,0,1,0) yv=x1 x2 x3 x4 X2=(1,0,1,1,0,0,0,0) =(1,1,1,1,0,0,1,1) X3=(1,0,0,0,0,0,1,1) X4=(1,0,0,0,0,0,0,1)
  • 15. Binary tree over graphs yv=111111 n  Assign to each node [1,8] v a bit arrays yv yv=110111 yv=101101 n  yv : bit array [1,4] [5,8] l  i-th bit is 1 if graphs yv=010101 yv=110100 v=100100 yv=001101 y in an interval have [1,2] [3,4] [5,6] [7,8] the corresponding word. 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111}
  • 16. Top-down traversal yv=111111 n  Q: bag-of-words of a [1,8] query vy =110111 y =101101 v n  Perform top-down [1,4] [5,8] traversal y =010101 y =110100 y =100100 y =001101 n  Prune the search v v v v [1,2] [3,4] [5,6] [7,8] space if # y [ j] ! k j"Q v n  The larger k is, the 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111} smaller the search space is
  • 17. Huge Memory n  Time is O(τm) : Very fast l  τ: the number of traversed node l  m: the number of bag-of-words in a query n  Space is O(Mnlogn) bit l  M: the number of unique words l  n: the number of graphs
  • 18. Wavelet Tree! (SODA,03) n  Replace yv in each node v by a rank dictionary l  explained in next slides n  Implement a tree without using pointers n  Only 60% memory overhead compared to the inverted index (Vigna,08) n  Access to the summary information in any internal node
  • 19. Rank dictionary (Raman,02) n  Give bit array B[1,n] the following operation: l  rankc(B,i): return the number of c {0,1} in B[1…i] Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1(B,8)=5 0 1 1 0 0 1 1 1 0 0 rank0(B,5)=3 0 1 1 0 0 1 1 1 0 0
  • 20. Implementation of rank dictionary n  Divide the bit array B into B large blocks of length l=log2n RL=Ranks of large blocks RL n  Divide each large block to small blocks of length s=logn/2 Rs=Ranks of small blocks RS relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n +o(n) bits
  • 21. Restricted inverted index Inverted Index n  Concatenate graph ids for words in the root n  Restrict the inverted index for the interval [sv,tv] of a node v [1,8] A B C D Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 [1,4] A B C D A B C D [4,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 22. Whole structure of restricted inverted index [1,8] A B C D 1 3 6 8 2 5 7 1 2 7 4 5 [1,4] A B C D A B C D [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,7] A B C A D A B D A B C 1 2 1 2 3 4 6 5 5 8 7 7 A C B C A D B D A B C A 1 1 2 2 3 4 5 5 6 7 7 8
  • 23. Similarity search n  To retrieve graphs similar to a query Q=(A,C), the tree is traversed in the top- down manner. [1,8] A C Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 [1,4] A C [5,8] A C Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 24. Similarity search n  To retrieve graphs similar to a query Q=(A,C), the tree is traversed in a top-down manner. n  Observation l  To perform top-down traversal, only intervals of words in each node are necessary A [1,4] C [8,10] Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 A [1,2] C [4,5] A[1,2] C[5,5] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 25. Similarity search n  Replace restricted inverted index Av in each node v with a bit array bv. l  bv[i]=1 if Av[i] goes to the right child 0 0 1 1 0 1 1 0 0 1 0 1 Aroot 1 3 6 8 2 5 7 1 2 7 4 5 0 1 bleft 0 1 0 0 0 1 bright 0 1 0 1 1 0 Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 26. Similarity search n  Intervals of child nodes can be computed by rank operations l  sleft(v),j=rank0(bv,svj-1)+1,tleft(v),j=rank0(bv,tvj) l  sright(v),j =rank1(bv,svj-1)+1,tright(v),j=rank1(bv,tvj) C [8,10] Ex) 0 0 1 1 0 1 1 0 0 1 0 1 Aroot 1 3 6 8 2 5 7 1 2 7 4 5 rank0(broot,8-1)+1=4, rank1(broot,8-1)+1=5, rank0(broot,10)=5 rank1(broot,10)=5 C [4,5] C [5,5] bleft 0 1 0 0 0 1 bright 0 1 0 1 1 0 Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 27. Wavelet Tree n Wavelet tree can be obtained to replace the restricted Inverted indices with bit arrays n Wavelet tree consists of bit arrays bv and initial intervals Croot. Croot A B C D 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0
  • 28. Wavelet Tree n Graphids can be recovered from bit strings on the path from the root to leaves Croot A B C D 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 1 000 001 010 011 100 101 110 111
  • 29. Memory n  (1+α)Nlogn + MlogN bits Bit arrays bv Initial intervals Croot l  N: the number of all words in the database l  n: the number of graphs l  α: overhead for rank dictionary (α=0.6) n  For inverted index, Nlogn bits n  About 60% overhead to inverted index!!
  • 30. Experiments n  25 million chemical compounds from PubChem database n  Use search time and memory as evaluation measures n  Compare our method gWT to l  inverted index l  sequential scan implemented in G-Hash (Wang et al, 2009)
  • 31. Search time on 25 million graphs 40 sec 38 sec 8 sec 3 sec 2 sec
  • 33. Overhead of rank dictionary
  • 35. Related work •  A lot of methods have been proposed so far. 1.gIndex [Yan et al., 04] 2.Graph Grep [Shasha et al., 07] 3.Tree+Delta [Zhao et al., 07] 4.TreePi [Zhang et al., 07] 5.Gstring [Jiang et al., 07] 6.FG-Index [Cheng et al., 07] 7.GDIndex [Williams et al., 07] etc
  • 36. Related work •  Decompose graphs into a set of Decompose substructures - subgraphs, trees, … paths etc •  Build a substructure- Index based index
  • 37. Drawbacks •  Require frequent subgraph mining •  Do not scale to millions of graphs
  • 38. Summary n  Efficientsimilarity search method for massive graph databases n  Solve semi-conjunctive query efficiently n  Built on wavelet trees n  Use Weisfeiler-Lehman procedure to convert graphs into bag-of-words n  Applicable to more than 20 million graphs n  Software http://code.google.com/p/gwt/
  • 39. Acknowledgements •  Prof. Shin-ichi Minato (Hokkaido Univ.) •  Dr. Daisuke Okanohara (PFI) •  Members in ERATO Minato Project