ALSIP, Dec. 1 2011


Kernel-based similarity search
 in massive graph databases
     with wavelet trees
         Yasuo Tabei and Koji Tsuda
       JST ERATO Minato Project,
 National Institute of Advanced Industrial
         Science and Technology
Outline
• Overview
• Wavelet Tree
  ✓ Problem = Range intersection on array
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Graph similarity search
• Similarity search for 25 million molecular
  graphs
 ✓ Find all graphs whose similarity to the query 1
 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009)

• Use data-structure called “Wavelet
  Tree” (SODA, 2003)
 ✓ Self-index of an integer array
 ✓ Enable fast array operations
 ‣ e.g., range minimum query, range intersection
Range intersection on array
•   Array A of length N, 1     Ai    M

                   i    j      k   "
        A       1 3 6 8 2 5 7 1 2 7 4 5


• Range intersection: rint(A, [i,j],[k,])
    ✓ Find common elements of A[i,j] and A[k,]

• The naive method is to concatenate and sort
    Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8

• Use wavelet tree and solve the problem faster
Tree of subarrays:
Lower half=left, Higher half=right
            [1,8]
                1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                             [5,8]
            1 3 2 1 2 4                 6 8 5 7 7 5

  [1,2]                   [3,4] [5,6]                 [7,8]
     1 2 1 2          3 4          6 5 5        8 7 7


   1 1       2 2      3      4     5 5     6    7 7      8
Remember if each element is either
 in lower half(0) or higher half(1)
            [1,8]
                  0 0 1 1 0 1 1 0 1 0 0 1

    [1,4]                                             [5,8]
            0 1 0 0 0 1                 0 1 0 1 1 0


  [1,2]                   [3,4] [5,6]               [7,8]
        0 1 0 1           0 1      1 0 0        1 0 0


    1         2         3   4       5    6      7     8
Index each bit array
      with a rank dictionary
• Using rank dictionary, the rank operation can be
  performed in O(1) time
  ✓ rankc(B, i): return the number of c   {0, 1} in B[1,i]

• Several methods known: rank9sel (Vigna, 08)
• Example) B=0110011100
                    i 1 2 3 4 5 6 7 8 9 10
 rank1 (B, 8) = 5     011001110 0
 rank0 (B, 5) = 3     011001110 0
O(1)-division of an interval
• Using the rank operation, the division of an
•




  interval can be done in constant time
    ✓ rank0 for left child and rank1 for right child

•   Naive = linear time to the total number of elements


          [1,8]
           Aroot 1 3 6 8 2 5 7 1 7 2 4 5

                   rank0                       rank1
      [1,4]                      [5,8]
       Aleft   1 3 2 1 2 4        Aright 6 8 5 7 7 5
Fast computation of rank
   intersection by pruning
Pruned      [1,8]
                  1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                                [5,8]
            1 3 2 1 2 4                    6 8 5 7 7 5

  [1,2]                      [3,4] [5,6]             [6,8]
     1 2 1 2             3 4          6 5 5        8 7 7


    1 1       2 2        3      4     5 5     6    7 7     8

            solution!!
Outline
• Overview
• Wavelet Tree
  ✓ Problem = Range intersection on array
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Graph Similarity Search
• Bag-of-words representation of graph
   ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
     Kashima (ICDM, 2009), Wang et al., (EDBT, 2009)


                      W=(A,D,E,H)


• Consine similarity query
 ✓ Find all graphs W whose cosine similarity (kernel) to
   the query Q is at least 1
Weisfeiler-Lehman Procedure (NIPS,09)
•   Convert a graph into a set of words (bag-of-words)
Semi-conjunctive query
• Cosine similarity query can be relaxed to
  the following form
           W s.t. |W         Q|      k
  ✓ Find all graphs W which share at least k words
    to the query Q

• No false negatives
• False positives can easily be filtered out by
  cosine calculations
Inverted index, Array, Wavelet Tree
                                • Inverted index is built from
                                  graph database
                                • Concatenate all rows to make
                                •




                                    an array
                                • Index the array with wavelet
                                •




                                  tree
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
                                • Semi-conjunctive query =
                                •




                                    Extension of range intersection
    Wavelet Tree
                                    ✓ Find graph ids which appear at
                                      least k times in given intervals
Pruning search space
• Find all graphs W in the database whose cosine
    to a query Q is larger than a threshold 1
                          |W Q|
       W s.t. KN (W, Q) =                      1
                            W Q
    ✓ W,Q: bag-of-words of graphs
• The above solution can be relaxed as follows
•




    If KN (W, Q)       1   , then
                                         |Q|
          (1       ) |Q|
                   2
                           |W |
                                    (1         )2
    ✓ Can be used for pruning search space
Complexity
• Time per query: O(τm)
 •   τ: the number of traversed nodes
 •   m: the number of bag-of-words in a query

• Memory: (1+α)N log n + M log N
 •   N: the number of all words in the database
 •   M: Maximum integer in the array
 •   n: the number of graphs
 •   α: overhead for rank dictionary (α=0.6)

• Inverted index takes Nlog n bits
• About 60% overhead to inverted index!
Outline
• Overview
• A data-structure
  ✓ Wavelet Tree
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Experiments

• 25 million chemical compounds from PubChem
  database
• Evaluate search time and memory usage
• Cosine threshold ε=0.3,0.35,0.4
• Compare our method gWT to
 ✓   Inverted index (concatenate all intervals and sort)
 ✓   Sequential scan (Compute similarity one by one)
Search time
              40 sec
              38 sec




              8 sec
              3 sec
              2 sec
Memory usage
               20GB
Construction time
                    7h
Summary
• Efficient similarity search method of
    massive graph databases
• Solve semi-conjunctive query efficiently
• Build on Wavelet Tree
• Use Weisfeiler-Lehman procedure to
    represent graphs as bag-of-words
• Applicable to 25 million graphs
• Software
•




    http://code.google.com/p/gwt

Gwt presen alsip-20111201

  • 1.
    ALSIP, Dec. 12011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, National Institute of Advanced Industrial Science and Technology
  • 2.
    Outline • Overview • WaveletTree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 3.
    Graph similarity search •Similarity search for 25 million molecular graphs ✓ Find all graphs whose similarity to the query 1 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009) • Use data-structure called “Wavelet Tree” (SODA, 2003) ✓ Self-index of an integer array ✓ Enable fast array operations ‣ e.g., range minimum query, range intersection
  • 4.
    Range intersection onarray • Array A of length N, 1 Ai M i j k " A 1 3 6 8 2 5 7 1 2 7 4 5 • Range intersection: rint(A, [i,j],[k,]) ✓ Find common elements of A[i,j] and A[k,] • The naive method is to concatenate and sort Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8 • Use wavelet tree and solve the problem faster
  • 5.
    Tree of subarrays: Lowerhalf=left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
  • 6.
    Remember if eachelement is either in lower half(0) or higher half(1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8
  • 7.
    Index each bitarray with a rank dictionary • Using rank dictionary, the rank operation can be performed in O(1) time ✓ rankc(B, i): return the number of c {0, 1} in B[1,i] • Several methods known: rank9sel (Vigna, 08) • Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1 (B, 8) = 5 011001110 0 rank0 (B, 5) = 3 011001110 0
  • 8.
    O(1)-division of aninterval • Using the rank operation, the division of an • interval can be done in constant time ✓ rank0 for left child and rank1 for right child • Naive = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 9.
    Fast computation ofrank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
  • 10.
    Outline • Overview • WaveletTree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 11.
    Graph Similarity Search •Bag-of-words representation of graph ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima (ICDM, 2009), Wang et al., (EDBT, 2009) W=(A,D,E,H) • Consine similarity query ✓ Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1
  • 12.
    Weisfeiler-Lehman Procedure (NIPS,09) • Convert a graph into a set of words (bag-of-words)
  • 13.
    Semi-conjunctive query • Cosinesimilarity query can be relaxed to the following form W s.t. |W Q| k ✓ Find all graphs W which share at least k words to the query Q • No false negatives • False positives can easily be filtered out by cosine calculations
  • 14.
    Inverted index, Array,Wavelet Tree • Inverted index is built from graph database • Concatenate all rows to make • an array • Index the array with wavelet • tree Aroot 1 3 6 8 2 5 7 1 2 7 4 5 • Semi-conjunctive query = • Extension of range intersection Wavelet Tree ✓ Find graph ids which appear at least k times in given intervals
  • 15.
    Pruning search space •Find all graphs W in the database whose cosine to a query Q is larger than a threshold 1 |W Q| W s.t. KN (W, Q) = 1 W Q ✓ W,Q: bag-of-words of graphs • The above solution can be relaxed as follows • If KN (W, Q) 1 , then |Q| (1 ) |Q| 2 |W | (1 )2 ✓ Can be used for pruning search space
  • 16.
    Complexity • Time perquery: O(τm) • τ: the number of traversed nodes • m: the number of bag-of-words in a query • Memory: (1+α)N log n + M log N • N: the number of all words in the database • M: Maximum integer in the array • n: the number of graphs • α: overhead for rank dictionary (α=0.6) • Inverted index takes Nlog n bits • About 60% overhead to inverted index!
  • 17.
    Outline • Overview • Adata-structure ✓ Wavelet Tree • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 18.
    Experiments • 25 millionchemical compounds from PubChem database • Evaluate search time and memory usage • Cosine threshold ε=0.3,0.35,0.4 • Compare our method gWT to ✓ Inverted index (concatenate all intervals and sort) ✓ Sequential scan (Compute similarity one by one)
  • 19.
    Search time 40 sec 38 sec 8 sec 3 sec 2 sec
  • 20.
  • 21.
  • 22.
    Summary • Efficient similaritysearch method of massive graph databases • Solve semi-conjunctive query efficiently • Build on Wavelet Tree • Use Weisfeiler-Lehman procedure to represent graphs as bag-of-words • Applicable to 25 million graphs • Software • http://code.google.com/p/gwt