Gwt presen alsip-20111201

Yasuo Tabei
Yasuo TabeiResearcher, Software Developer at Japan Science and Technology Agency
ALSIP, Dec. 1 2011


Kernel-based similarity search
 in massive graph databases
     with wavelet trees
         Yasuo Tabei and Koji Tsuda
       JST ERATO Minato Project,
 National Institute of Advanced Industrial
         Science and Technology
Outline
• Overview
• Wavelet Tree
  ✓ Problem = Range intersection on array
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Graph similarity search
• Similarity search for 25 million molecular
  graphs
 ✓ Find all graphs whose similarity to the query 1
 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009)

• Use data-structure called “Wavelet
  Tree” (SODA, 2003)
 ✓ Self-index of an integer array
 ✓ Enable fast array operations
 ‣ e.g., range minimum query, range intersection
Range intersection on array
•   Array A of length N, 1     Ai    M

                   i    j      k   "
        A       1 3 6 8 2 5 7 1 2 7 4 5


• Range intersection: rint(A, [i,j],[k,])
    ✓ Find common elements of A[i,j] and A[k,]

• The naive method is to concatenate and sort
    Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8

• Use wavelet tree and solve the problem faster
Tree of subarrays:
Lower half=left, Higher half=right
            [1,8]
                1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                             [5,8]
            1 3 2 1 2 4                 6 8 5 7 7 5

  [1,2]                   [3,4] [5,6]                 [7,8]
     1 2 1 2          3 4          6 5 5        8 7 7


   1 1       2 2      3      4     5 5     6    7 7      8
Remember if each element is either
 in lower half(0) or higher half(1)
            [1,8]
                  0 0 1 1 0 1 1 0 1 0 0 1

    [1,4]                                             [5,8]
            0 1 0 0 0 1                 0 1 0 1 1 0


  [1,2]                   [3,4] [5,6]               [7,8]
        0 1 0 1           0 1      1 0 0        1 0 0


    1         2         3   4       5    6      7     8
Index each bit array
      with a rank dictionary
• Using rank dictionary, the rank operation can be
  performed in O(1) time
  ✓ rankc(B, i): return the number of c   {0, 1} in B[1,i]

• Several methods known: rank9sel (Vigna, 08)
• Example) B=0110011100
                    i 1 2 3 4 5 6 7 8 9 10
 rank1 (B, 8) = 5     011001110 0
 rank0 (B, 5) = 3     011001110 0
O(1)-division of an interval
• Using the rank operation, the division of an
•




  interval can be done in constant time
    ✓ rank0 for left child and rank1 for right child

•   Naive = linear time to the total number of elements


          [1,8]
           Aroot 1 3 6 8 2 5 7 1 7 2 4 5

                   rank0                       rank1
      [1,4]                      [5,8]
       Aleft   1 3 2 1 2 4        Aright 6 8 5 7 7 5
Fast computation of rank
   intersection by pruning
Pruned      [1,8]
                  1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                                [5,8]
            1 3 2 1 2 4                    6 8 5 7 7 5

  [1,2]                      [3,4] [5,6]             [6,8]
     1 2 1 2             3 4          6 5 5        8 7 7


    1 1       2 2        3      4     5 5     6    7 7     8

            solution!!
Outline
• Overview
• Wavelet Tree
  ✓ Problem = Range intersection on array
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Graph Similarity Search
• Bag-of-words representation of graph
   ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
     Kashima (ICDM, 2009), Wang et al., (EDBT, 2009)


                      W=(A,D,E,H)


• Consine similarity query
 ✓ Find all graphs W whose cosine similarity (kernel) to
   the query Q is at least 1
Weisfeiler-Lehman Procedure (NIPS,09)
•   Convert a graph into a set of words (bag-of-words)
Semi-conjunctive query
• Cosine similarity query can be relaxed to
  the following form
           W s.t. |W         Q|      k
  ✓ Find all graphs W which share at least k words
    to the query Q

• No false negatives
• False positives can easily be filtered out by
  cosine calculations
Inverted index, Array, Wavelet Tree
                                • Inverted index is built from
                                  graph database
                                • Concatenate all rows to make
                                •




                                    an array
                                • Index the array with wavelet
                                •




                                  tree
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
                                • Semi-conjunctive query =
                                •




                                    Extension of range intersection
    Wavelet Tree
                                    ✓ Find graph ids which appear at
                                      least k times in given intervals
Pruning search space
• Find all graphs W in the database whose cosine
    to a query Q is larger than a threshold 1
                          |W Q|
       W s.t. KN (W, Q) =                      1
                            W Q
    ✓ W,Q: bag-of-words of graphs
• The above solution can be relaxed as follows
•




    If KN (W, Q)       1   , then
                                         |Q|
          (1       ) |Q|
                   2
                           |W |
                                    (1         )2
    ✓ Can be used for pruning search space
Complexity
• Time per query: O(τm)
 •   τ: the number of traversed nodes
 •   m: the number of bag-of-words in a query

• Memory: (1+α)N log n + M log N
 •   N: the number of all words in the database
 •   M: Maximum integer in the array
 •   n: the number of graphs
 •   α: overhead for rank dictionary (α=0.6)

• Inverted index takes Nlog n bits
• About 60% overhead to inverted index!
Outline
• Overview
• A data-structure
  ✓ Wavelet Tree
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Experiments

• 25 million chemical compounds from PubChem
  database
• Evaluate search time and memory usage
• Cosine threshold ε=0.3,0.35,0.4
• Compare our method gWT to
 ✓   Inverted index (concatenate all intervals and sort)
 ✓   Sequential scan (Compute similarity one by one)
Search time
              40 sec
              38 sec




              8 sec
              3 sec
              2 sec
Memory usage
               20GB
Construction time
                    7h
Summary
• Efficient similarity search method of
    massive graph databases
• Solve semi-conjunctive query efficiently
• Build on Wavelet Tree
• Use Weisfeiler-Lehman procedure to
    represent graphs as bag-of-words
• Applicable to 25 million graphs
• Software
•




    http://code.google.com/p/gwt
1 of 22

Recommended

Gwt sdm public by
Gwt sdm publicGwt sdm public
Gwt sdm publicYasuo Tabei
2.6K views22 slides
Mlab2012 tabei 20120806 by
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Yasuo Tabei
2.8K views32 slides
WABI2012-SuccinctMultibitTree by
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeYasuo Tabei
4.3K views33 slides
Dmss2011 public by
Dmss2011 publicDmss2011 public
Dmss2011 publicYasuo Tabei
588 views39 slides
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA... by
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...grssieee
599 views50 slides
Neural Processes by
Neural ProcessesNeural Processes
Neural ProcessesSangwoo Mo
682 views43 slides

More Related Content

What's hot

6 adesh kumar tripathi -71-74 by
6 adesh kumar tripathi -71-746 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-74Alexander Decker
390 views4 slides
Some fixed point theorems in fuzzy mappings by
Some fixed point theorems in fuzzy mappingsSome fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsAlexander Decker
346 views9 slides
l1-Embeddings and Algorithmic Applications by
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic ApplicationsGrigory Yaroslavtsev
255 views17 slides
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs by
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsIJASRD Journal
164 views7 slides
A common fixed point of integral type contraction in generalized metric spacess by
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacessAlexander Decker
564 views6 slides
Common fixed point theorem for occasionally weakly compatible mapping in q fu... by
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Alexander Decker
206 views5 slides

What's hot(20)

Some fixed point theorems in fuzzy mappings by Alexander Decker
Some fixed point theorems in fuzzy mappingsSome fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappings
Alexander Decker346 views
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs by IJASRD Journal
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
IJASRD Journal164 views
A common fixed point of integral type contraction in generalized metric spacess by Alexander Decker
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacess
Alexander Decker564 views
Common fixed point theorem for occasionally weakly compatible mapping in q fu... by Alexander Decker
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Alexander Decker206 views
11.the univalence of some integral operators by Alexander Decker
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operators
Alexander Decker220 views
The univalence of some integral operators by Alexander Decker
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operators
Alexander Decker155 views
Coincidence points for mappings under generalized contraction by Alexander Decker
Coincidence points for mappings under generalized contractionCoincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contraction
Alexander Decker212 views
A Szemerédi-type theorem for subsets of the unit cube by VjekoslavKovac1
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
VjekoslavKovac156 views
Tales on two commuting transformations or flows by VjekoslavKovac1
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
VjekoslavKovac163 views
Multilinear singular integrals with entangled structure by VjekoslavKovac1
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
VjekoslavKovac1222 views
Paraproducts with general dilations by VjekoslavKovac1
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
VjekoslavKovac1241 views
Supersymmetric Q-balls and boson stars in (d + 1) dimensions by Jurgen Riedel
Supersymmetric Q-balls and boson stars in (d + 1) dimensionsSupersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensions
Jurgen Riedel368 views
Scattering theory analogues of several classical estimates in Fourier analysis by VjekoslavKovac1
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
VjekoslavKovac1240 views
A Szemeredi-type theorem for subsets of the unit cube by VjekoslavKovac1
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
VjekoslavKovac1151 views
Ecfft zk studyclub 9.9 by Alex Pruden
Ecfft zk studyclub 9.9Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9
Alex Pruden180 views
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ... by Jurgen Riedel
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Jurgen Riedel277 views
Regularity and complexity in dynamical systems by Springer
Regularity and complexity in dynamical systemsRegularity and complexity in dynamical systems
Regularity and complexity in dynamical systems
Springer251 views
Skiena algorithm 2007 lecture04 elementary data structures by zukun
Skiena algorithm 2007 lecture04 elementary data structuresSkiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structures
zukun360 views

Viewers also liked

Sketch sort sugiyamalab-20101026 - public by
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicYasuo Tabei
632 views38 slides
Sketch sort ochadai20101015-public by
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicYasuo Tabei
1.4K views37 slides
Kdd2015reading-tabei by
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
1.1K views18 slides
GIW2013 by
GIW2013GIW2013
GIW2013Yasuo Tabei
1.1K views26 slides
Ibisml2011 06-20 by
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
896 views31 slides
DCC2014 - Fully Online Grammar Compression in Constant Space by
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
1.1K views21 slides

Viewers also liked(20)

Sketch sort sugiyamalab-20101026 - public by Yasuo Tabei
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
Yasuo Tabei632 views
Sketch sort ochadai20101015-public by Yasuo Tabei
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
Yasuo Tabei1.4K views
Kdd2015reading-tabei by Yasuo Tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
Yasuo Tabei1.1K views
Ibisml2011 06-20 by Yasuo Tabei
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
Yasuo Tabei896 views
DCC2014 - Fully Online Grammar Compression in Constant Space by Yasuo Tabei
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
Yasuo Tabei1.1K views
CPM2013-tabei201306 by Yasuo Tabei
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
Yasuo Tabei4.5K views
SPIRE2013-tabei20131009 by Yasuo Tabei
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
Yasuo Tabei4.9K views
Lgm saarbrucken by Yasuo Tabei
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
Yasuo Tabei1.2K views
NIPS2013読み会: Scalable kernels for graphs with continuous attributes by Yasuo Tabei
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
Yasuo Tabei8.3K views
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices by Yasuo Tabei
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Yasuo Tabei4.4K views
Lgm pakdd2011 public by Yasuo Tabei
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
Yasuo Tabei995 views
異常検知 - 何を探すかよく分かっていないものを見つける方法 by MapR Technologies Japan
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル) by Shirou Maruyama
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
Shirou Maruyama46.9K views
Euruko 2009 - Software Craftsmanship by Phillip Oertel
Euruko 2009 - Software CraftsmanshipEuruko 2009 - Software Craftsmanship
Euruko 2009 - Software Craftsmanship
Phillip Oertel352 views
Smart%20 Manual%20rev20060403 by guest4fb07c
Smart%20 Manual%20rev20060403Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403
guest4fb07c203 views

Similar to Gwt presen alsip-20111201

SISAP17 by
SISAP17SISAP17
SISAP17Yasuo Tabei
325 views32 slides
Data structures by
Data structuresData structures
Data structuresPranav Gupta
1.4K views52 slides
Plc (1) by
Plc (1)Plc (1)
Plc (1)James Croft
1.1K views49 slides
is anyone_interest_in_auto-encoding_variational-bayes by
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
1.2K views46 slides
Matlab lec1 by
Matlab lec1Matlab lec1
Matlab lec1Amba Research
3.2K views49 slides
03 search blind by
03 search blind03 search blind
03 search blindNour Zeineddine
1K views65 slides

Similar to Gwt presen alsip-20111201(20)

is anyone_interest_in_auto-encoding_variational-bayes by NAVER Engineering
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
NAVER Engineering1.2K views
Dsoop (co 221) 1 by Puja Koch
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1
Puja Koch539 views
Paper study: Learning to solve circuit sat by ChenYiHuang5
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
ChenYiHuang5281 views
Design and analysis of algorithms question paper 2015 tutorialsduniya.com by TutorialsDuniya.com
Design and analysis of algorithms  question paper 2015   tutorialsduniya.comDesign and analysis of algorithms  question paper 2015   tutorialsduniya.com
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
R for Pirates. ESCCONF October 27, 2011 by Mandi Walls
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011
Mandi Walls606 views

Recently uploaded

Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericShapeBlue
58 views9 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
110 views31 slides
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineShapeBlue
154 views19 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
74 views38 slides
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...James Anderson
142 views32 slides
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesShapeBlue
178 views15 slides

Recently uploaded(20)

Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue154 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson142 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue63 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue93 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue105 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue48 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software373 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue134 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue113 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue52 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE67 views

Gwt presen alsip-20111201

  • 1. ALSIP, Dec. 1 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, National Institute of Advanced Industrial Science and Technology
  • 2. Outline • Overview • Wavelet Tree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 3. Graph similarity search • Similarity search for 25 million molecular graphs ✓ Find all graphs whose similarity to the query 1 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009) • Use data-structure called “Wavelet Tree” (SODA, 2003) ✓ Self-index of an integer array ✓ Enable fast array operations ‣ e.g., range minimum query, range intersection
  • 4. Range intersection on array • Array A of length N, 1 Ai M i j k " A 1 3 6 8 2 5 7 1 2 7 4 5 • Range intersection: rint(A, [i,j],[k,]) ✓ Find common elements of A[i,j] and A[k,] • The naive method is to concatenate and sort Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8 • Use wavelet tree and solve the problem faster
  • 5. Tree of subarrays: Lower half=left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
  • 6. Remember if each element is either in lower half(0) or higher half(1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8
  • 7. Index each bit array with a rank dictionary • Using rank dictionary, the rank operation can be performed in O(1) time ✓ rankc(B, i): return the number of c {0, 1} in B[1,i] • Several methods known: rank9sel (Vigna, 08) • Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1 (B, 8) = 5 011001110 0 rank0 (B, 5) = 3 011001110 0
  • 8. O(1)-division of an interval • Using the rank operation, the division of an • interval can be done in constant time ✓ rank0 for left child and rank1 for right child • Naive = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
  • 10. Outline • Overview • Wavelet Tree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 11. Graph Similarity Search • Bag-of-words representation of graph ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima (ICDM, 2009), Wang et al., (EDBT, 2009) W=(A,D,E,H) • Consine similarity query ✓ Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1
  • 12. Weisfeiler-Lehman Procedure (NIPS,09) • Convert a graph into a set of words (bag-of-words)
  • 13. Semi-conjunctive query • Cosine similarity query can be relaxed to the following form W s.t. |W Q| k ✓ Find all graphs W which share at least k words to the query Q • No false negatives • False positives can easily be filtered out by cosine calculations
  • 14. Inverted index, Array, Wavelet Tree • Inverted index is built from graph database • Concatenate all rows to make • an array • Index the array with wavelet • tree Aroot 1 3 6 8 2 5 7 1 2 7 4 5 • Semi-conjunctive query = • Extension of range intersection Wavelet Tree ✓ Find graph ids which appear at least k times in given intervals
  • 15. Pruning search space • Find all graphs W in the database whose cosine to a query Q is larger than a threshold 1 |W Q| W s.t. KN (W, Q) = 1 W Q ✓ W,Q: bag-of-words of graphs • The above solution can be relaxed as follows • If KN (W, Q) 1 , then |Q| (1 ) |Q| 2 |W | (1 )2 ✓ Can be used for pruning search space
  • 16. Complexity • Time per query: O(τm) • τ: the number of traversed nodes • m: the number of bag-of-words in a query • Memory: (1+α)N log n + M log N • N: the number of all words in the database • M: Maximum integer in the array • n: the number of graphs • α: overhead for rank dictionary (α=0.6) • Inverted index takes Nlog n bits • About 60% overhead to inverted index!
  • 17. Outline • Overview • A data-structure ✓ Wavelet Tree • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 18. Experiments • 25 million chemical compounds from PubChem database • Evaluate search time and memory usage • Cosine threshold ε=0.3,0.35,0.4 • Compare our method gWT to ✓ Inverted index (concatenate all intervals and sort) ✓ Sequential scan (Compute similarity one by one)
  • 19. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
  • 20. Memory usage 20GB
  • 22. Summary • Efficient similarity search method of massive graph databases • Solve semi-conjunctive query efficiently • Build on Wavelet Tree • Use Weisfeiler-Lehman procedure to represent graphs as bag-of-words • Applicable to 25 million graphs • Software • http://code.google.com/p/gwt