SlideShare a Scribd company logo
2012 Sapporo Workshop on
Machine Learning and Applications to Biology
Aug 6-7 2012: Sapporo, Hokkaido, Japan




        Space-Efficient Multibit Trees
         for Large-Scale Chemical
           Fingerprint Searches	

                           Yasuo Tabei
                          JST ERATO
       HomePage: https://sites.google.com/site/yasuotabei/
              E-mail: yasuo.tabei@gmail.com
Outline	
•  Overview
   –  Chemical fingerprint search
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/Select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie
•  Experiments
Chemical fingerprint search	
•  Space-efficient data structures to index 30 million
   chemical fingerprints, e.g., W=(1,5,7,10)
•  Searching all fingerprints similar to a query (≧ε)
   –  Similarity = Jaccard (Tanimoto) (J(W,W’)=|W∩W’|/|W∪W’|)
•  Multibit tree (Kristensen et al.,09): Data structure
   enabling fast similarity searches
   –  Memory-inefficient pointer-based representation
   –  Need to store original fingerprint databases to prevent
      false positives
•  Succinct data structure (Jacobson, 1998)
   –  Space-efficient and enables fast operations
Outline	
•  Overview
   –  Chemical fingerprint search
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/Select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie
•  Experiments
Multibit Tree (MT) (Kristensen et al., 09)	
l    Multiple decision trees built on fingerprints
      clustered with respect to cardinality	
           (i)Fingerprint       (ii)Cluster into bins   (iii)Build decision
              Database              w.r.t cardinality        trees

           W1=(1,2,7,4,8)           W6 =(1)
           W2=(1,3,7)               W32=(2)
           W3=(1,3)                 W42=(4)              W6
           W5=(1,4,8,7)             W50=(8)                   W32
           W6=(1)                                                   W42 W
                                                                         50

             ...                    W3 =(1,3)
                                    W9 =(2,4)
           Wn=(1,3,4)               W12=(1,4)
                                                         W9
                                    W3 =(2,5,6)
                                                              W3      W12
                                    W9 =(1,3,6)
                            .
                            .
                                    W12=(4,6,7)
                                    W15=(2,3,5)
                            .       W18=(4,6,8)
                                         .
                                         .          .
                                                    .
                                         .
                                                        W3 W15 W9

                                                    .                 W12 W18
Similarity search of a query fingerprint Q	
l Find all fingerprints     such that J(Wi , Q)
•  For an efficient similarity search, find the candidate
   solutions Wi satisfying two constrains:
      1.  Cardinality constraint
                                     1
                       |Q|   |Wi |       |Q|
      2.  Upper bound of Jaccard similarity
                        min(|Wi | N0 , |Q| N1 )
                 |Wi | + |Q| min(|Wi | N0 , |Q|   N1 )
 - N0: The number of elements contained in Wi and not in Q
 - N1: The number of elements contained in Q and not in Wi
Similarity search of a query fingerprint Q	
Step1:                              Step2:                              Step3:
Find candidate solutions I1         Find candidate solutions I2         Calculate similarities
satisfying carinality constraints   satisfying upper bounds             to remove false positives
                                                                        in


                                           Searched
     W6
                                                   pruned
          W32
                W42 W
                     50
                                    W9
                                                                                     ?
                                         W3          W12
                                                                                     ?
     W9
          W3        W12




                                    W4    W15 W
                                               9
    W4 W15 W9                                         W12         W18
                    W12 W18

                .
                .
                .
Drawbacks	

•  Pointer-based representation of multibit trees
   needs a large memory
                 bits
 - Kc: number of fingerprints in bin c
 - C: total number of bins
   –  Log(.) factor is too large!
•  Need to store original fingerprint databases in
   memory to filter out false positives
Outline	
•  Overview
   –  Chemical fingerprint search
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie
•  Experiments
Succinct Data Structures	
•  Space-efficient data structures enabling fast
   operations
•  Pointer-based representations of ordered trees
   consume a large amount of memory
  –  O(nlogn) bits for the number n of nodes
  –  logn factor is too large for large-scale data
•  Represent ordered trees as bit strings of length 2n
   + 1 and enables O(1)-time operations
  –  Ex) 0100100101000
•  Various succinct data structures
  –  sets(Raman,2002), sequences(Ferragina,2001),
     trees(Jacobson,1989), graphs(Turan,1989)
Rank/select dictionary (RRR, 2002)
       : Foundation of various succinct data structures	

l    Enables the rank/select operations on bit string B in
      O(1)-time
      -  Rankc(B,i): return the number of c∈{0,1} in B[1…i]
      -  Selectc(B,i): return the position of i-th occurrence of c∈{0,1}
l    Efficient rank/select dictionary (Navarro and Providel, 2012) 	

 Ex) B=0110011100	
                    i 1 2 3 4 5 6 7 8 9 10
    Rank1(B,8)=5      011001110 0
    Select1(B,3)=6	
 0 1 1 0 0 1 1 1 0 0
B




l  Divide the bit array B into large blocks of length =log2n
   RL=Ranks of large blocks
l  Divide each large block to small blocks of length s=(logn)/2

  Rs=Ranks of small blocks relative to the large block
      rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)	
          Time:O(1)
          Memory: n + o(n) bits
Level-order Unary Degree Sequence
            (LOUDS) (Jacobson, 1989)	
•  Represents an ordered tree as a bit string
   of length 2n+1 (n: node number)
•  Construction
1)  Traversing the tree in a breadth-first manner
2)  Generating k 1s followed by 0 for a k-degree node in
    preorder

                   1            S:	
  super	
  root	
                                  S 1 2 3 4 567
           2           3       B 101101101100000

       4       5       6   7
Properties of LOUDS	
                 1
                               1   23       4 5 67
         2            3       B:101101101100000
                                        1     2      34 5 67
     4       5       6    7

•  For a tree consisting of n nodes, there are n 1s
   and n+1 0s on bit string B
•  Each 1 and 0 except the first 0 on B corresponds
   to a tree node one-by-one
•  Positions of the parent and children for a tree
   node on B can be calculated by combining the
   rank/select operations in O(1)-time.
O(1)-time operations on a tree	
•  Parent/child operations for i such that B[i]=1
     –  First child:p=select0(B,rank1(B,i))+1
     –  Next child:i+1 for position i of the first child
     –  Parent     :p=select1(B,rank0(B,i))	

Ex)	
  Calcula2ng	
  the	
  first	
  child	
  for	
  i	
  =	
  4	

                        1
                                     i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
                2           3        B 101101101100000
                                           1     23       45        6 7
            4       5       6    7
                                                 i=4	
              rank1(B,4)=3	
                                                                    select0(B,3)=9
Outline	
•  Overview
   –  Chemical fingerprint search
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/Select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie
•  Experiments
Succinct Multibit Trees (SMT)	

•  Consist of compact representations of multibit
   trees and fingerprint databases
•  Represent multibit trees by LOUDS
  –  O(8 C |Kc | 4C + M C) bits not including log factor
           c=1
  –  Fast similarity searches
•  Two compact representations of fingerprint
   databases
  –  Variable-length array (VLA)
  –  Succinct trie (TRIE)
Succinct representation of
             multibit trees (SMT)	
•  Basic idea is to represent MT by LOUDS
  –  MT consists of multiple binary decision trees.
•  Bc: LOUDS representation of a decision tree
•  Lc: bit string indicating whether Bc[i] is a leaf or not
•  IDs: Array containing fingerprint identifiers

      MT	
         1                SMT	
           2               3

      4        5       6       7
      W3       W4 W1           W2
Access to node auxiliaries and
    fingerprint identifiers in O(1)-time	
                               1    0
•  Access to node auxiliaries Mv , Mv for calculating
   upper bounds
   –  v = rank1 (Bc , p) for a given position p
   –  Each 1 bit in Bc corresponds to a node v
•  Identifiers for calculating Jaccard[p] = 1
                                    Lc similarities
   –  IDs[rank1 (Lc , p)] for a given position p
   –  Each 1 bit in Lc corresponds to an index on IDs
                  1

          2               3

      4       5       6       7
     W3       W4 W1           W2
Variable-length array for compactly
        representing fingerprints	
•  Standard array consists of bit strings of fixed-length
   –  Space-inefficient for storing small values
   Ex) Array, each element is represented as 8 bits
         Integer       2        1        3        4        32bits	
         Bit string 00000010 00000001 00000011 00000100

•  Variable-length array = bit strings of different lengths
  Ex)
         Integer      2        1        3        4         8bits	
         Bit string       10       1        11       100
   –  Space-efficient
   –  Difficulty: distinguish each range of an integer value
Representation of variable-length array	




•  Use two bit strings to represent an array A:
   -  R: bit string whose k-th substring corresponds to the
      bit string representation of A[k]
   -  P: bit string whose k-th substring consists of
       ( log A[k]    1) 0s followed by 1
Recovering A[k] from variable-length array	
 •  A[k] is recovered by three steps:
    1.  Start position s: If k=1 s=1, else s = select1(P,k-1) + 1
    2.  End position e: e = select1(P,k)
    3.  Conversion: Convert substring R[s,e] to an integer
 •  O(1)-time
Ex)k=3



  1.  s = select1(P,2)+1=4        s	
 e	
  2.  e = select1(P,3)=7
  3.  Convert R[4,7]=1000 to the integer 8
Trie	
•  Used to store an associative array
   –  keys are, usually, a string
•  Applicable to fingerprints considered as strings
   –  Each node defines the key it is associated
   –  All the descendants of a node has a common prefix of
      the string associated with that node
   –  Values are associated only with leaves, and some inner
      nodes that correspond to keys
                                                             0	
    Ex)	
                  Build                1	
                                                        1	
                                                                     2	
            W1=(1,2,3)     trie	
               2	
                3	
                                      2	
               3	
          3	
            W2=(2,3,7,8)                  4	
          5	
         6	
            W3=(1,2,5,8)        3	
              5	
        7	
                                    7	
                                                      5	
 10	
                                                8	
 9	
            W4=(1,3,5)	
                                     8	
                                                  8	
     12	
                                            11
Difficulty	
•  The alphabet size tends to be small for typical trie
   applications, e.g., DNA(4), English(26)
•  Difficulty: the word size of fingerprints is not always
   small, e.g., PubChem, 881 dimension
   –  Memory usage is dominated by labels
•  Compute the differences between every pair of a node
   label and the parent node label
                                      0       Compute
                                                               0
Ex)              Build    1               2   difference
                 trie                                 1            2
  W1=(1,2,3)                              3
                      2           3
  W2=(2,3,7,8)                                     1       2       1   Succinct Trie
                                                                        Succinct Trie
  W3=(1,2,5,8)    3       5       5
                                          7                            by LOUDS
                                                                        by LOUDS
                                               1       3           4
  W4=(1,3,5)                                               2
                                          8
                              8                                    1
                                                       3
Succinct Trie (TRIE)	
•  Three components:
  –  T: LOUDS representation of trie
  –  D: Variable-length array containing node labels
  –  Idconv: Array containing fingerprint identifiers	




  Trie                  1	
                             0	
           Succinct Trie
                1	
                  2	
   Node ids	
 -	
 1	
 2	
 3	
 4	
 5	
 6	
              7	
 8	
    9	
 10	
 11	
 12	
                                           LBS T	
     10	
 110	
 110	
 10	
 110	
 10	
 10	
   0	
 10	
   0	
 10	
 0	
 0	
                2	
                3	
     Words D	
  	
      0	
 1	
 2	
 1	
 2	
 1	
          1	
 3	
    2	
 4	
 3	
 1	
      1	
               2	
          1	
          4	
          5	
         6	
     Index	
      W1	
   W2	
   W3	
   W4	
    W5	
   1	
           3	
        4	
            idconv	
      7	
   12	
   11	
   10	
     9	
    7	
                      2	
 10	
                8	
 9	
                             1	
                  3	
     12	
            11
Outline	
•  Overview
   –  Chemical fingerprint search
•  Multibit Tree
•  Succinct Data Structures
   –  Rank/Select dictionary
   –  Succinct ordered tree: LOUDS
•  Succinct Multibit Trees
   –  Compact representation of multibit trees
   –  Compact representation of fingerprint databases
      1.  Variable-length array
      2.  Succinct Trie
•  Experiments
Experiments	

•  30 million chemical fingerprints from
   PubChem database
•  Evaluate search time and memory
•  Compared succinct multibit tree (SMT) to
   pointer-based multibit tree (MT)
•  Compared variable-length array (VLD) and
   succint trie (TRIE) to the raw
   representation of fingeprint databases.
Memory usage of multibit trees	
              6000       SMT
                                                                                                  6G	
                     ●

                         MT


              5000


              4000
Memory (MB)




              3000


              2000


              1000
                                                          ●
                                                                      ●
                                                                                 ●         ●
                                                                                                  847MB	
                                            ●
                                  ●
                0    ●●
                     ●

                 0.0e+00       5.0e+06   1.0e+07      1.5e+07       2.0e+07   2.5e+07   3.0e+07
                                                   # of fingerprints
Memory usage of representations of
                           fingerprint databases	
                           TRIE
                                                                                                     16GB	
                       ●

                           VLA
                           RAW
       15000
Memory (MB)




       10000




              5000

                                                                                                     3.2GB	
                                                                                    ●         ●

                        ●●
                        ●
                                     ●         ●             ●           ●
                                                                                                     1.3GB	
                0
                     0.0e+00      5.0e+06   1.0e+07      1.5e+07       2.0e+07   2.5e+07   3.0e+07
                                                      # of fingerprints
Search time and memory on 30 million
  fingerprints (ε=0.98) #answers:10	
                         0.025
                                   SMT+TRIE
                         0.021	
●
                         0.020
     search time (sec)




                         0.015        SMT+VLA
                         0.014	
                                SMT+RAW

                         0.010                MT+TRIE

                                                 MT+VLA
                                                                          0.006	
                         0.005                                  MT+RAW


                         0.000
                                 2GB	
5000    10000     15000   20000 22GB	
                                   4GB	
      memory (MB)
Search time and memory on 30 million
 fingerprints (ε=0.9) #answers:1,440	
                        2.0
                                SMT+TRIE
                        1.7	
●
                        1.5
    search time (sec)




                        1.0
                                 SMT+VLA MT+VLA
               0.58	
                                                           SMT+RAW
                        0.5           MT+TRIE
                                                                       0.3	
                                                              MT+RAW
                        0.0
                              2GB	
5000    10000   15000     2000022GB	
                                 4GB	
     memory (MB)
Summary	

•  Succinct Multibit Trees (SMT)
•  Compactly represent multibit trees and
   fingerprints by succinct data structures
•  Represent multibit trees by LOUDS
•  Represent fingerprints by variabl-length array and
   succinct trie
•  Enables us to index 30 million fingerprints in 2GB
   by SMT+TRIE and in 4GB by SMT+VLA
•  Search time remains practically fast

More Related Content

What's hot

IMT, col space again
IMT, col space againIMT, col space again
IMT, col space again
Prasanth George
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
Ruochun Tzeng
 
On the Zeros of Complex Polynomials
On the Zeros of Complex PolynomialsOn the Zeros of Complex Polynomials
Datastructure tree
Datastructure treeDatastructure tree
Datastructure tree
rantd
 
Add Maths Module
Add Maths ModuleAdd Maths Module
Add Maths Module
bspm
 
E-Cordial Labeling of Some Mirror Graphs
E-Cordial Labeling of Some Mirror GraphsE-Cordial Labeling of Some Mirror Graphs
E-Cordial Labeling of Some Mirror Graphs
Waqas Tariq
 
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Yandex
 
Mgm
MgmMgm
Teknik menjawab-percubaan-pmr-melaka-2010
Teknik menjawab-percubaan-pmr-melaka-2010Teknik menjawab-percubaan-pmr-melaka-2010
Teknik menjawab-percubaan-pmr-melaka-2010
Ieda Adam
 
Chapter 1 functions
Chapter 1  functionsChapter 1  functions
Chapter 1 functions
Umair Pearl
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learning
zukun
 
An order seven implicit symmetric sheme applied to second order initial value...
An order seven implicit symmetric sheme applied to second order initial value...An order seven implicit symmetric sheme applied to second order initial value...
An order seven implicit symmetric sheme applied to second order initial value...
Alexander Decker
 
Functions
FunctionsFunctions
Functions
Leo Crisologo
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
VjekoslavKovac1
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
VjekoslavKovac1
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
VjekoslavKovac1
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions www.ijeijournal.com
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda
nozomuhamada
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
VjekoslavKovac1
 

What's hot (19)

IMT, col space again
IMT, col space againIMT, col space again
IMT, col space again
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
 
On the Zeros of Complex Polynomials
On the Zeros of Complex PolynomialsOn the Zeros of Complex Polynomials
On the Zeros of Complex Polynomials
 
Datastructure tree
Datastructure treeDatastructure tree
Datastructure tree
 
Add Maths Module
Add Maths ModuleAdd Maths Module
Add Maths Module
 
E-Cordial Labeling of Some Mirror Graphs
E-Cordial Labeling of Some Mirror GraphsE-Cordial Labeling of Some Mirror Graphs
E-Cordial Labeling of Some Mirror Graphs
 
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
 
Mgm
MgmMgm
Mgm
 
Teknik menjawab-percubaan-pmr-melaka-2010
Teknik menjawab-percubaan-pmr-melaka-2010Teknik menjawab-percubaan-pmr-melaka-2010
Teknik menjawab-percubaan-pmr-melaka-2010
 
Chapter 1 functions
Chapter 1  functionsChapter 1  functions
Chapter 1 functions
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learning
 
An order seven implicit symmetric sheme applied to second order initial value...
An order seven implicit symmetric sheme applied to second order initial value...An order seven implicit symmetric sheme applied to second order initial value...
An order seven implicit symmetric sheme applied to second order initial value...
 
Functions
FunctionsFunctions
Functions
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 

Viewers also liked

SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
Yasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
Yasuo Tabei
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Yasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
Yasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
Yasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
Yasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
Yasuo Tabei
 
GIW2013
GIW2013GIW2013
GIW2013
Yasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
Yasuo Tabei
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
Yasuo Tabei
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
MapR Technologies Japan
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
Preferred Networks
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
Shirou Maruyama
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
Preferred Networks
 
Constitution herd
Constitution  herdConstitution  herd
Constitution herd
acolyte26
 
American YouthWorks' Environmental Corps
American YouthWorks' Environmental CorpsAmerican YouthWorks' Environmental Corps
American YouthWorks' Environmental Corps
American YouthWorks
 

Viewers also liked (20)

SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
GIW2013
GIW2013GIW2013
GIW2013
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
 
Constitution herd
Constitution  herdConstitution  herd
Constitution herd
 
American YouthWorks' Environmental Corps
American YouthWorks' Environmental CorpsAmerican YouthWorks' Environmental Corps
American YouthWorks' Environmental Corps
 

Similar to Mlab2012 tabei 20120806

Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
ESCOM
 
Solid modeling
Solid modelingSolid modeling
Solid modeling
selvakumar948
 
Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modified
Inbok Lee
 
Digital Roots and Their Properties
Digital Roots and Their PropertiesDigital Roots and Their Properties
Digital Roots and Their Properties
IOSR Journals
 
Math 223 Disclaimer It is not a good idea.docx
    Math 223   Disclaimer It is not a good idea.docx    Math 223   Disclaimer It is not a good idea.docx
Math 223 Disclaimer It is not a good idea.docx
joyjonna282
 
Dynamic Programming for 4th sem cse students
Dynamic Programming for 4th sem cse studentsDynamic Programming for 4th sem cse students
Dynamic Programming for 4th sem cse students
DeepakGowda357858
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
David Gleich
 
topologicalsort-using c++ as development language.pptx
topologicalsort-using c++ as development language.pptxtopologicalsort-using c++ as development language.pptx
topologicalsort-using c++ as development language.pptx
janafridi251
 
VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 
Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
Elasticsearch
 
PAM.ppt
PAM.pptPAM.ppt
PAM.ppt
janaki raman
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline
 
s11_bin_search_trees.ppt
s11_bin_search_trees.ppts11_bin_search_trees.ppt
s11_bin_search_trees.ppt
DrKishoreVermaS1
 
Topological sort
Topological sortTopological sort
Topological sort
jabishah
 
Anov af03
Anov af03Anov af03
Anov af03
pradeep joshi
 
Data structure-question-bank
Data structure-question-bankData structure-question-bank
Data structure-question-bank
Jagan Mohan Bishoyi
 
Functional sudoku
Functional sudokuFunctional sudoku
Functional sudoku
Cesar Tron-Lozai
 
Paper
PaperPaper
Paper
khbsharat
 
Pc 11.3 notes_cross
Pc 11.3 notes_crossPc 11.3 notes_cross
Pc 11.3 notes_cross
Jonathan Fjelstrom
 
SISAP17
SISAP17SISAP17
SISAP17
Yasuo Tabei
 

Similar to Mlab2012 tabei 20120806 (20)

Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
 
Solid modeling
Solid modelingSolid modeling
Solid modeling
 
Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modified
 
Digital Roots and Their Properties
Digital Roots and Their PropertiesDigital Roots and Their Properties
Digital Roots and Their Properties
 
Math 223 Disclaimer It is not a good idea.docx
    Math 223   Disclaimer It is not a good idea.docx    Math 223   Disclaimer It is not a good idea.docx
Math 223 Disclaimer It is not a good idea.docx
 
Dynamic Programming for 4th sem cse students
Dynamic Programming for 4th sem cse studentsDynamic Programming for 4th sem cse students
Dynamic Programming for 4th sem cse students
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
topologicalsort-using c++ as development language.pptx
topologicalsort-using c++ as development language.pptxtopologicalsort-using c++ as development language.pptx
topologicalsort-using c++ as development language.pptx
 
VoxelNet
VoxelNetVoxelNet
VoxelNet
 
Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
 
PAM.ppt
PAM.pptPAM.ppt
PAM.ppt
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
s11_bin_search_trees.ppt
s11_bin_search_trees.ppts11_bin_search_trees.ppt
s11_bin_search_trees.ppt
 
Topological sort
Topological sortTopological sort
Topological sort
 
Anov af03
Anov af03Anov af03
Anov af03
 
Data structure-question-bank
Data structure-question-bankData structure-question-bank
Data structure-question-bank
 
Functional sudoku
Functional sudokuFunctional sudoku
Functional sudoku
 
Paper
PaperPaper
Paper
 
Pc 11.3 notes_cross
Pc 11.3 notes_crossPc 11.3 notes_cross
Pc 11.3 notes_cross
 
SISAP17
SISAP17SISAP17
SISAP17
 

Recently uploaded

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 

Recently uploaded (20)

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 

Mlab2012 tabei 20120806

  • 1. 2012 Sapporo Workshop on Machine Learning and Applications to Biology Aug 6-7 2012: Sapporo, Hokkaido, Japan Space-Efficient Multibit Trees for Large-Scale Chemical Fingerprint Searches Yasuo Tabei JST ERATO HomePage: https://sites.google.com/site/yasuotabei/ E-mail: yasuo.tabei@gmail.com
  • 2. Outline •  Overview –  Chemical fingerprint search •  Multibit Tree •  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 3. Chemical fingerprint search •  Space-efficient data structures to index 30 million chemical fingerprints, e.g., W=(1,5,7,10) •  Searching all fingerprints similar to a query (≧ε) –  Similarity = Jaccard (Tanimoto) (J(W,W’)=|W∩W’|/|W∪W’|) •  Multibit tree (Kristensen et al.,09): Data structure enabling fast similarity searches –  Memory-inefficient pointer-based representation –  Need to store original fingerprint databases to prevent false positives •  Succinct data structure (Jacobson, 1998) –  Space-efficient and enables fast operations
  • 4. Outline •  Overview –  Chemical fingerprint search •  Multibit Tree •  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 5. Multibit Tree (MT) (Kristensen et al., 09) l  Multiple decision trees built on fingerprints clustered with respect to cardinality (i)Fingerprint (ii)Cluster into bins (iii)Build decision Database w.r.t cardinality trees W1=(1,2,7,4,8) W6 =(1) W2=(1,3,7) W32=(2) W3=(1,3) W42=(4) W6 W5=(1,4,8,7) W50=(8) W32 W6=(1) W42 W 50 ... W3 =(1,3) W9 =(2,4) Wn=(1,3,4) W12=(1,4) W9 W3 =(2,5,6) W3 W12 W9 =(1,3,6) . . W12=(4,6,7) W15=(2,3,5) . W18=(4,6,8) . . . . . W3 W15 W9 . W12 W18
  • 6. Similarity search of a query fingerprint Q l Find all fingerprints such that J(Wi , Q) •  For an efficient similarity search, find the candidate solutions Wi satisfying two constrains: 1.  Cardinality constraint 1 |Q| |Wi | |Q| 2.  Upper bound of Jaccard similarity min(|Wi | N0 , |Q| N1 ) |Wi | + |Q| min(|Wi | N0 , |Q| N1 ) - N0: The number of elements contained in Wi and not in Q - N1: The number of elements contained in Q and not in Wi
  • 7. Similarity search of a query fingerprint Q Step1: Step2: Step3: Find candidate solutions I1 Find candidate solutions I2 Calculate similarities satisfying carinality constraints satisfying upper bounds to remove false positives in Searched W6 pruned W32 W42 W 50 W9 ? W3 W12 ? W9 W3 W12 W4 W15 W 9 W4 W15 W9 W12 W18 W12 W18 . . .
  • 8. Drawbacks •  Pointer-based representation of multibit trees needs a large memory                  bits - Kc: number of fingerprints in bin c - C: total number of bins –  Log(.) factor is too large! •  Need to store original fingerprint databases in memory to filter out false positives
  • 9. Outline •  Overview –  Chemical fingerprint search •  Multibit Tree •  Succinct Data Structures –  Rank/select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 10. Succinct Data Structures •  Space-efficient data structures enabling fast operations •  Pointer-based representations of ordered trees consume a large amount of memory –  O(nlogn) bits for the number n of nodes –  logn factor is too large for large-scale data •  Represent ordered trees as bit strings of length 2n + 1 and enables O(1)-time operations –  Ex) 0100100101000 •  Various succinct data structures –  sets(Raman,2002), sequences(Ferragina,2001), trees(Jacobson,1989), graphs(Turan,1989)
  • 11. Rank/select dictionary (RRR, 2002) : Foundation of various succinct data structures l  Enables the rank/select operations on bit string B in O(1)-time -  Rankc(B,i): return the number of c∈{0,1} in B[1…i] -  Selectc(B,i): return the position of i-th occurrence of c∈{0,1} l  Efficient rank/select dictionary (Navarro and Providel, 2012) Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 Rank1(B,8)=5 011001110 0 Select1(B,3)=6 0 1 1 0 0 1 1 1 0 0
  • 12. B l  Divide the bit array B into large blocks of length =log2n RL=Ranks of large blocks l  Divide each large block to small blocks of length s=(logn)/2 Rs=Ranks of small blocks relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n + o(n) bits
  • 13. Level-order Unary Degree Sequence (LOUDS) (Jacobson, 1989) •  Represents an ordered tree as a bit string of length 2n+1 (n: node number) •  Construction 1)  Traversing the tree in a breadth-first manner 2)  Generating k 1s followed by 0 for a k-degree node in preorder 1 S:  super  root S 1 2 3 4 567 2 3 B 101101101100000 4 5 6 7
  • 14. Properties of LOUDS 1 1 23 4 5 67 2 3 B:101101101100000 1 2 34 5 67 4 5 6 7 •  For a tree consisting of n nodes, there are n 1s and n+1 0s on bit string B •  Each 1 and 0 except the first 0 on B corresponds to a tree node one-by-one •  Positions of the parent and children for a tree node on B can be calculated by combining the rank/select operations in O(1)-time.
  • 15. O(1)-time operations on a tree •  Parent/child operations for i such that B[i]=1 –  First child:p=select0(B,rank1(B,i))+1 –  Next child:i+1 for position i of the first child –  Parent :p=select1(B,rank0(B,i)) Ex)  Calcula2ng  the  first  child  for  i  =  4 1 i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 B 101101101100000 1 23 45 6 7 4 5 6 7 i=4 rank1(B,4)=3 select0(B,3)=9
  • 16. Outline •  Overview –  Chemical fingerprint search •  Multibit Tree •  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 17. Succinct Multibit Trees (SMT) •  Consist of compact representations of multibit trees and fingerprint databases •  Represent multibit trees by LOUDS –  O(8 C |Kc | 4C + M C) bits not including log factor c=1 –  Fast similarity searches •  Two compact representations of fingerprint databases –  Variable-length array (VLA) –  Succinct trie (TRIE)
  • 18. Succinct representation of multibit trees (SMT) •  Basic idea is to represent MT by LOUDS –  MT consists of multiple binary decision trees. •  Bc: LOUDS representation of a decision tree •  Lc: bit string indicating whether Bc[i] is a leaf or not •  IDs: Array containing fingerprint identifiers MT 1 SMT 2 3 4 5 6 7 W3 W4 W1 W2
  • 19. Access to node auxiliaries and fingerprint identifiers in O(1)-time 1 0 •  Access to node auxiliaries Mv , Mv for calculating upper bounds –  v = rank1 (Bc , p) for a given position p –  Each 1 bit in Bc corresponds to a node v •  Identifiers for calculating Jaccard[p] = 1 Lc similarities –  IDs[rank1 (Lc , p)] for a given position p –  Each 1 bit in Lc corresponds to an index on IDs 1 2 3 4 5 6 7 W3 W4 W1 W2
  • 20. Variable-length array for compactly representing fingerprints •  Standard array consists of bit strings of fixed-length –  Space-inefficient for storing small values Ex) Array, each element is represented as 8 bits Integer 2 1 3 4 32bits Bit string 00000010 00000001 00000011 00000100 •  Variable-length array = bit strings of different lengths Ex) Integer 2 1 3 4 8bits Bit string 10 1 11 100 –  Space-efficient –  Difficulty: distinguish each range of an integer value
  • 21. Representation of variable-length array •  Use two bit strings to represent an array A: -  R: bit string whose k-th substring corresponds to the bit string representation of A[k] -  P: bit string whose k-th substring consists of ( log A[k] 1) 0s followed by 1
  • 22. Recovering A[k] from variable-length array •  A[k] is recovered by three steps: 1.  Start position s: If k=1 s=1, else s = select1(P,k-1) + 1 2.  End position e: e = select1(P,k) 3.  Conversion: Convert substring R[s,e] to an integer •  O(1)-time Ex)k=3 1.  s = select1(P,2)+1=4 s e 2.  e = select1(P,3)=7 3.  Convert R[4,7]=1000 to the integer 8
  • 23. Trie •  Used to store an associative array –  keys are, usually, a string •  Applicable to fingerprints considered as strings –  Each node defines the key it is associated –  All the descendants of a node has a common prefix of the string associated with that node –  Values are associated only with leaves, and some inner nodes that correspond to keys 0 Ex) Build 1 1 2 W1=(1,2,3) trie 2 3 2 3 3 W2=(2,3,7,8) 4 5 6 W3=(1,2,5,8) 3 5 7 7 5 10 8 9 W4=(1,3,5) 8 8 12 11
  • 24. Difficulty •  The alphabet size tends to be small for typical trie applications, e.g., DNA(4), English(26) •  Difficulty: the word size of fingerprints is not always small, e.g., PubChem, 881 dimension –  Memory usage is dominated by labels •  Compute the differences between every pair of a node label and the parent node label 0 Compute 0 Ex) Build 1 2 difference trie 1 2 W1=(1,2,3) 3 2 3 W2=(2,3,7,8) 1 2 1 Succinct Trie Succinct Trie W3=(1,2,5,8) 3 5 5 7 by LOUDS by LOUDS 1 3 4 W4=(1,3,5) 2 8 8 1 3
  • 25. Succinct Trie (TRIE) •  Three components: –  T: LOUDS representation of trie –  D: Variable-length array containing node labels –  Idconv: Array containing fingerprint identifiers Trie 1 0 Succinct Trie 1 2 Node ids - 1 2 3 4 5 6 7 8 9 10 11 12 LBS T 10 110 110 10 110 10 10 0 10 0 10 0 0 2 3 Words D   0 1 2 1 2 1 1 3 2 4 3 1 1 2 1 4 5 6 Index W1 W2 W3 W4 W5 1 3 4 idconv 7 12 11 10 9 7 2 10 8 9 1 3 12 11
  • 26. Outline •  Overview –  Chemical fingerprint search •  Multibit Tree •  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS •  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie •  Experiments
  • 27. Experiments •  30 million chemical fingerprints from PubChem database •  Evaluate search time and memory •  Compared succinct multibit tree (SMT) to pointer-based multibit tree (MT) •  Compared variable-length array (VLD) and succint trie (TRIE) to the raw representation of fingeprint databases.
  • 28. Memory usage of multibit trees 6000 SMT 6G ● MT 5000 4000 Memory (MB) 3000 2000 1000 ● ● ● ● 847MB ● ● 0 ●● ● 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07 # of fingerprints
  • 29. Memory usage of representations of fingerprint databases TRIE 16GB ● VLA RAW 15000 Memory (MB) 10000 5000 3.2GB ● ● ●● ● ● ● ● ● 1.3GB 0 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07 # of fingerprints
  • 30. Search time and memory on 30 million fingerprints (ε=0.98) #answers:10 0.025 SMT+TRIE 0.021 ● 0.020 search time (sec) 0.015 SMT+VLA 0.014 SMT+RAW 0.010 MT+TRIE MT+VLA 0.006 0.005 MT+RAW 0.000 2GB 5000 10000 15000 20000 22GB 4GB memory (MB)
  • 31. Search time and memory on 30 million fingerprints (ε=0.9) #answers:1,440 2.0 SMT+TRIE 1.7 ● 1.5 search time (sec) 1.0 SMT+VLA MT+VLA 0.58 SMT+RAW 0.5 MT+TRIE 0.3 MT+RAW 0.0 2GB 5000 10000 15000 2000022GB 4GB memory (MB)
  • 32. Summary •  Succinct Multibit Trees (SMT) •  Compactly represent multibit trees and fingerprints by succinct data structures •  Represent multibit trees by LOUDS •  Represent fingerprints by variabl-length array and succinct trie •  Enables us to index 30 million fingerprints in 2GB by SMT+TRIE and in 4GB by SMT+VLA •  Search time remains practically fast