WABI2012-SuccinctMultibitTree

4,552
-1

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,552
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

WABI2012-SuccinctMultibitTree

  1. 1. 12th Workshop on Algorithms in Bioinformatics,Ljubljana, Slovenia Succinct Multibit Tree: Compact Representation of Multibit Trees by Succinct Data Structures in Chemical Fingerprint Searches Yasuo Tabei JST ERATO Minato Project
  2. 2. Chemical fingerprint search •  Space-efficient data structures to index 30 million chemical fingerprints, e.g., W=(1,5,7,10)•  Find all fingerprints similar to a query (≧ε) –  Similarity = Jaccard (Tanimoto) (J(W,W’)=|W∩W’|/|W∪W’|)•  Multibit tree (Kristensen et al.,WABI09) –  Data structure enabling fast similarity searches –  Memory-inefficiency of pointer-based representation•  Succinct data structures (Jacobson, 1989) –  Space efficient and enabling fast operationsØ Present succinct representation of multibit tree
  3. 3. Outline •  Multibit Tree•  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS•  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie•  Experiments
  4. 4. Outline •  Multibit Tree•  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS•  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie•  Experiments
  5. 5. Multibit Tree (MT) (Kristensen et al., 09) l  Multiple decision trees built on fingerprints clustered with respect to cardinality (i)Fingerprint (ii)Cluster into bins (iii)Build decision Database w.r.t cardinality trees W1=(1,2,7,4,8) W6 =(1) W2=(1,3,7) W32=(2) W3=(1,3) W42=(4) W6 W5=(1,4,8,7) W50=(8) W32 W6=(1) W42 W 50 ... W3 =(1,3) W9 =(2,4) Wn=(1,3,4) W12=(1,4) W9 W3 =(2,5,6) W3 W12 W9 =(1,3,6) . . W12=(4,6,7) W15=(2,3,5) . W18=(4,6,8) . . . . . W3 W15 W9 . W12 W18
  6. 6. Similarity search of a query fingerprint Q l  If Jaccard similarity J(Wi , Q) , two constraints are satisfied: 1.  Cardinality constraint 1 |Q| |Wi | |Q| 2.  Upper bound of Jaccard similarity min(|Wi | N0 , |Q| N1 ) |Wi | + |Q| min(|Wi | N0 , |Q| N1 ) - N0: The number of elements contained in Wi and not in Q - N1: The number of elements contained in Q and not in Wi
  7. 7. Similarity search of a query fingerprint Q Step1: Step2: Step3:Find candidate solutions I1 Find candidate solutions I2 Calculate similaritiessatisfying carinality constraints satisfying upper bounds to remove false positives in Searched W6 pruned W32 W42 W 50 W9 ? W3 W12 ? W9 W3 W12 W4 W15 W 9 W4 W15 W9 W12 W18 W12 W18 . . .
  8. 8. Drawbacks •  Pointer-based representation of multibit trees needs a large amount of memory                 bits - Kc: number of fingerprints in bin c - C: total number of bins –  Log(.) factor is too large!•  Need to store original fingerprint databases in memory to filter out false positives
  9. 9. Outline •  Multibit Tree•  Succinct Data Structures –  Rank/select dictionary –  Succinct ordered tree: LOUDS•  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie•  Experiments
  10. 10. Rank/select dictionary (RRR, 2002) : Foundation of various succinct data structures l  Enables the rank/select operations on bit string B in O(1)-time -  Rankc(B,i): return the number of c∈{0,1} in B[1…i] -  Selectc(B,i): return the position of i-th occurrence of c∈{0,1}l  Efficient rank/select dictionary (Navarro and Providel, 2012) Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 Rank1(B,8)=5 011001110 0 Select1(B,3)=6 0 1 1 0 0 1 1 1 0 0 Memory: n + o(n) bits
  11. 11. Level-order Unary Degree Sequence (LOUDS) (Jacobson, 1989) •  Represents an ordered tree as a bit string of length 2n+1 (n: node number)•  Construction1)  Traversing the tree in a breadth-first manner2)  Generating k 1s followed by 0 for a k-degree node in preorder 1 S:  super  root S 1 2 3 4 567 2 3 B 101101101100000 4 5 6 7
  12. 12. Properties of LOUDS 1 1 23 4 5 67 2 3 B:101101101100000 1 2 34 5 67 4 5 6 7•  For a tree consisting of n nodes, there are n 1s and n+1 0s on bit string B•  Each 1 and 0 except the first 0 on B corresponds to a tree node one-by-one•  Positions of the parent and children for a tree node on B can be calculated by combining the rank/select operations in O(1)-time.
  13. 13. O(1)-time operations on a tree •  Parent/child operations for i such that B[i]=1 –  First child:p=select0(B,rank1(B,i))+1 –  Next child:i+1 for position i of the first child –  Parent :p=select1(B,rank0(B,i)) Ex)  Calcula2ng  the  first  child  for  i  =  4 1 i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 B 101101101100000 1 23 45 6 7 4 5 6 7 i=4 rank1(B,4)=3 select0(B,3)=9
  14. 14. Outline •  Overview –  Chemical fingerprint search•  Multibit Tree•  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS•  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie•  Experiments
  15. 15. Succinct Multibit Trees (SMT) •  Consist of compact representations of multibit trees and fingerprint databases•  Represent multibit trees by LOUDS –  O(8 C |Kc | 4C + M C) bits not including log factor c=1 –  Fast similarity searches•  Two compact representations of fingerprint databases –  Variable-length array (VLA) –  Succinct trie (TRIE)
  16. 16. Succinct representation of multibit trees (SMT) •  Basic idea is to represent MT by LOUDS –  MT consists of multiple binary decision trees.•  Bc: LOUDS representation of a decision tree•  Lc: bit string indicating whether Bc[i] is a leaf or not•  IDs: Array containing fingerprint identifiers MT 1 SMT 2 3 4 5 6 7 W3 W4 W1 W2
  17. 17. Access to node auxiliaries and fingerprint identifiers in O(1)-time 1 0•  Access to node auxiliaries Mv , Mv for calculating upper bounds –  v = rank1 (Bc , p) for a given position p –  Each 1 bit in Bc corresponds to a node v•  Identifiers for calculating Jaccard[p] = 1 Lc similarities –  IDs[rank1 (Lc , p)] for a given position p –  Each 1 bit in Lc corresponds to an index on IDs 1 2 3 4 5 6 7 W3 W4 W1 W2
  18. 18. Variable-length array for compactly representing fingerprints •  Standard array consists of bit strings of fixed-length –  Space-inefficient for storing small values Ex) Array, each element is represented as 8 bits Integer 2 1 3 4 32bits Bit string 00000010 00000001 00000011 00000100•  Variable-length array = bit strings of different lengths Ex) Integer 2 1 3 4 8bits Bit string 10 1 11 100 –  Space-efficient –  Random access is impossible
  19. 19. Representation of variable-length array •  Use two bit strings to represent an array A: -  R: bit string whose k-th substring corresponds to the bit string representation of A[k] -  P: bit string whose k-th substring consists of ( log A[k] 1) 0s followed by 1
  20. 20. Recovering A[k] from variable-length array K=3 s e •  A[k] is recovered by three steps: 1.  Start position s: If k=1 s=1, else s = select1(P,k-1) + 1 2.  End position e: e = select1(P,k) 3.  Conversion: Convert substring R[s,e] to an integer•  O(1)-time
  21. 21. Trie •  Used to store an associative array –  keys are, usually, a string•  Applicable to fingerprints considered as strings 0 1 Build 1 2 W1=(1,2,3) trie 2 3 2 3 3 W2=(2,3,7,8) 4 5 6 W3=(1,2,5,8) 3 5 7 7 5 10 8 9 W4=(1,3,5) 8 8 12 11
  22. 22. Difficulty •  The alphabet size tends to be small for typical trie applications, e.g., DNA(4), English(26)•  Difficulty: the word size of fingerprints is not always small, e.g., PubChem, 881 dimension –  Memory usage is dominated by labels•  Compute the differences between every pair of a node label and the parent node label 0 Compute 0Ex) Build 1 2 difference trie 1 2 W1=(1,2,3) 3 2 3 W2=(2,3,7,8) 1 2 1 Succinct Trie Succinct Trie W3=(1,2,5,8) 3 5 5 7 by LOUDS by LOUDS 1 3 4 W4=(1,3,5) 2 8 8 1 3
  23. 23. Succinct Trie (TRIE) •  Three components: –  T: LOUDS representation of trie –  D: Variable-length array containing node labels –  Idconv: Array containing fingerprint identifiers Trie 1 0 Succinct Trie 1 2 Node ids - 1 2 3 4 5 6 7 8 9 10 11 12 LBS T 10 110 110 10 110 10 10 0 10 0 10 0 0 2 3 Words D   0 1 2 1 2 1 1 3 2 4 3 1 1 2 1 4 5 6 Index W1 W2 W3 W4 W5 1 3 4 idconv 7 12 11 10 9 7 2 10 8 9 1 3 12 11
  24. 24. Outline •  Multibit Tree•  Succinct Data Structures –  Rank/Select dictionary –  Succinct ordered tree: LOUDS•  Succinct Multibit Trees –  Compact representation of multibit trees –  Compact representation of fingerprint databases 1.  Variable-length array 2.  Succinct Trie•  Experiments
  25. 25. Experiments •  30 million chemical fingerprints from PubChem database•  Evaluate search time and memory•  Compared succinct multibit tree (SMT) to pointer-based multibit tree (MT)•  Compared variable-length array (VLD) and succint trie (TRIE) to the raw representation of fingeprint databases.
  26. 26. Memory usage of multibit trees 6000 SMT 6G ● MT 5000 4000Memory (MB) 3000 2000 1000 ● ● ● ● 847MB ● ● 0 ●● ● 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07 # of fingerprints
  27. 27. Memory usage of representations of fingerprint databases TRIE 16GB ● VLA RAW 15000Memory (MB) 10000 5000 3.2GB ● ● ●● ● ● ● ● ● 1.3GB 0 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07 # of fingerprints
  28. 28. Search time and memory on 30 million fingerprints (ε=0.98) #answers:10 0.025 SMT+TRIE 0.021 ● 0.020 search time (sec) 0.015 SMT+VLA 0.014 SMT+RAW 0.010 MT+TRIE MT+VLA 0.006 0.005 MT+RAW 0.000 2GB 5000 10000 15000 20000 22GB 4GB memory (MB)
  29. 29. Search time and memory on 30 million fingerprints (ε=0.9) #answers:1,440 2.0 SMT+TRIE 1.7 ● 1.5 search time (sec) 1.0 SMT+VLA MT+VLA 0.58 SMT+RAW 0.5 MT+TRIE 0.3 MT+RAW 0.0 2GB 5000 10000 15000 2000022GB 4GB memory (MB)
  30. 30. Summary •  Succinct Multibit Trees (SMT)•  Compactly represent multibit trees and fingerprints by succinct data structures•  Represent multibit trees by LOUDS•  Represent fingerprints by variabl-length array and succinct trie•  Enables us to index 30 million fingerprints in 2GB by SMT+TRIE and in 4GB by SMT+VLA•  Search time remains practically fast
  31. 31. Succinct Data Structures •  Space-efficient data structures enabling fast operations•  Pointer-based representations of ordered trees consume a large amount of memory –  O(nlogn) bits for the number n of nodes –  logn factor is too large for large-scale data•  Represent ordered trees as bit strings of length 2n + 1 and enables O(1)-time operations –  Ex) 0100100101000•  Various succinct data structures –  sets(Raman,2002), sequences(Ferragina,2001), trees(Jacobson,1989), graphs(Turan,1989)
  32. 32. Bl  Divide the bit array B into large blocks of length =log2n RL=Ranks of large blocksl  Divide each large block to small blocks of length s=(logn)/2 Rs=Ranks of small blocks relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n + o(n) bits
  33. 33. Recovering A[k] from variable-length array •  A[k] is recovered by three steps: 1.  Start position s: If k=1 s=1, else s = select1(P,k-1) + 1 2.  End position e: e = select1(P,k) 3.  Conversion: Convert substring R[s,e] to an integer •  O(1)-timeEx)k=3 1.  s = select1(P,2)+1=4 s e 2.  e = select1(P,3)=7 3.  Convert R[4,7]=1000 to the integer 8

×