- 1. 2012 Sapporo Workshop on Machine Learning and Applications to Biology Aug 6-7 2012: Sapporo, Hokkaido, Japan Space-Efficient Multibit Trees for Large-Scale Chemical Fingerprint Searches Yasuo Tabei JST ERATO HomePage: https://sites.google.com/site/yasuotabei/ E-mail: yasuo.tabei@gmail.com
- 2. Outline • Overview – Chemical fingerprint search • Multibit Tree • Succinct Data Structures – Rank/Select dictionary – Succinct ordered tree: LOUDS • Succinct Multibit Trees – Compact representation of multibit trees – Compact representation of fingerprint databases 1. Variable-length array 2. Succinct Trie • Experiments
- 3. Chemical fingerprint search • Space-efficient data structures to index 30 million chemical fingerprints, e.g., W=(1,5,7,10) • Searching all fingerprints similar to a query (≧ε) – Similarity = Jaccard (Tanimoto) (J(W,W’)=|W∩W’|/|W∪W’|) • Multibit tree (Kristensen et al.,09): Data structure enabling fast similarity searches – Memory-inefficient pointer-based representation – Need to store original fingerprint databases to prevent false positives • Succinct data structure (Jacobson, 1998) – Space-efficient and enables fast operations
- 4. Outline • Overview – Chemical fingerprint search • Multibit Tree • Succinct Data Structures – Rank/Select dictionary – Succinct ordered tree: LOUDS • Succinct Multibit Trees – Compact representation of multibit trees – Compact representation of fingerprint databases 1. Variable-length array 2. Succinct Trie • Experiments
- 5. Multibit Tree (MT) (Kristensen et al., 09) l Multiple decision trees built on fingerprints clustered with respect to cardinality (i)Fingerprint (ii)Cluster into bins (iii)Build decision Database w.r.t cardinality trees W1=(1,2,7,4,8) W6 =(1) W2=(1,3,7) W32=(2) W3=(1,3) W42=(4) W6 W5=(1,4,8,7) W50=(8) W32 W6=(1) W42 W 50 ... W3 =(1,3) W9 =(2,4) Wn=(1,3,4) W12=(1,4) W9 W3 =(2,5,6) W3 W12 W9 =(1,3,6) . . W12=(4,6,7) W15=(2,3,5) . W18=(4,6,8) . . . . . W3 W15 W9 . W12 W18
- 6. Similarity search of a query fingerprint Q l Find all fingerprints such that J(Wi , Q) • For an efficient similarity search, find the candidate solutions Wi satisfying two constrains: 1. Cardinality constraint 1 |Q| |Wi | |Q| 2. Upper bound of Jaccard similarity min(|Wi | N0 , |Q| N1 ) |Wi | + |Q| min(|Wi | N0 , |Q| N1 ) - N0: The number of elements contained in Wi and not in Q - N1: The number of elements contained in Q and not in Wi
- 7. Similarity search of a query fingerprint Q Step1: Step2: Step3: Find candidate solutions I1 Find candidate solutions I2 Calculate similarities satisfying carinality constraints satisfying upper bounds to remove false positives in Searched W6 pruned W32 W42 W 50 W9 ? W3 W12 ? W9 W3 W12 W4 W15 W 9 W4 W15 W9 W12 W18 W12 W18 . . .
- 8. Drawbacks • Pointer-based representation of multibit trees needs a large memory bits - Kc: number of fingerprints in bin c - C: total number of bins – Log(.) factor is too large! • Need to store original fingerprint databases in memory to filter out false positives
- 9. Outline • Overview – Chemical fingerprint search • Multibit Tree • Succinct Data Structures – Rank/select dictionary – Succinct ordered tree: LOUDS • Succinct Multibit Trees – Compact representation of multibit trees – Compact representation of fingerprint databases 1. Variable-length array 2. Succinct Trie • Experiments
- 10. Succinct Data Structures • Space-efficient data structures enabling fast operations • Pointer-based representations of ordered trees consume a large amount of memory – O(nlogn) bits for the number n of nodes – logn factor is too large for large-scale data • Represent ordered trees as bit strings of length 2n + 1 and enables O(1)-time operations – Ex) 0100100101000 • Various succinct data structures – sets(Raman,2002), sequences(Ferragina,2001), trees(Jacobson,1989), graphs(Turan,1989)
- 11. Rank/select dictionary (RRR, 2002) : Foundation of various succinct data structures l Enables the rank/select operations on bit string B in O(1)-time - Rankc(B,i): return the number of c∈{0,1} in B[1…i] - Selectc(B,i): return the position of i-th occurrence of c∈{0,1} l Efficient rank/select dictionary (Navarro and Providel, 2012) Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 Rank1(B,8)=5 011001110 0 Select1(B,3)=6 0 1 1 0 0 1 1 1 0 0
- 12. B l Divide the bit array B into large blocks of length =log2n RL=Ranks of large blocks l Divide each large block to small blocks of length s=(logn)/2 Rs=Ranks of small blocks relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n + o(n) bits
- 13. Level-order Unary Degree Sequence (LOUDS) (Jacobson, 1989) • Represents an ordered tree as a bit string of length 2n+1 (n: node number) • Construction 1) Traversing the tree in a breadth-first manner 2) Generating k 1s followed by 0 for a k-degree node in preorder 1 S: super root S 1 2 3 4 567 2 3 B 101101101100000 4 5 6 7
- 14. Properties of LOUDS 1 1 23 4 5 67 2 3 B:101101101100000 1 2 34 5 67 4 5 6 7 • For a tree consisting of n nodes, there are n 1s and n+1 0s on bit string B • Each 1 and 0 except the first 0 on B corresponds to a tree node one-by-one • Positions of the parent and children for a tree node on B can be calculated by combining the rank/select operations in O(1)-time.
- 15. O(1)-time operations on a tree • Parent/child operations for i such that B[i]=1 – First child:p=select0(B,rank1(B,i))+1 – Next child:i+1 for position i of the first child – Parent :p=select1(B,rank0(B,i)) Ex) Calcula2ng the ﬁrst child for i = 4 1 i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 B 101101101100000 1 23 45 6 7 4 5 6 7 i=4 rank1(B,4)=3 select0(B,3)=9
- 16. Outline • Overview – Chemical fingerprint search • Multibit Tree • Succinct Data Structures – Rank/Select dictionary – Succinct ordered tree: LOUDS • Succinct Multibit Trees – Compact representation of multibit trees – Compact representation of fingerprint databases 1. Variable-length array 2. Succinct Trie • Experiments
- 17. Succinct Multibit Trees (SMT) • Consist of compact representations of multibit trees and fingerprint databases • Represent multibit trees by LOUDS – O(8 C |Kc | 4C + M C) bits not including log factor c=1 – Fast similarity searches • Two compact representations of fingerprint databases – Variable-length array (VLA) – Succinct trie (TRIE)
- 18. Succinct representation of multibit trees (SMT) • Basic idea is to represent MT by LOUDS – MT consists of multiple binary decision trees. • Bc: LOUDS representation of a decision tree • Lc: bit string indicating whether Bc[i] is a leaf or not • IDs: Array containing fingerprint identifiers MT 1 SMT 2 3 4 5 6 7 W3 W4 W1 W2
- 19. Access to node auxiliaries and fingerprint identifiers in O(1)-time 1 0 • Access to node auxiliaries Mv , Mv for calculating upper bounds – v = rank1 (Bc , p) for a given position p – Each 1 bit in Bc corresponds to a node v • Identifiers for calculating Jaccard[p] = 1 Lc similarities – IDs[rank1 (Lc , p)] for a given position p – Each 1 bit in Lc corresponds to an index on IDs 1 2 3 4 5 6 7 W3 W4 W1 W2
- 20. Variable-length array for compactly representing fingerprints • Standard array consists of bit strings of fixed-length – Space-inefficient for storing small values Ex) Array, each element is represented as 8 bits Integer 2 1 3 4 32bits Bit string 00000010 00000001 00000011 00000100 • Variable-length array = bit strings of different lengths Ex) Integer 2 1 3 4 8bits Bit string 10 1 11 100 – Space-efficient – Difficulty: distinguish each range of an integer value
- 21. Representation of variable-length array • Use two bit strings to represent an array A: - R: bit string whose k-th substring corresponds to the bit string representation of A[k] - P: bit string whose k-th substring consists of ( log A[k] 1) 0s followed by 1
- 22. Recovering A[k] from variable-length array • A[k] is recovered by three steps: 1. Start position s: If k=1 s=1, else s = select1(P,k-1) + 1 2. End position e: e = select1(P,k) 3. Conversion: Convert substring R[s,e] to an integer • O(1)-time Ex)k=3 1. s = select1(P,2)+1=4 s e 2. e = select1(P,3)=7 3. Convert R[4,7]=1000 to the integer 8
- 23. Trie • Used to store an associative array – keys are, usually, a string • Applicable to fingerprints considered as strings – Each node defines the key it is associated – All the descendants of a node has a common prefix of the string associated with that node – Values are associated only with leaves, and some inner nodes that correspond to keys 0 Ex) Build 1 1 2 W1=(1,2,3) trie 2 3 2 3 3 W2=(2,3,7,8) 4 5 6 W3=(1,2,5,8) 3 5 7 7 5 10 8 9 W4=(1,3,5) 8 8 12 11
- 24. Difficulty • The alphabet size tends to be small for typical trie applications, e.g., DNA(4), English(26) • Difficulty: the word size of fingerprints is not always small, e.g., PubChem, 881 dimension – Memory usage is dominated by labels • Compute the differences between every pair of a node label and the parent node label 0 Compute 0 Ex) Build 1 2 difference trie 1 2 W1=(1,2,3) 3 2 3 W2=(2,3,7,8) 1 2 1 Succinct Trie Succinct Trie W3=(1,2,5,8) 3 5 5 7 by LOUDS by LOUDS 1 3 4 W4=(1,3,5) 2 8 8 1 3
- 25. Succinct Trie (TRIE) • Three components: – T: LOUDS representation of trie – D: Variable-length array containing node labels – Idconv: Array containing fingerprint identifiers Trie 1 0 Succinct Trie 1 2 Node ids - 1 2 3 4 5 6 7 8 9 10 11 12 LBS T 10 110 110 10 110 10 10 0 10 0 10 0 0 2 3 Words D 0 1 2 1 2 1 1 3 2 4 3 1 1 2 1 4 5 6 Index W1 W2 W3 W4 W5 1 3 4 idconv 7 12 11 10 9 7 2 10 8 9 1 3 12 11
- 26. Outline • Overview – Chemical fingerprint search • Multibit Tree • Succinct Data Structures – Rank/Select dictionary – Succinct ordered tree: LOUDS • Succinct Multibit Trees – Compact representation of multibit trees – Compact representation of fingerprint databases 1. Variable-length array 2. Succinct Trie • Experiments
- 27. Experiments • 30 million chemical fingerprints from PubChem database • Evaluate search time and memory • Compared succinct multibit tree (SMT) to pointer-based multibit tree (MT) • Compared variable-length array (VLD) and succint trie (TRIE) to the raw representation of fingeprint databases.
- 28. Memory usage of multibit trees 6000 SMT 6G ● MT 5000 4000 Memory (MB) 3000 2000 1000 ● ● ● ● 847MB ● ● 0 ●● ● 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07 # of fingerprints
- 29. Memory usage of representations of fingerprint databases TRIE 16GB ● VLA RAW 15000 Memory (MB) 10000 5000 3.2GB ● ● ●● ● ● ● ● ● 1.3GB 0 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07 # of fingerprints
- 30. Search time and memory on 30 million fingerprints (ε=0.98) #answers:10 0.025 SMT+TRIE 0.021 ● 0.020 search time (sec) 0.015 SMT+VLA 0.014 SMT+RAW 0.010 MT+TRIE MT+VLA 0.006 0.005 MT+RAW 0.000 2GB 5000 10000 15000 20000 22GB 4GB memory (MB)
- 31. Search time and memory on 30 million fingerprints (ε=0.9) #answers:1,440 2.0 SMT+TRIE 1.7 ● 1.5 search time (sec) 1.0 SMT+VLA MT+VLA 0.58 SMT+RAW 0.5 MT+TRIE 0.3 MT+RAW 0.0 2GB 5000 10000 15000 2000022GB 4GB memory (MB)
- 32. Summary • Succinct Multibit Trees (SMT) • Compactly represent multibit trees and fingerprints by succinct data structures • Represent multibit trees by LOUDS • Represent fingerprints by variabl-length array and succinct trie • Enables us to index 30 million fingerprints in 2GB by SMT+TRIE and in 4GB by SMT+VLA • Search time remains practically fast