1. Scalable Similarity Search for
Molecular Descriptors
Yasuo Tabei
RIKEN Center for Advanced Intelligent Project (AIP),
Japan
Joint work with
Simon J. Puglisi
University of Helsinki, Finland
SISAP’17, Oct. 6, 2017
2. Similarity search in chemoinformatics
• Similarity search of chemical compounds is an important
task for novel drug discoveries
• Important fact: similar molecules tend to have a similar
molecular functions
• Can find functions of a query by searching databases of
compounds
• Whole chemical space is said to be approximately 1060
• There are large databases storing tens of millions of
compounds, e.g., PubChem, ChEMBL
• Scalable similarity searches of chemical compounds are
required
3. Chemical fingerprint
• Binary vector representation of molecule
E.g.: x=(1, 0, 0, 1, 0)
Ø Each dimension indicates the presence/absence of a
substructure
– Representative fingerprints: Dragon, PubChem et al.
• Jaccard (a.k.a Tanimoto) similarity is used
• Many methods has been proposed
– Multibit tree, XOR-based, b-bit minhashing et al.
4. Molecular descriptor (NEW)
• Integer vector representation of molecules
– x=(3, 1, 0, 0, 2), equivalent to set W=(1:3, 2:1, 5:2)
– Each dimension indicates a chemical property
• Descriptors: RINGO [Vida et al.,05] and KCF-S
[Kotera et al., 13]
• Generalized Jaccard:
• Similarity search of molecular descriptors is in a
n infancy stage
• Problem: Find all Wi similar to query Q (≧ε)
5. Similarity search using
inverted index [Nasr’12]
• Inverted index: associative array
– Key = feature id, value= weight
• Similarity search for query Q
• Look up inverted index for each query
element and compute similarities
i Wi
1 (1:3)
2 (5:3)
3 (2:3)
4 (1:1,2:2,4:2)
5 (4:3)
6 (2:2,3:1,5:2)
7 (3:3)
8 (2:3)
1 (1:3) (4:1)
2 (3:3) (4:2) (6:2) (8:3)
3 (6:1) (7:3)
4 (4:2) (5:3)
5 (2:3) (6:2)
(i) Descriptors
(ii) Inverted index
(iii) Similarity search
for query Q=(1:3, 4:1)
(1:3) (4:1)
(4:2) (5:3)
Liner time for the
total length of lists
6. Drawback
• Scanning lists takes much time, especially for long lists
• Huge memory:
– N: number of descriptors
– M: maximum weight
• One can compress inverted index by using compression
methods, e.g., variable-byte codes and PForDelta
• Decompression is time-consuming
• Challenge: developing a fast and space-efficient
similarity search for molecular descriptors
7. SITAd: Scalable similarity search for
molecular descriptors
• Two techniques
1. Database partitioning
2. Conversion to inner product search
• Build wavelet tree on the notion of two
techniques
• Solve inner product search on wavelet tree
8. Database partitioning
W1=(1:1, 3:1)
W2=(2:1)
W3=(2:2, 4:1)
W4=(2:1, 4:1)
W5=(3:1)
W6=(1:2)
W7=(1:1, 4:1)
W8=(1:1, 2:2)
W2=(2:1)
W5=(3:1)
W1=(1:1, 3:1)
W4=(2:1, 4:1)
W7=(1:1, 4:1)
W6=(1:2)
W3=(2:2, 4:1)
W8=(1:1, 2:2)
Theorem1
• Classify each descriptor Wi into block Bc
• Search space can be limited to blocks satisfying
Theorem1 for given query q and ε
9. Conversion to inner product search
• Similarity search using generalized Jaccard similarity can
be converted to inner product search
• How to solve inner product search efficiently?
• Suppose a simple case that all weights are one
Ex) x = (3, 2, 0, 4, 2) ➞ x’= (1, 1, 0, 0, 1) ➞ W’=(1,2,5)
• Can be solved as a semi conjunctive query
Constant Inner product
Set
≧ε
10. 2
Conjunctive Query
§ Query with k keywords
§ (Word 2, Word 4)
§ Identify the set intersection by sorting merged id list
§ It takes O(|A|+|B|) time
§ Can that be any faster?
Word Document ids
Word 1 1,3
Word 2 2,6,8
Word 3 1,5,7
Word 4 2,7
6 8 2 7
2 2 6 7 8
A B
11. Alternation α
§ Number of switches after sorting
§ There exists a data structure that allows to find set
intersection in O(α log m) time (Barbay/Kenyon, 2002)
§ m : maximum value
α = 2
2 6 8 2 7
2 2 6 7 8
12. Range intersection on array
n Concatenate all rows of inverted index
n Array A of length n, values 1 ≤ A[i] ≤ m
n Query word = Interval
n Range intersection: rint(A,[i,j],[k,l])
• Find set intersection of A[i,j] and A[k,l]
n O(α log m) time using wavelet tree !
A 1 3 2 6 8 1 5 7 2 7 4 5
i j k l
15. Remember if each element is either in
lower half (0) or higher half (1)
[1,4] [5,8]
[1,8]
0 0 0 1 1 1 1 0 0 1 0 1
0 1 0 0 0 1 0 1 0 1 1 0
0 1 0 1 0 1 1 0 0 1 0 0
[1,2] [3,4] [5,6] [7,8]
1 2 3 4 5 6 7 8
16. Index each bit array with a rank
dictionary
n With rank dictionary, the rank operation can be
done in O(1) time
• rankc(B,i): return the number of c∈{0,1} in B[1…i]
Ex) B=0110011100
i 1 2 3 4 5 6 7 8 9 10
0 1 1 0 0 1 1 1 0 0
0 1 1 0 0 1 1 1 0 0
rank1(B,8)=5
rank0(B,5)=3
18. Memory Usage
n (1+γ) n log m bits
l n: Number of all words in the database
l m: Number of unique words
l γ: Overhead for rank dictionary (around 0.6)
l Not so different from simply storing the array (n log
m bit)
20. Range intersection: recap
n Array A of length n, values 1 ≤ A[i] ≤ m
n Query word = Interval
n Range intersection: rint(A,[i,j],[k,l])
• Find set intersection of A[i,j] and A[k,l]
n O(α log m) time using wavelet tree
A 1 3 2 6 8 1 5 7 2 7 4 5
i j k l
21. O(1)-time division of an interval
n Using the rank operations, the division of an
interval can be done in constant time
• rank0 for left child and rank1 for right child
• Naïve = linear time to the total number of elements
[1,4]
[1,8]
Aroot 1 3 2 6 8 1 5 7 2 7 4 5
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
[5,8]
25. • Inverted index
1 (1:3) (4:1)
2 (3:3) (4:2) (6:2) (8:3)
3 (6:1) (7:3)
4 (4:2) (5:3)
5 (2:3) (6:2)
1 4 3 4 6 8 6 7 4 5 2 6
§ Array of ids➞wavelet tree
3 1 3 2 2 3 1 3 2 3 3 2
§ Array of weights➞RMQ data structure
§ Build two arrays by
concatenating feature ids
and weights separately in
each row
§ RMQ data structure
- compute max B[t,s] in
O(1)time and |B|log|B|/2 +
|B|logM + 2n bits of space
§ Query = multiple interval
extensions of range
intersection
§ Find ids whose sum-of-
products for weights is
at least threshold
Solve inner product search
A
B
26. Computing upper bound of inner product
in O(1) time
• Using RMQ data structure, upper bound of inner
product can be computed in O(1)-time
• We compute max B[t,s] for each interval on
wavelet tree and compute upper bound
1 4 3 4 6 8 6 7 4 5 2 6
3 4 3 2 2 3 1 3 2 3 3 2
1 4 3 4 4 2 6 8 6 7 5 6
A
B Ex)Q=(2:2,4:1)
3・2 + 3・1
=9 Wavelet
Tree
RMQ
27. Experiments
• 42,971,672 compounds in PubChem database
• Use KCF-S descriptors, 642,297 dimension
• Use search time and memory as evaluation
measures
• Compare SITAd (proposed) to
– OVA: compute similarity one-by-one
– INV (state-of the-art): similarity search using inverted
index
– INV+VBYTE: INV compressed by variable byte codes
– INV+PD: INV compressed by PForDelta
28. Search time for the number of
compounds
●●●●●
●
●
●
●
0e+00 1e+07 2e+07 3e+07 4e+07
0123456
# of descriptors
Searchtime(sec)
●●● ● ●
●
●
●
●
●●●●●
●
●
●
●
●
●
SITAd epsilon=0.9
SITAd epsilon=0.95
SITAd epsilon=0.98
inverted index
inverted index(varbyte)
inverted index(pfordelta)
29. Search time and memory (MB)
on 42 million compounds
0 5000 10000 15000 20000 25000 30000 35000
0246810
Memory (mega byte)
searchtime(sec)
SITAd epsilon=0.98
SITAd epsilon=0.95
SITAd epsilon=0.9
INV
INV−VBYTE
INV−PD
OVA
2,400
0.23
0.61
1.54
33,012
5.24
9.58
8,171
31. Summary
• Present SITAd, scalable similarity search for
molecular descriptors
• Use two data structures: wavelet tree, RMQ
• Takes around 1 sec and use 2.5GB memory for
searching 42 million compounds
• Future work: develop similarity search methods
using ANN
32. Software for similarity search is available
https://sites.google.com/site/yasuotabei/
• All softwares are applicable to high dimension and
hundreds of millions of data
• All pairs similarity search (similarity join)
- SketchSort for cosine similarity
- SketchSortj for Jaccard similarity
- SketchSort-minmax for minmax similarity
• Similarity search
- SMBT for Jaccard similarity
• Graph similarity search
- gWT