SlideShare a Scribd company logo
Scalable Similarity Search for
Molecular Descriptors	
Yasuo Tabei
RIKEN Center for Advanced Intelligent Project (AIP),
Japan
Joint work with
Simon J. Puglisi
University of Helsinki, Finland
	
SISAP’17, Oct. 6, 2017
Similarity search in chemoinformatics	
•  Similarity search of chemical compounds is an important
task for novel drug discoveries
•  Important fact: similar molecules tend to have a similar
molecular functions
•  Can find functions of a query by searching databases of
compounds
•  Whole chemical space is said to be approximately 1060
•  There are large databases storing tens of millions of
compounds, e.g., PubChem, ChEMBL
•  Scalable similarity searches of chemical compounds are
required
Chemical fingerprint	
•  Binary vector representation of molecule
E.g.: x=(1, 0, 0, 1, 0)
Ø Each dimension indicates the presence/absence of a
substructure
–  Representative fingerprints: Dragon, PubChem et al.
•  Jaccard (a.k.a Tanimoto) similarity is used
•  Many methods has been proposed
–  Multibit tree, XOR-based, b-bit minhashing et al.
Molecular descriptor (NEW)	
•  Integer vector representation of molecules
–  x=(3, 1, 0, 0, 2), equivalent to set W=(1:3, 2:1, 5:2)
–  Each dimension indicates a chemical property
•  Descriptors: RINGO [Vida et al.,05] and KCF-S
[Kotera et al., 13]
•  Generalized Jaccard:
•  Similarity search of molecular descriptors is in a
n infancy stage
•  Problem: Find all Wi similar to query Q (≧ε)
Similarity search using
inverted index [Nasr’12]	
•  Inverted index: associative array
–  Key = feature id, value= weight
•  Similarity search for query Q
•  Look up inverted index for each query
element and compute similarities	
i	Wi
1	(1:3)
2	(5:3)
3	(2:3)
4	(1:1,2:2,4:2)	
5	(4:3)
6	(2:2,3:1,5:2)	
7	(3:3)
8	(2:3)
1	(1:3) (4:1)
2	(3:3) (4:2) (6:2) (8:3)	
3	(6:1) (7:3)
4	(4:2) (5:3) 	
5	(2:3) (6:2) 	
(i) Descriptors	
(ii) Inverted index	
(iii) Similarity search
for query Q=(1:3, 4:1)
(1:3) (4:1)
(4:2) (5:3)
Liner time for the
total length of lists
Drawback	
•  Scanning lists takes much time, especially for long lists
•  Huge memory:
–  N: number of descriptors
–  M: maximum weight
•  One can compress inverted index by using compression
methods, e.g., variable-byte codes and PForDelta
•  Decompression is time-consuming
•  Challenge: developing a fast and space-efficient
similarity search for molecular descriptors
SITAd: Scalable similarity search for
molecular descriptors	
•  Two techniques
1.  Database partitioning
2.  Conversion to inner product search
•  Build wavelet tree on the notion of two
techniques
•  Solve inner product search on wavelet tree
Database partitioning	
W1=(1:1, 3:1)
W2=(2:1)
W3=(2:2, 4:1)
W4=(2:1, 4:1)
W5=(3:1)
W6=(1:2)
W7=(1:1, 4:1)
W8=(1:1, 2:2)	
W2=(2:1)
W5=(3:1)
W1=(1:1, 3:1)
W4=(2:1, 4:1)
W7=(1:1, 4:1)
W6=(1:2)
W3=(2:2, 4:1)
W8=(1:1, 2:2)	
Theorem1	
•  Classify each descriptor Wi into block Bc
•  Search space can be limited to blocks satisfying
Theorem1 for given query q and ε
Conversion to inner product search	
•  Similarity search using generalized Jaccard similarity can
be converted to inner product search
•  How to solve inner product search efficiently?
•  Suppose a simple case that all weights are one
Ex) x = (3, 2, 0, 4, 2) ➞ x’= (1, 1, 0, 0, 1) ➞ W’=(1,2,5)
•  Can be solved as a semi conjunctive query 	
Constant	Inner product	
Set	
≧ε
2	
Conjunctive Query 	
§  Query with k keywords
§  (Word 2, Word 4)
§  Identify the set intersection by sorting merged id list
§  It takes O(|A|+|B|) time
§  Can that be any faster?
Word	 Document ids	
Word 1	 1,3	
Word 2	 2,6,8	
Word 3	 1,5,7	
Word 4	 2,7	
6	 8	 2	 7	
2	 2	 6	 7	 8	
A	 B
Alternation α	
§  Number of switches after sorting
§  There exists a data structure that allows to find set
intersection in O(α log m) time (Barbay/Kenyon, 2002)
§  m : maximum value	
α = 2	
2	 6	 8	 2	 7	
2	 2	 6	 7	 8
Range intersection on array	
n  Concatenate all rows of inverted index
n  Array A of length n, values 1 ≤ A[i] ≤ m
n  Query word = Interval
n  Range intersection: rint(A,[i,j],[k,l])
•  Find set intersection of A[i,j] and A[k,l]
n  O(α log m) time using wavelet tree !	
A 1 3 2 6 8 1 5 7 2 7 4 5
i	 j	 k l
Definition of Wavelet Tree
Tree of subarrays:
Lower half = left, Higher half=right	
[1,4] [5,8]
[1,8]
1 3 2 6 8 5 7 1 2 7 4 5
1 3 2 1 2 4 6 8 5 7 7 5
1 2 1 2 3 4 6 5 5 8 7 7
1 1 2 2 3 4 5 5 6 7 7 8
[1,2] [3,4] [5,6] [7,8]
Remember if each element is either in
lower half (0) or higher half (1) 	
[1,4] [5,8]
[1,8]
0 0 0 1 1 1 1 0 0 1 0 1
0 1 0 0 0 1 0 1 0 1 1 0
0 1 0 1 0 1 1 0 0 1 0 0
[1,2] [3,4] [5,6] [7,8]
1	 2	 3	 4	 5	 6	 7	 8
Index each bit array with a rank
dictionary	
n  With rank dictionary, the rank operation can be
done in O(1) time
•  rankc(B,i): return the number of c∈{0,1} in B[1…i]	
Ex) B=0110011100	
i 1 2 3 4 5 6 7 8 9 10
0 1 1 0 0 1 1 1 0 0
0 1 1 0 0 1 1 1 0 0	
rank1(B,8)=5
rank0(B,5)=3
Wavelet Tree = Collection of bit arrays
indexed by rank dictionaries 	
[1,4] [5,8]
[1,8]
0 1 0 0 0 1 0 1 0 1 1 0
0 1 0 1 0 1 1 0 0 1 0 0
[1,2] [3,4] [5,6] [7,8]
1	 2	 3	 4	 5	 6	 7	 8	
0 0 0 1 1 1 1 0 0 1 0 1
Memory Usage	
n  (1+γ) n log m bits
l  n: Number of all words in the database
l  m: Number of unique words
l  γ: Overhead for rank dictionary (around 0.6)
l  Not so different from simply storing the array (n log
m bit)
Solving range intersection using
wavelet tree
Range intersection: recap	
n  Array A of length n, values 1 ≤ A[i] ≤ m
n  Query word = Interval
n  Range intersection: rint(A,[i,j],[k,l])
•  Find set intersection of A[i,j] and A[k,l]
n  O(α log m) time using wavelet tree 	
A 1 3 2 6 8 1 5 7 2 7 4 5
i	 j	 k l
O(1)-time division of an interval	
n  Using the rank operations, the division of an
interval can be done in constant time
•  rank0 for left child and rank1 for right child
•  Naïve = linear time to the total number of elements
[1,4]
[1,8]
Aroot 1 3 2 6 8 1 5 7 2 7 4 5
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
[5,8]
Fast computation of range intersection on
wavelet tree	
[1,4] [5,8]
[1,8]
1 3 2 6 8 1 5 7 2 7 4 5
1 3 2 1 2 4 6 8 5 7 7 5
1 2 1 2 3 4 6 5 5 8 7 7
1 1 2 2 3 4 5 5 6 7 7 8
[1,2] [3,4] [5,6] [6,8]
Pruned
solution!!
Fast computation of range intersection on
wavelet tree	
[1,4] [5,8]
[1,8]
1 3 2 6 8 1 5 7 2 7 4 5
1 3 2 1 2 4 6 8 5 7 7 5
1 2 1 2 3 4 6 5 5 8 7 7
1 1 2 2 3 4 5 5 6 7 7 8
[1,2] [3,4] [5,6] [6,8]
Height log m
Solve inner product search using
wavelet tree
•  Inverted index	
1	(1:3) (4:1)
2	(3:3) (4:2) (6:2) (8:3)	
3	(6:1) (7:3)
4	(4:2) (5:3) 	
5	(2:3) (6:2) 	
1	 4	 3	 4	 6	 8	 6	 7	 4	 5	 2	 6	
§  Array of ids➞wavelet tree	
3	 1	 3	 2	 2	 3	 1	 3	 2	 3	 3	 2	
§  Array of weights➞RMQ data structure	
§  Build two arrays by
concatenating feature ids
and weights separately in
each row
§  RMQ data structure
- compute max B[t,s] in
O(1)time and |B|log|B|/2 +
|B|logM + 2n bits of space
§  Query = multiple interval
extensions of range
intersection
§  Find ids whose sum-of-
products for weights is
at least threshold
Solve inner product search 	
A	
B
Computing upper bound of inner product
in O(1) time 	
•  Using RMQ data structure, upper bound of inner
product can be computed in O(1)-time
•  We compute max B[t,s] for each interval on
wavelet tree and compute upper bound	
1	 4	 3	 4	 6	 8	 6	 7	 4	 5	 2	 6	
3	 4	 3	 2	 2	 3	 1	 3	 2	 3	 3	 2	
1	 4	 3	 4	 4	 2	 6	 8	 6	 7	 5	 6	
A	
B	 Ex)Q=(2:2,4:1)
3・2 + 3・1
=9	Wavelet	
Tree	
RMQ
Experiments	
•  42,971,672 compounds in PubChem database
•  Use KCF-S descriptors, 642,297 dimension
•  Use search time and memory as evaluation
measures
•  Compare SITAd (proposed) to
–  OVA: compute similarity one-by-one
–  INV (state-of the-art): similarity search using inverted
index
–  INV+VBYTE: INV compressed by variable byte codes
–  INV+PD: INV compressed by PForDelta
Search time for the number of
compounds	
●●●●●
●
●
●
●
0e+00 1e+07 2e+07 3e+07 4e+07
0123456
# of descriptors
Searchtime(sec)
●●● ● ●
●
●
●
●
●●●●●
●
●
●
●
●
●
SITAd epsilon=0.9
SITAd epsilon=0.95
SITAd epsilon=0.98
inverted index
inverted index(varbyte)
inverted index(pfordelta)
Search time and memory (MB)
on 42 million compounds	
0 5000 10000 15000 20000 25000 30000 35000
0246810
Memory (mega byte)
searchtime(sec)
SITAd epsilon=0.98
SITAd epsilon=0.95
SITAd epsilon=0.9
INV
INV−VBYTE
INV−PD
OVA
2,400
0.23
0.61
1.54
33,012
5.24
9.58
8,171
Construction time	
●●●●●
●
●
●
●
0e+00 1e+07 2e+07 3e+07 4e+07
0100200300400
# of descriptors
Constructiontime(sec)
●●●
●
●
●
●
●
●
●
SITAd
INV
INV-VBYTE
INV-PD
Summary	
•  Present SITAd, scalable similarity search for
molecular descriptors
•  Use two data structures: wavelet tree, RMQ
•  Takes around 1 sec and use 2.5GB memory for
searching 42 million compounds
•  Future work: develop similarity search methods
using ANN
Software for similarity search is available
https://sites.google.com/site/yasuotabei/	
•  All softwares are applicable to high dimension and
hundreds of millions of data
•  All pairs similarity search (similarity join)
- SketchSort for cosine similarity
- SketchSortj for Jaccard similarity
- SketchSort-minmax for minmax similarity
•  Similarity search
- SMBT for Jaccard similarity
•  Graph similarity search
- gWT

More Related Content

What's hot

LAP2009 c&p101-vector2 d.5ht
LAP2009 c&p101-vector2 d.5htLAP2009 c&p101-vector2 d.5ht
LAP2009 c&p101-vector2 d.5ht
A Jorge Garcia
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
Yu Liu
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
André Panisson
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
Holistic Benchmarking of Big Linked Data
 
Some fixed point and common fixed point theorems of integral
Some fixed point and common fixed point theorems of integralSome fixed point and common fixed point theorems of integral
Some fixed point and common fixed point theorems of integral
Alexander Decker
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
srinivasa teja
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
talktoharry
 
ThreeTen
ThreeTenThreeTen
ThreeTen
彥彬 洪
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Dwaipayan Roy
 
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
Thomas Gottron
 
Applicationof datastructures
Applicationof datastructuresApplicationof datastructures
Applicationof datastructures
Hitesh Wagle
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
Mahmoud Alfarra
 
Sortsearch
SortsearchSortsearch
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
VitAnhNguyn94
 
Au4201315330
Au4201315330Au4201315330
Au4201315330
IJERA Editor
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based Clustering
SSA KPI
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
Vahid Mirjalili
 
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
taufiq99
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
FELIX75
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
Fabian Pedregosa
 

What's hot (20)

LAP2009 c&p101-vector2 d.5ht
LAP2009 c&p101-vector2 d.5htLAP2009 c&p101-vector2 d.5ht
LAP2009 c&p101-vector2 d.5ht
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
Some fixed point and common fixed point theorems of integral
Some fixed point and common fixed point theorems of integralSome fixed point and common fixed point theorems of integral
Some fixed point and common fixed point theorems of integral
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
ThreeTen
ThreeTenThreeTen
ThreeTen
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
 
Applicationof datastructures
Applicationof datastructuresApplicationof datastructures
Applicationof datastructures
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Sortsearch
SortsearchSortsearch
Sortsearch
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 
Au4201315330
Au4201315330Au4201315330
Au4201315330
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based Clustering
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 

Similar to SISAP17

Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201
Yasuo Tabei
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
Yasuo Tabei
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lecture
Sara-Jayne Terp
 
multiscale_tutorial.pdf
multiscale_tutorial.pdfmultiscale_tutorial.pdf
multiscale_tutorial.pdf
NAIMAHMED NESARAGI
 
Meow Hagedorn
Meow HagedornMeow Hagedorn
Meow Hagedorn
MedicineAndDermatology
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select Dictionaries
Rakuten Group, Inc.
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphs
NextMove Software
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
showkat27
 
Ch07 linearspacealignment
Ch07 linearspacealignmentCh07 linearspacealignment
Ch07 linearspacealignment
BioinformaticsInstitute
 
Creating a Custom Serialization Format (Gophercon 2017)
Creating a Custom Serialization Format (Gophercon 2017)Creating a Custom Serialization Format (Gophercon 2017)
Creating a Custom Serialization Format (Gophercon 2017)
Scott Mansfield
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
Bioinformatics and Computational Biosciences Branch
 
Data structures
Data structuresData structures
Data structures
Pranav Gupta
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
Austin Benson
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesOptimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
HPCC Systems
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
Mike Anderson
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
baoilleach
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
guest5929fa7
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
Yu Liu
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 

Similar to SISAP17 (20)

Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lecture
 
multiscale_tutorial.pdf
multiscale_tutorial.pdfmultiscale_tutorial.pdf
multiscale_tutorial.pdf
 
Meow Hagedorn
Meow HagedornMeow Hagedorn
Meow Hagedorn
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select Dictionaries
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphs
 
Text clustering
Text clusteringText clustering
Text clustering
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
 
Ch07 linearspacealignment
Ch07 linearspacealignmentCh07 linearspacealignment
Ch07 linearspacealignment
 
Creating a Custom Serialization Format (Gophercon 2017)
Creating a Custom Serialization Format (Gophercon 2017)Creating a Custom Serialization Format (Gophercon 2017)
Creating a Custom Serialization Format (Gophercon 2017)
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
Data structures
Data structuresData structures
Data structures
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesOptimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 

More from Yasuo Tabei

Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
Yasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Yasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
Yasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
GIW2013
GIW2013GIW2013
GIW2013
Yasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
Yasuo Tabei
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
Yasuo Tabei
 
Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
Yasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
Yasuo Tabei
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
Yasuo Tabei
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
Yasuo Tabei
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
Yasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
Yasuo Tabei
 

More from Yasuo Tabei (17)

Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
GIW2013
GIW2013GIW2013
GIW2013
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
 
Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 

Recently uploaded

Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf
AlvianRamadhani5
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
MadhavJungKarki
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
Kamal Acharya
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
um7474492
 
Mechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineeringMechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineering
sachin chaurasia
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
ijaia
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 

Recently uploaded (20)

Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
 
Mechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineeringMechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineering
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 

SISAP17

  • 1. Scalable Similarity Search for Molecular Descriptors Yasuo Tabei RIKEN Center for Advanced Intelligent Project (AIP), Japan Joint work with Simon J. Puglisi University of Helsinki, Finland SISAP’17, Oct. 6, 2017
  • 2. Similarity search in chemoinformatics •  Similarity search of chemical compounds is an important task for novel drug discoveries •  Important fact: similar molecules tend to have a similar molecular functions •  Can find functions of a query by searching databases of compounds •  Whole chemical space is said to be approximately 1060 •  There are large databases storing tens of millions of compounds, e.g., PubChem, ChEMBL •  Scalable similarity searches of chemical compounds are required
  • 3. Chemical fingerprint •  Binary vector representation of molecule E.g.: x=(1, 0, 0, 1, 0) Ø Each dimension indicates the presence/absence of a substructure –  Representative fingerprints: Dragon, PubChem et al. •  Jaccard (a.k.a Tanimoto) similarity is used •  Many methods has been proposed –  Multibit tree, XOR-based, b-bit minhashing et al.
  • 4. Molecular descriptor (NEW) •  Integer vector representation of molecules –  x=(3, 1, 0, 0, 2), equivalent to set W=(1:3, 2:1, 5:2) –  Each dimension indicates a chemical property •  Descriptors: RINGO [Vida et al.,05] and KCF-S [Kotera et al., 13] •  Generalized Jaccard: •  Similarity search of molecular descriptors is in a n infancy stage •  Problem: Find all Wi similar to query Q (≧ε)
  • 5. Similarity search using inverted index [Nasr’12] •  Inverted index: associative array –  Key = feature id, value= weight •  Similarity search for query Q •  Look up inverted index for each query element and compute similarities i Wi 1 (1:3) 2 (5:3) 3 (2:3) 4 (1:1,2:2,4:2) 5 (4:3) 6 (2:2,3:1,5:2) 7 (3:3) 8 (2:3) 1 (1:3) (4:1) 2 (3:3) (4:2) (6:2) (8:3) 3 (6:1) (7:3) 4 (4:2) (5:3) 5 (2:3) (6:2) (i) Descriptors (ii) Inverted index (iii) Similarity search for query Q=(1:3, 4:1) (1:3) (4:1) (4:2) (5:3) Liner time for the total length of lists
  • 6. Drawback •  Scanning lists takes much time, especially for long lists •  Huge memory: –  N: number of descriptors –  M: maximum weight •  One can compress inverted index by using compression methods, e.g., variable-byte codes and PForDelta •  Decompression is time-consuming •  Challenge: developing a fast and space-efficient similarity search for molecular descriptors
  • 7. SITAd: Scalable similarity search for molecular descriptors •  Two techniques 1.  Database partitioning 2.  Conversion to inner product search •  Build wavelet tree on the notion of two techniques •  Solve inner product search on wavelet tree
  • 8. Database partitioning W1=(1:1, 3:1) W2=(2:1) W3=(2:2, 4:1) W4=(2:1, 4:1) W5=(3:1) W6=(1:2) W7=(1:1, 4:1) W8=(1:1, 2:2) W2=(2:1) W5=(3:1) W1=(1:1, 3:1) W4=(2:1, 4:1) W7=(1:1, 4:1) W6=(1:2) W3=(2:2, 4:1) W8=(1:1, 2:2) Theorem1 •  Classify each descriptor Wi into block Bc •  Search space can be limited to blocks satisfying Theorem1 for given query q and ε
  • 9. Conversion to inner product search •  Similarity search using generalized Jaccard similarity can be converted to inner product search •  How to solve inner product search efficiently? •  Suppose a simple case that all weights are one Ex) x = (3, 2, 0, 4, 2) ➞ x’= (1, 1, 0, 0, 1) ➞ W’=(1,2,5) •  Can be solved as a semi conjunctive query Constant Inner product Set ≧ε
  • 10. 2 Conjunctive Query §  Query with k keywords §  (Word 2, Word 4) §  Identify the set intersection by sorting merged id list §  It takes O(|A|+|B|) time §  Can that be any faster? Word Document ids Word 1 1,3 Word 2 2,6,8 Word 3 1,5,7 Word 4 2,7 6 8 2 7 2 2 6 7 8 A B
  • 11. Alternation α §  Number of switches after sorting §  There exists a data structure that allows to find set intersection in O(α log m) time (Barbay/Kenyon, 2002) §  m : maximum value α = 2 2 6 8 2 7 2 2 6 7 8
  • 12. Range intersection on array n  Concatenate all rows of inverted index n  Array A of length n, values 1 ≤ A[i] ≤ m n  Query word = Interval n  Range intersection: rint(A,[i,j],[k,l]) •  Find set intersection of A[i,j] and A[k,l] n  O(α log m) time using wavelet tree ! A 1 3 2 6 8 1 5 7 2 7 4 5 i j k l
  • 14. Tree of subarrays: Lower half = left, Higher half=right [1,4] [5,8] [1,8] 1 3 2 6 8 5 7 1 2 7 4 5 1 3 2 1 2 4 6 8 5 7 7 5 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 [1,2] [3,4] [5,6] [7,8]
  • 15. Remember if each element is either in lower half (0) or higher half (1) [1,4] [5,8] [1,8] 0 0 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0 [1,2] [3,4] [5,6] [7,8] 1 2 3 4 5 6 7 8
  • 16. Index each bit array with a rank dictionary n  With rank dictionary, the rank operation can be done in O(1) time •  rankc(B,i): return the number of c∈{0,1} in B[1…i] Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 rank1(B,8)=5 rank0(B,5)=3
  • 17. Wavelet Tree = Collection of bit arrays indexed by rank dictionaries [1,4] [5,8] [1,8] 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0 [1,2] [3,4] [5,6] [7,8] 1 2 3 4 5 6 7 8 0 0 0 1 1 1 1 0 0 1 0 1
  • 18. Memory Usage n  (1+γ) n log m bits l  n: Number of all words in the database l  m: Number of unique words l  γ: Overhead for rank dictionary (around 0.6) l  Not so different from simply storing the array (n log m bit)
  • 19. Solving range intersection using wavelet tree
  • 20. Range intersection: recap n  Array A of length n, values 1 ≤ A[i] ≤ m n  Query word = Interval n  Range intersection: rint(A,[i,j],[k,l]) •  Find set intersection of A[i,j] and A[k,l] n  O(α log m) time using wavelet tree A 1 3 2 6 8 1 5 7 2 7 4 5 i j k l
  • 21. O(1)-time division of an interval n  Using the rank operations, the division of an interval can be done in constant time •  rank0 for left child and rank1 for right child •  Naïve = linear time to the total number of elements [1,4] [1,8] Aroot 1 3 2 6 8 1 5 7 2 7 4 5 Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5 [5,8]
  • 22. Fast computation of range intersection on wavelet tree [1,4] [5,8] [1,8] 1 3 2 6 8 1 5 7 2 7 4 5 1 3 2 1 2 4 6 8 5 7 7 5 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 [1,2] [3,4] [5,6] [6,8] Pruned solution!!
  • 23. Fast computation of range intersection on wavelet tree [1,4] [5,8] [1,8] 1 3 2 6 8 1 5 7 2 7 4 5 1 3 2 1 2 4 6 8 5 7 7 5 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 [1,2] [3,4] [5,6] [6,8] Height log m
  • 24. Solve inner product search using wavelet tree
  • 25. •  Inverted index 1 (1:3) (4:1) 2 (3:3) (4:2) (6:2) (8:3) 3 (6:1) (7:3) 4 (4:2) (5:3) 5 (2:3) (6:2) 1 4 3 4 6 8 6 7 4 5 2 6 §  Array of ids➞wavelet tree 3 1 3 2 2 3 1 3 2 3 3 2 §  Array of weights➞RMQ data structure §  Build two arrays by concatenating feature ids and weights separately in each row §  RMQ data structure - compute max B[t,s] in O(1)time and |B|log|B|/2 + |B|logM + 2n bits of space §  Query = multiple interval extensions of range intersection §  Find ids whose sum-of- products for weights is at least threshold Solve inner product search A B
  • 26. Computing upper bound of inner product in O(1) time •  Using RMQ data structure, upper bound of inner product can be computed in O(1)-time •  We compute max B[t,s] for each interval on wavelet tree and compute upper bound 1 4 3 4 6 8 6 7 4 5 2 6 3 4 3 2 2 3 1 3 2 3 3 2 1 4 3 4 4 2 6 8 6 7 5 6 A B Ex)Q=(2:2,4:1) 3・2 + 3・1 =9 Wavelet Tree RMQ
  • 27. Experiments •  42,971,672 compounds in PubChem database •  Use KCF-S descriptors, 642,297 dimension •  Use search time and memory as evaluation measures •  Compare SITAd (proposed) to –  OVA: compute similarity one-by-one –  INV (state-of the-art): similarity search using inverted index –  INV+VBYTE: INV compressed by variable byte codes –  INV+PD: INV compressed by PForDelta
  • 28. Search time for the number of compounds ●●●●● ● ● ● ● 0e+00 1e+07 2e+07 3e+07 4e+07 0123456 # of descriptors Searchtime(sec) ●●● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● SITAd epsilon=0.9 SITAd epsilon=0.95 SITAd epsilon=0.98 inverted index inverted index(varbyte) inverted index(pfordelta)
  • 29. Search time and memory (MB) on 42 million compounds 0 5000 10000 15000 20000 25000 30000 35000 0246810 Memory (mega byte) searchtime(sec) SITAd epsilon=0.98 SITAd epsilon=0.95 SITAd epsilon=0.9 INV INV−VBYTE INV−PD OVA 2,400 0.23 0.61 1.54 33,012 5.24 9.58 8,171
  • 30. Construction time ●●●●● ● ● ● ● 0e+00 1e+07 2e+07 3e+07 4e+07 0100200300400 # of descriptors Constructiontime(sec) ●●● ● ● ● ● ● ● ● SITAd INV INV-VBYTE INV-PD
  • 31. Summary •  Present SITAd, scalable similarity search for molecular descriptors •  Use two data structures: wavelet tree, RMQ •  Takes around 1 sec and use 2.5GB memory for searching 42 million compounds •  Future work: develop similarity search methods using ANN
  • 32. Software for similarity search is available https://sites.google.com/site/yasuotabei/ •  All softwares are applicable to high dimension and hundreds of millions of data •  All pairs similarity search (similarity join) - SketchSort for cosine similarity - SketchSortj for Jaccard similarity - SketchSort-minmax for minmax similarity •  Similarity search - SMBT for Jaccard similarity •  Graph similarity search - gWT