SlideShare a Scribd company logo
1 of 32
Download to read offline
Scalable Similarity Search for
Molecular Descriptors	
Yasuo Tabei
RIKEN Center for Advanced Intelligent Project (AIP),
Japan
Joint work with
Simon J. Puglisi
University of Helsinki, Finland
	
SISAP’17, Oct. 6, 2017
Similarity search in chemoinformatics	
•  Similarity search of chemical compounds is an important
task for novel drug discoveries
•  Important fact: similar molecules tend to have a similar
molecular functions
•  Can find functions of a query by searching databases of
compounds
•  Whole chemical space is said to be approximately 1060
•  There are large databases storing tens of millions of
compounds, e.g., PubChem, ChEMBL
•  Scalable similarity searches of chemical compounds are
required
Chemical fingerprint	
•  Binary vector representation of molecule
E.g.: x=(1, 0, 0, 1, 0)
Ø Each dimension indicates the presence/absence of a
substructure
–  Representative fingerprints: Dragon, PubChem et al.
•  Jaccard (a.k.a Tanimoto) similarity is used
•  Many methods has been proposed
–  Multibit tree, XOR-based, b-bit minhashing et al.
Molecular descriptor (NEW)	
•  Integer vector representation of molecules
–  x=(3, 1, 0, 0, 2), equivalent to set W=(1:3, 2:1, 5:2)
–  Each dimension indicates a chemical property
•  Descriptors: RINGO [Vida et al.,05] and KCF-S
[Kotera et al., 13]
•  Generalized Jaccard:
•  Similarity search of molecular descriptors is in a
n infancy stage
•  Problem: Find all Wi similar to query Q (≧ε)
Similarity search using
inverted index [Nasr’12]	
•  Inverted index: associative array
–  Key = feature id, value= weight
•  Similarity search for query Q
•  Look up inverted index for each query
element and compute similarities	
i	Wi
1	(1:3)
2	(5:3)
3	(2:3)
4	(1:1,2:2,4:2)	
5	(4:3)
6	(2:2,3:1,5:2)	
7	(3:3)
8	(2:3)
1	(1:3) (4:1)
2	(3:3) (4:2) (6:2) (8:3)	
3	(6:1) (7:3)
4	(4:2) (5:3) 	
5	(2:3) (6:2) 	
(i) Descriptors	
(ii) Inverted index	
(iii) Similarity search
for query Q=(1:3, 4:1)
(1:3) (4:1)
(4:2) (5:3)
Liner time for the
total length of lists
Drawback	
•  Scanning lists takes much time, especially for long lists
•  Huge memory:
–  N: number of descriptors
–  M: maximum weight
•  One can compress inverted index by using compression
methods, e.g., variable-byte codes and PForDelta
•  Decompression is time-consuming
•  Challenge: developing a fast and space-efficient
similarity search for molecular descriptors
SITAd: Scalable similarity search for
molecular descriptors	
•  Two techniques
1.  Database partitioning
2.  Conversion to inner product search
•  Build wavelet tree on the notion of two
techniques
•  Solve inner product search on wavelet tree
Database partitioning	
W1=(1:1, 3:1)
W2=(2:1)
W3=(2:2, 4:1)
W4=(2:1, 4:1)
W5=(3:1)
W6=(1:2)
W7=(1:1, 4:1)
W8=(1:1, 2:2)	
W2=(2:1)
W5=(3:1)
W1=(1:1, 3:1)
W4=(2:1, 4:1)
W7=(1:1, 4:1)
W6=(1:2)
W3=(2:2, 4:1)
W8=(1:1, 2:2)	
Theorem1	
•  Classify each descriptor Wi into block Bc
•  Search space can be limited to blocks satisfying
Theorem1 for given query q and ε
Conversion to inner product search	
•  Similarity search using generalized Jaccard similarity can
be converted to inner product search
•  How to solve inner product search efficiently?
•  Suppose a simple case that all weights are one
Ex) x = (3, 2, 0, 4, 2) ➞ x’= (1, 1, 0, 0, 1) ➞ W’=(1,2,5)
•  Can be solved as a semi conjunctive query 	
Constant	Inner product	
Set	
≧ε
2	
Conjunctive Query 	
§  Query with k keywords
§  (Word 2, Word 4)
§  Identify the set intersection by sorting merged id list
§  It takes O(|A|+|B|) time
§  Can that be any faster?
Word	 Document ids	
Word 1	 1,3	
Word 2	 2,6,8	
Word 3	 1,5,7	
Word 4	 2,7	
6	 8	 2	 7	
2	 2	 6	 7	 8	
A	 B
Alternation α	
§  Number of switches after sorting
§  There exists a data structure that allows to find set
intersection in O(α log m) time (Barbay/Kenyon, 2002)
§  m : maximum value	
α = 2	
2	 6	 8	 2	 7	
2	 2	 6	 7	 8
Range intersection on array	
n  Concatenate all rows of inverted index
n  Array A of length n, values 1 ≤ A[i] ≤ m
n  Query word = Interval
n  Range intersection: rint(A,[i,j],[k,l])
•  Find set intersection of A[i,j] and A[k,l]
n  O(α log m) time using wavelet tree !	
A 1 3 2 6 8 1 5 7 2 7 4 5
i	 j	 k l
Definition of Wavelet Tree
Tree of subarrays:
Lower half = left, Higher half=right	
[1,4] [5,8]
[1,8]
1 3 2 6 8 5 7 1 2 7 4 5
1 3 2 1 2 4 6 8 5 7 7 5
1 2 1 2 3 4 6 5 5 8 7 7
1 1 2 2 3 4 5 5 6 7 7 8
[1,2] [3,4] [5,6] [7,8]
Remember if each element is either in
lower half (0) or higher half (1) 	
[1,4] [5,8]
[1,8]
0 0 0 1 1 1 1 0 0 1 0 1
0 1 0 0 0 1 0 1 0 1 1 0
0 1 0 1 0 1 1 0 0 1 0 0
[1,2] [3,4] [5,6] [7,8]
1	 2	 3	 4	 5	 6	 7	 8
Index each bit array with a rank
dictionary	
n  With rank dictionary, the rank operation can be
done in O(1) time
•  rankc(B,i): return the number of c∈{0,1} in B[1…i]	
Ex) B=0110011100	
i 1 2 3 4 5 6 7 8 9 10
0 1 1 0 0 1 1 1 0 0
0 1 1 0 0 1 1 1 0 0	
rank1(B,8)=5
rank0(B,5)=3
Wavelet Tree = Collection of bit arrays
indexed by rank dictionaries 	
[1,4] [5,8]
[1,8]
0 1 0 0 0 1 0 1 0 1 1 0
0 1 0 1 0 1 1 0 0 1 0 0
[1,2] [3,4] [5,6] [7,8]
1	 2	 3	 4	 5	 6	 7	 8	
0 0 0 1 1 1 1 0 0 1 0 1
Memory Usage	
n  (1+γ) n log m bits
l  n: Number of all words in the database
l  m: Number of unique words
l  γ: Overhead for rank dictionary (around 0.6)
l  Not so different from simply storing the array (n log
m bit)
Solving range intersection using
wavelet tree
Range intersection: recap	
n  Array A of length n, values 1 ≤ A[i] ≤ m
n  Query word = Interval
n  Range intersection: rint(A,[i,j],[k,l])
•  Find set intersection of A[i,j] and A[k,l]
n  O(α log m) time using wavelet tree 	
A 1 3 2 6 8 1 5 7 2 7 4 5
i	 j	 k l
O(1)-time division of an interval	
n  Using the rank operations, the division of an
interval can be done in constant time
•  rank0 for left child and rank1 for right child
•  Naïve = linear time to the total number of elements
[1,4]
[1,8]
Aroot 1 3 2 6 8 1 5 7 2 7 4 5
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
[5,8]
Fast computation of range intersection on
wavelet tree	
[1,4] [5,8]
[1,8]
1 3 2 6 8 1 5 7 2 7 4 5
1 3 2 1 2 4 6 8 5 7 7 5
1 2 1 2 3 4 6 5 5 8 7 7
1 1 2 2 3 4 5 5 6 7 7 8
[1,2] [3,4] [5,6] [6,8]
Pruned
solution!!
Fast computation of range intersection on
wavelet tree	
[1,4] [5,8]
[1,8]
1 3 2 6 8 1 5 7 2 7 4 5
1 3 2 1 2 4 6 8 5 7 7 5
1 2 1 2 3 4 6 5 5 8 7 7
1 1 2 2 3 4 5 5 6 7 7 8
[1,2] [3,4] [5,6] [6,8]
Height log m
Solve inner product search using
wavelet tree
•  Inverted index	
1	(1:3) (4:1)
2	(3:3) (4:2) (6:2) (8:3)	
3	(6:1) (7:3)
4	(4:2) (5:3) 	
5	(2:3) (6:2) 	
1	 4	 3	 4	 6	 8	 6	 7	 4	 5	 2	 6	
§  Array of ids➞wavelet tree	
3	 1	 3	 2	 2	 3	 1	 3	 2	 3	 3	 2	
§  Array of weights➞RMQ data structure	
§  Build two arrays by
concatenating feature ids
and weights separately in
each row
§  RMQ data structure
- compute max B[t,s] in
O(1)time and |B|log|B|/2 +
|B|logM + 2n bits of space
§  Query = multiple interval
extensions of range
intersection
§  Find ids whose sum-of-
products for weights is
at least threshold
Solve inner product search 	
A	
B
Computing upper bound of inner product
in O(1) time 	
•  Using RMQ data structure, upper bound of inner
product can be computed in O(1)-time
•  We compute max B[t,s] for each interval on
wavelet tree and compute upper bound	
1	 4	 3	 4	 6	 8	 6	 7	 4	 5	 2	 6	
3	 4	 3	 2	 2	 3	 1	 3	 2	 3	 3	 2	
1	 4	 3	 4	 4	 2	 6	 8	 6	 7	 5	 6	
A	
B	 Ex)Q=(2:2,4:1)
3・2 + 3・1
=9	Wavelet	
Tree	
RMQ
Experiments	
•  42,971,672 compounds in PubChem database
•  Use KCF-S descriptors, 642,297 dimension
•  Use search time and memory as evaluation
measures
•  Compare SITAd (proposed) to
–  OVA: compute similarity one-by-one
–  INV (state-of the-art): similarity search using inverted
index
–  INV+VBYTE: INV compressed by variable byte codes
–  INV+PD: INV compressed by PForDelta
Search time for the number of
compounds	
●●●●●
●
●
●
●
0e+00 1e+07 2e+07 3e+07 4e+07
0123456
# of descriptors
Searchtime(sec)
●●● ● ●
●
●
●
●
●●●●●
●
●
●
●
●
●
SITAd epsilon=0.9
SITAd epsilon=0.95
SITAd epsilon=0.98
inverted index
inverted index(varbyte)
inverted index(pfordelta)
Search time and memory (MB)
on 42 million compounds	
0 5000 10000 15000 20000 25000 30000 35000
0246810
Memory (mega byte)
searchtime(sec)
SITAd epsilon=0.98
SITAd epsilon=0.95
SITAd epsilon=0.9
INV
INV−VBYTE
INV−PD
OVA
2,400
0.23
0.61
1.54
33,012
5.24
9.58
8,171
Construction time	
●●●●●
●
●
●
●
0e+00 1e+07 2e+07 3e+07 4e+07
0100200300400
# of descriptors
Constructiontime(sec)
●●●
●
●
●
●
●
●
●
SITAd
INV
INV-VBYTE
INV-PD
Summary	
•  Present SITAd, scalable similarity search for
molecular descriptors
•  Use two data structures: wavelet tree, RMQ
•  Takes around 1 sec and use 2.5GB memory for
searching 42 million compounds
•  Future work: develop similarity search methods
using ANN
Software for similarity search is available
https://sites.google.com/site/yasuotabei/	
•  All softwares are applicable to high dimension and
hundreds of millions of data
•  All pairs similarity search (similarity join)
- SketchSort for cosine similarity
- SketchSortj for Jaccard similarity
- SketchSort-minmax for minmax similarity
•  Similarity search
- SMBT for Jaccard similarity
•  Graph similarity search
- gWT

More Related Content

What's hot

LAP2009 c&p101-vector2 d.5ht
LAP2009 c&p101-vector2 d.5htLAP2009 c&p101-vector2 d.5ht
LAP2009 c&p101-vector2 d.5htA Jorge Garcia
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsYu Liu
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...André Panisson
 
Some fixed point and common fixed point theorems of integral
Some fixed point and common fixed point theorems of integralSome fixed point and common fixed point theorems of integral
Some fixed point and common fixed point theorems of integralAlexander Decker
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysissrinivasa teja
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
 
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources Thomas Gottron
 
Applicationof datastructures
Applicationof datastructuresApplicationof datastructures
Applicationof datastructuresHitesh Wagle
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1VitAnhNguyn94
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based ClusteringSSA KPI
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)taufiq99
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 

What's hot (20)

LAP2009 c&p101-vector2 d.5ht
LAP2009 c&p101-vector2 d.5htLAP2009 c&p101-vector2 d.5ht
LAP2009 c&p101-vector2 d.5ht
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
Some fixed point and common fixed point theorems of integral
Some fixed point and common fixed point theorems of integralSome fixed point and common fixed point theorems of integral
Some fixed point and common fixed point theorems of integral
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
ThreeTen
ThreeTenThreeTen
ThreeTen
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
 
Applicationof datastructures
Applicationof datastructuresApplicationof datastructures
Applicationof datastructures
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Sortsearch
SortsearchSortsearch
Sortsearch
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 
Au4201315330
Au4201315330Au4201315330
Au4201315330
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based Clustering
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
Pengantar dasar matematika 4 (TURUNAAN FUNGSI)
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 

Similar to SISAP17

Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201Yasuo Tabei
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lectureSara-Jayne Terp
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsNextMove Software
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexityshowkat27
 
Creating a Custom Serialization Format (Gophercon 2017)
Creating a Custom Serialization Format (Gophercon 2017)Creating a Custom Serialization Format (Gophercon 2017)
Creating a Custom Serialization Format (Gophercon 2017)Scott Mansfield
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Austin Benson
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesOptimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesHPCC Systems
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data ScienceMike Anderson
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistryguest5929fa7
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistrybaoilleach
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce worldYu Liu
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 

Similar to SISAP17 (20)

Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lecture
 
multiscale_tutorial.pdf
multiscale_tutorial.pdfmultiscale_tutorial.pdf
multiscale_tutorial.pdf
 
Meow Hagedorn
Meow HagedornMeow Hagedorn
Meow Hagedorn
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select Dictionaries
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphs
 
Text clustering
Text clusteringText clustering
Text clustering
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
 
Ch07 linearspacealignment
Ch07 linearspacealignmentCh07 linearspacealignment
Ch07 linearspacealignment
 
Creating a Custom Serialization Format (Gophercon 2017)
Creating a Custom Serialization Format (Gophercon 2017)Creating a Custom Serialization Format (Gophercon 2017)
Creating a Custom Serialization Format (Gophercon 2017)
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
Data structures
Data structuresData structures
Data structures
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix SchemesOptimizing Set-Similarity Join and Search with Different Prefix Schemes
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 

More from Yasuo Tabei

Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeYasuo Tabei
 
Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Yasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicYasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicYasuo Tabei
 

More from Yasuo Tabei (17)

Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
GIW2013
GIW2013GIW2013
GIW2013
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
 
Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 

Recently uploaded

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 

Recently uploaded (20)

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 

SISAP17

  • 1. Scalable Similarity Search for Molecular Descriptors Yasuo Tabei RIKEN Center for Advanced Intelligent Project (AIP), Japan Joint work with Simon J. Puglisi University of Helsinki, Finland SISAP’17, Oct. 6, 2017
  • 2. Similarity search in chemoinformatics •  Similarity search of chemical compounds is an important task for novel drug discoveries •  Important fact: similar molecules tend to have a similar molecular functions •  Can find functions of a query by searching databases of compounds •  Whole chemical space is said to be approximately 1060 •  There are large databases storing tens of millions of compounds, e.g., PubChem, ChEMBL •  Scalable similarity searches of chemical compounds are required
  • 3. Chemical fingerprint •  Binary vector representation of molecule E.g.: x=(1, 0, 0, 1, 0) Ø Each dimension indicates the presence/absence of a substructure –  Representative fingerprints: Dragon, PubChem et al. •  Jaccard (a.k.a Tanimoto) similarity is used •  Many methods has been proposed –  Multibit tree, XOR-based, b-bit minhashing et al.
  • 4. Molecular descriptor (NEW) •  Integer vector representation of molecules –  x=(3, 1, 0, 0, 2), equivalent to set W=(1:3, 2:1, 5:2) –  Each dimension indicates a chemical property •  Descriptors: RINGO [Vida et al.,05] and KCF-S [Kotera et al., 13] •  Generalized Jaccard: •  Similarity search of molecular descriptors is in a n infancy stage •  Problem: Find all Wi similar to query Q (≧ε)
  • 5. Similarity search using inverted index [Nasr’12] •  Inverted index: associative array –  Key = feature id, value= weight •  Similarity search for query Q •  Look up inverted index for each query element and compute similarities i Wi 1 (1:3) 2 (5:3) 3 (2:3) 4 (1:1,2:2,4:2) 5 (4:3) 6 (2:2,3:1,5:2) 7 (3:3) 8 (2:3) 1 (1:3) (4:1) 2 (3:3) (4:2) (6:2) (8:3) 3 (6:1) (7:3) 4 (4:2) (5:3) 5 (2:3) (6:2) (i) Descriptors (ii) Inverted index (iii) Similarity search for query Q=(1:3, 4:1) (1:3) (4:1) (4:2) (5:3) Liner time for the total length of lists
  • 6. Drawback •  Scanning lists takes much time, especially for long lists •  Huge memory: –  N: number of descriptors –  M: maximum weight •  One can compress inverted index by using compression methods, e.g., variable-byte codes and PForDelta •  Decompression is time-consuming •  Challenge: developing a fast and space-efficient similarity search for molecular descriptors
  • 7. SITAd: Scalable similarity search for molecular descriptors •  Two techniques 1.  Database partitioning 2.  Conversion to inner product search •  Build wavelet tree on the notion of two techniques •  Solve inner product search on wavelet tree
  • 8. Database partitioning W1=(1:1, 3:1) W2=(2:1) W3=(2:2, 4:1) W4=(2:1, 4:1) W5=(3:1) W6=(1:2) W7=(1:1, 4:1) W8=(1:1, 2:2) W2=(2:1) W5=(3:1) W1=(1:1, 3:1) W4=(2:1, 4:1) W7=(1:1, 4:1) W6=(1:2) W3=(2:2, 4:1) W8=(1:1, 2:2) Theorem1 •  Classify each descriptor Wi into block Bc •  Search space can be limited to blocks satisfying Theorem1 for given query q and ε
  • 9. Conversion to inner product search •  Similarity search using generalized Jaccard similarity can be converted to inner product search •  How to solve inner product search efficiently? •  Suppose a simple case that all weights are one Ex) x = (3, 2, 0, 4, 2) ➞ x’= (1, 1, 0, 0, 1) ➞ W’=(1,2,5) •  Can be solved as a semi conjunctive query Constant Inner product Set ≧ε
  • 10. 2 Conjunctive Query §  Query with k keywords §  (Word 2, Word 4) §  Identify the set intersection by sorting merged id list §  It takes O(|A|+|B|) time §  Can that be any faster? Word Document ids Word 1 1,3 Word 2 2,6,8 Word 3 1,5,7 Word 4 2,7 6 8 2 7 2 2 6 7 8 A B
  • 11. Alternation α §  Number of switches after sorting §  There exists a data structure that allows to find set intersection in O(α log m) time (Barbay/Kenyon, 2002) §  m : maximum value α = 2 2 6 8 2 7 2 2 6 7 8
  • 12. Range intersection on array n  Concatenate all rows of inverted index n  Array A of length n, values 1 ≤ A[i] ≤ m n  Query word = Interval n  Range intersection: rint(A,[i,j],[k,l]) •  Find set intersection of A[i,j] and A[k,l] n  O(α log m) time using wavelet tree ! A 1 3 2 6 8 1 5 7 2 7 4 5 i j k l
  • 14. Tree of subarrays: Lower half = left, Higher half=right [1,4] [5,8] [1,8] 1 3 2 6 8 5 7 1 2 7 4 5 1 3 2 1 2 4 6 8 5 7 7 5 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 [1,2] [3,4] [5,6] [7,8]
  • 15. Remember if each element is either in lower half (0) or higher half (1) [1,4] [5,8] [1,8] 0 0 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0 [1,2] [3,4] [5,6] [7,8] 1 2 3 4 5 6 7 8
  • 16. Index each bit array with a rank dictionary n  With rank dictionary, the rank operation can be done in O(1) time •  rankc(B,i): return the number of c∈{0,1} in B[1…i] Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 rank1(B,8)=5 rank0(B,5)=3
  • 17. Wavelet Tree = Collection of bit arrays indexed by rank dictionaries [1,4] [5,8] [1,8] 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0 [1,2] [3,4] [5,6] [7,8] 1 2 3 4 5 6 7 8 0 0 0 1 1 1 1 0 0 1 0 1
  • 18. Memory Usage n  (1+γ) n log m bits l  n: Number of all words in the database l  m: Number of unique words l  γ: Overhead for rank dictionary (around 0.6) l  Not so different from simply storing the array (n log m bit)
  • 19. Solving range intersection using wavelet tree
  • 20. Range intersection: recap n  Array A of length n, values 1 ≤ A[i] ≤ m n  Query word = Interval n  Range intersection: rint(A,[i,j],[k,l]) •  Find set intersection of A[i,j] and A[k,l] n  O(α log m) time using wavelet tree A 1 3 2 6 8 1 5 7 2 7 4 5 i j k l
  • 21. O(1)-time division of an interval n  Using the rank operations, the division of an interval can be done in constant time •  rank0 for left child and rank1 for right child •  Naïve = linear time to the total number of elements [1,4] [1,8] Aroot 1 3 2 6 8 1 5 7 2 7 4 5 Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5 [5,8]
  • 22. Fast computation of range intersection on wavelet tree [1,4] [5,8] [1,8] 1 3 2 6 8 1 5 7 2 7 4 5 1 3 2 1 2 4 6 8 5 7 7 5 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 [1,2] [3,4] [5,6] [6,8] Pruned solution!!
  • 23. Fast computation of range intersection on wavelet tree [1,4] [5,8] [1,8] 1 3 2 6 8 1 5 7 2 7 4 5 1 3 2 1 2 4 6 8 5 7 7 5 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 [1,2] [3,4] [5,6] [6,8] Height log m
  • 24. Solve inner product search using wavelet tree
  • 25. •  Inverted index 1 (1:3) (4:1) 2 (3:3) (4:2) (6:2) (8:3) 3 (6:1) (7:3) 4 (4:2) (5:3) 5 (2:3) (6:2) 1 4 3 4 6 8 6 7 4 5 2 6 §  Array of ids➞wavelet tree 3 1 3 2 2 3 1 3 2 3 3 2 §  Array of weights➞RMQ data structure §  Build two arrays by concatenating feature ids and weights separately in each row §  RMQ data structure - compute max B[t,s] in O(1)time and |B|log|B|/2 + |B|logM + 2n bits of space §  Query = multiple interval extensions of range intersection §  Find ids whose sum-of- products for weights is at least threshold Solve inner product search A B
  • 26. Computing upper bound of inner product in O(1) time •  Using RMQ data structure, upper bound of inner product can be computed in O(1)-time •  We compute max B[t,s] for each interval on wavelet tree and compute upper bound 1 4 3 4 6 8 6 7 4 5 2 6 3 4 3 2 2 3 1 3 2 3 3 2 1 4 3 4 4 2 6 8 6 7 5 6 A B Ex)Q=(2:2,4:1) 3・2 + 3・1 =9 Wavelet Tree RMQ
  • 27. Experiments •  42,971,672 compounds in PubChem database •  Use KCF-S descriptors, 642,297 dimension •  Use search time and memory as evaluation measures •  Compare SITAd (proposed) to –  OVA: compute similarity one-by-one –  INV (state-of the-art): similarity search using inverted index –  INV+VBYTE: INV compressed by variable byte codes –  INV+PD: INV compressed by PForDelta
  • 28. Search time for the number of compounds ●●●●● ● ● ● ● 0e+00 1e+07 2e+07 3e+07 4e+07 0123456 # of descriptors Searchtime(sec) ●●● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● SITAd epsilon=0.9 SITAd epsilon=0.95 SITAd epsilon=0.98 inverted index inverted index(varbyte) inverted index(pfordelta)
  • 29. Search time and memory (MB) on 42 million compounds 0 5000 10000 15000 20000 25000 30000 35000 0246810 Memory (mega byte) searchtime(sec) SITAd epsilon=0.98 SITAd epsilon=0.95 SITAd epsilon=0.9 INV INV−VBYTE INV−PD OVA 2,400 0.23 0.61 1.54 33,012 5.24 9.58 8,171
  • 30. Construction time ●●●●● ● ● ● ● 0e+00 1e+07 2e+07 3e+07 4e+07 0100200300400 # of descriptors Constructiontime(sec) ●●● ● ● ● ● ● ● ● SITAd INV INV-VBYTE INV-PD
  • 31. Summary •  Present SITAd, scalable similarity search for molecular descriptors •  Use two data structures: wavelet tree, RMQ •  Takes around 1 sec and use 2.5GB memory for searching 42 million compounds •  Future work: develop similarity search methods using ANN
  • 32. Software for similarity search is available https://sites.google.com/site/yasuotabei/ •  All softwares are applicable to high dimension and hundreds of millions of data •  All pairs similarity search (similarity join) - SketchSort for cosine similarity - SketchSortj for Jaccard similarity - SketchSort-minmax for minmax similarity •  Similarity search - SMBT for Jaccard similarity •  Graph similarity search - gWT