T5 Informatics GmbH
greg.landrum@t5informatics.com
@dr_greg_landrum
Big (chemical) data? No Problem!
Storing and searching large amounts of data with
open-source software
Greg Landrum
This work is licensed under a
Creative Commons Attribution 4.0
International License.
2T5 Informatics
Defining our terms
Big data is a term for data sets that are so large or
complex that traditional data processing
applications are inadequate to deal with them.
https://en.wikipedia.org/wiki/Big_data
3T5 Informatics
Defining our terms
Big data is a term for data sets that are so large or complex that traditional
data processing applications are inadequate to deal with them. Challenges
include analysis, capture, data curation, search, sharing, storage, transfer,
visualization, querying, updating and information privacy. The term "big
data" often refers simply to the use of predictive analytics, user behavior
analytics, or certain other advanced data analytics methods that extract
value from data, and seldom to a particular size of data set.[2] "There is little
doubt that the quantities of data now available are indeed large, but that’s
not the most relevant characteristic of this new data ecosystem."[3]
https://en.wikipedia.org/wiki/Big_data
4T5 Informatics
Motivation
● Starting point: we frequently end up working with large collections of
compounds and data about them but we don’t really have standard, efficient,
portable, cross-platform ways of storing/working with that data
● Let’s take a look at a few datasets and use cases for them and see what we
can do
5T5 Informatics
A point of faith
We’ve got lots of tools...
…. and there’s not going to be a silver bullet that solves every problem.
One size does not fit all.
We need to match the complexity of the tool to the task at hand
Even the simplest tools can end up being problematic
ğ
6T5 Informatics
Looking at the technologies
Criteria:
● Flexibility
● Schema free?
● Size of our data on disk
● Speed of retrieving data
7T5 Informatics
Demo/test dataset 1
Working with PubChem
● 223 million substance records
● 99 millon substance synonyms
● 91.5 million compound records
● 207 million associations
Use cases: looking up names and basic chemical information, but dealing with the real-world
complexity of that
8T5 Informatics
Demo/test dataset 1
PubChem
● 223 million substance records (15GB)
○ SID, source, source regnum, xref
● 99 millon substance synonyms (5GB)
● 91.5 million compound records (42GB)
○ pubchem_{inchi,inchi_key,iupac_name},rdkit_{smiles,inchi,inchi_key}
● 207 million associations (8GB)
Setup
● PostgreSQL 9.5
● Macbook Pro 13 (early 2015) with a 3.1GHz Core i7, 16MB of RAM, 500GB SSD
https://www.postgresql.org/
9T5 Informatics
Dataset 1: Fuzzy name lookup
I’ve got a bunch of compound names and I’d like to get the info available about
them.
chem_integration=# select cid,sid,assoc_type,lower(synonym),rdkit_smiles from pubchem_substance_assocs join
pubchem_substance_synonyms using (sid) join pubchem_compounds using (cid)
where lower(synonym) like 'chlorprom%';
cid | sid | assoc_type | lower | rdkit_smiles
--------+-----------+------------+------------------------------+------------------------------------------
2726 | 7978926 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21
2726 | 8149253 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21
2726 | 252402355 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21
2726 | 268735291 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21
2726 | 273002747 | 1 | chlorpromazine hydrochloride | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21
165214 | 274768709 | 1 | chlorpromazine sulfone | CN(C)CCCN1c2ccccc2S(=O)(=O)c2ccc(Cl)cc21
(6 rows)
Time: 13.206 ms
That’s a typical result.
10T5 Informatics
Dataset 1: Doing some chemistry
String operations on the database are really fast. Can we use those to do some
chemical queries?
“Find all molecules with a particular formula”
“Find all molecules with a particular formula and connectivity”
11T5 Informatics
Dataset 1: InChI aggregation
Using the structure of InChI we can do chemical aggregation just using string
operations:
● Group by formula:
rdkit_inchi like 'InChI=1S/C22H25N5O/%'
● Group by formula + connectivity
rdkit_inchi like
'InChI=1S/C22H25N5O/c1-16-20(21(23)27(24-16)19-6-4-3-5-7-19)17-8-10-18(11-9-
17)22(28)26-14-12-25(2)13-15-26/%'
12T5 Informatics
Dataset 1: InChI aggregation
Do that aggregation across pubchem_compound for 1000 random InChIs from
pubchem_compound
Results for group by formula:
13T5 Informatics
Dataset 1: InChI aggregation
Do that aggregation across pubchem_compound for 1000 random InChIs from
pubchem_compound
Results for group by connectivity:
14T5 Informatics
Aside: examples of duplicate connectivity
15T5 Informatics
Next dataset...
16T5 Informatics
Demo/test dataset 2
Base dataset: 14.9 million compounds from ZINC15
● MW and logp: all tranches
● Reactivity: tranche A ("anodyne")
● Availability: tranche A ("in-stock")
For this example, 4 million of these were randomly selected.
Data fields: label, smiles, molpkl, num_atoms, num_heavy_atoms, num_rotatable,
num_rings, tpsa, mollogp, molwt, Patternfp (#on bits, fpdata), Morganfp2 (#on bits, fpdata)
Data generated using the RDKit
Total dataset size (stored as python binary pickle): ~3.5GB
http://zinc15.docking.org/tranches/home/
17T5 Informatics
Benchmarking setup
For data set 2:
Dell XPS desktop with a 3.6GHz i7-4790 CPU, 16GB RAM, standard HD
Ubuntu 16.04
Python 3.5.1 (anaconda python)
18T5 Informatics
Demo/test dataset 2
Queries:
● Q1: Retrieve num_atoms for 50K rows
● Q2: Retrieve on_bit_count for 50K fingerprints
● Q3: Count number of mols with num_atoms>25
● Q4: Count number of fingerprints with on_bit_count=50
● Q5: Count number of fingerprints with on_bit_count between 40 and 50
● Q6: Retrieve fingerprints with on_bit_count=50
● Q7: Retrieve fingerprints with on_bit_count between 40 and 50
19T5 Informatics
Baseline: Reading from Python
Just read everything and unpickle it.
Dataset size: 3.5GB
● Q1: 120ms
● Q2: same
● Q3: 10.2s
● Q4: same
● Q5: same
● Q6: same
● Q7: same
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
5. Count number of fingerprints with
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50
20T5 Informatics
Tech 1: MessagePack
"It's like JSON, but fast and small"
Binary format, simple to read and write from multiple languages
Now being used by the PDB
Readers and writers from many, many languages
Very flexible, schema free
https://github.com/msgpack/msgpack-python
21T5 Informatics
MessagePack performance
Data stored as tuples
Dataset size: 3.5GB
● Q1: 26ms
● Q2: 26ms
● Q3: 2.1s
● Q4: same
● Q5: same
● Q6: same
● Q7: same
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
5. Count number of fingerprints with
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50
22T5 Informatics
Tech 2: Flat buffers
Cross platform serialization library
Binary format, simple to read and write from
multiple languages
Flexible hierarchical schema
http://google.github.io/flatbuffers/index.html
namespace storage_formats;
table Fingerprint {
on_bit_count:ushort;
bytes:[ubyte];
}
table Molecule {
smiles:string;
name:string;
pkl:[ubyte];
num_atoms:ushort;
num_heavy_atoms:ushort;
pattern_fp:Fingerprint;
morgan2_fp:Fingerprint;
num_rotatable_bonds:ushort;
num_rings:ushort;
tpsa:double;
mollogp:double;
molwt:double;
}
root_type Molecule;
23T5 Informatics
FlatBuffers performance
Using a C++ reader
Dataset size: 3.9GB
● Q1: 42ms
● Q2: 26ms
● Q3: 1.2s
● Q4: same
● Q5: same
● Q6: same
● Q7: same
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
5. Count number of fingerprints with
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50
24T5 Informatics
Tech 3: Pandas
"Standard" Python data frame
Powerful data manipulation and aggregation
Very extensible (excellent RDKit integration)
Not suitable for this project because the entire data frame needs to be in memory.
http://pandas.pydata.org/
25T5 Informatics
Tech 4: SQLite
Open-source SQL database
File based, no server required
Extremely flexible
Basic RDKit integration available
Connectors available from many, many languages
Requires a schema
https://sqlite.org/
26T5 Informatics
SQLite performance
Dataset size: 4.6GB (without index)
● Q1: 30ms
● Q2: 30ms
● Q3: 1.4s (with index 13ms)
● Q4: 1.0s (with index 5ms)
● Q5: 1.0s (with index 54ms)
● Q6: 1.1s (with index 199ms)
● Q7: 2.8s (with index 2.2s)
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
5. Count number of fingerprints with
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50
Q5: select count(*) from morganfps where on_bit_count>40 and on_bit_count<50;
Q6: select pkl from morganfps where on_bit_count>40 and on_bit_count<50;
27T5 Informatics
Tech 5: PostgreSQL
Open-source SQL database
Uses a server
Extremely flexible and extensible
Strong RDKit integration
Connectors available from many, many languages
Requires a schema
https://www.postgresql.org/
28T5 Informatics
PostgreSQL performance
Dataset size: 4.0GB (4.3GB with indices)
● Q1: 23ms
● Q2: 102ms
● Q3: 697ms (215ms with index)
● Q4: 432ms (58ms with index)
● Q5: 504ms (182ms with index)
● Q6: 608ms (220ms with index)
● Q7: 2.4s (2.2s with index)
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
5. Count number of fingerprints with
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50
Q5: select count(*) from morganfps where on_bit_count>40 and on_bit_count<50;
Q6: select pkl from morganfps where on_bit_count>40 and on_bit_count<50;
29T5 Informatics
Tech 6: bcolz
Columnar data format for Python
Compressed on disk and/or in memory
Provides a similar API to numpy
Pretty good querying primitives
Requires a schema
https://github.com/Blosc/bcolz
dtype=[('zincid','S16'),
('smiles','S256'),
('pkl','S512'),
('pfp_obc','u4'),
('pfp_pkl','S256'),
('mfp2_obc','u4'),
('mfp2_pkl','S256'),
('num_atoms','u4'),
('num_heavy_atoms','u4'),
('num_rotatable_bonds','u4'),
('num_rings','u4'),
('tpsa','f8'),
('mollogp','f8'),
('molwt','f8')]
30T5 Informatics
Bcolz performance
Dataset size: 2.0GB
● Q1: 8ms
● Q2: 8ms
● Q3: 295ms
● Q4: 83ms
● Q5: 587ms
● Q6: 1.1s
● Q7: 2.2s
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
5. Count number of fingerprints with
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50
Q5: len([x for x in tbl.where("mfp2_obc==50", outcols="mfp2_obc")])
Q6: [x for x in tbl.where("(mfp2_obc>40) & (mfp2_obc<50)", outcols="mfp2_pkl")]
31T5 Informatics
Tech 7: dask
"Flexible parallel computing library for analytic computing"
Does *way* more than what I'm using it for here
Provides, among other things, a parallel Pandas-like interface to bcolz data
http://dask.pydata.org/
32T5 Informatics
dask performance
Dataset size: 2.0GB (uses bcolz data)
● Q1: N/A
● Q2: N/A
● Q3: 74ms
● Q4: 116ms
● Q5: 131ms
● Q6: 5.8s
● Q7: 7.8s
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
5. Count number of fingerprints with
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50
Q5: len(df.zincid[df.mfp2_obc==50])
Q6: df.mfp2_pkl[df.mfp2_obc.between(40,50,inclusive=False)]
33T5 Informatics
Things I didn't look at (yet)
HDF5: well-established hierarchical binary data format
Arrow: new (and rapidly evolving) column-oriented format
Parquet: Columnar data store for Hadoop
Impala: massively parallel SQL search on top of Hadoop (from Cloudera)
MonetDb: Open-source column-oriented database
34T5 Informatics
Summary of all that
Technology Size Q1 Q2 Q3 Q4 Q5 Q6 Q7
Raw Python 3.5GB 120ms 120ms 10200ms 10200ms 10200ms 10200ms 10200ms
MessagePack 3.5GB 26ms 25ms 2100ms 2100ms 2100ms 2100ms 2100ms
FlatBuffers 3.9GB 42ms 26ms 1200ms 1200ms 1200ms 1200ms 1200ms
SQLite 4.6GB 30ms 30ms 13ms 5ms 54ms 199ms 2200ms
PostgreSQL 4.3GB 23ms 102ms 215ms 58ms 182ms 220ms 2200ms
bcolz 2.0GB 8ms 8ms 295ms 83ms 587ms 1100ms 2200ms
dask 2.0GB N/A N/A 74ms 116ms 131ms 5800ms 7800ms
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
5. Count number of fingerprints with
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50
35T5 Informatics
Next dataset…
Let’s actually do some chemistry
36T5 Informatics
Demo/test dataset 3
PubChem
● 10 million compound records loaded as RDKit molecules into
PostgreSQL (4.0GB)
● Substructure search index built (3.9GB)
Setup
● PostgreSQL 9.5
● RDKit v2016.09.1
● Macbook Pro 13 (early 2015) with a 3.1GHz Core i7, 16MB of RAM,
500GB SSD
https://www.postgresql.org/
37T5 Informatics
Dataset 3: Substructure searches 1
Construct the Murcko Scaffolds for 100 random molecules from the
pubchem_compound set and use them as substructure queries against the 10
million compounds
Search times are to
retrieve the first 100
results.
38T5 Informatics
Dataset 3: Substructure searches 1
Construct the Murcko Scaffolds for 100 random molecules from the
pubchem_compound set and use them as substructure queries against the 10
million compounds
Zoomed to ignore the 6 queries that took longer
than 4 seconds.
Search times are to
retrieve the first 100
results.
39T5 Informatics
Aside: some slower substructure queries
Mainly a note to myself to go back and tune the fingerprint for these
Example of a slow one QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..419.53 rows=100 width=397) (actual time=78.800..16617.229 rows=100 loops=1)
-> Index Scan using molidx on mols (cost=0.42..41911.42 rows=10000 width=397) (actual time=78.799..16617.169 rows=100 loops=1)
Index Cond: (m @> 'c1ccc(COc2ccc(Cc3ccccc3)cc2)cc1'::mol)
Rows Removed by Index Recheck: 62166
Planning time: 0.064 ms
Execution time: 16617.718 ms
(6 rows)
More normal example QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..419.53 rows=100 width=397) (actual time=7.690..106.251 rows=100 loops=1)
-> Index Scan using molidx on mols (cost=0.42..41911.42 rows=10000 width=397) (actual time=7.688..106.228 rows=100 loops=1)
Index Cond: (m @> 'O=C(CNc1ccccc1)Nc1ccccc1'::mol)
Rows Removed by Index Recheck: 474
Planning time: 0.060 ms
Execution time: 106.646 ms
(6 rows)
40T5 Informatics
Dataset 3: Substructure searches 2
Do a substructure search with 100 random lead-like molecules from ZINC against
the 10 million compounds
Search times are to
retrieve at most the first
100 results.
41T5 Informatics
Dataset 3: Substructure searches 2
Do a substructure search with 100 random lead-like molecules from ZINC against
the 10 million compounds
Search times are to
retrieve at most the first
100 results.
Zoomed to ignore the 7 queries that took longer
than 225ms.
42T5 Informatics
Dataset 3: Substructure searches 3
Do a substructure search with 100 random fragment-like molecules from ZINC
against the 10 million compounds
Search times are to
retrieve at most the first
100 results.
43T5 Informatics
Dataset 3: Substructure searches 3
Do a substructure search with 100 random fragment-like molecules from ZINC
against the 10 million compounds
Search times are to
retrieve at most the first
100 results.
Zoomed to ignore the 6 queries that took longer
than 1s.
44T5 Informatics
Dataset 3: Similarity searches 1
Do a similarity search with 100 random lead-like molecules from ZINC against the
10 million compounds. Similarity threshold = 0.8, fingerprint MFP2
Search times are to
retrieve at most the 10
most similar results
45T5 Informatics
The right tool for the job
The index type provided by PostgreSQL works really well for substructure
searching, but is less effective for similarity queries.
Let’s switch from the general tool to something specialized
46T5 Informatics
Dataset 3: Using the FPB format
Do a similarity search with 100 random lead-like molecules from ZINC against the
10 million compounds. Similarity threshold = 0.8, fingerprint MFP2
File size: 2.6GB
http://chemfp.com/
mean : 90ms / search
47T5 Informatics
Dataset 3: Using chemfp
Do a similarity search with 500 random lead-like molecules from ZINC against the
10 million compounds. Fingerprint MFP2
● Threshold = 0.8: 2.5 seconds (5ms / search)
● Threshold = 0.7: 4.6 seconds (9ms / search)
● Threshold = 0.5: 9.0 seconds (18ms / search)
http://chemfp.com/
The right data structure meets a clever algorithm...
48T5 Informatics
Wrapping up
One size does not fit all: Matching the format and algorithm to the problem at hand
is always going to win
For standard searching and relational queries, good-old SQL with indices is pretty
hard to beat
But some of the new(er) column-oriented technologies are very promising,
particularly if you are really working on one column at a time.
Shouldn't lose track of the fact that the tools are just a means to an end

Big (chemical) data? No Problem!

  • 1.
    T5 Informatics GmbH greg.landrum@t5informatics.com @dr_greg_landrum Big(chemical) data? No Problem! Storing and searching large amounts of data with open-source software Greg Landrum This work is licensed under a Creative Commons Attribution 4.0 International License.
  • 2.
    2T5 Informatics Defining ourterms Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. https://en.wikipedia.org/wiki/Big_data
  • 3.
    3T5 Informatics Defining ourterms Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.[2] "There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem."[3] https://en.wikipedia.org/wiki/Big_data
  • 4.
    4T5 Informatics Motivation ● Startingpoint: we frequently end up working with large collections of compounds and data about them but we don’t really have standard, efficient, portable, cross-platform ways of storing/working with that data ● Let’s take a look at a few datasets and use cases for them and see what we can do
  • 5.
    5T5 Informatics A pointof faith We’ve got lots of tools... …. and there’s not going to be a silver bullet that solves every problem. One size does not fit all. We need to match the complexity of the tool to the task at hand Even the simplest tools can end up being problematic ğ
  • 6.
    6T5 Informatics Looking atthe technologies Criteria: ● Flexibility ● Schema free? ● Size of our data on disk ● Speed of retrieving data
  • 7.
    7T5 Informatics Demo/test dataset1 Working with PubChem ● 223 million substance records ● 99 millon substance synonyms ● 91.5 million compound records ● 207 million associations Use cases: looking up names and basic chemical information, but dealing with the real-world complexity of that
  • 8.
    8T5 Informatics Demo/test dataset1 PubChem ● 223 million substance records (15GB) ○ SID, source, source regnum, xref ● 99 millon substance synonyms (5GB) ● 91.5 million compound records (42GB) ○ pubchem_{inchi,inchi_key,iupac_name},rdkit_{smiles,inchi,inchi_key} ● 207 million associations (8GB) Setup ● PostgreSQL 9.5 ● Macbook Pro 13 (early 2015) with a 3.1GHz Core i7, 16MB of RAM, 500GB SSD https://www.postgresql.org/
  • 9.
    9T5 Informatics Dataset 1:Fuzzy name lookup I’ve got a bunch of compound names and I’d like to get the info available about them. chem_integration=# select cid,sid,assoc_type,lower(synonym),rdkit_smiles from pubchem_substance_assocs join pubchem_substance_synonyms using (sid) join pubchem_compounds using (cid) where lower(synonym) like 'chlorprom%'; cid | sid | assoc_type | lower | rdkit_smiles --------+-----------+------------+------------------------------+------------------------------------------ 2726 | 7978926 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21 2726 | 8149253 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21 2726 | 252402355 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21 2726 | 268735291 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21 2726 | 273002747 | 1 | chlorpromazine hydrochloride | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21 165214 | 274768709 | 1 | chlorpromazine sulfone | CN(C)CCCN1c2ccccc2S(=O)(=O)c2ccc(Cl)cc21 (6 rows) Time: 13.206 ms That’s a typical result.
  • 10.
    10T5 Informatics Dataset 1:Doing some chemistry String operations on the database are really fast. Can we use those to do some chemical queries? “Find all molecules with a particular formula” “Find all molecules with a particular formula and connectivity”
  • 11.
    11T5 Informatics Dataset 1:InChI aggregation Using the structure of InChI we can do chemical aggregation just using string operations: ● Group by formula: rdkit_inchi like 'InChI=1S/C22H25N5O/%' ● Group by formula + connectivity rdkit_inchi like 'InChI=1S/C22H25N5O/c1-16-20(21(23)27(24-16)19-6-4-3-5-7-19)17-8-10-18(11-9- 17)22(28)26-14-12-25(2)13-15-26/%'
  • 12.
    12T5 Informatics Dataset 1:InChI aggregation Do that aggregation across pubchem_compound for 1000 random InChIs from pubchem_compound Results for group by formula:
  • 13.
    13T5 Informatics Dataset 1:InChI aggregation Do that aggregation across pubchem_compound for 1000 random InChIs from pubchem_compound Results for group by connectivity:
  • 14.
    14T5 Informatics Aside: examplesof duplicate connectivity
  • 15.
  • 16.
    16T5 Informatics Demo/test dataset2 Base dataset: 14.9 million compounds from ZINC15 ● MW and logp: all tranches ● Reactivity: tranche A ("anodyne") ● Availability: tranche A ("in-stock") For this example, 4 million of these were randomly selected. Data fields: label, smiles, molpkl, num_atoms, num_heavy_atoms, num_rotatable, num_rings, tpsa, mollogp, molwt, Patternfp (#on bits, fpdata), Morganfp2 (#on bits, fpdata) Data generated using the RDKit Total dataset size (stored as python binary pickle): ~3.5GB http://zinc15.docking.org/tranches/home/
  • 17.
    17T5 Informatics Benchmarking setup Fordata set 2: Dell XPS desktop with a 3.6GHz i7-4790 CPU, 16GB RAM, standard HD Ubuntu 16.04 Python 3.5.1 (anaconda python)
  • 18.
    18T5 Informatics Demo/test dataset2 Queries: ● Q1: Retrieve num_atoms for 50K rows ● Q2: Retrieve on_bit_count for 50K fingerprints ● Q3: Count number of mols with num_atoms>25 ● Q4: Count number of fingerprints with on_bit_count=50 ● Q5: Count number of fingerprints with on_bit_count between 40 and 50 ● Q6: Retrieve fingerprints with on_bit_count=50 ● Q7: Retrieve fingerprints with on_bit_count between 40 and 50
  • 19.
    19T5 Informatics Baseline: Readingfrom Python Just read everything and unpickle it. Dataset size: 3.5GB ● Q1: 120ms ● Q2: same ● Q3: 10.2s ● Q4: same ● Q5: same ● Q6: same ● Q7: same 1. Retrieve num_atoms for 50K rows 2. Retrieve on_bit_count for 50K fingerprints 3. Count number of mols with num_atoms>25 4. Count number of fingerprints with on_bit_count=50 5. Count number of fingerprints with on_bit_count between 40 and 50 6. Retrieve fingerprints with on_bit_count=50 7. Retrieve fingerprints with on_bit_count between 40 and 50
  • 20.
    20T5 Informatics Tech 1:MessagePack "It's like JSON, but fast and small" Binary format, simple to read and write from multiple languages Now being used by the PDB Readers and writers from many, many languages Very flexible, schema free https://github.com/msgpack/msgpack-python
  • 21.
    21T5 Informatics MessagePack performance Datastored as tuples Dataset size: 3.5GB ● Q1: 26ms ● Q2: 26ms ● Q3: 2.1s ● Q4: same ● Q5: same ● Q6: same ● Q7: same 1. Retrieve num_atoms for 50K rows 2. Retrieve on_bit_count for 50K fingerprints 3. Count number of mols with num_atoms>25 4. Count number of fingerprints with on_bit_count=50 5. Count number of fingerprints with on_bit_count between 40 and 50 6. Retrieve fingerprints with on_bit_count=50 7. Retrieve fingerprints with on_bit_count between 40 and 50
  • 22.
    22T5 Informatics Tech 2:Flat buffers Cross platform serialization library Binary format, simple to read and write from multiple languages Flexible hierarchical schema http://google.github.io/flatbuffers/index.html namespace storage_formats; table Fingerprint { on_bit_count:ushort; bytes:[ubyte]; } table Molecule { smiles:string; name:string; pkl:[ubyte]; num_atoms:ushort; num_heavy_atoms:ushort; pattern_fp:Fingerprint; morgan2_fp:Fingerprint; num_rotatable_bonds:ushort; num_rings:ushort; tpsa:double; mollogp:double; molwt:double; } root_type Molecule;
  • 23.
    23T5 Informatics FlatBuffers performance Usinga C++ reader Dataset size: 3.9GB ● Q1: 42ms ● Q2: 26ms ● Q3: 1.2s ● Q4: same ● Q5: same ● Q6: same ● Q7: same 1. Retrieve num_atoms for 50K rows 2. Retrieve on_bit_count for 50K fingerprints 3. Count number of mols with num_atoms>25 4. Count number of fingerprints with on_bit_count=50 5. Count number of fingerprints with on_bit_count between 40 and 50 6. Retrieve fingerprints with on_bit_count=50 7. Retrieve fingerprints with on_bit_count between 40 and 50
  • 24.
    24T5 Informatics Tech 3:Pandas "Standard" Python data frame Powerful data manipulation and aggregation Very extensible (excellent RDKit integration) Not suitable for this project because the entire data frame needs to be in memory. http://pandas.pydata.org/
  • 25.
    25T5 Informatics Tech 4:SQLite Open-source SQL database File based, no server required Extremely flexible Basic RDKit integration available Connectors available from many, many languages Requires a schema https://sqlite.org/
  • 26.
    26T5 Informatics SQLite performance Datasetsize: 4.6GB (without index) ● Q1: 30ms ● Q2: 30ms ● Q3: 1.4s (with index 13ms) ● Q4: 1.0s (with index 5ms) ● Q5: 1.0s (with index 54ms) ● Q6: 1.1s (with index 199ms) ● Q7: 2.8s (with index 2.2s) 1. Retrieve num_atoms for 50K rows 2. Retrieve on_bit_count for 50K fingerprints 3. Count number of mols with num_atoms>25 4. Count number of fingerprints with on_bit_count=50 5. Count number of fingerprints with on_bit_count between 40 and 50 6. Retrieve fingerprints with on_bit_count=50 7. Retrieve fingerprints with on_bit_count between 40 and 50 Q5: select count(*) from morganfps where on_bit_count>40 and on_bit_count<50; Q6: select pkl from morganfps where on_bit_count>40 and on_bit_count<50;
  • 27.
    27T5 Informatics Tech 5:PostgreSQL Open-source SQL database Uses a server Extremely flexible and extensible Strong RDKit integration Connectors available from many, many languages Requires a schema https://www.postgresql.org/
  • 28.
    28T5 Informatics PostgreSQL performance Datasetsize: 4.0GB (4.3GB with indices) ● Q1: 23ms ● Q2: 102ms ● Q3: 697ms (215ms with index) ● Q4: 432ms (58ms with index) ● Q5: 504ms (182ms with index) ● Q6: 608ms (220ms with index) ● Q7: 2.4s (2.2s with index) 1. Retrieve num_atoms for 50K rows 2. Retrieve on_bit_count for 50K fingerprints 3. Count number of mols with num_atoms>25 4. Count number of fingerprints with on_bit_count=50 5. Count number of fingerprints with on_bit_count between 40 and 50 6. Retrieve fingerprints with on_bit_count=50 7. Retrieve fingerprints with on_bit_count between 40 and 50 Q5: select count(*) from morganfps where on_bit_count>40 and on_bit_count<50; Q6: select pkl from morganfps where on_bit_count>40 and on_bit_count<50;
  • 29.
    29T5 Informatics Tech 6:bcolz Columnar data format for Python Compressed on disk and/or in memory Provides a similar API to numpy Pretty good querying primitives Requires a schema https://github.com/Blosc/bcolz dtype=[('zincid','S16'), ('smiles','S256'), ('pkl','S512'), ('pfp_obc','u4'), ('pfp_pkl','S256'), ('mfp2_obc','u4'), ('mfp2_pkl','S256'), ('num_atoms','u4'), ('num_heavy_atoms','u4'), ('num_rotatable_bonds','u4'), ('num_rings','u4'), ('tpsa','f8'), ('mollogp','f8'), ('molwt','f8')]
  • 30.
    30T5 Informatics Bcolz performance Datasetsize: 2.0GB ● Q1: 8ms ● Q2: 8ms ● Q3: 295ms ● Q4: 83ms ● Q5: 587ms ● Q6: 1.1s ● Q7: 2.2s 1. Retrieve num_atoms for 50K rows 2. Retrieve on_bit_count for 50K fingerprints 3. Count number of mols with num_atoms>25 4. Count number of fingerprints with on_bit_count=50 5. Count number of fingerprints with on_bit_count between 40 and 50 6. Retrieve fingerprints with on_bit_count=50 7. Retrieve fingerprints with on_bit_count between 40 and 50 Q5: len([x for x in tbl.where("mfp2_obc==50", outcols="mfp2_obc")]) Q6: [x for x in tbl.where("(mfp2_obc>40) & (mfp2_obc<50)", outcols="mfp2_pkl")]
  • 31.
    31T5 Informatics Tech 7:dask "Flexible parallel computing library for analytic computing" Does *way* more than what I'm using it for here Provides, among other things, a parallel Pandas-like interface to bcolz data http://dask.pydata.org/
  • 32.
    32T5 Informatics dask performance Datasetsize: 2.0GB (uses bcolz data) ● Q1: N/A ● Q2: N/A ● Q3: 74ms ● Q4: 116ms ● Q5: 131ms ● Q6: 5.8s ● Q7: 7.8s 1. Retrieve num_atoms for 50K rows 2. Retrieve on_bit_count for 50K fingerprints 3. Count number of mols with num_atoms>25 4. Count number of fingerprints with on_bit_count=50 5. Count number of fingerprints with on_bit_count between 40 and 50 6. Retrieve fingerprints with on_bit_count=50 7. Retrieve fingerprints with on_bit_count between 40 and 50 Q5: len(df.zincid[df.mfp2_obc==50]) Q6: df.mfp2_pkl[df.mfp2_obc.between(40,50,inclusive=False)]
  • 33.
    33T5 Informatics Things Ididn't look at (yet) HDF5: well-established hierarchical binary data format Arrow: new (and rapidly evolving) column-oriented format Parquet: Columnar data store for Hadoop Impala: massively parallel SQL search on top of Hadoop (from Cloudera) MonetDb: Open-source column-oriented database
  • 34.
    34T5 Informatics Summary ofall that Technology Size Q1 Q2 Q3 Q4 Q5 Q6 Q7 Raw Python 3.5GB 120ms 120ms 10200ms 10200ms 10200ms 10200ms 10200ms MessagePack 3.5GB 26ms 25ms 2100ms 2100ms 2100ms 2100ms 2100ms FlatBuffers 3.9GB 42ms 26ms 1200ms 1200ms 1200ms 1200ms 1200ms SQLite 4.6GB 30ms 30ms 13ms 5ms 54ms 199ms 2200ms PostgreSQL 4.3GB 23ms 102ms 215ms 58ms 182ms 220ms 2200ms bcolz 2.0GB 8ms 8ms 295ms 83ms 587ms 1100ms 2200ms dask 2.0GB N/A N/A 74ms 116ms 131ms 5800ms 7800ms 1. Retrieve num_atoms for 50K rows 2. Retrieve on_bit_count for 50K fingerprints 3. Count number of mols with num_atoms>25 4. Count number of fingerprints with on_bit_count=50 5. Count number of fingerprints with on_bit_count between 40 and 50 6. Retrieve fingerprints with on_bit_count=50 7. Retrieve fingerprints with on_bit_count between 40 and 50
  • 35.
    35T5 Informatics Next dataset… Let’sactually do some chemistry
  • 36.
    36T5 Informatics Demo/test dataset3 PubChem ● 10 million compound records loaded as RDKit molecules into PostgreSQL (4.0GB) ● Substructure search index built (3.9GB) Setup ● PostgreSQL 9.5 ● RDKit v2016.09.1 ● Macbook Pro 13 (early 2015) with a 3.1GHz Core i7, 16MB of RAM, 500GB SSD https://www.postgresql.org/
  • 37.
    37T5 Informatics Dataset 3:Substructure searches 1 Construct the Murcko Scaffolds for 100 random molecules from the pubchem_compound set and use them as substructure queries against the 10 million compounds Search times are to retrieve the first 100 results.
  • 38.
    38T5 Informatics Dataset 3:Substructure searches 1 Construct the Murcko Scaffolds for 100 random molecules from the pubchem_compound set and use them as substructure queries against the 10 million compounds Zoomed to ignore the 6 queries that took longer than 4 seconds. Search times are to retrieve the first 100 results.
  • 39.
    39T5 Informatics Aside: someslower substructure queries Mainly a note to myself to go back and tune the fingerprint for these Example of a slow one QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=0.42..419.53 rows=100 width=397) (actual time=78.800..16617.229 rows=100 loops=1) -> Index Scan using molidx on mols (cost=0.42..41911.42 rows=10000 width=397) (actual time=78.799..16617.169 rows=100 loops=1) Index Cond: (m @> 'c1ccc(COc2ccc(Cc3ccccc3)cc2)cc1'::mol) Rows Removed by Index Recheck: 62166 Planning time: 0.064 ms Execution time: 16617.718 ms (6 rows) More normal example QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.42..419.53 rows=100 width=397) (actual time=7.690..106.251 rows=100 loops=1) -> Index Scan using molidx on mols (cost=0.42..41911.42 rows=10000 width=397) (actual time=7.688..106.228 rows=100 loops=1) Index Cond: (m @> 'O=C(CNc1ccccc1)Nc1ccccc1'::mol) Rows Removed by Index Recheck: 474 Planning time: 0.060 ms Execution time: 106.646 ms (6 rows)
  • 40.
    40T5 Informatics Dataset 3:Substructure searches 2 Do a substructure search with 100 random lead-like molecules from ZINC against the 10 million compounds Search times are to retrieve at most the first 100 results.
  • 41.
    41T5 Informatics Dataset 3:Substructure searches 2 Do a substructure search with 100 random lead-like molecules from ZINC against the 10 million compounds Search times are to retrieve at most the first 100 results. Zoomed to ignore the 7 queries that took longer than 225ms.
  • 42.
    42T5 Informatics Dataset 3:Substructure searches 3 Do a substructure search with 100 random fragment-like molecules from ZINC against the 10 million compounds Search times are to retrieve at most the first 100 results.
  • 43.
    43T5 Informatics Dataset 3:Substructure searches 3 Do a substructure search with 100 random fragment-like molecules from ZINC against the 10 million compounds Search times are to retrieve at most the first 100 results. Zoomed to ignore the 6 queries that took longer than 1s.
  • 44.
    44T5 Informatics Dataset 3:Similarity searches 1 Do a similarity search with 100 random lead-like molecules from ZINC against the 10 million compounds. Similarity threshold = 0.8, fingerprint MFP2 Search times are to retrieve at most the 10 most similar results
  • 45.
    45T5 Informatics The righttool for the job The index type provided by PostgreSQL works really well for substructure searching, but is less effective for similarity queries. Let’s switch from the general tool to something specialized
  • 46.
    46T5 Informatics Dataset 3:Using the FPB format Do a similarity search with 100 random lead-like molecules from ZINC against the 10 million compounds. Similarity threshold = 0.8, fingerprint MFP2 File size: 2.6GB http://chemfp.com/ mean : 90ms / search
  • 47.
    47T5 Informatics Dataset 3:Using chemfp Do a similarity search with 500 random lead-like molecules from ZINC against the 10 million compounds. Fingerprint MFP2 ● Threshold = 0.8: 2.5 seconds (5ms / search) ● Threshold = 0.7: 4.6 seconds (9ms / search) ● Threshold = 0.5: 9.0 seconds (18ms / search) http://chemfp.com/ The right data structure meets a clever algorithm...
  • 48.
    48T5 Informatics Wrapping up Onesize does not fit all: Matching the format and algorithm to the problem at hand is always going to win For standard searching and relational queries, good-old SQL with indices is pretty hard to beat But some of the new(er) column-oriented technologies are very promising, particularly if you are really working on one column at a time. Shouldn't lose track of the fact that the tools are just a means to an end