Big (chemical) data? No Problem!

T5 Informatics GmbH
greg.landrum@t5informatics.com
@dr_greg_landrum
Big (chemical) data? No Problem!
Storing and searching large amounts of data with
open-source software
Greg Landrum
This work is licensed under a
Creative Commons Attribution 4.0
International License.

2T5 Informatics
Defining our terms
Big data is a term for data sets that are so large or
complex that traditional data processing
applications are inadequate to deal with them.
https://en.wikipedia.org/wiki/Big_data

3T5 Informatics
Defining our terms
Big data is a term for data sets that are so large or complex that traditional
data processing applications are inadequate to deal with them. Challenges
include analysis, capture, data curation, search, sharing, storage, transfer,
visualization, querying, updating and information privacy. The term "big
data" often refers simply to the use of predictive analytics, user behavior
analytics, or certain other advanced data analytics methods that extract
value from data, and seldom to a particular size of data set.[2] "There is little
doubt that the quantities of data now available are indeed large, but that’s
not the most relevant characteristic of this new data ecosystem."[3]
https://en.wikipedia.org/wiki/Big_data

4T5 Informatics
Motivation
● Starting point: we frequently end up working with large collections of
compounds and data about them but we don’t really have standard, efficient,
portable, cross-platform ways of storing/working with that data
● Let’s take a look at a few datasets and use cases for them and see what we
can do

5T5 Informatics
A point of faith
We’ve got lots of tools...
…. and there’s not going to be a silver bullet that solves every problem.
One size does not fit all.
We need to match the complexity of the tool to the task at hand
Even the simplest tools can end up being problematic
ğ

6T5 Informatics
Looking at the technologies
Criteria:
● Flexibility
● Schema free?
● Size of our data on disk
● Speed of retrieving data

7T5 Informatics
Demo/test dataset 1
Working with PubChem
● 223 million substance records
● 99 millon substance synonyms
● 91.5 million compound records
● 207 million associations
Use cases: looking up names and basic chemical information, but dealing with the real-world
complexity of that

8T5 Informatics
Demo/test dataset 1
PubChem
● 223 million substance records (15GB)
○ SID, source, source regnum, xref
● 99 millon substance synonyms (5GB)
● 91.5 million compound records (42GB)
○ pubchem_{inchi,inchi_key,iupac_name},rdkit_{smiles,inchi,inchi_key}
● 207 million associations (8GB)
Setup
● PostgreSQL 9.5
● Macbook Pro 13 (early 2015) with a 3.1GHz Core i7, 16MB of RAM, 500GB SSD
https://www.postgresql.org/

9T5 Informatics
Dataset 1: Fuzzy name lookup
I’ve got a bunch of compound names and I’d like to get the info available about
them.
chem_integration=# select cid,sid,assoc_type,lower(synonym),rdkit_smiles from pubchem_substance_assocs join
pubchem_substance_synonyms using (sid) join pubchem_compounds using (cid)
where lower(synonym) like 'chlorprom%';
cid | sid | assoc_type | lower | rdkit_smiles
--------+-----------+------------+------------------------------+------------------------------------------
2726 | 7978926 | 1 | chlorpromazine | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21
2726 | 273002747 | 1 | chlorpromazine hydrochloride | CN(C)CCCN1c2ccccc2Sc2ccc(Cl)cc21
165214 | 274768709 | 1 | chlorpromazine sulfone | CN(C)CCCN1c2ccccc2S(=O)(=O)c2ccc(Cl)cc21
(6 rows)
Time: 13.206 ms
That’s a typical result.

10T5 Informatics
Dataset 1: Doing some chemistry
String operations on the database are really fast. Can we use those to do some
chemical queries?
“Find all molecules with a particular formula”
“Find all molecules with a particular formula and connectivity”

11T5 Informatics
Dataset 1: InChI aggregation
Using the structure of InChI we can do chemical aggregation just using string
operations:
● Group by formula:
rdkit_inchi like 'InChI=1S/C22H25N5O/%'
● Group by formula + connectivity
rdkit_inchi like
'InChI=1S/C22H25N5O/c1-16-20(21(23)27(24-16)19-6-4-3-5-7-19)17-8-10-18(11-9-
17)22(28)26-14-12-25(2)13-15-26/%'

12T5 Informatics
Do that aggregation across pubchem_compound for 1000 random InChIs from
pubchem_compound
Results for group by formula:

13T5 Informatics
Do that aggregation across pubchem_compound for 1000 random InChIs from
pubchem_compound
Results for group by connectivity:

14T5 Informatics
Aside: examples of duplicate connectivity

15T5 Informatics
Next dataset...

16T5 Informatics
Demo/test dataset 2
Base dataset: 14.9 million compounds from ZINC15
● MW and logp: all tranches
● Reactivity: tranche A ("anodyne")
● Availability: tranche A ("in-stock")
For this example, 4 million of these were randomly selected.
Data fields: label, smiles, molpkl, num_atoms, num_heavy_atoms, num_rotatable,
num_rings, tpsa, mollogp, molwt, Patternfp (#on bits, fpdata), Morganfp2 (#on bits, fpdata)
Data generated using the RDKit
Total dataset size (stored as python binary pickle): ~3.5GB
http://zinc15.docking.org/tranches/home/

17T5 Informatics
Benchmarking setup
For data set 2:
Dell XPS desktop with a 3.6GHz i7-4790 CPU, 16GB RAM, standard HD
Ubuntu 16.04
Python 3.5.1 (anaconda python)

18T5 Informatics
Demo/test dataset 2
Queries:
● Q1: Retrieve num_atoms for 50K rows
● Q2: Retrieve on_bit_count for 50K fingerprints
● Q3: Count number of mols with num_atoms>25
● Q4: Count number of fingerprints with on_bit_count=50
● Q5: Count number of fingerprints with on_bit_count between 40 and 50
● Q6: Retrieve fingerprints with on_bit_count=50
● Q7: Retrieve fingerprints with on_bit_count between 40 and 50

19T5 Informatics
Baseline: Reading from Python
Just read everything and unpickle it.
Dataset size: 3.5GB
● Q1: 120ms
● Q2: same
● Q3: 10.2s
● Q4: same
● Q5: same
● Q6: same
● Q7: same
1. Retrieve num_atoms for 50K rows
2. Retrieve on_bit_count for 50K fingerprints
3. Count number of mols with num_atoms>25
4. Count number of fingerprints with
on_bit_count=50
on_bit_count between 40 and 50
6. Retrieve fingerprints with on_bit_count=50
7. Retrieve fingerprints with on_bit_count
between 40 and 50

20T5 Informatics
Tech 1: MessagePack
"It's like JSON, but fast and small"
Binary format, simple to read and write from multiple languages
Now being used by the PDB
Readers and writers from many, many languages
Very flexible, schema free
https://github.com/msgpack/msgpack-python

21T5 Informatics
MessagePack performance
Data stored as tuples
Dataset size: 3.5GB
● Q1: 26ms
● Q2: 26ms
● Q3: 2.1s
● Q4: same
● Q5: same
● Q6: same
● Q7: same
on_bit_count=50
between 40 and 50

22T5 Informatics
Tech 2: Flat buffers
Cross platform serialization library
Binary format, simple to read and write from
multiple languages
Flexible hierarchical schema
http://google.github.io/flatbuffers/index.html
namespace storage_formats;
table Fingerprint {
on_bit_count:ushort;
bytes:[ubyte];
}
table Molecule {
smiles:string;
name:string;
pkl:[ubyte];
num_atoms:ushort;
num_heavy_atoms:ushort;
pattern_fp:Fingerprint;
morgan2_fp:Fingerprint;
num_rotatable_bonds:ushort;
num_rings:ushort;
tpsa:double;
mollogp:double;
molwt:double;
}
root_type Molecule;

23T5 Informatics
FlatBuffers performance
Using a C++ reader
Dataset size: 3.9GB
● Q1: 42ms
● Q2: 26ms
● Q3: 1.2s
● Q4: same
● Q5: same
● Q6: same
● Q7: same
on_bit_count=50
between 40 and 50

24T5 Informatics
Tech 3: Pandas
"Standard" Python data frame
Powerful data manipulation and aggregation
Very extensible (excellent RDKit integration)
Not suitable for this project because the entire data frame needs to be in memory.
http://pandas.pydata.org/

25T5 Informatics
Tech 4: SQLite
Open-source SQL database
File based, no server required
Extremely flexible
Basic RDKit integration available
Connectors available from many, many languages
Requires a schema
https://sqlite.org/

26T5 Informatics
SQLite performance
Dataset size: 4.6GB (without index)
● Q1: 30ms
● Q2: 30ms
● Q3: 1.4s (with index 13ms)
● Q7: 2.8s (with index 2.2s)
on_bit_count=50
between 40 and 50
Q5: select count(*) from morganfps where on_bit_count>40 and on_bit_count<50;
Q6: select pkl from morganfps where on_bit_count>40 and on_bit_count<50;

27T5 Informatics
Tech 5: PostgreSQL
Open-source SQL database
Uses a server
Extremely flexible and extensible
Strong RDKit integration
Connectors available from many, many languages
Requires a schema

28T5 Informatics
PostgreSQL performance
Dataset size: 4.0GB (4.3GB with indices)
● Q1: 23ms
● Q2: 102ms
● Q3: 697ms (215ms with index)
● Q7: 2.4s (2.2s with index)
on_bit_count=50
between 40 and 50
Q5: select count(*) from morganfps where on_bit_count>40 and on_bit_count<50;
Q6: select pkl from morganfps where on_bit_count>40 and on_bit_count<50;

29T5 Informatics
Tech 6: bcolz
Columnar data format for Python
Compressed on disk and/or in memory
Provides a similar API to numpy
Pretty good querying primitives
Requires a schema
https://github.com/Blosc/bcolz
dtype=[('zincid','S16'),
('smiles','S256'),
('pkl','S512'),
('pfp_obc','u4'),
('pfp_pkl','S256'),
('mfp2_obc','u4'),
('mfp2_pkl','S256'),
('num_atoms','u4'),
('num_heavy_atoms','u4'),
('num_rotatable_bonds','u4'),
('num_rings','u4'),
('tpsa','f8'),
('mollogp','f8'),
('molwt','f8')]

30T5 Informatics
Bcolz performance
Dataset size: 2.0GB
● Q1: 8ms
● Q2: 8ms
● Q3: 295ms
● Q4: 83ms
● Q5: 587ms
● Q6: 1.1s
● Q7: 2.2s
on_bit_count=50
between 40 and 50
Q5: len([x for x in tbl.where("mfp2_obc==50", outcols="mfp2_obc")])
Q6: [x for x in tbl.where("(mfp2_obc>40) & (mfp2_obc<50)", outcols="mfp2_pkl")]

31T5 Informatics
Tech 7: dask
"Flexible parallel computing library for analytic computing"
Does *way* more than what I'm using it for here
Provides, among other things, a parallel Pandas-like interface to bcolz data
http://dask.pydata.org/

32T5 Informatics
dask performance
Dataset size: 2.0GB (uses bcolz data)
● Q1: N/A
● Q2: N/A
● Q3: 74ms
● Q4: 116ms
● Q5: 131ms
● Q6: 5.8s
● Q7: 7.8s
on_bit_count=50
between 40 and 50
Q5: len(df.zincid[df.mfp2_obc==50])
Q6: df.mfp2_pkl[df.mfp2_obc.between(40,50,inclusive=False)]

33T5 Informatics
Things I didn't look at (yet)
HDF5: well-established hierarchical binary data format
Arrow: new (and rapidly evolving) column-oriented format
Parquet: Columnar data store for Hadoop
Impala: massively parallel SQL search on top of Hadoop (from Cloudera)
MonetDb: Open-source column-oriented database

34T5 Informatics
Summary of all that
Technology Size Q1 Q2 Q3 Q4 Q5 Q6 Q7
Raw Python 3.5GB 120ms 120ms 10200ms 10200ms 10200ms 10200ms 10200ms
MessagePack 3.5GB 26ms 25ms 2100ms 2100ms 2100ms 2100ms 2100ms
FlatBuffers 3.9GB 42ms 26ms 1200ms 1200ms 1200ms 1200ms 1200ms
SQLite 4.6GB 30ms 30ms 13ms 5ms 54ms 199ms 2200ms
PostgreSQL 4.3GB 23ms 102ms 215ms 58ms 182ms 220ms 2200ms
bcolz 2.0GB 8ms 8ms 295ms 83ms 587ms 1100ms 2200ms
dask 2.0GB N/A N/A 74ms 116ms 131ms 5800ms 7800ms
on_bit_count=50
between 40 and 50

35T5 Informatics
Next dataset…
Let’s actually do some chemistry

36T5 Informatics
Demo/test dataset 3
PubChem
● 10 million compound records loaded as RDKit molecules into
PostgreSQL (4.0GB)
● Substructure search index built (3.9GB)
Setup
● PostgreSQL 9.5
● RDKit v2016.09.1
● Macbook Pro 13 (early 2015) with a 3.1GHz Core i7, 16MB of RAM,
500GB SSD

37T5 Informatics
Dataset 3: Substructure searches 1
Construct the Murcko Scaffolds for 100 random molecules from the
pubchem_compound set and use them as substructure queries against the 10
million compounds
Search times are to
retrieve the first 100
results.

38T5 Informatics
Construct the Murcko Scaffolds for 100 random molecules from the
pubchem_compound set and use them as substructure queries against the 10
million compounds
Zoomed to ignore the 6 queries that took longer
than 4 seconds.
Search times are to
retrieve the first 100
results.

39T5 Informatics
Aside: some slower substructure queries
Mainly a note to myself to go back and tune the fingerprint for these
Example of a slow one QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..419.53 rows=100 width=397) (actual time=78.800..16617.229 rows=100 loops=1)
-> Index Scan using molidx on mols (cost=0.42..41911.42 rows=10000 width=397) (actual time=78.799..16617.169 rows=100 loops=1)
Index Cond: (m @> 'c1ccc(COc2ccc(Cc3ccccc3)cc2)cc1'::mol)
Rows Removed by Index Recheck: 62166
Planning time: 0.064 ms
Execution time: 16617.718 ms
(6 rows)
More normal example QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..419.53 rows=100 width=397) (actual time=7.690..106.251 rows=100 loops=1)
-> Index Scan using molidx on mols (cost=0.42..41911.42 rows=10000 width=397) (actual time=7.688..106.228 rows=100 loops=1)
Index Cond: (m @> 'O=C(CNc1ccccc1)Nc1ccccc1'::mol)
Rows Removed by Index Recheck: 474
Planning time: 0.060 ms
Execution time: 106.646 ms
(6 rows)

40T5 Informatics
Do a substructure search with 100 random lead-like molecules from ZINC against
the 10 million compounds
Search times are to
retrieve at most the first
100 results.

41T5 Informatics
Do a substructure search with 100 random lead-like molecules from ZINC against
the 10 million compounds
Search times are to
100 results.
than 225ms.

42T5 Informatics
Do a substructure search with 100 random fragment-like molecules from ZINC
against the 10 million compounds
Search times are to
100 results.

43T5 Informatics
Do a substructure search with 100 random fragment-like molecules from ZINC
against the 10 million compounds
Search times are to
100 results.
than 1s.

44T5 Informatics
Dataset 3: Similarity searches 1
Do a similarity search with 100 random lead-like molecules from ZINC against the
10 million compounds. Similarity threshold = 0.8, fingerprint MFP2
Search times are to
retrieve at most the 10
most similar results

45T5 Informatics
The right tool for the job
The index type provided by PostgreSQL works really well for substructure
searching, but is less effective for similarity queries.
Let’s switch from the general tool to something specialized

46T5 Informatics
Dataset 3: Using the FPB format
10 million compounds. Similarity threshold = 0.8, fingerprint MFP2
File size: 2.6GB
http://chemfp.com/
mean : 90ms / search

47T5 Informatics
Dataset 3: Using chemfp
10 million compounds. Fingerprint MFP2
● Threshold = 0.8: 2.5 seconds (5ms / search)
http://chemfp.com/
The right data structure meets a clever algorithm...

48T5 Informatics
Wrapping up
One size does not fit all: Matching the format and algorithm to the problem at hand
is always going to win
For standard searching and relational queries, good-old SQL with indices is pretty
hard to beat
But some of the new(er) column-oriented technologies are very promising,
particularly if you are really working on one column at a time.
Shouldn't lose track of the fact that the tools are just a means to an end

Big (chemical) data? No Problem!

More Related Content

What's hot

Similar to Big (chemical) data? No Problem!

More from Greg Landrum

Recently uploaded

Big (chemical) data? No Problem!