With the unprecedented growth of chemical databases incorporating up to several hundred billions of synthetically feasible chemicals, modelers are not in shortage of chemicals to process. Importantly, such "Big Chemical Data" offers humongous opportunities for discovering novel bioactive molecules. However, the current generation of cheminformatics software tools is not capable of handling, characterizing, and processing such extremely large chemical libraries. In this presentation, we will discuss the rationale and the main challenges (theoretical and technical) for screening very large repositories of compounds in the current context of drug discovery. We will present several proof-of-concept studies regarding the screening of extremely large libraries (1+ billion compounds) using our novel GPU-accelerated cheminformatics platform to identify molecules with defined bioactivity. Overall, we will show that GPU computing represents an effective and inexpensive architecture to develop, employ, and validate a new generation of cheminformatics methods and tools ready to process billions of compounds.
2. o “Big Data” in chemistry world
o Sources
o Challenges
o Our vision for GPU accelerated cheminformatics
workflow
o Benchmarks & Case studies
o Descriptor calculations
o Similarity
o Predictive modeling
2
Outline
3. Data – Knowledge Gap
Drowning in Data but starving for Knowledge
Tremendous opportunities for discovery of new drugs / materials
5. Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191-200
Decline in Pharmaceutical R&D efficiency
The cost of developing a new
drug (~$2B) roughly doubles
every nine years.
Need of novel approaches that
(i) Fully exploit the potential of modern chemical biological data streams;
(ii) Reliably forecast compounds’ bioactivity and safety profiles;
(iii) Accelerate the translation from basic research to drug candidates
6. Quantitative
Structure
Activity
Relationships
D
E
S
C
R
I
P
T
O
R
S
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
0.613
0.380
-0.222
0.708
1.146
0.491
0.301
0.141
0.956
0.256
0.799
1.195
1.005
C
O
M
P
O
U
N
D
S
A
C
T
I
V
I
T
Y
Thousands of molecular descriptors
are available for organic compounds
constitutional, topological, structural,
quantum mechanics based, fragmental, steric,
pharmacophoric, geometrical,
thermodynamical, conformational, etc.
- Building of models
using machine learning
methods (NN, SVM, RF)
- Validation of models
according to numerous
statistical procedures, and
their applicability domains.
7
Samples
(compounds)
Features (descriptors)
X1 X2 ... Xm
1 X11 X12 ... X1m
2 X21 X22 ... X2m
... ... ... ... ...
n Xn1 Xn2 ... Xnm
C
O
M
P
O
U
N
D
S
A
C
T
I
V
I
T
Y
ACTIVITY (i)
Descriptor
matrix
External predictive power of QSAR models is critical
to enable their application to virtual screening.
Technically challenging to compute molecular
properties and descriptors for more >109 compounds.
No cheminformatics architecture is able to screen >109
compounds.
11. Small Changes, Big Speed-up
Application Code
+
GPU CPU
Use GPU to
Parallelize
Compute-Intensive
Functions
Rest of Sequential
CPU Code
Courtesy of NVIDIA
12. ~ 300 bytes/cmpd-Data parsing from Smiles
-2D structure generation
-Automatic curation
Chemical
Library
(text file)
High throughput
descriptor generator
Mol weight, logP,
Rule of 5, Daylight fingerprints
30-50M/hr
Data
Processing
13. ~ 300 bytes/cmpd-Data parsing from Smiles
-2D structure generation
-Automatic curation
Chemical
Library
(text file)
High throughput
descriptor generator
Mol weight, logP,
Rule of 5, Daylight fingerprints
30-50M/hr
Data
Processing
Similarity
Search
Indexed, fully searchable,
accessible via high level API, e.g.,
(MolWt > 150) & (logP == 3)
Access in chunks or streaming
Interactive
analytics
with IPython
GPU accelerated
similarity search
177M/s on K40
GPUsim
GPUdup
GPUdiv
14. ~ 300 bytes/cmpd-Data parsing from Smiles
-2D structure generation
-Automatic curation
Chemical
Library
(text file)
High throughput
descriptor generator
Mol weight, logP,
Rule of 5, Daylight fingerprints
30-50M/hr
Data
Processing
Similarity
Search
Indexed, fully searchable,
accessible via high level API, e.g.,
(MolWt > 150) & (logP == 3)
Access in chunks or streaming
Interactive
analytics
with IPython
Rapid screening of extremely large libraries with
multiple molecular probes and QSAR/QSPR models
GPU accelerated
similarity search
177M/s on K40
GPUsim
GPUdup
GPUdiv
Predictive
Modeling
(QSAR/QSPR)
GPUrf
GPUdnn
…
CudaTree@GitHub Wrapper
Deep Learning based on Theano
15. Chemical Datasets
Largest publicly available virtual libraries
GDB-13 955 M compounds
GDB-13-ABCDE subset 141 M
GDB-17 subset 50 M
1 Blum and Reymond, 2009, J Am Chem Soc, 131, 8732–8733
2 Ruddigkeit et al., 2012, J Chem Inf Model, 52, 2864-2875
16. GDB-13
Subset of 141 M
GDB-17
Random sample
of 50 M
GPU - Case Study 1
Fast Computation of Molecular Properties for
Extremely Large Chemical Libraries
Our GPU-accelerated cheminformatics platform is able to compute
key molecular properties for GDB-13 (855M), GBD-13-ABCDE
(141M), and a subset of GDB-17 (50M) compounds.
17. GPU - Case Study 1
Fast Computation of Molecular Properties for
Extremely Large Chemical Libraries
Our GPU-accelerated cheminformatics platform is able to compute
key molecular properties for GDB-13 (855M), GBD-13-ABCDE
(141M), and a subset of GDB-17 (50M) compounds.
18. Similarity Search
From J. Bajorath, SSS Cheminformatics, Obernai 2008
Similarity searching using fingerprint representations of molecules is one of the
most widely used approaches for chemical database mining: it assumes that
similar compounds possess similar biological activities.
Tanimoto Coefficient
19. ● Tanimoto similarity needs to know the number of 1s in a
binary representation of the data (popcount)
● CUDA includes a device instruction to accomplish this
__popc() for 32-bit data and __popcll() for 64-bit data.
● We used __popcll() in our implementation
● We break 1024bit fingerprints into 64-bit chunks
● Resulting similarity is aggregated over chunks
Implementation
__popc() and __popcll() instructions
20. __device__ double similarity(long long *query, long long
*target, int data_len) {
int a = 0, b = 0, c = 0, i;
for (i = 0; i < data_len; i++) {
a += __popcll(query[i]);
b += __popcll(target[i]);
c += __popcll(query[i] & target[i]);
}
return (double) c / (a + b - c);
}
Some GPU / CUDA code
23. GPU - Case Study
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds
Lacosamide
- Lacosamide (trade name Vimpat) is
an anticonvulsant drug used to
prevent seizures for patients treated
for epilepsy;
- Functionalized amino acid;
- Many active analogues have been
synthesized in Prof. Harold Kohn’s
laboratory* at UNC-CH.
*Wang et al., 2011, ACS Chem Neurosci, 2, 90–106
24. Analog 1 Analog 2
Analog 3 Analog 4 Analog 5
GPU - Case Study
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds
200M compound subset
of GDB-13/17
Similarity search using
Lacosamide as molecular
probe
25. Compound ID Tanimoto Ts
Analog 2 0.997
Analog 3 0.995
Analog 1 0.994
Analog 4 0.992
Analog 5 0.978
Gdb13-a10573585 0.977
Gdb13-b28137563 0.977
Gdb13-a36264983 0.976
Gdb13-a36264952 0.976
Gdb13-a10616005 0.976
Gdb13-a3011053 0.976
Gdb13-b21242261 0.976
Gdb17-44140083 0.976
Gdb13-a30878321 0.975
Gdb13-b3485216 0.975
GPU - Case Study
Virtual Screening of Very Large Chemical
Libraries to Identify Bioactive Compounds
The GPU-accelerated
screening platform was able to
retrieve:
-known active analogues of
lacosamide,
-several functionalized amino
acids present in GDB-13,
-a novel compound (Gdb17-
44140083) fully matching the
pharmacophore of lacosamide.
26. 103K
molecules
DeepLearning - Case Study
Large scale QSPR prediction of
bioactivity Model accuracy 97%
Build model with
Deep Learning
200 M
molecules
Rapid screening
of potential
candidates
Deep Neural Net
2 Hidden Layers
Rectified Linear Unit (ReLU)
27. In Summary
• GPU-accelerated cheminformatics platform for high
performance virtual screening of extremely large
chemical libraries.
• Tested for the analysis of the largest publicly available
dataset GDB-13 (~900M compounds) and (2) the
screening of ~200M compound library for similarity
search using an anticonvulsant drug as the molecular
probe.
• Our platform aims to virtually screen billions of
compounds using similarity filters and QSAR models.
28. • UNC-CS: Vance Miller, Chun-Wei Liu,
Zimeng Wang and Reed Palmer
• Prof. Alex Tropsha (UNC-CH)
• Prof. Denis Fourches (NCSU)
• NVIDIA & Mark Berger for help & generous
hardware donation
Acknowledgements
Funding
- NSF ABI program
- Office of Naval Research