Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GPU-accelerated Virtual Screening

949 views

Published on

With the unprecedented growth of chemical databases incorporating up to several hundred billions of synthetically feasible chemicals, modelers are not in shortage of chemicals to process. Importantly, such "Big Chemical Data" offers humongous opportunities for discovering novel bioactive molecules. However, the current generation of cheminformatics software tools is not capable of handling, characterizing, and processing such extremely large chemical libraries. In this presentation, we will discuss the rationale and the main challenges (theoretical and technical) for screening very large repositories of compounds in the current context of drug discovery. We will present several proof-of-concept studies regarding the screening of extremely large libraries (1+ billion compounds) using our novel GPU-accelerated cheminformatics platform to identify molecules with defined bioactivity. Overall, we will show that GPU computing represents an effective and inexpensive architecture to develop, employ, and validate a new generation of cheminformatics methods and tools ready to process billions of compounds.

Published in: Science
  • Be the first to comment

GPU-accelerated Virtual Screening

  1. 1. Olexandr Isayev, Ph.D. University of North Carolina at Chapel Hill Twitter @olexandr http://olexandrisayev.com
  2. 2. o “Big Data” in chemistry world o Sources o Challenges o Our vision for GPU accelerated cheminformatics workflow o Benchmarks & Case studies o Descriptor calculations o Similarity o Predictive modeling 2 Outline
  3. 3. Data – Knowledge Gap Drowning in Data but starving for Knowledge Tremendous opportunities for discovery of new drugs / materials
  4. 4. OH Cl N H OH Br H2 N CH2 CH3 Br H2 N OH Br Br Cl CH3 O CH3 FH2 N OH OH OH H3 C H2 P D.Fourches. Cheminformatics at the crossroads of eras. In Book: Applications of Computational Techniques in Pharmacy and Medicine, Springer. Available 04/2014. * Polishchuk, Madzhidov, Varnek. J Comput Aided Mol Des. 2013, 27(8):675-9. 1060-100 chemicals 1033 drug-like chemicals* 108 compounds in PubChem 106 compounds in ChEMBL with ≥ 1 known bioactivity
  5. 5. Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191-200 Decline in Pharmaceutical R&D efficiency The cost of developing a new drug (~$2B) roughly doubles every nine years. Need of novel approaches that (i) Fully exploit the potential of modern chemical biological data streams; (ii) Reliably forecast compounds’ bioactivity and safety profiles; (iii) Accelerate the translation from basic research to drug candidates
  6. 6. Quantitative Structure Activity Relationships D E S C R I P T O R S N O N O N O N O N O N O N O N O N O N O 0.613 0.380 -0.222 0.708 1.146 0.491 0.301 0.141 0.956 0.256 0.799 1.195 1.005 C O M P O U N D S A C T I V I T Y Thousands of molecular descriptors are available for organic compounds constitutional, topological, structural, quantum mechanics based, fragmental, steric, pharmacophoric, geometrical, thermodynamical, conformational, etc. - Building of models using machine learning methods (NN, SVM, RF) - Validation of models according to numerous statistical procedures, and their applicability domains. 7 Samples (compounds) Features (descriptors) X1 X2 ... Xm 1 X11 X12 ... X1m 2 X21 X22 ... X2m ... ... ... ... ... n Xn1 Xn2 ... Xnm C O M P O U N D S A C T I V I T Y ACTIVITY (i) Descriptor matrix External predictive power of QSAR models is critical to enable their application to virtual screening. Technically challenging to compute molecular properties and descriptors for more >109 compounds. No cheminformatics architecture is able to screen >109 compounds.
  7. 7. ~106 – 107 molecules ~102 – 103 molecules VIRTUAL SCREENING Empirical Rules/Filters Similarity Search Consensus QSA Potential Hits ML or QSAR Models Structure-based Models Virtual Screening to identify potential hits Candidate molecules
  8. 8. Polypharmacology & Biological profiles
  9. 9. Our vision for next-gen cheminformatics platforms
  10. 10. GPUCPU Add GPUs: Accelerate Science Applications Courtesy of NVIDIA
  11. 11. Small Changes, Big Speed-up Application Code + GPU CPU Use GPU to Parallelize Compute-Intensive Functions Rest of Sequential CPU Code Courtesy of NVIDIA
  12. 12. ~ 300 bytes/cmpd-Data parsing from Smiles -2D structure generation -Automatic curation Chemical Library (text file) High throughput descriptor generator Mol weight, logP, Rule of 5, Daylight fingerprints 30-50M/hr Data Processing
  13. 13. ~ 300 bytes/cmpd-Data parsing from Smiles -2D structure generation -Automatic curation Chemical Library (text file) High throughput descriptor generator Mol weight, logP, Rule of 5, Daylight fingerprints 30-50M/hr Data Processing Similarity Search Indexed, fully searchable, accessible via high level API, e.g., (MolWt > 150) & (logP == 3) Access in chunks or streaming Interactive analytics with IPython GPU accelerated similarity search 177M/s on K40 GPUsim GPUdup GPUdiv
  14. 14. ~ 300 bytes/cmpd-Data parsing from Smiles -2D structure generation -Automatic curation Chemical Library (text file) High throughput descriptor generator Mol weight, logP, Rule of 5, Daylight fingerprints 30-50M/hr Data Processing Similarity Search Indexed, fully searchable, accessible via high level API, e.g., (MolWt > 150) & (logP == 3) Access in chunks or streaming Interactive analytics with IPython Rapid screening of extremely large libraries with multiple molecular probes and QSAR/QSPR models GPU accelerated similarity search 177M/s on K40 GPUsim GPUdup GPUdiv Predictive Modeling (QSAR/QSPR) GPUrf GPUdnn … CudaTree@GitHub Wrapper Deep Learning based on Theano
  15. 15. Chemical Datasets Largest publicly available virtual libraries GDB-13 955 M compounds GDB-13-ABCDE subset 141 M GDB-17 subset 50 M 1 Blum and Reymond, 2009, J Am Chem Soc, 131, 8732–8733 2 Ruddigkeit et al., 2012, J Chem Inf Model, 52, 2864-2875
  16. 16. GDB-13 Subset of 141 M GDB-17 Random sample of 50 M GPU - Case Study 1 Fast Computation of Molecular Properties for Extremely Large Chemical Libraries Our GPU-accelerated cheminformatics platform is able to compute key molecular properties for GDB-13 (855M), GBD-13-ABCDE (141M), and a subset of GDB-17 (50M) compounds.
  17. 17. GPU - Case Study 1 Fast Computation of Molecular Properties for Extremely Large Chemical Libraries Our GPU-accelerated cheminformatics platform is able to compute key molecular properties for GDB-13 (855M), GBD-13-ABCDE (141M), and a subset of GDB-17 (50M) compounds.
  18. 18. Similarity Search From J. Bajorath, SSS Cheminformatics, Obernai 2008 Similarity searching using fingerprint representations of molecules is one of the most widely used approaches for chemical database mining: it assumes that similar compounds possess similar biological activities. Tanimoto Coefficient
  19. 19. ● Tanimoto similarity needs to know the number of 1s in a binary representation of the data (popcount) ● CUDA includes a device instruction to accomplish this __popc() for 32-bit data and __popcll() for 64-bit data. ● We used __popcll() in our implementation ● We break 1024bit fingerprints into 64-bit chunks ● Resulting similarity is aggregated over chunks Implementation __popc() and __popcll() instructions
  20. 20. __device__ double similarity(long long *query, long long *target, int data_len) { int a = 0, b = 0, c = 0, i; for (i = 0; i < data_len; i++) { a += __popcll(query[i]); b += __popcll(target[i]); c += __popcll(query[i] & target[i]); } return (double) c / (a + b - c); } Some GPU / CUDA code
  21. 21. Benchmarks
  22. 22. Benchmarks
  23. 23. GPU - Case Study Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds Lacosamide - Lacosamide (trade name Vimpat) is an anticonvulsant drug used to prevent seizures for patients treated for epilepsy; - Functionalized amino acid; - Many active analogues have been synthesized in Prof. Harold Kohn’s laboratory* at UNC-CH. *Wang et al., 2011, ACS Chem Neurosci, 2, 90–106
  24. 24. Analog 1 Analog 2 Analog 3 Analog 4 Analog 5 GPU - Case Study Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds 200M compound subset of GDB-13/17 Similarity search using Lacosamide as molecular probe
  25. 25. Compound ID Tanimoto Ts Analog 2 0.997 Analog 3 0.995 Analog 1 0.994 Analog 4 0.992 Analog 5 0.978 Gdb13-a10573585 0.977 Gdb13-b28137563 0.977 Gdb13-a36264983 0.976 Gdb13-a36264952 0.976 Gdb13-a10616005 0.976 Gdb13-a3011053 0.976 Gdb13-b21242261 0.976 Gdb17-44140083 0.976 Gdb13-a30878321 0.975 Gdb13-b3485216 0.975 GPU - Case Study Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds The GPU-accelerated screening platform was able to retrieve: -known active analogues of lacosamide, -several functionalized amino acids present in GDB-13, -a novel compound (Gdb17- 44140083) fully matching the pharmacophore of lacosamide.
  26. 26. 103K molecules DeepLearning - Case Study Large scale QSPR prediction of bioactivity Model accuracy 97% Build model with Deep Learning 200 M molecules Rapid screening of potential candidates Deep Neural Net 2 Hidden Layers Rectified Linear Unit (ReLU)
  27. 27. In Summary • GPU-accelerated cheminformatics platform for high performance virtual screening of extremely large chemical libraries. • Tested for the analysis of the largest publicly available dataset GDB-13 (~900M compounds) and (2) the screening of ~200M compound library for similarity search using an anticonvulsant drug as the molecular probe. • Our platform aims to virtually screen billions of compounds using similarity filters and QSAR models.
  28. 28. • UNC-CS: Vance Miller, Chun-Wei Liu, Zimeng Wang and Reed Palmer • Prof. Alex Tropsha (UNC-CH) • Prof. Denis Fourches (NCSU) • NVIDIA & Mark Berger for help & generous hardware donation Acknowledgements Funding - NSF ABI program - Office of Naval Research

×