Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AI & ML in Drug Design: Pistoia Alliance CoE


Published on

Creating novel drugs is an extraordinarily hard and complex problem.
One of the many challenges in drug design is the sheer size of the search space for novel chemical compounds. Scientists need to find molecules that are active toward a biological target or pathway and at the same time have acceptable ADMET properties.

There is now considerable research going on using various AI and ML approaches to tackle these challenges.

Our distinguished speakers, Drs. Alex Tropsha and Ola Engkvist, will discuss their recent work in Drug Design involving Deep Reinforcement Learning and Neural Networks, and will answer questions from the audience on the current state of the research in the field.


Prof Alex Tropsha, Professor at University of North Carolina at Chapel Hill, USA
Dr. Ola Engkvist, Associate Director at AstraZeneca R&D, Gothenburg, Sweden

Published in: Health & Medicine
  • Be the first to comment

AI & ML in Drug Design: Pistoia Alliance CoE

  1. 1. 26 February, 2019 AI in Drug Design Pistoia Alliance Centre of Excellence for AI in Life Sciences Moderator: Vladimir Makarov and Nick Lynch
  2. 2. This webinar is being recorded
  3. 3. Poll Question 1: Are you or your organisation using AI / ML in Drug Design? A. Yes, already B. Plan to do in next 12 months C. Plan in next 12-24 months D. No plans
  4. 4. ©PistoiaAlliance Introduction to Today’s Speakers Prof Alex Tropsha Associate Dean for Pharmacoinformatics and data science K.H. Lee distinguished professor Dr Ola Engqvist Associate Director Discovery Sciences AstraZeneca
  5. 5. Alexander Tropsha UNC Eshelman School of Pharmacy Machine learning, text mining, and AI approaches for drug discovery and repurposing
  6. 6. The ultimate dream of a computational chemist
  7. 7. ~106 – 109 molecules VIRTUAL SCREENING CHEMICAL STRUCTURES CHEMICAL DESCRIPTORS PROPERTY/ ACTIVITY PREDICTIVE QSAR MODELS Confirmed inactives QSAR MAGIC Confirmed actives CHEMICAL DATABASE The chief utility of computational models: Annotation of new compounds 7 Varnek, A., Tropsha, A. (Eds) Chemoinformatics Approaches to Virtual Screening, RSC Publishing, Cambridge, UK, 2008
  8. 8. Datasets are represented by a matrix of molecular descriptors Samples (Compounds) Variables (descriptors) X1 X2 ... Xm 1 X11 X12 ... X1m 2 X21 X22 ... X2m ... ... ... ... ... n Xn1 Xn2 ... Xnm
  9. 9. Quantitative Structure Activity Relationships D E S C R I P T O R S N O N O N O N O N O N O N O N O N O N O 0.613 0.380 -0.222 0.708 1.146 0.491 0.301 0.141 0.956 0.256 0.799 1.195 1.005 C O M P O U N D S A C T I V I T Y Thousands of molecular descriptors are available for organic compounds constitutional, topological, structural, quantum mechanics based, fragmental, steric, pharmacophoric, geometrical, thermodynamical conformational, etc. - Building of models using machine learning methods (NN, SVM etc.); - Validation of models according to numerous statistical procedures, and their applicability domains. Credit: Denis Fourches 9 N O N O N O N O N O N O N O N O 0 0.5 1 1.5 2 2.5 3 0 1 2 3 4 PredictedLogED50 Actual LogED50 (ED50 = mM/kg) Training Linear (Training) Tropsha, A. Best Practices for QSAR Model Development, Validation, and Exploitation Mol. Inf., 2010, 29, 476
  10. 10. QSAR Modeling Workflow: the importance of rigorous validation M o d e l i n g m e t h o d s 5-fold External Validation 1 4 3 2 5 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest Neighbors (kNN) Random Forest (RF) Support Vector Machines (SVM) Dragon MOE Internal validation Model selection An ensemble of QSAR Models Modeling set External set D e s c r i p t o r s Evaluation of external performance 10 Tropsha, A. Best Practices for QSAR Model Development, Validation, and Exploitation Mol. Inf., 2010, 29, 476 – 488 Fully implemented on CHEMBENCH.MML.UNC.EDU Virtual screening (with AD threshold) Experimental confirmation
  11. 11. SMILEs: a compact way to encode, store, share chemical data
  12. 12. Representation of molecules by the SMILEs language p?curid=2556784
  13. 13. ReLeaSE* design principles: learning and exploiting structural linguistics of SMILES notation • SMILES notations reflect rules of Chemistry • SMILES notation may embed linguistic rules • Neural nets could learn both of the above types of rules • This knowledge can be transformed into the generation of new SMILES corresponding to novel chemically feasible molecules (generative model) • One can build QSAR models based solely on SMILES notation (predictive model) • QSAR models can be used as a reward function for reinforcement learning to bias the design of novel libraries *Popova, M,, Isayev, O., and Tropsha, A. "Deep reinforcement learning for de-novo drug design." Science Advances, 2018 Jul 25;4(7):eaap7885.
  14. 14. NLP/Text mining:directly learn low-dimensional word vectors ∙ In deeplearning models, a wordis represented as a dense vector ∙ Word vectors form the basis for deep learning methods ∙ Objective: predict word based on the context Mikolov T . et al. Distributed representations of words and phrases and their compositionality //Advances in neural information processing systems. – 2013. – С. 3111-3119.
  15. 15. Design of the ReLeaSE* method (Reinforcement Learning for Structural Evolution) Elements of the thought cycle (molecules->models- molecules): • Generate chemically feasible SMILES • Develop SMILES- based QSAR model • Employ QSAR model to bias library generation • Produce new SMILES *Popova, Mariya, Olexandr Isayev, and Alexander Tropsha. "Deep reinforcement learning for de-novo drug design." arXiv preprint arXiv:1711.10907 (2017).
  16. 16. ReLeaSE:* Disruptive Innovation of Conventional Computational Drug Discovery Pipeline Learn from target-specific data (300-500 molecules) Target-specific models Virtual screening Internal/public databases Selection and testing of known molecules Generation of novel molecules Selection and testing of novel molecules ReLeaSE Workflow Traditional Workflow Learn from all data (2M molecules) Target-specific and property models / Reinforcement learning Hits with desired properties *Popova, M,, Isayev, O., and Tropsha, A. "Deep reinforcement learning for de-novo drug design." Science Advances, 2018 Jul 25;4(7):eaap7885.
  17. 17. Disruptive innovation in QSAR: Can we avoid descriptor generation altogether and besides, predict new structures? Did the training converge? NO YES <START> c <START>c1ccc(O)cc1<END> c 1 1 c c c c ) + loss c ( ( F + loss O ) ) c c c c 1 1 <END> Softmax loss 1.5M molecules from ChEMBL c1ccc(O)cc1 *Popova, M,, Isayev, O., and Tropsha, A. "Deep reinforcement learning for de-novo drug design." Science Advances, 2018 Jul 25;4(7):eaap7885.
  18. 18. Are we making legitimate Smiles? AI learning system 95% Valid Chemically-feasible molecules SMILE strings / Smiles strings
  19. 19. Fc1ccc2c(Nc3ccc(F)c(F)c3)ncnc2c1 Generative model Reinforcement learning for chemical design Predictive model
  20. 20. O=C(C)Oc1ccccc1C(=O)O CCOc1cc(C)ccc1OCC=CF COc1ccccc1OCCO CC(N)Sc1ccc(Cl)nc1 COC(=O)NCc1ccccc1Cl C O M P O U N D S A C T I V I T Y 0.531 1.299 0.946 -0.218 0.017 QSAR Smile-ification of QSAR! Quantitative Smiles – Activity Relationships
  21. 21. QSAR modeling using Smiles strings only* RMSE: 0.57 0.53 MAE: 0.37 0.35 R2 ext: 0.90 0.91 CN2C(=O)N(C)C(=O)C1=C2N=CN1C Neural Network Property prediction Predicted LogP ObservedLogP 5CV RF model with DRAGON7 Descriptors 5CV NN model with SMILES directly *LogP data for ~16K molecules from PHYSPROP (, Toxcast Dashboard (, and others.
  22. 22. Generative model Predictive modelACTIVE! Reinforcement learning for chemical design
  23. 23. Generative model Predictive modelACTIVE! Reinforcement learning for chemical design
  24. 24. Generative model Predictive modelACTIVE! Reinforcement learning for chemical design
  25. 25. Generative model Predictive model Reinforcement learning for chemical design
  26. 26. FC(F)COc1ccc2c(Nc3ccc(Cl)c(Cl)c3)ncnc2c1 Generative model Predictive model Reinforcement learning for chemical design
  27. 27. FC(F)COc1ccc2c(Nc3ccc(Cl)c(Cl)c3)ncnc2c1 Generative model Predictive model Reinforcement learning for chemical design
  28. 28. Generative model Predictive modelINACTIVE! Reinforcement learning for chemical design
  29. 29. Generative model Predictive modelINACTIVE! Reinforcement learning for chemical design
  30. 30. Generative model Predictive modelINACTIVE! Reinforcement learning for chemical design
  31. 31. Generative model Predictive model Reinforcement learning for chemical design
  32. 32. Results: Synthetic accessibility score* of the designed libraries *Ertl, Peter, and Ansgar Schuffenhauer. "Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions." Journal of cheminformatics 1.1 (2009): 8.
  33. 33. PoC: Physical properties LogP (10K compounds) T melt, C° (10K compounds)
  34. 34. Predicted pIC50 for JAK2 kinase CAS 236-084-2 (buffer reagent) ZINC37859566 New moleculeSIMILAR SCAFFOLDS NEW CHEMOTYPE JAK2 Kinase inhibition Untrained data distribution Maximized property distribution Minimized property distribution
  35. 35. Target predictions for generated compounds using SEA* *Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand chemistry. Nat Biotech 25 (2), 197-206 (2007).
  36. 36. Target predictions for generated compounds using SEA* *Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand chemistry. Nat Biotech 25 (2), 197-206 (2007).
  37. 37. Practical implementation workflow • Select a target • Train ReLeaSE to generate new target-specific molecules; collect computational hits • Identify a fraction of hits available in commercial libraries; purchase and test selected hits • Following successful validation, order NCE synthesis and testing in vitro and in vivo and if successful file for IP protection 37
  38. 38. Summary • We propose an innovative de novo drug discovery technology termed Reinforcement Learning for Structural Evolution (ReLeaSE)* • ReLeaSE is a product of convergence of fields as disparate as cheminformatics and text mining united by AI • Unlike most of the current technologies, ReLeaSE enables the discovery of new chemical entities with the desired bioactivity and drug-like properties Patent application filed (application # 62/535069, filed by UNC07/2018)
  39. 39. General Summary • Accumulation of Big Data in all areas of research creates previously unachievable opportunities for using ML and AI approaches – However, primary data must be handled with extreme care (curation, reproducibility) • Exciting developments in computational chemistry – Critical shift from discovery to design and AI-driven robotics • Rapid progression from the use of computational modeling for decision support to using models to guide experimental research – Critical importance of rigorous and comprehensive model validation using truly external data • Natural progression toward automated chemical labs driven by AI
  40. 40. Principal Investigator Alexander Tropsha Research Professors Alexander Golbraikh Olexander Isayev Eugene Muratov Graduate students Sherif Faraq Kyle Bowers Maria Popova Andrew Thieme Dan Korn Phil Gusev Postdoctoral Fellows Vinicius Alves Joyce Borba MAJOR FUNDING NIH - 1U01CA207160 - R01-GM114015 - 5U54CA198999 - 1OT3TR002020 ONR - N00014-16-1-2311 Acknowledgements
  41. 41. Poll Question 2: What are the biggest barriers to machine learning adoption Drug Design? (multi select) A. Lack of access to AI/ML Skills B. Access to Data C. Quality of Data D. Access to ML & AI Tools E. Other
  42. 42. Artificial Intelligence in Drug Design Ola Engkvist, Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Sweden February 26 2019PISTOIA Webinar
  43. 43. Drug Design What to make next? How to make it? De novo design Multi-parameter scoring function Retrosynthesis
  44. 44. What is different now? 44 Augmented design Autonomous design Automatic design de novo molecular design Synthesis prediction Automation Data generation
  45. 45. It takes two to tango 45 Artificial Intelligence Chemistry Automation
  46. 46. AI/ML for drug design science @AZ 46
  47. 47. Neural Networks & Deep Learning 47 • Neural Networks known for decades • Inputs, Hidden Layers, Outputs • Single layer NNs have been used in QSAR modelling for years • Recent Applications use more complex networks such as • Multi-layer Feed-Forward NNs • Convolutional NNs • biological image processing • Auto-encoder NNs • Adversial NNs • Recurrent NNs
  48. 48. Why? Generation of Novel Compounds in the 1060 Chemical Space! 48 Where´s the impact? • Use for de novo Molecular Design • Scaffold Hopping • Novelty • Virtual Screening • Library Design 10601010-1012
  49. 49. Natural language generation and molecular structure generation 49 • Can we borrow concepts from natural language processing and apply to SMILES description of molecular structures to generate molecules? • Conditional probability distributions given context • 𝑃 𝑔𝑟𝑒𝑒𝑛 𝑖𝑠, 𝑔𝑟𝑎𝑠𝑠, 𝑇ℎ𝑒 • 𝑃 𝑂 =, 𝐶, 𝐶 The grass is ? C C = ?
  50. 50. Tokenization of SMILES 50 • Tokenize combinations of characters like “Cl” or “[nH]” • Represent the characters as one-hot vectors
  51. 51. The generative process 51
  52. 52. Reinforcement learning 52 Learning from doing Action Reward Update behaviour Design molecule Active? Good DMPK? Synthetically accessible? Make more like this? Make something else instead? Agent
  53. 53. AI live: Create Structures Similar to Celecoxib 53 • Key Message • RNN generates structures similar to Celecoxib • Rapid sampling! • Average score describes how many learning steps are required to reach similar compounds
  54. 54. Some misconceptions about de novo RNN generated molecules 54 “The molecules are not diverse” “The molecules are not synthetic feasible” Answer: The generated molecules follows the properties of the dataset used as prior Segler et al ACS Central Sci. 2018, 4, 120-131 Ertl et al arXiv:1712.07449 Diversity Synthetic feasibility
  55. 55. “Cambrian explosion” of different DL based molecular de novo generation methods 55 PyTorch + RDKit + ChEMBL => anyone with a computer can contribute => Benchmarking is urgently needed
  56. 56. Which benchmarks? What are the relevant questions? Does the same algorithm work best for both scaffold hopping and lead series optimization? Which algorithm samples the underlying chemical space most complete? 1 2 3 Which algorithm zooms most efficiently to the most interesting regions of chemical space?4 Which is best way to describe molecules, strings or graphs?
  57. 57. Benchmark published by the scientific community • MOSES Polykovskiy et al • • Diversity and quality of generated molecules 1 2 3 • Arus-Pous et al • • Complete sampling of the relevant chemical space 4 • Klambauer et al • J. Chem. Inf. Mod. 2018, 58, 1736 • Distribution between generated and real molecules • GuacaMol Brown et al • • Efficient optimisation of a specific property
  58. 58. Artificial Intelligence Guided Drug Design Platform 58 Generation of Novel Chemical Space Reaction & Synthesis Prediction iLAB DMTA Make Test Analyse Design Desirability function Σ IC50, LogP, Novelty etc. Iterations Profiling AI Design Platform Fully Automated DMTA Cycle
  59. 59. 2018 Proof-of-Principle Pilot Study 1st iteration Novelty 3rd iteration Expansion 2nd iteration Novelty 4th iteration Chemistry Automation library ~2month ~2month ~2month Constant re-learning and training 1 • Novelty key goal • Crowded IP space • Lots of available data • Selectivity • New promising series identified 2 • Selectivity key goal • Novelty • Several promising series identified 3 • Optimising HI series • Tool compound • Optimization successful
  60. 60. 60 Lessons from pilot study • It works! • Novel scaffolds were identified in crowded chemical space • Compound series could be efficiently optimised • Affinity and ADME predictions are still bottlenecks • Too many ideas might make prioritization for synthesis challenging • Chemistry resources need to be frontloaded • Optimisation under constraints might lead to molecules that is difficult to synthesize
  61. 61. • Synergize with automation • Better Machine Learning Models • Access to more data (for instance IMI2 Call 14 Topic 3) • Experimental descriptors • Graph convolution, include protein based information • Multi-task modelling • Matrix factorization with side information • Free energy calculations • Progress in speed • Combine with machine learning • Confidence estimation • Conformal prediction • Bayesian methods • Benchmarking • Public Chemogenomics set available (ChEMBL, Excape-DB, Pidgin) • Blind competitions (SAMPL, D3R) How can we improve affinity prediction? 61
  62. 62. Will ML/AI revolutionize drug design? My personal opinion(s) 62 • Only time will tell…. • The last commonly agreed revolution was the introduction of DMPK departments in the 90s, so the bar is high • ML/AI like other promising technologies (for instance PROTACS) warrants further investments • More data, automation and ability to learn makes ML/AI bound to have larger impact on drug design in the future • During my 19 years in industry it has never been as exciting to work with in silico drug design
  63. 63. Acknowledgements 63 Discovery Sciences CompChem ML/AI Team Thierry Kogej Hongming Chen Isabella Feierberg Atanas Patronov Esben Jannik Bjerrum Preeti Iyer Jiangming Sun (Postdoc 2015-2017) Noe Sturm (Postdoc 2017-2018) Philipp Buerger (Postdoc 2017-2020) Jiazhen He (Postdoc 2019-2022) Rocio Mercado (Postdoc 2018-2021) Thomas Blaschke (PhD student 2017-2018) Josep Arus Pous (PhD student 2018-2019) Michael Withnall (PhD student 2018-2019) Oliver Laufkötter (PhD student 2018-2019) Laurent David (PhD student 2018-2019) Ave Kuusk (PhD student 2016-2019) Marcus Olivecrona (AZ GradProgram 2017) Alexander Aivazidis (AZ GradProgram 2018) Dhanushka Weerakoon (AZ GradProgram 2018-2019) Panagiotis-Christos Kotsias (AZ AI GradProgram 2018-2019) Edvard Lindelöf (Master Thesis Student 2018-2019) Simon Johansson (Master Thesis Student 2019) Oleksii Prykhodko (Master Thesis Student 2019) Academic Collaborators Marwin Segler (Munster) Juergen Bajorath (Bonn) Jean-Louis Reymond (Bern) Andreas Bender (Cambridge) Sepp Hochreiter (Linz) Gunther Klambauer (Linz) Sami Kaski (Helsinki) Discovery Sciences Garry Pairaudeau Clive Green Lars Carlsson Nidhal Selmi DSM AI Team Ernst Ahlberg Suzanne Winiwarter Ioana Oprisiu Ruben Buendia (Postdoc 2018) PharmSci Per-Ola Norrby 2018 PoP Pilot Study Werngard Czechtizky Ina Terstiege Christian Tyrchan Anders Johansson Jonas Boström Kun Song Alex Hird Neil Grimster Richard Ward Jeff Johannes
  64. 64. Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, 64
  65. 65. Utilize the GDB-13 database (975 Million compounds) 65 If we train with 1 million compounds and sample 2 billion, what will we get? Josep Arus
  66. 66. Utilize the GDB-13 database 66 80% of 2B sampled molecule within GDB-13 70% of GDB-13 sampled Josep Arus
  67. 67. Utilize the GDB-13 database 67 Long tail distribution, 99.5% of molecules sampled at least once Molecules with uncommon substrings sampled less often Josep Arus
  68. 68. ©PistoiaAlliance Getting Involved 68 • Suggest Future webinar topics & speakers • Datathon engagement – share and collaborate • Centre of Excellence Community • Planning for London March 2019 • New project idea groups • register or involve colleagues
  69. 69. ©PistoiaAlliance Poll Question 3: Where do you see the biggest benefits of AI / ML in Drug Design A. Finding novel chemical compounds (unbiased) B. Using full breadth of available data (ADME, Assay, Target etc) C. Quicker cycle time & speed to lead compound(s) D. Ability to cope data breadth & volume E. Other
  70. 70. ©PistoiaAlliance Panel & Audience Q&A Please use the Question function in GoToWebinar
  71. 71. ©PistoiaAlliance Upcoming Webinars Future webinars will focus on: Further examples of AI in Drug Design and downstream impact Ethics and AI Imaging and AI in Life Science Please suggest other examples
  72. 72. @pistoiaalliance Thank You