AI & ML in Drug Design: Pistoia Alliance CoE

26 February, 2019
AI in Drug Design
Pistoia Alliance Centre of Excellence
for AI in Life Sciences
Moderator: Vladimir Makarov and Nick Lynch

This webinar is being recorded

Poll Question 1:
Are you or your organisation using AI /
ML in Drug Design?
A. Yes, already
B. Plan to do in next 12 months
C. Plan in next 12-24 months
D. No plans

©PistoiaAlliance
Introduction to Today’s Speakers
Prof Alex Tropsha
Associate Dean for
Pharmacoinformatics and data
science
K.H. Lee distinguished professor
Dr Ola Engqvist
Associate Director
Discovery Sciences
AstraZeneca

Alexander Tropsha
UNC Eshelman School of
Pharmacy
Machine learning, text mining, and
AI approaches for drug discovery
and repurposing

The ultimate dream of a
computational chemist

~106 – 109
molecules
VIRTUAL
SCREENING
CHEMICAL
STRUCTURES
CHEMICAL
DESCRIPTORS
PROPERTY/
ACTIVITY
PREDICTIVE
QSAR MODELS
Confirmed inactives
QSAR
MAGIC
Confirmed
actives
CHEMICAL DATABASE
The chief utility of computational models:
Annotation of new compounds
7
Varnek, A., Tropsha, A. (Eds) Chemoinformatics
Approaches to Virtual Screening, RSC
Publishing, Cambridge, UK, 2008

Datasets are represented by a matrix
of molecular descriptors
Samples
(Compounds)
Variables (descriptors)
X1 X2 ... Xm
1 X11 X12 ... X1m
2 X21 X22 ... X2m
... ... ... ... ...
n Xn1 Xn2 ... Xnm

Quantitative
Structure
Activity
Relationships
D
E
S
C
R
I
P
T
O
R
S
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
0.613
0.380
-0.222
0.708
1.146
0.491
0.301
0.141
0.956
0.256
0.799
1.195
1.005
C
O
M
P
O
U
N
D
S
A
C
T
I
V
I
T
Y
Thousands of molecular descriptors
are available for organic compounds
constitutional, topological, structural,
quantum mechanics based, fragmental,
steric, pharmacophoric, geometrical,
thermodynamical conformational, etc.
- Building of models using
machine learning methods
(NN, SVM etc.);
- Validation of models
according to numerous
statistical procedures, and
their applicability domains.
Credit: Denis Fourches 9
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4
PredictedLogED50
Actual LogED50 (ED50 = mM/kg)
Training
Linear
(Training)
Tropsha, A. Best Practices for QSAR Model Development,
Validation, and Exploitation Mol. Inf., 2010, 29, 476

QSAR Modeling Workflow: the
importance of rigorous validation
M o d e l i n g m e t h o d s
5-fold
External
Validation
1
4
3
2
5
12354
courtesy of L. Zhang
Combi-QSAR
modeling
Datasets
K-Nearest
Neighbors (kNN)
Random
Forest (RF)
Support Vector
Machines (SVM)
Dragon MOE
Internal validation
Model selection
An ensemble of
QSAR Models
Modeling set
External set
D e s c r i p t o r s
Evaluation of
external performance
10
Tropsha, A. Best Practices for QSAR Model Development, Validation,
and Exploitation Mol. Inf., 2010, 29, 476 – 488
Fully implemented on CHEMBENCH.MML.UNC.EDU
Virtual screening
(with AD threshold)
Experimental
confirmation

SMILEs: a compact way to encode,
store, share chemical data

Representation of molecules
by the SMILEs language
https://commons.wikimedia.org/w/index.ph
p?curid=2556784

ReLeaSE* design principles: learning
and exploiting structural linguistics of
SMILES notation
• SMILES notations reflect rules of Chemistry
• SMILES notation may embed linguistic rules
• Neural nets could learn both of the above types of rules
• This knowledge can be transformed into the generation of
new SMILES corresponding to novel chemically feasible
molecules (generative model)
• One can build QSAR models based solely on SMILES
notation (predictive model)
• QSAR models can be used as a reward function for
reinforcement learning to bias the design of novel libraries
*Popova, M,, Isayev, O., and Tropsha, A. "Deep reinforcement learning for de-novo drug design."
Science Advances, 2018 Jul 25;4(7):eaap7885.

NLP/Text mining:directly learn
low-dimensional word vectors
∙ In deeplearning models, a wordis represented as a dense vector
∙ Word vectors form the basis for deep learning methods
∙ Objective: predict word based on the context
Mikolov T . et al. Distributed representations of words and phrases and their compositionality
//Advances in neural information processing systems. – 2013. – С. 3111-3119.

Design of the ReLeaSE* method
(Reinforcement Learning for Structural Evolution)
Elements of the
thought cycle
(molecules->models-
molecules):
• Generate chemically
feasible SMILES
• Develop SMILES-
based QSAR model
• Employ QSAR model
to bias library
generation
• Produce new
SMILES
*Popova, Mariya, Olexandr Isayev, and Alexander Tropsha. "Deep reinforcement learning for de-novo drug design."
arXiv preprint arXiv:1711.10907 (2017).

ReLeaSE:* Disruptive Innovation of
Conventional Computational Drug
Discovery Pipeline
Learn from
target-specific
data (300-500
molecules)
Target-specific
models
Virtual screening
Internal/public
databases
Selection and
testing of
known
molecules
Generation
of novel
molecules
Selection and
testing of
novel
molecules
ReLeaSE Workflow
Traditional Workflow
Learn from
all data (2M
molecules)
Target-specific and property
models / Reinforcement learning
Hits with
desired
properties

Disruptive innovation in QSAR: Can we avoid
descriptor generation altogether and besides,
predict new structures?
Did the
training
converge?
NO
YES
<START>
c
<START>c1ccc(O)cc1<END>
c
1
1
c
c
c
c
)
+ loss
c
(
(
F
+ loss
O
)
)
c
c
c
c
1
1
<END>
Softmax
loss
1.5M
molecules
from
ChEMBL
c1ccc(O)cc1

Are we making legitimate Smiles?
AI learning
system
95% Valid
Chemically-feasible
molecules
SMILE strings
/
Smiles strings

Fc1ccc2c(Nc3ccc(F)c(F)c3)ncnc2c1
Generative model
Reinforcement learning for
chemical design
Predictive model

O=C(C)Oc1ccccc1C(=O)O
CCOc1cc(C)ccc1OCC=CF
COc1ccccc1OCCO
CC(N)Sc1ccc(Cl)nc1
COC(=O)NCc1ccccc1Cl
C
O
M
P
O
U
N
D
S
A
C
T
I
V
I
T
Y
0.531
1.299
0.946
-0.218
0.017
QSAR
Smile-ification of QSAR!
Quantitative Smiles – Activity Relationships

QSAR modeling using Smiles strings
only*
RMSE: 0.57 0.53
MAE: 0.37 0.35
R2
ext: 0.90 0.91
CN2C(=O)N(C)C(=O)C1=C2N=CN1C
Neural
Network
Property prediction
Predicted LogP
ObservedLogP
5CV RF model with
DRAGON7 Descriptors
5CV NN model with
SMILES directly
*LogP data for ~16K molecules from PHYSPROP (srcinc.com), Toxcast Dashboard
(https://comptox.epa.gov/dashboard), and others.

Generative model
Predictive modelACTIVE!
chemical design

Generative model
Predictive model
chemical design

FC(F)COc1ccc2c(Nc3ccc(Cl)c(Cl)c3)ncnc2c1
Generative model
Predictive model
chemical design

Generative model
Predictive modelINACTIVE!
chemical design

Results: Synthetic accessibility
score* of the designed libraries
*Ertl, Peter, and Ansgar Schuffenhauer. "Estimation of synthetic accessibility score of drug-like molecules based on molecular
complexity and fragment contributions." Journal of cheminformatics 1.1 (2009): 8.

PoC: Physical properties
LogP (10K compounds) T melt, C° (10K compounds)

Predicted pIC50 for JAK2 kinase
CAS 236-084-2
(buffer reagent)
ZINC37859566
New moleculeSIMILAR SCAFFOLDS
NEW CHEMOTYPE
JAK2 Kinase inhibition
Untrained data distribution
Maximized property distribution
Minimized property distribution

Target predictions for generated
compounds using SEA*
*Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand
chemistry. Nat Biotech 25 (2), 197-206 (2007).

Practical implementation workflow
• Select a target
• Train ReLeaSE to generate new target-specific
molecules; collect computational hits
• Identify a fraction of hits available in commercial
libraries; purchase and test selected hits
• Following successful validation, order NCE synthesis
and testing in vitro and in vivo and if successful file
for IP protection
37

Summary
• We propose an innovative de novo drug discovery
technology termed Reinforcement Learning for
Structural Evolution (ReLeaSE)*
• ReLeaSE is a product of convergence of fields as
disparate as cheminformatics and text mining united
by AI
• Unlike most of the current technologies, ReLeaSE
enables the discovery of new chemical entities with the
desired bioactivity and drug-like properties
Patent application filed (application # 62/535069, filed by UNC07/2018)

General Summary
• Accumulation of Big Data in all areas of research creates
previously unachievable opportunities for using ML and AI
approaches
– However, primary data must be handled with extreme care (curation,
reproducibility)
• Exciting developments in computational chemistry
– Critical shift from discovery to design and AI-driven robotics
• Rapid progression from the use of computational modeling
for decision support to using models to guide experimental
research
– Critical importance of rigorous and comprehensive model validation
using truly external data
• Natural progression toward automated chemical labs driven
by AI

Principal Investigator
Alexander Tropsha
Research Professors
Alexander Golbraikh
Olexander Isayev
Eugene Muratov
Graduate students
Sherif Faraq
Kyle Bowers
Maria Popova
Andrew Thieme
Dan Korn
Phil Gusev
Postdoctoral Fellows
Vinicius Alves
Joyce Borba
MAJOR FUNDING
NIH
- 1U01CA207160
- R01-GM114015
- 5U54CA198999
- 1OT3TR002020
ONR
- N00014-16-1-2311
Acknowledgements

Poll Question 2:
What are the biggest barriers to machine
learning adoption Drug Design? (multi
select)
A. Lack of access to AI/ML Skills
B. Access to Data
C. Quality of Data
D. Access to ML & AI Tools
E. Other

Artificial Intelligence in Drug Design
Ola Engkvist, Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Sweden
February 26 2019PISTOIA Webinar

Drug Design
What to make next? How to make it?
De novo design
Multi-parameter scoring function
Retrosynthesis

What is different now?
44
Augmented
design
Autonomous
design
Automatic
design
de novo molecular
design
Synthesis prediction
Automation
Data generation

It takes two to tango
45
Artificial Intelligence Chemistry Automation

AI/ML for drug design science @AZ
46

Neural Networks & Deep Learning
47
• Neural Networks known for decades
• Inputs, Hidden Layers, Outputs
• Single layer NNs have been used in QSAR
modelling for years
• Recent Applications use more complex
networks such as
• Multi-layer Feed-Forward NNs
• Convolutional NNs
• biological image processing
• Auto-encoder NNs
• Adversial NNs
• Recurrent NNs

Why? Generation of Novel Compounds in the 1060 Chemical Space!
48
Where´s the impact?
• Use for de novo Molecular Design
• Scaffold Hopping
• Novelty
• Virtual Screening
• Library Design
10601010-1012

Natural language generation and molecular structure generation
49
• Can we borrow concepts from natural language processing and
apply to SMILES description of molecular structures to generate
molecules?
• Conditional probability distributions given context
• 𝑃 𝑔𝑟𝑒𝑒𝑛 𝑖𝑠, 𝑔𝑟𝑎𝑠𝑠, 𝑇ℎ𝑒
• 𝑃 𝑂 =, 𝐶, 𝐶
The grass is ?
C C = ?

Tokenization of SMILES
50
• Tokenize combinations of characters like “Cl” or “[nH]”
• Represent the characters as one-hot vectors

Reinforcement learning
52
Learning from doing
Action Reward Update behaviour
Design molecule
Active?
Good DMPK?
Synthetically accessible?
Make more like this?
Make something else instead?
Agent

AI live: Create Structures Similar to Celecoxib
53
• Key Message
• RNN generates
structures similar
to Celecoxib
• Rapid sampling!
• Average score
describes how
many learning
steps are required
to reach similar
compounds

Some misconceptions about de novo RNN generated molecules
54
“The molecules are not diverse”
“The molecules are not synthetic feasible”
Answer: The generated molecules follows the properties of the dataset used as prior
Segler et al ACS Central Sci. 2018, 4, 120-131 Ertl et al arXiv:1712.07449
Diversity Synthetic feasibility

“Cambrian explosion” of different DL based molecular de novo generation
methods
55
PyTorch + RDKit + ChEMBL => anyone with a computer can contribute =>
Benchmarking is urgently needed

Which benchmarks? What are the relevant questions?
Does the same algorithm work best for both
scaffold hopping and lead series optimization?
Which algorithm samples the underlying
chemical space most complete?
1
2
3
Which algorithm zooms most efficiently to the
most interesting regions of chemical space?4
Which is best way to describe molecules,
strings or graphs?

Benchmark published by the scientific community
• MOSES Polykovskiy et al
• https://arxiv.org/abs/1811.12823
• Diversity and quality of generated molecules
1
2
3
• Arus-Pous et al
• https://chemrxiv.org/articles/Exploring_the_GDB13_Chemical_Space_Using_Deep_Generative_Models/7172849
• Complete sampling of the relevant chemical space
4
• Klambauer et al
• J. Chem. Inf. Mod. 2018, 58, 1736
• Distribution between generated and real molecules
• GuacaMol Brown et al
• https://arxiv.org/abs/1811.09621
• Efficient optimisation of a specific property

Artificial Intelligence Guided Drug Design Platform
58
Generation of Novel Chemical
Space
Reaction & Synthesis
Prediction
iLAB
DMTA
Make
Test
Analyse
Design
Desirability
function
Σ IC50, LogP,
Novelty etc.
Iterations
Profiling
AI Design
Platform
Fully Automated
DMTA Cycle

2018 Proof-of-Principle Pilot Study
1st iteration
Novelty
3rd iteration
Expansion
2nd iteration
Novelty
4th iteration
Chemistry Automation
library
~2month ~2month ~2month
Constant re-learning and training
1
• Novelty key goal
• Crowded IP space
• Lots of available data
• Selectivity
• New promising series
identified
2
• Selectivity key goal
• Novelty
• Several promising
series identified
3
• Optimising HI series
• Tool compound
• Optimization successful

60
Lessons from pilot study
• It works!
• Novel scaffolds were identified in crowded chemical space
• Compound series could be efficiently optimised
• Affinity and ADME predictions are still bottlenecks
• Too many ideas might make prioritization for synthesis challenging
• Chemistry resources need to be frontloaded
• Optimisation under constraints might lead to molecules that is difficult to synthesize

• Synergize with automation
• Better Machine Learning Models
• Access to more data (for instance IMI2 Call 14 Topic 3)
• Experimental descriptors
• Graph convolution, include protein based information
• Multi-task modelling
• Matrix factorization with side information
• Free energy calculations
• Progress in speed
• Combine with machine learning
• Confidence estimation
• Conformal prediction
• Bayesian methods
• Benchmarking
• Public Chemogenomics set available (ChEMBL, Excape-DB, Pidgin)
• Blind competitions (SAMPL, D3R)
How can we improve affinity prediction?
61

Will ML/AI revolutionize drug design?
My personal opinion(s)
62
• Only time will tell….
• The last commonly agreed revolution was the introduction of DMPK
departments in the 90s, so the bar is high
• ML/AI like other promising technologies (for instance PROTACS) warrants
further investments
• More data, automation and ability to learn makes ML/AI bound to have
larger impact on drug design in the future
• During my 19 years in industry it has never been as exciting to work with in
silico drug design

Acknowledgements
63
Discovery Sciences CompChem ML/AI Team
Thierry Kogej
Hongming Chen
Isabella Feierberg
Atanas Patronov
Esben Jannik Bjerrum
Preeti Iyer
Jiangming Sun (Postdoc 2015-2017)
Noe Sturm (Postdoc 2017-2018)
Philipp Buerger (Postdoc 2017-2020)
Jiazhen He (Postdoc 2019-2022)
Rocio Mercado (Postdoc 2018-2021)
Thomas Blaschke (PhD student 2017-2018)
Josep Arus Pous (PhD student 2018-2019)
Michael Withnall (PhD student 2018-2019)
Oliver Laufkötter (PhD student 2018-2019)
Laurent David (PhD student 2018-2019)
Ave Kuusk (PhD student 2016-2019)
Marcus Olivecrona (AZ GradProgram 2017)
Alexander Aivazidis (AZ GradProgram 2018)
Dhanushka Weerakoon (AZ GradProgram 2018-2019)
Panagiotis-Christos Kotsias (AZ AI GradProgram 2018-2019)
Edvard Lindelöf (Master Thesis Student 2018-2019)
Simon Johansson (Master Thesis Student 2019)
Oleksii Prykhodko (Master Thesis Student 2019)
Academic Collaborators
Marwin Segler (Munster)
Juergen Bajorath (Bonn)
Jean-Louis Reymond (Bern)
Andreas Bender (Cambridge)
Sepp Hochreiter (Linz)
Gunther Klambauer (Linz)
Sami Kaski (Helsinki)
Discovery Sciences
Garry Pairaudeau
Clive Green
Lars Carlsson
Nidhal Selmi
DSM AI Team
Ernst Ahlberg
Suzanne Winiwarter
Ioana Oprisiu
Ruben Buendia (Postdoc 2018)
PharmSci
Per-Ola Norrby
2018 PoP Pilot Study
Werngard Czechtizky
Ina Terstiege
Christian Tyrchan
Anders Johansson
Jonas Boström
Kun Song
Alex Hird
Neil Grimster
Richard Ward
Jeff Johannes

Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove
it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the
contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus,
Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, www.astrazeneca.com
64

Utilize the GDB-13 database (975 Million compounds)
65
If we train with 1 million compounds and sample 2 billion, what will we get?
Josep Arus
https://chemrxiv.org/articles/Exploring_the_GDB-13_Chemical_Space_Using_Deep_Generative_Models/7172849

Utilize the GDB-13 database
66
80% of 2B sampled molecule within GDB-13
70% of GDB-13 sampled
Josep Arus

Utilize the GDB-13 database
67
Long tail distribution, 99.5% of molecules sampled at least once
Molecules with uncommon substrings sampled less often
Josep Arus

©PistoiaAlliance
Getting Involved
68
• Suggest Future webinar topics & speakers
• Datathon engagement – share and collaborate
• Centre of Excellence Community
• Planning for London March 2019
• New project idea groups
• register or involve colleagues

©PistoiaAlliance
Poll Question 3:
Where do you see the biggest benefits of AI / ML in Drug
Design
A. Finding novel chemical compounds (unbiased)
B. Using full breadth of available data (ADME, Assay, Target etc)
C. Quicker cycle time & speed to lead compound(s)
D. Ability to cope data breadth & volume
E. Other

©PistoiaAlliance
Upcoming Webinars
Future webinars will focus on:
Further examples of AI in Drug Design and
downstream impact
Ethics and AI
Imaging and AI in Life Science
Please suggest other examples

info@pistoiaalliance.org @pistoiaalliance www.pistoiaalliance.org
Thank You

AI & ML in Drug Design: Pistoia Alliance CoE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AI & ML in Drug Design: Pistoia Alliance CoE

Similar to AI & ML in Drug Design: Pistoia Alliance CoE (20)

More from Pistoia Alliance

More from Pistoia Alliance (20)

Recently uploaded

Recently uploaded (20)

AI & ML in Drug Design: Pistoia Alliance CoE