Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Computational Drug Discovery
Associate Professor Dr. Chanin Nantasenamat
 
E-mail: chanin.nan@mahidol.edu
YouTube: http://bit.ly/dataprofessor
Machine Learning for Making Sense
of Big Data in Drug Discovery

About the Speaker
• Research group website at http://codes.bio
• Codes and Data at http://github.com/
chaninn and http://github.com/chaninlab
• YouTube Channel called Data Professor
available at http://bit.ly/dataprofessor
• Data Professor FaceBook Page at  
http://facebook.com/dataprofessor
Icon made by Freepik from www.ﬂaticon.com

Disease
• The word ‘disease’ is
deﬁned by Cambridge
Dictionary as 
 
illness of people, animals, plants,
etc., caused by infection or a
failure of health rather than by
an accident
http://static.ﬁlmannex.com/users/galleries/
294182/19265_fa_rszd.jpg

Drugs
• A ‘drug’ is a biological or
chemical entity that can
modulate the course of a
disease state by interacting
with its target protein
• Biological entity 
(e.g. antibodies)
• Chemical entity 
(e.g. small molecules)
Natthapon Ngamnithiporn. Image from FreePik. 
http://www.freepik.com/free-photo/packings-of-pills-and-
capsules-of-medicines_1178867.htm

Li et al. BMC Syst Biol 8 (2014) 141.

Drug Discovery Process
• Costs ~2 billion USD
• Takes about 10-15 years
• Failure rate is > 90%
http://drugdiscovery.nd.edu/

Drug Discovery Process
Ashburn andThor. Nature Rev. Drug Discov. 3 (2004) 673-683
Identify target
protein that is key in
modulating disease
Screen for ‘hit’
molecules that
can inhibit the
target protein
‘Hit-to-lead’
and ‘Lead
optimization’
Evaluate
pharmaco-
kinetic
properties
Initiate Clinical trials to evaluate
safety & dosage; efﬁcacy & side effects;
adverse reaction to long-term use
Drug reaches
the market

https://slideplayer.com/slide/13182763/
From a million to one

Multi-objective optimization
• A drug need not only target the protein of interest but must
also possess other properties
• Desirable characteristics of a drug:
1. Binds selectively to the target protein
2. Absorbs in the stomach (oral drugs)
3. Permeates gut-wall or cell-wall (can reach target site)
4. Metabolically stable
5. Non-toxic
6. Can be synthesized
• To achieve all these desirable properties, the chemical structure
will need to be optimized (an optimal balance will need to be
achieved against many factors)

Creating new compounds
• We can look to nature for inspiration (biologically inspired)
or use existing drugs as starting point
• Medicinal chemists optimize existing componds by modifying
them in a process known as bioisosteric replacement
(replacing a hydrogen atom by a halogen atom)
• Cheminformaticians can computationally enumerate a
compound (compound enumeration) library using the
rules of organic chemistry (considers chemical stability and
synthetic feasibility)
Icon made by dDara from www.ﬂaticon.com

Molecules
• Molecules can be thought of as framework of atoms
(molecular graph) where atoms are vertices and bonds are
edges
- Each vertices can typically be one of nine atoms (C, N, O, F, P, S, Cl or
Br)
- Each edge that links the vertices can be a single, double or triple bond
• Compound enumeration as performed by the research group of
JL Reymond (Acc Chem Res 2015, 48(3):722-730)
- Molecules of up to 13 atoms ⟶ 977 million possible molecules (109)
- Molecules of up to 17 atoms ⟶ 166 billion possible molecules (1011)

Chemical space
• Theoretically possible chemical space as
revealed via compound enumeration by the
research group of JL Reymond (Acc Chem Res 2015,
48(3):722-730)
- Molecules of up to 13 atoms ⟶ 977 million
possible molecules (109)
- Molecules of up to 17 atoms ⟶ 166 billion
possible molecules (1011)
• Drug space (<500 Da) is estimated to
constitute up to 40 atoms (in some cases, even
more) ⟶ roughly 1060 molecules

Drug Discovery Toolbox
Combina(
torial,
Chemistry,
Chemical,
Libraries,
Chemical,
Space,
High(
Throughput,
Screening,
Property,
Filters,
Compu(
ta;onal,
Chemistry,
Machine(
Learning(
QSAR(
Proteo3
chemo3
metrics(
Molecular(
Modeling(
Molecular(
Dynamics(
Molecular(
Docking(

Bioactivity
• Bioactivity is the activity elicited by the
target protein of interest
• Such target proteins are typically involved
in key pathways that inﬂuence the course
of a disease
• Thus, great attention has been placed to
modulate these target proteins
• Primary literature
• Curated
Databases
• ChEMBL, BindingDB,
MOAD, PubChem
• Open Innovation
• Pharmaceutical
companies are
making data publicly
available for non-
commercial diseases

What can computers do?
• Computers (IBM Deep Blue) have defeated human in
Jeopardy and Chess
• Google released a self-driving car
• NASA uses computers to simulate space missions
• Computers are being used to design aircrafts and cars
• Supermarkets and Shopping Malls are using our
purchase history to analyze and predict our spending
behavior
• Why not use it to discover, design and develop new
drugs?
• Computers (deep learning) can
paint likeVan Gogh and Picasso
• Computers can programmatically
code music (Sonic Pi)
• Computers can dream

http://www.boredpanda.com/computer-deep-learning-algorithm-painting-masters/

https://storage.googleapis.com/cdn.thenewstack.io/media/2015/07/google-deep-dream-
artiﬁcial-neural-networks-12.jpg

Why do we need computational
models in drug discovery?
• To discern structure-activity
relationship of chemical library
• In vitro data are limited,
expensive, time-consuming,
laborious, etc.
• Computational models can be
quickly built to preliminarily
predict the pharmacokinetics
and bioactivity of query
compounds
Anuwongcharoen et al. PeerJ 4 (2016) e1958

Questions that can be answered by
computational models
• What target proteins could my compound(s) bind
to and modulate?
• Would my compound bind unspeciﬁcally to other
proteins and thus have off-target activity?
• What type of compounds can bind and modulate
the bioactivity of the target protein of my interest?
• Are there similar compounds to my query
compound that may potentially exert similar
binding behavior?
• How does my compound bind to the protein
structure of its target? Hall et al. Prog Biophys Mol Biol 116 (2014) 82-91.
• How can I modify the structure
of my compound to enhance
its pharmacokinetics and
bioactivity?

ADMET
QSAR
Pharmacophore
Statistical molecular design
Molecular modeling
Protein structure prediction
- Homology/comparative
- Ab initio
Molecular dynamics
Normal mode analysis
Docking/reverse docking
Binding cavity analysis
Pharmacophore
Protein–ligand interactome
Protein–protein interactome
Drug target gene expression
Intrinsically disordered proteins
Allo-network drugs
High-throughput synthesis
High-throughput screening
Privileged structures
Bioisostere
Chemoisostere
Scaffold hopping
Sequence alignment
BLAST
Phylogenetic analysis
Biological space
Computational chemistry
Molecular descriptors
Chemical space
Profiling
Filtering
- Lipinski’s rule of 5
Search
- Molecular similarity
- Substructure similarity
- Shape, volume and
charge-based similarityDatabases
Small molecules
- DrugBank
- ChEMBL
- Pubchem
- BindingDB
- ZINC
Proteins
- PDB
- UniProt
- SCOP
Protein-protein
- MINT
- STITCH
- STRING
Pathway
- KEGG
- Reactome
Proteochemometrics
Computational
chemogenomics
Graph/network theory
Fragment-based docking
Fragment-based QSAR
Ligand growing
Structure-based
Systems-based
Medicinal chemistry
Bioinformatics
Cheminformatics
Ligand-based
Chemogenomics
Fragment-based
Maximizing computational tools for successful drug discovery
Overview of Computational Drug Discovery
Nantasenamat and Prachayasittikul. Expert Opin Drug Discov 10 (2015) 321-329.

Bioinformatics
• Bioinformatics is a discipline entailing
the use of computational approaches to
analyze biological data
‣ Analyze and compare genes, proteins
and genomes
‣ Explore structures and functions of
biomolecules (DNA, protein, lipid and
carbohydrate)
‣ Explore network biology and metabolic
pathways
http://www.gettyimages.com/detail/photo/bioinformatics-background-concept-royalty-free-
image/475811932?esource=SEO_GIS_CDN_Redirect
I424
L428
F404
R394
E353
A350
D351
L354
P535
W383L525
Suvannang et al. Manuscript under Preparation.

• Cheminformatics is a discipline at the
interface of chemistry and computers that
enables the analysis of various aspects
relevant to chemical structures
‣ Chemical space for investigating
Molecular similarity/diversity
‣ Molecular descriptors (e.g. MW,
LogP, nHBdon, nHBacc) and
Quantum chemical
descriptors (HOMO, LUMO,
HOMO-LUMO)
Cheminformatics
Ertl and Rohde. J Cheminf 4 (2012) 12.

Drugs and its pre-cursors
• Fragments - are one of many substructures found in a compound (drug)
• Privileged substructures - are substructures that are commonly found as
inhibitors/activators (drugs) against several therapeutic targets
• Hits - are a small subset of compounds from large chemical libraries that are
identiﬁed from high-throughput screening
• Leads - are compounds that have undergone minor structural optimization from
hits. From there, these leads often undergo further rounds of “lead optimization”
• Drugs - are one of many leads that had passed rigorous tests (pre-clinical and
clinical trials) before reaching the market

Identifying hits
• So how does one go about
identifying hit compounds?
- High-throughput screening  
(Experimental and computational)
- Find similar compounds to
known actives as the bioactivity of
each compound is not an isolated point
(similar chemical structures also provide
similar biological activity)
๏ 30% of these similar compounds to
known actives, are themselves actives
https://southernresearch.org/news/nih-contract-high-
throughput-screening-for-zika/
Hernandex-Santoyo et al. Protein-protein and protein-
ligand docking. DOI:10.5772/56376
MartinYC, J Med Chem 2002, 45(19):4350-4358

Lead generation (Hit-to-Lead)
• Identiﬁed hits from high-
throughput screens are
transformed to leads by
means of limited
structural modiﬁcation
(as to optimize their
ADMET properties)
• Generated leads are
subjected to further
rounds of lead
optimization
Fuller N et al. Drug DiscovToday 2016, 21(8):1272-1283.

Fragment-based Drug Design
Source: http://practicalfragments.blogspot.com/2011/08/ﬁrst-fragment-based-drug-approved.html
Zelboraf treats melanoma by inhibiting BRAF.

DeLaBarre B. http://consultingbiochemist.com/2014/12/cone-chemical-space/

• Christopher Lipinski analyzed a large set of > 2,000 orally-active
drugs that led to what is known as the Lipinski’s Rule of 5, which is a set of
rules deﬁning the drug like-ness of small molecules
‣ Molecular weight < 500 Da
‣ Lipophilicity (LogP) < 5
‣ Hydrogen bond donors < 5
‣ Hydrogen bond acceptors < 10
Lipinski’s Rule of 5
a b
c da b
c d
Christopher Lipinski
@ Pﬁzer
Lipinski et al.Adv Drug Deliv Rev 23 (1997) 3-25
Suvannang et al. (2017) Unpublished results

• In drug discovery, there is a tendency for the lipophilicity and
molecular weight to increase as lead optimization progresses
as to improve the drug’s afﬁnity and selectivity
‣ Molecular weight < 300 Da
‣ Lipophilicity (LogP) < 3
‣ Hydrogen bond donors < 3
‣ Hydrogen bond acceptors < 3
‣ Rotatable bonds < 3
Lead-like Rule of 3

Chemical space
• Chemical space can be generally deﬁned as
the universe of synthetically feasible small
molecules of <500 Da that is estimated to
be in the order of ~1060 molecules
• The visualization of which gives us a bird’s
eye glance at the relative diversity/likeness
of chemical libraries
• Reymond group at University of Bern,
Switzerland developed a computational
algorithm that enumerates all possible chemical
structures that can be built from 17 heavy
atoms in their GDB-17 database which amounts
to 166.4 billion
Reymond and Awale.ACS Chem Neurosci 3 (2012) 649-657.

Biological space
• Biological space refers to the chemical
space of druggable protein families
‣ ADMET
‣ Aminergic/Lipophilic GPCR space
‣ Kinase space
‣ Protease space
‣ CYP450
‣ Nuclear receptors Petit-Zeman S. http://www.nature.com/horizon/
chemicalspace/background/ﬁgs/explore_b1.html

Fragment space
• Fragment space can be deﬁned as
the universe or collection of all possible
molecular fragments (substructures)
• Fragments are < 300 Da
• Utilization of the fragment space has
been suggested to allow more diverse
exploration of the possible chemical
space
• Reymond group also extracted 10
million fragments from the GDB-17
https://software.zbh.uni-hamburg.de/assets/softwareserverslide6-
a0e42ecb3651120926821932574540d5b2e83ff0209654f9ab14
804c7858451a.png
Virshup et al. J Am Chem Soc 135 (2013) 7296-7303

Koch et al. PNAS 102 (2005) 17272-17277
Structural classiﬁcation of natural products (SCONP)

Nadin et al.Angew Chem Int Ed 51 (2012) 1114-1122.

Polypharmacology
• There is a paradigm shift from ‘one
drug-one target’ to ‘one drug-
multiple targets’
• Unintended off-target binding may elicit
undesirable side effects and adverse
effects
• Desirable off-target binding gives you
drug repositioning opportunities
• Knowledge of polypharmacology may aid
in the design of multi-targeted drugs
Reddy and Zhang. Expert Rev Clin Pharmacol 6 (2013) 41-47
Kinase targets of Staurosporine

Drug repositioning/repurposing
• There is a need to
discover new drugs for
treatment especially rare
and neglected diseases
• Drug repositioning/ re-
purposing is a lucrative
approach as it tests
existing FDA-approved
drugs against various
other whole-cell and
target assays
Wu et al. Mol BioSyst 9 (2013) 1268-1281.

Experimental activity (pIC50)
5.0 5.5 6.0 6.5 7.0 7.5 8.0
Predictedactivity(pIC50)
5.0
5.5
6.0
6.5
7.0
7.5
8.0
What is QSAR? (1)
• QSAR/QSPR is the
acronym of Quantitative
Structure-Activity/Property
Relationship
• QSAR seeks to correlate
structural features of
compounds with their
biological activities

What is QSAR? (2)
• Structure governs activity/
property
• Typically in the medicinal
chemistry literature, effects
of substituent groups on
activity is extensively studied
1"
2"
3"
4"
5"
6"
• QSAR/QSPR studies exploits this knowledge for modeling the
biological or chemical activities/properties

What is QSAR? (3)
• QSAR involves three main concepts:
1. Selecting a biological activity or chemical property of interest
2. Generating the physicochemical description
3. Predicting the biological activity or chemical property
Qm# Energy# μ# HOMO# LUMO# HOMO0LUMO#gap#
0.2271& '309.834& 1.0521& '0.21346& '0.0127& 0.20076&
0.2142& '195.31& 0.2337& '0.22611& '0.01915& 0.20696&
IC50%
0.05$
1.50$
Molecular
Descriptors
Biological
Activity
Computational Chemistry
Machine Learning
Compounds of Interest
Predict

Growth of QSAR?
• A search in
SCOPUS
shows the
growing trend
of QSAR
publications

Data set preparation QSAR modeling
ChEMBL 23
Bioactivity
measured by IC50
Remove duplicate
SMILES
Bioactivity data of
ER α inhibitors
Initial
data set
10,666 bioactivity
data for 5,809
compounds
IC50
subset
3,527 compounds
Final
data set
1,299 compounds
Select entries with
CONFIDENCE_SCORE=9
and assay_type=B
Selected
data set
1,346 compounds
Mechanistic
interpretation of
feature
importance
Feature
selection
12 sets of
PaDEL
fingerprints
Descriptor
calculation
Data
splitting
Evaluate
performance
QSAR model
Predicted
pIC50 values
Y-scrambling
for evaluating
chance
correlation
Delete entries with < or >
signs and those with
Salt removal
Transform
IC50 to pIC50
Final
data set
Tautomer
standardization
Remove collinear
descriptors
70/30 split ratio
Perform 10
data splits
Delete entries with missing
SMILES notation
R2,Q2,
Rm
2, RMSE
A typical QSAR workﬂow
Suvannang et al. RSC Adv 2018, 8: 11344-11356

Applications of QSAR/QSPR models
• Regulatory Use: QSAR for modelling environmental
toxicity/chemical hazards by EPA and OECD
• Drug Design: QSAR for modelling biological activities
• Materials Design: QSPR for modelling chemical
properties

GFP$
LPS$
QSAR$
DNA$
PCP$
BPA$
Bacitracin$
Quorum$
Furin$
Vasorelaxant$
Vitamin$E$
Template?$
Monomer$
Phenol$
Sulfonamide$
EDTA?$
DPPC$
Copper$
Complex$
AnDmalarial$
AnD?H1N1$
Aromatase$
Inhibitors$
CYP450$
Inhibitors$
Monte$Carlo$
Feature$$
SelecDon$
Text$
Mining$

Biological activity/chemical property
modeled by QSAR
Biological Activity Chemical Property
Bioconcentration Boiling point
Biodegradation Chromatographic retention time
Carcinogenicity Dielectric constant
Drug metabolism Diffusion coefﬁcient
Inhibitor constant Dissociation constant
Mutagenicity Melting point
Permeability Reactivity
Blood brain barrier Solubility
Skin Stability
Pharmacokinetics Thermodynamic properties
Receptor binding Viscosity
Nantasenamat et al. EXCLI J. (2009) 8: 74-88

Multiple
Compounds
Single  
Target Protein
Multiple
Compounds
Multiple  
Target Proteins
QSAR Proteochemometrics

Summary
• QSAR models allow us to understand how changes to the
chromophore structure leads to GFP color change
• PCM models allow us to understand how changes to
chromophore structure, changes to protein structure and the
chromophore-protein interaction inﬂuences GFP color
change
• Insights from the predictive models could be used in further
extending the spectral repertoire of GFP
Nantasenamat C et al. J. Comput. Chem. 35(27): 1951-1966.

Proteochemometrics
• Proteochemometrics was developed by Maris Lapins and Jarl Wikberg of
Uppsala University in 2001
• Advantages
• Can explain ligand-target afﬁnity by providing detailed maps down to
the substructures and amino acid level
• Can be used to rationalise why a ligand is active toward one target and
not on the other related target
• Has been shown to be useful for Drug Repositioning
• Could be used for Personalized Medicine

Conclusion (1)
• It is without a doubt that the QSAR paradigm boasts much beneﬁt for the rational design
of robust compounds
• Nevertheless, there are certain shortcomings that may limit the widespread application
of QSAR
• Workﬂow of QSAR model development
• High dimensionality of the input space
• Representation of the molecular structure
• Interpretability and meaning of the developed QSAR models
• Presence of outliers or activity cliffs
• Validation of QSAR model performance
• Applicability in real-world setting

Conclusion (2)
• In spite of certain inherent flaws, the QSAR paradigms inevitably
one of the most useful forces contributing to the rapid
development of drug discovery and design.
• As with all technologies, QSAR is not perfect; however, its
weaknesses and flaws are continuously being identified, solved
and reformed to help shape a new improved and robust
approach that is approaching minimal predictive error
• To help realize the goal of developing an intuitive approach
toward the development of robust QSAR models, our
laboratory had developed a software that affords a semi-
automated if not automated QSAR modeling.

Conclusion (3)
• At more than 10 years of QSAR research, we can say that the
demise of QSAR is a myth if done properly and we had only
scratched the surface of its full potential
• QSAR is continuously evolving…starting from 2D-QSAR to  
8D-QSAR!
• Proteochemometrics (so to say Multi-Target QSAR) enables
us to take advantage of the explosion of Omics data

A"so%ware"for"performing"automated"Data"Mining"
AutoWeka"is"a"
Python"wrapper"
of"Weka"
• It is freely available at http://www.mt.mahidol.ac.th/autoweka/
• Nantasenamat et al. Chapter 8:AutoWeka:Toward an Automated Data Mining
Software for QSAR and QSPR Studies. In: Cartwright H.Artiﬁcial Neural
Networks, Springer, pp. 119-147.
AutoWeka

BioCurator
Nantasenamat et al. Manuscript under preparation.
• We had developed a web application that allow users to upload
ChEMBL bioactivity data for automatic data curation
Protocol
• The web app selects a
subset of IC50/Ki data
• Removes redundant
compounds if bioactivity
values exceed 2 SD
• Remove data with < or >
symbols in the bioactivity
label
• Remove redundant
compounds based on
SMILES notation

osFP
Simeon et al. J Cheminf 8 (2016) 72.
Protocol
• The web app accepts the
input peptide sequence
and computes amino acid
composition descriptors
• Applies the constructed
predictive model to predict
the class label of the query
peptide
• Predicted class label is
relayed into the Results
output
Simeon et al. J Cheminform (2016) 8:72
DOI 10.1186/s13321-016-0185-8
RESEARCH ARTICLE
osFP: a web server for predicting the
oligomeric states of fluorescent proteins
Saw Simeon1
, Watshara Shoombuatong1
, Nuttapat Anuwongcharoen1
, Likit Preeyanon2
,
Virapong Prachayasittikul2
, Jarl E. S. Wikberg3
and Chanin Nantasenamat1*
Abstract
Background: Currently, monomeric ﬂuorescent proteins (FP) are ideal markers for protein tagging. The prediction of
Open Access

HemoPred
Win et al. Future Med Chem 9 (2017) 275-291.
Protocol
peptide
output
Future
Medicinal
Chemistry
Research Article
HemoPred: a web server for predicting the
hemolytic activity of peptides
For reprint orders, please contact reprints@future-science.com

CryoProtect
Win et al. Future Med Chem 9 (2017) 275-291.
Protocol
peptide
output
Research Article
CryoProtect: A Web Server for Classifying Antifreeze Proteins
from Nonantifreeze Proteins
Reny Pratiwi,1,2
Aijaz Ahmad Malik,1
Nalini Schaduangrat,1
Virapong Prachayasittikul,3
Jarl E. S. Wikberg,4
Chanin Nantasenamat,1
and Watshara Shoombuatong1
1
Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
2
Department of Medical Laboratory Technology, Faculty of Health Science, Setia Budi University, Surakarta 57127, Indonesia
3
Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University,
Bangkok 10700, Thailand
4
Hindawi
Journal of Chemistry
Volume 2017,Article ID 9861752, 15 pages
https://doi.org/10.1155/2017/9861752

How to get started in CDD?
• Hardware
• Laptop
• Desktop
• High-
performance
computer
• Compute clusters
• Cloud computing
• Software
• Commercial
• Free
• Programming
• C, Java, etc.
• R, Python,
MATLAB, etc.

Computational Drug Discovery
based on Open Source
• Data source
◦ Bioactivity data: ChEMBL,
PubChem, BindingDB
◦ Chemical database: ZINC,
ChemSpider, GDB-17
◦ Biological database: PDB, UniProt
• Data curation and pre-processing
◦ BioCurator (developed in-house)
◦ Babel
• Descriptor calculation
◦ Rcpi, PyDPI, CDK, PADEL
• Multivariate analysis
◦ R: caret
◦ Python: scikit-learn
• Plots
◦ R: ggplot
◦ Python: MatPlotLib, seaborn
Molecular modeling
◦ Avogadro
◦ PyMol
◦ Chimera
◦ VMD
• Molecular docking
◦ AutoDock
• Molecular dynamics
◦ Gromacs
◦ NAMD

Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Similar to Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery (20)

Recently uploaded

Recently uploaded (20)

Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery