In this lecture, I provide an overview on how computers can be instrumental in drug discovery efforts. Topics covered includes: big data as a result of omics effort; bioinformatics; cheminformatics; biological space; chemical space; how computers particularly machine learning (and data science) can be applied in the context of drug discovery.
A video of this lecture is also provided on the "Data Professor" YouTube channel available at http://bit.ly/dataprofessor
If you are fascinated about data science, it would mean the world to me if you would consider subscribing to this channel (by clicking the link below):
http://bit.ly/dataprofessor
Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery
1. Computational Drug Discovery
Associate Professor Dr. Chanin Nantasenamat
āØ
E-mail: chanin.nan@mahidol.edu
YouTube: http://bit.ly/dataprofessor
Machine Learning for Making Sense
of Big Data in Drug Discovery
2. About the Speaker
ā¢ Research group website at http://codes.bio
ā¢ Codes and Data at http://github.com/
chaninn and http://github.com/chaninlab
ā¢ YouTube Channel called Data Professor
available at http://bit.ly/dataprofessor
ā¢ Data Professor FaceBook Page at āØ
http://facebook.com/dataprofessor
Icon made by Freepik from www.ļ¬aticon.com
3. Disease
ā¢ The word ādiseaseā is
deļ¬ned by Cambridge
Dictionary asāØ
āØ
illness of people, animals, plants,
etc., caused by infection or a
failure of health rather than by
an accident
http://static.ļ¬lmannex.com/users/galleries/
294182/19265_fa_rszd.jpg
4. Drugs
ā¢ A ādrugā is a biological or
chemical entity that can
modulate the course of a
disease state by interacting
with its target protein
ā¢ Biological entityāØ
(e.g. antibodies)
ā¢ Chemical entityāØ
(e.g. small molecules)
Natthapon Ngamnithiporn. Image from FreePik.āØ
http://www.freepik.com/free-photo/packings-of-pills-and-
capsules-of-medicines_1178867.htm
6. Drug Discovery Process
ā¢ Costs ~2 billion USD
ā¢ Takes about 10-15 years
ā¢ Failure rate is > 90%
http://drugdiscovery.nd.edu/
7. Drug Discovery Process
Ashburn andThor. Nature Rev. Drug Discov. 3 (2004) 673-683
Identify target
protein that is key in
modulating disease
Screen for āhitā
molecules that
can inhibit the
target protein
āHit-to-leadā
and āLead
optimizationā
Evaluate
pharmaco-
kinetic
properties
Initiate Clinical trials to evaluate
safety & dosage; efļ¬cacy & side effects;
adverse reaction to long-term use
Drug reaches
the market
9. Multi-objective optimization
ā¢ A drug need not only target the protein of interest but must
also possess other properties
ā¢ Desirable characteristics of a drug:
1. Binds selectively to the target protein
2. Absorbs in the stomach (oral drugs)
3. Permeates gut-wall or cell-wall (can reach target site)
4. Metabolically stable
5. Non-toxic
6. Can be synthesized
ā¢ To achieve all these desirable properties, the chemical structure
will need to be optimized (an optimal balance will need to be
achieved against many factors)
10. Creating new compounds
ā¢ We can look to nature for inspiration (biologically inspired)
or use existing drugs as starting point
ā¢ Medicinal chemists optimize existing componds by modifying
them in a process known as bioisosteric replacement
(replacing a hydrogen atom by a halogen atom)
ā¢ Cheminformaticians can computationally enumerate a
compound (compound enumeration) library using the
rules of organic chemistry (considers chemical stability and
synthetic feasibility)
Icon made by dDara from www.ļ¬aticon.com
11. Molecules
ā¢ Molecules can be thought of as framework of atoms
(molecular graph) where atoms are vertices and bonds are
edges
- Each vertices can typically be one of nine atoms (C, N, O, F, P, S, Cl or
Br)
- Each edge that links the vertices can be a single, double or triple bond
ā¢ Compound enumeration as performed by the research group of
JL Reymond (Acc Chem Res 2015, 48(3):722-730)
- Molecules of up to 13 atoms ā¶ 977 million possible molecules (109)
- Molecules of up to 17 atoms ā¶ 166 billion possible molecules (1011)
12. Chemical space
ā¢ Theoretically possible chemical space as
revealed via compound enumeration by the
research group of JL Reymond (Acc Chem Res 2015,
48(3):722-730)
- Molecules of up to 13 atoms ā¶ 977 million
possible molecules (109)
- Molecules of up to 17 atoms ā¶ 166 billion
possible molecules (1011)
ā¢ Drug space (<500 Da) is estimated to
constitute up to 40 atoms (in some cases, even
more) ā¶ roughly 1060 molecules
14. Bioactivity
ā¢ Bioactivity is the activity elicited by the
target protein of interest
ā¢ Such target proteins are typically involved
in key pathways that inļ¬uence the course
of a disease
ā¢ Thus, great attention has been placed to
modulate these target proteins
ā¢ Primary literature
ā¢ Curated
Databases
ā¢ ChEMBL, BindingDB,
MOAD, PubChem
ā¢ Open Innovation
ā¢ Pharmaceutical
companies are
making data publicly
available for non-
commercial diseases
15. What can computers do?
ā¢ Computers (IBM Deep Blue) have defeated human in
Jeopardy and Chess
ā¢ Google released a self-driving car
ā¢ NASA uses computers to simulate space missions
ā¢ Computers are being used to design aircrafts and cars
ā¢ Supermarkets and Shopping Malls are using our
purchase history to analyze and predict our spending
behavior
ā¢ Why not use it to discover, design and develop new
drugs?
ā¢ Computers (deep learning) can
paint likeVan Gogh and Picasso
ā¢ Computers can programmatically
code music (Sonic Pi)
ā¢ Computers can dream
18. Why do we need computational
models in drug discovery?
ā¢ To discern structure-activity
relationship of chemical library
ā¢ In vitro data are limited,
expensive, time-consuming,
laborious, etc.
ā¢ Computational models can be
quickly built to preliminarily
predict the pharmacokinetics
and bioactivity of query
compounds
Anuwongcharoen et al. PeerJ 4 (2016) e1958
19. Questions that can be answered by
computational models
ā¢ What target proteins could my compound(s) bind
to and modulate?
ā¢ Would my compound bind unspeciļ¬cally to other
proteins and thus have off-target activity?
ā¢ What type of compounds can bind and modulate
the bioactivity of the target protein of my interest?
ā¢ Are there similar compounds to my query
compound that may potentially exert similar
binding behavior?
ā¢ How does my compound bind to the protein
structure of its target? Hall et al. Prog Biophys Mol Biol 116 (2014) 82-91.
ā¢ How can I modify the structure
of my compound to enhance
its pharmacokinetics and
bioactivity?
20.
21. ADMET
QSAR
Pharmacophore
Statistical molecular design
Molecular modeling
Protein structure prediction
- Homology/comparative
- Ab initio
Molecular dynamics
Normal mode analysis
Docking/reverse docking
Binding cavity analysis
Pharmacophore
Proteināligand interactome
Proteināprotein interactome
Drug target gene expression
Intrinsically disordered proteins
Allo-network drugs
High-throughput synthesis
High-throughput screening
Privileged structures
Bioisostere
Chemoisostere
Scaffold hopping
Sequence alignment
BLAST
Phylogenetic analysis
Biological space
Computational chemistry
Molecular descriptors
Chemical space
Profiling
Filtering
- Lipinskiās rule of 5
Search
- Molecular similarity
- Substructure similarity
- Shape, volume and
charge-based similarityDatabases
Small molecules
- DrugBank
- ChEMBL
- Pubchem
- BindingDB
- ZINC
Proteins
- PDB
- UniProt
- SCOP
Protein-protein
- MINT
- STITCH
- STRING
Pathway
- KEGG
- Reactome
Proteochemometrics
Computational
chemogenomics
Graph/network theory
Fragment-based docking
Fragment-based QSAR
Ligand growing
Structure-based
Systems-based
Medicinal chemistry
Bioinformatics
Cheminformatics
Ligand-based
Chemogenomics
Fragment-based
Maximizing computational tools for successful drug discovery
Overview of Computational Drug Discovery
Nantasenamat and Prachayasittikul. Expert Opin Drug Discov 10 (2015) 321-329.
22. Bioinformatics
ā¢ Bioinformatics is a discipline entailing
the use of computational approaches to
analyze biological data
ā£ Analyze and compare genes, proteins
and genomes
ā£ Explore structures and functions of
biomolecules (DNA, protein, lipid and
carbohydrate)
ā£ Explore network biology and metabolic
pathways
http://www.gettyimages.com/detail/photo/bioinformatics-background-concept-royalty-free-
image/475811932?esource=SEO_GIS_CDN_Redirect
I424
L428
F404
R394
E353
A350
D351
L354
P535
W383L525
Suvannang et al. Manuscript under Preparation.
23. ā¢ Cheminformatics is a discipline at the
interface of chemistry and computers that
enables the analysis of various aspects
relevant to chemical structures
ā£ Chemical space for investigating
Molecular similarity/diversity
ā£ Molecular descriptors (e.g. MW,
LogP, nHBdon, nHBacc) and
Quantum chemical
descriptors (HOMO, LUMO,
HOMO-LUMO)
Cheminformatics
Ertl and Rohde. J Cheminf 4 (2012) 12.
24. Drugs and its pre-cursors
ā¢ Fragments - are one of many substructures found in a compound (drug)
ā¢ Privileged substructures - are substructures that are commonly found as
inhibitors/activators (drugs) against several therapeutic targets
ā¢ Hits - are a small subset of compounds from large chemical libraries that are
identiļ¬ed from high-throughput screening
ā¢ Leads - are compounds that have undergone minor structural optimization from
hits. From there, these leads often undergo further rounds of ālead optimizationā
ā¢ Drugs - are one of many leads that had passed rigorous tests (pre-clinical and
clinical trials) before reaching the market
25. Identifying hits
ā¢ So how does one go about
identifying hit compounds?
- High-throughput screening āØ
(Experimental and computational)
- Find similar compounds to
known actives as the bioactivity of
each compound is not an isolated point
(similar chemical structures also provide
similar biological activity)
ą¹ 30% of these similar compounds to
known actives, are themselves actives
https://southernresearch.org/news/nih-contract-high-
throughput-screening-for-zika/
Hernandex-Santoyo et al. Protein-protein and protein-
ligand docking. DOI:10.5772/56376
MartinYC, J Med Chem 2002, 45(19):4350-4358
26. Lead generation (Hit-to-Lead)
ā¢ Identiļ¬ed hits from high-
throughput screens are
transformed to leads by
means of limited
structural modiļ¬cation
(as to optimize their
ADMET properties)
ā¢ Generated leads are
subjected to further
rounds of lead
optimization
Fuller N et al. Drug DiscovToday 2016, 21(8):1272-1283.
27. Fragment-based Drug Design
Source: http://practicalfragments.blogspot.com/2011/08/ļ¬rst-fragment-based-drug-approved.html
Zelboraf treats melanoma by inhibiting BRAF.
29. ā¢ Christopher Lipinski analyzed a large set of > 2,000 orally-active
drugs that led to what is known as the Lipinskiās Rule of 5, which is a set of
rules deļ¬ning the drug like-ness of small molecules
ā£ Molecular weight < 500 Da
ā£ Lipophilicity (LogP) < 5
ā£ Hydrogen bond donors < 5
ā£ Hydrogen bond acceptors < 10
Lipinskiās Rule of 5
a b
c da b
c d
Christopher Lipinski
@ Pļ¬zer
Lipinski et al.Adv Drug Deliv Rev 23 (1997) 3-25
Suvannang et al. (2017) Unpublished results
30. ā¢ In drug discovery, there is a tendency for the lipophilicity and
molecular weight to increase as lead optimization progresses
as to improve the drugās afļ¬nity and selectivity
ā£ Molecular weight < 300 Da
ā£ Lipophilicity (LogP) < 3
ā£ Hydrogen bond donors < 3
ā£ Hydrogen bond acceptors < 3
ā£ Rotatable bonds < 3
Lead-like Rule of 3
31. Chemical space
ā¢ Chemical space can be generally deļ¬ned as
the universe of synthetically feasible small
molecules of <500 Da that is estimated to
be in the order of ~1060 molecules
ā¢ The visualization of which gives us a birdās
eye glance at the relative diversity/likeness
of chemical libraries
ā¢ Reymond group at University of Bern,
Switzerland developed a computational
algorithm that enumerates all possible chemical
structures that can be built from 17 heavy
atoms in their GDB-17 database which amounts
to 166.4 billion
Reymond and Awale.ACS Chem Neurosci 3 (2012) 649-657.
32. Biological space
ā¢ Biological space refers to the chemical
space of druggable protein families
ā£ ADMET
ā£ Aminergic/Lipophilic GPCR space
ā£ Kinase space
ā£ Protease space
ā£ CYP450
ā£ Nuclear receptors Petit-Zeman S. http://www.nature.com/horizon/
chemicalspace/background/ļ¬gs/explore_b1.html
33. Fragment space
ā¢ Fragment space can be deļ¬ned as
the universe or collection of all possible
molecular fragments (substructures)
ā¢ Fragments are < 300 Da
ā¢ Utilization of the fragment space has
been suggested to allow more diverse
exploration of the possible chemical
space
ā¢ Reymond group also extracted 10
million fragments from the GDB-17
https://software.zbh.uni-hamburg.de/assets/softwareserverslide6-
a0e42ecb3651120926821932574540d5b2e83ff0209654f9ab14
804c7858451a.png
Virshup et al. J Am Chem Soc 135 (2013) 7296-7303
34. Koch et al. PNAS 102 (2005) 17272-17277
Structural classiļ¬cation of natural products (SCONP)
36. Polypharmacology
ā¢ There is a paradigm shift from āone
drug-one targetā to āone drug-
multiple targetsā
ā¢ Unintended off-target binding may elicit
undesirable side effects and adverse
effects
ā¢ Desirable off-target binding gives you
drug repositioning opportunities
ā¢ Knowledge of polypharmacology may aid
in the design of multi-targeted drugs
Reddy and Zhang. Expert Rev Clin Pharmacol 6 (2013) 41-47
Kinase targets of Staurosporine
37. Drug repositioning/repurposing
ā¢ There is a need to
discover new drugs for
treatment especially rare
and neglected diseases
ā¢ Drug repositioning/ re-
purposing is a lucrative
approach as it tests
existing FDA-approved
drugs against various
other whole-cell and
target assays
Wu et al. Mol BioSyst 9 (2013) 1268-1281.
38. Experimental activity (pIC50)
5.0 5.5 6.0 6.5 7.0 7.5 8.0
Predictedactivity(pIC50)
5.0
5.5
6.0
6.5
7.0
7.5
8.0
What is QSAR? (1)
ā¢ QSAR/QSPR is the
acronym of Quantitative
Structure-Activity/Property
Relationship
ā¢ QSAR seeks to correlate
structural features of
compounds with their
biological activities
39. What is QSAR? (2)
ā¢ Structure governs activity/
property
ā¢ Typically in the medicinal
chemistry literature, effects
of substituent groups on
activity is extensively studied
1"
2"
3"
4"
5"
6"
ā¢ QSAR/QSPR studies exploits this knowledge for modeling the
biological or chemical activities/properties
40. What is QSAR? (3)
ā¢ QSAR involves three main concepts:
1. Selecting a biological activity or chemical property of interest
2. Generating the physicochemical description
3. Predicting the biological activity or chemical property
Qm# Energy# Ī¼# HOMO# LUMO# HOMO0LUMO#gap#
0.2271& '309.834& 1.0521& '0.21346& '0.0127& 0.20076&
0.2142& '195.31& 0.2337& '0.22611& '0.01915& 0.20696&
IC50%
0.05$
1.50$
Molecular
Descriptors
Biological
Activity
Computational Chemistry
Machine Learning
Compounds of Interest
Predict
41. Growth of QSAR?
ā¢ A search in
SCOPUS
shows the
growing trend
of QSAR
publications
42. Data set preparation QSAR modeling
ChEMBL 23
Bioactivity
measured by IC50
Remove duplicate
SMILES
Bioactivity data of
ER Ī± inhibitors
Initial
data set
10,666 bioactivity
data for 5,809
compounds
IC50
subset
3,527 compounds
Final
data set
1,299 compounds
Select entries with
CONFIDENCE_SCORE=9
and assay_type=B
Selected
data set
1,346 compounds
Mechanistic
interpretation of
feature
importance
Feature
selection
12 sets of
PaDEL
fingerprints
Descriptor
calculation
Data
splitting
Evaluate
performance
QSAR model
Predicted
pIC50 values
Y-scrambling
for evaluating
chance
correlation
Delete entries with < or >
signs and those with
Salt removal
Transform
IC50 to pIC50
Final
data set
Tautomer
standardization
Remove collinear
descriptors
70/30 split ratio
Perform 10
data splits
Delete entries with missing
SMILES notation
R2,Q2,
Rm
2, RMSE
A typical QSAR workļ¬ow
Suvannang et al. RSC Adv 2018, 8: 11344-11356
43. Applications of QSAR/QSPR models
ā¢ Regulatory Use: QSAR for modelling environmental
toxicity/chemical hazards by EPA and OECD
ā¢ Drug Design: QSAR for modelling biological activities
ā¢ Materials Design: QSPR for modelling chemical
properties
47. Summary
ā¢ QSAR models allow us to understand how changes to the
chromophore structure leads to GFP color change
ā¢ PCM models allow us to understand how changes to
chromophore structure, changes to protein structure and the
chromophore-protein interaction inļ¬uences GFP color
change
ā¢ Insights from the predictive models could be used in further
extending the spectral repertoire of GFP
Nantasenamat C et al. J. Comput. Chem. 35(27): 1951-1966.
48. Proteochemometrics
ā¢ Proteochemometrics was developed by Maris Lapins and Jarl Wikberg of
Uppsala University in 2001
ā¢ Advantages
ā¢ Can explain ligand-target afļ¬nity by providing detailed maps down to
the substructures and amino acid level
ā¢ Can be used to rationalise why a ligand is active toward one target and
not on the other related target
ā¢ Has been shown to be useful for Drug Repositioning
ā¢ Could be used for Personalized Medicine
49. Conclusion (1)
ā¢ It is without a doubt that the QSAR paradigm boasts much beneļ¬t for the rational design
of robust compounds
ā¢ Nevertheless, there are certain shortcomings that may limit the widespread application
of QSAR
ā¢ Workļ¬ow of QSAR model development
ā¢ High dimensionality of the input space
ā¢ Representation of the molecular structure
ā¢ Interpretability and meaning of the developed QSAR models
ā¢ Presence of outliers or activity cliffs
ā¢ Validation of QSAR model performance
ā¢ Applicability in real-world setting
50. Conclusion (2)
ā¢ In spite of certain inherent ļ¬aws, the QSAR paradigms inevitably
one of the most useful forces contributing to the rapid
development of drug discovery and design.
ā¢ As with all technologies, QSAR is not perfect; however, its
weaknesses and ļ¬aws are continuously being identiļ¬ed, solved
and reformed to help shape a new improved and robust
approach that is approaching minimal predictive error
ā¢ To help realize the goal of developing an intuitive approach
toward the development of robust QSAR models, our
laboratory had developed a software that affords a semi-
automated if not automated QSAR modeling.
51. Conclusion (3)
ā¢ At more than 10 years of QSAR research, we can say that the
demise of QSAR is a myth if done properly and we had only
scratched the surface of its full potential
ā¢ QSAR is continuously evolvingā¦starting from 2D-QSAR to āØ
8D-QSAR!
ā¢ Proteochemometrics (so to say Multi-Target QSAR) enables
us to take advantage of the explosion of Omics data
53. BioCurator
Nantasenamat et al. Manuscript under preparation.
ā¢ We had developed a web application that allow users to upload
ChEMBL bioactivity data for automatic data curation
Protocol
ā¢ The web app selects a
subset of IC50/Ki data
ā¢ Removes redundant
compounds if bioactivity
values exceed 2 SD
ā¢ Remove data with < or >
symbols in the bioactivity
label
ā¢ Remove redundant
compounds based on
SMILES notation
54. osFP
Simeon et al. J Cheminf 8 (2016) 72.
Protocol
ā¢ The web app accepts the
input peptide sequence
and computes amino acid
composition descriptors
ā¢ Applies the constructed
predictive model to predict
the class label of the query
peptide
ā¢ Predicted class label is
relayed into the Results
output
Simeon et al. J Cheminform (2016) 8:72
DOI 10.1186/s13321-016-0185-8
RESEARCH ARTICLE
osFP: a web server forĀ predicting the
oligomeric states ofĀ fluorescent proteins
Saw Simeon1
, Watshara Shoombuatong1
, Nuttapat Anuwongcharoen1
, Likit Preeyanon2
,
Virapong Prachayasittikul2
, Jarl E. S. Wikberg3
and Chanin Nantasenamat1*
Abstract
Background: Currently, monomeric ļ¬uorescent proteins (FP) are ideal markers for protein tagging. The prediction of
Open Access
55. HemoPred
Win et al. Future Med Chem 9 (2017) 275-291.
Protocol
ā¢ The web app accepts the
input peptide sequence
and computes amino acid
composition descriptors
ā¢ Applies the constructed
predictive model to predict
the class label of the query
peptide
ā¢ Predicted class label is
relayed into the Results
output
Future
Medicinal
Chemistry
Research Article
HemoPred: a web server for predicting the
hemolytic activity of peptides
For reprint orders, please contact reprints@future-science.com
56. CryoProtect
Win et al. Future Med Chem 9 (2017) 275-291.
Protocol
ā¢ The web app accepts the
input peptide sequence
and computes amino acid
composition descriptors
ā¢ Applies the constructed
predictive model to predict
the class label of the query
peptide
ā¢ Predicted class label is
relayed into the Results
output
Research Article
CryoProtect: A Web Server for Classifying Antifreeze Proteins
from Nonantifreeze Proteins
Reny Pratiwi,1,2
Aijaz Ahmad Malik,1
Nalini Schaduangrat,1
Virapong Prachayasittikul,3
Jarl E. S. Wikberg,4
Chanin Nantasenamat,1
and Watshara Shoombuatong1
1
Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
2
Department of Medical Laboratory Technology, Faculty of Health Science, Setia Budi University, Surakarta 57127, Indonesia
3
Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University,
Bangkok 10700, Thailand
4
Hindawi
Journal of Chemistry
Volume 2017,Article ID 9861752, 15 pages
https://doi.org/10.1155/2017/9861752
57. How to get started in CDD?
ā¢ Hardware
ā¢ Laptop
ā¢ Desktop
ā¢ High-
performance
computer
ā¢ Compute clusters
ā¢ Cloud computing
ā¢ Software
ā¢ Commercial
ā¢ Free
ā¢ Programming
ā¢ C, Java, etc.
ā¢ R, Python,
MATLAB, etc.
58. Computational Drug Discovery
based on Open Source
ā¢ Data source
ā¦ Bioactivity data: ChEMBL,
PubChem, BindingDB
ā¦ Chemical database: ZINC,
ChemSpider, GDB-17
ā¦ Biological database: PDB, UniProt
ā¢ Data curation and pre-processing
ā¦ BioCurator (developed in-house)
ā¦ Babel
ā¢ Descriptor calculation
ā¦ Rcpi, PyDPI, CDK, PADEL
ā¢ Multivariate analysis
ā¦ R: caret
ā¦ Python: scikit-learn
ā¢ Plots
ā¦ R: ggplot
ā¦ Python: MatPlotLib, seaborn
Molecular modeling
ā¦ Avogadro
ā¦ PyMol
ā¦ Chimera
ā¦ VMD
ā¢ Molecular docking
ā¦ AutoDock
ā¢ Molecular dynamics
ā¦ Gromacs
ā¦ NAMD