ML & AI in Drug
development: an
introduction &
overview
Paul Agapow
Statistics & Data Science
Innovation Hub, GSK
Disclosure
– No conflicts of interest
– Own views and does not reflect official company
thought or projects
– Based on experience in current & previous
positions
– Data Science / Statistics @GSK
– ML&AI / Health Informatics @AZ
– Data Science Institute @ICL
– Bioinformatics @Health Protection Agency (UK) …
2
What is drug development, how does it work?
Agenda
3
Why ML & AI is difficult in pharma
Where ML & AI can be powerful in pharma and what
we need to do
1
2
3
How we make drugs
1
Clinical trials
Identifying and
understanding
disease, unravelling
the molecular
machinery,
pinpointing targets
Drug development is a long & complex process
5
Pathophysiology
Developing
molecules that can
be synthesized and
delivered safely to
the target
Drug candidates
Testing via trials,
dissecting failures
and successes,
tracking adverse
events, seeking
regulatory approval
Who gets the drug,
how is it re-
imbursed, tracking
long-term adverse
events
Post-approval
6
• ~ $2B and 10 years to
develop & launch a drug
• The “valley of death”: most
candidate drugs will fail
• Can be difficult to predict
what will work
The tough maths of drug development
ePharmacology.hubpages.com
Why ML & AI is
difficult in pharma
2
10 June 2021 8
“AI will not replace
drug hunters, but drug
hunters who don’t use
AI will be replaced by
those who do.”
-Andrew Hopkins, CEO Exscientia
9
Why?
– Biology is outrageously complex
– Data is frequently biased, irregular, incomplete,
in different formats
– Biomedicine is a label desert
– As a consequence:
– Advances are throttled by domain knowledge
– How to represent & analyse complex domain
– Suitable data is often scarce
10
12 July 2021 11
The complexity of biomedicine:
About 50 trillion cells of 200 types
Each cell has 23 pairs of chromosomes
In total 6.4 billion basepairs (positions)
Organised into about 18,000 genes
(Or maybe more like 40,000 genes)
Genetic material elsewhere in the cell
Epigenetic modification
1 million different types of molecules
Lifestyle & history
Exposure & environment
Immune system repertoire & priming
…
Of which we know only a fraction
The classic
analytical
tension
12
What we need to solve
What we tend to solve
Easy things
Available, ideal data
Ground truth
Simplify
“Interesting”
“Table-land”
Useful things
Incomplete messy data
Unclear biological reality
Uncertain findings
Needful
“Network-land”
Where can ML & AI be
powerful in pharma?
3
14
Radiology & imaging widely used in healthcare
• Capture important & difficult to
abstract data
– E.g. presence, size, shape of
tumor
• Radiologists
– Never enough of them
– Rushed
– Frequently wrong
• But AI is good at interpreting
images …
SubtleMedical.com
15
Not just X-rays & MRI but microscopes
• Cancers are associated with
certain proteins
• Traditionally have to be stained
& examined visually
• Deep learning can automatically
do this for us
• Faster, more consistent
Li et al. 2021
16
Precision medicine: subtypes of diseases & patients
• Because many conditions have
similar clinical presentations
but vastly different underlying
molecular machinery
• Precision medicine
• The right drug for the right patient at
the right time
• Clustering
• But as simple as seems
• E.g. asthma
Kermani et al. 2018
10 June 2021 17
• A lot of biomedical
knowledge is associative
or relational & multimodal
• Knowledge graphs /
GCNs help us to capture
and analysis
• Have been used to
propose new drugs and
patient subtypes
Good (engineering) practices & production quality is
vital
18
Wynants et al. 2021
19
We need more data
• Many possible types of
useful data
• For many purposes
• From where?
• How to manage &
interoperate?
• Issues of representation &
diversity
Interpretability (etc.) is vital
– May feedback to inspire mechanistic research, but …
– But what actually is interpretability?
– Essential for:
– a smoke test, validation
– check for bias
– communication
– Likewise calibration
– Important to understand how (un)sure we are
20
Takeaways
Drug
development
is a
enormously
complex
process
Although
attractive,
ML & AI are
often
hindered by
the nature of
the data
Areas of
definite
value
include
subtyping,
imaging &
knowledge
graphs
More and
wider data
& better
engineering
is key to
further
progress
21
Some light
reading
22
Academic Press (2021)
Looking for
work?
– If you are driven by science and passioned
about improving lives, why not look at a job in
pharma?
– Principal Statistician, Internship, Software
Engineer, Data Analyst, Apprentice, Future
Leaders Programme …
– Visit our careers website for much, much
more: https://www.gsk.com/en-gb/careers/
23

ML & AI in pharma: an overview

  • 1.
    ML & AIin Drug development: an introduction & overview Paul Agapow Statistics & Data Science Innovation Hub, GSK
  • 2.
    Disclosure – No conflictsof interest – Own views and does not reflect official company thought or projects – Based on experience in current & previous positions – Data Science / Statistics @GSK – ML&AI / Health Informatics @AZ – Data Science Institute @ICL – Bioinformatics @Health Protection Agency (UK) … 2
  • 3.
    What is drugdevelopment, how does it work? Agenda 3 Why ML & AI is difficult in pharma Where ML & AI can be powerful in pharma and what we need to do 1 2 3
  • 4.
    How we makedrugs 1
  • 5.
    Clinical trials Identifying and understanding disease,unravelling the molecular machinery, pinpointing targets Drug development is a long & complex process 5 Pathophysiology Developing molecules that can be synthesized and delivered safely to the target Drug candidates Testing via trials, dissecting failures and successes, tracking adverse events, seeking regulatory approval Who gets the drug, how is it re- imbursed, tracking long-term adverse events Post-approval
  • 6.
    6 • ~ $2Band 10 years to develop & launch a drug • The “valley of death”: most candidate drugs will fail • Can be difficult to predict what will work The tough maths of drug development ePharmacology.hubpages.com
  • 7.
    Why ML &AI is difficult in pharma 2
  • 8.
    10 June 20218 “AI will not replace drug hunters, but drug hunters who don’t use AI will be replaced by those who do.” -Andrew Hopkins, CEO Exscientia
  • 9.
  • 10.
    Why? – Biology isoutrageously complex – Data is frequently biased, irregular, incomplete, in different formats – Biomedicine is a label desert – As a consequence: – Advances are throttled by domain knowledge – How to represent & analyse complex domain – Suitable data is often scarce 10
  • 11.
    12 July 202111 The complexity of biomedicine: About 50 trillion cells of 200 types Each cell has 23 pairs of chromosomes In total 6.4 billion basepairs (positions) Organised into about 18,000 genes (Or maybe more like 40,000 genes) Genetic material elsewhere in the cell Epigenetic modification 1 million different types of molecules Lifestyle & history Exposure & environment Immune system repertoire & priming … Of which we know only a fraction
  • 12.
    The classic analytical tension 12 What weneed to solve What we tend to solve Easy things Available, ideal data Ground truth Simplify “Interesting” “Table-land” Useful things Incomplete messy data Unclear biological reality Uncertain findings Needful “Network-land”
  • 13.
    Where can ML& AI be powerful in pharma? 3
  • 14.
    14 Radiology & imagingwidely used in healthcare • Capture important & difficult to abstract data – E.g. presence, size, shape of tumor • Radiologists – Never enough of them – Rushed – Frequently wrong • But AI is good at interpreting images … SubtleMedical.com
  • 15.
    15 Not just X-rays& MRI but microscopes • Cancers are associated with certain proteins • Traditionally have to be stained & examined visually • Deep learning can automatically do this for us • Faster, more consistent Li et al. 2021
  • 16.
    16 Precision medicine: subtypesof diseases & patients • Because many conditions have similar clinical presentations but vastly different underlying molecular machinery • Precision medicine • The right drug for the right patient at the right time • Clustering • But as simple as seems • E.g. asthma Kermani et al. 2018
  • 17.
    10 June 202117 • A lot of biomedical knowledge is associative or relational & multimodal • Knowledge graphs / GCNs help us to capture and analysis • Have been used to propose new drugs and patient subtypes
  • 18.
    Good (engineering) practices& production quality is vital 18 Wynants et al. 2021
  • 19.
    19 We need moredata • Many possible types of useful data • For many purposes • From where? • How to manage & interoperate? • Issues of representation & diversity
  • 20.
    Interpretability (etc.) isvital – May feedback to inspire mechanistic research, but … – But what actually is interpretability? – Essential for: – a smoke test, validation – check for bias – communication – Likewise calibration – Important to understand how (un)sure we are 20
  • 21.
    Takeaways Drug development is a enormously complex process Although attractive, ML &AI are often hindered by the nature of the data Areas of definite value include subtyping, imaging & knowledge graphs More and wider data & better engineering is key to further progress 21
  • 22.
  • 23.
    Looking for work? – Ifyou are driven by science and passioned about improving lives, why not look at a job in pharma? – Principal Statistician, Internship, Software Engineer, Data Analyst, Apprentice, Future Leaders Programme … – Visit our careers website for much, much more: https://www.gsk.com/en-gb/careers/ 23

Editor's Notes

  • #17 COPD: Distill events from patients history contained in RWD into a graph demonstrating commonalities and diverging pathways. In the resulting patient-patient network, patients (nodes) are connected to one another by edges if they exhibit clinical similarity across many clinical dimensions (for example, laboratory tests). Patients who exhibited very high degrees of similarity were grouped into single nodes\ The filtering step resulted in 73 clinical features that were used for topological inference of the patient-patient similarity network (table S1). From the resulting patient-patient network, we identified three completely segregated clusters with 762 (subtype 1), 617 (subtype 2), and 1096 (subtype 3) patients Subtype 1 was characterized by T2D complications diabetic nephropathy and diabetic retinopathy; subtype 2 was enriched for cancer malignancy and cardiovascular diseases; and subtype 3 was associated most strongly with cardiovascular diseases, neurological diseases, allergies, and HIV infections