Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models

Ed Griffen
Ed GriffenTechnical Director at Medchemica Ltd

Presented at the 15th GCC - German Conference on Cheminformatics November 2019 We combine regression forest machine learning with our MMPA based generative methods to deliver an active learning system to accelerate lead optimisation. In the process we identify permutative MMPA as a method to leverage SAR information from small data sets. Published by MedChemica Ltd

• Features are acid, base, hydrogen
bond donor, acceptor, hydrophobe,
aromatic attachment, aliphatic
attachment and halogen. Definitions
are highly engineered.†
• Feature 1 – topological distance -
Feature 2
• Engineered for chemical relevance –
features can be superimposed or
directly linked, e.g. enables a group
to be both a hydrogen bond
acceptor and a base
• A bit identifies a pharmacophore pair
e.g. : Aromatic - 3 bonds - Base
• Used as unfolded 280 bit fingerprints
• Regression Forest as ML method
• Build models with 10 fold CV – report
CV-Pearson’s R2 and CV RMSE
• Build RF error model to generate
predicted error for each compound
using the same descriptors
†Taylor, R.; Cole, J. C.; Cosgrove, D. A.; Gardiner, E. J.; Gillet, V. J.; Korb, O. J Comput Aided Mol Des 2012, 26 (4), 451–472.
†Acid & Base definitions are SMARTS including C, N, heteroaromatic acids, bases excluding weak aniline bases, including amidines, guanidine’s - MedChemica
definitions.
Regression forest models
Strategy Number of
compounds
generated
Number of
matches to D2
known set
Maximum
pIC50
(actual)
Maximum pIC50
(predicted[error])
Hit-to-Lead 682 10 7.8 5.5[0.21]
Dopamine class 469 8 7.9 5.5[0.23]
Solubility 10148 10 7.8 5.5[0.21]
Metabolism 12729 19 7.9 5.5[0.21]
Permutative
MMPA
(env = 4)
5 3 7.9 6.1[?]
Accelerating lead optimisation with active learning by exploiting MMPA based
ADMET knowledge with regression forest potency models
A. G. Dossetter•, E. Griffen•, A. Leach•+, P. de Sousa•.
•Medchemica Ltd, Macclesfield, UK, + Pharmacy and Biomolecular Sciences, Liverpool John Moores University,
Problem
How can we reduce the number of compounds made in going from a small set of confirmed hits to
compounds we can test in vivo? For example: can we go from 30 hits to potent in vivo available leads in 10
rounds of synthesizing 30 compounds?
Learning
Combining focused generative approaches with
explainable QSAR models is shows initial promise.
The pinch point is the second set of compounds.
MedChemica
contact@medchemica.com
Approach Case Study
Dopamine D2 dataset
• Well studied target, ligand based design,
• >5200 measured compounds known
• Simulate hit optimization process
• Use known compounds as validation
The Startpoints
30 compounds: 5 <= pIC50 <=6 , -1 < AlogP < 3.5, selected by LLE sort
Generate virtual compounds from MedChemica Knowledge database
• Hit-to-Lead transformations – the most used medicinal chemistry
• ADMET transformations for metabolism and solubility
• Target class transformations learning from target analogues
Permutative MMPA
• generate compounds from data already gained
Regression forest models
• Accurate pharmacophore features with topological distance
• Unfolded fingerprints connect feature importance to pharmacophores
• Error models give accuracy of prediction for each compound
Active Learning
• Explore from predicted high potency, high error
• Exploit from predicted high potency, low error
• Take all compounds in a data set
• Find all matched pairs extract DpIC50
and the transforms between them
• Aggregate transformations with
median DpIC50 and count of pairs
• Apply all transformations back to the
initial data set (at what environment
level?)
• Predicted pIC50 = substrate pIC50 +
median DpIC50
• Remove existing compounds
• Prioritise new compounds by pIC50
estimate
Permutative MMPA
M1
M2
M3
M4
t1
M5
t1
t1
M*
• M1 à M2 transform t1
• M3 à M4 transform t1
• M5 matches t1 and generates
M*
• Predict pIC50:
pIC50(M5) + median DpIC50(t1)
MedChemica
Transformation
Database
Generator
Substrate
molecules
Virtual
molecules
Generate molecules from Knowledge Database
• Hit – to - Lead transformations:
689 transformations with >=250 example pairs
• Dopamine receptor transformations(not D2!)
1027 transformations
• Solubility
6320 transformations
• Metabolism
12719 transformations
Generating new structures is not an issue…
Conclusions
• Good starting points are key(!)
• There is no free lunch – good models need data
• Make best use of the data you already have – focused permutative MMPA finds SAR you may have missed by eye
• Target class based enumeration is most efficient, but still need a better method for round 2 synthesis
• The first set of compounds after the hits are critical if you want to move fast…
Experiment: Fully automated active learning
• Build RF model CV-R2 -0.26, small data set, is it useful?
• Enumerate from all compounds:
• what’s the best enumeration strategy?
• how to pick the (few)compounds to make from the enumerated set?
?
90% of predictions within 0.5 log of measured
• Enumeration generates high potency
compounds, but but early models are too
coarse to correctly prioritize the best small
set for synthesis either by high error or high
potency
7.9!
• Permutative MMPA with tight definition of MMPA environment generates an excellent first
set of follow up compounds learning from the SAR within the hits
• The second batch of compounds is more of a challenge….
Most potent compound(measured) from HtL
enumeration
Active Learning
Hits
Build model with
error estimates
Enumerate
Select for
Explore and
Exploit
Synthesise & Test
Compounds
with data
Compounds
meet
criteria?
Yes
No
Explore: prioritize high error
Exploit : prioritize high potency & low error
Ratio of explore to exploit varies with stage
Select enumeration strategy by stage:
Hit-to lead, target class, solubility, metabolism
For in silico simulation match to
known and measured compounds

Recommended

Griffen MedChemica Virtual Tox Panel by
Griffen MedChemica Virtual Tox PanelGriffen MedChemica Virtual Tox Panel
Griffen MedChemica Virtual Tox PanelEd Griffen
338 views1 slide
RSC Hatfield 2018 Kinase meeting : potency patents MMPA approaches by
RSC Hatfield 2018  Kinase meeting : potency patents MMPA approachesRSC Hatfield 2018  Kinase meeting : potency patents MMPA approaches
RSC Hatfield 2018 Kinase meeting : potency patents MMPA approachesEd Griffen
236 views1 slide
Emerging Challenges for Artificial Intelligence in Medicinal Chemistry by
Emerging Challenges for Artificial Intelligence in Medicinal ChemistryEmerging Challenges for Artificial Intelligence in Medicinal Chemistry
Emerging Challenges for Artificial Intelligence in Medicinal ChemistryEd Griffen
257 views34 slides
Qsar studies on gallic acid derivatives and molecular docking studies of bace... by
Qsar studies on gallic acid derivatives and molecular docking studies of bace...Qsar studies on gallic acid derivatives and molecular docking studies of bace...
Qsar studies on gallic acid derivatives and molecular docking studies of bace...bioejjournal
29 views17 slides
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace... by
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...bioejjournal
59 views17 slides
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS by
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS santosh Kumbhar
7.1K views34 slides

More Related Content

What's hot

Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017 by
Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017
Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017Ed Griffen
661 views37 slides
Molecular docking by
Molecular dockingMolecular docking
Molecular dockingRahul B S
68.3K views68 slides
Accelerating multiple medicinal chemistry projects using Artificial Intellige... by
Accelerating multiple medicinal chemistry projects using Artificial Intellige...Accelerating multiple medicinal chemistry projects using Artificial Intellige...
Accelerating multiple medicinal chemistry projects using Artificial Intellige...Al Dossetter
968 views33 slides
Molecular docking by
Molecular dockingMolecular docking
Molecular dockingShrihith.A Ananthram
14.9K views14 slides
Practical Drug Discovery using Explainable Artificial Intelligence by
Practical Drug Discovery using Explainable Artificial IntelligencePractical Drug Discovery using Explainable Artificial Intelligence
Practical Drug Discovery using Explainable Artificial IntelligenceAl Dossetter
202 views48 slides
Molecular docking by
Molecular dockingMolecular docking
Molecular dockingMaakasaikumar
1.1K views31 slides

What's hot(19)

Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017 by Ed Griffen
Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017
Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017
Ed Griffen661 views
Molecular docking by Rahul B S
Molecular dockingMolecular docking
Molecular docking
Rahul B S68.3K views
Accelerating multiple medicinal chemistry projects using Artificial Intellige... by Al Dossetter
Accelerating multiple medicinal chemistry projects using Artificial Intellige...Accelerating multiple medicinal chemistry projects using Artificial Intellige...
Accelerating multiple medicinal chemistry projects using Artificial Intellige...
Al Dossetter968 views
Practical Drug Discovery using Explainable Artificial Intelligence by Al Dossetter
Practical Drug Discovery using Explainable Artificial IntelligencePractical Drug Discovery using Explainable Artificial Intelligence
Practical Drug Discovery using Explainable Artificial Intelligence
Al Dossetter202 views
Structure based computer aided drug design by Thanh Truong
Structure based computer aided drug designStructure based computer aided drug design
Structure based computer aided drug design
Thanh Truong4.7K views
molecular docking by KOUSHIK DEB
molecular dockingmolecular docking
molecular docking
KOUSHIK DEB2.4K views
Open-source tools for querying and organizing large reaction databases by Greg Landrum
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Greg Landrum2.3K views
SCI What can Big Data do for Chemistry 2017 MedChemica by Ed Griffen
SCI What can Big Data do for Chemistry 2017 MedChemicaSCI What can Big Data do for Chemistry 2017 MedChemica
SCI What can Big Data do for Chemistry 2017 MedChemica
Ed Griffen872 views
Molecular docking and_virtual_screening by Florent Barbault
Molecular docking and_virtual_screeningMolecular docking and_virtual_screening
Molecular docking and_virtual_screening
Florent Barbault15.9K views
Lecture 4 ligand based drug design by RAJAN ROLTA
Lecture 4 ligand based drug designLecture 4 ligand based drug design
Lecture 4 ligand based drug design
RAJAN ROLTA1K views
Basics Of Molecular Docking by Satarupa Deb
Basics Of Molecular DockingBasics Of Molecular Docking
Basics Of Molecular Docking
Satarupa Deb18.7K views
Docking Score Functions by SAKEEL AHMED
Docking Score FunctionsDocking Score Functions
Docking Score Functions
SAKEEL AHMED825 views
Connecting Metabolomic Data with Context by Dmitry Grapov
Connecting Metabolomic Data with ContextConnecting Metabolomic Data with Context
Connecting Metabolomic Data with Context
Dmitry Grapov4.8K views
Computer Aided Molecular Modeling by pkchoudhury
Computer Aided Molecular ModelingComputer Aided Molecular Modeling
Computer Aided Molecular Modeling
pkchoudhury6.8K views

Similar to Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models

Denovo Drug Design by
Denovo Drug DesignDenovo Drug Design
Denovo Drug DesignSomasekhar Gupta
40.9K views38 slides
DENOVO DRUG DESIGN AS PER PCI SYLLABUS by
DENOVO DRUG DESIGN AS PER PCI SYLLABUSDENOVO DRUG DESIGN AS PER PCI SYLLABUS
DENOVO DRUG DESIGN AS PER PCI SYLLABUSShikha Popali
520 views38 slides
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARM by
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARMDENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARM
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARMShikha Popali
1.8K views37 slides
Virtual sreening by
Virtual sreeningVirtual sreening
Virtual sreeningMahendra G S
16.6K views24 slides
Modeling Chemical Datasets by
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical DatasetsAbhik Seal
567 views18 slides

Similar to Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models(20)

DENOVO DRUG DESIGN AS PER PCI SYLLABUS by Shikha Popali
DENOVO DRUG DESIGN AS PER PCI SYLLABUSDENOVO DRUG DESIGN AS PER PCI SYLLABUS
DENOVO DRUG DESIGN AS PER PCI SYLLABUS
Shikha Popali520 views
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARM by Shikha Popali
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARMDENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARM
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARM
Shikha Popali1.8K views
Virtual sreening by Mahendra G S
Virtual sreeningVirtual sreening
Virtual sreening
Mahendra G S16.6K views
Modeling Chemical Datasets by Abhik Seal
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
Abhik Seal567 views
The influence of data curation on QSAR Modeling – Presented at American Chemi... by Kamel Mansouri
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...
Kamel Mansouri506 views
PREDICTION OF ANTIMICROBIAL PEPTIDES USING MACHINE LEARNING METHODS by Bilal Nizami
PREDICTION OF ANTIMICROBIAL PEPTIDES USING MACHINE LEARNING METHODSPREDICTION OF ANTIMICROBIAL PEPTIDES USING MACHINE LEARNING METHODS
PREDICTION OF ANTIMICROBIAL PEPTIDES USING MACHINE LEARNING METHODS
Bilal Nizami2.8K views
How predictive models help Medicinal Chemists design better drugs_webinar by Ann-Marie Roche
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinar
Ann-Marie Roche753 views
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera... by MLAI2
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
MLAI2173 views
cadd-191129134050 (1).pptx by Noorelhuda2
cadd-191129134050 (1).pptxcadd-191129134050 (1).pptx
cadd-191129134050 (1).pptx
Noorelhuda238 views
Bagley_HNRS_CRM_talk_2015 by Thomas Bagley
Bagley_HNRS_CRM_talk_2015Bagley_HNRS_CRM_talk_2015
Bagley_HNRS_CRM_talk_2015
Thomas Bagley303 views
MedChemica Large scale analysis and sharing of Medicinal chemistry Knowledge ... by Ed Griffen
MedChemica Large scale analysis and sharing of Medicinal chemistry Knowledge ...MedChemica Large scale analysis and sharing of Medicinal chemistry Knowledge ...
MedChemica Large scale analysis and sharing of Medicinal chemistry Knowledge ...
Ed Griffen371 views
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa... by Kamel Mansouri
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
Kamel Mansouri184 views
Molecular modelling and dcoking.pptx by 12nikitaborade1
Molecular modelling and dcoking.pptxMolecular modelling and dcoking.pptx
Molecular modelling and dcoking.pptx
12nikitaborade162 views
Enhanced bioseparations peptide mapping and m abs by Oskari Aro
Enhanced bioseparations peptide mapping and m absEnhanced bioseparations peptide mapping and m abs
Enhanced bioseparations peptide mapping and m abs
Oskari Aro268 views
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas... by Kamel Mansouri
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
Kamel Mansouri497 views
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine... by ChemAxon
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
ChemAxon182 views
Data analysis by amlbinder
Data analysisData analysis
Data analysis
amlbinder1.4K views

Recently uploaded

"How can I develop my learning path in bioinformatics? by
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?Bioinformy
17 views13 slides
Water-bath by
Water-bath Water-bath
Water-bath zolajoneslabtronuk
8 views3 slides
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... by
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...GIFT KIISI NKIN
12 views31 slides
Open Access Publishing in Astrophysics by
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in AstrophysicsPeter Coles
380 views26 slides
Max Welling ChemAI 231116.pptx by
Max Welling ChemAI 231116.pptxMax Welling ChemAI 231116.pptx
Max Welling ChemAI 231116.pptxMarco Tibaldi
132 views35 slides
Chromatography ppt.pptx by
Chromatography ppt.pptxChromatography ppt.pptx
Chromatography ppt.pptxvarshachandgudesvpm
15 views1 slide

Recently uploaded(20)

"How can I develop my learning path in bioinformatics? by Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy17 views
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... by GIFT KIISI NKIN
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
GIFT KIISI NKIN12 views
Open Access Publishing in Astrophysics by Peter Coles
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in Astrophysics
Peter Coles380 views
Max Welling ChemAI 231116.pptx by Marco Tibaldi
Max Welling ChemAI 231116.pptxMax Welling ChemAI 231116.pptx
Max Welling ChemAI 231116.pptx
Marco Tibaldi132 views
Distinct distributions of elliptical and disk galaxies across the Local Super... by Sérgio Sacani
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...
Sérgio Sacani30 views
Artificial Intelligence Helps in Drug Designing and Discovery.pptx by abhinashsahoo2001
Artificial Intelligence Helps in Drug Designing and Discovery.pptxArtificial Intelligence Helps in Drug Designing and Discovery.pptx
Artificial Intelligence Helps in Drug Designing and Discovery.pptx
abhinashsahoo2001112 views
Researching and Communicating Our Changing Climate by Zachary Labe
Researching and Communicating Our Changing ClimateResearching and Communicating Our Changing Climate
Researching and Communicating Our Changing Climate
Zachary Labe5 views
Physical Characterization of Moon Impactor WE0913A by Sérgio Sacani
Physical Characterization of Moon Impactor WE0913APhysical Characterization of Moon Impactor WE0913A
Physical Characterization of Moon Impactor WE0913A
Sérgio Sacani42 views
Metatheoretical Panda-Samaneh Borji.pdf by samanehborji
Metatheoretical Panda-Samaneh Borji.pdfMetatheoretical Panda-Samaneh Borji.pdf
Metatheoretical Panda-Samaneh Borji.pdf
samanehborji16 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH12 views
Matthias Beller ChemAI 231116.pptx by Marco Tibaldi
Matthias Beller ChemAI 231116.pptxMatthias Beller ChemAI 231116.pptx
Matthias Beller ChemAI 231116.pptx
Marco Tibaldi82 views
Conventional and non-conventional methods for improvement of cucurbits.pptx by gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97614 views
Types of Fluids - Newtonian and Non Newtonian Fluids in Continuous Culture Fe... by Pavithra B R
Types of Fluids - Newtonian and Non Newtonian Fluids in Continuous Culture Fe...Types of Fluids - Newtonian and Non Newtonian Fluids in Continuous Culture Fe...
Types of Fluids - Newtonian and Non Newtonian Fluids in Continuous Culture Fe...
Pavithra B R11 views
Workshop LLM Life Sciences ChemAI 231116.pptx by Marco Tibaldi
Workshop LLM Life Sciences ChemAI 231116.pptxWorkshop LLM Life Sciences ChemAI 231116.pptx
Workshop LLM Life Sciences ChemAI 231116.pptx
Marco Tibaldi96 views

Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models

  • 1. • Features are acid, base, hydrogen bond donor, acceptor, hydrophobe, aromatic attachment, aliphatic attachment and halogen. Definitions are highly engineered.† • Feature 1 – topological distance - Feature 2 • Engineered for chemical relevance – features can be superimposed or directly linked, e.g. enables a group to be both a hydrogen bond acceptor and a base • A bit identifies a pharmacophore pair e.g. : Aromatic - 3 bonds - Base • Used as unfolded 280 bit fingerprints • Regression Forest as ML method • Build models with 10 fold CV – report CV-Pearson’s R2 and CV RMSE • Build RF error model to generate predicted error for each compound using the same descriptors †Taylor, R.; Cole, J. C.; Cosgrove, D. A.; Gardiner, E. J.; Gillet, V. J.; Korb, O. J Comput Aided Mol Des 2012, 26 (4), 451–472. †Acid & Base definitions are SMARTS including C, N, heteroaromatic acids, bases excluding weak aniline bases, including amidines, guanidine’s - MedChemica definitions. Regression forest models Strategy Number of compounds generated Number of matches to D2 known set Maximum pIC50 (actual) Maximum pIC50 (predicted[error]) Hit-to-Lead 682 10 7.8 5.5[0.21] Dopamine class 469 8 7.9 5.5[0.23] Solubility 10148 10 7.8 5.5[0.21] Metabolism 12729 19 7.9 5.5[0.21] Permutative MMPA (env = 4) 5 3 7.9 6.1[?] Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models A. G. Dossetter•, E. Griffen•, A. Leach•+, P. de Sousa•. •Medchemica Ltd, Macclesfield, UK, + Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Problem How can we reduce the number of compounds made in going from a small set of confirmed hits to compounds we can test in vivo? For example: can we go from 30 hits to potent in vivo available leads in 10 rounds of synthesizing 30 compounds? Learning Combining focused generative approaches with explainable QSAR models is shows initial promise. The pinch point is the second set of compounds. MedChemica contact@medchemica.com Approach Case Study Dopamine D2 dataset • Well studied target, ligand based design, • >5200 measured compounds known • Simulate hit optimization process • Use known compounds as validation The Startpoints 30 compounds: 5 <= pIC50 <=6 , -1 < AlogP < 3.5, selected by LLE sort Generate virtual compounds from MedChemica Knowledge database • Hit-to-Lead transformations – the most used medicinal chemistry • ADMET transformations for metabolism and solubility • Target class transformations learning from target analogues Permutative MMPA • generate compounds from data already gained Regression forest models • Accurate pharmacophore features with topological distance • Unfolded fingerprints connect feature importance to pharmacophores • Error models give accuracy of prediction for each compound Active Learning • Explore from predicted high potency, high error • Exploit from predicted high potency, low error • Take all compounds in a data set • Find all matched pairs extract DpIC50 and the transforms between them • Aggregate transformations with median DpIC50 and count of pairs • Apply all transformations back to the initial data set (at what environment level?) • Predicted pIC50 = substrate pIC50 + median DpIC50 • Remove existing compounds • Prioritise new compounds by pIC50 estimate Permutative MMPA M1 M2 M3 M4 t1 M5 t1 t1 M* • M1 à M2 transform t1 • M3 à M4 transform t1 • M5 matches t1 and generates M* • Predict pIC50: pIC50(M5) + median DpIC50(t1) MedChemica Transformation Database Generator Substrate molecules Virtual molecules Generate molecules from Knowledge Database • Hit – to - Lead transformations: 689 transformations with >=250 example pairs • Dopamine receptor transformations(not D2!) 1027 transformations • Solubility 6320 transformations • Metabolism 12719 transformations Generating new structures is not an issue… Conclusions • Good starting points are key(!) • There is no free lunch – good models need data • Make best use of the data you already have – focused permutative MMPA finds SAR you may have missed by eye • Target class based enumeration is most efficient, but still need a better method for round 2 synthesis • The first set of compounds after the hits are critical if you want to move fast… Experiment: Fully automated active learning • Build RF model CV-R2 -0.26, small data set, is it useful? • Enumerate from all compounds: • what’s the best enumeration strategy? • how to pick the (few)compounds to make from the enumerated set? ? 90% of predictions within 0.5 log of measured • Enumeration generates high potency compounds, but but early models are too coarse to correctly prioritize the best small set for synthesis either by high error or high potency 7.9! • Permutative MMPA with tight definition of MMPA environment generates an excellent first set of follow up compounds learning from the SAR within the hits • The second batch of compounds is more of a challenge…. Most potent compound(measured) from HtL enumeration Active Learning Hits Build model with error estimates Enumerate Select for Explore and Exploit Synthesise & Test Compounds with data Compounds meet criteria? Yes No Explore: prioritize high error Exploit : prioritize high potency & low error Ratio of explore to exploit varies with stage Select enumeration strategy by stage: Hit-to lead, target class, solubility, metabolism For in silico simulation match to known and measured compounds