Successfully reported this slideshow.
Your SlideShare is downloading. ×

Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models

Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models

Download to read offline

Presented at the 15th GCC - German Conference on Cheminformatics November 2019

We combine regression forest machine learning with our MMPA based generative methods to deliver an active learning system to accelerate lead optimisation. In the process we identify permutative MMPA as a method to leverage SAR information from small data sets.

Published by MedChemica Ltd

Presented at the 15th GCC - German Conference on Cheminformatics November 2019

We combine regression forest machine learning with our MMPA based generative methods to deliver an active learning system to accelerate lead optimisation. In the process we identify permutative MMPA as a method to leverage SAR information from small data sets.

Published by MedChemica Ltd

Advertisement
Advertisement

More Related Content

Similar to Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models (20)

Advertisement

Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models

  1. 1. • Features are acid, base, hydrogen bond donor, acceptor, hydrophobe, aromatic attachment, aliphatic attachment and halogen. Definitions are highly engineered.† • Feature 1 – topological distance - Feature 2 • Engineered for chemical relevance – features can be superimposed or directly linked, e.g. enables a group to be both a hydrogen bond acceptor and a base • A bit identifies a pharmacophore pair e.g. : Aromatic - 3 bonds - Base • Used as unfolded 280 bit fingerprints • Regression Forest as ML method • Build models with 10 fold CV – report CV-Pearson’s R2 and CV RMSE • Build RF error model to generate predicted error for each compound using the same descriptors †Taylor, R.; Cole, J. C.; Cosgrove, D. A.; Gardiner, E. J.; Gillet, V. J.; Korb, O. J Comput Aided Mol Des 2012, 26 (4), 451–472. †Acid & Base definitions are SMARTS including C, N, heteroaromatic acids, bases excluding weak aniline bases, including amidines, guanidine’s - MedChemica definitions. Regression forest models Strategy Number of compounds generated Number of matches to D2 known set Maximum pIC50 (actual) Maximum pIC50 (predicted[error]) Hit-to-Lead 682 10 7.8 5.5[0.21] Dopamine class 469 8 7.9 5.5[0.23] Solubility 10148 10 7.8 5.5[0.21] Metabolism 12729 19 7.9 5.5[0.21] Permutative MMPA (env = 4) 5 3 7.9 6.1[?] Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models A. G. Dossetter•, E. Griffen•, A. Leach•+, P. de Sousa•. •Medchemica Ltd, Macclesfield, UK, + Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Problem How can we reduce the number of compounds made in going from a small set of confirmed hits to compounds we can test in vivo? For example: can we go from 30 hits to potent in vivo available leads in 10 rounds of synthesizing 30 compounds? Learning Combining focused generative approaches with explainable QSAR models is shows initial promise. The pinch point is the second set of compounds. MedChemica contact@medchemica.com Approach Case Study Dopamine D2 dataset • Well studied target, ligand based design, • >5200 measured compounds known • Simulate hit optimization process • Use known compounds as validation The Startpoints 30 compounds: 5 <= pIC50 <=6 , -1 < AlogP < 3.5, selected by LLE sort Generate virtual compounds from MedChemica Knowledge database • Hit-to-Lead transformations – the most used medicinal chemistry • ADMET transformations for metabolism and solubility • Target class transformations learning from target analogues Permutative MMPA • generate compounds from data already gained Regression forest models • Accurate pharmacophore features with topological distance • Unfolded fingerprints connect feature importance to pharmacophores • Error models give accuracy of prediction for each compound Active Learning • Explore from predicted high potency, high error • Exploit from predicted high potency, low error • Take all compounds in a data set • Find all matched pairs extract DpIC50 and the transforms between them • Aggregate transformations with median DpIC50 and count of pairs • Apply all transformations back to the initial data set (at what environment level?) • Predicted pIC50 = substrate pIC50 + median DpIC50 • Remove existing compounds • Prioritise new compounds by pIC50 estimate Permutative MMPA M1 M2 M3 M4 t1 M5 t1 t1 M* • M1 à M2 transform t1 • M3 à M4 transform t1 • M5 matches t1 and generates M* • Predict pIC50: pIC50(M5) + median DpIC50(t1) MedChemica Transformation Database Generator Substrate molecules Virtual molecules Generate molecules from Knowledge Database • Hit – to - Lead transformations: 689 transformations with >=250 example pairs • Dopamine receptor transformations(not D2!) 1027 transformations • Solubility 6320 transformations • Metabolism 12719 transformations Generating new structures is not an issue… Conclusions • Good starting points are key(!) • There is no free lunch – good models need data • Make best use of the data you already have – focused permutative MMPA finds SAR you may have missed by eye • Target class based enumeration is most efficient, but still need a better method for round 2 synthesis • The first set of compounds after the hits are critical if you want to move fast… Experiment: Fully automated active learning • Build RF model CV-R2 -0.26, small data set, is it useful? • Enumerate from all compounds: • what’s the best enumeration strategy? • how to pick the (few)compounds to make from the enumerated set? ? 90% of predictions within 0.5 log of measured • Enumeration generates high potency compounds, but but early models are too coarse to correctly prioritize the best small set for synthesis either by high error or high potency 7.9! • Permutative MMPA with tight definition of MMPA environment generates an excellent first set of follow up compounds learning from the SAR within the hits • The second batch of compounds is more of a challenge…. Most potent compound(measured) from HtL enumeration Active Learning Hits Build model with error estimates Enumerate Select for Explore and Exploit Synthesise & Test Compounds with data Compounds meet criteria? Yes No Explore: prioritize high error Exploit : prioritize high potency & low error Ratio of explore to exploit varies with stage Select enumeration strategy by stage: Hit-to lead, target class, solubility, metabolism For in silico simulation match to known and measured compounds

×