dual-event machine learning models to accelerate drug discovery
Dual-Event Machine Learning Models to Accelerate Drug DiscoverySean Ekins1,2*, Robert C. Reynolds3,4*, Hiyun Kim5, Mi-Sun Koo5, MarilynEkonomidis5, Meliza Talaue5, Steve D. Paget5, Lisa K. Woolhiser6, Anne J.Lenaerts6, Barry A. Bunin1, Nancy Connell5 and Joel S. Freundlich5,7*1CollaborativeDrug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA.2Collaborationsin Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA.3Southern Research Institute, 2000 Ninth Avenue South, Birmingham, AL 35205, USA.4Current address: University of Alabama at Birmingham, College of Arts and Sciences , Department of Chemistry, 1530 3 rdAvenue South, Birmingham, Alabama 35294-1240, USA.5Department of Medicine, Center for Emerging and Reemerging Pathogens, UMDNJ – New Jersey Medical School, 185 SouthOrange Avenue Newark, NJ 07103, USA.6Department of Microbiology, Immunology and Pathology, Colorado State University, 200 West Lake Street, CO 80523, USA.7Department of Pharmacology & Physiology, UMDNJ – New Jersey Medical School, 185 South Orange Avenue Newark, NJ07103, USA. .
TB facts Tuberculosis kills 1.6-1.7m/yr (~1 every 8 seconds) 1/3rd of worlds population infected!!!! Multi drug resistance in 4.3% of cases Extensively drug resistant increasing incidence one new drug (bedaquiline) in 40 yrs streptomycin (1943) para-aminosalicyclic acid (1949) isoniazid (1952) (Bayer, Roche, Squibb) pyrazinamide (1954) cycloserine (1955) ethambutol (1962) rifampicin (1967) Drug-drug interactions and Co-morbidity with HIV Collaboration between groups is rare These groups may work on existing or new targets Use of computational methods with TB is rare
~ 20 public datasets for TBIncluding Novartis data on TB hits>300,000 cpdsPatents, Papers Annotated by CDDOpen to browse by anyone http://www.collaborativedrug. com/register
Phenotypic screening HTS Hit rates SRI papers Usually less than 1%
Bayesian Model Construction: Mtb Whole-Cell HTS• Learning from 3,779 compounds from an NIAID library - active: MIC < 5 M - inactive: MIC ≥ 5 M
Bayesian machine learningBayesian classification is a simple probabilistic classification model. It is based onBayes’ theoremh is the hypothesis or modeld is the observed datap(h) is the prior belief (probability of hypothesis h before observing any data)p(d) is the data evidence (marginal probability of the data)p(d|h) is the likelihood (probability of data d if hypothesis h is true)p(h|d) is the posterior probability (probability of hypothesis h being true given theobserved data d)A weight is calculated for each feature using a Laplacian-adjusted probabilityestimate to account for the different sampling frequencies of different features.The weights are summed to provide a probability estimateEkins, Williams and Xu, Drug Metab Dispos 38: 2302-2308, 2010
Novel Bayesian Models for Mtb Whole-Cell Efficacy SRI MLSMR 220K single point model active: ≥90% inhibition @ 10 M; inactive <90% inhibition @ 10 M SRI MLSMR 2.5K dose reponse model active: IC50 ≤ 5 M; inactive: IC50 > 5 M Model Building and Validation • Laplacian-corrected Bayesian classifier models (Accelrys Discovery Studio) • Molecular function class fingerprints of maximum diameter 6 (FCFP_6) • Simple molecular descriptors chosen including AlogP, molecular weight, # rotatable bonds, # rings, # hydrogen bond acceptors, # hydrogen bond donors, and polar surface area • Validated w/ leave-one-out cross-validation & leave-50%-out cross-validationEkins, S. et al., Mol. Biosyst. 2010, 6, 840-51; Ekins, S. et al., Mol. Biosyst. 2010, 6, 2316-2324.
Bayesian Classification TB Models We can use the public data for machine learning model building Using Discovery Studio Bayesian model Leave out 50% x 100 Dateset Internal (number of External ROC molecules) ROC Score Score Concordance Specificity Sensitivity MLSMR All single point screen (N = 220463) 0.86 0 0.86 0 78.56 1.86 78.59 1.94 77.13 2.26 MLSMRdose response set (N = 2273) 0.73 0.01 0.75 0.01 66.85 4.06 67.21 7.05 65.47 7.96 Ekins et al., Mol BioSyst, 6: 840-851, 2010
Bayesian Classification Models for TB Laplacian-corrected Bayesian classifier models were generated using FCFP-6 and simple descriptors. 2 models 220,000 and >2000 compounds active compounds with MIC < 5uMGoodBad Ekins et al., Mol BioSyst, 6: 840-851, 2010
Additional test sets 1702 hits in >100K cpds 34 hits in 248 cpds 21 hits in 2108 cpds 100K library Novartis Data FDA drugsSuggests models can predict data from the same and independent labsEnrichments 4-10 foldInitial enrichment – enables screening few compounds to find activesEkins et al., Mol BioSyst, 6: 840-851, 2010 Ekins and Freundlich, Pharm Res, 28, 1859-1869, 2011.
Testing to date has been retrospective Can we use our models to select compounds and influence design? Prospective prediction Do it enough times to show robustness Testing prospectively
Bayesian Machine Learning Models – testingRanked Asinex 25K library with MLSMR dose response model –Bayesian score range -28.4 – 15.399 compounds screened (Bayesian score 9.4 – 15.3).12 cpds were identified with IC90 < 30 ug/mL~12% hit rateMost active SYN 22269076Pyrazolo[1,5-a]pyrimidine Bayesian 14.9 10.6 9.8IC50 1.1ug/ml (3.2uM) Score
Principal component analysis (PCA) of all SRI data sets toillustrate overlap of chemistry space using the datasetsfrom this study (red TAACF-CB2, green = MLSMR, black =kinase dataset), 3PCs explain 72% of the variance.
Dual-Event modelsHigh-throughput Mtb screening Bayesian Machine Learning Mtb Model phenotypic molecule database Mtb screening S Descriptors + Bioactivity (+Cytotoxicity) N H N Molecule Database (e.g. GSK malaria actives) virtually scored using Bayesian Models Top scoring molecules assayed for New bioactivity data Mtb growth inhibition may enhance models S Identify in vitro hits N H N Increased hit/lead discovery efficiency
Dual-Event modelsBecome more stringent in what we call an ACTIVEIC90 < 10 ug/ml (CB2) or <10uM (MLSMR) and a selectivity index (SI)greater than ten.SI was calculated as SI = CC50/IC90 where CC50 is the concentration thatresulted in 50% inhibition of Vero cells (CC50).
Bayesian Classification TB Models Single pt ROC XV AUC = 0.88 Dose resp = 0.78 Dose resp + cyto = 0.86 Dateset External Internal (number of ROC ROC molecules) Score Score Concordance Specificity Sensitivity MLSMR All single point screen (N = 220463) 0.86 0 0.86 0 78.56 1.86 78.59 1.94 77.13 2.26 MLSMRdose response set (N = 2273) 0.73 0.01 0.75 0.01 66.85 4.06 67.21 7.05 65.47 7.96NEW Dose resp and cytotoxicity (N = 2273) 0.82 0.02 0.84 0.02 82.61 4.68 83.91 5.48 65.99 7.47 Ekins et al., PLOSONE, in press 2013
Prospective prediction of antimalarial compounds vs Mtb 1. Virtually screen 13,533-member GSK antimalarial hit library 2. Model = SRI TAACF-CB dose response + cytotoxicity model 3. Top 46 commercially available compounds visually inspected 4. 7 compounds chosen for Mtb testing based on - drug-likeness - chemotype diversity Dateset External Internal ROC(number of molecules) ROC Score Score Concordance Specificity SensitivityTAACF-CB2 IC90 and 0.64 0.59 ± 0.01 0.63 ± 0.02 55.74 ±1.31 61.61 ± 8.96 cytotoxicity (1783)
Prospective prediction of antimalarial compounds vs Mtb 7 tested, 5 active (70% hit rate) Ekins et al.,Chem Biol 20, 370–378, 2013
Bayesian Model Follow-up: Do we have a lead? • BAS00521003/ TCMDC-125802 reported to be a P. falciparum lactate dehydrogenase inhibitor • Only one report (that we were unaware of when picking the compound) of antitubercular activity from 1969 - solid agar MIC = 1 g/mL (“wild strain”) MIC of 0.0625 ug/mL - “no activity” in mouse model up to 400 mg/kg - however, activity was solely judged by extension of survival! SRI MLSMR 220K library contains: 107 hits with this substructure - 3 nitrofuryl hydrazones - 10 furyl hydrazones - 19 nitrophenyl hydrazonesBruhin, H. et al., J. Pharm. Pharmac. 1969, 21, 423-433.Maddry et al., Tuberculosis 2009, 89, 354. 32 inactives with this substructure
Efficacy Profiling of TCMDC-125802 • 64X MIC affords 6 logs of kill • Resistance and/or drug instability beyond 14 d Vero cells : CC50 = 4.0 g/mL Selectivity Index SI = CC50/MICMtb = 16 – 64 Ekins et al.,Chem Biol 20, 370–378, 2013
In vivo Evaluation of TCMDC-125802Goal: Evaluate the in vivo safety and efficacy of JSF-2019 in mousemodels of TB infection Step #2: 7-day Maximum Tolerated Dose study in mice - formulated in 0.5% methyl cellulose - single dose p.o. @ 30, 100, and 300 mg/kg in B6D2F1 mice - no overt toxicity Step #3: evaluation in GKO mouse model of TB infection - Five 12 week-old female C57BL/6 mice infected with Mtb Erdman via low-dose aerosol exposure - Days 16 – 23 : dosed w/ 300 mg/kg JSF-2019 p.o. OR 25 mg/kg INH OR untreated - Sacrificed day 24 and lung and spleen homogenates were cultured - no difference in lungs and spleens vs. control Lisa Woolhiser and Anne Lenaerts (CSU)
Why screen cpds?Ballel et al., Fueling Open-Source drug discovery: 177 small- http://goo.gl/UujRXmolecule leads against tuberculosis ChemMedChem 2013.GSK screened 2M compounds – 3 yrs agoBayesian predictions for 14,000 cpds exposed 11 / 15 (73%)correct when paper was publishedFurther prospective validation example
Conclusions>38,000 molecules screened through Bayesian models106 molecules were tested in vitro17 actives were identified (22.5 % hit rate)Identified several novel potent lead series with good cytotoxicity & selectivitySome series have been missed in SRI screening dataTook a non toxic molecule quickly in vivo – Have made analogs in attempt toovercome in vivo efficacy failureAll Bayesian models shared with Abbott and Merck in TB Accelerator projectAll Bayesian models are freely available to researchers Ekins et al.,Chem Biol 20, 370–378, 2013
Acknowledgments Joel Freundlich Lab The project described was supported by Award Number R43 LM011152-01 “Biocomputation across distributed private datasets to enhance drug discovery” from the National Library of Medicine (PI: S. Ekins) Accelrys The CDD TB has been developed thanks to funding from the Bill and Melinda Gates Foundation (Grant#49852 “Collaborative drug discovery for TB through a novel database of SAR data optimized to promote data archiving and sharing”) Allen Casey (IDRI)
You can find me @... CDD Booth 205PAPER ID: 13433PAPER TITLE: “Dispensing processes profoundly impact biological assays and computational and statisticalanalyses”April 8th 8.35am Room 349PAPER ID: 14750PAPER TITLE: “Enhancing High Throughput Screening For Mycobacterium tuberculosis Drug DiscoveryUsing Bayesian Models”April 9th 1.30pm Room 353PAPER ID: 21524PAPER TITLE: “Navigating between patents, papers, abstracts and databases using public sources andtools”April 9th 3.50pm Room 350PAPER ID: 13358PAPER TITLE: “TB Mobile: Appifying Data on Anti-tuberculosis Molecule Targets”April 10th 8.30am Room 357PAPER ID: 13382PAPER TITLE: “Challenges and recommendations for obtaining chemical structures of industry-providedrepurposing candidates”April 10th 10.20am Room 350PAPER ID: 13438PAPER TITLE: “Dual-event machine learning models to accelerate drug discovery”April 10th 3.05 pm Room 350