SlideShare a Scribd company logo
Mining 'Bigger' Datasets to Create, Validate and Share Machine
Learning Models
Sean Ekins1,2,3* and Alex M. Clark4
1 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA
2 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA
3 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA
4 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada
Disclosure: As well as employee of above funded by NIH, EC FP7, consultant for several
rare disease foundations, drug companies and consumer product companies etc.
Laboratories past and present
Lavoisier’s lab 18th C Edison’s lab 20th C
Author’s lab 21th C
+ Network of global collaborators
"Rub al Khali 002" by Nepenthes
The chemistry/ biology data desert outside of pharma circa early 2000’s
Limited ADME/Tox data
Paucity of Structure Activity Data
Small datasets for modeling
Drug companies – gate keepers of information for drug discovery
"Oasis in Libya" by Sfivat
The growing chemistry/biology data Oasis outside of pharma circa 2015
ADME/Tox models 15 yrs on: Then & Now
• Datasets very small < 100 cpds
• Heavy focus on P450
• Models rarely used
• Very limited number of
properties addressed
• Few tools / agorithms used
• Limited access to models
• Much bigger datasets > 1000s
cpds >10,000
• Broader range of models
• Models more widely used and
reported
• More accessible models
• Pharma making data available
 70 hERG models (Villoutreix and
Taboroureau 2015)
 19 protein binding models
(Lambrinidis et al 2015)
 40 BBB models upto 2009
Model resources for ADME/Tox
CYP 1A2 2C9 2C19
Substrate (mM) phenacetin (10) diclofenac (10) omeprazole (0.5)
Inhibitor naphthoflavone sulfaphenazole tranylcypromine
Compounds IC50 (mM) IC50 (mM) IC50 (mM)
JSF-2019 2.25 3.55 10.8
Retinal dehydrogenase 1
ADME SARfari predicts importance of CYP1A2, CYP2C9, CYP2C19
The Naïve Bayes model
was built with 142345
compounds (training and
validation) and features
135 learned classes.
Testing by
Dr. Joel Freundlich
Just a matter of scale?
Drug Discovery’s
definition of Big data
Everyone else’s definition of Big data
• Data Sources
• PubChem
• ChEMBL
• ToxCast over 1800 molecules tested against over 800 endpoints
Where can we get the datasets
Open source – but much smaller
400 diverse, drug-like molecules active against neglected diseases
400 cpds from around 20,000 hits
generated screening campaign ~ four million compounds from the
libraries of St. Jude Children's Research Hospital, TN, USA, Novartis
and GSK.
Many screens completed
Bigger datasets and model collections
• Profiling “big datasets” is going to
be the norm.
• A recent study mined PubChem
datasets for compounds that have
rat in vivo acute toxicity data
• This could be used in other big data
initiatives like ToxCast (> 1000
compounds x 800 assays) and
Tox21 etc.
• Kinase screening data (1000s mols
x 100s assays)
• GPCR datasets etc (1000s mols x
100s assays)
Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal
Toxicants by Automatically Mining Public
Bioassay Data: A Big Data Approach for
Computational Toxicology. PLoS ONE 9(6):
e99863. doi:10.1371/journal.pone.0099863
http://127.0.0.1:8081/plosone/article?id=info:d
oi/10.1371/journal.pone.0099863
‘Bigger’ and not ‘Big’
(220463)
(102633)
(23797)
(346893)
(2273)
(1783)
(1248)
(5304)
(218640)
(102634)
(23737)
(345011)
1771924
Are bigger models better for tuberculosis ?
Ekins et al., J Chem Inf Model
54: 2157-2165 (2014)
No relationship between internal or external ROC and the
number of molecules in the training set?
PCA of combined
data and ARRA(red)
Ekins et al., J Chem Inf Model
54: 2157-2165 (2014)
Internal and leave out 50%x100 ROC track each other
External ROC less correlation
Smaller models do just as well with external testing
~350,000
The Opportunity
•Get pharmas to use open source molecular descriptors and algorithms
•Benefit from initial work done by Pfizer/CDD
•Avoid repetition of open source tools vs commercial tools comparisons
•Change the mindset from real data to virtual data – confirm predictions
•ADME/Tox is precompetitive
•Expand the chemical space and predictivity of models
•Share models with collaborators – Companies could share data as models
Ekins and Williams, Lab On A Chip, 10: 13-22, 2010.
Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010
Pfizer Open models and descriptors
Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010
• What can be developed with very large training
and test sets?
• HLM training 50,000 testing 25,000 molecules
• training 194,000 and testing 39,000
• MDCK training 25,000 testing 25,000
• MDR training 25,000 testing 18,400
• Open molecular descriptors / models vs
commercial descriptors
• Examples – Metabolic Stability
Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010
HLM Model with CDK and
SMARTS Keys:
HLM Model with MOE2D and
SMARTS Keys
# Descriptors: 578 Descriptors
# Training Set compounds:
193,650
Cross Validation Results: 38,730
compounds
Training R2: 0.79
20% Test Set R2: 0.69
Blind Data Set (2310
compounds):
R2 = 0.53
RMSE = 0.367
Continuous  Categorical:
κ = 0.40
Sensitivity = 0.16
Specificity = 0.99
PPV = 0.80
Time (sec/compound): 0.252
# Descriptors: 818 Descriptors
# Training Set compounds:
193,930
Cross Validation Results: 38,786
compounds
Training R2: 0.77
20% Test Set R2: 0.69
Blind Data Set (2310
compounds):
R2 = 0.53
RMSE = 0.367
Continuous  Categorical:
κ = 0.42
Sensitivity = 0.24
Specificity = 0.987
PPV = 0.823
Time (sec/compound): 0.303
PCA of training (red) and test (blue)
compounds
Overlap in Chemistry space
• Examples – P-gp
Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010
Open source descriptors CDK and C5.0 algorithm
~60,000 molecules with P-gp efflux data from Pfizer
MDR <2.5 (low risk) (N = 14,175) MDR > 2.5 (high risk) (N = 10,820)
Test set MDR <2.5 (N = 10,441) > 2.5 (N = 7972)
Could facilitate model sharing?
CDK +fragment descriptors MOE 2D +fragment descriptors
Kappa 0.65 0.67
sensitivity 0.86 0.86
specificity 0.78 0.8
PPV 0.84 0.84
MoDELS RESIDE IN PAPERS
NOT ACCESSIBLE…THIS IS
UNDESIRABLE
How do we share them?
How do we use Them?
Open ExtendedConnectivity Fingerprints
ECFP_6 FCFP_6
• Collected,
deduplicated,
hashed
• Sparse integers
• Invented for Pipeline Pilot: public method, proprietary details
• Often used with Bayesian models: many published papers
• Built a new implementation: open source, Java, CDK
– stable: fingerprints don't change with each new toolkit release
– well defined: easy to document precise steps
– easy to port: already migrated to iOS (Objective-C) for TB Mobile app
• Provides core basis feature for CDD open source model service
Clark et al., J Cheminform 6:38 2014
Uses Bayesian algorithm and FCFP_6 fingerprints
Bayesian models
Clark et al., J Cheminform 6:38 2014
Exporting models from CDD
Clark et al., JCIM 55: 1231-1245 (2015)
Machine Learning – Different tools
• Models generated using : molecular
function class fingerprints of maximum
diameter 6 (FCFP_6), AlogP, molecular
weight, number of rotatable bonds,
number of rings, number of aromatic
rings, number of hydrogen bond
acceptors, number of hydrogen bond
donors, and molecular fractional polar
surface area.
• Models were validated using five-fold
cross validation (leave out 20% of the
database).
• Bayesian, Support Vector Machine and
Recursive Partitioning Forest and single
tree models built.
• RP Forest and RP Single Tree models
used the standard protocol in Discovery
Studio.
• 5-fold cross validation or leave out 50%
x 100 fold cross validation was used to
calculate the ROC for the models
generated
• *fingerprints only Ai et al., ADDR 86: 46-60, 2015
KCNQ1
Ames Bayesian model built with 6512 molecules (Hansen et al., 2009)
Features important for Ames actives. Features important for Ames inactives.
Ames Bayesian model built using CDD Models showing ROC
for 3 fold cross validation. Note only FCFP_6 descriptors were
used
FCFP6 fingerprint models in CDD
Clark et al., JCIM 55: 1231-1245 (2015)
ECFP6 fingerprint only models in MMDS
Clark et al., JCIM 55: 1231-1245 (2015)
Using AZ-ChEMBL data for CDD Models
• Human microsomal
intrinsic clearance
• Rat hepatocyte
intrinsic clearance
What if the models were already
built for you
• Instead of having to go into a database and find
data
• The models are already prebuilt
• Ready to use
• Shareable
• Create a repository of models
Previous work by others
• Using large datasets to predict targets with Bayesian algorithm
• Bayesian classifier - 698 target models (> 200,000 molecules, 561,000
measurements) Paolini et al 2006
• 246 targets (65,241 molecules) Similarity ensemble analysis Keiser et al 2007
• 2000 targets (167,000 molecules) target identification from zebrafish screen
Laggner et al 2012
• 70 targets (100,269 data points) Bender et al 2007
• Many others…..
• None of these enable you qualitatively or quantitatively predict activity for a
single target.
Recent Studies
• Bit folding – trade off between performance & efficacy
• Model cut-off selection for cross validation
• Scalability of ECFP6 and FCFP6 using ChEMBL 20 mid size
datasets
• CDK codebase on Github (http://github.com/cdk/cdk: look
for class org.open-science.cdk.fingerprint.model.Bayesian )
• Made the models accessible http://molsync.com/bayesian2
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
What do 2000 ChEMBL models
look like
Folding bit size
Average
ROC
http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
ChEMBL 20
• Skipped targets with > 100,000 assays and sets with
< 100 measurements
• Converted data to –log
• Dealt with duplicates
• 2152 datasets
• Cutoff determination
• Balance active/ inactive ratio
• Favor structural diversity and activity distribution
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Desirability score
• ROC integral for model using subset of molecules and
threshold for partitioning active / inactive (higher is
better)
• Second derivative of population interpolated from the
current threshold (lower is better)
• Ratio of actives to inactives if the collection partitioned
(actives+1) / (inactives+1) or reciprocal..whichever greater
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Ekins et al Drug Metab Dispos 43(10):1642-5, 2015
Models from ChEMBL data
http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Results
• Bit folding – plateau at 4096, can use 1024 with little
degredation
• Cut off – works well
• Evaluated balanced training: test and diabolical were test
and training sets are structurally different
Easy ROC 0.83 ± 0.11 Hard ROC 0.39 ± 0.23
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Models in mobile app
• Added atom
coloring using
ECFP6 fingerprints
• Red and green
high and low
probability of
activity,
respectively
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Results for Bayesian model cross validation. 5-fold and Leave one out (LOO)
validation with Bayesian models generated with Discovery Studio and Open
Models implemented in the mobile app MMDS. * = previously published
Ekins et al Drug Metab Dispos 43(10):1642-5, 2015
Transporter models
Ekins et al Drug Metab Dispos 43(10):1642-5, 2015
Transporter models
ToxCast data
• Few studies use the ToxCast data for machine learning
• Recent reviews Sipes et al., Chem Res Toxicol. 2013 Jun 17; 26(6): 878–895.
• Liu et al., Chem Res Toxicol. 2015 Apr 20;28(4):738-51
• A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors
• (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp,
OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories
• six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes
(NB), support vector machines (SVM), classification and regression trees (CART), k-
nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB)from animal
studies)
• nuclear receptor activation and mitochondrial functions were frequently found in
highly predictive classifiers of hepatotoxicity
• CART, ENSMB, and SVM classifiers performed the best
CDD Models for human P450s (NVS data) from ToxCast (n=1787) <1uM cutoff
CYP1A1 CYP1A2 CYP2B6 CYP2C18
CYP2C19 CYP2C9 CYP3A4 CYP3A5
ToxCast models in a mobile app
IC50 1A2 = 2.25 uM
IC50 2C9 = 3.55 uM
IC50 2C19 = 10.8 uM
In vitro data
Courtesy Dr. Joel Freundlich
PolyPharma a new free app for drug discovery
Composite models - Binned Bayesians
Clark et al., Submitted 2015
Summary
• Shown that open source models/ descriptors comparable to
previously published models with commercial software
• Implemented Bayesian machine learning in CDD Vault
• Can be used on private or public data
• Can enable sharing of models in CDD Vault
• Enabled export of models – can use models in 3rd part mobile apps
or other tools
• Demonstrated various ADME/Tox models and transporters
• Make ToxCast data into models that can be used by anyone
• Provide more information on models and predictions
• Visualize training set molecules vs test compounds
• Use a model to predict compounds and then test them
Future ?
+ = Big Models
Thousands of Big Models
How do you validate 1000’s of models
How do algorithms hands 500K – 1M molecules
Need new algorithms, data visualization, mining approaches
Model sharing is here
Need for broad Biology & Chemistry knowledge – open minds, BIG thinkers
Acknowledgments
• Alex Clark Antony Williams
• Joel Freundlich Robert Reynolds
• Steven Wright
• Krishna Dole and all colleagues at CDD
• Award Number 9R44TR000942-02 “Biocomputation across distributed private datasets to enhance drug
discovery” from the NIH National Center for Advancing Translational Sciences.
• R41-AI108003-01 “Identification and validation of targets of phenotypic high throughput screening” from NIH
National Institute of Allergy and Infectious Diseases
• Bill and Melinda Gates Foundation (Grant#49852 “Collaborative drug discovery for TB through a novel
database of SAR data optimized to promote data archiving and sharing”).
Software on github
Models can be accessed at
• http://molsync.com/bayesian1
• http://molsync.com/bayesian2
• http://molsync.com/transporters

More Related Content

What's hot

Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...
Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...
Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...
Sean Ekins
 
Complex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningComplex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine Learning
Dmitry Grapov
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_
Ann-Marie Roche
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
Alejandra Gonzalez-Beltran
 
Connecting the Data Wires
Connecting the Data WiresConnecting the Data Wires
Connecting the Data Wires
Medicines Discovery Catapult
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
Pistoia Alliance
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
Pistoia Alliance
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the future
Pistoia Alliance
 
Cassavabase-PhenoApps demo ISTRC 2018
Cassavabase-PhenoApps demo ISTRC 2018Cassavabase-PhenoApps demo ISTRC 2018
Cassavabase-PhenoApps demo ISTRC 2018
solgenomics
 
Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysis
Dmitry Grapov
 
NETTAB 2013
NETTAB 2013NETTAB 2013
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
Dmitry Grapov
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
Ola Spjuth
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Alejandra Gonzalez-Beltran
 

What's hot (20)

Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...
Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...
Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...
 
Complex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine LearningComplex Systems Biology Informed Data Analysis and Machine Learning
Complex Systems Biology Informed Data Analysis and Machine Learning
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier Datathon
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
Connecting the Data Wires
Connecting the Data WiresConnecting the Data Wires
Connecting the Data Wires
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the future
 
Cassavabase-PhenoApps demo ISTRC 2018
Cassavabase-PhenoApps demo ISTRC 2018Cassavabase-PhenoApps demo ISTRC 2018
Cassavabase-PhenoApps demo ISTRC 2018
 
Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysis
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 

Viewers also liked

Plans for Creative Writing
Plans for Creative WritingPlans for Creative Writing
Plans for Creative Writing
Fatheha Rahman
 
Uram ecp course
Uram ecp courseUram ecp course
Uram ecp course
tigerron
 
Eit orginal
Eit orginalEit orginal
Eit orginal
anamsini
 
Codes & Tiny Houses
Codes & Tiny HousesCodes & Tiny Houses
Codes & Tiny Houses
Historic Shed
 
Pintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesPintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenes
Olmeda Orígenes
 
A Writing Group Strategy for Scientists
A Writing Group Strategy for ScientistsA Writing Group Strategy for Scientists
A Writing Group Strategy for Scientists
gizemk
 
Haapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarHaapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminar
Jüri Kaljundi
 
ILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipILASCD - Student-Centered Leadership
ILASCD - Student-Centered Leadership
PJ Caposey
 
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Insight
 
MANEJO DE LA ENFERMEDAD DE CHAGAS EN ATENCIÓN PRIMARIA EN ESPAÑA
MANEJO DE LA ENFERMEDAD DE CHAGAS  EN ATENCIÓN PRIMARIA EN ESPAÑAMANEJO DE LA ENFERMEDAD DE CHAGAS  EN ATENCIÓN PRIMARIA EN ESPAÑA
MANEJO DE LA ENFERMEDAD DE CHAGAS EN ATENCIÓN PRIMARIA EN ESPAÑA
Centro Fuensanta Valencia. Departamento Hospital General
 
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
Digiday
 
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in RCrunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Rajarshi Guha
 
MEMS sensor catalog with I2C
MEMS sensor catalog with I2CMEMS sensor catalog with I2C
MEMS sensor catalog with I2C
Akira Sasaki
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by design
Alex Clark
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
Kamel Mansouri
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...
Pieter De Leenheer
 
[Etude] Entrepreneurs de la Tech : qui sont-ils?
[Etude] Entrepreneurs de la Tech : qui sont-ils?[Etude] Entrepreneurs de la Tech : qui sont-ils?
[Etude] Entrepreneurs de la Tech : qui sont-ils?
FrenchWeb.fr
 
Pay’s Role in a Performance Culture
Pay’s Role in a Performance CulturePay’s Role in a Performance Culture
Pay’s Role in a Performance Culture
The VisionLink Advisory Group
 
Ultimate Guitar Chord Chart
Ultimate Guitar Chord ChartUltimate Guitar Chord Chart
Ultimate Guitar Chord Chart
swamy g
 

Viewers also liked (20)

Plans for Creative Writing
Plans for Creative WritingPlans for Creative Writing
Plans for Creative Writing
 
Uram ecp course
Uram ecp courseUram ecp course
Uram ecp course
 
Eit orginal
Eit orginalEit orginal
Eit orginal
 
Codes & Tiny Houses
Codes & Tiny HousesCodes & Tiny Houses
Codes & Tiny Houses
 
Pintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesPintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenes
 
A Writing Group Strategy for Scientists
A Writing Group Strategy for ScientistsA Writing Group Strategy for Scientists
A Writing Group Strategy for Scientists
 
Haapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarHaapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminar
 
ILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipILASCD - Student-Centered Leadership
ILASCD - Student-Centered Leadership
 
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
 
MANEJO DE LA ENFERMEDAD DE CHAGAS EN ATENCIÓN PRIMARIA EN ESPAÑA
MANEJO DE LA ENFERMEDAD DE CHAGAS  EN ATENCIÓN PRIMARIA EN ESPAÑAMANEJO DE LA ENFERMEDAD DE CHAGAS  EN ATENCIÓN PRIMARIA EN ESPAÑA
MANEJO DE LA ENFERMEDAD DE CHAGAS EN ATENCIÓN PRIMARIA EN ESPAÑA
 
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
 
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in RCrunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
 
MEMS sensor catalog with I2C
MEMS sensor catalog with I2CMEMS sensor catalog with I2C
MEMS sensor catalog with I2C
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by design
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...
 
[Etude] Entrepreneurs de la Tech : qui sont-ils?
[Etude] Entrepreneurs de la Tech : qui sont-ils?[Etude] Entrepreneurs de la Tech : qui sont-ils?
[Etude] Entrepreneurs de la Tech : qui sont-ils?
 
школа
школашкола
школа
 
Pay’s Role in a Performance Culture
Pay’s Role in a Performance CulturePay’s Role in a Performance Culture
Pay’s Role in a Performance Culture
 
Ultimate Guitar Chord Chart
Ultimate Guitar Chord ChartUltimate Guitar Chord Chart
Ultimate Guitar Chord Chart
 

Similar to Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models

Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Sean Ekins
 
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
Sean Ekins
 
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Barry Smith
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Bigger Data to Increase Drug Discovery
Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery
Bigger Data to Increase Drug Discovery
Sean Ekins
 
2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference
Megan Sawchuk
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
Sean Ekins
 
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Sean Ekins
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
European Bioinformatics Institute
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
Philip Cheung
 
Sharing and standards christopher hart - clinical innovation and partnering...
Sharing and standards   christopher hart - clinical innovation and partnering...Sharing and standards   christopher hart - clinical innovation and partnering...
Sharing and standards christopher hart - clinical innovation and partnering...
Christopher Hart
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
nist-spin
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
Ian Foster
 
effective data sharing for a learning healthcare system
effective data sharing for a learning healthcare systemeffective data sharing for a learning healthcare system
effective data sharing for a learning healthcare system
Paul Houston
 
Open PHACTS (Sept 2013) EBI Industry Programme
Open PHACTS (Sept 2013) EBI Industry ProgrammeOpen PHACTS (Sept 2013) EBI Industry Programme
Open PHACTS (Sept 2013) EBI Industry Programme
SciBite Limited
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association Studies
Bastian Greshake
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
Paul Groth
 
10th Annual Utah's Health Services Research Conference - Data Quality in Mult...
10th Annual Utah's Health Services Research Conference - Data Quality in Mult...10th Annual Utah's Health Services Research Conference - Data Quality in Mult...
10th Annual Utah's Health Services Research Conference - Data Quality in Mult...
Utah's Annual Health Services Research Conference
 
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery CollaborationsCDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
Sean Ekins
 

Similar to Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models (20)

Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
 
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
 
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Bigger Data to Increase Drug Discovery
Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery
Bigger Data to Increase Drug Discovery
 
2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
Sharing and standards christopher hart - clinical innovation and partnering...
Sharing and standards   christopher hart - clinical innovation and partnering...Sharing and standards   christopher hart - clinical innovation and partnering...
Sharing and standards christopher hart - clinical innovation and partnering...
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
effective data sharing for a learning healthcare system
effective data sharing for a learning healthcare systemeffective data sharing for a learning healthcare system
effective data sharing for a learning healthcare system
 
Open PHACTS (Sept 2013) EBI Industry Programme
Open PHACTS (Sept 2013) EBI Industry ProgrammeOpen PHACTS (Sept 2013) EBI Industry Programme
Open PHACTS (Sept 2013) EBI Industry Programme
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association Studies
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
 
10th Annual Utah's Health Services Research Conference - Data Quality in Mult...
10th Annual Utah's Health Services Research Conference - Data Quality in Mult...10th Annual Utah's Health Services Research Conference - Data Quality in Mult...
10th Annual Utah's Health Services Research Conference - Data Quality in Mult...
 
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery CollaborationsCDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery Collaborations
 

More from Sean Ekins

How to Win a small business grant.pptx
How to Win a small business grant.pptxHow to Win a small business grant.pptx
How to Win a small business grant.pptx
Sean Ekins
 
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Sean Ekins
 
A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...
Sean Ekins
 
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Sean Ekins
 
Bayesian Models for Chagas Disease
Bayesian Models for Chagas DiseaseBayesian Models for Chagas Disease
Bayesian Models for Chagas Disease
Sean Ekins
 
Drug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issueDrug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issue
Sean Ekins
 
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan DiseasesUsing In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Sean Ekins
 
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchFive Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
Sean Ekins
 
Open zika presentation
Open zika presentation Open zika presentation
Open zika presentation
Sean Ekins
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2
Sean Ekins
 
CDD models case study #3
CDD models case study #3 CDD models case study #3
CDD models case study #3
Sean Ekins
 
CDD models case study #2
CDD models case study #2 CDD models case study #2
CDD models case study #2
Sean Ekins
 
CDD Models case study #1
CDD Models case study #1 CDD Models case study #1
CDD Models case study #1
Sean Ekins
 
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Sean Ekins
 
The future of computational chemistry b ig
The future of computational chemistry b igThe future of computational chemistry b ig
The future of computational chemistry b ig
Sean Ekins
 
#ZikaOpen: Homology Models -
#ZikaOpen: Homology Models - #ZikaOpen: Homology Models -
#ZikaOpen: Homology Models -
Sean Ekins
 
Slas talk 2016
Slas talk 2016Slas talk 2016
Slas talk 2016
Sean Ekins
 
Pros and cons of social networking for scientists
Pros and cons of social networking for scientistsPros and cons of social networking for scientists
Pros and cons of social networking for scientists
Sean Ekins
 
Rare pediatric and neglected tropical diseases priority review voucher and tr...
Rare pediatric and neglected tropical diseases priority review voucher and tr...Rare pediatric and neglected tropical diseases priority review voucher and tr...
Rare pediatric and neglected tropical diseases priority review voucher and tr...
Sean Ekins
 
Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...
Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...
Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...
Sean Ekins
 

More from Sean Ekins (20)

How to Win a small business grant.pptx
How to Win a small business grant.pptxHow to Win a small business grant.pptx
How to Win a small business grant.pptx
 
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
 
A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...
 
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
 
Bayesian Models for Chagas Disease
Bayesian Models for Chagas DiseaseBayesian Models for Chagas Disease
Bayesian Models for Chagas Disease
 
Drug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issueDrug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issue
 
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan DiseasesUsing In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
 
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchFive Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
 
Open zika presentation
Open zika presentation Open zika presentation
Open zika presentation
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2
 
CDD models case study #3
CDD models case study #3 CDD models case study #3
CDD models case study #3
 
CDD models case study #2
CDD models case study #2 CDD models case study #2
CDD models case study #2
 
CDD Models case study #1
CDD Models case study #1 CDD Models case study #1
CDD Models case study #1
 
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
 
The future of computational chemistry b ig
The future of computational chemistry b igThe future of computational chemistry b ig
The future of computational chemistry b ig
 
#ZikaOpen: Homology Models -
#ZikaOpen: Homology Models - #ZikaOpen: Homology Models -
#ZikaOpen: Homology Models -
 
Slas talk 2016
Slas talk 2016Slas talk 2016
Slas talk 2016
 
Pros and cons of social networking for scientists
Pros and cons of social networking for scientistsPros and cons of social networking for scientists
Pros and cons of social networking for scientists
 
Rare pediatric and neglected tropical diseases priority review voucher and tr...
Rare pediatric and neglected tropical diseases priority review voucher and tr...Rare pediatric and neglected tropical diseases priority review voucher and tr...
Rare pediatric and neglected tropical diseases priority review voucher and tr...
 
Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...
Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...
Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Mode...
 

Recently uploaded

NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Katherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdfKatherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdf
Texas Alliance of Groundwater Districts
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 

Recently uploaded (20)

NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Katherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdfKatherine Romanak - Geologic CO2 Storage.pdf
Katherine Romanak - Geologic CO2 Storage.pdf
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 

Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models

  • 1. Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models Sean Ekins1,2,3* and Alex M. Clark4 1 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA 2 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA 3 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA 4 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada Disclosure: As well as employee of above funded by NIH, EC FP7, consultant for several rare disease foundations, drug companies and consumer product companies etc.
  • 2. Laboratories past and present Lavoisier’s lab 18th C Edison’s lab 20th C Author’s lab 21th C + Network of global collaborators
  • 3. "Rub al Khali 002" by Nepenthes The chemistry/ biology data desert outside of pharma circa early 2000’s Limited ADME/Tox data Paucity of Structure Activity Data Small datasets for modeling Drug companies – gate keepers of information for drug discovery
  • 4. "Oasis in Libya" by Sfivat The growing chemistry/biology data Oasis outside of pharma circa 2015
  • 5. ADME/Tox models 15 yrs on: Then & Now • Datasets very small < 100 cpds • Heavy focus on P450 • Models rarely used • Very limited number of properties addressed • Few tools / agorithms used • Limited access to models • Much bigger datasets > 1000s cpds >10,000 • Broader range of models • Models more widely used and reported • More accessible models • Pharma making data available  70 hERG models (Villoutreix and Taboroureau 2015)  19 protein binding models (Lambrinidis et al 2015)  40 BBB models upto 2009
  • 7. CYP 1A2 2C9 2C19 Substrate (mM) phenacetin (10) diclofenac (10) omeprazole (0.5) Inhibitor naphthoflavone sulfaphenazole tranylcypromine Compounds IC50 (mM) IC50 (mM) IC50 (mM) JSF-2019 2.25 3.55 10.8 Retinal dehydrogenase 1 ADME SARfari predicts importance of CYP1A2, CYP2C9, CYP2C19 The Naïve Bayes model was built with 142345 compounds (training and validation) and features 135 learned classes. Testing by Dr. Joel Freundlich
  • 8.
  • 9. Just a matter of scale? Drug Discovery’s definition of Big data Everyone else’s definition of Big data
  • 10. • Data Sources • PubChem • ChEMBL • ToxCast over 1800 molecules tested against over 800 endpoints Where can we get the datasets
  • 11. Open source – but much smaller 400 diverse, drug-like molecules active against neglected diseases 400 cpds from around 20,000 hits generated screening campaign ~ four million compounds from the libraries of St. Jude Children's Research Hospital, TN, USA, Novartis and GSK. Many screens completed
  • 12. Bigger datasets and model collections • Profiling “big datasets” is going to be the norm. • A recent study mined PubChem datasets for compounds that have rat in vivo acute toxicity data • This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc. • Kinase screening data (1000s mols x 100s assays) • GPCR datasets etc (1000s mols x 100s assays) Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863 http://127.0.0.1:8081/plosone/article?id=info:d oi/10.1371/journal.pone.0099863
  • 13.
  • 14. ‘Bigger’ and not ‘Big’
  • 16. No relationship between internal or external ROC and the number of molecules in the training set? PCA of combined data and ARRA(red) Ekins et al., J Chem Inf Model 54: 2157-2165 (2014) Internal and leave out 50%x100 ROC track each other External ROC less correlation Smaller models do just as well with external testing ~350,000
  • 17. The Opportunity •Get pharmas to use open source molecular descriptors and algorithms •Benefit from initial work done by Pfizer/CDD •Avoid repetition of open source tools vs commercial tools comparisons •Change the mindset from real data to virtual data – confirm predictions •ADME/Tox is precompetitive •Expand the chemical space and predictivity of models •Share models with collaborators – Companies could share data as models Ekins and Williams, Lab On A Chip, 10: 13-22, 2010. Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010
  • 18. Pfizer Open models and descriptors Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010 • What can be developed with very large training and test sets? • HLM training 50,000 testing 25,000 molecules • training 194,000 and testing 39,000 • MDCK training 25,000 testing 25,000 • MDR training 25,000 testing 18,400 • Open molecular descriptors / models vs commercial descriptors
  • 19. • Examples – Metabolic Stability Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010 HLM Model with CDK and SMARTS Keys: HLM Model with MOE2D and SMARTS Keys # Descriptors: 578 Descriptors # Training Set compounds: 193,650 Cross Validation Results: 38,730 compounds Training R2: 0.79 20% Test Set R2: 0.69 Blind Data Set (2310 compounds): R2 = 0.53 RMSE = 0.367 Continuous  Categorical: κ = 0.40 Sensitivity = 0.16 Specificity = 0.99 PPV = 0.80 Time (sec/compound): 0.252 # Descriptors: 818 Descriptors # Training Set compounds: 193,930 Cross Validation Results: 38,786 compounds Training R2: 0.77 20% Test Set R2: 0.69 Blind Data Set (2310 compounds): R2 = 0.53 RMSE = 0.367 Continuous  Categorical: κ = 0.42 Sensitivity = 0.24 Specificity = 0.987 PPV = 0.823 Time (sec/compound): 0.303 PCA of training (red) and test (blue) compounds Overlap in Chemistry space
  • 20. • Examples – P-gp Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010 Open source descriptors CDK and C5.0 algorithm ~60,000 molecules with P-gp efflux data from Pfizer MDR <2.5 (low risk) (N = 14,175) MDR > 2.5 (high risk) (N = 10,820) Test set MDR <2.5 (N = 10,441) > 2.5 (N = 7972) Could facilitate model sharing? CDK +fragment descriptors MOE 2D +fragment descriptors Kappa 0.65 0.67 sensitivity 0.86 0.86 specificity 0.78 0.8 PPV 0.84 0.84
  • 21. MoDELS RESIDE IN PAPERS NOT ACCESSIBLE…THIS IS UNDESIRABLE How do we share them? How do we use Them?
  • 22. Open ExtendedConnectivity Fingerprints ECFP_6 FCFP_6 • Collected, deduplicated, hashed • Sparse integers • Invented for Pipeline Pilot: public method, proprietary details • Often used with Bayesian models: many published papers • Built a new implementation: open source, Java, CDK – stable: fingerprints don't change with each new toolkit release – well defined: easy to document precise steps – easy to port: already migrated to iOS (Objective-C) for TB Mobile app • Provides core basis feature for CDD open source model service Clark et al., J Cheminform 6:38 2014
  • 23. Uses Bayesian algorithm and FCFP_6 fingerprints Bayesian models Clark et al., J Cheminform 6:38 2014
  • 24. Exporting models from CDD Clark et al., JCIM 55: 1231-1245 (2015)
  • 25. Machine Learning – Different tools • Models generated using : molecular function class fingerprints of maximum diameter 6 (FCFP_6), AlogP, molecular weight, number of rotatable bonds, number of rings, number of aromatic rings, number of hydrogen bond acceptors, number of hydrogen bond donors, and molecular fractional polar surface area. • Models were validated using five-fold cross validation (leave out 20% of the database). • Bayesian, Support Vector Machine and Recursive Partitioning Forest and single tree models built. • RP Forest and RP Single Tree models used the standard protocol in Discovery Studio. • 5-fold cross validation or leave out 50% x 100 fold cross validation was used to calculate the ROC for the models generated • *fingerprints only Ai et al., ADDR 86: 46-60, 2015 KCNQ1
  • 26. Ames Bayesian model built with 6512 molecules (Hansen et al., 2009) Features important for Ames actives. Features important for Ames inactives.
  • 27. Ames Bayesian model built using CDD Models showing ROC for 3 fold cross validation. Note only FCFP_6 descriptors were used
  • 28. FCFP6 fingerprint models in CDD Clark et al., JCIM 55: 1231-1245 (2015)
  • 29. ECFP6 fingerprint only models in MMDS Clark et al., JCIM 55: 1231-1245 (2015)
  • 30. Using AZ-ChEMBL data for CDD Models
  • 31. • Human microsomal intrinsic clearance • Rat hepatocyte intrinsic clearance
  • 32. What if the models were already built for you • Instead of having to go into a database and find data • The models are already prebuilt • Ready to use • Shareable • Create a repository of models
  • 33. Previous work by others • Using large datasets to predict targets with Bayesian algorithm • Bayesian classifier - 698 target models (> 200,000 molecules, 561,000 measurements) Paolini et al 2006 • 246 targets (65,241 molecules) Similarity ensemble analysis Keiser et al 2007 • 2000 targets (167,000 molecules) target identification from zebrafish screen Laggner et al 2012 • 70 targets (100,269 data points) Bender et al 2007 • Many others….. • None of these enable you qualitatively or quantitatively predict activity for a single target.
  • 34. Recent Studies • Bit folding – trade off between performance & efficacy • Model cut-off selection for cross validation • Scalability of ECFP6 and FCFP6 using ChEMBL 20 mid size datasets • CDK codebase on Github (http://github.com/cdk/cdk: look for class org.open-science.cdk.fingerprint.model.Bayesian ) • Made the models accessible http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  • 35. What do 2000 ChEMBL models look like Folding bit size Average ROC http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  • 36. ChEMBL 20 • Skipped targets with > 100,000 assays and sets with < 100 measurements • Converted data to –log • Dealt with duplicates • 2152 datasets • Cutoff determination • Balance active/ inactive ratio • Favor structural diversity and activity distribution Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  • 37. Desirability score • ROC integral for model using subset of molecules and threshold for partitioning active / inactive (higher is better) • Second derivative of population interpolated from the current threshold (lower is better) • Ratio of actives to inactives if the collection partitioned (actives+1) / (inactives+1) or reciprocal..whichever greater Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60 Ekins et al Drug Metab Dispos 43(10):1642-5, 2015
  • 38. Models from ChEMBL data http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  • 39. Results • Bit folding – plateau at 4096, can use 1024 with little degredation • Cut off – works well • Evaluated balanced training: test and diabolical were test and training sets are structurally different Easy ROC 0.83 ± 0.11 Hard ROC 0.39 ± 0.23 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  • 40. Models in mobile app • Added atom coloring using ECFP6 fingerprints • Red and green high and low probability of activity, respectively Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  • 41. Results for Bayesian model cross validation. 5-fold and Leave one out (LOO) validation with Bayesian models generated with Discovery Studio and Open Models implemented in the mobile app MMDS. * = previously published Ekins et al Drug Metab Dispos 43(10):1642-5, 2015 Transporter models
  • 42. Ekins et al Drug Metab Dispos 43(10):1642-5, 2015 Transporter models
  • 43. ToxCast data • Few studies use the ToxCast data for machine learning • Recent reviews Sipes et al., Chem Res Toxicol. 2013 Jun 17; 26(6): 878–895. • Liu et al., Chem Res Toxicol. 2015 Apr 20;28(4):738-51 • A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors • (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp, OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories • six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes (NB), support vector machines (SVM), classification and regression trees (CART), k- nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB)from animal studies) • nuclear receptor activation and mitochondrial functions were frequently found in highly predictive classifiers of hepatotoxicity • CART, ENSMB, and SVM classifiers performed the best
  • 44. CDD Models for human P450s (NVS data) from ToxCast (n=1787) <1uM cutoff CYP1A1 CYP1A2 CYP2B6 CYP2C18 CYP2C19 CYP2C9 CYP3A4 CYP3A5
  • 45. ToxCast models in a mobile app IC50 1A2 = 2.25 uM IC50 2C9 = 3.55 uM IC50 2C19 = 10.8 uM In vitro data Courtesy Dr. Joel Freundlich
  • 46. PolyPharma a new free app for drug discovery
  • 47. Composite models - Binned Bayesians Clark et al., Submitted 2015
  • 48. Summary • Shown that open source models/ descriptors comparable to previously published models with commercial software • Implemented Bayesian machine learning in CDD Vault • Can be used on private or public data • Can enable sharing of models in CDD Vault • Enabled export of models – can use models in 3rd part mobile apps or other tools • Demonstrated various ADME/Tox models and transporters • Make ToxCast data into models that can be used by anyone • Provide more information on models and predictions • Visualize training set molecules vs test compounds • Use a model to predict compounds and then test them
  • 49. Future ? + = Big Models Thousands of Big Models How do you validate 1000’s of models How do algorithms hands 500K – 1M molecules Need new algorithms, data visualization, mining approaches Model sharing is here Need for broad Biology & Chemistry knowledge – open minds, BIG thinkers
  • 50. Acknowledgments • Alex Clark Antony Williams • Joel Freundlich Robert Reynolds • Steven Wright • Krishna Dole and all colleagues at CDD • Award Number 9R44TR000942-02 “Biocomputation across distributed private datasets to enhance drug discovery” from the NIH National Center for Advancing Translational Sciences. • R41-AI108003-01 “Identification and validation of targets of phenotypic high throughput screening” from NIH National Institute of Allergy and Infectious Diseases • Bill and Melinda Gates Foundation (Grant#49852 “Collaborative drug discovery for TB through a novel database of SAR data optimized to promote data archiving and sharing”).
  • 51. Software on github Models can be accessed at • http://molsync.com/bayesian1 • http://molsync.com/bayesian2 • http://molsync.com/transporters