SlideShare a Scribd company logo
© 2018 KNIME AG. All Rights Reserved.
How Do You Build and Validate 1500
Models and What Can You Learn from
Them?
Greg Landrum*, Anna Martin,
Daria Goldmann
KNIME AG
2018 ICCS
@dr_greg_landrum
© 2018 KNIME AG. All Rights Reserved.
The Monster Model Factory
Greg Landrum*, Anna Martin,
Daria Goldmann
KNIME AG
2018 ICCS
@dr_greg_landrum
© 2018 KNIME AG. All Rights Reserved. 3
Who cares?
• I have >1500 datasets from ChEMBL that I would like
to build models for
• I want to actually use the models, so they need to
be deployed
• The whole process needs to be automated and
reproducible so that I can do it again when ChEMBL
is updated
• Maybe we can learn something interesting from the
models themselves
4© 2018 KNIME AG. All Rights Reserved.
Back to the beginning
© 2018 KNIME AG. All Rights Reserved. 5
The model process
Image from:
https://upload.wikimedia.org/wikipedia/commons
/b/b9/CRISP-DM_Process_Diagram.png
CRISP-DM (CRoss Industry
Standard Process for Data
Mining) is a standard
process for data mining
solutions.
wikipedia://CRISP-DM
© 2018 KNIME AG. All Rights Reserved. 6
The model process
Image from:
https://upload.wikimedia.org/wiki
pedia/commons/b/b9/CRISP-
DM_Process_Diagram.png
Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 7
The model process, multiple models
…
© 2018 KNIME AG. All Rights Reserved. 8
The model process, multiple models
…
© 2018 KNIME AG. All Rights Reserved. 9
The model process, multiple models
…
https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
© 2018 KNIME AG. All Rights Reserved. 10
The model process, multiple models
…
It’s not feasible to manually do this
for a daunting number of models!
https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
11© 2018 KNIME AG. All Rights Reserved.
https://www.publicdomainpictures.net/view-image.php?image=155188
© 2018 KNIME AG. All Rights Reserved. 12
Automation: the model process factory
© 2018 KNIME AG. All Rights Reserved. 13
Init Load Transform Learn Score Evaluate Deploy
Automation: the model process factory
Score EvaluateTransform DeployLoad Learn
Score
Learn
Load Transform Evaluate Deploy
Score EvaluateTransform DeployLoad Learn
Score
Learn
Load Transform Evaluate Deploy
Make each step a separate workflow.
Use KNIME to orchestrate calling those workflows
KNIME blog post: https://goo.gl/LvESqB
White paper: https://goo.gl/d6UpUu
© 2018 KNIME AG. All Rights Reserved. 14
Model Factory Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 15
The heart of the factory: Call Local Workflow1
• Executes another workflow in the same local repository
https://pixabay.com/en/heart-veins-arteries-anatomy-152594/
1 Call Remote Workflow
when run on the KNIME
Server
© 2018 KNIME AG. All Rights Reserved. 16
Model Factory Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 17
Model Factory Init Load Transform Learn Score Evaluate Deploy
18© 2018 KNIME AG. All Rights Reserved.
Details
© 2018 KNIME AG. All Rights Reserved. 19
Extracting the data
• Data source: ChEMBL 23
• Activity types:
('GI50', 'IC50', 'Ki', 'MIC', 'EC50', 'AC50', 'ED50', 'GI', 'Kd', 'CC50', 'LC50',
'MIC90', 'MIC50', 'ID50’) -> 6.5 million points
• Define active:
Standard_value < 100nM -> 1.3 million actives
• Define inactive:
Standard_value > 1uM
• Define an interesting assay
At least 50 actives -> 1556 assays
• Final dataset size: 2.5 million data points, 1.5 million
compounds
Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 20
Init Load Transform Learn Score Evaluate DeployFinding more inactives
• The ChEMBL datasets almost all have an
unrealistically high ratio of actives to inactives
• “Fix” that by adding enough assumed inactives to
each dataset to get a 1:10 active:inactive ratio
• Pick those assumed inactives to be roughly similar
to the actives: Tanimoto similarity of between 0.35
and 0.6 using RDKit Morgan 2 fingerprints
© 2018 KNIME AG. All Rights Reserved. 21
Extracting the data Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 22
Transform
• Convert SMILES from database into chemical
structures
• Cleanup the chemical structures
Init Load Transform Learn Score Evaluate Deploy
http://www.rdkit.org
© 2018 KNIME AG. All Rights Reserved. 23
Transform
• Convert SMILES from database into chemical
structures
• Cleanup the chemical structures
• Generate five chemical fingerprints for each
structure
Init Load Transform Learn Score Evaluate Deploy
http://www.rdkit.org
© 2018 KNIME AG. All Rights Reserved. 24
Transform
• Convert SMILES from database into chemical
structures
• Cleanup the chemical structures
• Generate five chemical fingerprints for each
structure
– Morgan 3 counts (ECFC6), 4K “bits”
– Morgan 3 (ECFP6), 4K bits
– Morgan 2 (ECFP4), 2K bits
– RDKit FP, length 1-5, 2K bits
– Atom pairs, distances 1-20, 4K bits
Init Load Transform Learn Score Evaluate Deploy
http://www.rdkit.org
© 2018 KNIME AG. All Rights Reserved. 25
Learn and Score Init Load Transform Learn Score Evaluate Deploy
10 different stratified random
training/holdout splits generated for
each assay
© 2018 KNIME AG. All Rights Reserved. 26
Learn Init Load Transform Learn Score Evaluate Deploy
Learning:
• Fingerprint Bayes (NB)
• Logistic Regression (LR)
• Random Forest (RF)
200 trees, max depth=10,
min_leaf_size=3, min_node_size=6
• Gradient Boosting (H2O)
100 trees, max_depth = 5,
learning_rate = 0.05
Model Selection:
• Pick best model based on Enrichment
factor at 5% (EF5)
© 2018 KNIME AG. All Rights Reserved. 27
Learn Init Load Transform Learn Score Evaluate Deploy
Where did these parameters
come from?
Learning:
• Fingerprint Bayes (NB)
• Logistic Regression (LR)
• Random Forest (RF)
200 trees, max depth=10,
min_leaf_size=3, min_node_size=6
• Gradient Boosting (H2O)
100 trees, max_depth = 5,
learning_rate = 0.05
© 2018 KNIME AG. All Rights Reserved. 28
Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
• Full parameter optimization done for each
method+fingerprint on 70 assays
• Results used to pick “standard” parameter
sets:
– Random Forest: 200 trees, max depth=10,
min_leaf_size=3, min_node_size=6
– Gradient Boosting: 100 trees, max_depth = 5,
learning_rate = 0.05
© 2018 KNIME AG. All Rights Reserved. 29
Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 30
Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
The optimization and model selection workflow is presented in detail in Daria’s
KNIME blog post:
https://www.knime.com/blog/stuck-in-the-nine-circles-of-hell-try-parameter-
optimization-a-cup-of-tea
The workflow is available in the EXAMPLES folder inside KNIME:
04_Analytics/11_Optimization/08_Model_Optimization_and_Selection
31© 2018 KNIME AG. All Rights Reserved.
Making it all run
Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 32
Execution
• In total >310K models were built1
1 ~1550 assays * 4 methods * 5 FPs * 10 repeats
© 2018 KNIME AG. All Rights Reserved. 33
Execution
KNIME Analytics
Platform
KNIME
Server
...
Distributed Executor
Distributed Executor
Distributed Executor
Build/test workflows Run model factory Run individual assays
65-70 load-balanced
distributed executors
34© 2018 KNIME AG. All Rights Reserved.
Are the models any good?
© 2018 KNIME AG. All Rights Reserved. 35
Performance on validation sets
• AUC: mean=0.958
s=0.070
• Cohen’s kappa:
mean=0.690 s=0.382
© 2018 KNIME AG. All Rights Reserved. 36
Performance on validation sets
• AUC: mean=0.958
s=0.070
• Cohen’s kappa:
mean=0.690 s=0.382
Yeah!
© 2018 KNIME AG. All Rights Reserved. 37
Performance on validation sets
• AUC: mean=0.958
s=0.070
• Cohen’s kappa:
mean=0.690 s=0.382
Yeah! Uh oh…
38© 2018 KNIME AG. All Rights Reserved.
https://www.publicdomainpictures.net/view-image.php?image=155188
© 2018 KNIME AG. All Rights Reserved. 39
An experiment to check model generalizability
• Pick assays where standard_type is Ki
• Group them by target ID
• Limit to targets where Ki was measured in at least 5
assays -> 11 targets, 73 assays
• Use the model built on one assay from a target ID to
predict activity across the other assays.
© 2018 KNIME AG. All Rights Reserved. 40
An experiment to check model generalizability
• The targets:
TargetID Name Num Assays
CHEMBL205 Carbonic anhydrase II 7
CHEMBL224 Serotonin 2a (5-HT2a) receptor 8
CHEMBL234 Dopamine D3 receptor 10
CHEMBL243 Human immunodeficiency virus type 1 protease 6
CHEMBL244 Coagulation factor X 5
CHEMBL253 Cannabinoid CB2 receptor 7
CHEMBL281 Carbonic anhydrase IV 5
CHEMBL3371 Serotonin 6 (5-HT6) receptor 8
CHEMBL344 Melanin-concentrating hormone receptor 1 5
CHEMBL4550 5-lipoxygenase activating protein 5
CHEMBL4908 Trace amine-associated receptor 1 7
© 2018 KNIME AG. All Rights Reserved. 41
Carbonic Anhydrase IV
Carbonic Anhydrase II
HIV Protease
Factor X
5-HT6
TAAR1
© 2018 KNIME AG. All Rights Reserved. 42
Carbonic Anhydrase IV
Carbonic Anhydrase II HIV Protease
Factor X
5-HT6 TAAR1
© 2018 KNIME AG. All Rights Reserved. 43
An Example
Target: CHEMBL3371 (5-HT6)
Train on Assay ID: 448716
Test with Assay ID: 1366806
AUROC: 0.38
EF5: 0
© 2018 KNIME AG. All Rights Reserved. 44
An Example
Assay_ID 448716 Assay_ID 1366806
© 2018 KNIME AG. All Rights Reserved. 45
An Example
Target: CHEMBL3371 (5-HT6)
Train on Assay ID: 448716
Test with Assay ID: 659849
AUROC: 0.99
EF5: 8.8
© 2018 KNIME AG. All Rights Reserved. 46
An Example
Assay_ID 448716 Assay_ID 659849
© 2018 KNIME AG. All Rights Reserved. 47
An Example
Target: CHEMBL3371 (5-HT6)
Train on Assay ID: 448716
Test with Assay ID: 1528679
AUROC: 0.83
EF5: 0.4
© 2018 KNIME AG. All Rights Reserved. 48
An Example
Assay_ID 448716 Assay_ID 1528679
© 2018 KNIME AG. All Rights Reserved. 49
Intermediate conclusion
• Many/most of the models have likely overfit the
training data
• Alternative interpretation: we’ve actually built
models to predict whether or not a compound is
taken from a particular paper
• Unfortunately these are functionally the same if you
want to predict activity
50© 2018 KNIME AG. All Rights Reserved.
https://www.publicdomainpictures.net/view-image.php?image=155188
© 2018 KNIME AG. All Rights Reserved. 51
Look for frequent algorithm + fingerprint combinations
• For each of the ~1550 assays * 4 learning algorithms
* 10 repeats, look at which fingerprint performed
best (as measured by EF5)
© 2018 KNIME AG. All Rights Reserved. 52
Look for frequent algorithm + fingerprint combinations
For each of the ~1550 assays * 4 learning
algorithms * 10 repeats, look at which fingerprint
performed best (as measured by EF5)
© 2018 KNIME AG. All Rights Reserved. 53
Which method/FP pair is best for each assay?
• For each of the ~1550 assays * 10 repeats, look at
which algorithm + fingerprint performed best (as
measured by EF51, AUC2, and algorithm complexity3)
1 Rounded to 1 decimal point
2 Rounded to 2 decimal points
3 Random Forest > Gradient Boosting > Fingerprint Bayes > Logistic
Regression
© 2018 KNIME AG. All Rights Reserved. 54
Which method/FP pair is best for each assay?
Select best model using EF5,
AUC, algorithm complexity
© 2018 KNIME AG. All Rights Reserved. 55
Wrapping up
• We have automated the construction and evaluation
of >1500 models for bioassays using data pulled
from ChEMBL
• We’ve got some strong evidence that the models
themselves are significantly overfit
• We were able to start to draw some general
conclusions about fingerprints and methods
© 2018 KNIME AG. All Rights Reserved. 56
There’s still a lot left to do
• Verify the repeatability of the process by updating
when the next version of ChEMBL is released
• Some more thought into combining assays to get
around the “one series per paper” problem
• Look into doing the full optimization run
• Come up with a good way of presenting the
predictions
© 2018 KNIME AG. All Rights Reserved. 57
More details…
• Model process factory blog post: https://goo.gl/LvESqB
• Model process factory white paper:
https://goo.gl/d6UpUu
• Model process factory workflow:
knime://EXAMPLES/50_Applications/26_Model_Process_
Management
• Daria’s blog post on the model optimization workflow:
https://www.knime.com/blog/stuck-in-the-nine-circles-
of-hell-try-parameter-optimization-a-cup-of-tea
• Accompanying workflow: knime://EXAMPLES/
04_Analytics/11_Optimization/08_Model_Optimization_
and_Selection
• When we’re done cleaning up, there will be a blog
post/sample workflow for the monster model factory too.
© 2018 KNIME AG. All Rights Reserved. 58
7th RDKit UGM: 19 - 21 September
• Hosted by Andreas Bender, Cambridge
University
• Free registration: https://goo.gl/VVvHUH
(or get it on http://www.rdkit.org)
http://www.rdkit.org
© 2018 KNIME AG. All Rights Reserved. 59
KNIME Fall Summit 2018
November 6 – 9 at AT&T Executive Education and
Conference Center, Austin, Texas
• Tuesday & Wednesday: One-day courses
• Thursday & Friday: Summit sessions
Use the code
ICCS-2018
for 10% off tickets.
Register at:
knime.com/fall-summit2018
60© 2018 KNIME AG. All Rights Reserved.
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by
KNIME.com AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.

More Related Content

What's hot

HDF Product Designer
HDF Product DesignerHDF Product Designer
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntopIT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
InfluxData
 
Case Studies in advanced analytics with R
Case Studies in advanced analytics with RCase Studies in advanced analytics with R
Case Studies in advanced analytics with R
Wit Jakuczun
 
Know your R usage workflow to handle reproducibility challenges
Know your R usage workflow to handle reproducibility challengesKnow your R usage workflow to handle reproducibility challenges
Know your R usage workflow to handle reproducibility challenges
Wit Jakuczun
 
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
InfluxData
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Revolution Analytics
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
Performance Co-Pilot
Performance Co-PilotPerformance Co-Pilot
Performance Co-Pilot
YOSHIKAWA Ryota
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
The HDF-EOS Tools and Information Center
 
Managing large (and small) R based solutions with R Suite
Managing large (and small) R based solutions with R SuiteManaging large (and small) R based solutions with R Suite
Managing large (and small) R based solutions with R Suite
Wit Jakuczun
 
Raster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDigRaster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDig
Karin Patenge
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE Atlas
RIPE NCC
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
InfluxData
 
OpenACC Highlights - February
OpenACC Highlights - FebruaryOpenACC Highlights - February
OpenACC Highlights - February
NVIDIA
 
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the CloudMigrating PostgreSQL to the Cloud
Migrating PostgreSQL to the Cloud
Mike Fowler
 
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilotHelix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula The Science Cloud
 
Deploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud PlatformDeploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud Platform
MariaDB plc
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Sonja Schweigert
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
 
Scossu gdi iiif_r+d_report_2019
Scossu gdi iiif_r+d_report_2019Scossu gdi iiif_r+d_report_2019
Scossu gdi iiif_r+d_report_2019
Stefano Cossu
 

What's hot (20)

HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntopIT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
 
Case Studies in advanced analytics with R
Case Studies in advanced analytics with RCase Studies in advanced analytics with R
Case Studies in advanced analytics with R
 
Know your R usage workflow to handle reproducibility challenges
Know your R usage workflow to handle reproducibility challengesKnow your R usage workflow to handle reproducibility challenges
Know your R usage workflow to handle reproducibility challenges
 
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Performance Co-Pilot
Performance Co-PilotPerformance Co-Pilot
Performance Co-Pilot
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
Managing large (and small) R based solutions with R Suite
Managing large (and small) R based solutions with R SuiteManaging large (and small) R based solutions with R Suite
Managing large (and small) R based solutions with R Suite
 
Raster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDigRaster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDig
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE Atlas
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
OpenACC Highlights - February
OpenACC Highlights - FebruaryOpenACC Highlights - February
OpenACC Highlights - February
 
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the CloudMigrating PostgreSQL to the Cloud
Migrating PostgreSQL to the Cloud
 
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilotHelix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
 
Deploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud PlatformDeploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud Platform
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
Scossu gdi iiif_r+d_report_2019
Scossu gdi iiif_r+d_report_2019Scossu gdi iiif_r+d_report_2019
Scossu gdi iiif_r+d_report_2019
 

Similar to How Do You Build and Validate 1500 Models and What Can You Learn from Them?

KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
 
From raw data to deployment
From raw data to deployment From raw data to deployment
From raw data to deployment
KNIMESlides
 
From Raw Data to Deployment
From Raw Data to DeploymentFrom Raw Data to Deployment
From Raw Data to Deployment
KNIMESlides
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
Bin Han
 
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To DeploymentKNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIMESlides
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
DataKitchen
 
ODSC data science to DataOps
ODSC data science to DataOpsODSC data science to DataOps
ODSC data science to DataOps
Christopher Bergh
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
Alok Singh
 
Webinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the LearningWebinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the Learning
Mesosphere Inc.
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Amazon Web Services
 
Master the RETE algorithm
Master the RETE algorithmMaster the RETE algorithm
Master the RETE algorithm
Masahiko Umeno
 
Your Flight is Boarding Now!
Your Flight is Boarding Now!Your Flight is Boarding Now!
Your Flight is Boarding Now!
MeetupDataScienceRoma
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon Web Services
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
NETWAYS
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon Web Services
 
Sharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME ServerSharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME Server
KNIMESlides
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Amazon Web Services
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon Web Services
 

Similar to How Do You Build and Validate 1500 Models and What Can You Learn from Them? (20)

KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
 
From raw data to deployment
From raw data to deployment From raw data to deployment
From raw data to deployment
 
From Raw Data to Deployment
From Raw Data to DeploymentFrom Raw Data to Deployment
From Raw Data to Deployment
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
 
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To DeploymentKNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To Deployment
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
 
ODSC data science to DataOps
ODSC data science to DataOpsODSC data science to DataOps
ODSC data science to DataOps
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
 
Webinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the LearningWebinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the Learning
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
 
Master the RETE algorithm
Master the RETE algorithmMaster the RETE algorithm
Master the RETE algorithm
 
Your Flight is Boarding Now!
Your Flight is Boarding Now!Your Flight is Boarding Now!
Your Flight is Boarding Now!
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
 
Sharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME ServerSharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME Server
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
 

More from Greg Landrum

Chemical registration
Chemical registrationChemical registration
Chemical registration
Greg Landrum
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
Greg Landrum
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Greg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Greg Landrum
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
Greg Landrum
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
Greg Landrum
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
Greg Landrum
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Greg Landrum
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Greg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
 

More from Greg Landrum (13)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 

Recently uploaded

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 

Recently uploaded (20)

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 

How Do You Build and Validate 1500 Models and What Can You Learn from Them?

  • 1. © 2018 KNIME AG. All Rights Reserved. How Do You Build and Validate 1500 Models and What Can You Learn from Them? Greg Landrum*, Anna Martin, Daria Goldmann KNIME AG 2018 ICCS @dr_greg_landrum
  • 2. © 2018 KNIME AG. All Rights Reserved. The Monster Model Factory Greg Landrum*, Anna Martin, Daria Goldmann KNIME AG 2018 ICCS @dr_greg_landrum
  • 3. © 2018 KNIME AG. All Rights Reserved. 3 Who cares? • I have >1500 datasets from ChEMBL that I would like to build models for • I want to actually use the models, so they need to be deployed • The whole process needs to be automated and reproducible so that I can do it again when ChEMBL is updated • Maybe we can learn something interesting from the models themselves
  • 4. 4© 2018 KNIME AG. All Rights Reserved. Back to the beginning
  • 5. © 2018 KNIME AG. All Rights Reserved. 5 The model process Image from: https://upload.wikimedia.org/wikipedia/commons /b/b9/CRISP-DM_Process_Diagram.png CRISP-DM (CRoss Industry Standard Process for Data Mining) is a standard process for data mining solutions. wikipedia://CRISP-DM
  • 6. © 2018 KNIME AG. All Rights Reserved. 6 The model process Image from: https://upload.wikimedia.org/wiki pedia/commons/b/b9/CRISP- DM_Process_Diagram.png Init Load Transform Learn Score Evaluate Deploy
  • 7. © 2018 KNIME AG. All Rights Reserved. 7 The model process, multiple models …
  • 8. © 2018 KNIME AG. All Rights Reserved. 8 The model process, multiple models …
  • 9. © 2018 KNIME AG. All Rights Reserved. 9 The model process, multiple models … https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
  • 10. © 2018 KNIME AG. All Rights Reserved. 10 The model process, multiple models … It’s not feasible to manually do this for a daunting number of models! https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
  • 11. 11© 2018 KNIME AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
  • 12. © 2018 KNIME AG. All Rights Reserved. 12 Automation: the model process factory
  • 13. © 2018 KNIME AG. All Rights Reserved. 13 Init Load Transform Learn Score Evaluate Deploy Automation: the model process factory Score EvaluateTransform DeployLoad Learn Score Learn Load Transform Evaluate Deploy Score EvaluateTransform DeployLoad Learn Score Learn Load Transform Evaluate Deploy Make each step a separate workflow. Use KNIME to orchestrate calling those workflows KNIME blog post: https://goo.gl/LvESqB White paper: https://goo.gl/d6UpUu
  • 14. © 2018 KNIME AG. All Rights Reserved. 14 Model Factory Init Load Transform Learn Score Evaluate Deploy
  • 15. © 2018 KNIME AG. All Rights Reserved. 15 The heart of the factory: Call Local Workflow1 • Executes another workflow in the same local repository https://pixabay.com/en/heart-veins-arteries-anatomy-152594/ 1 Call Remote Workflow when run on the KNIME Server
  • 16. © 2018 KNIME AG. All Rights Reserved. 16 Model Factory Init Load Transform Learn Score Evaluate Deploy
  • 17. © 2018 KNIME AG. All Rights Reserved. 17 Model Factory Init Load Transform Learn Score Evaluate Deploy
  • 18. 18© 2018 KNIME AG. All Rights Reserved. Details
  • 19. © 2018 KNIME AG. All Rights Reserved. 19 Extracting the data • Data source: ChEMBL 23 • Activity types: ('GI50', 'IC50', 'Ki', 'MIC', 'EC50', 'AC50', 'ED50', 'GI', 'Kd', 'CC50', 'LC50', 'MIC90', 'MIC50', 'ID50’) -> 6.5 million points • Define active: Standard_value < 100nM -> 1.3 million actives • Define inactive: Standard_value > 1uM • Define an interesting assay At least 50 actives -> 1556 assays • Final dataset size: 2.5 million data points, 1.5 million compounds Init Load Transform Learn Score Evaluate Deploy
  • 20. © 2018 KNIME AG. All Rights Reserved. 20 Init Load Transform Learn Score Evaluate DeployFinding more inactives • The ChEMBL datasets almost all have an unrealistically high ratio of actives to inactives • “Fix” that by adding enough assumed inactives to each dataset to get a 1:10 active:inactive ratio • Pick those assumed inactives to be roughly similar to the actives: Tanimoto similarity of between 0.35 and 0.6 using RDKit Morgan 2 fingerprints
  • 21. © 2018 KNIME AG. All Rights Reserved. 21 Extracting the data Init Load Transform Learn Score Evaluate Deploy
  • 22. © 2018 KNIME AG. All Rights Reserved. 22 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
  • 23. © 2018 KNIME AG. All Rights Reserved. 23 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures • Generate five chemical fingerprints for each structure Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
  • 24. © 2018 KNIME AG. All Rights Reserved. 24 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures • Generate five chemical fingerprints for each structure – Morgan 3 counts (ECFC6), 4K “bits” – Morgan 3 (ECFP6), 4K bits – Morgan 2 (ECFP4), 2K bits – RDKit FP, length 1-5, 2K bits – Atom pairs, distances 1-20, 4K bits Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
  • 25. © 2018 KNIME AG. All Rights Reserved. 25 Learn and Score Init Load Transform Learn Score Evaluate Deploy 10 different stratified random training/holdout splits generated for each assay
  • 26. © 2018 KNIME AG. All Rights Reserved. 26 Learn Init Load Transform Learn Score Evaluate Deploy Learning: • Fingerprint Bayes (NB) • Logistic Regression (LR) • Random Forest (RF) 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 • Gradient Boosting (H2O) 100 trees, max_depth = 5, learning_rate = 0.05 Model Selection: • Pick best model based on Enrichment factor at 5% (EF5)
  • 27. © 2018 KNIME AG. All Rights Reserved. 27 Learn Init Load Transform Learn Score Evaluate Deploy Where did these parameters come from? Learning: • Fingerprint Bayes (NB) • Logistic Regression (LR) • Random Forest (RF) 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 • Gradient Boosting (H2O) 100 trees, max_depth = 5, learning_rate = 0.05
  • 28. © 2018 KNIME AG. All Rights Reserved. 28 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy • Full parameter optimization done for each method+fingerprint on 70 assays • Results used to pick “standard” parameter sets: – Random Forest: 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 – Gradient Boosting: 100 trees, max_depth = 5, learning_rate = 0.05
  • 29. © 2018 KNIME AG. All Rights Reserved. 29 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
  • 30. © 2018 KNIME AG. All Rights Reserved. 30 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy The optimization and model selection workflow is presented in detail in Daria’s KNIME blog post: https://www.knime.com/blog/stuck-in-the-nine-circles-of-hell-try-parameter- optimization-a-cup-of-tea The workflow is available in the EXAMPLES folder inside KNIME: 04_Analytics/11_Optimization/08_Model_Optimization_and_Selection
  • 31. 31© 2018 KNIME AG. All Rights Reserved. Making it all run Init Load Transform Learn Score Evaluate Deploy
  • 32. © 2018 KNIME AG. All Rights Reserved. 32 Execution • In total >310K models were built1 1 ~1550 assays * 4 methods * 5 FPs * 10 repeats
  • 33. © 2018 KNIME AG. All Rights Reserved. 33 Execution KNIME Analytics Platform KNIME Server ... Distributed Executor Distributed Executor Distributed Executor Build/test workflows Run model factory Run individual assays 65-70 load-balanced distributed executors
  • 34. 34© 2018 KNIME AG. All Rights Reserved. Are the models any good?
  • 35. © 2018 KNIME AG. All Rights Reserved. 35 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382
  • 36. © 2018 KNIME AG. All Rights Reserved. 36 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382 Yeah!
  • 37. © 2018 KNIME AG. All Rights Reserved. 37 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382 Yeah! Uh oh…
  • 38. 38© 2018 KNIME AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
  • 39. © 2018 KNIME AG. All Rights Reserved. 39 An experiment to check model generalizability • Pick assays where standard_type is Ki • Group them by target ID • Limit to targets where Ki was measured in at least 5 assays -> 11 targets, 73 assays • Use the model built on one assay from a target ID to predict activity across the other assays.
  • 40. © 2018 KNIME AG. All Rights Reserved. 40 An experiment to check model generalizability • The targets: TargetID Name Num Assays CHEMBL205 Carbonic anhydrase II 7 CHEMBL224 Serotonin 2a (5-HT2a) receptor 8 CHEMBL234 Dopamine D3 receptor 10 CHEMBL243 Human immunodeficiency virus type 1 protease 6 CHEMBL244 Coagulation factor X 5 CHEMBL253 Cannabinoid CB2 receptor 7 CHEMBL281 Carbonic anhydrase IV 5 CHEMBL3371 Serotonin 6 (5-HT6) receptor 8 CHEMBL344 Melanin-concentrating hormone receptor 1 5 CHEMBL4550 5-lipoxygenase activating protein 5 CHEMBL4908 Trace amine-associated receptor 1 7
  • 41. © 2018 KNIME AG. All Rights Reserved. 41 Carbonic Anhydrase IV Carbonic Anhydrase II HIV Protease Factor X 5-HT6 TAAR1
  • 42. © 2018 KNIME AG. All Rights Reserved. 42 Carbonic Anhydrase IV Carbonic Anhydrase II HIV Protease Factor X 5-HT6 TAAR1
  • 43. © 2018 KNIME AG. All Rights Reserved. 43 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 1366806 AUROC: 0.38 EF5: 0
  • 44. © 2018 KNIME AG. All Rights Reserved. 44 An Example Assay_ID 448716 Assay_ID 1366806
  • 45. © 2018 KNIME AG. All Rights Reserved. 45 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 659849 AUROC: 0.99 EF5: 8.8
  • 46. © 2018 KNIME AG. All Rights Reserved. 46 An Example Assay_ID 448716 Assay_ID 659849
  • 47. © 2018 KNIME AG. All Rights Reserved. 47 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 1528679 AUROC: 0.83 EF5: 0.4
  • 48. © 2018 KNIME AG. All Rights Reserved. 48 An Example Assay_ID 448716 Assay_ID 1528679
  • 49. © 2018 KNIME AG. All Rights Reserved. 49 Intermediate conclusion • Many/most of the models have likely overfit the training data • Alternative interpretation: we’ve actually built models to predict whether or not a compound is taken from a particular paper • Unfortunately these are functionally the same if you want to predict activity
  • 50. 50© 2018 KNIME AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
  • 51. © 2018 KNIME AG. All Rights Reserved. 51 Look for frequent algorithm + fingerprint combinations • For each of the ~1550 assays * 4 learning algorithms * 10 repeats, look at which fingerprint performed best (as measured by EF5)
  • 52. © 2018 KNIME AG. All Rights Reserved. 52 Look for frequent algorithm + fingerprint combinations For each of the ~1550 assays * 4 learning algorithms * 10 repeats, look at which fingerprint performed best (as measured by EF5)
  • 53. © 2018 KNIME AG. All Rights Reserved. 53 Which method/FP pair is best for each assay? • For each of the ~1550 assays * 10 repeats, look at which algorithm + fingerprint performed best (as measured by EF51, AUC2, and algorithm complexity3) 1 Rounded to 1 decimal point 2 Rounded to 2 decimal points 3 Random Forest > Gradient Boosting > Fingerprint Bayes > Logistic Regression
  • 54. © 2018 KNIME AG. All Rights Reserved. 54 Which method/FP pair is best for each assay? Select best model using EF5, AUC, algorithm complexity
  • 55. © 2018 KNIME AG. All Rights Reserved. 55 Wrapping up • We have automated the construction and evaluation of >1500 models for bioassays using data pulled from ChEMBL • We’ve got some strong evidence that the models themselves are significantly overfit • We were able to start to draw some general conclusions about fingerprints and methods
  • 56. © 2018 KNIME AG. All Rights Reserved. 56 There’s still a lot left to do • Verify the repeatability of the process by updating when the next version of ChEMBL is released • Some more thought into combining assays to get around the “one series per paper” problem • Look into doing the full optimization run • Come up with a good way of presenting the predictions
  • 57. © 2018 KNIME AG. All Rights Reserved. 57 More details… • Model process factory blog post: https://goo.gl/LvESqB • Model process factory white paper: https://goo.gl/d6UpUu • Model process factory workflow: knime://EXAMPLES/50_Applications/26_Model_Process_ Management • Daria’s blog post on the model optimization workflow: https://www.knime.com/blog/stuck-in-the-nine-circles- of-hell-try-parameter-optimization-a-cup-of-tea • Accompanying workflow: knime://EXAMPLES/ 04_Analytics/11_Optimization/08_Model_Optimization_ and_Selection • When we’re done cleaning up, there will be a blog post/sample workflow for the monster model factory too.
  • 58. © 2018 KNIME AG. All Rights Reserved. 58 7th RDKit UGM: 19 - 21 September • Hosted by Andreas Bender, Cambridge University • Free registration: https://goo.gl/VVvHUH (or get it on http://www.rdkit.org) http://www.rdkit.org
  • 59. © 2018 KNIME AG. All Rights Reserved. 59 KNIME Fall Summit 2018 November 6 – 9 at AT&T Executive Education and Conference Center, Austin, Texas • Tuesday & Wednesday: One-day courses • Thursday & Friday: Summit sessions Use the code ICCS-2018 for 10% off tickets. Register at: knime.com/fall-summit2018
  • 60. 60© 2018 KNIME AG. All Rights Reserved. The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States. KNIME® is also registered in Germany.