SlideShare a Scribd company logo
1 of 30
Download to read offline
© 2019 KNIME AG. All Rights Reserved.
Building useful models for
imbalanced datasets (without
resampling)
Greg Landrum
(greg.landrum@knime.com)
COMP Together, UCSF
22 Aug 2019
© 2019 KNIME AG. All Rights Reserved. 2
First things first
• RDKit blog post with initial work:
http://rdkit.blogspot.com/2018/11/working-with-
unbalanced-data-part-i.html
• The notebooks I used for this presentation are all in
Github:
– Original notebook: https://bit.ly/2UY2u2K
– Using the balanced random forest: https://bit.ly/2tuafSc
– Plotting: https://bit.ly/2GJSeHH
• I have a KNIME workflow that does the same thing. Let
me know if you're interested
• Download links for the datasets are in the blog post
© 2019 KNIME AG. All Rights Reserved. 3
The problem
• Typical datasets for bioactivity prediction tend to
have way more inactives than actives
• This leads to a couple of pathologies:
– Overall accuracy is really not a good metric for how useful
a model is
– Many learning algorithms produce way too many false
negatives
© 2019 KNIME AG. All Rights Reserved. 4
Example dataset
• Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS
for Inhibitors of Tau Fibril Formation, Thioflavin T
Binding. (Class of assay: confirmatory))
– https://www.ebi.ac.uk/chembl/assay_report_card/CHEM
BL1614166/
– https://pubchem.ncbi.nlm.nih.gov/bioassay/1460
• 43345 inactives, 5602 actives (using the annotations
from PubChem)
© 2019 KNIME AG. All Rights Reserved. 5
Data Preparation
• Structures are taken from ChEMBL
– Already some standardization done
– Processed with RDKit
• Fingerprints: RDKit Morgan-2, 2048 bits
© 2019 KNIME AG. All Rights Reserved. 6
Modeling
• Stratified 80-20 training/holdout split
• KNIME random forest classifier
– 500 trees
– Max depth 15
– Min node size 2
This is a first pass through the cycle, we will try
other fingerprints, learning algorithms, and
hyperparameters in future iterations
© 2019 KNIME AG. All Rights Reserved. 7
Results CHEMBL1614421: holdout data
© 2019 KNIME AG. All Rights Reserved. 8
Evaluation CHEMBL1614421: holdout data
AUROC=0.75
© 2019 KNIME AG. All Rights Reserved. 9
Taking stock
• Model has:
– Good overall accuracies (because of imbalance)
– Decent AUROC values
– Terrible Cohen kappas
Now what?
© 2019 KNIME AG. All Rights Reserved. 10
Quick diversion on bag classifiers
When making predictions, each tree in the
classifier votes on the result.
Majority wins
The predicted class probabilities are often the
means of the predicted probabilities from the
individual trees
We construct the ROC curve by sorting the
predictions in decreasing order of predicted
probability of being active.
Note that the actual predictions are irrelevant for an ROC curve. As long
as true actives tend to have a higher predicted probability of being active
than true inactives the AUC will be good.
© 2019 KNIME AG. All Rights Reserved. 11
Handling imbalanced data
• The standard decision rule for a random forest (or
any bag classifier) is that the majority wins1, i.e. at
the predicted probability of being active must be
>=0.5 in order for the model to predict "active"
• Shift that threshold to a lower value for models built
on highly imbalanced datasets2
1 This is only strictly true for binary classifiers
2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and
QSAR in Environmental Research 17 (2006): 337–52.
© 2019 KNIME AG. All Rights Reserved. 12
Picking a new decision threshold: approach 1
• Generate a random forest for the dataset using the
training set
• Generate out-of-bag predicted probabilities using
the training set
• Try a number of different decision thresholds1 and
pick the one that gives the best kappa
• Once we have the decision threshold, use it to
generate predictions for the test set.
1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
© 2019 KNIME AG. All Rights Reserved. 13
• Balanced confusion matrix
Results CHEMBL1614421
Previously 0.005
Nice! But does it work in general?
14© 2019 KNIME AG. All Rights Reserved.
Validation experiment
© 2019 KNIME AG. All Rights Reserved. 15
• "Serotonin": 6 datasets with >900 Ki values for human
serotonin receptors
– Active: pKi > 9.0, Inactive: pKi < 8.5
– If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi
< 7.5
• "DS1": 80 "Dataset 1" sets.1
– Active: 100 diverse measured actives ("standard_value<10uM");
Inactive: 2000 random compounds from the same property space
• "PubChem": 8 HTS Validation assays with at least 3K
"Potency" values
– Active: "active" in dataset. Inactive: "inactive", "not active", or
"inconclusive" in dataset
• "DrugMatrix": 44 DrugMatrix assays with at least 40 actives
– Active: "active" in dataset. Inactive: "not active" in dataset
The datasets (all extracted from ChEMBL_24)
1 S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision
making by committee can be a good thing." Journal of chemical information and modeling 53:2829-36 (2013).
© 2019 KNIME AG. All Rights Reserved. 16
Model building and validation
• Fingerprints: 2048 bit MorganFP radius=2
• 80/20 training/test split
• Random forest parameters:
– cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True)
• Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35,
0.4 , 0.45, 0.5 ] with out-of-bag predictions and pick the best
based on kappa
• Generate initial kappa value for the test data using threshold
= 0.5
• Generate "balanced" kappa value for the test data with the
optimized threshold
© 2019 KNIME AG. All Rights Reserved. 17
Does it work in general?
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 18
Does it work in general?
Proprietary data, time-split validation
© 2019 KNIME AG. All Rights Reserved. 19
Picking a new decision threshold: approach 2
• Generate a random forest for the dataset using the
training set
• Generate out-of-bag predicted probabilities using
the training set
• Pick the threshold corresponding to the point on the
ROC curve that’s closest to the upper left corner
• Once we have the decision threshold, use it to
generate predictions for the test set.
Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in
Environmental Research 17 (2006): 337–52.
© 2019 KNIME AG. All Rights Reserved. 20
Does it work in general?
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 21
Does it work in general?
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 22
Other evaluation metrics: F1 score
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 23
Does it work in general?
Proprietary data, time-split validation
© 2019 KNIME AG. All Rights Reserved. 24
Compare to balanced random forests
• Resampling strategy that still uses the entire training
set
• Idea: train each tree on a balanced bootstrap
sample of the training data
Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data.
https://statistics.berkeley.edu/tech-reports/666 (2004).
© 2019 KNIME AG. All Rights Reserved. 25
How do bag classifiers end up with different models?
Each tree is built
with a different
dataset
© 2019 KNIME AG. All Rights Reserved. 26
Balanced random forests
• Take advantage of the structure of the classifier.
• Learn each tree with a balanced dataset:
– Select a bootstrap sample of the minority class (actives)
– Randomly select, with replacement, the same number of
points from the majority class (inactives)
• Prediction works the same as with a normal random
forest
• Easy to do in scikit-learn using the imbalanced-learn
contrib package: https://imbalanced-
learn.readthedocs.io/en/stable/ensemble.html#forest-of-randomized-trees
– cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True
Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666
(2004).
© 2019 KNIME AG. All Rights Reserved. 27
Comparing to resampling: balanced random forests
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 28
Comparing to resampling: balanced random forests
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 29
What comes next
• Try the same thing with other learning methods like
logistic regression and stochastic gradient boosting
– These are more complicated since they can't do out-of-
bag classification
– We need to add another data split and loop to do
calibration and find the best threshold
• More datasets! I need *your* help with this
– I have a script for you to run that takes sets of compounds
with activity labels and outputs the summary statistics
that I'm using here
© 2019 KNIME AG. All Rights Reserved. 30
Acknowledgements
• Dean Abbott (Abbott Analytics)
• Daria Goldmann (KNIME)
• NIBR:
– Nik Stiefl
– Nadine Schneider
– Niko Fechner

More Related Content

What's hot

[NDC 발표] 모바일 게임데이터분석 및 실전 활용
[NDC 발표] 모바일 게임데이터분석 및 실전 활용[NDC 발표] 모바일 게임데이터분석 및 실전 활용
[NDC 발표] 모바일 게임데이터분석 및 실전 활용Tapjoy X 5Rocks
 
Gestion de la Memoire dans un Système d'exploitation
Gestion de la Memoire dans un Système d'exploitationGestion de la Memoire dans un Système d'exploitation
Gestion de la Memoire dans un Système d'exploitationiMitwe
 
Reddit, hedge fund and sentiments v 0.1 #wallstreetbets #GameStop
Reddit, hedge fund and sentiments v 0.1 #wallstreetbets #GameStopReddit, hedge fund and sentiments v 0.1 #wallstreetbets #GameStop
Reddit, hedge fund and sentiments v 0.1 #wallstreetbets #GameStopAnil Nayak
 
Data Mining Spatial
Data Mining Spatial Data Mining Spatial
Data Mining Spatial dihiaselma
 
Les arbres de décisions
Les arbres de décisionsLes arbres de décisions
Les arbres de décisionsMariem Chaaben
 
Formation photoshop débutant
Formation photoshop débutantFormation photoshop débutant
Formation photoshop débutantQuentin Coray
 
Chp1 - Introduction au Développement Mobile
Chp1 - Introduction au Développement MobileChp1 - Introduction au Développement Mobile
Chp1 - Introduction au Développement MobileLilia Sfaxi
 
Percepção de Riscos_IP.pptx
Percepção de Riscos_IP.pptxPercepção de Riscos_IP.pptx
Percepção de Riscos_IP.pptxAmilcarPelissoli1
 
5.1 K plus proches voisins
5.1 K plus proches voisins5.1 K plus proches voisins
5.1 K plus proches voisinsBoris Guarisma
 
Initiation arcgis10 v3-libre
Initiation arcgis10 v3-libreInitiation arcgis10 v3-libre
Initiation arcgis10 v3-libreSouhila Benkaci
 
Le pilotage des risques avec Méhari-Standard (2017) : Indicateurs et tableau ...
Le pilotage des risques avec Méhari-Standard (2017) : Indicateurs et tableau ...Le pilotage des risques avec Méhari-Standard (2017) : Indicateurs et tableau ...
Le pilotage des risques avec Méhari-Standard (2017) : Indicateurs et tableau ...PECB
 
개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?choi kyumin
 
Analyse factorielle des_correspondances-afc
Analyse factorielle des_correspondances-afcAnalyse factorielle des_correspondances-afc
Analyse factorielle des_correspondances-afcRémi Bachelet
 
T1 corrections-qcm
T1 corrections-qcmT1 corrections-qcm
T1 corrections-qcminfcom
 
Présentation acp
Présentation acpPrésentation acp
Présentation acpgrandprime1
 
Exercice arbre de décision
Exercice arbre de décision Exercice arbre de décision
Exercice arbre de décision Yassine Badri
 
Présentation audits de sécurité
Présentation   audits de sécuritéPrésentation   audits de sécurité
Présentation audits de sécuritéHarvey Francois
 
Introduction-A-La-Logique.pdf
Introduction-A-La-Logique.pdfIntroduction-A-La-Logique.pdf
Introduction-A-La-Logique.pdfdenischef1
 

What's hot (20)

[NDC 발표] 모바일 게임데이터분석 및 실전 활용
[NDC 발표] 모바일 게임데이터분석 및 실전 활용[NDC 발표] 모바일 게임데이터분석 및 실전 활용
[NDC 발표] 모바일 게임데이터분석 및 실전 활용
 
Gestion de la Memoire dans un Système d'exploitation
Gestion de la Memoire dans un Système d'exploitationGestion de la Memoire dans un Système d'exploitation
Gestion de la Memoire dans un Système d'exploitation
 
ségmentation d'image
ségmentation d'imageségmentation d'image
ségmentation d'image
 
Reddit, hedge fund and sentiments v 0.1 #wallstreetbets #GameStop
Reddit, hedge fund and sentiments v 0.1 #wallstreetbets #GameStopReddit, hedge fund and sentiments v 0.1 #wallstreetbets #GameStop
Reddit, hedge fund and sentiments v 0.1 #wallstreetbets #GameStop
 
Data Mining Spatial
Data Mining Spatial Data Mining Spatial
Data Mining Spatial
 
Les arbres de décisions
Les arbres de décisionsLes arbres de décisions
Les arbres de décisions
 
Formation photoshop débutant
Formation photoshop débutantFormation photoshop débutant
Formation photoshop débutant
 
Chp1 - Introduction au Développement Mobile
Chp1 - Introduction au Développement MobileChp1 - Introduction au Développement Mobile
Chp1 - Introduction au Développement Mobile
 
Percepção de Riscos_IP.pptx
Percepção de Riscos_IP.pptxPercepção de Riscos_IP.pptx
Percepção de Riscos_IP.pptx
 
5.1 K plus proches voisins
5.1 K plus proches voisins5.1 K plus proches voisins
5.1 K plus proches voisins
 
Initiation arcgis10 v3-libre
Initiation arcgis10 v3-libreInitiation arcgis10 v3-libre
Initiation arcgis10 v3-libre
 
Le pilotage des risques avec Méhari-Standard (2017) : Indicateurs et tableau ...
Le pilotage des risques avec Méhari-Standard (2017) : Indicateurs et tableau ...Le pilotage des risques avec Méhari-Standard (2017) : Indicateurs et tableau ...
Le pilotage des risques avec Méhari-Standard (2017) : Indicateurs et tableau ...
 
개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?
 
Analyse factorielle des_correspondances-afc
Analyse factorielle des_correspondances-afcAnalyse factorielle des_correspondances-afc
Analyse factorielle des_correspondances-afc
 
T1 corrections-qcm
T1 corrections-qcmT1 corrections-qcm
T1 corrections-qcm
 
Cours ALGR M1.pdf
Cours ALGR M1.pdfCours ALGR M1.pdf
Cours ALGR M1.pdf
 
Présentation acp
Présentation acpPrésentation acp
Présentation acp
 
Exercice arbre de décision
Exercice arbre de décision Exercice arbre de décision
Exercice arbre de décision
 
Présentation audits de sécurité
Présentation   audits de sécuritéPrésentation   audits de sécurité
Présentation audits de sécurité
 
Introduction-A-La-Logique.pdf
Introduction-A-La-Logique.pdfIntroduction-A-La-Logique.pdf
Introduction-A-La-Logique.pdf
 

Similar to Building useful models for imbalanced datasets (without resampling)

Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningGreg Landrum
 
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Using Optimization to find Synthetic Equity Universes that minimize Survivors...Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Using Optimization to find Synthetic Equity Universes that minimize Survivors...OpenMetrics Solutions LLC
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Greg Landrum
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knimeGreg Landrum
 
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ..."Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...Quantopian
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithmsShalitha Suranga
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? Greg Landrum
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Maarten Smeets
 
Random forests-talk-nl-meetup
Random forests-talk-nl-meetupRandom forests-talk-nl-meetup
Random forests-talk-nl-meetupWillem Hendriks
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...pandavaTirumala
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks
 
Robust Design And Variation Reduction Using DiscoverSim
Robust Design And Variation Reduction Using DiscoverSimRobust Design And Variation Reduction Using DiscoverSim
Robust Design And Variation Reduction Using DiscoverSimJohnNoguera
 
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21Hyun Wong Choi
 
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedSqrrl
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentationoldmanpat
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...Edge AI and Vision Alliance
 

Similar to Building useful models for imbalanced datasets (without resampling) (20)

Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
 
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Using Optimization to find Synthetic Equity Universes that minimize Survivors...Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ..."Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
 
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS ModelerUsing Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Random forests-talk-nl-meetup
Random forests-talk-nl-meetupRandom forests-talk-nl-meetup
Random forests-talk-nl-meetup
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
 
pradeep ppt final.pptx
pradeep ppt final.pptxpradeep ppt final.pptx
pradeep ppt final.pptx
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
 
Robust Design And Variation Reduction Using DiscoverSim
Robust Design And Variation Reduction Using DiscoverSimRobust Design And Variation Reduction Using DiscoverSim
Robust Design And Variation Reduction Using DiscoverSim
 
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21
 
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting Started
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentation
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
 

More from Greg Landrum

Chemical registration
Chemical registrationChemical registration
Chemical registrationGreg Landrum
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Greg Landrum
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Greg Landrum
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsGreg Landrum
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysisGreg Landrum
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Greg Landrum
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialGreg Landrum
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Greg Landrum
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontGreg Landrum
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataGreg Landrum
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesGreg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Greg Landrum
 

More from Greg Landrum (15)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 

Recently uploaded

Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Masticationvidulajaib
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayZachary Labe
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiologyDrAnita Sharma
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 

Recently uploaded (20)

Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Mastication
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work Day
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiology
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 

Building useful models for imbalanced datasets (without resampling)

  • 1. © 2019 KNIME AG. All Rights Reserved. Building useful models for imbalanced datasets (without resampling) Greg Landrum (greg.landrum@knime.com) COMP Together, UCSF 22 Aug 2019
  • 2. © 2019 KNIME AG. All Rights Reserved. 2 First things first • RDKit blog post with initial work: http://rdkit.blogspot.com/2018/11/working-with- unbalanced-data-part-i.html • The notebooks I used for this presentation are all in Github: – Original notebook: https://bit.ly/2UY2u2K – Using the balanced random forest: https://bit.ly/2tuafSc – Plotting: https://bit.ly/2GJSeHH • I have a KNIME workflow that does the same thing. Let me know if you're interested • Download links for the datasets are in the blog post
  • 3. © 2019 KNIME AG. All Rights Reserved. 3 The problem • Typical datasets for bioactivity prediction tend to have way more inactives than actives • This leads to a couple of pathologies: – Overall accuracy is really not a good metric for how useful a model is – Many learning algorithms produce way too many false negatives
  • 4. © 2019 KNIME AG. All Rights Reserved. 4 Example dataset • Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory)) – https://www.ebi.ac.uk/chembl/assay_report_card/CHEM BL1614166/ – https://pubchem.ncbi.nlm.nih.gov/bioassay/1460 • 43345 inactives, 5602 actives (using the annotations from PubChem)
  • 5. © 2019 KNIME AG. All Rights Reserved. 5 Data Preparation • Structures are taken from ChEMBL – Already some standardization done – Processed with RDKit • Fingerprints: RDKit Morgan-2, 2048 bits
  • 6. © 2019 KNIME AG. All Rights Reserved. 6 Modeling • Stratified 80-20 training/holdout split • KNIME random forest classifier – 500 trees – Max depth 15 – Min node size 2 This is a first pass through the cycle, we will try other fingerprints, learning algorithms, and hyperparameters in future iterations
  • 7. © 2019 KNIME AG. All Rights Reserved. 7 Results CHEMBL1614421: holdout data
  • 8. © 2019 KNIME AG. All Rights Reserved. 8 Evaluation CHEMBL1614421: holdout data AUROC=0.75
  • 9. © 2019 KNIME AG. All Rights Reserved. 9 Taking stock • Model has: – Good overall accuracies (because of imbalance) – Decent AUROC values – Terrible Cohen kappas Now what?
  • 10. © 2019 KNIME AG. All Rights Reserved. 10 Quick diversion on bag classifiers When making predictions, each tree in the classifier votes on the result. Majority wins The predicted class probabilities are often the means of the predicted probabilities from the individual trees We construct the ROC curve by sorting the predictions in decreasing order of predicted probability of being active. Note that the actual predictions are irrelevant for an ROC curve. As long as true actives tend to have a higher predicted probability of being active than true inactives the AUC will be good.
  • 11. © 2019 KNIME AG. All Rights Reserved. 11 Handling imbalanced data • The standard decision rule for a random forest (or any bag classifier) is that the majority wins1, i.e. at the predicted probability of being active must be >=0.5 in order for the model to predict "active" • Shift that threshold to a lower value for models built on highly imbalanced datasets2 1 This is only strictly true for binary classifiers 2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in Environmental Research 17 (2006): 337–52.
  • 12. © 2019 KNIME AG. All Rights Reserved. 12 Picking a new decision threshold: approach 1 • Generate a random forest for the dataset using the training set • Generate out-of-bag predicted probabilities using the training set • Try a number of different decision thresholds1 and pick the one that gives the best kappa • Once we have the decision threshold, use it to generate predictions for the test set. 1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
  • 13. © 2019 KNIME AG. All Rights Reserved. 13 • Balanced confusion matrix Results CHEMBL1614421 Previously 0.005 Nice! But does it work in general?
  • 14. 14© 2019 KNIME AG. All Rights Reserved. Validation experiment
  • 15. © 2019 KNIME AG. All Rights Reserved. 15 • "Serotonin": 6 datasets with >900 Ki values for human serotonin receptors – Active: pKi > 9.0, Inactive: pKi < 8.5 – If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi < 7.5 • "DS1": 80 "Dataset 1" sets.1 – Active: 100 diverse measured actives ("standard_value<10uM"); Inactive: 2000 random compounds from the same property space • "PubChem": 8 HTS Validation assays with at least 3K "Potency" values – Active: "active" in dataset. Inactive: "inactive", "not active", or "inconclusive" in dataset • "DrugMatrix": 44 DrugMatrix assays with at least 40 actives – Active: "active" in dataset. Inactive: "not active" in dataset The datasets (all extracted from ChEMBL_24) 1 S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing." Journal of chemical information and modeling 53:2829-36 (2013).
  • 16. © 2019 KNIME AG. All Rights Reserved. 16 Model building and validation • Fingerprints: 2048 bit MorganFP radius=2 • 80/20 training/test split • Random forest parameters: – cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True) • Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ] with out-of-bag predictions and pick the best based on kappa • Generate initial kappa value for the test data using threshold = 0.5 • Generate "balanced" kappa value for the test data with the optimized threshold
  • 17. © 2019 KNIME AG. All Rights Reserved. 17 Does it work in general? ChEMBL data, random-split validation
  • 18. © 2019 KNIME AG. All Rights Reserved. 18 Does it work in general? Proprietary data, time-split validation
  • 19. © 2019 KNIME AG. All Rights Reserved. 19 Picking a new decision threshold: approach 2 • Generate a random forest for the dataset using the training set • Generate out-of-bag predicted probabilities using the training set • Pick the threshold corresponding to the point on the ROC curve that’s closest to the upper left corner • Once we have the decision threshold, use it to generate predictions for the test set. Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in Environmental Research 17 (2006): 337–52.
  • 20. © 2019 KNIME AG. All Rights Reserved. 20 Does it work in general? ChEMBL data, random-split validation
  • 21. © 2019 KNIME AG. All Rights Reserved. 21 Does it work in general? ChEMBL data, random-split validation
  • 22. © 2019 KNIME AG. All Rights Reserved. 22 Other evaluation metrics: F1 score ChEMBL data, random-split validation
  • 23. © 2019 KNIME AG. All Rights Reserved. 23 Does it work in general? Proprietary data, time-split validation
  • 24. © 2019 KNIME AG. All Rights Reserved. 24 Compare to balanced random forests • Resampling strategy that still uses the entire training set • Idea: train each tree on a balanced bootstrap sample of the training data Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666 (2004).
  • 25. © 2019 KNIME AG. All Rights Reserved. 25 How do bag classifiers end up with different models? Each tree is built with a different dataset
  • 26. © 2019 KNIME AG. All Rights Reserved. 26 Balanced random forests • Take advantage of the structure of the classifier. • Learn each tree with a balanced dataset: – Select a bootstrap sample of the minority class (actives) – Randomly select, with replacement, the same number of points from the majority class (inactives) • Prediction works the same as with a normal random forest • Easy to do in scikit-learn using the imbalanced-learn contrib package: https://imbalanced- learn.readthedocs.io/en/stable/ensemble.html#forest-of-randomized-trees – cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666 (2004).
  • 27. © 2019 KNIME AG. All Rights Reserved. 27 Comparing to resampling: balanced random forests ChEMBL data, random-split validation
  • 28. © 2019 KNIME AG. All Rights Reserved. 28 Comparing to resampling: balanced random forests ChEMBL data, random-split validation
  • 29. © 2019 KNIME AG. All Rights Reserved. 29 What comes next • Try the same thing with other learning methods like logistic regression and stochastic gradient boosting – These are more complicated since they can't do out-of- bag classification – We need to add another data split and loop to do calibration and find the best threshold • More datasets! I need *your* help with this – I have a script for you to run that takes sets of compounds with activity labels and outputs the summary statistics that I'm using here
  • 30. © 2019 KNIME AG. All Rights Reserved. 30 Acknowledgements • Dean Abbott (Abbott Analytics) • Daria Goldmann (KNIME) • NIBR: – Nik Stiefl – Nadine Schneider – Niko Fechner