Moving from Artisanal to Industrial Machine Learning

© 2019 KNIME AG. All Rights Reserved.
Moving from Artisanal to Industrial
Machine Learning
Greg Landrum
(greg.landrum@knime.com)

© 2019 KNIME AG. All Rights Reserved. 2
This talk
• Motivation
• Creating a reproducible/industrial artisan
• An artisanal side trip into working with imbalanced
data

Context
Artisanal Industrial
https://flic.kr/p/RJ5xEs
License: CC-BY 2.0CC BY 2.0, https://flic.kr/p/a3LLdm

Context
Artisanal
• Creative/Exploratory
• Flexible
Industrial
• Automated
• Reproducible
• Repeatable
• Quality control

Motivation: utility
• Thinking about the models that are useful in the
design-make-test cycle of a med-chem project
• Perhaps something project-specific for the main
target + important anti-targets.
• Likely a host of additional global models that could
be used (solubility, pKa, hERG, CYPs, synthetic
accessibility, etc.)

Aspirations
• Can we figure out how to help the artisan be more
reproducible/repeatable?
• Can we provide an “industrial” framework the
artisan can work within?
• Can this somehow be practical?

7© 2019 KNIME AG. All Rights Reserved.
A process for data mining

Cross-industry standard process for data mining
• An EU-funded project from the late ‘90s run by
Integral Solutions (bought by SPSS, bought by IBM),
Teradata, Daimler-Benz, NCR, and OHRA.

I can guess what you’re thinking…

Shockingly, this actually produced
something useful

The CRISP-DM Process
12
CRISP-DM (CRoss Industry
Standard Process for Data
Mining) is a standard
process for data mining
solutions.
Image from:
https://upload.wikimedia.org/wikipedia/commons
/b/b9/CRISP-DM_Process_Diagram.png

Establishing context
• Business understanding
– What problem are we trying to solve?
– What would a solution look like?
• Data understanding
– What data do we have available?
– Is it any good?
– What might be useful for this problem?
Image from:
https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-
DM_Process_Diagram.png
Domain expertise required here

The problem
• Build predictive models for bioactivity based
on the data in screening assays

The datasets we’ll be working with
• qHTS data from eight PubChem assays
captured in ChEMBL
• The assays have very different numbers of
actives in them, so to get started we’ll use
two at different ends of the spectrum

• Assay CHEMBL1614166 (PubChem BioAssay.
qHTS Assay for Inhibitors of MBNL1-poly(CUG)
RNA binding. (Class of assay: confirmatory))
– https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL1614166/
– https://pubchem.ncbi.nlm.nih.gov/bioassay/2675
• 34018 inactives, 98 actives (using the
annotations from PubChem)

Nature of the actives (CHEMBL1614166)

• Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS
for Inhibitors of Tau Fibril Formation, Thioflavin T
Binding. (Class of assay: confirmatory))
– https://www.ebi.ac.uk/chembl/assay_report_card/CHEM
BL1614166/
– https://pubchem.ncbi.nlm.nih.gov/bioassay/1460
• 43345 inactives, 5602 actives (using the annotations
from PubChem)

Model building
• Data Preparation
– Making it machine-useable
– Cleanup
– Feature engineering
• Modeling
– The cool ML/AI stuff
Image from:

Data Preparation
• Structures are taken from ChEMBL
– Already some standardization done
– Processed with RDKit
• Fingerprints: RDKit Morgan-2, 2048 bits

Modeling
• Stratified 80-20 training/holdout split
• KNIME random forest classifier
– 500 trees
– Max depth 15
– Min node size 2
This is a first pass through the cycle, we will try
other fingerprints, learning algorithms, and
hyperparameters in future iterations

Evaluation
• Does the model work?
• Does it actually solve the problem?
• Was the problem well posed?
• Is it implying data problems?
Image from:

Evaluation
• AUROC, overall accuracy and Cohen’s kappa
on the holdout data
Many, many, many options here. I’m using global
metrics because in the end I want to use the
“active/inactive” predictions made by the model

Using
• Deployment
– How do you actually use the model?
– How do you keep it up to date?
– How do you get people to accept the
results? Image from:

Deployment: technical
• Easy since I’m using KNIME
• Deploy as a web service
– Easy to validate/test
• Automated rebuild/re-evaluate when new data
are available

Deployment: practical
• Providing “active/inactive” classifications and
predicted probabilities likely not enough
• Similar compounds from training set?
• Applicability domain?
• Conformal prediction?
• “Explanation” of the prediction (i.e. similarity
maps)?

Results

Evaluation CHEMBL1614166: holdout data

Evaluation CHEMBL1614166: test data
AUROC=0.72

Results CHEMBL1614421: holdout data

Evaluation CHEMBL1614421: holdout data
AUROC=0.75

Taking stock
• Both models have:
– Good overall accuracies (because of imbalance)
– Decent AUROC values
– Terrible Cohen kappas
Now what?

Let’s get artisanal…

Quick diversion on bag classifiers
When making predictions, each tree in the
classifier votes on the result.
Majority wins
The predicted class probabilities are often the
means of the predicted probabilities from the
individual trees
We construct the ROC curve by sorting the
predictions in decreasing order of predicted
probability of being active.
Note that the actual predictions are irrelevant for an ROC curve. As long
as true actives tend to have a higher predicted probability of being active
than true inactives the AUC will be good.

Handling imbalanced data
• The standard decision rule for a random forest (or
any bag classifier) is that the majority wins1, i.e. at
the predicted probability of being active must be
>=0.5 in order for the model to predict "active"
• Shift that threshold to a lower value for models built
on highly imbalanced datasets2
1 This is only strictly true for binary classifiers
2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and
QSAR in Environmental Research 17 (2006): 337–52.

Picking a new decision threshold
• Generate a random forest for the dataset using the
training set
• Generate out-of-bag predicted probabilities using
the training set
• Try a number of different decision thresholds1 and
pick the one that gives the best kappa
• Once we have the decision threshold, use it to
generate predictions for the test set.
1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]

Results CHEMBL1614166
• Balanced confusion matrix
Previously 0.181

• Balanced confusion matrix
Results CHEMBL1614421
Previously 0.005

Does it work in general?
ChEMBL data, random-split validation

Does it work in general?
Proprietary data, time-split validation

Coming back to validation
• CHEMBL1614166:
– Overall accuracy: 99.8%
– Kappa: 0.53
– AUROC: 0.72
• CHEMBL1614421:
– Overall accuracy: 89.6%
– Kappa: 0. 30
– AUROC: 0.75

Wrapping up
Image from:
https://upload.wikimedia.org/wikipedia/commons
/b/b9/CRISP-DM_Process_Diagram.png

Maybe useful…
• “Practical Machine Learning Canvas”

Data/Scripts
• KNIME workflow for adjusting the decision
threshold: https://kni.me/w/HRDmzyQy0UL0k7H2
• RDKit blog post about adjusting the decision
threshold (includes links to code):
http://rdkit.blogspot.com/2018/11/working-with-
unbalanced-data-part-i.html
• Practical ML Canvas: https://bit.ly/2JLLsRC

Acknowledgements
• Dean Abbott (Abbott Analytics)
• KNIME:
– Daria Goldmann
– Rosaria Silipo
• NIBR:
– Nik Stiefl
– Nadine Schneider
– Niko Fechner
For more amazing car pictures: do an image search for “rat rod”

Moving from Artisanal to Industrial Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Moving from Artisanal to Industrial Machine Learning

Similar to Moving from Artisanal to Industrial Machine Learning (20)

More from Greg Landrum

More from Greg Landrum (12)

Recently uploaded

Recently uploaded (20)

Moving from Artisanal to Industrial Machine Learning