Machine Learning in the Life Sciences...
with KNIME!
Gregory Landrum
NIBR Informatics
Novartis Institutes for BioMedical Research, Basel
Cartoon machine learning
Training
Data
Training Model
New
Items
Model Predictions
Training a model:
Using a model:
The data
introducing vocabulary
Descriptors End point
A typical life-sciences problem
Training
Data
Training Model
New
Items
Model Predictions
Training a model:
Using a model:
Literature molecules
active for an interesting
protein target
New molecules we are
thinking about making.
Prioritized list
A problem...
Here’s what our input looks like:
All data taken from ChEMBL (https://www.ebi.ac.uk/chembl/)
Good luck training a model with that!
One solution: Molecular Fingerprints
§  Idea : Apply a kernel to a molecule to generate a bit vector or count
vector (less frequent)
§  Typical kernels extract features of the molecule, hash them, and use
the hash to determine bits that should be set
§  Typical fingerprint sizes: 1K-4K bits.
...
The toolbox: Knime + the RDKit
§  Open-source RDKit-based nodes for Knime providing cheminformatics
functionality
+
§  Trusted nodes distributed from
knime community site
§  Work in progress: more nodes being
added (new wizard makes it easy)
What’s there?
Let’s build a model!
Step 1, getting the data ready
Detail: we’re
using atom-pair
fingerprints
100 actives
~83K assumed inactives
Detail: we’re using
Histamine H3 actives
Let’s build a model!
Step 2, training
For this example I use
70% of the data
(randomly selected) to
train the model
Detail: the model is a
depth-limited random
forest with 500 trees
Let’s build a model!
Step 3, testing
Test with the 30% of the
data that was not used to
build the model
The model is 99.9% accurate.
Unfortunately it’s saying
“inactive” almost all the time.
This makes sense given how
unbalanced the data is
Adjusting the model for highly unbalanced data
Is there a signal there?
Test with the 30% of the
data that was not used to
build the model
Obviously a strong signal
there, we just need to figure
out how to use it.
Adjusting the model for highly unbalanced data
Is there a signal there?
Test with the 30% of the
data that was not used to
build the model
Obviously a strong signal
there, we just need to figure
out how to use it.
How about changing the
decision boundary?
Find the model score that
corresponds to this point
in the ROC curve for the
training data
Adjusting the model for highly unbalanced data
Shifting the decision boundary
Set decision
boundary here
Now we’ve got a >99%
accurate model that does a
good job of retrieving actives
without mixing in too many
inactives.
Training data
ROC
Wrapping up
§  We were able to build very accurate random forests for predicting
biological activity by adjusting the decision boundary for models built
using highly unbalanced data
§  The same thing works with the Knime “Fingerprint Bayesian” nodes.
§  Acknowledgements:
•  Manuel Schwarze (NIBR)
•  Sereina Riniker (NIBR)
•  Nikolas Fechner (NIBR)
•  Bernd Wiswedel (Knime)
•  Dean Abbott (Abbott Analytics)
Advertising
3rd RDKit User Group Meeting
22-24 October 2014
Merck KGaA, Darmstadt, Germany
Talks, “talktorials”, lightning talks, social activities, and a hackathon on
the 24th.
Announcement and (free) registration links at www.rdkit.org
We’re looking for speakers. Please contact greg.landrum@gmail.com

Machine learning in the life sciences with knime

  • 1.
    Machine Learning inthe Life Sciences... with KNIME! Gregory Landrum NIBR Informatics Novartis Institutes for BioMedical Research, Basel
  • 2.
    Cartoon machine learning Training Data TrainingModel New Items Model Predictions Training a model: Using a model:
  • 3.
  • 4.
    A typical life-sciencesproblem Training Data Training Model New Items Model Predictions Training a model: Using a model: Literature molecules active for an interesting protein target New molecules we are thinking about making. Prioritized list
  • 5.
    A problem... Here’s whatour input looks like: All data taken from ChEMBL (https://www.ebi.ac.uk/chembl/) Good luck training a model with that!
  • 6.
    One solution: MolecularFingerprints §  Idea : Apply a kernel to a molecule to generate a bit vector or count vector (less frequent) §  Typical kernels extract features of the molecule, hash them, and use the hash to determine bits that should be set §  Typical fingerprint sizes: 1K-4K bits. ...
  • 7.
    The toolbox: Knime+ the RDKit §  Open-source RDKit-based nodes for Knime providing cheminformatics functionality + §  Trusted nodes distributed from knime community site §  Work in progress: more nodes being added (new wizard makes it easy)
  • 8.
  • 9.
    Let’s build amodel! Step 1, getting the data ready Detail: we’re using atom-pair fingerprints 100 actives ~83K assumed inactives Detail: we’re using Histamine H3 actives
  • 10.
    Let’s build amodel! Step 2, training For this example I use 70% of the data (randomly selected) to train the model Detail: the model is a depth-limited random forest with 500 trees
  • 11.
    Let’s build amodel! Step 3, testing Test with the 30% of the data that was not used to build the model The model is 99.9% accurate. Unfortunately it’s saying “inactive” almost all the time. This makes sense given how unbalanced the data is
  • 12.
    Adjusting the modelfor highly unbalanced data Is there a signal there? Test with the 30% of the data that was not used to build the model Obviously a strong signal there, we just need to figure out how to use it.
  • 13.
    Adjusting the modelfor highly unbalanced data Is there a signal there? Test with the 30% of the data that was not used to build the model Obviously a strong signal there, we just need to figure out how to use it. How about changing the decision boundary? Find the model score that corresponds to this point in the ROC curve for the training data
  • 14.
    Adjusting the modelfor highly unbalanced data Shifting the decision boundary Set decision boundary here Now we’ve got a >99% accurate model that does a good job of retrieving actives without mixing in too many inactives. Training data ROC
  • 15.
    Wrapping up §  Wewere able to build very accurate random forests for predicting biological activity by adjusting the decision boundary for models built using highly unbalanced data §  The same thing works with the Knime “Fingerprint Bayesian” nodes. §  Acknowledgements: •  Manuel Schwarze (NIBR) •  Sereina Riniker (NIBR) •  Nikolas Fechner (NIBR) •  Bernd Wiswedel (Knime) •  Dean Abbott (Abbott Analytics)
  • 16.
    Advertising 3rd RDKit UserGroup Meeting 22-24 October 2014 Merck KGaA, Darmstadt, Germany Talks, “talktorials”, lightning talks, social activities, and a hackathon on the 24th. Announcement and (free) registration links at www.rdkit.org We’re looking for speakers. Please contact greg.landrum@gmail.com