Chemaxon EU UGM 2022 | Translating data to predictive models

Translating data to
predictive models
Akos Tarcsay

From data to prediction
Data ingestion Preprocessing Modelling
Features Models
Review Prediction
Model repository

Chemaxon
Descriptor
generation
Chemaxon
Standardizer
Persistence and
search
System overview
DB layer
PostgreSQL
Statistical
evaluation
ML library
(SMILE)
Conformal
prediction wrapper
Service layer
Programmatic
access
REST interface
“Comp Chem”
Trainer GUI
“Med Chem”
Prediction GUI

How to reduce noise?
Preprocessing

Effect of standardization
- Simple descriptors (Mw, fsp3,
HBDA, etc. )
Imipramine pamoate Furan-2-ol
- Phys-chem (logD, pKa)
- Molecular graph, Fingerprints
Salts, solvates Tautomerism
“Overall and despite our efforts to use open software wherever possible, we ﬁnd that
ChemAxon Tautomers node outperforms the other approaches we tested.”
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00606-7

Small molecule retention time (SMRT) dataset: Tautomerization
https:/
/www.nature.com/articles/s41467-019-13680-7

Prediction power?
ChEMBL Benchmark

Activity dataset: the ‘ChEMBL bioactivity benchmark set’
Data source: Journal of Cheminformatics, 9, 45 (2017) by Eelke B. Lenselink, Niels
ten Dijke, Brandon Bongers, George Papadatos, Herman W. T. van Vlijmen, Wojtek
Kowalczyk, Adriaan P. IJzerman, Gerard J. P. van Westen
- ChEMBL database (version 20)
- Activities were selected that met the following criteria:
- at least 30 compounds tested per protein and from at least 2 separate publications
- assay confidence score of 9
- ‘single protein’ target type
- assigned pCHEMBL value
- no flags on potential duplicate or data validity comment
- originating from scientific literature
- data points with activity comments ‘not active’, ‘inactive’, ‘inconclusive’, and ‘undetermined’ were
removed
- MED value was chosen
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0

Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 163 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points
reserved as External set: Ext
Last 30 Ext

Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 163 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points reserved
as External set: Ext
- 10-90% test-training set split: Test
- ~160k total training size
- ~18k total test size
Rnd 90% Train
Last 30 Ext
Rnd 10% Test
...

Benchmark on 163 ChEMBL targets
163 ChEMBL Targets

Analysis of the best models per target
Pearson R2
Ext Test Ext Test
Avg 0.672 0.824 0.306 0.679
Median 0.722 0.833 0.385 0.697

Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
https:/
/www.jmlr.org/papers/volume9/shafer08a/shafer08a.pdf
https:/
/pubs.acs.org/doi/10.1021/ci5001168

Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
Test: 14233 / 17661 80.6% within the error bound
Ext: 3344 / 4890 68.4% within the error bound

Feature engineering is the
process of using domain
knowledge to extract
features from raw data.

Combined descriptor set
∑ 20 *163 =3260 descriptor sets

Importance (ECFP4_CHEMTERM:All)
50% related to
protonation or
partitioning

SAMPL7 pKa
https://link.springer.com/article/10.1007/s10822-021-00397-3

Classiﬁcation use case:
Blood-Brain Barrier
Penetration

Blood brain barrier penetration model
MoleculeNet, 10.1039/c7sc02664a

Classiﬁcation use case:
PAMPA Permeability

Data preparation
1) Source: PubChem BioAssay AID: 1508612
a) NCATS Parallel Artificial Membrane Permeability Assay (PAMPA) Profiling
b) Observed PAMPA at pH 7.4 (x 10-6 cm/sec)
2) Standardize (strip salts, tautomerize)
3) Mw 0-800 -> 2029 cases
4) Permeability 100 cutoff using “Phenotype” field
0:Low, Medium (646 cases)
1:High (1383 cases)

Clustering MACCS similarity matrix tSNE projection

Is it a hard task? Base model: MPNN
Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740
https:/
/keras.io/examples/graph/mpnn-molecular-graphs/
atom_featurizer = AtomFeaturizer(
allowable_sets={
"symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na",
"O", "P", "S"},
"n_valence": {0, 1, 2, 3, 4, 5, 6},
"n_hydrogens": {0, 1, 2, 3, 4},
"hybridization": {"s", "sp", "sp2", "sp3"},
}
)
bond_featurizer = BondFeaturizer(
allowable_sets={
"bond_type": {"single", "double", "triple", "aromatic"},
"conjugated": {True, False},
}
)

MPNN: Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740

CONVERTING DATA TO
PROJECT TEAM INSIGHTS

Discovery teams
Design Landscape
Design Hub
Series
H1 H2 H3 H4
Discovery hub
connecting chemical
series, data,
predictions and
chemical project
management

Discovery teams
Analysis
Design Hub
Series
H1 H2 H3 H4
Biological
measurements

Discovery teams
Biological data focus
Design Hub
Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine

Discovery teams
Train&Deploy
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API

Discovery teams
Fill the gap
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API

Discovery teams
Multi parameter optimization
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine

Translate data to reliable
models
Centralize model
management
Connect project team
members and resources
Track and manage discovery
Design Hub
Lower the barrier to adopt AI models in design
Trainer Engine
https:/
/chemaxon.com/products/trainer-engine https:/
/chemaxon.com/products/design-hub

Interested?
atarcsay@chemaxon.com

Chemaxon EU UGM 2022 | Translating data to predictive models

Recommended

Recommended

More Related Content

More from ChemAxon

More from ChemAxon (20)

Chemaxon EU UGM 2022 | Translating data to predictive models