SlideShare a Scribd company logo
Translating data to
predictive models
Akos Tarcsay
Machine learning life-cycle
From data to prediction
Data ingestion Preprocessing Modelling
Features Models
Review Prediction
Model repository
Chemaxon
Descriptor
generation
Chemaxon
Standardizer
Persistence and
search
System overview
DB layer
PostgreSQL
Statistical
evaluation
ML library
(SMILE)
Conformal
prediction wrapper
Service layer
Programmatic
access
REST interface
“Comp Chem”
Trainer GUI
“Med Chem”
Prediction GUI
How to reduce noise?
Preprocessing
Effect of standardization
- Simple descriptors (Mw, fsp3,
HBDA, etc. )
Imipramine pamoate Furan-2-ol
- Phys-chem (logD, pKa)
- Molecular graph, Fingerprints
Salts, solvates Tautomerism
“Overall and despite our efforts to use open software wherever possible, we find that
ChemAxon Tautomers node outperforms the other approaches we tested.”
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00606-7
Small molecule retention time (SMRT) dataset: Tautomerization
https:/
/www.nature.com/articles/s41467-019-13680-7
SMRT Tautomer effect
Prediction power?
ChEMBL Benchmark
Activity dataset: the ‘ChEMBL bioactivity benchmark set’
Data source: Journal of Cheminformatics, 9, 45 (2017) by Eelke B. Lenselink, Niels
ten Dijke, Brandon Bongers, George Papadatos, Herman W. T. van Vlijmen, Wojtek
Kowalczyk, Adriaan P. IJzerman, Gerard J. P. van Westen
- ChEMBL database (version 20)
- Activities were selected that met the following criteria:
- at least 30 compounds tested per protein and from at least 2 separate publications
- assay confidence score of 9
- ‘single protein’ target type
- assigned pCHEMBL value
- no flags on potential duplicate or data validity comment
- originating from scientific literature
- data points with activity comments ‘not active’, ‘inactive’, ‘inconclusive’, and ‘undetermined’ were
removed
- MED value was chosen
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 163 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points
reserved as External set: Ext
Last 30 Ext
Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 163 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points reserved
as External set: Ext
- 10-90% test-training set split: Test
- ~160k total training size
- ~18k total test size
Rnd 90% Train
Last 30 Ext
Rnd 10% Test
...
Benchmark on 163 ChEMBL targets
163 ChEMBL Targets
Analysis of the best models per target
Pearson R2
Ext Test Ext Test
Avg 0.672 0.824 0.306 0.679
Median 0.722 0.833 0.385 0.697
How reliable?
Confidence
?
Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
https:/
/www.jmlr.org/papers/volume9/shafer08a/shafer08a.pdf
https:/
/pubs.acs.org/doi/10.1021/ci5001168
Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
Test: 14233 / 17661 80.6% within the error bound
Ext: 3344 / 4890 68.4% within the error bound
Feature engineering is the
process of using domain
knowledge to extract
features from raw data.
Combined descriptor set
∑ 20 *163 =3260 descriptor sets
Importance (ECFP4_CHEMTERM:All)
50% related to
protonation or
partitioning
SAMPL7 pKa
https://link.springer.com/article/10.1007/s10822-021-00397-3
Classification use case:
Blood-Brain Barrier
Penetration
Blood brain barrier penetration model
MoleculeNet, 10.1039/c7sc02664a
Model analysis
Rich prediction results
Classification use case:
PAMPA Permeability
Data preparation
1) Source: PubChem BioAssay AID: 1508612
a) NCATS Parallel Artificial Membrane Permeability Assay (PAMPA) Profiling
b) Observed PAMPA at pH 7.4 (x 10-6 cm/sec)
2) Standardize (strip salts, tautomerize)
3) Mw 0-800 -> 2029 cases
4) Permeability 100 cutoff using “Phenotype” field
0:Low, Medium (646 cases)
1:High (1383 cases)
Clustering MACCS similarity matrix tSNE projection
Is it a hard task? Base model: MPNN
Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740
https:/
/keras.io/examples/graph/mpnn-molecular-graphs/
atom_featurizer = AtomFeaturizer(
allowable_sets={
"symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na",
"O", "P", "S"},
"n_valence": {0, 1, 2, 3, 4, 5, 6},
"n_hydrogens": {0, 1, 2, 3, 4},
"hybridization": {"s", "sp", "sp2", "sp3"},
}
)
bond_featurizer = BondFeaturizer(
allowable_sets={
"bond_type": {"single", "double", "triple", "aromatic"},
"conjugated": {True, False},
}
)
MPNN: Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740
CONVERTING DATA TO
PROJECT TEAM INSIGHTS
Discovery teams
Design Landscape
Design Hub
Series
H1 H2 H3 H4
Discovery hub
connecting chemical
series, data,
predictions and
chemical project
management
Discovery teams
Analysis
Design Hub
Series
H1 H2 H3 H4
Biological
measurements
Discovery teams
Biological data focus
Design Hub
Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine
Discovery teams
Train&Deploy
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API
Discovery teams
Fill the gap
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API
Discovery teams
Multi parameter optimization
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine
Translate data to reliable
models
Centralize model
management
Connect project team
members and resources
Track and manage discovery
Design Hub
Lower the barrier to adopt AI models in design
Trainer Engine
https:/
/chemaxon.com/products/trainer-engine https:/
/chemaxon.com/products/design-hub
Interested?
atarcsay@chemaxon.com

More Related Content

More from ChemAxon

Patent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug DiscoveryPatent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug Discovery
ChemAxon
 
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
ChemAxon
 
Research data management on the cloud
Research data management on the cloudResearch data management on the cloud
Research data management on the cloud
ChemAxon
 
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound RegistrationCheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
ChemAxon
 
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction
ChemAxon
 
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
ChemAxon
 
Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology
ChemAxon
 
JChem Microservices
JChem MicroservicesJChem Microservices
JChem Microservices
ChemAxon
 
Migration from joc to jpc or choral
Migration from joc to jpc or choralMigration from joc to jpc or choral
Migration from joc to jpc or choral
ChemAxon
 
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon
 
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5
ChemAxon
 
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
ChemAxon
 
ChemAxon ChemLocator - Cheminfo Stories Day 5
ChemAxon ChemLocator - Cheminfo Stories Day 5ChemAxon ChemLocator - Cheminfo Stories Day 5
ChemAxon ChemLocator - Cheminfo Stories Day 5
ChemAxon
 
AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1
ChemAxon
 
Search Engine Improvements - Cheminfo Stories 2020 Day 1
Search Engine Improvements - Cheminfo Stories 2020 Day 1Search Engine Improvements - Cheminfo Stories 2020 Day 1
Search Engine Improvements - Cheminfo Stories 2020 Day 1
ChemAxon
 
An application of ChemAxon's platform for education
An application of ChemAxon's platform for educationAn application of ChemAxon's platform for education
An application of ChemAxon's platform for education
ChemAxon
 
Chemical intelligence that makes hidden knowledge effortlessly reachable
Chemical intelligence that makes hidden knowledge effortlessly reachableChemical intelligence that makes hidden knowledge effortlessly reachable
Chemical intelligence that makes hidden knowledge effortlessly reachable
ChemAxon
 
Deep analysis of chemical patents and Markush claims
Deep analysis of chemical patents and Markush claimsDeep analysis of chemical patents and Markush claims
Deep analysis of chemical patents and Markush claims
ChemAxon
 
Bridging the gap between small molecule and biologics editing
Bridging the gap between small molecule and biologics editingBridging the gap between small molecule and biologics editing
Bridging the gap between small molecule and biologics editing
ChemAxon
 
EUGM15 - Zoltán Simon (Printnet): Drug Profile Matching - Drug Discovery by P...
EUGM15 - Zoltán Simon (Printnet): Drug Profile Matching - Drug Discovery by P...EUGM15 - Zoltán Simon (Printnet): Drug Profile Matching - Drug Discovery by P...
EUGM15 - Zoltán Simon (Printnet): Drug Profile Matching - Drug Discovery by P...
ChemAxon
 

More from ChemAxon (20)

Patent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug DiscoveryPatent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug Discovery
 
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
 
Research data management on the cloud
Research data management on the cloudResearch data management on the cloud
Research data management on the cloud
 
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound RegistrationCheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
 
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction
 
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
 
Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology
 
JChem Microservices
JChem MicroservicesJChem Microservices
JChem Microservices
 
Migration from joc to jpc or choral
Migration from joc to jpc or choralMigration from joc to jpc or choral
Migration from joc to jpc or choral
 
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
 
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5
 
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
 
ChemAxon ChemLocator - Cheminfo Stories Day 5
ChemAxon ChemLocator - Cheminfo Stories Day 5ChemAxon ChemLocator - Cheminfo Stories Day 5
ChemAxon ChemLocator - Cheminfo Stories Day 5
 
AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1
 
Search Engine Improvements - Cheminfo Stories 2020 Day 1
Search Engine Improvements - Cheminfo Stories 2020 Day 1Search Engine Improvements - Cheminfo Stories 2020 Day 1
Search Engine Improvements - Cheminfo Stories 2020 Day 1
 
An application of ChemAxon's platform for education
An application of ChemAxon's platform for educationAn application of ChemAxon's platform for education
An application of ChemAxon's platform for education
 
Chemical intelligence that makes hidden knowledge effortlessly reachable
Chemical intelligence that makes hidden knowledge effortlessly reachableChemical intelligence that makes hidden knowledge effortlessly reachable
Chemical intelligence that makes hidden knowledge effortlessly reachable
 
Deep analysis of chemical patents and Markush claims
Deep analysis of chemical patents and Markush claimsDeep analysis of chemical patents and Markush claims
Deep analysis of chemical patents and Markush claims
 
Bridging the gap between small molecule and biologics editing
Bridging the gap between small molecule and biologics editingBridging the gap between small molecule and biologics editing
Bridging the gap between small molecule and biologics editing
 
EUGM15 - Zoltán Simon (Printnet): Drug Profile Matching - Drug Discovery by P...
EUGM15 - Zoltán Simon (Printnet): Drug Profile Matching - Drug Discovery by P...EUGM15 - Zoltán Simon (Printnet): Drug Profile Matching - Drug Discovery by P...
EUGM15 - Zoltán Simon (Printnet): Drug Profile Matching - Drug Discovery by P...
 

Chemaxon EU UGM 2022 | Translating data to predictive models

  • 1. Translating data to predictive models Akos Tarcsay
  • 3. From data to prediction Data ingestion Preprocessing Modelling Features Models Review Prediction Model repository
  • 4. Chemaxon Descriptor generation Chemaxon Standardizer Persistence and search System overview DB layer PostgreSQL Statistical evaluation ML library (SMILE) Conformal prediction wrapper Service layer Programmatic access REST interface “Comp Chem” Trainer GUI “Med Chem” Prediction GUI
  • 5. How to reduce noise? Preprocessing
  • 6. Effect of standardization - Simple descriptors (Mw, fsp3, HBDA, etc. ) Imipramine pamoate Furan-2-ol - Phys-chem (logD, pKa) - Molecular graph, Fingerprints Salts, solvates Tautomerism “Overall and despite our efforts to use open software wherever possible, we find that ChemAxon Tautomers node outperforms the other approaches we tested.” https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00606-7
  • 7. Small molecule retention time (SMRT) dataset: Tautomerization https:/ /www.nature.com/articles/s41467-019-13680-7
  • 10. Activity dataset: the ‘ChEMBL bioactivity benchmark set’ Data source: Journal of Cheminformatics, 9, 45 (2017) by Eelke B. Lenselink, Niels ten Dijke, Brandon Bongers, George Papadatos, Herman W. T. van Vlijmen, Wojtek Kowalczyk, Adriaan P. IJzerman, Gerard J. P. van Westen - ChEMBL database (version 20) - Activities were selected that met the following criteria: - at least 30 compounds tested per protein and from at least 2 separate publications - assay confidence score of 9 - ‘single protein’ target type - assigned pCHEMBL value - no flags on potential duplicate or data validity comment - originating from scientific literature - data points with activity comments ‘not active’, ‘inactive’, ‘inconclusive’, and ‘undetermined’ were removed - MED value was chosen https:/ /jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
  • 11. Application Study on ChEMBL - Data points in range: 500-4703 (med:776) - 163 ChEMBL targets, pAct - Sorted by Document Year, last 30 points reserved as External set: Ext Last 30 Ext
  • 12. Application Study on ChEMBL - Data points in range: 500-4703 (med:776) - 163 ChEMBL targets, pAct - Sorted by Document Year, last 30 points reserved as External set: Ext - 10-90% test-training set split: Test - ~160k total training size - ~18k total test size Rnd 90% Train Last 30 Ext Rnd 10% Test ...
  • 13. Benchmark on 163 ChEMBL targets 163 ChEMBL Targets
  • 14. Analysis of the best models per target Pearson R2 Ext Test Ext Test Avg 0.672 0.824 0.306 0.679 Median 0.722 0.833 0.385 0.697
  • 16. Conformal prediction Proper Training Set Model Error model Calibration set Error Prediction Training Set P(80%) calibration factor (ɑ) https:/ /www.jmlr.org/papers/volume9/shafer08a/shafer08a.pdf https:/ /pubs.acs.org/doi/10.1021/ci5001168
  • 17. Conformal prediction Proper Training Set Model Error model Calibration set Error Prediction Training Set P(80%) calibration factor (ɑ) Test: 14233 / 17661 80.6% within the error bound Ext: 3344 / 4890 68.4% within the error bound
  • 18. Feature engineering is the process of using domain knowledge to extract features from raw data.
  • 19. Combined descriptor set ∑ 20 *163 =3260 descriptor sets
  • 20. Importance (ECFP4_CHEMTERM:All) 50% related to protonation or partitioning
  • 23. Blood brain barrier penetration model MoleculeNet, 10.1039/c7sc02664a
  • 27. Data preparation 1) Source: PubChem BioAssay AID: 1508612 a) NCATS Parallel Artificial Membrane Permeability Assay (PAMPA) Profiling b) Observed PAMPA at pH 7.4 (x 10-6 cm/sec) 2) Standardize (strip salts, tautomerize) 3) Mw 0-800 -> 2029 cases 4) Permeability 100 cutoff using “Phenotype” field 0:Low, Medium (646 cases) 1:High (1383 cases)
  • 28. Clustering MACCS similarity matrix tSNE projection
  • 29. Is it a hard task? Base model: MPNN Epoch 40/40 Train: AUC: 0.7897 MCC: 0.3921 Validation: AUC: 0.5879 MCC: 0.0740 https:/ /keras.io/examples/graph/mpnn-molecular-graphs/ atom_featurizer = AtomFeaturizer( allowable_sets={ "symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na", "O", "P", "S"}, "n_valence": {0, 1, 2, 3, 4, 5, 6}, "n_hydrogens": {0, 1, 2, 3, 4}, "hybridization": {"s", "sp", "sp2", "sp3"}, } ) bond_featurizer = BondFeaturizer( allowable_sets={ "bond_type": {"single", "double", "triple", "aromatic"}, "conjugated": {True, False}, } )
  • 30. MPNN: Epoch 40/40 Train: AUC: 0.7897 MCC: 0.3921 Validation: AUC: 0.5879 MCC: 0.0740
  • 31. CONVERTING DATA TO PROJECT TEAM INSIGHTS
  • 32. Discovery teams Design Landscape Design Hub Series H1 H2 H3 H4 Discovery hub connecting chemical series, data, predictions and chemical project management
  • 33. Discovery teams Analysis Design Hub Series H1 H2 H3 H4 Biological measurements
  • 34. Discovery teams Biological data focus Design Hub Series Trainer GUI Training / Analysis Comp. Chem Trainer Engine H1 H2 H3 H4 Trainer Engine
  • 35. Discovery teams Train&Deploy Production Models Design Hub Services Series Trainer GUI Training / Analysis Comp. Chem Trainer Engine H1 H2 H3 H4 Trainer Engine { } REST … API { } REST … API
  • 36. Discovery teams Fill the gap Production Models Design Hub Services Series Trainer GUI Training / Analysis Comp. Chem Trainer Engine H1 H2 H3 H4 Trainer Engine { } REST … API { } REST … API
  • 37. Discovery teams Multi parameter optimization Production Models Design Hub Services Series Trainer GUI Training / Analysis Comp. Chem Trainer Engine H1 H2 H3 H4 Trainer Engine
  • 38. Translate data to reliable models Centralize model management Connect project team members and resources Track and manage discovery Design Hub Lower the barrier to adopt AI models in design Trainer Engine https:/ /chemaxon.com/products/trainer-engine https:/ /chemaxon.com/products/design-hub