Dataledger(TM) - research data
platform with data acquisition and
integrated machine learning
pipeline
SCIENCE DATA SOFTWARE, LLC, Rockville, Maryland, United States
Data, Information, Knowledge, Wisdom
Laboratory Data Acquisition
VirtualStandardFAIRDataBus
Other Registries
Other Registries
Other Registries
Open Data Science Platform – Dataledger(TM)
D
a
t
a
Data Lake
Social
Media
Electronic
Notebooks
Databases
Sensor Med
Dev
IoT
Curated
Repository
Models
Curation &
Integration
Validation
Decision
Support
Analysis &
Modeling
Open Data Science Platform
Mining
USERS
Model-driven experimental studies
Workflow
• Data acquisition and templated data entry
• Processing
• Curation
• Datasets management
• Descriptive analysis
• Modeling
• Validation
• Prediction
• Feedback cycle
• Automation
Open Data Science Platform
Chemical processing
● Support for chemical
formats
● Chemistry validation
and standardization
● Automatic processing
and visualization
FAIR Data Principles
Collaborative data authoring and curation
● Datacite.org
support
● Other formats
● Audit trail
● Notifications
Built-in Machine Learning
● Automated ML
pipeline
● Pre-built ML
modules
● Comparison
between different
ML algorithms
● NB, NN, RF, SVM, LR
● DNN
Train Models:
auto optimized fingerprints
MACCS – 166 public MACCS keys + 1 zero bit
AVALON – Avalon count FPs from Avalon cheminformatcis toolkit
ECFP – Extented Connectivity FingerPrint
FCFC – Feature Connectivity Fingerprint Count vector
DESCS – all of the RDKit supported descriptors
Automated optimization of molecular representation by combining
up to 4 different types of fingerprints
TOP-5 combinations (BioHC, VP, KM datasets):
• DESC, MACCS, ECFP (Radius: 3, Size: 512), FCFC (Radius: 2, Size: 128)
• DESC, AVALON Size: 128, ECFP (Radius: 2, Size: 128), FCFC (Radius: 2, Size: 128)
• DESC, MACCS, ECFP (Radius: 2, Size: 256), FCFC (Radius: 3, Size: 128)
• MACCS, AVALON (Size: 1024), Type: ECFP (Radius: 2, Size: 256), FCFC (Radius: 4, Size: 512)
• DESC, AVALON (Size: 128), ECFP (Radius: 3, Size: 512 )
Prediction
Sharing
OSDR Single Structure Prediction Free Service
https://ssp.dataledger.io/predict
https://ssp.dataledger.io/predict
OSDR Single Structure Prediction Free Service
https://ssp.dataledger.io/predict
OSDR Single Structure Prediction Free Service
Molecule representation
RDKit: Descriptors, Fingerprints; Autoencoder FP, CNN
Smart Molecule representation:
SMILES-based seq2seq encoders
CC(C)Cc1ccc(cc1)C(C)C(=O)O
CC(C)Cc1ccc(cc1)C(C)C(=O)O
• A powerful tool for data-driven features extraction
• Latent vector can be used as a QSAR descriptor of a molecule
32.4
24.2
…
24.2
32.1
Latent
vector
Initial SMILES
Reconstructed
SMILES
IGC50, R2 Solubility, R2
ENUM2CAN 0.7918 0.8954
CAN2CAN 0.7256 0.8289
ECFP4 0.562 0.6359
Trained Decoder combined with
Multiple Objective Optimization
algorithm can be used as a
specialized inverse QSAR - type
generative model
B. Sattarov et al, Ready for publication
Public available free service for molecular
representations calculations https://ssp.dataledger.io/features
AI and Machine Learning methods in OSDR for QSAR, QSPR, and material SPR
Shallow Machine Learning (SML) methods:
• Naive Bayes, k Nearest Neighbors, Linear Logistic Regression, AdaBoost Decision Tree, Random Forest, Elastic Net regression
(L1 and L2 regularizes), Support Vector Machine, XGBoost
• Clustering, Classification and Regression models
• Open source based advanced ML libraries such as scikit-learn (CPU for training and prediction) for training, tuning
(automated optimization), and validating all SML models
Deep Neural Networks (DNN) methods:
• Different complexity or feed-forward DNN (up to 6 hidden layers)
• Encoder-decoder NN for feature selection and clustering
• Long-Shot Term Memory (LSTM) NN for inverse QSAR (generate SMILES of molecules with desired properties)
• Convolutional Graph NN (Graph CNN)
• Clustering, Classification, Regression and Generative models, models optimization and tuning
• Open source Keras and Tensorflow (GPU training and CPU or GPU for prediction)
Datasets preparation:
• Dataset exploratory analysis (values distributions, average heavy atoms, rotational bonds, etc.)
• Converting SMILES to feature vector (fingerprints and descriptors alone and in combinations)
• Split into train and test datasets (80/20 is a default settings) maintaining equal proportions of active/inactive class or similar
train/test distributions for regression (stratified splitting)
• k-fold cross on train data for better model generalization
Models’ performance evaluation metrics
• Receiver Operating Characteristic (ROC) curve and the area under it (AUC) - is created by plotting the
true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
• F1-Score - the harmonic mean of the Recall and Precision:
𝐹𝐹1 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 2 ∗
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
• Accuracy - the percentage of correctly identified labels out of the entire population:
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹
• Matthews correlation coefficient - is generally regarded as a balanced measure which can be used even
if the classes are of very different sizes:
𝑀𝑀𝑀𝑀𝑀𝑀 =
𝑇𝑇𝑇𝑇 � 𝑇𝑇𝑇𝑇 − 𝐹𝐹𝐹𝐹 � 𝐹𝐹𝐹𝐹
√(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)
• Cohen’s Kappa coefficient - estimating overall model performance, attempts to leverage the Accuracy
by normalizing it to the probability that the classification would agree by chance (pe):
𝐶𝐶𝐶𝐶 =
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴−𝑝𝑝𝑒𝑒
1−𝑝𝑝𝑒𝑒
, where
𝑝𝑝𝑒𝑒 = 𝑝𝑝𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 + 𝑝𝑝𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹, 𝑝𝑝𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 =
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹
�
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹
, 𝑝𝑝𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 =
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹
�
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹
OSDR models performance on publicly available datasets
OPERA: https://cfpub.epa.gov/si/si_public_record_report.cfm?dirEntryId=337488
TEST: https://www.epa.gov/chemical-research/toxicity-estimation-software-
tool-test
EPA shallow ML vs OSDR deep models
Dataset
EPA OPERA,
R2 test
OSDR DNN,
R2 test
Octanol-water Partition
Coefficient
0.860 0.916
Melting Point 0.730 0.817
Water solubility 0.860 0.931
Vapor Pressure 0.920 0.944
Boiling Point 0.930 0.964
Fish Biotransformation Half-life 0.730 0.894
Biological Half-life 0.750 0.832
Bioconcentration Factor 0.830 0.775
Soil Adsorption Coefficient 0.710 0.785
Dataset
EPA TEST,
R2 test
OSDR
DNN, R2
test
BCF 0.758 0.743
BP 0.946 0.959
Density 0.955 0.968
FP 0.878 0.898
IGC50 0.763 0.855
LD50 0.624 0.681
LC50 0.729 0.760
LC50DM 0.657 0.716
MP 0.830 0.842
ST 0.898 0.938
TC 0.889 0.911
Viscosity 0.856 0.875
VP 0.953 0.953
WS 0.858 0.853
SDS trained public available data sets used by EPA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
EPA
OSDR
OSDR vs EPA predictions consensus
Deviatons from experimental values
10.00% 25.00% 50.00% Out of 50%
Tested 40 structures from PHYSPROP and DrugBank, 9 properties:
Water Solubility, Melting Poing, Vapor Pressure, Boiling Point, LogP, LogBCF, LC50, LD50, ICG50
Trained Models are available in OSDR for a Single Structure Prediction as a Free Service
https://ssp.dataledger.io/predict
Deep models showed overall better ability to learn representations and great generality of the deep
algorithms vs shallow approaches.
In progress: Materials
In progress: Graph convolutional neural networks as
"general-purpose" property predictors
V. Korolev et al, Ready for publication
In progress: Deep Learning approach for Ln(III)
complexation
models and model analysis
0
1
2
3
4
5
La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu
Train Validation Test
0
0.2
0.4
0.6
0.8
1
La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu
Train Validation Test
-1
-0.5
0
0.5
1
1.5
2
4
12
13
15
16
21
26
31
37
39
42
43
51
58
65
69
73
81
87
97
104
106
107
108
116
119
120
122
124
131
134
139
140
143
Impact,[logK]
Bit number
A. Mitrofanov, Ready for submission
Reaction 1: compounds and solutions
Diisobutylaluminium hydride (1.1 M in cyclohexane,
2.93 mL, 3.23 mmol) was added dropwise to the
solution of 9 (500 mg, 1.29 mmol) and
dichloromethane (20 mL) at −78 °C. The reaction
mixture was stirred at −78 °C for another 2 h, warmed
up to rt, quenched with methanol (3 mL) and citric
acid (aq) (w/w, 10%, 5 mL), concentrated. The residue
was added with water (10 mL) and extracted with
dichloromethane (12 mL × 3). The organic layers were
combined, dried over Na2SO4, filtered and
concentrated. The crude product was further purified
by column chromatography (SiO2, EtOAc–hexanes,
1 : 7; Rf 0.33) to give 10 (308 mg, 1.02 mmol, 79%) as a
colourless liquid.
Solutions
• Diisobutylaluminium hydride
Compounds
• 9
• dichloromethane
• methanol
• citric acid
• water
• Dichloromethane
• Na2SO4
• 10
Ignored for now (only the name was
extracted in this pass) – in time
“Substances”
• SiO2
• EtOAc–hexanes
Reaction components: reactant , solvent, product
Other compound/substance used in procedure
Reaction 2: Planned reaction runReaction
• Reaction Run: Reaction SJH-01-227 dated 2/12/2014; FailedReaction: false; Experiment Stage: Planned
• Stoichiometry Table:
Label Reaction Component Substance Amounts Comments
SJH-01-223 Role: Limiting Reactant
Compound
Molecular Mass: 756.95
State: Solid Equivalence: 1
Moles: 0.132 mMol
Mass: 0.1 g
benzaldehyde Role: Reactant
Compound
Molecular Mass: 206.24
State: Solid Equivalence: 6
Moles: 0.293 mMol
Mass: 0.163 g
Cu(OTf)2 Role: Reactant
Compound
Molecular Mass: 361.67
State: Liquid
Purity: 98%
Source: 283673-5G, Sigma Aldrich
Equivalence: 0.1
Moles: 0.013 mMoles
Mass: 0.005 g
DCE Role: Solvent
Compound
State: Liquid
Purity: 99-100%
Source: 283673-5G, Sigma Aldrich
Volume: 1.321 mL Concentration in
line 1: 0.1 M
TFA Role: Solvent
Compound
Molecular Mass: 114.02
Density: 1.49 g/ml
State: Liquid
Purity: 99%
Source: T6508-500mL, Sigma Aldrich
Equivalence: 3
Moles: 0.396 mMol
Mass: 0.045 g
Volume: 0.030 mL
SJH_01_227 Role: Product
Compound
State: Solid Equivalence: 1
Moles: 0.132 mMol
S88 process standard approach
Process
Process
Stage
Process
Stage
Process
Stage
Process
Operation
Process
Actions
Experiment Synthesis stage Preparation / Reaction /
Work up / Isolation
Heat / Cool /
Dose / Stir etc.
S88 allows procedure steps (process actions) to be grouped
into “process operations”:
We allow “Procedure Steps” to be nested and have seeded the following
procedure step types to assign to procedure steps for these parent operations:
S88 process operation/Procedure. StepTypes.Title
Preparation
Reaction
S88 process operation/Procedure. StepTypes.Title
Work up
Isolation
S88-style procedures
• Type of actions which can be assigned to procedure steps
Action Types
Add Synthesize Wait Degass
Yield Wash Unknown Irradiate
Stir Extract Precipitate Mill
Remove Filter Partition Sample
Heat Concentrate Quench Reflux
Dry Cool Apparatus Action Transfer
Purify Dissolve Recover
Reaction 1: procedure steps
Diisobutylaluminium hydride (1.1 M in cyclohexane,
2.93 mL, 3.23 mmol) was added dropwise to the
solution of 9 (500 mg, 1.29 mmol) and
dichloromethane (20 mL) at −78 °C. The reaction
mixture was stirred at −78 °C for another 2 h,
warmed up to rt, quenched with methanol (3 mL)
and citric acid (aq) (w/w, 10%, 5 mL), concentrated.
The residue was added with water (10 mL) and
extracted with dichloromethane (12 mL × 3). The
organic layers were combined, dried over Na2SO4,
filtered and concentrated. The crude product was
further purified by column chromatography (SiO2,
EtOAc–hexanes, 1 : 7; Rf 0.33) to give 10 (308 mg,
1.02 mmol, 79%) as a colourless liquid.
Text mining breaks down procedure summary into steps:
<dl:reactionActionList/dl:reactionActions> dl:phraseTexts
• action="Add“: Diisobutylaluminium hydride (1.1 M in
cyclohexane, 2.93 mL, 3.23 mmol) was added dropwise to the
solution of 9 (500 mg, 1.29 mmol) and dichloromethane (20 mL)
at −78 °C
• action=" Stir“: The reaction mixture was stirred at −78 °C for
another 2 h
• action="Heat“: warmed up to rt
• action="Quench“: quenched with methanol (3 mL) and citric
acid(aq) (w/w, 10%, 5 mL)
• action="Concentrate“: concentrated
• action="Add“: The residue was added with water (10 mL)
• action="Extract“: extracted with dichloromethane (12 mL × 3)
• action="Dry“: dried over Na2SO4
• action="Filter“: filtered
• action="Concentrate“: concentrated
• action="Purify“: The crude product was further purified by
column chromatography (SiO2, EtOAc–hexanes, 1 : 7; Rf 0.33)
• action="Yield“: to give 10 (308 mg, 1.02 mmol, 79%) as a
colourless liquid
Extensible micro-service based architecture
Technologies
● Mix of technologies connected
through microservices architecture
● Open source toolkits and libraries
with permissive licenses
● NoSQL Databases
● Containerization
● Leading practices in CI/CD
● Automated testing, rapid
development
OSDR CLI - https://github.com/scidatasoft/osdr-cli
Summary
• A Machine Learning toolkit with simple user interface have been
developed for the Dataledger(TM) software.
• Chemical datasets exploratory analysis, wide range of descriptors and
fingerprints selections, dataset preparation and validation for QSAR/QSPR
readiness for organic and inorganic materials
• Intelligent Machine Learning for QSAR and inverse QSAR, QSPR, and
material SPR
• Shallow Machine Learning methods (Naive Bayes, k Nearest Neighbors,
Linear Logistic Regression, AdaBoost Decision Tree, Random Forest, Elastic
Net regression (L1 and L2 regularized), Kernel Ridge Regression, Support
Vector Machine, XGBoost) for classification and regression problems
solving in SAR and SPR
Summary
• Deep Learning Methods (Feed-Forward Deep Neural Networks (NN),
Encoder-Decoder NN, Long-Short Term Memory NN, Convolutional Graph
NN) and their applications for drug discovery and pharmaceutical research
• Clustering, Classification, Regression and Generative models, models
optimization and tuning, models interpretation and applicability domain
estimation, prediction properties and screening chemical compounds for
leads search and hits optimization
• Free public Feature Vector Computation service:
https://ssp.dataledger.io/features
• Free public Single Structure Prediction service (multiple properties,
multiple models): https://ssp.dataledger.io/predict
Thank you!
On Web:
scidatasoft.com
Slides:
https://www.slideshare.net/valerytkachenko16
Contact us:
info@scidatasoft.com

Open Science Data Repository - Dataledger

  • 1.
    Dataledger(TM) - researchdata platform with data acquisition and integrated machine learning pipeline SCIENCE DATA SOFTWARE, LLC, Rockville, Maryland, United States
  • 2.
  • 3.
    Laboratory Data Acquisition VirtualStandardFAIRDataBus OtherRegistries Other Registries Other Registries
  • 5.
    Open Data SciencePlatform – Dataledger(TM) D a t a Data Lake Social Media Electronic Notebooks Databases Sensor Med Dev IoT Curated Repository Models Curation & Integration Validation Decision Support Analysis & Modeling Open Data Science Platform Mining USERS Model-driven experimental studies
  • 6.
    Workflow • Data acquisitionand templated data entry • Processing • Curation • Datasets management • Descriptive analysis • Modeling • Validation • Prediction • Feedback cycle • Automation
  • 7.
  • 8.
    Chemical processing ● Supportfor chemical formats ● Chemistry validation and standardization ● Automatic processing and visualization
  • 10.
  • 11.
    Collaborative data authoringand curation ● Datacite.org support ● Other formats ● Audit trail ● Notifications
  • 12.
    Built-in Machine Learning ●Automated ML pipeline ● Pre-built ML modules ● Comparison between different ML algorithms ● NB, NN, RF, SVM, LR ● DNN
  • 14.
  • 15.
    MACCS – 166public MACCS keys + 1 zero bit AVALON – Avalon count FPs from Avalon cheminformatcis toolkit ECFP – Extented Connectivity FingerPrint FCFC – Feature Connectivity Fingerprint Count vector DESCS – all of the RDKit supported descriptors Automated optimization of molecular representation by combining up to 4 different types of fingerprints TOP-5 combinations (BioHC, VP, KM datasets): • DESC, MACCS, ECFP (Radius: 3, Size: 512), FCFC (Radius: 2, Size: 128) • DESC, AVALON Size: 128, ECFP (Radius: 2, Size: 128), FCFC (Radius: 2, Size: 128) • DESC, MACCS, ECFP (Radius: 2, Size: 256), FCFC (Radius: 3, Size: 128) • MACCS, AVALON (Size: 1024), Type: ECFP (Radius: 2, Size: 256), FCFC (Radius: 4, Size: 512) • DESC, AVALON (Size: 128), ECFP (Radius: 3, Size: 512 )
  • 16.
  • 17.
  • 18.
    OSDR Single StructurePrediction Free Service https://ssp.dataledger.io/predict
  • 19.
  • 20.
  • 21.
    Molecule representation RDKit: Descriptors,Fingerprints; Autoencoder FP, CNN
  • 22.
    Smart Molecule representation: SMILES-basedseq2seq encoders CC(C)Cc1ccc(cc1)C(C)C(=O)O CC(C)Cc1ccc(cc1)C(C)C(=O)O • A powerful tool for data-driven features extraction • Latent vector can be used as a QSAR descriptor of a molecule 32.4 24.2 … 24.2 32.1 Latent vector Initial SMILES Reconstructed SMILES IGC50, R2 Solubility, R2 ENUM2CAN 0.7918 0.8954 CAN2CAN 0.7256 0.8289 ECFP4 0.562 0.6359 Trained Decoder combined with Multiple Objective Optimization algorithm can be used as a specialized inverse QSAR - type generative model B. Sattarov et al, Ready for publication
  • 23.
    Public available freeservice for molecular representations calculations https://ssp.dataledger.io/features
  • 24.
    AI and MachineLearning methods in OSDR for QSAR, QSPR, and material SPR Shallow Machine Learning (SML) methods: • Naive Bayes, k Nearest Neighbors, Linear Logistic Regression, AdaBoost Decision Tree, Random Forest, Elastic Net regression (L1 and L2 regularizes), Support Vector Machine, XGBoost • Clustering, Classification and Regression models • Open source based advanced ML libraries such as scikit-learn (CPU for training and prediction) for training, tuning (automated optimization), and validating all SML models Deep Neural Networks (DNN) methods: • Different complexity or feed-forward DNN (up to 6 hidden layers) • Encoder-decoder NN for feature selection and clustering • Long-Shot Term Memory (LSTM) NN for inverse QSAR (generate SMILES of molecules with desired properties) • Convolutional Graph NN (Graph CNN) • Clustering, Classification, Regression and Generative models, models optimization and tuning • Open source Keras and Tensorflow (GPU training and CPU or GPU for prediction) Datasets preparation: • Dataset exploratory analysis (values distributions, average heavy atoms, rotational bonds, etc.) • Converting SMILES to feature vector (fingerprints and descriptors alone and in combinations) • Split into train and test datasets (80/20 is a default settings) maintaining equal proportions of active/inactive class or similar train/test distributions for regression (stratified splitting) • k-fold cross on train data for better model generalization
  • 25.
    Models’ performance evaluationmetrics • Receiver Operating Characteristic (ROC) curve and the area under it (AUC) - is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. • F1-Score - the harmonic mean of the Recall and Precision: 𝐹𝐹1 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 2 ∗ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 • Accuracy - the percentage of correctly identified labels out of the entire population: 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹 • Matthews correlation coefficient - is generally regarded as a balanced measure which can be used even if the classes are of very different sizes: 𝑀𝑀𝑀𝑀𝑀𝑀 = 𝑇𝑇𝑇𝑇 � 𝑇𝑇𝑇𝑇 − 𝐹𝐹𝐹𝐹 � 𝐹𝐹𝐹𝐹 √(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹) • Cohen’s Kappa coefficient - estimating overall model performance, attempts to leverage the Accuracy by normalizing it to the probability that the classification would agree by chance (pe): 𝐶𝐶𝐶𝐶 = 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴−𝑝𝑝𝑒𝑒 1−𝑝𝑝𝑒𝑒 , where 𝑝𝑝𝑒𝑒 = 𝑝𝑝𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 + 𝑝𝑝𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹, 𝑝𝑝𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹 � 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹 , 𝑝𝑝𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹 � 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹
  • 26.
    OSDR models performanceon publicly available datasets OPERA: https://cfpub.epa.gov/si/si_public_record_report.cfm?dirEntryId=337488 TEST: https://www.epa.gov/chemical-research/toxicity-estimation-software- tool-test EPA shallow ML vs OSDR deep models Dataset EPA OPERA, R2 test OSDR DNN, R2 test Octanol-water Partition Coefficient 0.860 0.916 Melting Point 0.730 0.817 Water solubility 0.860 0.931 Vapor Pressure 0.920 0.944 Boiling Point 0.930 0.964 Fish Biotransformation Half-life 0.730 0.894 Biological Half-life 0.750 0.832 Bioconcentration Factor 0.830 0.775 Soil Adsorption Coefficient 0.710 0.785 Dataset EPA TEST, R2 test OSDR DNN, R2 test BCF 0.758 0.743 BP 0.946 0.959 Density 0.955 0.968 FP 0.878 0.898 IGC50 0.763 0.855 LD50 0.624 0.681 LC50 0.729 0.760 LC50DM 0.657 0.716 MP 0.830 0.842 ST 0.898 0.938 TC 0.889 0.911 Viscosity 0.856 0.875 VP 0.953 0.953 WS 0.858 0.853
  • 27.
    SDS trained publicavailable data sets used by EPA 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 EPA OSDR OSDR vs EPA predictions consensus Deviatons from experimental values 10.00% 25.00% 50.00% Out of 50% Tested 40 structures from PHYSPROP and DrugBank, 9 properties: Water Solubility, Melting Poing, Vapor Pressure, Boiling Point, LogP, LogBCF, LC50, LD50, ICG50 Trained Models are available in OSDR for a Single Structure Prediction as a Free Service https://ssp.dataledger.io/predict
  • 28.
    Deep models showedoverall better ability to learn representations and great generality of the deep algorithms vs shallow approaches.
  • 29.
  • 30.
    In progress: Graphconvolutional neural networks as "general-purpose" property predictors V. Korolev et al, Ready for publication
  • 31.
    In progress: DeepLearning approach for Ln(III) complexation models and model analysis 0 1 2 3 4 5 La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Train Validation Test 0 0.2 0.4 0.6 0.8 1 La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Train Validation Test -1 -0.5 0 0.5 1 1.5 2 4 12 13 15 16 21 26 31 37 39 42 43 51 58 65 69 73 81 87 97 104 106 107 108 116 119 120 122 124 131 134 139 140 143 Impact,[logK] Bit number A. Mitrofanov, Ready for submission
  • 32.
    Reaction 1: compoundsand solutions Diisobutylaluminium hydride (1.1 M in cyclohexane, 2.93 mL, 3.23 mmol) was added dropwise to the solution of 9 (500 mg, 1.29 mmol) and dichloromethane (20 mL) at −78 °C. The reaction mixture was stirred at −78 °C for another 2 h, warmed up to rt, quenched with methanol (3 mL) and citric acid (aq) (w/w, 10%, 5 mL), concentrated. The residue was added with water (10 mL) and extracted with dichloromethane (12 mL × 3). The organic layers were combined, dried over Na2SO4, filtered and concentrated. The crude product was further purified by column chromatography (SiO2, EtOAc–hexanes, 1 : 7; Rf 0.33) to give 10 (308 mg, 1.02 mmol, 79%) as a colourless liquid. Solutions • Diisobutylaluminium hydride Compounds • 9 • dichloromethane • methanol • citric acid • water • Dichloromethane • Na2SO4 • 10 Ignored for now (only the name was extracted in this pass) – in time “Substances” • SiO2 • EtOAc–hexanes Reaction components: reactant , solvent, product Other compound/substance used in procedure
  • 33.
    Reaction 2: Plannedreaction runReaction • Reaction Run: Reaction SJH-01-227 dated 2/12/2014; FailedReaction: false; Experiment Stage: Planned • Stoichiometry Table: Label Reaction Component Substance Amounts Comments SJH-01-223 Role: Limiting Reactant Compound Molecular Mass: 756.95 State: Solid Equivalence: 1 Moles: 0.132 mMol Mass: 0.1 g benzaldehyde Role: Reactant Compound Molecular Mass: 206.24 State: Solid Equivalence: 6 Moles: 0.293 mMol Mass: 0.163 g Cu(OTf)2 Role: Reactant Compound Molecular Mass: 361.67 State: Liquid Purity: 98% Source: 283673-5G, Sigma Aldrich Equivalence: 0.1 Moles: 0.013 mMoles Mass: 0.005 g DCE Role: Solvent Compound State: Liquid Purity: 99-100% Source: 283673-5G, Sigma Aldrich Volume: 1.321 mL Concentration in line 1: 0.1 M TFA Role: Solvent Compound Molecular Mass: 114.02 Density: 1.49 g/ml State: Liquid Purity: 99% Source: T6508-500mL, Sigma Aldrich Equivalence: 3 Moles: 0.396 mMol Mass: 0.045 g Volume: 0.030 mL SJH_01_227 Role: Product Compound State: Solid Equivalence: 1 Moles: 0.132 mMol
  • 34.
    S88 process standardapproach Process Process Stage Process Stage Process Stage Process Operation Process Actions Experiment Synthesis stage Preparation / Reaction / Work up / Isolation Heat / Cool / Dose / Stir etc. S88 allows procedure steps (process actions) to be grouped into “process operations”: We allow “Procedure Steps” to be nested and have seeded the following procedure step types to assign to procedure steps for these parent operations: S88 process operation/Procedure. StepTypes.Title Preparation Reaction S88 process operation/Procedure. StepTypes.Title Work up Isolation
  • 35.
    S88-style procedures • Typeof actions which can be assigned to procedure steps Action Types Add Synthesize Wait Degass Yield Wash Unknown Irradiate Stir Extract Precipitate Mill Remove Filter Partition Sample Heat Concentrate Quench Reflux Dry Cool Apparatus Action Transfer Purify Dissolve Recover
  • 36.
    Reaction 1: proceduresteps Diisobutylaluminium hydride (1.1 M in cyclohexane, 2.93 mL, 3.23 mmol) was added dropwise to the solution of 9 (500 mg, 1.29 mmol) and dichloromethane (20 mL) at −78 °C. The reaction mixture was stirred at −78 °C for another 2 h, warmed up to rt, quenched with methanol (3 mL) and citric acid (aq) (w/w, 10%, 5 mL), concentrated. The residue was added with water (10 mL) and extracted with dichloromethane (12 mL × 3). The organic layers were combined, dried over Na2SO4, filtered and concentrated. The crude product was further purified by column chromatography (SiO2, EtOAc–hexanes, 1 : 7; Rf 0.33) to give 10 (308 mg, 1.02 mmol, 79%) as a colourless liquid. Text mining breaks down procedure summary into steps: <dl:reactionActionList/dl:reactionActions> dl:phraseTexts • action="Add“: Diisobutylaluminium hydride (1.1 M in cyclohexane, 2.93 mL, 3.23 mmol) was added dropwise to the solution of 9 (500 mg, 1.29 mmol) and dichloromethane (20 mL) at −78 °C • action=" Stir“: The reaction mixture was stirred at −78 °C for another 2 h • action="Heat“: warmed up to rt • action="Quench“: quenched with methanol (3 mL) and citric acid(aq) (w/w, 10%, 5 mL) • action="Concentrate“: concentrated • action="Add“: The residue was added with water (10 mL) • action="Extract“: extracted with dichloromethane (12 mL × 3) • action="Dry“: dried over Na2SO4 • action="Filter“: filtered • action="Concentrate“: concentrated • action="Purify“: The crude product was further purified by column chromatography (SiO2, EtOAc–hexanes, 1 : 7; Rf 0.33) • action="Yield“: to give 10 (308 mg, 1.02 mmol, 79%) as a colourless liquid
  • 37.
  • 38.
    Technologies ● Mix oftechnologies connected through microservices architecture ● Open source toolkits and libraries with permissive licenses ● NoSQL Databases ● Containerization ● Leading practices in CI/CD ● Automated testing, rapid development
  • 39.
    OSDR CLI -https://github.com/scidatasoft/osdr-cli
  • 40.
    Summary • A MachineLearning toolkit with simple user interface have been developed for the Dataledger(TM) software. • Chemical datasets exploratory analysis, wide range of descriptors and fingerprints selections, dataset preparation and validation for QSAR/QSPR readiness for organic and inorganic materials • Intelligent Machine Learning for QSAR and inverse QSAR, QSPR, and material SPR • Shallow Machine Learning methods (Naive Bayes, k Nearest Neighbors, Linear Logistic Regression, AdaBoost Decision Tree, Random Forest, Elastic Net regression (L1 and L2 regularized), Kernel Ridge Regression, Support Vector Machine, XGBoost) for classification and regression problems solving in SAR and SPR
  • 41.
    Summary • Deep LearningMethods (Feed-Forward Deep Neural Networks (NN), Encoder-Decoder NN, Long-Short Term Memory NN, Convolutional Graph NN) and their applications for drug discovery and pharmaceutical research • Clustering, Classification, Regression and Generative models, models optimization and tuning, models interpretation and applicability domain estimation, prediction properties and screening chemical compounds for leads search and hits optimization • Free public Feature Vector Computation service: https://ssp.dataledger.io/features • Free public Single Structure Prediction service (multiple properties, multiple models): https://ssp.dataledger.io/predict
  • 42.