Making solubility models with reaxy

Creating Solubility Models with Reaxys |
Presented By
Date
Creating Solubility Models with Reaxys
Elsevier R&D Solutions Services
Dr. Matthew CLARK
19 January 2016

• Reaxys has solubility data that can be used to create and study predictive
models
• Appears to have data more diverse than the well-studied “Huuskonen” data set.
• The nature/diversity of the training set is very important for predictive
models
• The best reported models have the smallest training sets.
• However, these training sets may not be useful for prediction of more diverse
compounds.
• Huuskonen-set-trained model predictions on Reaxys set is poor.
• Reaxys has a diverse set of structures and solubilities
• Each individual measurement is referenced.
• Good source for model making
2
What We Will Learn

• In addition to the well-known reactions and compounds, Reaxys is filled with
hundreds of different measured properties reported for the compounds
• Each property is associated with a reference
• Each property has a “cluster” of values such as measurement temperature, pressure,
solvents etc. describing the conditions of the measurement.
• In many cases multiple measurements are reported by different authors at
different times for a particular value.
• A mean, median, and standard deviation can be assessed for the value. Each value is
associated with a reference.
• One can use this data, combined with the chemical structures of the
compounds to make structure-based predictive models for these properties.
• One can then predict the value of new or proposed compounds from their chemical
structures.
Reaxys Property Data

Creating Solubility Models with Reaxys | 4
Reaxys Property Data is Grouped with Conditions
You can select the measurement conditions relevant to your model
Boiling Point
Boiling Point, °C (BP.BP)
Pressure, Torr (BP.P)
Refractive Index
Refractive Index (RI.RI)
Wavelength, nm (RI.W)
Temperature, °C (RI.T)
Dielectric Constant
Dielectric Constant (DIC.DIC)
Frequency, Hz (DIC.F)
Temperature, °C (DIC.T)
Electrical Moment
Description (EM.KW)
Moment, D (EM.EM)
Temperature, °C (EM.T)
Method (EM.MET)
Solvent (EM.SOL)
Enthalpy of Formation
Enthalpy of Formation, Jmol-1
(HFOR.HFOR)
Temperature, °C (HFOR.T)
Pressure, Torr (HFOR.P)
Solubility (MCS)
Solubility, gl-1 (SLB.SLB)
Saturation (SLB.SAT)
Temperature, °C (SLB.T)
Solvent (SLB.SOL)
Ratio of Solvents (SLB.RAT)

• There are several ways to access this data
• API (Application Programming Interface) allows direct access
• Download tagged SD file from Reaxys after searching
• “Hop in to” links to automatically go to data
• Reaxys API allows direct access to the data
• XML-based interface
• KNIME, PiplelinePilot supported.
• Need to query based on measurement conditions, (temp, solvent), and nature
of molecules (organic, single-fragment)
• Form-based query
• “Advanced Query”
5
Model Making Tools

Solubility Query To Select Data and Molecules
SLB.SLB > 0 has a reported solubility
Temperature 19-25 temperature range of measurement
Solvent 'H2O solubility in water
Number of Fragments =1 only one contiguous fragment
Elements = 'c‘ contains carbon!
NOT Chemical Name = '*radical not a radical
Molecular Weight > 40 AND < 1000 molecular weight range
Number of Elements <5 fewer than 5 different elements

Reviewing Solubility Data in Reaxys

SolubilitySources
Reaxys logS is -3.67

Data Processing in KNIME
• Combines compounds with solubility measured in desired conditions
• Convert values to molarity by dividing by molecular weight.

• Used with data from Reaxys, and from the Huuskonen paper
• Uses “R” and stepwise multiple regression
• Results and error of prediction appear in a spreadsheet
10
Model Making Workflow

• Full compound set, no further
filtering
• 3590 compounds
• Standard error of prediction 1.1
log units
• Not spectacular, but useful
• Training set is larger range of
diversity than used in most
models
• r2 0.56
11
Initial Model and Prediction Result is OK-ish
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS

Reaxys Solubility Model 2 – Filtering of Source Compounds
Residual standard error: 0.6932 on 2697 degrees of freedom
Multiple R-squared: 0.8099, Adjusted R-squared: 0.8037
F-statistic: 132 on 87 and 2697 DF, p-value: < 2.2e-16
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
2785 remain, Examples
of filtered compounds:
Model is better, but
does not improve
prediction of
Huuskonen data set

Comparison with Other Reports
Clark – fragment-based solubility model r2 0.73, SE 0.89
using “PHYSPROP” data set
Generalized Fragment-Substructure Based Property Prediction Method
Matthew Clark J. Chem. Inf. Model., 2005, 45 (1), pp 30–38
DOI: 10.1021/ci049744c

Comparison with other data sets
Defined a training set of compounds/solubilities, and test sets that
have been used for several comparative studies

• Models made with Huuskonen structures and data using CDK descriptors
and R model
• Using published training, test sets.
• Models not as good as in publication; he used different descriptor
computation and statistical method. Standard error 0.67 log units.
15
Huuskonen Molecule/Data Set Models – (No Reaxys Data)
y = 0.961x
R² = 0.8832
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
y = 0.9452x
R² = 0.8598
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
y = 0.9912x
R² = 0.7857
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
Training Set Test Set 1 Test Set 2

• Same molecule sets – Model Trained with Reaxys Training Set
• Standard error 0.98 log units – not bad
16
Huuskonen Molecule Sets – Predicted with Model Created from
Reaxys Data Set
y = 0.8824x
R² = 0.6522
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
y = 0.8834x
R² = 0.6889
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
y = 0.8741x
R² = 0.7968
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS

• Standard Error 3.5 log units
• Issue is likely that many molecules
from Reaxys are “outside” the
structural diversity of the Huuskonen
data set
• Illustrates a significant issue with
modeling –
• Generally predictions are best when
the molecule are similar to the training
set.
17
Reaxys Molecule Set Predicted with Model Created from
Huuskonen Data Set – Not Very Good
y = 0.6596x - 1.0645
R² = 0.1459
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS

• Only a subset of solubilities of the Huuskonen set are found in Reaxys.
• Differences are generally due to multiple measurements being reported with
outliers
18
Does Reaxys Give The Same Solubility Values as Huuskonen Data
Set? Yes.
y = 1.0082x - 0.0367
R² = 0.9607
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
ReaxyslogS
Huuskonen logS

• Similarity matrix of each data set computed set using fingerprints/Tanimoto
• Huuskonen set more similar to each other than Reaxys set
19
Reaxys Solubility Data Set is Structurally More Diverse
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
NormalizedFractionofPair-SimilarityCount
Similarity Value
Huuskonen
Reaxys
Reaxys has a higher
proportion of
molecules not similar
to others in the setNormalized
for different
data set
sizes

• Reaxys has solubility data that can be used to create and study predictive
models
• Appears to have data more diverse than the well-studied “Huuskonen” data set.
• The nature/diversity of the training set is very important.
• The best reported models have the smallest training sets.
• However, these training sets may not be useful for prediction of more diverse
compounds.
• Huuskonen-set-trained model predictions on Reaxys set is poor.
• Generally good models can predict with a standard error of about 1 log unit – for
compounds similar to training set.
• Question: what is the accuracy of measurement?
•
𝜕𝑙𝑜𝑔𝑆
𝜕𝑔𝐿−1 =
1
2.303 ∗𝑔𝐿−1 ~ logS changes 0.4 log units/mg for a 1mg/L solubility
• Reaxys has a diverse set of structures and solubilities
• Each individual measurement is referenced.
• Good source for model making
20
What We Learned

• Reaxys is a rich source of data for solubility and other properties.
• One can explore many subsets based on condition, molecule class etc.
• High diversity of molecules – organic, inorganic, peptides etc.
• Reaxys is a good source of data for making predictive models
• It provides not just the value, but the measurement conditions
• Selection of “good” measurements is an important factor in making models
• Reaxys contains hundreds of measured properties!
• Solubility is well studied
• Not as many models available for refractive index, magnetic susceptibility etc.
• Reaxys has only measured solubilities, SciFinder has predicted values
• We can see the effect of the training set and model quality in this presentation.
• Reaxys Medicinal Chemistry contains thousands of bioassay results on
thousands of targets that can be used for predictive models.
21
Conclusion

Making solubility models with reaxy

More Related Content

What's hot

Viewers also liked

Similar to Making solubility models with reaxy

More from Ann-Marie Roche

Recently uploaded

Making solubility models with reaxy