Creating Solubility Models with Reaxys |
Presented By
Date
Creating Solubility Models with Reaxys
Elsevier R&D Solutions Services
Dr. Matthew CLARK
19 January 2016
Creating Solubility Models with Reaxys |
• Reaxys has solubility data that can be used to create and study predictive
models
• Appears to have data more diverse than the well-studied “Huuskonen” data set.
• The nature/diversity of the training set is very important for predictive
models
• The best reported models have the smallest training sets.
• However, these training sets may not be useful for prediction of more diverse
compounds.
• Huuskonen-set-trained model predictions on Reaxys set is poor.
• Reaxys has a diverse set of structures and solubilities
• Each individual measurement is referenced.
• Good source for model making
2
What We Will Learn
Creating Solubility Models with Reaxys |
• In addition to the well-known reactions and compounds, Reaxys is filled with
hundreds of different measured properties reported for the compounds
• Each property is associated with a reference
• Each property has a “cluster” of values such as measurement temperature, pressure,
solvents etc. describing the conditions of the measurement.
• In many cases multiple measurements are reported by different authors at
different times for a particular value.
• A mean, median, and standard deviation can be assessed for the value. Each value is
associated with a reference.
• One can use this data, combined with the chemical structures of the
compounds to make structure-based predictive models for these properties.
• One can then predict the value of new or proposed compounds from their chemical
structures.
Reaxys Property Data
Creating Solubility Models with Reaxys | 4
Reaxys Property Data is Grouped with Conditions
You can select the measurement conditions relevant to your model
Boiling Point
Boiling Point, °C (BP.BP)
Pressure, Torr (BP.P)
Refractive Index
Refractive Index (RI.RI)
Wavelength, nm (RI.W)
Temperature, °C (RI.T)
Dielectric Constant
Dielectric Constant (DIC.DIC)
Frequency, Hz (DIC.F)
Temperature, °C (DIC.T)
Electrical Moment
Description (EM.KW)
Moment, D (EM.EM)
Temperature, °C (EM.T)
Method (EM.MET)
Solvent (EM.SOL)
Enthalpy of Formation
Enthalpy of Formation, Jmol-1
(HFOR.HFOR)
Temperature, °C (HFOR.T)
Pressure, Torr (HFOR.P)
Solubility (MCS)
Solubility, gl-1 (SLB.SLB)
Saturation (SLB.SAT)
Temperature, °C (SLB.T)
Solvent (SLB.SOL)
Ratio of Solvents (SLB.RAT)
Creating Solubility Models with Reaxys |
• There are several ways to access this data
• API (Application Programming Interface) allows direct access
• Download tagged SD file from Reaxys after searching
• “Hop in to” links to automatically go to data
• Reaxys API allows direct access to the data
• XML-based interface
• KNIME, PiplelinePilot supported.
• Need to query based on measurement conditions, (temp, solvent), and nature
of molecules (organic, single-fragment)
• Form-based query
• “Advanced Query”
5
Model Making Tools
Creating Solubility Models with Reaxys | 6
Solubility Query To Select Data and Molecules
SLB.SLB > 0 has a reported solubility
Temperature 19-25 temperature range of measurement
Solvent 'H2O solubility in water
Number of Fragments =1 only one contiguous fragment
Elements = 'c‘ contains carbon!
NOT Chemical Name = '*radical not a radical
Molecular Weight > 40 AND < 1000 molecular weight range
Number of Elements <5 fewer than 5 different elements
Creating Solubility Models with Reaxys | 7
Reviewing Solubility Data in Reaxys
Creating Solubility Models with Reaxys | 8
SolubilitySources
Reaxys logS is -3.67
Creating Solubility Models with Reaxys | 9
Data Processing in KNIME
• Combines compounds with solubility measured in desired conditions
• Convert values to molarity by dividing by molecular weight.
Creating Solubility Models with Reaxys |
• Used with data from Reaxys, and from the Huuskonen paper
• Uses “R” and stepwise multiple regression
• Results and error of prediction appear in a spreadsheet
10
Model Making Workflow
Creating Solubility Models with Reaxys |
• Full compound set, no further
filtering
• 3590 compounds
• Standard error of prediction 1.1
log units
• Not spectacular, but useful
• Training set is larger range of
diversity than used in most
models
• r2 0.56
11
Initial Model and Prediction Result is OK-ish
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
Creating Solubility Models with Reaxys | 12
Reaxys Solubility Model 2 – Filtering of Source Compounds
Residual standard error: 0.6932 on 2697 degrees of freedom
Multiple R-squared: 0.8099, Adjusted R-squared: 0.8037
F-statistic: 132 on 87 and 2697 DF, p-value: < 2.2e-16
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
2785 remain, Examples
of filtered compounds:
Model is better, but
does not improve
prediction of
Huuskonen data set
Creating Solubility Models with Reaxys | 13
Comparison with Other Reports
Clark – fragment-based solubility model r2 0.73, SE 0.89
using “PHYSPROP” data set
Generalized Fragment-Substructure Based Property Prediction Method
Matthew Clark J. Chem. Inf. Model., 2005, 45 (1), pp 30–38
DOI: 10.1021/ci049744c
Creating Solubility Models with Reaxys | 14
Comparison with other data sets
Defined a training set of compounds/solubilities, and test sets that
have been used for several comparative studies
Creating Solubility Models with Reaxys |
• Models made with Huuskonen structures and data using CDK descriptors
and R model
• Using published training, test sets.
• Models not as good as in publication; he used different descriptor
computation and statistical method. Standard error 0.67 log units.
15
Huuskonen Molecule/Data Set Models – (No Reaxys Data)
y = 0.961x
R² = 0.8832
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
y = 0.9452x
R² = 0.8598
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
y = 0.9912x
R² = 0.7857
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
Training Set Test Set 1 Test Set 2
Creating Solubility Models with Reaxys |
• Same molecule sets – Model Trained with Reaxys Training Set
• Standard error 0.98 log units – not bad
16
Huuskonen Molecule Sets – Predicted with Model Created from
Reaxys Data Set
y = 0.8824x
R² = 0.6522
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
y = 0.8834x
R² = 0.6889
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
y = 0.8741x
R² = 0.7968
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
Creating Solubility Models with Reaxys |
• Standard Error 3.5 log units
• Issue is likely that many molecules
from Reaxys are “outside” the
structural diversity of the Huuskonen
data set
• Illustrates a significant issue with
modeling –
• Generally predictions are best when
the molecule are similar to the training
set.
17
Reaxys Molecule Set Predicted with Model Created from
Huuskonen Data Set – Not Very Good
y = 0.6596x - 1.0645
R² = 0.1459
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
predictedlogS
experimental logS
Creating Solubility Models with Reaxys |
• Only a subset of solubilities of the Huuskonen set are found in Reaxys.
• Differences are generally due to multiple measurements being reported with
outliers
18
Does Reaxys Give The Same Solubility Values as Huuskonen Data
Set? Yes.
y = 1.0082x - 0.0367
R² = 0.9607
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
ReaxyslogS
Huuskonen logS
Creating Solubility Models with Reaxys |
• Similarity matrix of each data set computed set using fingerprints/Tanimoto
• Huuskonen set more similar to each other than Reaxys set
19
Reaxys Solubility Data Set is Structurally More Diverse
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
NormalizedFractionofPair-SimilarityCount
Similarity Value
Huuskonen
Reaxys
Reaxys has a higher
proportion of
molecules not similar
to others in the setNormalized
for different
data set
sizes
Creating Solubility Models with Reaxys |
• Reaxys has solubility data that can be used to create and study predictive
models
• Appears to have data more diverse than the well-studied “Huuskonen” data set.
• The nature/diversity of the training set is very important.
• The best reported models have the smallest training sets.
• However, these training sets may not be useful for prediction of more diverse
compounds.
• Huuskonen-set-trained model predictions on Reaxys set is poor.
• Generally good models can predict with a standard error of about 1 log unit – for
compounds similar to training set.
• Question: what is the accuracy of measurement?
•
𝜕𝑙𝑜𝑔𝑆
𝜕𝑔𝐿−1 =
1
2.303 ∗𝑔𝐿−1 ~ logS changes 0.4 log units/mg for a 1mg/L solubility
• Reaxys has a diverse set of structures and solubilities
• Each individual measurement is referenced.
• Good source for model making
20
What We Learned
Creating Solubility Models with Reaxys |
• Reaxys is a rich source of data for solubility and other properties.
• One can explore many subsets based on condition, molecule class etc.
• High diversity of molecules – organic, inorganic, peptides etc.
• Reaxys is a good source of data for making predictive models
• It provides not just the value, but the measurement conditions
• Selection of “good” measurements is an important factor in making models
• Reaxys contains hundreds of measured properties!
• Solubility is well studied
• Not as many models available for refractive index, magnetic susceptibility etc.
• Reaxys has only measured solubilities, SciFinder has predicted values
• We can see the effect of the training set and model quality in this presentation.
• Reaxys Medicinal Chemistry contains thousands of bioassay results on
thousands of targets that can be used for predictive models.
21
Conclusion

Making solubility models with reaxy

  • 1.
    Creating Solubility Modelswith Reaxys | Presented By Date Creating Solubility Models with Reaxys Elsevier R&D Solutions Services Dr. Matthew CLARK 19 January 2016
  • 2.
    Creating Solubility Modelswith Reaxys | • Reaxys has solubility data that can be used to create and study predictive models • Appears to have data more diverse than the well-studied “Huuskonen” data set. • The nature/diversity of the training set is very important for predictive models • The best reported models have the smallest training sets. • However, these training sets may not be useful for prediction of more diverse compounds. • Huuskonen-set-trained model predictions on Reaxys set is poor. • Reaxys has a diverse set of structures and solubilities • Each individual measurement is referenced. • Good source for model making 2 What We Will Learn
  • 3.
    Creating Solubility Modelswith Reaxys | • In addition to the well-known reactions and compounds, Reaxys is filled with hundreds of different measured properties reported for the compounds • Each property is associated with a reference • Each property has a “cluster” of values such as measurement temperature, pressure, solvents etc. describing the conditions of the measurement. • In many cases multiple measurements are reported by different authors at different times for a particular value. • A mean, median, and standard deviation can be assessed for the value. Each value is associated with a reference. • One can use this data, combined with the chemical structures of the compounds to make structure-based predictive models for these properties. • One can then predict the value of new or proposed compounds from their chemical structures. Reaxys Property Data
  • 4.
    Creating Solubility Modelswith Reaxys | 4 Reaxys Property Data is Grouped with Conditions You can select the measurement conditions relevant to your model Boiling Point Boiling Point, °C (BP.BP) Pressure, Torr (BP.P) Refractive Index Refractive Index (RI.RI) Wavelength, nm (RI.W) Temperature, °C (RI.T) Dielectric Constant Dielectric Constant (DIC.DIC) Frequency, Hz (DIC.F) Temperature, °C (DIC.T) Electrical Moment Description (EM.KW) Moment, D (EM.EM) Temperature, °C (EM.T) Method (EM.MET) Solvent (EM.SOL) Enthalpy of Formation Enthalpy of Formation, Jmol-1 (HFOR.HFOR) Temperature, °C (HFOR.T) Pressure, Torr (HFOR.P) Solubility (MCS) Solubility, gl-1 (SLB.SLB) Saturation (SLB.SAT) Temperature, °C (SLB.T) Solvent (SLB.SOL) Ratio of Solvents (SLB.RAT)
  • 5.
    Creating Solubility Modelswith Reaxys | • There are several ways to access this data • API (Application Programming Interface) allows direct access • Download tagged SD file from Reaxys after searching • “Hop in to” links to automatically go to data • Reaxys API allows direct access to the data • XML-based interface • KNIME, PiplelinePilot supported. • Need to query based on measurement conditions, (temp, solvent), and nature of molecules (organic, single-fragment) • Form-based query • “Advanced Query” 5 Model Making Tools
  • 6.
    Creating Solubility Modelswith Reaxys | 6 Solubility Query To Select Data and Molecules SLB.SLB > 0 has a reported solubility Temperature 19-25 temperature range of measurement Solvent 'H2O solubility in water Number of Fragments =1 only one contiguous fragment Elements = 'c‘ contains carbon! NOT Chemical Name = '*radical not a radical Molecular Weight > 40 AND < 1000 molecular weight range Number of Elements <5 fewer than 5 different elements
  • 7.
    Creating Solubility Modelswith Reaxys | 7 Reviewing Solubility Data in Reaxys
  • 8.
    Creating Solubility Modelswith Reaxys | 8 SolubilitySources Reaxys logS is -3.67
  • 9.
    Creating Solubility Modelswith Reaxys | 9 Data Processing in KNIME • Combines compounds with solubility measured in desired conditions • Convert values to molarity by dividing by molecular weight.
  • 10.
    Creating Solubility Modelswith Reaxys | • Used with data from Reaxys, and from the Huuskonen paper • Uses “R” and stepwise multiple regression • Results and error of prediction appear in a spreadsheet 10 Model Making Workflow
  • 11.
    Creating Solubility Modelswith Reaxys | • Full compound set, no further filtering • 3590 compounds • Standard error of prediction 1.1 log units • Not spectacular, but useful • Training set is larger range of diversity than used in most models • r2 0.56 11 Initial Model and Prediction Result is OK-ish -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS
  • 12.
    Creating Solubility Modelswith Reaxys | 12 Reaxys Solubility Model 2 – Filtering of Source Compounds Residual standard error: 0.6932 on 2697 degrees of freedom Multiple R-squared: 0.8099, Adjusted R-squared: 0.8037 F-statistic: 132 on 87 and 2697 DF, p-value: < 2.2e-16 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS 2785 remain, Examples of filtered compounds: Model is better, but does not improve prediction of Huuskonen data set
  • 13.
    Creating Solubility Modelswith Reaxys | 13 Comparison with Other Reports Clark – fragment-based solubility model r2 0.73, SE 0.89 using “PHYSPROP” data set Generalized Fragment-Substructure Based Property Prediction Method Matthew Clark J. Chem. Inf. Model., 2005, 45 (1), pp 30–38 DOI: 10.1021/ci049744c
  • 14.
    Creating Solubility Modelswith Reaxys | 14 Comparison with other data sets Defined a training set of compounds/solubilities, and test sets that have been used for several comparative studies
  • 15.
    Creating Solubility Modelswith Reaxys | • Models made with Huuskonen structures and data using CDK descriptors and R model • Using published training, test sets. • Models not as good as in publication; he used different descriptor computation and statistical method. Standard error 0.67 log units. 15 Huuskonen Molecule/Data Set Models – (No Reaxys Data) y = 0.961x R² = 0.8832 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS y = 0.9452x R² = 0.8598 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS y = 0.9912x R² = 0.7857 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS Training Set Test Set 1 Test Set 2
  • 16.
    Creating Solubility Modelswith Reaxys | • Same molecule sets – Model Trained with Reaxys Training Set • Standard error 0.98 log units – not bad 16 Huuskonen Molecule Sets – Predicted with Model Created from Reaxys Data Set y = 0.8824x R² = 0.6522 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS y = 0.8834x R² = 0.6889 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS y = 0.8741x R² = 0.7968 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS
  • 17.
    Creating Solubility Modelswith Reaxys | • Standard Error 3.5 log units • Issue is likely that many molecules from Reaxys are “outside” the structural diversity of the Huuskonen data set • Illustrates a significant issue with modeling – • Generally predictions are best when the molecule are similar to the training set. 17 Reaxys Molecule Set Predicted with Model Created from Huuskonen Data Set – Not Very Good y = 0.6596x - 1.0645 R² = 0.1459 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 predictedlogS experimental logS
  • 18.
    Creating Solubility Modelswith Reaxys | • Only a subset of solubilities of the Huuskonen set are found in Reaxys. • Differences are generally due to multiple measurements being reported with outliers 18 Does Reaxys Give The Same Solubility Values as Huuskonen Data Set? Yes. y = 1.0082x - 0.0367 R² = 0.9607 -12 -10 -8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0 2 4 ReaxyslogS Huuskonen logS
  • 19.
    Creating Solubility Modelswith Reaxys | • Similarity matrix of each data set computed set using fingerprints/Tanimoto • Huuskonen set more similar to each other than Reaxys set 19 Reaxys Solubility Data Set is Structurally More Diverse 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 NormalizedFractionofPair-SimilarityCount Similarity Value Huuskonen Reaxys Reaxys has a higher proportion of molecules not similar to others in the setNormalized for different data set sizes
  • 20.
    Creating Solubility Modelswith Reaxys | • Reaxys has solubility data that can be used to create and study predictive models • Appears to have data more diverse than the well-studied “Huuskonen” data set. • The nature/diversity of the training set is very important. • The best reported models have the smallest training sets. • However, these training sets may not be useful for prediction of more diverse compounds. • Huuskonen-set-trained model predictions on Reaxys set is poor. • Generally good models can predict with a standard error of about 1 log unit – for compounds similar to training set. • Question: what is the accuracy of measurement? • 𝜕𝑙𝑜𝑔𝑆 𝜕𝑔𝐿−1 = 1 2.303 ∗𝑔𝐿−1 ~ logS changes 0.4 log units/mg for a 1mg/L solubility • Reaxys has a diverse set of structures and solubilities • Each individual measurement is referenced. • Good source for model making 20 What We Learned
  • 21.
    Creating Solubility Modelswith Reaxys | • Reaxys is a rich source of data for solubility and other properties. • One can explore many subsets based on condition, molecule class etc. • High diversity of molecules – organic, inorganic, peptides etc. • Reaxys is a good source of data for making predictive models • It provides not just the value, but the measurement conditions • Selection of “good” measurements is an important factor in making models • Reaxys contains hundreds of measured properties! • Solubility is well studied • Not as many models available for refractive index, magnetic susceptibility etc. • Reaxys has only measured solubilities, SciFinder has predicted values • We can see the effect of the training set and model quality in this presentation. • Reaxys Medicinal Chemistry contains thousands of bioassay results on thousands of targets that can be used for predictive models. 21 Conclusion