Modeling	
  Chemical	
  datasets	
  
	
  

with	
  a	
  focus	
  on	
  regression	
  based	
  methods	
  

dsdht.wikispace...
Aims
•  How does the dynamic range of the data
being modeled impact the apparent
performance of the model? "
•  How does e...
Example	
  
	
  
Examine	
  a	
  number	
  of	
  datasets	
  containing	
  
measured	
  values	
  for	
  aqueous	
  solubi...
CChallenges	
  in	
  modeling	
  solubility	
  
Aqueous solubility of a compound can vary
depending on a number of factors...
Datasets	
  under	
  study	
  
•  	
  The	
  Huuskonen	
  Dataset	
  :	
  	
  1274	
  experimental	
  
solubility	
  value...
Formula	
  

LogS = log10((solubility in µg/ml)/(1000.0 MW))	
  
 	
  	
  	
  	
  	
  Solubility	
  Comparison	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...
Requirements	
  for	
  PredicCve	
  model	
  

•  Reliable experimental data
•  	
  Sets	
  of	
  molecular	
  descriptors...
Types	
  of	
  Models	
  
ClassificaCon	
  Model	
  :	
  	
  	
  
•  Taking	
  cutoffs	
  points	
  in	
  modeling	
  “edge	...
Types	
  of	
  Models	
  
Regression	
  Model	
  :	
  	
  	
  
	
  
•  	
  difficult	
  to	
  create	
  a	
  regression	
  m...
EvaluaCng	
  a	
  predicCve	
  model	
  
•  Pearson’s	
  r:	
  	
  commonly	
  referred	
  to	
  as	
  Pearson’s	
  r	
  ,...
Steps	
  involved	
  in	
  building	
  a	
  predicCve	
  model	
  
•  Integrate	
  the	
  experimental	
  data	
  and	
  m...
Random	
  forest	
  model	
  	
  

The	
  dynamic	
  range	
  in	
  a	
  dataset	
  can	
  have	
  a	
  large	
  
impact	
...
 Experimental	
  Error	
  and	
  Model	
  Performance	
  
•  	
  experimental	
  data	
  point	
  has	
  an	
  error	
  as...
Experimental	
  Error	
  and	
  Model	
  Performance	
  
•  Table	
  shows	
  the	
  maximum	
  possible	
  correla7on	
  ...
Model	
  Applicability	
  
•  Models	
  ofen	
  perform	
  poorly	
  on	
  molecules	
  that	
  
bear	
  ligle	
  resembla...
 Comparing	
  Predic7ve	
  Models	
  
•  	
  When	
  comparing	
  correla7on	
  coefficients,	
  we	
  must	
  not	
  only	
...
References	
  
•  hgp://www.wiley.com/WileyCDA/WileyTitle/
productCd-­‐1118139100.html	
  
•  hgps://github.com/PatWalters...
Upcoming SlideShare
Loading in …5
×

Modeling Chemical Datasets

350 views
282 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
350
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Modeling Chemical Datasets

  1. 1. Modeling  Chemical  datasets     with  a  focus  on  regression  based  methods   dsdht.wikispaces.com  
  2. 2. Aims •  How does the dynamic range of the data being modeled impact the apparent performance of the model? " •  How does experimental error impact the apparent predictivity of a model? " •  How can we determine whether a model is applicable to a new dataset?" •  How should we compare the performance of regression models? "                                                                                                                                                                                                                                h0p://media.johnwiley.com.au/product_data/excerpt/00/11181391/1118139100-­‐4.pdf  
  3. 3. Example     Examine  a  number  of  datasets  containing   measured  values  for  aqueous  solubility  and  use   these  datasets  to  build  and  evaluate  predic7ve   models.  
  4. 4. CChallenges  in  modeling  solubility   Aqueous solubility of a compound can vary depending on a number of factors: •   Temperature   •   Purity   •   polymorph  
  5. 5. Datasets  under  study   •   The  Huuskonen  Dataset  :    1274  experimental   solubility  values  first  largest  solubility  dataset.   •   The  JCIM  Dataset  :    94  experimental  solubility  2008   •   The  PubChem  Dataset  (AID1996):  A  randomly   selected  subset  of  1000  measured  solubility  values   selected  from  a  set  of  58,000  values  that  were   experimentally  determined  using  chemilumenescent   nitrogen  detec7on  (CLND).  
  6. 6. Formula   LogS = log10((solubility in µg/ml)/(1000.0 MW))  
  7. 7.            Solubility  Comparison                                        A  boxplot  comparison  of  Log  S  for  the  three  datasets  
  8. 8. Requirements  for  PredicCve  model   •  Reliable experimental data •   Sets  of  molecular  descriptors   •   Sta7s7cal  or  machine-­‐learning  methods  
  9. 9. Types  of  Models   ClassificaCon  Model  :       •  Taking  cutoffs  points  in  modeling  “edge  effects”.              consider  a  case  where  we  have  a  two-­‐class                          system  with  a  cutoff  of  100  μM.  A  value  of  99  μ                    M  will  be  considered  insoluble  while  a  value  of                  101  μ  M  will  be  considered  soluble.     •  other  difficulty  with  classifica7on  models  is  that   they  provide  limited  direc7on  for  improving  the   proper7es  of  a  compound    
  10. 10. Types  of  Models   Regression  Model  :         •   difficult  to  create  a  regression  model  given  data              with  a  limited  dynamic  range.   •   limited  dynamic  range  unreliable  model      
  11. 11. EvaluaCng  a  predicCve  model   •  Pearson’s  r:    commonly  referred  to  as  Pearson’s  r  ,  or   its  square  r^2              Values  of  r    can  vary  between  −1  and  1,   •  Kendall’s  Tau:    Pearson’s  r    is  that  it  is  sensi7ve  to   outliers  and  to  the  distribu7on  of  the  underlying  data.   Employ  rank  order  or  values.   •  RMSD:    If  we  consider  paired  values  X    and  Y  ,  RMSD  can   be  calculated  using  the  following  equa7on.  
  12. 12. Steps  involved  in  building  a  predicCve  model   •  Integrate  the  experimental  data  and  molecular   descriptors   •  Divide  the  data  into  training  and  test  sets   •  Build  a  model  from  the  training  set   •  Use  this  model  to  predict  the  test  set  
  13. 13. Random  forest  model     The  dynamic  range  in  a  dataset  can  have  a  large   impact  on  the  apparent  correla7on  between   experimental  and  predicted  ac7vity.  
  14. 14.  Experimental  Error  and  Model  Performance   •   experimental  data  point  has  an  error  associated                  with  it.            If  we  measure  the  Log  S    of  a  compound  as  −6  and  that  data  point  has  an  error  of                  0.3  log  units,  the  actual  value  could  be  anywhere  between  −6.3  and  −5.7.     •  Brown  examined  the  rela7onship  between  experimental   error  and  model  performance.       •  Gaussian  distributed  random  values  were  added  to   data  to  simulate  experimental  errors.     •   Correla7on  between  the  measured  values  and  the  same   values  with  simulated  error  is  measured.  
  15. 15. Experimental  Error  and  Model  Performance   •  Table  shows  the  maximum  possible  correla7on  for   each  of  the  three  solubility  datasets  we  have  been   examining  when  experimental  errors  of  0.3,  0.5,   and  1.0  log  are  considered.   •  Error  is  more  for  a  dataset  like  pubchem.  
  16. 16. Model  Applicability   •  Models  ofen  perform  poorly  on  molecules  that   bear  ligle  resemblance  to  those  in  the  training  set.   Dataset     Mean   Median   Huuskonen_Test   0.76   0.78   JCIM   0.74   0.62   Pubchem   0.56   0.56   Similarity  of  Each  Test  Set   Dataset   R2   Kendall   RMS   Error   Huuskonen_Test   0.92   0.82   0.58   JCIM   0.58   0.59   0.83   Pubchem   0.11   0.22   1.12  
  17. 17.  Comparing  Predic7ve  Models   •   When  comparing  correla7on  coefficients,  we  must  not  only  consider  the  value  of  the   correla7on  coefficient,  but  also  the  confidence  intervals  around  the  correla7on   coefficient.   •   If  the  confidence  intervals  of  two  correla7ons  overlap,  we  cannot  claim  that              one  predic7ve  model  is  superior  to  another.   •  For  subset  of  25  compounds  confidence  intervals  overlap  so  ,  we  cannot  say  that  one   correla7on  is  superior  to  the  other.   •  For  subset  of  50  compounds,  there  is  a  very  small  difference  between  the  upper   bound  of  the  95%  confidence  interval.   •  For  subset  of  100  compounds,  there  is  clear  separa7on  between  the  confidence   intervals  so  it  implies  that  there  is  clear  separa7on  between  correla7on  coefficients.      
  18. 18. References   •  hgp://www.wiley.com/WileyCDA/WileyTitle/ productCd-­‐1118139100.html   •  hgps://github.com/PatWalters/ cheminforma7csbook    

×