Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modeling Chemical Datasets


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

Modeling Chemical Datasets

  1. 1. Modeling  Chemical  datasets     with  a  focus  on  regression  based  methods  
  2. 2. Aims •  How does the dynamic range of the data being modeled impact the apparent performance of the model? " •  How does experimental error impact the apparent predictivity of a model? " •  How can we determine whether a model is applicable to a new dataset?" •  How should we compare the performance of regression models? "                                                                                                                                                                                                                                h0p://­‐4.pdf  
  3. 3. Example     Examine  a  number  of  datasets  containing   measured  values  for  aqueous  solubility  and  use   these  datasets  to  build  and  evaluate  predic7ve   models.  
  4. 4. CChallenges  in  modeling  solubility   Aqueous solubility of a compound can vary depending on a number of factors: •   Temperature   •   Purity   •   polymorph  
  5. 5. Datasets  under  study   •   The  Huuskonen  Dataset  :    1274  experimental   solubility  values  first  largest  solubility  dataset.   •   The  JCIM  Dataset  :    94  experimental  solubility  2008   •   The  PubChem  Dataset  (AID1996):  A  randomly   selected  subset  of  1000  measured  solubility  values   selected  from  a  set  of  58,000  values  that  were   experimentally  determined  using  chemilumenescent   nitrogen  detec7on  (CLND).  
  6. 6. Formula   LogS = log10((solubility in µg/ml)/(1000.0 MW))  
  7. 7.            Solubility  Comparison                                        A  boxplot  comparison  of  Log  S  for  the  three  datasets  
  8. 8. Requirements  for  PredicCve  model   •  Reliable experimental data •   Sets  of  molecular  descriptors   •   Sta7s7cal  or  machine-­‐learning  methods  
  9. 9. Types  of  Models   ClassificaCon  Model  :       •  Taking  cutoffs  points  in  modeling  “edge  effects”.              consider  a  case  where  we  have  a  two-­‐class                          system  with  a  cutoff  of  100  μM.  A  value  of  99  μ                    M  will  be  considered  insoluble  while  a  value  of                  101  μ  M  will  be  considered  soluble.     •  other  difficulty  with  classifica7on  models  is  that   they  provide  limited  direc7on  for  improving  the   proper7es  of  a  compound    
  10. 10. Types  of  Models   Regression  Model  :         •   difficult  to  create  a  regression  model  given  data              with  a  limited  dynamic  range.   •   limited  dynamic  range  unreliable  model      
  11. 11. EvaluaCng  a  predicCve  model   •  Pearson’s  r:    commonly  referred  to  as  Pearson’s  r  ,  or   its  square  r^2              Values  of  r    can  vary  between  −1  and  1,   •  Kendall’s  Tau:    Pearson’s  r    is  that  it  is  sensi7ve  to   outliers  and  to  the  distribu7on  of  the  underlying  data.   Employ  rank  order  or  values.   •  RMSD:    If  we  consider  paired  values  X    and  Y  ,  RMSD  can   be  calculated  using  the  following  equa7on.  
  12. 12. Steps  involved  in  building  a  predicCve  model   •  Integrate  the  experimental  data  and  molecular   descriptors   •  Divide  the  data  into  training  and  test  sets   •  Build  a  model  from  the  training  set   •  Use  this  model  to  predict  the  test  set  
  13. 13. Random  forest  model     The  dynamic  range  in  a  dataset  can  have  a  large   impact  on  the  apparent  correla7on  between   experimental  and  predicted  ac7vity.  
  14. 14.  Experimental  Error  and  Model  Performance   •   experimental  data  point  has  an  error  associated                  with  it.            If  we  measure  the  Log  S    of  a  compound  as  −6  and  that  data  point  has  an  error  of                  0.3  log  units,  the  actual  value  could  be  anywhere  between  −6.3  and  −5.7.     •  Brown  examined  the  rela7onship  between  experimental   error  and  model  performance.       •  Gaussian  distributed  random  values  were  added  to   data  to  simulate  experimental  errors.     •   Correla7on  between  the  measured  values  and  the  same   values  with  simulated  error  is  measured.  
  15. 15. Experimental  Error  and  Model  Performance   •  Table  shows  the  maximum  possible  correla7on  for   each  of  the  three  solubility  datasets  we  have  been   examining  when  experimental  errors  of  0.3,  0.5,   and  1.0  log  are  considered.   •  Error  is  more  for  a  dataset  like  pubchem.  
  16. 16. Model  Applicability   •  Models  ofen  perform  poorly  on  molecules  that   bear  ligle  resemblance  to  those  in  the  training  set.   Dataset     Mean   Median   Huuskonen_Test   0.76   0.78   JCIM   0.74   0.62   Pubchem   0.56   0.56   Similarity  of  Each  Test  Set   Dataset   R2   Kendall   RMS   Error   Huuskonen_Test   0.92   0.82   0.58   JCIM   0.58   0.59   0.83   Pubchem   0.11   0.22   1.12  
  17. 17.  Comparing  Predic7ve  Models   •   When  comparing  correla7on  coefficients,  we  must  not  only  consider  the  value  of  the   correla7on  coefficient,  but  also  the  confidence  intervals  around  the  correla7on   coefficient.   •   If  the  confidence  intervals  of  two  correla7ons  overlap,  we  cannot  claim  that              one  predic7ve  model  is  superior  to  another.   •  For  subset  of  25  compounds  confidence  intervals  overlap  so  ,  we  cannot  say  that  one   correla7on  is  superior  to  the  other.   •  For  subset  of  50  compounds,  there  is  a  very  small  difference  between  the  upper   bound  of  the  95%  confidence  interval.   •  For  subset  of  100  compounds,  there  is  clear  separa7on  between  the  confidence   intervals  so  it  implies  that  there  is  clear  separa7on  between  correla7on  coefficients.      
  18. 18. References   •  hgp:// productCd-­‐1118139100.html   •  hgps:// cheminforma7csbook