Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity Landscapes
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Structure-Activity Relationships and Networks: A Generalized Approach to Exploring Structure-Activity Landscapes

on

  • 1,177 views

 

Statistics

Views

Total Views
1,177
Views on SlideShare
1,177
Embed Views
0

Actions

Likes
0
Downloads
42
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Structure-Activity Relationships and Networks: A Generalized Approach to Exploring Structure-Activity Landscapes Presentation Transcript

  • 1. Structure-­‐Ac)vity  Rela)onships  and   Networks:  A  Generalized  Approach   to  Exploring  Structure-­‐Ac)vity   Landscapes   Rajarshi  Guha   NIH  Chemical  Genomics  Center  /   NIH  Center  for  Transla9onal  Therapeu9cs   March  29,  2011  
  • 2. NIH  Chemical  Genomics  Center  •  Founded  2004  as  part  of  NIH  Roadmap  Molecular  Libraries  Ini9a9ve   –  NCGC  staffed  with  90+  scien9sts  –  biologists,  chemists,  informa9cians,  engineers   –  Post-­‐doc  program  •  Mission   –  MLPCN  (screening  &  chemical  synthesis;  compound  repository;  PubChem  database;   funding  for  assay,  library  and  technology  development  )   •  Complements  individual  inves9gator-­‐ini9ated  research  programs   •  Enables  “pharma-­‐level”  HTS  and  early  chemical  op9miza9on   –  Develop  new  chemical  probes  for  basic  research  and  leads  for  therapeu9c  development,   par9cularly  for  rare/neglected  diseases   –  New  paradigms  &  applica9ons  of  HTS  for  chemical  biology  /  chemical  genomics  •  All  NCGC  projects  are  collabora9ons  with  a  target  or  disease  expert;    currently  >200   collabora9ons  with  inves9gators  worldwide     –  75%  NIH  extramural,  10%  NIH  intramural,  15%  Founda9ons/Research  Consor9a/Pharma/ Biotech  
  • 3. NCGC  Project  Diversity  (A) Disease areas (B) Target types (C) Detection methods
  • 4. qHTS:    High  Throughput  Dose  Response   Assay concentration ranges over 4 logs Informatics pipeline. Automated curve fittingA   (high:~ 100 μM) 1536-well plates, inter-plate dilution series and classification. 300K samples C   Assay volumes 2 – 5 μLB   Automated concentration-response data collection ~1 CRC/sec
  • 5. Background  •  Cheminforma9cs  methods   –  QSAR,  diversity  analysis,  virtual  screening,     fragments,  polypharmacology,  networks  •  More  recently   –  RNAi  screening,  high  content  imaging  •  Extensive  use  of  machine  learning  •  All  9ed  together  with  socware     development   –  User-­‐facing  GUI  tools   –  Low  level  programma9c  libraries  •  Believer  &  prac99oner  of  Open  Source  
  • 6. Outline  •  Structure-­‐ac9vity  rela9onships  •  Characterizing  ac9vity  cliffs  •  Working  with  the  structure-­‐ac9vity  landscape  
  • 7. Structure  Ac)vity  Rela)onships   •  Similar  molecules  will  have  similar  ac9vi9es   •  Small  changes  in  structure  will  lead  to  small   changes  in  ac9vity   •  One  implica9on  is  that  SAR’s  are  addi9ve   •  This  is  the  basis  for  QSAR  modeling  Mar9n,  Y.C.  et  al.,  J.  Med.  Chem.,  2002,  45,  4350–4358  
  • 8. Excep)ons  Are  Easy  to  Find   F3C Cl Cl F3C Cl Cl NH2 NH2 N N N N NH2 NH O O O Ki  =  39.0  nM   Ki  =  1.8  nM   F3C Cl Cl F3C Cl Cl NH2 NH2 N N N N NH NH O NH2 O O O NH2 Ki  =  10.0  nM   Ki  =  1.0  nM  Tran,  J.A.  et  al.,  Bioorg.  Med.  Chem.  Le2.,  2007,  15,  5166–5176  
  • 9. Structure  Ac)vity  Landscapes   •  Rugged  gorges  or  rolling  hills?   –  Small  structural  changes  associated  with  large   ac9vity  changes  represent  steep  slopes  in  the   landscape   –  But  tradi9onally,  QSAR  assumes  gentle  slopes     –  Machine  learning  is  not  very  good  for  special   cases  Maggiora,  G.M.,  J.  Chem.  Inf.  Model.,  2006,  46,  1535–1535  
  • 10. Structure  Ac)vity  Landscapes  
  • 11. Characterizing  the  Landscape   •  A  cliff  can  be  numerically  characterized   •  Structure  Ac9vity  Landscape  Index  (SALI)   Ai − A j SALIi, j = 1− sim(i, j) •  Cliffs  are  characterized  by  elements  of  the   matrix  with  very  large  values   €Guha,  R.;  Van  Drie,  J.H.,  J.  Chem.  Inf.  Model.,  2008,  48,  646–658  
  • 12. Visualizing  the  SALI  Matrix  
  • 13. Fingerprints   1 0 1 1 0 0 0 1 0•  Lots  of  types  of  fingerprints    •  Indicates  the  presence  or  absence  of  a  structural   feature    •  Length  can  vary  from  166  to  4096  bits  or  more    •  Fingerprints  usually  compared  using  the   Tanimoto  metric  
  • 14. Varying  Fingerprint  Methods   BCI 1052 bit MACCS 166 bit CDK 1024 bit 8 8 8 6 6 6 Density Density Density 4 4 4 2 2 2 0 0 0 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 0.9 1.0 Tanimoto Similarity Tanimoto Similarity Tanimoto Similarity•  Shorter  fingerprints  will  lead  to  more  “similar”  pairs  •  Requires  a  higher  cutoff  to  focus  on  significant  cliffs  
  • 15. Varying  the  Similarity  Metric  
  • 16. Different  Ac)vity  Representa)ons   •  Using  the  Hill  parameters  from  a  dose-­‐response   curve  represents  richer  data  than  a  single  IC50   SInf ⎧ S0 ⎫ ⎪ ⎪ ⎪ Sinf ⎪ d(Pi ,P j ) SALIi, j = 50% ⎨ ⎬Activity ⎪ AC50 ⎪ 1− sim(i, j) ⎪ H ⎪ ⎩ ⎭ S0 AC50 Concentration €
  • 17. Visualizing  SALI  Values  •  Alterna9ves?   –  A  heatmap  is  an  easy  to  understand  visualiza9on   –  Coupled  with  brushing,  can  be  a  handy  tool   –  A  more  flexible  approach  is  to  consider  a  network   view  of  the  matrix    •  The  SALI  graph   –  Compounds  are  nodes   –  Nodes  i,j  are  connected  if  SALI(i,j)  >  X   –  Only  display  connected  nodes  
  • 18. Visualizing  SALI  Values  •  The  SALI  graph   –  Compounds  are  nodes   –  Nodes  i,j  are  connected  if  SALI(i,j)  >  X   –  Only  display  connected  nodes   ! 17 !!!!!!!!! 7 13 29 43 49 45 54 59 76 ! 15 ! 28 ! !!!!!!! 6 52 44 50 46 55 60 75 ! ! 3 18 !! 2 35 !! ! 20 22 9 ! 64 ! 69 ! 21 ! 34 ! 38 ! 8 ! 65 ! 24 ! ! 1 71 !! 12 58 !! 63 10 !! ! !! 68 27 23 41 42 !!!! 72 73 31 51 ! 39 ! 5 ! ! 19 62 ! 25 ! 57 ! 56 !!! 30 53 37 ! 4 ! 40 ! 66
  • 19. Varying  the  Cutoff   •  The  cutoff  controls  the  complexity  of  the  graph     •  Higher  cut  offs  will  highlight  the  most  significant   ac9vity  cliffs   Cutoff = 90% Cutoff = 50% Cutoff = 20% ! !!!!!!!!!! ! ! ! ! !!!!! ! !!!!!! 17 7 13 29 43 49 45 54 59 769 17 15 13 12 22 23 29 38 41 64 43 45 49 54 59 63 ! ! 9 17 ! 15 ! ! ! !!! ! 13 12 21 22 29 35 38 !64 !!!!!! 43 45 49 54 59 63 ! 15 ! 28 ! !!!!!!! 6 52 44 50 46 55 60 75! !!1 28 3 !! !!!!!!!!!!!!! 6 19 24 25 52 39 57 42 56 44 46 50 55 60 62 ! !! 1 28 3 !! ! !!! !!!! !!!!!!!! 6 19 23 24 52 65 39 41 42 56 58 66 44 46 50 55 60 62 ! ! 3 18 !! 2 35 !! ! 20 22 9 ! 64 ! 69 ! 21 ! 34 ! 38 ! 2 ! 8 !40 ! 2 ! 8 ! ! 40 25 ! 37 !57 ! 8 ! 65 ! 24 ! ! 1 71 !! 12 58 !! 63 10 !! ! !! 68 27 23 41 42 !!!! 72 73 31 51 ! 39 ! 5 ! ! 19 62 ! 25 ! 57 ! 56 !!! 30 53 37 ! 5 ! 5 ! 4 ! 40 ! 4 ! 4 ! 66
  • 20. BePer  Visualiza)on  -­‐  SALIViewer   hPp://sali.rguha.net  
  • 21. What  Can  We  Do  With  SALI’s?  •  SALI  characterizes  cliffs  &  non-­‐cliffs  •  For  a    given  molecular  representa9on,  SALI’s   gives  us  an  idea  of    the   smoothness  of  the     SAR  landscape  •  Models  try  and  encode   this  landscape  •  Use  the  landscape  to  guide   descriptor  or  model     selec9on  
  • 22. Descriptor  Space  Smoothness   gatifloxacin granisetron dolasetron perhexiline amitriptyline diltiazem sparfloxacin grepafloxacin sildenafil moxifloxacin gatifloxacin moxifloxacin grepafloxacin sildenafil sparfloxacin diltiazem amitriptyline dolasetron granisetron imipramine perhexiline 400 Number of Edges in SALI Graph mibefradil chlorpromazine azimilide bepridil cisapride E-4031 sertindole pimozide dofetilide droperidol thioridazine haloperidol domperidone loratadine mizolastine bepridil azimilide mibefradil chlorpromazine imipramine halofantrine mizolastine loratadine domperidone verapamil terfenadine sertindole dofetilide haloperidol thioridazine droperidol 300 E-4031 cisapride pimozide astemizole astemizole 200 grepafloxacin sildenafil moxifloxacin gatifloxacin 100 0 0.0 0.2 0.4 0.6 0.8 1.0 astemizole SALI Cutoff•  Edge  count  of  the  SALI  graph  for  varying  cutoffs  •  Measures  smoothness  of  the  descriptor  space  •  Can  reduce  this  to  a  single  number  (AUC)  
  • 23. Other  Examples   400•  Instead  of  fingerprints,     Number of Edges in SALI Graph 300 we  use  molecular     200 2D   descriptors   100•  SALI  denominator  now     0 uses  Euclidean  distance   0.0 0.2 0.4 0.6 SALI Cutoff 0.8 1.0•  2D  &  3D  random     descriptor  sets   400 Number of Edges in SALI Graph –  None  are  really  good   300 3D   –  Too  rough,  or   200 –  Too  flat   100 0 0.0 0.2 0.4 0.6 0.8 1.0 SALI Cutoff
  • 24. Feature  Selec)on  Using  SALI  •  Surprisingly,  exhaus9ve  search  of  66,000  4-­‐ descriptor  combina9ons  did  not  yield  semi-­‐ smoothly  decreasing  curves  •  Not  en9rely  clear  what  type  of  curve  is  desirable  
  • 25. SALI  Graphs  &  Predic)ve  Models  •  The  graph  view  allows  us  to  view  SAR’s  and  iden9fy   trends  easily  •  The  aim  of  a  QSAR  model  is  to  encode  SAR’s  •  Tradi9onally,  we  consider  the  quality  of  a  model  in   terms  of  RMSE  or  R2  •  But  in  general,  we’re  not  as  interested  in  RMSE’s  as   we  are  in  whether  the  model  predicted  something   as  more  ac9ve  than  something  else     –  What  we  want  to  have  is  the  correct  ordering   –  We  assume  the  model  is  sta9s9cally  significant  
  • 26. Measuring  Model  Quality  •  A  QSAR  model  should  easily  encode  the  “rolling   hills”  •  A  good  model  captures  the  most  significant  cliffs  •  Can  be  formalized  as        How  many  of  the  edge  orderings  of  a  SALI  graph                    does  the  model  predict  correctly?  •  Define  S  (X  ),  represen9ng  the  number  of  edges   correctly  predicted  for  a  SALI  network  at  a  threshold   X  •  Repeat  for  varying  X  and  obtain  the  SALI  curve  
  • 27. SALI  Curves   1.0 1.0 0.5 0.5 S(X)S(X) 0.0 0.0 !0.5 !0.5 3!descriptor 5!descriptor Scrambled 3!descriptor !1.0 SCI = 0.12 !1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 X X
  • 28. Model  Search  Using  the  SCI  •  We’ve  used  the  SALI  to  retrospec9vely  analyze   models  •  Can  we  use  SALI  to  develop  models?   –  Iden9fy  a  model  that  captures  the  cliffs  •  Tricky   –  Cliffs  are  fundamentally  outliers   –  Op9mizing  for  good  SALI  values  implies  overfivng   –  Need  to  trade-­‐off  between  SALI  &  generalizability  
  • 29. The  Objec)ve  Func)on  •  S0  is  a  measure  of  the  models   1.0 ability  to  summarize  the  dataset   0.9 S100   S(X) 0.8 (analogous  to  RMSE)   S   0.7 0•  S100  measures  the  models   0.6 ability  to  capture  cliffs   0.0 0.2 0.4 0.6 0.8 1.0 SALI Cutoff•  Ideally,  the  curve  starts  high  and  stays  high   1 1 (S100 − S0 ) 1 F= F= + F= S100 S0 2 SCI
  • 30. SALI  Based  Model  Selec)on   RMSE SCI S(100) •  Considered  the  BZR  dataset     0.5 from  Sutherland  et  al   S(X) 0.0 •  Iden9fied  “best”  models   -0.5 using  a  GA  to  select  from  a     0.0 0.2 0.4 0.6 SALI Cutoff 0.8 1.0 pool  of  2D  descriptors   RMSE SCI S(100) •  While  SALI  based  op9miza9on   0.5 can  lead  to  a  “bexer”  curve,     S(X) 0.0 it  doesn’t  give  the  best  model   -0.5 0.00 0.02 0.04 0.06 0.08 SALI CutoffSutherland,  J  et  al,  J.  Chem.  Inf.  Comput.  Sci.,  2003,  43,  1906-­‐1915  
  • 31. SALI  Based  Model  Selec)on   RMSE SCI S(0) + D/2 •  107  aryl  azoles  as  ER-­‐β  agonists   0.5 S(X) 0.0 •  Used  a  GA  and  2D  descriptors   -0.5 to  iden9fy  models   0.0 0.2 0.4 0.6 0.8 1.0 •  In  this  case,  a  SALI  based     RMSE SALI Cutoff SCI S(0) + D/2 objec9ve  func9on  was  able  to   iden9fy  the  best  model   0.5 •  Interes9ngly,  SCI  does  not     S(X) 0.0 seem  to  perform  very  well   -0.5 0.00 0.02 0.04 0.06 0.08 SALI CutoffMalamas,  M.S.  et  al,  J  Med  Chem,  2004,  47,  5021-­‐5040  
  • 32. SALI  Based  Model  Selec)on   •  The  size  of  the  solu9on  space  explored   depends  on  the  SALI  objec9ve  func9on   1.15 BZR   ER-­‐β   0.65 1.10 1.05 0.60 RMSERMSE 1.00 0.95 0.55 0.90 RMSE S(100) SCI 1/S(0) + D/2 RMSE SCI Objective Function Objective Function
  • 33. Predic)ng  the  Landscape   •  Rather  than  predic9ng  ac9vity  directly,  we  can   try  to  predict  the  SAR  landscape   •  Implies  that  we  axempt  to  directly  predict  cliffs   –  Observa9ons  are  now  pairs  of  molecules   •  A  more  complex  problem   –  Choice  of  features  is  trickier   –  S9ll  face  the  problem  of  cliffs  as  outliers   –  Somewhat  similar  to  predic9ng  ac9vity  differences  Scheiber  et  al,  StaHsHcal  Analysis  and  Data  Mining,  2009,  2,  115-­‐122  
  • 34. Predic)ng  Cliffs  •  Dependent  variable  are  pairwise  SALI  values,   calculated  using  fingerprints  •  Independent  variables  are  molecular  descriptors   –  but  considered  pairwise   –  Absolute  difference  of  descriptor  pairs,  or   –  Geometric  mean  of  descriptor  pairs   –  …  •  Develop  a  model  to  correlate  pairwise   descriptors  to  pairwise  SALI  values  
  • 35. A  Test  Case   •  We  first  consider  the  Cavalli  CoMFA  dataset  of  30   molecules  with  pIC50’s   •  Evaluate  topological  and  physicochemical   descriptors   •  Developed  random  forest     models   –  On  the  original  observed     values  (30  obs)   –  On  the  SALI  values     (435  observa9ons)  Cavalli,  A.  et  al,  J  Med  Chem,  2002,  45,  3844-­‐3853  
  • 36. Double  Coun)ng  Structures?  •  The  dependent  and     GeoMean independent  variables  both     60 50 encode  structure.     40 30•  But  prexy  low  correla9ons     20 between  individual  pairwise     10 Percent of Total 0 descriptors  and  the  SALI     AbsDiff 60 values   50 40 30 20 10 0 0.00 0.05 0.10 0.15 R2
  • 37. Model    Summaries   Original  pIC50   SALI,  AbsDiff   SALI,  GeoMean   9 RMSE  =  0.97   RMSE  =  1.10   RMSE  =  1.04   6 6 ! 8Predicted pIC50 ! !! ! Predicted SALI Predicted SALI ! ! ! ! ! ! ! ! ! !!! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! 7 ! ! ! ! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 4 ! ! ! !! ! ! ! ! 4 !! !! ! ! ! ! ! ! ! !!! !! ! ! ! ! ! !! !! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! !! ! !! ! ! ! !! ! ! ! ! !! ! ! !! ! ! ! ! ! !! ! !! ! !! ! ! !!! ! ! ! ! ! !!!!!!! ! ! ! ! ! ! ! ! !! ! !! ! !! ! ! ! ! ! ! !! ! ! 6 ! ! ! !! ! ! ! ! ! ! !! ! ! ! !! ! ! !!!! ! !!!!!!!! ! ! ! ! ! ! ! ! !!! !! ! ! ! ! ! ! !!! ! ! ! ! ! !!!! ! ! ! !! ! !!! !! ! ! ! !!! ! !!!!! ! ! ! ! ! !! !! ! ! ! ! !! ! !!!!! ! !!!! ! ! ! ! ! !! !! ! !! ! ! ! ! ! ! ! ! !! ! ! !! !!!!!! !!!!! !! ! ! !! ! ! ! ! ! ! !! ! ! !!! !!!! !!!! !!! ! ! ! !! ! !!!!! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! !! ! !! !! !! !! ! !! ! ! !! ! !! ! ! !!! !!!!!!!!!! !! ! ! !! !! !!!! ! ! ! ! !! !! ! ! ! !!!!!!!!!!!! ! !! ! ! ! !! !!!!!!!!! !!!!! !! ! ! ! !! ! ! ! 2 !!!!!!! ! !! ! ! ! ! ! ! ! ! !! !! ! ! ! !!!! !!!! ! !! ! ! !!!! ! ! ! ! !! ! !!!!!!! !!! !! 2 ! ! ! !!!!!!! !!! ! ! !!!!!! ! ! ! ! ! ! ! ! ! !!!! ! !! ! ! ! ! ! !!!!!!!!!!!! !! ! ! ! !! ! !!! ! ! ! !! !!!!! ! !! ! ! ! ! ! ! !!!!!!! ! !!! ! !!! !! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! 5 ! ! ! ! ! ! ! !!!!! ! ! ! !! ! ! ! !!! !!! !!!!! ! !!! !!! !!!! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !!!! ! ! ! ! !! !! ! !! ! !! ! ! ! ! ! ! ! !!! ! !! ! !! !! 4 0 0 4 5 6 7 8 9 0 2 4 6 0 2 4 6 Observed pIC50 Observed SALI Observed SALI •  All  models  explain  similar  %  of  variance  of   their  respec9ve  datasets     •  Using  geometric  mean  as  the  descriptor   aggrega9on  func9on  seems  to  perform  best   •  SALI  models  are  more  robust  due  to  larger  size   of  the  dataset  
  • 38. Test  Case  2   •  Considered  the  Holloway  docking  dataset,  32   molecules  with  pIC50’s  and  Einter   •  Similar  strategy  as  before   •  Need  to  transform  SALI  values     •  Descriptors  show  minimal     correla9on   50 30 40 Percent of Total Percent of Total 30 20 20 10 10 0 0 0 20 40 60 80 100 120 -1 0 1 2Holloway,  M.K.  et  al,  J  Med  Chem,  1995,  38,  305-­‐317   SALI log10 (SALI)
  • 39. Model    Summaries   Original  pIC50   SALI,  AbsDiff   SALI,  GeoMean   10 RMSE  =  1.05   RMSE  =  0.48   RMSE  =  0.48   ! ! ! !! 2 2 Predicted log10(SALI) Predicted log10(SALI) ! ! ! !! ! ! !! !! !! ! ! ! 9 ! ! !! !! ! !! !! !!!Predicted pIC50 ! ! ! ! !! ! !! ! !! ! ! ! !! !! ! ! !! ! ! ! ! !!!!! ! ! ! !!!! !! ! !! ! ! ! !! !! !! ! !! ! ! !! ! ! !!! ! 1 ! ! !!!!!!!! !! !!! !! !!! ! ! ! !! !!!!!!! ! ! 1 ! !!!!! ! ! !! ! ! !!!!!! ! ! !!!! ! ! !!!! !!!! ! ! ! !!!!! !! ! ! ! !!!!! ! !! !!!!!! !!!!!!!! ! !!!! !! !!!! ! ! !!!!!!!! ! ! ! !! !!!! ! !! !! !! !! ! ! !! ! ! !! !!!!!!! ! ! ! ! !! ! !!!!!!! ! ! !!! !! 8 ! ! ! ! ! ! !! ! !!!!!! !! !!!!! ! ! ! ! !!!!!!! !! ! !!!!! ! !! !! !! ! !!! ! ! ! ! !!!! ! ! ! ! !! !!! ! ! ! !! ! !! ! !!!!!!!! !!! ! ! !!!!!! ! ! !!!!! ! ! ! !!! ! ! ! ! !!!! !!!!!! !! ! ! !! !!!!! ! ! !!! !!! ! ! !!!!! ! !! ! ! ! ! !!! ! ! ! ! ! ! ! ! !!!! ! ! ! ! ! ! ! ! ! ! ! ! !!!!!!!!!! ! ! ! !!!!! ! !! ! ! ! !!! ! !!!!!!!! !! ! ! ! ! !!!!!!!!! ! ! ! ! !! !!!!! !!!!!! ! ! ! !!! ! ! !!! ! ! ! ! ! ! !!!!! ! !!! ! !! !!!!!!! ! ! ! ! ! !! !! ! !!!! !!!! ! ! ! !!!!!!! !!! ! !!! !! !!! ! ! ! ! !!! ! !!! !! ! ! !! !!!! !!!! !!!! ! ! !! ! ! !!!! !! ! ! ! ! !!! !! !! !! ! !! ! !! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! !! ! 7 ! ! !! !! ! ! ! ! ! ! ! !!! ! ! !! ! !! ! ! ! ! ! ! ! !! ! 0 !! ! ! 0 ! ! ! ! ! ! ! 6 ! ! !1 !1 5 5 6 7 8 9 10 !1 0 1 2 !1 0 1 2 Observed pIC50 Observed log10(SALI) Observed log10(SALI) •  The  SALI  models  perform  much  poorer  in   terms  of    %  of  variance  explained   •  Descriptor  aggrega9on  method  does  not  seem   to  have  much  effect   •  The  SALI  models  appear  to  perform  decently   on  the  cliffs  –  but  misses  the  most  significant    
  • 40. Model    Summaries   Original  pIC50   SALI,  AbsDiff   SALI,  GeoMean   10 RMSE  =  1.05   100 RMSE  =  9.76   100 RMSE  =  10.01   ! ! ! ! !! ! ! !! ! ! 9 !Predicted pIC50 ! 80 80 Predicted SALI Predicted SALI ! ! ! 8 ! ! 60 60 ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! 7 ! ! ! ! ! ! ! ! ! ! ! ! 40 ! ! ! ! ! 40 ! ! ! ! ! ! ! ! ! ! ! ! 6 ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! 20 ! !! ! !! ! ! ! 20 ! !! ! ! 5 ! ! !! ! ! ! ! !! !!! ! !! ! !!! !!! ! ! ! ! !! ! !! ! !! !!! !! ! !!!! !!! ! ! !!!!!!!!! ! ! !! !!!!! !!!! ! !!!!!!! ! ! ! ! !! !!! ! ! ! ! !!!!!! ! ! !! !!! !! !! !!!!!! ! ! ! !!! ! !!!!!! ! ! !!! ! ! ! !!!!!! ! !! !!!!! ! ! ! !!! ! ! !!!!!!! !!! ! ! !!!!!!!! !!!! ! !!!!!!!! !! ! !!!!!!!! ! !!!! !! !!!! !!!!!! ! !!!! ! !!!! !! ! !!! !!!!! !!!!!! ! !!!! ! ! ! !!!!!!!! ! !! ! !! !! !!!! ! !!!!!! !!!! ! !!!!! ! !! !! ! ! ! !!! !! !!!! !! !!!! ! !!!! ! !!!!! ! ! !! !!!!! !! !!!!!! !! ! ! !!! ! ! !! !!!!!! ! ! ! ! !!!!!! !!!!!!! !!!!! !! ! !!! ! !! !!!!! !! !!!!!! ! !!! ! ! ! !! ! !! !! ! !! !! ! 5 6 7 8 9 10 20 40 60 80 100 20 40 60 80 100 Observed pIC50 Observed SALI Observed SALI •  With  untransformed  SALI  values,  models   perform  similarly  in  terms  of    %  of  variance   explained   •  The  most  significant  cliffs  correspond  to   stereoisomers  
  • 41. Model  Caveats  •  Models  based  on  SALI  values  are  dependent  on   their  being  an  SAR  in  the  original  ac9vity  data  •  Scrambling  results  for  these  models  are  poorer   than  the  original     models  but  aren’t  as     random  as  expected   6 Predicted SALI 4 2 0 0 2 4 6 Observed SALI
  • 42. SALI  in  Bulk  •  Much  of  this  material  is  exploratory  •  So  we’re  interested  in  trends  across  many  assays    •  ChEMBL  is  an  excellent  source  for  ac9vity  cliffs  •  Assay  selec9on   –  Human  target,  binding  assay   –  High  confidence  (score  =  9)   –  Number  of  compounds  between  75  &  300   –  Only  consider  non-­‐NA  ac9vity  values   –  Censored  data  is  considered  the  same  as  exact  data   –  31  assays  •  We  iden9fy  datasets  with  ac9vity  cliffs  by  the  skewness   of  the  dependent  variable  
  • 43. SALI  in  Bulk  •  Used  pIC50’s  and  CDK  hashed  fingerprints  •  Lots  of  material  to  explore  here  
  • 44. SALI  in  Bulk  •  But  fingerprint  quality  is  important  •  Assay  379744  has  83  cliffs  of  infinite  height  since   83  pairs  of  molecules  have  Tc  =  1.0  •  Probably  should  simply  ignore  such   “iden9cal”  molecules  
  • 45. Conclusions  •  SALI  is  the  first  step  in  characterizing  the  SAR   landscape  •  Allows  us  to  directly  analyze  the  landscape,  as   opposed  to  individual  molecules  •  Being  able  to  predict  the  landscape  could  serve   as  a  useful  way  to  extend  an  SAR    landscape  
  • 46. Acknowledgements  •  John  Van  Drie  •  Gerry  Maggiora  •  Mic  Lajiness  •  Jurgen  Bajorath  •  Ajit  Jadhav  •  Trung  Ngyuen  
  • 47. ER-­‐β  Dataset   •  107  molecules,  censored  data   taken  as  exact   •  A  few  big  cliffs   •  The  best  linear  model  performs    decently   25 -1 20Predicted pIC50 Frequency 15 -2 10 -3 5 -4 0 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 -4 -3 -2 -1 Observed pIC50 pIC50
  • 48. SALI  Curves  from  DRCs  •  No  difference  in  major  cliffs  •  Some  of  the  minor  cliffs  are  highlighted  using  the   DRC  instead  of  IC50  
  • 49. Height 0.5 0.6 0.7 0.8 0.9 1.0 17 14 23 25 26 18 9 27 16 19 1 10 32 6 29 8 33 12 30 11 4hclust (*, "complete") 22 5 28 2 7 13 3 31 24 20 15 21 Clustering  in  the  Holloway  Dataset