Classification and Clustering for Hit Identification in High Content RNAi Screens

  • 890 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
890
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Classifica(on  and  Clustering  for     Hit  Iden(fica(on  in  High     Content  RNAi  Screens   Rajarshi  Guha,  Ph.D.   NIH  Center  for  Transla:onal  Therapeu:cs     January  11,  2012  
  • 2. DNA Re-replication Collaborator:! Mel Depamphilis, NICHD! Wenge Zhu, Georgetown U! Sivaprasad et al Cell DivisionLevels of geminin increaseas cells enter S phase, After mitosis, levels ofwhich help to prevent a geminin and cyclins decreasesecond round of DNA through ubiqutin mediatedreplication.! degradation.!DNA replication is a tightly controlled and well-studied process. Proteinsincluding geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!
  • 3. DNA Re-replication Zhu et al, Cancer Res, 2009Certain cancer cells may have less safeguards against DNA re-replicationthan normal cells (i.e. Achilles heel). Induction of re-replication results inapoptosis.!
  • 4. Screening  Protocol  •  HCT-116 colon 
 cancer cells are 
 fixed and stained
 (Hoechst)!•  Image at 4X on
 ImageXpress!•  MetaXpress used 
 to perform cell cycle analysis to quantify cells with >4N DNA content !•  Screens were run with singles and pools  
  • 5. Screen  Summary  •  Qiagen  druggable  genome  library  (6,866  genes)  •  94  plates,  36K  wells     SSMD 14 including  controls   12 10•  Good  screen     8 6 performance,     Statistic 4 some  poorer     0 20 40 Trimmed Z 60 80 100 plates  were     0.8 redone   0.7 0.6   0.5 0 20 40 60 80 100 Plate Index
  • 6. Goals  •  Can  we  iden:fy  genes  with  GMNN-­‐like   phenotypes   –  We  already  iden:fied  a  set  of  genes  via  thresholding   the  %G2  parameter   –  We’d  like  to  see  what  we  get  when  we  use  a  mul:-­‐ dimensional  representa:on  •  Employ  predic:ve  modeling  to  “learn”  the   phenotype  •  Apply  clustering  and  iden:fy  biologically   relevant  clusters  
  • 7. What  Do  GMNN  Wells  Look  Like?  
  • 8. Cell-­‐Level  Modeling  •  A  first  approach  was  to  match  distribu:ons  of   individual  wells  with  the  overall  distribu:on   from  the  posi:ve  control  wells   –  Expected  that  distribu:on  for  GMNN  wells  should   match  the  posi:ve  control   –  Use  KS  test  to  iden:fy  wells  with  similar  distribu:ons   –  Doesn’t  work  too  well,  even  for  GMNN  itself   –  Considers  1  parameter  at  a  :me  (though  a  2D  KS  test   is  possible)  
  • 9. Random  Forest  Model  •  Ensemble  of  decision  trees  (Breiman  1984)  •  Not  always  the  most     accurate,  but  great  for     exploratory  modeling   –  Implicit  feature  selec:on   h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html   –  Proven  to  not  overfit   –  Provides  a  measure  of  feature  importance  •  Employ  the  randomForest  package  from  R  
  • 10. Cell-­‐Level  Modeling  •  Removed  cells  with  “incomplete”  parameters  •  S:ll  leaves  291K  posi:ve  cases  and  3M  nega:ve   cases  •  Developed  a  random  forest  model,  sampling   from  nega:ves  to  maintain  balanced  classes   –  Predict  whether  a  cell  is  GMNN-­‐like   –  Models  from  mul:ple  samples     of  the  nega:ve  control     Posi-ve   Nega-ve   exhibited  similar   Posi-ve   220,636   72,498   Nega-ve   35,614   257,520   performance   Overall  18%  error,  25%  error  on  posi3ve     class  and  12%  error  on  nega3ve  class  
  • 11. Cell-­‐Level  Modeling  •  Significant  overlap  between  distribu:ons  for  the   nega:ve  and  posi:ve  controls  
  • 12. Cell-­‐Level  Predic(ons  •  Aggregate  predic:ons  for  all  cells  in  a  well  to   label  a  well  as  GMNN-­‐like  •  Iden:fy  genes  with  >=  2  siRNA’s  (ie  wells)   labeled  as  GMNN-­‐like   –  31  genes  iden:fied  (GMNN,  KIF11,  ESPL1,  …)  •  Iden:fied  expected  genes  and  most  of  the  set   were  func:onally  relevant   –  Also  iden:fied  a  few  interes:ng,  novel  genes  •  Reconfirma:on  based  on  Ambion  sequences  was   rela:vely  low  (9/31)  
  • 13. Well-­‐Level  Modeling  •  Started  with  27  parameters  from  MetaXpress  •  Performed  automated  feature  selec:on   –  Remove  undefined,  constant  features   –  Manually  removed  a  few  highly  correlated  features  •  Work  with  12     All  Wells   Controls  Wells   parameters  •  Convert  to  Z-­‐scores  •  Posi:ve  &  nega:ve   controls  are  nicely   separated  
  • 14. Parameter  Distribu(ons  
  • 15. Model  Performance  •  Classifica:on  model  trained  using  the  posi:ve   (GMNN-­‐like)  and  nega:ve  (not  GMNN-­‐like)   controls  •  Perfect  classifica:on!     Posi-ve   Nega-ve     Posi-ve   1504   0     Nega-ve   0   1504   –  Worrying  –  overfiqng?   –  Nearly,  99%  of  the  control  wells  were  confidently   classified  as  a  posi:ve  or  nega:ve    
  • 16. Descriptor  Importance  •  What  does  the  model  iden:fy  as  the  most   relevant  descriptors?  •  Some  parameters   G0.G1Cells SPhaseCells are  moderately   X.G2 Cell.MitoticIntegratedIntensity correlated   Cell.DNAIntegratedIntensity X.G0.G1   Cell.DNAArea DNABackgroundValue G2Cells X.SPhase Cell.DNAAverageIntensity Cell.MitoticAverageIntensity 0 100 200 300 MeanDecreaseGini
  • 17. Random  Forest  Predic(ons  •  We  use  the  model  to  predict  the  class  for  all  the   remaining  wells  •  All  four  siRNA’s  targe:ngGMNN  are  classified  as   Geminin-­‐like  with  high   10 confidence   8 Percent of Total 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Probability of being Geminin-like
  • 18. Random  Forest  Predic(ons   •  Select  genes  for  which  >  75%  of  its  siRNA’s  are   predicted  to  be  Geminin-­‐like  with  probability  >  0.8   •  Good  overlap  with  cell-­‐level  model   0Probability of being Geminin-like 1. 8 0. 6 0. 4 0. 2 0. 0 0. AU A KB 8 C 9 C 5 A8 ES T 1 FB 2 G 5 N IN B P A KC N 6 1 O L4 A2 PS 1 PS 1 B4 R 2 P2 TO K TR A 64 K BC N D rf7 A PL F1 XO H F1 K A BO A K S EN PK R P2 TT N JU R L IM C C 10 PL M M PL BR R N R R U SN U M W M 8o KI D D C IT O AU G C R R C
  • 19. GO  Enrichment  •  GO  Biological  Processes  enriched  by  this  set  of   selected  genes,  are  relevant  to  the  biology  •  Similarly  with  pathways  (from  GeneGo)  
  • 20. Clustering  •  RF  classifica:on  is  useful,  but  doesn’t  directly  tell   us  much  about  finer  groups  of  genes  that  might     be  phenotypically  related  •  So  we  apply  unsupervised  clustering  (PAM)   –  Explore  different  numbers  of  clusters   –  Evaluate  sta:s:cal  cluster  quality  metrics   –  Evaluate  biologically  mo:vated  quality  metrics  •  We  considered  both  plate-­‐wise  and  experiment-­‐ wise  clustering  protocols  
  • 21. Platewise  Clustering  (k=4)  •  Cluster  assignments  can’t  be  directly  compared   across  plates  •  Good  to  see  that     control  columns   are  dis:nctly     clustered  •  Certain  plates   show  no     membership  to   the  ‘GMNN  cluster’  
  • 22. Experimentwise  Clustering  (k=2)  •  Encouraging  to  see  clean  separa:on  between   control  columns  •  Bulk  of  wells  are  iden:fied  as  inac:ve  •  We  can  compare  results   from  this  clustering  to     RF  classifica:on   –  6  genes  iden:fied,  with   mul:ple  siRNA’s     clustered  with  nega:ve   control  
  • 23. Experimentwise  Clustering  (k=2)   •  6  genes  iden:fied  with  mul:ple  siRNA’s  clustered   with  the  nega:ve  control   •  These  were  confidently  iden:fied  by  the  RF  model   0Probability of being Geminin-like 1. 8 0. 6 0. 4 0. 2 0. 0 0. AU A KB 8 C 9 C 5 A8 ES T 1 FB 2 G 5 N IN B P A KC N 6 1 O L4 A2 PS 1 PS 1 B4 R 2 P2 TO K TR A 64 K BC N D rf7 A PL F1 XO H F1 K A BO A K S EN PK R P2 TT N JU R L IM C C 10 PL M M PL BR R N R R U SN U M W M 8o KI D D C IT O AU G C R R C
  • 24. How  Many  Clusters?  •  A  priori,  difficult  to  decide  how  many  clusters   there  should  be   –  Manual  spot  checks  did  not  iden:fy  dis:nctly     different  morphologies,  counts  •  Evaluate  clusters  with   0.7 Average Silhouette Width varying  k  and  calculate   0.6 average  silhoue`e  width   0.5•  Clustering  based  on  the     0.4 0.3 Euclidean  metric  doesn’t     0.2 do  a  good  job   2 5 8 11 14 17 20 Number of Clusters
  • 25. How  Many  Clusters?  •  One  approach  is  to  ignore  clusterings  that  have   spread  all  GMNN  siRNAs  across  mul:ple  clusters  •  The  current  data  suggests   that  we  s:ck  to  k  =  5  
  • 26. Biological  Enrichment  in  Clusters  •  Considering  5  clusters  •  Some  clusters  are  annotated  with  more  relevant   terms     Cluster  containing  ¾  GMNN  siRNAs  
  • 27. Signal  Enhancement  in  Clusters  •  Signal  is  significantly  enhanced  in  some  clusters   versus  others  •  Clusters  1,  2  and  4  did  not  contain  any  siRNA’s   above  Z  =  3  
  • 28. Making  a  Final  Hitlist  •  Off  targets  effects  are  a  major  confounding   factor  •  We  are  able  to  assess  OTE  on  a  gene  by  gene   basis  using  Common  Seed  Analysis  •  Select  genes  from  individual  clusters,  using  %  G2   and  number  of  siRNA’s  as  secondary  filters  •  Combine  with  hits  from  random  forest  model   Marine,  S.  et  al,  J.  Biomol.  Screen.,  2011,  ASAP  
  • 29. Reconfirma(on  •  18/211  genes  selected  based  on  thresholding  from   the  primary  reconfirmed  using  Ambion  sequences  •  Considering  just  the  genes  selected  by  the  random   forest  and/or  clustering  methods   –  11/30  genes  selected  by  RF  reconfirmed  using  Ambion   libraries   –  5/6  Genes  iden:fied  by  RF  &  clustering  reconfirmed   using  mul:ple  libraries   •  ESPL1,  FBXO5,  INCENP,  KIF11  reconfirmed  very  strongly  •  Based  on  k  =  5  clustering,     –  23/181  genes  from  cluster  3  reconfirmed   –  5/5  genes  from  cluster  5  reconfirmed    
  • 30. Outlook  •  Complements  tradi:onal  threshold  based   selec:on  methods  •  The  random  forest  approach  is  sufficiently   accurate  and  lets  us  avoid  explicitly  selec:ng   features  up  front  •  Combined  with  clustering  lets  us  zoom  into   biological  relevant  clusters  of  genes  
  • 31. Acknowledgements  •  Sco`  Mar:n  •  Pinar  Tuzmen  •  Carleen  Klump  •  Eugen  Buehler