Classification and Clustering for Hit Identification in High Content RNAi Screens
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Classification and Clustering for Hit Identification in High Content RNAi Screens

on

  • 1,114 views

 

Statistics

Views

Total Views
1,114
Views on SlideShare
1,113
Embed Views
1

Actions

Likes
1
Downloads
5
Comments
0

1 Embed 1

http://a0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Classification and Clustering for Hit Identification in High Content RNAi Screens Presentation Transcript

  • 1. Classifica(on  and  Clustering  for     Hit  Iden(fica(on  in  High     Content  RNAi  Screens   Rajarshi  Guha,  Ph.D.   NIH  Center  for  Transla:onal  Therapeu:cs     January  11,  2012  
  • 2. DNA Re-replication Collaborator:! Mel Depamphilis, NICHD! Wenge Zhu, Georgetown U! Sivaprasad et al Cell DivisionLevels of geminin increaseas cells enter S phase, After mitosis, levels ofwhich help to prevent a geminin and cyclins decreasesecond round of DNA through ubiqutin mediatedreplication.! degradation.!DNA replication is a tightly controlled and well-studied process. Proteinsincluding geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!
  • 3. DNA Re-replication Zhu et al, Cancer Res, 2009Certain cancer cells may have less safeguards against DNA re-replicationthan normal cells (i.e. Achilles heel). Induction of re-replication results inapoptosis.!
  • 4. Screening  Protocol  •  HCT-116 colon 
 cancer cells are 
 fixed and stained
 (Hoechst)!•  Image at 4X on
 ImageXpress!•  MetaXpress used 
 to perform cell cycle analysis to quantify cells with >4N DNA content !•  Screens were run with singles and pools  
  • 5. Screen  Summary  •  Qiagen  druggable  genome  library  (6,866  genes)  •  94  plates,  36K  wells     SSMD 14 including  controls   12 10•  Good  screen     8 6 performance,     Statistic 4 some  poorer     0 20 40 Trimmed Z 60 80 100 plates  were     0.8 redone   0.7 0.6   0.5 0 20 40 60 80 100 Plate Index
  • 6. Goals  •  Can  we  iden:fy  genes  with  GMNN-­‐like   phenotypes   –  We  already  iden:fied  a  set  of  genes  via  thresholding   the  %G2  parameter   –  We’d  like  to  see  what  we  get  when  we  use  a  mul:-­‐ dimensional  representa:on  •  Employ  predic:ve  modeling  to  “learn”  the   phenotype  •  Apply  clustering  and  iden:fy  biologically   relevant  clusters  
  • 7. What  Do  GMNN  Wells  Look  Like?  
  • 8. Cell-­‐Level  Modeling  •  A  first  approach  was  to  match  distribu:ons  of   individual  wells  with  the  overall  distribu:on   from  the  posi:ve  control  wells   –  Expected  that  distribu:on  for  GMNN  wells  should   match  the  posi:ve  control   –  Use  KS  test  to  iden:fy  wells  with  similar  distribu:ons   –  Doesn’t  work  too  well,  even  for  GMNN  itself   –  Considers  1  parameter  at  a  :me  (though  a  2D  KS  test   is  possible)  
  • 9. Random  Forest  Model  •  Ensemble  of  decision  trees  (Breiman  1984)  •  Not  always  the  most     accurate,  but  great  for     exploratory  modeling   –  Implicit  feature  selec:on   h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html   –  Proven  to  not  overfit   –  Provides  a  measure  of  feature  importance  •  Employ  the  randomForest  package  from  R  
  • 10. Cell-­‐Level  Modeling  •  Removed  cells  with  “incomplete”  parameters  •  S:ll  leaves  291K  posi:ve  cases  and  3M  nega:ve   cases  •  Developed  a  random  forest  model,  sampling   from  nega:ves  to  maintain  balanced  classes   –  Predict  whether  a  cell  is  GMNN-­‐like   –  Models  from  mul:ple  samples     of  the  nega:ve  control     Posi-ve   Nega-ve   exhibited  similar   Posi-ve   220,636   72,498   Nega-ve   35,614   257,520   performance   Overall  18%  error,  25%  error  on  posi3ve     class  and  12%  error  on  nega3ve  class  
  • 11. Cell-­‐Level  Modeling  •  Significant  overlap  between  distribu:ons  for  the   nega:ve  and  posi:ve  controls  
  • 12. Cell-­‐Level  Predic(ons  •  Aggregate  predic:ons  for  all  cells  in  a  well  to   label  a  well  as  GMNN-­‐like  •  Iden:fy  genes  with  >=  2  siRNA’s  (ie  wells)   labeled  as  GMNN-­‐like   –  31  genes  iden:fied  (GMNN,  KIF11,  ESPL1,  …)  •  Iden:fied  expected  genes  and  most  of  the  set   were  func:onally  relevant   –  Also  iden:fied  a  few  interes:ng,  novel  genes  •  Reconfirma:on  based  on  Ambion  sequences  was   rela:vely  low  (9/31)  
  • 13. Well-­‐Level  Modeling  •  Started  with  27  parameters  from  MetaXpress  •  Performed  automated  feature  selec:on   –  Remove  undefined,  constant  features   –  Manually  removed  a  few  highly  correlated  features  •  Work  with  12     All  Wells   Controls  Wells   parameters  •  Convert  to  Z-­‐scores  •  Posi:ve  &  nega:ve   controls  are  nicely   separated  
  • 14. Parameter  Distribu(ons  
  • 15. Model  Performance  •  Classifica:on  model  trained  using  the  posi:ve   (GMNN-­‐like)  and  nega:ve  (not  GMNN-­‐like)   controls  •  Perfect  classifica:on!     Posi-ve   Nega-ve     Posi-ve   1504   0     Nega-ve   0   1504   –  Worrying  –  overfiqng?   –  Nearly,  99%  of  the  control  wells  were  confidently   classified  as  a  posi:ve  or  nega:ve    
  • 16. Descriptor  Importance  •  What  does  the  model  iden:fy  as  the  most   relevant  descriptors?  •  Some  parameters   G0.G1Cells SPhaseCells are  moderately   X.G2 Cell.MitoticIntegratedIntensity correlated   Cell.DNAIntegratedIntensity X.G0.G1   Cell.DNAArea DNABackgroundValue G2Cells X.SPhase Cell.DNAAverageIntensity Cell.MitoticAverageIntensity 0 100 200 300 MeanDecreaseGini
  • 17. Random  Forest  Predic(ons  •  We  use  the  model  to  predict  the  class  for  all  the   remaining  wells  •  All  four  siRNA’s  targe:ngGMNN  are  classified  as   Geminin-­‐like  with  high   10 confidence   8 Percent of Total 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Probability of being Geminin-like
  • 18. Random  Forest  Predic(ons   •  Select  genes  for  which  >  75%  of  its  siRNA’s  are   predicted  to  be  Geminin-­‐like  with  probability  >  0.8   •  Good  overlap  with  cell-­‐level  model   0Probability of being Geminin-like 1. 8 0. 6 0. 4 0. 2 0. 0 0. AU A KB 8 C 9 C 5 A8 ES T 1 FB 2 G 5 N IN B P A KC N 6 1 O L4 A2 PS 1 PS 1 B4 R 2 P2 TO K TR A 64 K BC N D rf7 A PL F1 XO H F1 K A BO A K S EN PK R P2 TT N JU R L IM C C 10 PL M M PL BR R N R R U SN U M W M 8o KI D D C IT O AU G C R R C
  • 19. GO  Enrichment  •  GO  Biological  Processes  enriched  by  this  set  of   selected  genes,  are  relevant  to  the  biology  •  Similarly  with  pathways  (from  GeneGo)  
  • 20. Clustering  •  RF  classifica:on  is  useful,  but  doesn’t  directly  tell   us  much  about  finer  groups  of  genes  that  might     be  phenotypically  related  •  So  we  apply  unsupervised  clustering  (PAM)   –  Explore  different  numbers  of  clusters   –  Evaluate  sta:s:cal  cluster  quality  metrics   –  Evaluate  biologically  mo:vated  quality  metrics  •  We  considered  both  plate-­‐wise  and  experiment-­‐ wise  clustering  protocols  
  • 21. Platewise  Clustering  (k=4)  •  Cluster  assignments  can’t  be  directly  compared   across  plates  •  Good  to  see  that     control  columns   are  dis:nctly     clustered  •  Certain  plates   show  no     membership  to   the  ‘GMNN  cluster’  
  • 22. Experimentwise  Clustering  (k=2)  •  Encouraging  to  see  clean  separa:on  between   control  columns  •  Bulk  of  wells  are  iden:fied  as  inac:ve  •  We  can  compare  results   from  this  clustering  to     RF  classifica:on   –  6  genes  iden:fied,  with   mul:ple  siRNA’s     clustered  with  nega:ve   control  
  • 23. Experimentwise  Clustering  (k=2)   •  6  genes  iden:fied  with  mul:ple  siRNA’s  clustered   with  the  nega:ve  control   •  These  were  confidently  iden:fied  by  the  RF  model   0Probability of being Geminin-like 1. 8 0. 6 0. 4 0. 2 0. 0 0. AU A KB 8 C 9 C 5 A8 ES T 1 FB 2 G 5 N IN B P A KC N 6 1 O L4 A2 PS 1 PS 1 B4 R 2 P2 TO K TR A 64 K BC N D rf7 A PL F1 XO H F1 K A BO A K S EN PK R P2 TT N JU R L IM C C 10 PL M M PL BR R N R R U SN U M W M 8o KI D D C IT O AU G C R R C
  • 24. How  Many  Clusters?  •  A  priori,  difficult  to  decide  how  many  clusters   there  should  be   –  Manual  spot  checks  did  not  iden:fy  dis:nctly     different  morphologies,  counts  •  Evaluate  clusters  with   0.7 Average Silhouette Width varying  k  and  calculate   0.6 average  silhoue`e  width   0.5•  Clustering  based  on  the     0.4 0.3 Euclidean  metric  doesn’t     0.2 do  a  good  job   2 5 8 11 14 17 20 Number of Clusters
  • 25. How  Many  Clusters?  •  One  approach  is  to  ignore  clusterings  that  have   spread  all  GMNN  siRNAs  across  mul:ple  clusters  •  The  current  data  suggests   that  we  s:ck  to  k  =  5  
  • 26. Biological  Enrichment  in  Clusters  •  Considering  5  clusters  •  Some  clusters  are  annotated  with  more  relevant   terms     Cluster  containing  ¾  GMNN  siRNAs  
  • 27. Signal  Enhancement  in  Clusters  •  Signal  is  significantly  enhanced  in  some  clusters   versus  others  •  Clusters  1,  2  and  4  did  not  contain  any  siRNA’s   above  Z  =  3  
  • 28. Making  a  Final  Hitlist  •  Off  targets  effects  are  a  major  confounding   factor  •  We  are  able  to  assess  OTE  on  a  gene  by  gene   basis  using  Common  Seed  Analysis  •  Select  genes  from  individual  clusters,  using  %  G2   and  number  of  siRNA’s  as  secondary  filters  •  Combine  with  hits  from  random  forest  model   Marine,  S.  et  al,  J.  Biomol.  Screen.,  2011,  ASAP  
  • 29. Reconfirma(on  •  18/211  genes  selected  based  on  thresholding  from   the  primary  reconfirmed  using  Ambion  sequences  •  Considering  just  the  genes  selected  by  the  random   forest  and/or  clustering  methods   –  11/30  genes  selected  by  RF  reconfirmed  using  Ambion   libraries   –  5/6  Genes  iden:fied  by  RF  &  clustering  reconfirmed   using  mul:ple  libraries   •  ESPL1,  FBXO5,  INCENP,  KIF11  reconfirmed  very  strongly  •  Based  on  k  =  5  clustering,     –  23/181  genes  from  cluster  3  reconfirmed   –  5/5  genes  from  cluster  5  reconfirmed    
  • 30. Outlook  •  Complements  tradi:onal  threshold  based   selec:on  methods  •  The  random  forest  approach  is  sufficiently   accurate  and  lets  us  avoid  explicitly  selec:ng   features  up  front  •  Combined  with  clustering  lets  us  zoom  into   biological  relevant  clusters  of  genes  
  • 31. Acknowledgements  •  Sco`  Mar:n  •  Pinar  Tuzmen  •  Carleen  Klump  •  Eugen  Buehler