On October 23rd, 2014, we updated our
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
Classification and Clustering for Hit Identification in High Content RNAi ScreensPresentation Transcript
Classiﬁca(on and Clustering for Hit Iden(ﬁca(on in High Content RNAi Screens Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs January 11, 2012
DNA Re-replication Collaborator:! Mel Depamphilis, NICHD! Wenge Zhu, Georgetown U! Sivaprasad et al Cell DivisionLevels of geminin increaseas cells enter S phase, After mitosis, levels ofwhich help to prevent a geminin and cyclins decreasesecond round of DNA through ubiqutin mediatedreplication.! degradation.!DNA replication is a tightly controlled and well-studied process. Proteinsincluding geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!
DNA Re-replication Zhu et al, Cancer Res, 2009Certain cancer cells may have less safeguards against DNA re-replicationthan normal cells (i.e. Achilles heel). Induction of re-replication results inapoptosis.!
Screening Protocol • HCT-116 colon cancer cells are ﬁxed and stained (Hoechst)!• Image at 4X on ImageXpress!• MetaXpress used to perform cell cycle analysis to quantify cells with >4N DNA content !• Screens were run with singles and pools
Screen Summary • Qiagen druggable genome library (6,866 genes) • 94 plates, 36K wells SSMD 14 including controls 12 10• Good screen 8 6 performance, Statistic 4 some poorer 0 20 40 Trimmed Z 60 80 100 plates were 0.8 redone 0.7 0.6 0.5 0 20 40 60 80 100 Plate Index
Goals • Can we iden:fy genes with GMNN-‐like phenotypes – We already iden:ﬁed a set of genes via thresholding the %G2 parameter – We’d like to see what we get when we use a mul:-‐ dimensional representa:on • Employ predic:ve modeling to “learn” the phenotype • Apply clustering and iden:fy biologically relevant clusters
What Do GMNN Wells Look Like?
Cell-‐Level Modeling • A ﬁrst approach was to match distribu:ons of individual wells with the overall distribu:on from the posi:ve control wells – Expected that distribu:on for GMNN wells should match the posi:ve control – Use KS test to iden:fy wells with similar distribu:ons – Doesn’t work too well, even for GMNN itself – Considers 1 parameter at a :me (though a 2D KS test is possible)
Random Forest Model • Ensemble of decision trees (Breiman 1984) • Not always the most accurate, but great for exploratory modeling – Implicit feature selec:on h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classiﬁers.html – Proven to not overﬁt – Provides a measure of feature importance • Employ the randomForest package from R
Cell-‐Level Modeling • Removed cells with “incomplete” parameters • S:ll leaves 291K posi:ve cases and 3M nega:ve cases • Developed a random forest model, sampling from nega:ves to maintain balanced classes – Predict whether a cell is GMNN-‐like – Models from mul:ple samples of the nega:ve control Posi-ve Nega-ve exhibited similar Posi-ve 220,636 72,498 Nega-ve 35,614 257,520 performance Overall 18% error, 25% error on posi3ve class and 12% error on nega3ve class
Cell-‐Level Modeling • Signiﬁcant overlap between distribu:ons for the nega:ve and posi:ve controls
Cell-‐Level Predic(ons • Aggregate predic:ons for all cells in a well to label a well as GMNN-‐like • Iden:fy genes with >= 2 siRNA’s (ie wells) labeled as GMNN-‐like – 31 genes iden:ﬁed (GMNN, KIF11, ESPL1, …) • Iden:ﬁed expected genes and most of the set were func:onally relevant – Also iden:ﬁed a few interes:ng, novel genes • Reconﬁrma:on based on Ambion sequences was rela:vely low (9/31)
Well-‐Level Modeling • Started with 27 parameters from MetaXpress • Performed automated feature selec:on – Remove undeﬁned, constant features – Manually removed a few highly correlated features • Work with 12 All Wells Controls Wells parameters • Convert to Z-‐scores • Posi:ve & nega:ve controls are nicely separated
Model Performance • Classiﬁca:on model trained using the posi:ve (GMNN-‐like) and nega:ve (not GMNN-‐like) controls • Perfect classiﬁca:on! Posi-ve Nega-ve Posi-ve 1504 0 Nega-ve 0 1504 – Worrying – overﬁqng? – Nearly, 99% of the control wells were conﬁdently classiﬁed as a posi:ve or nega:ve
Descriptor Importance • What does the model iden:fy as the most relevant descriptors? • Some parameters G0.G1Cells SPhaseCells are moderately X.G2 Cell.MitoticIntegratedIntensity correlated Cell.DNAIntegratedIntensity X.G0.G1 Cell.DNAArea DNABackgroundValue G2Cells X.SPhase Cell.DNAAverageIntensity Cell.MitoticAverageIntensity 0 100 200 300 MeanDecreaseGini
Random Forest Predic(ons • We use the model to predict the class for all the remaining wells • All four siRNA’s targe:ngGMNN are classiﬁed as Geminin-‐like with high 10 conﬁdence 8 Percent of Total 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Probability of being Geminin-like
Random Forest Predic(ons • Select genes for which > 75% of its siRNA’s are predicted to be Geminin-‐like with probability > 0.8 • Good overlap with cell-‐level model 0Probability of being Geminin-like 1. 8 0. 6 0. 4 0. 2 0. 0 0. AU A KB 8 C 9 C 5 A8 ES T 1 FB 2 G 5 N IN B P A KC N 6 1 O L4 A2 PS 1 PS 1 B4 R 2 P2 TO K TR A 64 K BC N D rf7 A PL F1 XO H F1 K A BO A K S EN PK R P2 TT N JU R L IM C C 10 PL M M PL BR R N R R U SN U M W M 8o KI D D C IT O AU G C R R C
GO Enrichment • GO Biological Processes enriched by this set of selected genes, are relevant to the biology • Similarly with pathways (from GeneGo)
Clustering • RF classiﬁca:on is useful, but doesn’t directly tell us much about ﬁner groups of genes that might be phenotypically related • So we apply unsupervised clustering (PAM) – Explore diﬀerent numbers of clusters – Evaluate sta:s:cal cluster quality metrics – Evaluate biologically mo:vated quality metrics • We considered both plate-‐wise and experiment-‐ wise clustering protocols
Platewise Clustering (k=4) • Cluster assignments can’t be directly compared across plates • Good to see that control columns are dis:nctly clustered • Certain plates show no membership to the ‘GMNN cluster’
Experimentwise Clustering (k=2) • Encouraging to see clean separa:on between control columns • Bulk of wells are iden:ﬁed as inac:ve • We can compare results from this clustering to RF classiﬁca:on – 6 genes iden:ﬁed, with mul:ple siRNA’s clustered with nega:ve control
Experimentwise Clustering (k=2) • 6 genes iden:ﬁed with mul:ple siRNA’s clustered with the nega:ve control • These were conﬁdently iden:ﬁed by the RF model 0Probability of being Geminin-like 1. 8 0. 6 0. 4 0. 2 0. 0 0. AU A KB 8 C 9 C 5 A8 ES T 1 FB 2 G 5 N IN B P A KC N 6 1 O L4 A2 PS 1 PS 1 B4 R 2 P2 TO K TR A 64 K BC N D rf7 A PL F1 XO H F1 K A BO A K S EN PK R P2 TT N JU R L IM C C 10 PL M M PL BR R N R R U SN U M W M 8o KI D D C IT O AU G C R R C
How Many Clusters? • A priori, diﬃcult to decide how many clusters there should be – Manual spot checks did not iden:fy dis:nctly diﬀerent morphologies, counts • Evaluate clusters with 0.7 Average Silhouette Width varying k and calculate 0.6 average silhoue`e width 0.5• Clustering based on the 0.4 0.3 Euclidean metric doesn’t 0.2 do a good job 2 5 8 11 14 17 20 Number of Clusters
How Many Clusters? • One approach is to ignore clusterings that have spread all GMNN siRNAs across mul:ple clusters • The current data suggests that we s:ck to k = 5
Biological Enrichment in Clusters • Considering 5 clusters • Some clusters are annotated with more relevant terms Cluster containing ¾ GMNN siRNAs
Signal Enhancement in Clusters • Signal is signiﬁcantly enhanced in some clusters versus others • Clusters 1, 2 and 4 did not contain any siRNA’s above Z = 3
Making a Final Hitlist • Oﬀ targets eﬀects are a major confounding factor • We are able to assess OTE on a gene by gene basis using Common Seed Analysis • Select genes from individual clusters, using % G2 and number of siRNA’s as secondary ﬁlters • Combine with hits from random forest model Marine, S. et al, J. Biomol. Screen., 2011, ASAP
Reconﬁrma(on • 18/211 genes selected based on thresholding from the primary reconﬁrmed using Ambion sequences • Considering just the genes selected by the random forest and/or clustering methods – 11/30 genes selected by RF reconﬁrmed using Ambion libraries – 5/6 Genes iden:ﬁed by RF & clustering reconﬁrmed using mul:ple libraries • ESPL1, FBXO5, INCENP, KIF11 reconﬁrmed very strongly • Based on k = 5 clustering, – 23/181 genes from cluster 3 reconﬁrmed – 5/5 genes from cluster 5 reconﬁrmed
Outlook • Complements tradi:onal threshold based selec:on methods • The random forest approach is suﬃciently accurate and lets us avoid explicitly selec:ng features up front • Combined with clustering lets us zoom into biological relevant clusters of genes