Small Molecules and siRNA: Methods to Explore Bioactivity Data


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Outliers in a cliff prediction model are not as severe since SALI changes more slowly than just activity differences
  • For SALI = 0, had to set log10(SALI) = 0Similar performance if we use SALI and not log10(SALI) at least more % variance is explained. Still fail on most significant cliffs
  • View plates (raw, normalized, adjusted, …)Highlight specific genes, siRNA’sView assay statisticsView pathway membership (via Wikipathways)Linkout to external resources (Entrez, GeneCards, …)Hit selection, follow up (DRC)
  • View plates (raw, normalized, adjusted, …)Highlight specific genes, siRNA’sView assay statisticsView pathway membership (via Wikipathways)Linkout to external resources (Entrez, GeneCards, …)Hit selection, follow up (DRC)
  • * Proscillaridin A was not selected in the 20 compounds for further analysis in the paper* 2 cardiac glycosides in the top 3, target appears to be caspase-3 (activating it). CG inhibition of NF-kb is well known . See PNAS 2005, by Pollard* Trabectidin induces lethal DNA strand breaks and blocks cell cycle in G2 phase
  • PSM* genes code for proteosome subunits – so they likely prevent the ubiquination of the IkBa complex, so that RelA+cp50 cannot be released from the IkBa complex and enter the nucleus
  • Size of node indicates potency – larger is more potentLanatosidec and a have Tc = 1 and hence the edge was not shown (ideally it should be shown)
  • Good confirmation that SEA worksSize of node corresponds to SEA confidence score
  • We consider 41 compounds rather than 55, since a number of them did not have sufficiently confident target predictionsWe then get to 18 compounds since, many of the predicted genes, did not map to an NCI PID pathway
  • Pheontypic difference can arise when PPI’s are involved
  • HPRD subnetwork corresponding to the Qiagen HDG has 6782 genes
  • HPRD subnetwork corresponding to the Qiagen HDG has 6782 genes
  • Small Molecules and siRNA: Methods to Explore Bioactivity Data

    1. 1. Small Molecules and siRNA:Methods to Explore Bioactivity Data<br />Rajarshi Guha<br />NIH Chemical for Translational Therapeutics<br />August 17, 2011<br />Pfizer, Groton<br />
    2. 2. Background<br />Cheminformatics methods<br />QSAR, diversity analysis, virtual screening, fragments, polypharmacology, networks<br />More recently<br />siRNAscreening, high content imaging,combination screening<br />Extensive use of machine learning<br />All tied together with software development<br />Integrate small molecule information & biosystems – systems chemical biology<br />
    3. 3. Outline<br />Exploring the SAR landscape<br />The landscape view of SAR data<br />Quantifying SAR landscapes<br />Extending an SAR landscape<br />Linking small molecule & RNAiHTS<br />Overview of the Trans NIH RNAi Screening Initiative<br />Infrastructure components<br />Linking small molecule & siRNA screens<br />
    4. 4. The Landscape View of Structure Activity Datasets<br />
    5. 5. Structure Activity Relationships<br />Similar molecules will have similar activities<br />Small changes in structure will lead to small changes in activity<br />One implication is that SAR’s are additive<br />This is the basis for QSAR modeling<br />Martin, Y.C. et al., J. Med. Chem., 2002, 45, 4350–4358<br />
    6. 6. Structure Activity Landscapes<br />Rugged gorges or rolling hills?<br />Small structural changes associated with large activity changes represent steep slopes in the landscape<br />But traditionally, QSAR assumes gentle slopes<br />Machine learning is not very good for special cases<br />Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535<br />
    7. 7. Characterizing the Landscape<br />A cliff can be numerically characterized<br />Structure Activity Landscape Index (SALI)<br />Cliffs are characterized by elements of the matrix with very large values<br />Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658<br />
    8. 8. Visualizing SALI Values<br />The SALI graph<br />Compounds are nodes<br />Nodes i,j are connected if SALI(i,j) > X<br />Only display connected nodes<br />
    9. 9. What Can We Do With SALI’s?<br />SALI characterizes cliffs & non-cliffs<br />For a given molecular representation, SALI’s gives us an idea of thesmoothness of the SAR landscape<br />Models try and encodethis landscape<br />Use the landscape to guidedescriptor or model selection<br />
    10. 10. Descriptor Space Smoothness<br />Edge count of the SALI graph for varying cutoffs<br />Measures smoothness of the descriptor space<br />Can reduce this to a single number (AUC)<br />
    11. 11. Other Examples<br />Instead of fingerprints, we use molecular descriptors<br />SALI denominator now uses Euclidean distance<br />2D & 3D random descriptor sets<br />None are really good<br />Too rough, or<br />Too flat<br />2D<br />3D<br />
    12. 12. Feature Selection Using SALI<br />Surprisingly, exhaustive search of 66,000 4-descriptor combinations did not yield semi-smoothly decreasing curves<br />Not entirely clear what type of curve is desirable<br />
    13. 13. Measuring Model Quality<br />A QSAR model should easily encode the “rolling hills”<br />A good model captures the most significantcliffs<br />Can be formalized as <br />How many of the edge orderings of a SALI graph does the model predict correctly?<br />Define S (X ), representing the number of edges correctly predicted for a SALI network at a threshold X<br />Repeat for varying X and obtain the SALI curve<br />
    14. 14. SALI Curves<br />
    15. 15. Model Search Using the SCI<br />We’ve used the SALI to retrospectively analyze models<br />Can we use SALI to develop models?<br />Identify a model that captures the cliffs<br />Tricky<br />Cliffs are fundamentally outliers<br />Optimizing for good SALI values implies overfitting<br />Need to trade-off between SALI & generalizability<br />
    16. 16. Predicting the Landscape<br />Rather than predicting activity directly, we can try to predict the SAR landscape<br />Implies that we attempt to directly predict cliffs<br />Observations are now pairs of molecules<br />A more complex problem<br />Choice of features is trickier<br />Still face the problem of cliffs as outliers<br />Somewhat similar to predicting activity differences<br />Scheiber et al, Statistical Analysis and Data Mining, 2009, 2, 115-122<br />
    17. 17. Motivation<br />Predicting activity cliffs corresponds to extending the SAR landscape<br />Identify whether a new molecule will perform better or worse compared to the specific molecules in the dataset<br />Can be useful for guiding lead optimization, but not necessarily useful for lead hopping<br />
    18. 18. Predicting Cliffs<br />Dependent variable are pairwise SALI values, calculated using fingerprints<br />Independent variables are molecular descriptors – but considered pairwise<br />Absolute difference of descriptor pairs, or<br />Geometric mean of descriptor pairs<br />…<br />Develop a model to correlate pairwise descriptors to pairwise SALI values<br />
    19. 19. A Test Case<br />We first consider the CavalliCoMFA dataset of 30 molecules with pIC50’s<br />Evaluate topological and physicochemical descriptors<br />Developed random forest models<br />On the original observed values (30 obs)<br />On the SALI values (435 observations)<br />Cavalli, A. et al, J Med Chem, 2002, 45, 3844-3853<br />
    20. 20. Double Counting Structures?<br />The dependent and independent variables both encode structure. <br />But pretty low correlations between individual pairwisedescriptors and the SALI values<br />
    21. 21. Model Summaries<br />Original pIC50<br />RMSE = 0.97<br />SALI, AbsDiff<br />RMSE = 1.10<br />SALI, GeoMean<br />RMSE = 1.04<br />All models explain similar % of variance of their respective datasets <br />Using geometric mean as the descriptor aggregation function seems to perform best<br />SALI models are more robust due to larger size of the dataset<br />
    22. 22. Test Case 2<br />Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter<br />Similar strategy as before<br />Need to transform SALI values <br />Descriptors show minimal correlation<br />Holloway, M.K. et al, J Med Chem, 1995, 38, 305-317<br />
    23. 23. Model Summaries<br />Original pIC50<br />RMSE = 1.05<br />SALI, AbsDiff<br />RMSE = 0.48<br />SALI, GeoMean<br />RMSE = 0.48<br />The SALI models perform much poorer in terms of % of variance explained<br />Descriptor aggregation method does not seem to have much effect<br />The SALI models appear to perform decently on the cliffs – but misses the most significant <br />
    24. 24. Model Summaries<br />Original pIC50<br />RMSE = 1.05<br />SALI, AbsDiff<br />RMSE = 9.76<br />SALI, GeoMean<br />RMSE = 10.01<br />With untransformed SALI values, models perform similarly in terms of % of variance explained<br />The most significant cliffs correspond to stereoisomers<br />
    25. 25. Test Case 3<br />38 adenosine receptor antagonists with reported Ki values; use 35 for training and 3 for testing<br />Random forest model on the SALI values performed reasonable well (RMSE = 7.51, R2=0.62)<br />Upper end ofSALI rangeis better predicted<br />Kalla, R.V. et al, J. Med. Chem., 2006, 48, 1984-2008<br />
    26. 26. Test Case 3<br /><ul><li>The dataset does not containing really big cliffs
    27. 27. Generally, performance is poorer for smaller cliffs</li></ul>For any given hold out molecule, range of error in SALI prediction is large<br />Suggests that some form of domain applicability metric would be useful <br />
    28. 28. Model Caveats<br />Models based on SALI values are dependent on their being an SAR in the original activity data<br />Scrambling results for these models are poorer than the original models but aren’t as random as expected<br />
    29. 29. Conclusions<br />SALI is the first step in characterizing the SAR landscape<br />Allows us to directly analyze the landscape, as opposed to individual molecules<br />Being able to predict the landscape could serve as a useful way to extend an SAR landscape<br />
    30. 30. Joining the Dots: Integrating High Throughput Small Molecule and RNAi Screens<br />
    31. 31. RNAi Facility Mission<br />Pathway (Reporter assays, e.g. luciferase, b-lactamase)<br />Simple Phenotypes (Viability, cytotoxicity, oxidative stress, etc)<br />Perform collaborative genome-wide RNAi screening-based projects with intramural investigators<br />Advance the science of RNAi and miRNA screening and informatics via technology development to improve efficiency, reliability, and costs.<br />Complex Phenotypes (High-content imaging, cell cycle, translocation, etc)<br />Range of Assays<br />
    32. 32. RNAi Informatics Infrastructure<br />
    33. 33. RNAi Analysis Workflow<br />Raw and Processed Data<br />GO annotations<br />Pathways<br />Interactions<br />Hit List<br />Follow-up<br />
    34. 34. RNAi Informatics Toolset<br />Local databases (screen data, pathways, interactions, etc).<br />Commercial pathway tools. <br />Custom software for loading, analysis and visualization.<br />
    35. 35. Back End Services<br />Currently all computational analysis performed on the backend<br />R & Bioconductor code<br />Custom R package (ncgcrnai) to support NCGC infrastructure<br />Partly derived from cellHTS2<br />Supports QC metrics, normalization, adjustments, selections, triage, (static) visualization, reports<br />Some Java tools for<br />Data loading<br />Library and plate registration<br />
    36. 36. User Accessible Tools<br />
    37. 37. User Accessible Tools<br />
    38. 38. RNAi& Small Molecule Screens<br />CAGCATGAGTACTACAGGCCA<br />TACGGGAACTACCATAATTTA<br />What targets mediate activity of siRNA and compound<br />Pathway elucidation, identification of interactions<br /><ul><li> Reuse pre-existing MLI data
    39. 39. Develop new annotated libraries</li></ul>Target ID and validation<br />Link RNAi generated pathway peturbations to small molecule activities. Could provide insight into polypharmacology<br /><ul><li> Run parallel RNAi screen</li></ul>Goal: Develop systems level view of small molecule activity<br />
    40. 40. HTS for NF-κB Antagonists<br />NF-κB controls DNA transcription <br />Involved in cellular responses to stimuli<br />Immune response, memory formation<br />Inflammation, cancer, auto-immune diseases<br /><br />
    41. 41. HTS for NF-κB Antagonists<br />ME-180 cell line<br />Stimulate cells using TNF, leading to NF-κB activation, readout via a β-lactamase reporter<br />Identify small molecules and siRNA’s that block the resultant activation<br />
    42. 42. Small Molecule HTS Summary<br />2,899 FDA-approved compounds screened<br />55 compounds retested active<br />Which components of the NF-κB pathway do they hit?<br />17 molecules have target/pathway information in GeneGO<br />Literature searches list a few more<br />Most Potent Actives<br />Proscillaridin A<br />Trabectidin<br />Digoxin<br />Miller, S.C. et al, Biochem. Pharmacol., 2010, ASAP<br />
    43. 43. RNAi HTS Summary<br />Qiagen HDG library – 6886 genes, 4 siRNA’s per gene<br />A total of 567 genes were knockeddown by 1 or more siRNA’s<br />We consider >= 2 as a “reliable” hit<br />16 reliable hits<br />Added in 66 genes for follow up via triage procedure<br />
    44. 44. The Obvious Conclusion<br />The active compounds target the 16 hits (at least) from the RNAi screen<br />Useful if the RNAi screen was small & focused<br />But what if we’re investigating a larger system?<br />Is there a way to get more specific?<br />Can compound data suggest RNAi non-hits?<br />
    45. 45. Small Molecule Targets<br />Bortezomib (proteosome inhibitor)<br />Some small molecules interact with core components<br />Daunorubicin (IκBα inhibitor)<br />
    46. 46. Small Molecule Targets<br />Montelukast (LDT4 antagonist)<br />Others are active against upstream targets<br />We also get an idea of off -target effects<br />
    47. 47. Compound Networks - Similarity<br />Evaluate fingerprint-based similarity matrix for the 55 actives<br />Connect pairs that exhibit Tc> 0.7 <br />Edges are weightedby the Tc value <br />Most groupings areobvious<br />
    48. 48. A “Dictionary” Based Approach<br />Create a small-ish annotated library<br />“Seed” compounds<br />Use it in parallel small molecule/RNAi screens<br />Use a similarity based approach to prioritize larger collections, in terms of anticipated targets<br />Currently, we’d use structural similarity<br />Diversity of prioritized structures is dependent on the diversity of the annotated library<br />
    49. 49. Compound Networks - Targets<br />Predict targets for the actives using SEA<br />Target based compound network maps nearly identically to the similarity based network <br />But depending on the predicted target qualitywe get poor (or no) mappings to the RNAi targeted genes<br />Keiser, M.J. et al, Nat. Biotech., 2007, 25, 197-206<br />
    50. 50. Gene Networks - Pathways<br />Nodes are 1374 HDG genes contained in the NCI PID <br />Edge indicates two genes/proteins are involved in the same pathway<br />“Good” hits tend to be very highly connected<br />Wang, L. et al, BMC Genomics, 2009, 10, 220<br />
    51. 51. (Reduced) Gene Networks – Pathways<br />Nodes are 526 genes with >= 1 siRNA showing knockdown <br />Edge indicates two genes/proteins are involved in the same pathway<br />
    52. 52. Pathway Based Integration<br />Direct matching of targets is not very useful<br />Try and map compounds to siRNA targets if the compounds’ predicted target(s) and siRNA targets are in the same pathway<br />Considering 16 reliable hits, we cover 26 pathways<br />Predicted compound targets cover 131 pathways<br />For 18 out of 41 compounds<br />3 RNAi-derived pathways not covered by compound-derived pathways <br />Rhodopsin, alternative NFkB, FAS<br />
    53. 53. Pathway Based Integration<br />Still not completely useful, as it only handled 18 compounds<br />Depending on target predictions is probably not a great idea<br />
    54. 54. Integration Caveats<br />Biggest bottleneck is lack of resolution<br />Currently, both small molecule and RNAi data are 1-D<br />Active or inactive, high/low signal<br />CRC’s for small molecules alleviate this a bit<br />High content screens can provide significantly more information and so better resolution<br />Data size & feature selection are of concern<br />
    55. 55. Integration Caveats<br />Compound annotations are key<br />Currently working on using ChEMBL data to provide target ‘suggestions’<br />More comprehensive pathway data will be required<br />RNAi and small molecule inhibition do not always lead to the same phenotype<br />Could be indicative of promiscuity<br />Could indicate true biological differences<br />Weiss, W.A. et al, Nat. Chem. Biol., 2007, 12, 739-744<br />
    56. 56. Conclusions<br />Building up a wealth of small molecule and RNAi data<br />“Standard” analysis of RNAi screens relatively straightforward<br />Challenges involve integrating RNAi data with other sources<br />Primary bottleneck is dimensionality of the data<br />Simple flourescence-based approaches do not provide sufficient resolution<br />High-content is required<br />
    57. 57. Acknowledgements<br />John Van Drie<br />Gerry Maggiora<br />MicLajiness<br />JurgenBajorath<br />Scott Martin<br />Pinar Tuzmen<br />CarleenKlump<br />DacTrung Nguyen<br />Ruili Huang<br />Yuhong Wang<br />
    58. 58. CPT Sensitization & “Central” Genes<br />Yves Pommier, Nat. Rev. Cancer, 2006. <br />TOP1 poisons prevent DNA religation resulting in replication-dependent double strand breaks. Cell activates DNA damage response (e.g. ATR).<br />
    59. 59. Screening Protocol<br />Screen conducted in the human breast cancer cell line MDA-MB-231. Many variables to optimize including transfection conditions, cell seeding density, assay conditions, and the selection of positive and negative controls.<br />
    60. 60. Hit Selection<br />Follow-Up Dose Response Analysis<br />ATR<br />Screen #1<br />siNeg<br />siATR-A<br />siATR-B<br />siATR-C<br />Viability (%)<br />Sensitization Ranked by Log2 Fold Change<br />CPT (Log M)<br />Screen #2<br />MAP3K7IP2<br />siNeg<br />siMAP3K7IP2-A<br />siMAP3K7IP2-B<br />siMAP3K7IP2-C<br />Viability (%)<br />siMAP3K7IP2-D<br />Sensitization Ranked by Log2 Fold Change<br />CPT (Log M)<br />Multiple active siRNAs for ATR, MAP3K7IP2, and BCL2L1. <br />
    61. 61. Are These Genes Relevant?<br />Some are well known to be CPT-sensitizers<br />Consider a HPRD PPI sub-network corresponding to the Qiagen HDG gene set<br />How “central” are these selected genes?<br />Larger values of betweennessindicate that the node lies onmany shortest paths<br />Makes sense - a number of them are stress-related<br />But some of them have very lowbetweenness values<br />
    62. 62. Are These Genes Relevant?<br />Most selected genesare densely connected<br />A few are not<br />Generally did notreconfirm<br />Network metrics could be used to provide confidencein selections<br />