Small Molecules and siRNA: Methods to Explore Bioactivity Data

Small Molecules and siRNA:Methods to Explore Bioactivity Data Rajarshi Guha NIH Chemical for Translational Therapeutics August 17, 2011 Pfizer, Groton

Background Cheminformatics methods QSAR, diversity analysis, virtual screening, fragments, polypharmacology, networks More recently siRNAscreening, high content imaging,combination screening Extensive use of machine learning All tied together with software development Integrate small molecule information & biosystems – systems chemical biology

Outline Exploring the SAR landscape The landscape view of SAR data Quantifying SAR landscapes Extending an SAR landscape Linking small molecule & RNAiHTS Overview of the Trans NIH RNAi Screening Initiative Infrastructure components Linking small molecule & siRNA screens

The Landscape View of Structure Activity Datasets

Structure Activity Relationships Similar molecules will have similar activities Small changes in structure will lead to small changes in activity One implication is that SAR’s are additive This is the basis for QSAR modeling Martin, Y.C. et al., J. Med. Chem., 2002, 45, 4350–4358

Structure Activity Landscapes Rugged gorges or rolling hills? Small structural changes associated with large activity changes represent steep slopes in the landscape But traditionally, QSAR assumes gentle slopes Machine learning is not very good for special cases Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535

Characterizing the Landscape A cliff can be numerically characterized Structure Activity Landscape Index (SALI) Cliffs are characterized by elements of the matrix with very large values Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658

Visualizing SALI Values The SALI graph Compounds are nodes Nodes i,j are connected if SALI(i,j) > X Only display connected nodes

What Can We Do With SALI’s? SALI characterizes cliffs & non-cliffs For a given molecular representation, SALI’s gives us an idea of thesmoothness of the SAR landscape Models try and encodethis landscape Use the landscape to guidedescriptor or model selection

Descriptor Space Smoothness Edge count of the SALI graph for varying cutoffs Measures smoothness of the descriptor space Can reduce this to a single number (AUC)

Other Examples Instead of fingerprints, we use molecular descriptors SALI denominator now uses Euclidean distance 2D & 3D random descriptor sets None are really good Too rough, or Too flat 2D 3D

Feature Selection Using SALI Surprisingly, exhaustive search of 66,000 4-descriptor combinations did not yield semi-smoothly decreasing curves Not entirely clear what type of curve is desirable

Measuring Model Quality A QSAR model should easily encode the “rolling hills” A good model captures the most significantcliffs Can be formalized as How many of the edge orderings of a SALI graph does the model predict correctly? Define S (X ), representing the number of edges correctly predicted for a SALI network at a threshold X Repeat for varying X and obtain the SALI curve

Model Search Using the SCI We’ve used the SALI to retrospectively analyze models Can we use SALI to develop models? Identify a model that captures the cliffs Tricky Cliffs are fundamentally outliers Optimizing for good SALI values implies overfitting Need to trade-off between SALI & generalizability

Predicting the Landscape Rather than predicting activity directly, we can try to predict the SAR landscape Implies that we attempt to directly predict cliffs Observations are now pairs of molecules A more complex problem Choice of features is trickier Still face the problem of cliffs as outliers Somewhat similar to predicting activity differences Scheiber et al, Statistical Analysis and Data Mining, 2009, 2, 115-122

Motivation Predicting activity cliffs corresponds to extending the SAR landscape Identify whether a new molecule will perform better or worse compared to the specific molecules in the dataset Can be useful for guiding lead optimization, but not necessarily useful for lead hopping

Predicting Cliffs Dependent variable are pairwise SALI values, calculated using fingerprints Independent variables are molecular descriptors – but considered pairwise Absolute difference of descriptor pairs, or Geometric mean of descriptor pairs … Develop a model to correlate pairwise descriptors to pairwise SALI values

A Test Case We first consider the CavalliCoMFA dataset of 30 molecules with pIC50’s Evaluate topological and physicochemical descriptors Developed random forest models On the original observed values (30 obs) On the SALI values (435 observations) Cavalli, A. et al, J Med Chem, 2002, 45, 3844-3853

Double Counting Structures? The dependent and independent variables both encode structure. But pretty low correlations between individual pairwisedescriptors and the SALI values

Model Summaries Original pIC50 RMSE = 0.97 SALI, AbsDiff RMSE = 1.10 SALI, GeoMean RMSE = 1.04 All models explain similar % of variance of their respective datasets Using geometric mean as the descriptor aggregation function seems to perform best SALI models are more robust due to larger size of the dataset

Test Case 2 Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter Similar strategy as before Need to transform SALI values Descriptors show minimal correlation Holloway, M.K. et al, J Med Chem, 1995, 38, 305-317

Model Summaries Original pIC50 RMSE = 1.05 SALI, AbsDiff RMSE = 0.48 SALI, GeoMean RMSE = 0.48 The SALI models perform much poorer in terms of % of variance explained Descriptor aggregation method does not seem to have much effect The SALI models appear to perform decently on the cliffs – but misses the most significant

Model Summaries Original pIC50 RMSE = 1.05 SALI, AbsDiff RMSE = 9.76 SALI, GeoMean RMSE = 10.01 With untransformed SALI values, models perform similarly in terms of % of variance explained The most significant cliffs correspond to stereoisomers

Test Case 3 38 adenosine receptor antagonists with reported Ki values; use 35 for training and 3 for testing Random forest model on the SALI values performed reasonable well (RMSE = 7.51, R2=0.62) Upper end ofSALI rangeis better predicted Kalla, R.V. et al, J. Med. Chem., 2006, 48, 1984-2008

Generally, performance is poorer for smaller cliffsFor any given hold out molecule, range of error in SALI prediction is large Suggests that some form of domain applicability metric would be useful

Model Caveats Models based on SALI values are dependent on their being an SAR in the original activity data Scrambling results for these models are poorer than the original models but aren’t as random as expected

Conclusions SALI is the first step in characterizing the SAR landscape Allows us to directly analyze the landscape, as opposed to individual molecules Being able to predict the landscape could serve as a useful way to extend an SAR landscape

Joining the Dots: Integrating High Throughput Small Molecule and RNAi Screens

RNAi Facility Mission Pathway (Reporter assays, e.g. luciferase, b-lactamase) Simple Phenotypes (Viability, cytotoxicity, oxidative stress, etc) Perform collaborative genome-wide RNAi screening-based projects with intramural investigators Advance the science of RNAi and miRNA screening and informatics via technology development to improve efficiency, reliability, and costs. Complex Phenotypes (High-content imaging, cell cycle, translocation, etc) Range of Assays

RNAi Informatics Infrastructure

RNAi Analysis Workflow Raw and Processed Data GO annotations Pathways Interactions Hit List Follow-up

RNAi Informatics Toolset Local databases (screen data, pathways, interactions, etc). Commercial pathway tools. Custom software for loading, analysis and visualization.

Back End Services Currently all computational analysis performed on the backend R & Bioconductor code Custom R package (ncgcrnai) to support NCGC infrastructure Partly derived from cellHTS2 Supports QC metrics, normalization, adjustments, selections, triage, (static) visualization, reports Some Java tools for Data loading Library and plate registration

RNAi& Small Molecule Screens CAGCATGAGTACTACAGGCCA TACGGGAACTACCATAATTTA What targets mediate activity of siRNA and compound Pathway elucidation, identification of interactions ,[object Object]

Develop new annotated librariesTarget ID and validation Link RNAi generated pathway peturbations to small molecule activities. Could provide insight into polypharmacology ,[object Object],Goal: Develop systems level view of small molecule activity

HTS for NF-κB Antagonists NF-κB controls DNA transcription Involved in cellular responses to stimuli Immune response, memory formation Inflammation, cancer, auto-immune diseases http://www.genego.com

HTS for NF-κB Antagonists ME-180 cell line Stimulate cells using TNF, leading to NF-κB activation, readout via a β-lactamase reporter Identify small molecules and siRNA’s that block the resultant activation

Small Molecule HTS Summary 2,899 FDA-approved compounds screened 55 compounds retested active Which components of the NF-κB pathway do they hit? 17 molecules have target/pathway information in GeneGO Literature searches list a few more Most Potent Actives Proscillaridin A Trabectidin Digoxin Miller, S.C. et al, Biochem. Pharmacol., 2010, ASAP

RNAi HTS Summary Qiagen HDG library – 6886 genes, 4 siRNA’s per gene A total of 567 genes were knockeddown by 1 or more siRNA’s We consider >= 2 as a “reliable” hit 16 reliable hits Added in 66 genes for follow up via triage procedure

The Obvious Conclusion The active compounds target the 16 hits (at least) from the RNAi screen Useful if the RNAi screen was small & focused But what if we’re investigating a larger system? Is there a way to get more specific? Can compound data suggest RNAi non-hits?

Small Molecule Targets Bortezomib (proteosome inhibitor) Some small molecules interact with core components Daunorubicin (IκBα inhibitor)

Small Molecule Targets Montelukast (LDT4 antagonist) Others are active against upstream targets We also get an idea of off -target effects

Compound Networks - Similarity Evaluate fingerprint-based similarity matrix for the 55 actives Connect pairs that exhibit Tc> 0.7 Edges are weightedby the Tc value Most groupings areobvious

A “Dictionary” Based Approach Create a small-ish annotated library “Seed” compounds Use it in parallel small molecule/RNAi screens Use a similarity based approach to prioritize larger collections, in terms of anticipated targets Currently, we’d use structural similarity Diversity of prioritized structures is dependent on the diversity of the annotated library

Compound Networks - Targets Predict targets for the actives using SEA Target based compound network maps nearly identically to the similarity based network But depending on the predicted target qualitywe get poor (or no) mappings to the RNAi targeted genes Keiser, M.J. et al, Nat. Biotech., 2007, 25, 197-206

Gene Networks - Pathways Nodes are 1374 HDG genes contained in the NCI PID Edge indicates two genes/proteins are involved in the same pathway “Good” hits tend to be very highly connected Wang, L. et al, BMC Genomics, 2009, 10, 220

(Reduced) Gene Networks – Pathways Nodes are 526 genes with >= 1 siRNA showing knockdown Edge indicates two genes/proteins are involved in the same pathway

Pathway Based Integration Direct matching of targets is not very useful Try and map compounds to siRNA targets if the compounds’ predicted target(s) and siRNA targets are in the same pathway Considering 16 reliable hits, we cover 26 pathways Predicted compound targets cover 131 pathways For 18 out of 41 compounds 3 RNAi-derived pathways not covered by compound-derived pathways Rhodopsin, alternative NFkB, FAS

Pathway Based Integration Still not completely useful, as it only handled 18 compounds Depending on target predictions is probably not a great idea

Integration Caveats Biggest bottleneck is lack of resolution Currently, both small molecule and RNAi data are 1-D Active or inactive, high/low signal CRC’s for small molecules alleviate this a bit High content screens can provide significantly more information and so better resolution Data size & feature selection are of concern

Integration Caveats Compound annotations are key Currently working on using ChEMBL data to provide target ‘suggestions’ More comprehensive pathway data will be required RNAi and small molecule inhibition do not always lead to the same phenotype Could be indicative of promiscuity Could indicate true biological differences Weiss, W.A. et al, Nat. Chem. Biol., 2007, 12, 739-744

Conclusions Building up a wealth of small molecule and RNAi data “Standard” analysis of RNAi screens relatively straightforward Challenges involve integrating RNAi data with other sources Primary bottleneck is dimensionality of the data Simple flourescence-based approaches do not provide sufficient resolution High-content is required

Acknowledgements John Van Drie Gerry Maggiora MicLajiness JurgenBajorath Scott Martin Pinar Tuzmen CarleenKlump DacTrung Nguyen Ruili Huang Yuhong Wang

CPT Sensitization & “Central” Genes Yves Pommier, Nat. Rev. Cancer, 2006. TOP1 poisons prevent DNA religation resulting in replication-dependent double strand breaks. Cell activates DNA damage response (e.g. ATR).

Screening Protocol Screen conducted in the human breast cancer cell line MDA-MB-231. Many variables to optimize including transfection conditions, cell seeding density, assay conditions, and the selection of positive and negative controls.

Hit Selection Follow-Up Dose Response Analysis ATR Screen #1 siNeg siATR-A siATR-B siATR-C Viability (%) Sensitization Ranked by Log2 Fold Change CPT (Log M) Screen #2 MAP3K7IP2 siNeg siMAP3K7IP2-A siMAP3K7IP2-B siMAP3K7IP2-C Viability (%) siMAP3K7IP2-D Sensitization Ranked by Log2 Fold Change CPT (Log M) Multiple active siRNAs for ATR, MAP3K7IP2, and BCL2L1.

Small Molecules and siRNA: Methods to Explore Bioactivity Data

Recommended

Recommended

More Related Content

Similar to Small Molecules and siRNA: Methods to Explore Bioactivity Data

Similar to Small Molecules and siRNA: Methods to Explore Bioactivity Data (20)

More from Rajarshi Guha

More from Rajarshi Guha (20)

Recently uploaded

Recently uploaded (20)

Small Molecules and siRNA: Methods to Explore Bioactivity Data

Editor's Notes