This document discusses methods for exploring bioactivity data from small molecules and siRNA. It begins with background on cheminformatics methods like QSAR and machine learning approaches. It then outlines exploring structure-activity relationships using a "landscape" view and quantifying cliffs in activity. Models can be developed to predict these SAR landscapes. The document also discusses linking small molecule and siRNA screening data by looking at shared targets and pathways. It notes challenges integrating the different data types due to differences in dimensionality and resolution.
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...Christos Argyropoulos
Presentation given about the Generalized Additive Model Location, Scale and Shape (GAMLSS) methodology for the analysis of small RNA sequencing data and the potential of microRNAs as biomarkers for kidney and cardiometabolic diseases
Molecular modelling for in silico drug discoveryLee Larcombe
A slide set based on the small molecule section of "Introduction to in silico drug discovery" with more detail on molecular modelling and simulation aspects. Including a bit more on protein structure prediction
Introduction of QSAR, Steps involved in QSAR, Hansch Analysis, Free Wilson Analysis, Mixed Approach method, Advantage,Disadvantage and Application of QSAR.
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...Christos Argyropoulos
Presentation given about the Generalized Additive Model Location, Scale and Shape (GAMLSS) methodology for the analysis of small RNA sequencing data and the potential of microRNAs as biomarkers for kidney and cardiometabolic diseases
Molecular modelling for in silico drug discoveryLee Larcombe
A slide set based on the small molecule section of "Introduction to in silico drug discovery" with more detail on molecular modelling and simulation aspects. Including a bit more on protein structure prediction
Introduction of QSAR, Steps involved in QSAR, Hansch Analysis, Free Wilson Analysis, Mixed Approach method, Advantage,Disadvantage and Application of QSAR.
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ijaia
The work in this paper shows intensive empirical experiments using 13 datasets to understand the regularization effectiveness of ridge regression, the lasso estimate, and elastic net regularization methods. The study offers a deep understanding of how the datasets affect the goodness of the prediction accuracy of each regularization method for a given problem given the diversity in the datasets used. The results have shown that datasets play crucial rules on the performance of the regularization method and that the
predication accuracy depends heavily on the nature of the sampled datasets.
Performance analysis of regularized linear regression models for oxazolines a...ijcsity
Regularized regression technique
s for lin
ear regression have been creat
ed
the last few
ten
year
s to
reduce
the
flaws
of ordinary least squ
ares regression
with regard to prediction accuracy.
In this paper, new
methods
for using regularized regression in model
choice are introduc
ed, and we
distinguish
the condition
s
in whic
h regularized regression develops
our ability to discriminate models.
W
e applied all the five
methods that use penalty
-
based (regularization) shrinkage to
handle Oxazolines and Oxazoles derivatives
descriptor
dataset
with far more predictors than observations.
The lasso,
ridge,
elasticnet,
lars and relaxed
lasso
further pos
sess the desirable property that they simultaneously sele
ct relevant predictive descriptor
s
and optimally
estimate their effects.
Here, we comparatively evaluate the performance of five regularized
linear regression methods
The assessment of the performanc
e of each model by means of benchmark
experiments
is an
established exercise.
Cross
-
validation and
resampling
method
s
are genera
lly
used to
arrive
point
evaluates the efficienci
es which are compared
to recognize
methods
with acceptable feature
s.
Predictiv
e accuracy
was evaluated
us
ing the root mean squared error
(RMSE)
and
Square of usual
correlation between predictors and observed mean inhibitory concentration of antitubercular activity
(R
square)
.
We found that all five regularized regression models were
able to produce feasible models
and
efficient capturing the linearity in the data
.
The elastic net and lars had similar accuracies
as well as lasso
and relaxed lasso
had similar accuracies
but outperformed ridge regression in terms of the RMSE and R
squ
are
metrics.
Cadd and molecular modeling for M.PharmShikha Popali
THE CADD IS FOR THE DRUG DEVELOPMENT THE DIFFERENT STRATEGIES ARE MENTIONED LIKE QSAR MOLECULAR DOCKING, THE DIFFERENT DIMNSIONAL FORMS OF QSAR , THE ADVANCE SAR of it.
This presentation is about Statistical method used in QSAR which is the part of computer aided drug design. In this slide we deals with choosing the descriptors or independent variables and validation about them .Linear Regression method, Non linear Regression method, Partial least square method, Cluster analysis, Principle component analysis.
LASSO MODELING AS AN ALTERNATIVE TO PCA BASED MULTIVARIATE MODELS TO SYSTEM W...mathsjournal
Principal component analysis (PCA) is a widespread and widely used in various areas of science such as bioinformatics, econometrics, and chemometrics among others. Once that PCA is based in the eigenvalues and the eigenvectors which are a very weak approach to high dimension systems with degrees of sparsity and in these situations the PCA is no longer a recommended procedure. Sparsity is very common in near infrared spectroscopy due to the large number of spectra required and the water absorption broad bands what makes these spectra very similar and with heavy sparsity in matrix dataset, demoting the precision and accuracy, in the multivariate modeling and within projections of data matrix in smaller dimensions. To overcoming these shortcomings the LASSO, a not PCA based method, model was applied to a NIR spectra dataset from Biodiesel and its performance was, statistically, compared with traditional multivariate modeling such as PCR and PLSR.
LASSO MODELING AS AN ALTERNATIVE TO PCA BASED MULTIVARIATE MODELS TO SYSTEM W...mathsjournal
Principal component analysis (PCA) is a widespread and widely used in various areas of science such as
bioinformatics, econometrics, and chemometrics among others. Once that PCA is based in the
eigenvalues and the eigenvectors which are a very weak approach to high dimension systems with
degrees of sparsity and in these situations the PCA is no longer a recommended procedure. Sparsity is
very common in near infrared spectroscopy due to the large number of spectra required and the water
absorption broad bands what makes these spectra very similar and with heavy sparsity in matrix dataset,
demoting the precision and accuracy, in the multivariate modeling and within projections of data matrix
in smaller dimensions. To overcoming these shortcomings the LASSO, a not PCA based method, model
was applied to a NIR spectra dataset from Biodiesel and its performance was, statistically, compared
with traditional multivariate modeling such as PCR and PLSR.
What is QSAR?, introduction to 3D QSAR, CoMFA, CoMSIA, Case Study on CoMFA contour maps analysis and CoMSIA interactive forces between ligand and receptor, various Statistical techniques involved in QSAR
Validation is the process of checking that your model is consistent with stereochemical standards i.e., validation is the process of evaluating reliability
In this presentation various aspects of validation are discussed
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ijaia
The work in this paper shows intensive empirical experiments using 13 datasets to understand the regularization effectiveness of ridge regression, the lasso estimate, and elastic net regularization methods. The study offers a deep understanding of how the datasets affect the goodness of the prediction accuracy of each regularization method for a given problem given the diversity in the datasets used. The results have shown that datasets play crucial rules on the performance of the regularization method and that the
predication accuracy depends heavily on the nature of the sampled datasets.
Performance analysis of regularized linear regression models for oxazolines a...ijcsity
Regularized regression technique
s for lin
ear regression have been creat
ed
the last few
ten
year
s to
reduce
the
flaws
of ordinary least squ
ares regression
with regard to prediction accuracy.
In this paper, new
methods
for using regularized regression in model
choice are introduc
ed, and we
distinguish
the condition
s
in whic
h regularized regression develops
our ability to discriminate models.
W
e applied all the five
methods that use penalty
-
based (regularization) shrinkage to
handle Oxazolines and Oxazoles derivatives
descriptor
dataset
with far more predictors than observations.
The lasso,
ridge,
elasticnet,
lars and relaxed
lasso
further pos
sess the desirable property that they simultaneously sele
ct relevant predictive descriptor
s
and optimally
estimate their effects.
Here, we comparatively evaluate the performance of five regularized
linear regression methods
The assessment of the performanc
e of each model by means of benchmark
experiments
is an
established exercise.
Cross
-
validation and
resampling
method
s
are genera
lly
used to
arrive
point
evaluates the efficienci
es which are compared
to recognize
methods
with acceptable feature
s.
Predictiv
e accuracy
was evaluated
us
ing the root mean squared error
(RMSE)
and
Square of usual
correlation between predictors and observed mean inhibitory concentration of antitubercular activity
(R
square)
.
We found that all five regularized regression models were
able to produce feasible models
and
efficient capturing the linearity in the data
.
The elastic net and lars had similar accuracies
as well as lasso
and relaxed lasso
had similar accuracies
but outperformed ridge regression in terms of the RMSE and R
squ
are
metrics.
Cadd and molecular modeling for M.PharmShikha Popali
THE CADD IS FOR THE DRUG DEVELOPMENT THE DIFFERENT STRATEGIES ARE MENTIONED LIKE QSAR MOLECULAR DOCKING, THE DIFFERENT DIMNSIONAL FORMS OF QSAR , THE ADVANCE SAR of it.
This presentation is about Statistical method used in QSAR which is the part of computer aided drug design. In this slide we deals with choosing the descriptors or independent variables and validation about them .Linear Regression method, Non linear Regression method, Partial least square method, Cluster analysis, Principle component analysis.
LASSO MODELING AS AN ALTERNATIVE TO PCA BASED MULTIVARIATE MODELS TO SYSTEM W...mathsjournal
Principal component analysis (PCA) is a widespread and widely used in various areas of science such as bioinformatics, econometrics, and chemometrics among others. Once that PCA is based in the eigenvalues and the eigenvectors which are a very weak approach to high dimension systems with degrees of sparsity and in these situations the PCA is no longer a recommended procedure. Sparsity is very common in near infrared spectroscopy due to the large number of spectra required and the water absorption broad bands what makes these spectra very similar and with heavy sparsity in matrix dataset, demoting the precision and accuracy, in the multivariate modeling and within projections of data matrix in smaller dimensions. To overcoming these shortcomings the LASSO, a not PCA based method, model was applied to a NIR spectra dataset from Biodiesel and its performance was, statistically, compared with traditional multivariate modeling such as PCR and PLSR.
LASSO MODELING AS AN ALTERNATIVE TO PCA BASED MULTIVARIATE MODELS TO SYSTEM W...mathsjournal
Principal component analysis (PCA) is a widespread and widely used in various areas of science such as
bioinformatics, econometrics, and chemometrics among others. Once that PCA is based in the
eigenvalues and the eigenvectors which are a very weak approach to high dimension systems with
degrees of sparsity and in these situations the PCA is no longer a recommended procedure. Sparsity is
very common in near infrared spectroscopy due to the large number of spectra required and the water
absorption broad bands what makes these spectra very similar and with heavy sparsity in matrix dataset,
demoting the precision and accuracy, in the multivariate modeling and within projections of data matrix
in smaller dimensions. To overcoming these shortcomings the LASSO, a not PCA based method, model
was applied to a NIR spectra dataset from Biodiesel and its performance was, statistically, compared
with traditional multivariate modeling such as PCR and PLSR.
What is QSAR?, introduction to 3D QSAR, CoMFA, CoMSIA, Case Study on CoMFA contour maps analysis and CoMSIA interactive forces between ligand and receptor, various Statistical techniques involved in QSAR
Validation is the process of checking that your model is consistent with stereochemical standards i.e., validation is the process of evaluating reliability
In this presentation various aspects of validation are discussed
Similar to Small Molecules and siRNA: Methods to Explore Bioactivity Data (20)
The design of chemical libraries is usually informed by pre-existing characteristics and desired features. On the other hand, assesing the prospective performance of a new library is more difficult. Importantly, a given screening library is often screened in a variety of systems which can differ in cell lines, readouts, formats and so on. In this study we explore to what extent pre-existing libraries can shed light on the relation between library activity and assay features. Using an ontology such as the BAO, it is possible to construct a hierarchy of annotations associated with an assay. Based on this annotation hierarchy we can then ask how likely are molecules associated with a specific annotation, to be identified as active. To allow generalization we consider substrucural features, as represented by a structural key fingerprint, rather than whole molecules. We employ a Bayesian framework to quantify the the association between a substructural feature and a given assay annotation, using a set of NCGC assays that have been annotated with BAO terms. We discuss our approach to training the Bayesian model and describe benchmarks that characterize model performance relative to the position of the annotation in the BAO hierarchy. Finally we discuss the role of this approach in a library design workflow that includes traditional design features such as chemical space coverage and physicochemical properties but also takes in to account screening platform features.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Climate Impact of Software Testing at Nordic Testing Days
Small Molecules and siRNA: Methods to Explore Bioactivity Data
1. Small Molecules and siRNA:Methods to Explore Bioactivity Data Rajarshi Guha NIH Chemical for Translational Therapeutics August 17, 2011 Pfizer, Groton
2. Background Cheminformatics methods QSAR, diversity analysis, virtual screening, fragments, polypharmacology, networks More recently siRNAscreening, high content imaging,combination screening Extensive use of machine learning All tied together with software development Integrate small molecule information & biosystems – systems chemical biology
3. Outline Exploring the SAR landscape The landscape view of SAR data Quantifying SAR landscapes Extending an SAR landscape Linking small molecule & RNAiHTS Overview of the Trans NIH RNAi Screening Initiative Infrastructure components Linking small molecule & siRNA screens
5. Structure Activity Relationships Similar molecules will have similar activities Small changes in structure will lead to small changes in activity One implication is that SAR’s are additive This is the basis for QSAR modeling Martin, Y.C. et al., J. Med. Chem., 2002, 45, 4350–4358
6. Structure Activity Landscapes Rugged gorges or rolling hills? Small structural changes associated with large activity changes represent steep slopes in the landscape But traditionally, QSAR assumes gentle slopes Machine learning is not very good for special cases Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
7. Characterizing the Landscape A cliff can be numerically characterized Structure Activity Landscape Index (SALI) Cliffs are characterized by elements of the matrix with very large values Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658
8. Visualizing SALI Values The SALI graph Compounds are nodes Nodes i,j are connected if SALI(i,j) > X Only display connected nodes
9. What Can We Do With SALI’s? SALI characterizes cliffs & non-cliffs For a given molecular representation, SALI’s gives us an idea of thesmoothness of the SAR landscape Models try and encodethis landscape Use the landscape to guidedescriptor or model selection
10. Descriptor Space Smoothness Edge count of the SALI graph for varying cutoffs Measures smoothness of the descriptor space Can reduce this to a single number (AUC)
11. Other Examples Instead of fingerprints, we use molecular descriptors SALI denominator now uses Euclidean distance 2D & 3D random descriptor sets None are really good Too rough, or Too flat 2D 3D
12. Feature Selection Using SALI Surprisingly, exhaustive search of 66,000 4-descriptor combinations did not yield semi-smoothly decreasing curves Not entirely clear what type of curve is desirable
13. Measuring Model Quality A QSAR model should easily encode the “rolling hills” A good model captures the most significantcliffs Can be formalized as How many of the edge orderings of a SALI graph does the model predict correctly? Define S (X ), representing the number of edges correctly predicted for a SALI network at a threshold X Repeat for varying X and obtain the SALI curve
15. Model Search Using the SCI We’ve used the SALI to retrospectively analyze models Can we use SALI to develop models? Identify a model that captures the cliffs Tricky Cliffs are fundamentally outliers Optimizing for good SALI values implies overfitting Need to trade-off between SALI & generalizability
16. Predicting the Landscape Rather than predicting activity directly, we can try to predict the SAR landscape Implies that we attempt to directly predict cliffs Observations are now pairs of molecules A more complex problem Choice of features is trickier Still face the problem of cliffs as outliers Somewhat similar to predicting activity differences Scheiber et al, Statistical Analysis and Data Mining, 2009, 2, 115-122
17. Motivation Predicting activity cliffs corresponds to extending the SAR landscape Identify whether a new molecule will perform better or worse compared to the specific molecules in the dataset Can be useful for guiding lead optimization, but not necessarily useful for lead hopping
18. Predicting Cliffs Dependent variable are pairwise SALI values, calculated using fingerprints Independent variables are molecular descriptors – but considered pairwise Absolute difference of descriptor pairs, or Geometric mean of descriptor pairs … Develop a model to correlate pairwise descriptors to pairwise SALI values
19. A Test Case We first consider the CavalliCoMFA dataset of 30 molecules with pIC50’s Evaluate topological and physicochemical descriptors Developed random forest models On the original observed values (30 obs) On the SALI values (435 observations) Cavalli, A. et al, J Med Chem, 2002, 45, 3844-3853
20. Double Counting Structures? The dependent and independent variables both encode structure. But pretty low correlations between individual pairwisedescriptors and the SALI values
21. Model Summaries Original pIC50 RMSE = 0.97 SALI, AbsDiff RMSE = 1.10 SALI, GeoMean RMSE = 1.04 All models explain similar % of variance of their respective datasets Using geometric mean as the descriptor aggregation function seems to perform best SALI models are more robust due to larger size of the dataset
22. Test Case 2 Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter Similar strategy as before Need to transform SALI values Descriptors show minimal correlation Holloway, M.K. et al, J Med Chem, 1995, 38, 305-317
23. Model Summaries Original pIC50 RMSE = 1.05 SALI, AbsDiff RMSE = 0.48 SALI, GeoMean RMSE = 0.48 The SALI models perform much poorer in terms of % of variance explained Descriptor aggregation method does not seem to have much effect The SALI models appear to perform decently on the cliffs – but misses the most significant
24. Model Summaries Original pIC50 RMSE = 1.05 SALI, AbsDiff RMSE = 9.76 SALI, GeoMean RMSE = 10.01 With untransformed SALI values, models perform similarly in terms of % of variance explained The most significant cliffs correspond to stereoisomers
25. Test Case 3 38 adenosine receptor antagonists with reported Ki values; use 35 for training and 3 for testing Random forest model on the SALI values performed reasonable well (RMSE = 7.51, R2=0.62) Upper end ofSALI rangeis better predicted Kalla, R.V. et al, J. Med. Chem., 2006, 48, 1984-2008
26.
27. Generally, performance is poorer for smaller cliffsFor any given hold out molecule, range of error in SALI prediction is large Suggests that some form of domain applicability metric would be useful
28. Model Caveats Models based on SALI values are dependent on their being an SAR in the original activity data Scrambling results for these models are poorer than the original models but aren’t as random as expected
29. Conclusions SALI is the first step in characterizing the SAR landscape Allows us to directly analyze the landscape, as opposed to individual molecules Being able to predict the landscape could serve as a useful way to extend an SAR landscape
30. Joining the Dots: Integrating High Throughput Small Molecule and RNAi Screens
31. RNAi Facility Mission Pathway (Reporter assays, e.g. luciferase, b-lactamase) Simple Phenotypes (Viability, cytotoxicity, oxidative stress, etc) Perform collaborative genome-wide RNAi screening-based projects with intramural investigators Advance the science of RNAi and miRNA screening and informatics via technology development to improve efficiency, reliability, and costs. Complex Phenotypes (High-content imaging, cell cycle, translocation, etc) Range of Assays
33. RNAi Analysis Workflow Raw and Processed Data GO annotations Pathways Interactions Hit List Follow-up
34. RNAi Informatics Toolset Local databases (screen data, pathways, interactions, etc). Commercial pathway tools. Custom software for loading, analysis and visualization.
35. Back End Services Currently all computational analysis performed on the backend R & Bioconductor code Custom R package (ncgcrnai) to support NCGC infrastructure Partly derived from cellHTS2 Supports QC metrics, normalization, adjustments, selections, triage, (static) visualization, reports Some Java tools for Data loading Library and plate registration
40. HTS for NF-κB Antagonists NF-κB controls DNA transcription Involved in cellular responses to stimuli Immune response, memory formation Inflammation, cancer, auto-immune diseases http://www.genego.com
41. HTS for NF-κB Antagonists ME-180 cell line Stimulate cells using TNF, leading to NF-κB activation, readout via a β-lactamase reporter Identify small molecules and siRNA’s that block the resultant activation
42. Small Molecule HTS Summary 2,899 FDA-approved compounds screened 55 compounds retested active Which components of the NF-κB pathway do they hit? 17 molecules have target/pathway information in GeneGO Literature searches list a few more Most Potent Actives Proscillaridin A Trabectidin Digoxin Miller, S.C. et al, Biochem. Pharmacol., 2010, ASAP
43. RNAi HTS Summary Qiagen HDG library – 6886 genes, 4 siRNA’s per gene A total of 567 genes were knockeddown by 1 or more siRNA’s We consider >= 2 as a “reliable” hit 16 reliable hits Added in 66 genes for follow up via triage procedure
44. The Obvious Conclusion The active compounds target the 16 hits (at least) from the RNAi screen Useful if the RNAi screen was small & focused But what if we’re investigating a larger system? Is there a way to get more specific? Can compound data suggest RNAi non-hits?
45. Small Molecule Targets Bortezomib (proteosome inhibitor) Some small molecules interact with core components Daunorubicin (IκBα inhibitor)
46. Small Molecule Targets Montelukast (LDT4 antagonist) Others are active against upstream targets We also get an idea of off -target effects
47. Compound Networks - Similarity Evaluate fingerprint-based similarity matrix for the 55 actives Connect pairs that exhibit Tc> 0.7 Edges are weightedby the Tc value Most groupings areobvious
48. A “Dictionary” Based Approach Create a small-ish annotated library “Seed” compounds Use it in parallel small molecule/RNAi screens Use a similarity based approach to prioritize larger collections, in terms of anticipated targets Currently, we’d use structural similarity Diversity of prioritized structures is dependent on the diversity of the annotated library
49. Compound Networks - Targets Predict targets for the actives using SEA Target based compound network maps nearly identically to the similarity based network But depending on the predicted target qualitywe get poor (or no) mappings to the RNAi targeted genes Keiser, M.J. et al, Nat. Biotech., 2007, 25, 197-206
50. Gene Networks - Pathways Nodes are 1374 HDG genes contained in the NCI PID Edge indicates two genes/proteins are involved in the same pathway “Good” hits tend to be very highly connected Wang, L. et al, BMC Genomics, 2009, 10, 220
51. (Reduced) Gene Networks – Pathways Nodes are 526 genes with >= 1 siRNA showing knockdown Edge indicates two genes/proteins are involved in the same pathway
52. Pathway Based Integration Direct matching of targets is not very useful Try and map compounds to siRNA targets if the compounds’ predicted target(s) and siRNA targets are in the same pathway Considering 16 reliable hits, we cover 26 pathways Predicted compound targets cover 131 pathways For 18 out of 41 compounds 3 RNAi-derived pathways not covered by compound-derived pathways Rhodopsin, alternative NFkB, FAS
53. Pathway Based Integration Still not completely useful, as it only handled 18 compounds Depending on target predictions is probably not a great idea
54. Integration Caveats Biggest bottleneck is lack of resolution Currently, both small molecule and RNAi data are 1-D Active or inactive, high/low signal CRC’s for small molecules alleviate this a bit High content screens can provide significantly more information and so better resolution Data size & feature selection are of concern
55. Integration Caveats Compound annotations are key Currently working on using ChEMBL data to provide target ‘suggestions’ More comprehensive pathway data will be required RNAi and small molecule inhibition do not always lead to the same phenotype Could be indicative of promiscuity Could indicate true biological differences Weiss, W.A. et al, Nat. Chem. Biol., 2007, 12, 739-744
56. Conclusions Building up a wealth of small molecule and RNAi data “Standard” analysis of RNAi screens relatively straightforward Challenges involve integrating RNAi data with other sources Primary bottleneck is dimensionality of the data Simple flourescence-based approaches do not provide sufficient resolution High-content is required
57. Acknowledgements John Van Drie Gerry Maggiora MicLajiness JurgenBajorath Scott Martin Pinar Tuzmen CarleenKlump DacTrung Nguyen Ruili Huang Yuhong Wang
58. CPT Sensitization & “Central” Genes Yves Pommier, Nat. Rev. Cancer, 2006. TOP1 poisons prevent DNA religation resulting in replication-dependent double strand breaks. Cell activates DNA damage response (e.g. ATR).
59. Screening Protocol Screen conducted in the human breast cancer cell line MDA-MB-231. Many variables to optimize including transfection conditions, cell seeding density, assay conditions, and the selection of positive and negative controls.
60. Hit Selection Follow-Up Dose Response Analysis ATR Screen #1 siNeg siATR-A siATR-B siATR-C Viability (%) Sensitization Ranked by Log2 Fold Change CPT (Log M) Screen #2 MAP3K7IP2 siNeg siMAP3K7IP2-A siMAP3K7IP2-B siMAP3K7IP2-C Viability (%) siMAP3K7IP2-D Sensitization Ranked by Log2 Fold Change CPT (Log M) Multiple active siRNAs for ATR, MAP3K7IP2, and BCL2L1.
61. Are These Genes Relevant? Some are well known to be CPT-sensitizers Consider a HPRD PPI sub-network corresponding to the Qiagen HDG gene set How “central” are these selected genes? Larger values of betweennessindicate that the node lies onmany shortest paths Makes sense - a number of them are stress-related But some of them have very lowbetweenness values
62. Are These Genes Relevant? Most selected genesare densely connected A few are not Generally did notreconfirm Network metrics could be used to provide confidencein selections
Editor's Notes
Outliers in a cliff prediction model are not as severe since SALI changes more slowly than just activity differences
For SALI = 0, had to set log10(SALI) = 0Similar performance if we use SALI and not log10(SALI) at least more % variance is explained. Still fail on most significant cliffs
View plates (raw, normalized, adjusted, …)Highlight specific genes, siRNA’sView assay statisticsView pathway membership (via Wikipathways)Linkout to external resources (Entrez, GeneCards, …)Hit selection, follow up (DRC)
View plates (raw, normalized, adjusted, …)Highlight specific genes, siRNA’sView assay statisticsView pathway membership (via Wikipathways)Linkout to external resources (Entrez, GeneCards, …)Hit selection, follow up (DRC)
* Proscillaridin A was not selected in the 20 compounds for further analysis in the paper* 2 cardiac glycosides in the top 3, target appears to be caspase-3 (activating it). CG inhibition of NF-kb is well known . See PNAS 2005, by Pollard* Trabectidin induces lethal DNA strand breaks and blocks cell cycle in G2 phase
PSM* genes code for proteosome subunits – so they likely prevent the ubiquination of the IkBa complex, so that RelA+cp50 cannot be released from the IkBa complex and enter the nucleus
Size of node indicates potency – larger is more potentLanatosidec and a have Tc = 1 and hence the edge was not shown (ideally it should be shown)
Good confirmation that SEA worksSize of node corresponds to SEA confidence score
We consider 41 compounds rather than 55, since a number of them did not have sufficiently confident target predictionsWe then get to 18 compounds since, many of the predicted genes, did not map to an NCI PID pathway
Pheontypic difference can arise when PPI’s are involved
HPRD subnetwork corresponding to the Qiagen HDG has 6782 genes
HPRD subnetwork corresponding to the Qiagen HDG has 6782 genes