We present a method for visualizing and navigating large and diverse chemical spaces, such as screening datasets, along with their activities and properties. Our approach is to annotate the data with all possible scaffolds contained within each molecule using an exhaustive algorithm developed at NCATS. We have developed a Spotfire visualization that is used to drive the hit triage process. Progression decisions can be made using aggregate scaffold parameters and data from multiple datasets merged at the scaffold level. This visualization easily reveals overlaps that help prioritize hits, highlight tractable series and posit ways to combine aspects of multiple hits . The SAR of a large and complex hit is automatically mapped into all constituent scaffolds making it possible to navigate, via any shared scaffold, to all related hits. This scaffold “walking” helps address bias toward a handful of potent and ligand-efficient molecules at the expense of coverage of chemical space. The mapping also automates the laborious process of substructure searches within a dataset as structures are now linked to pre-processed search results. We compare the NCATS scaffold generation method with published screening triage methods such as nearest-neighbor clustering, data-driven clustering and scaffold networks. We believe that our Spotfire visualization used in combination with structure annotation provides a novel view of large and diverse datasets. This allows teams to effortlessly navigate between structurally related molecules and enriches the population of leads considered and progressed in a manner complementary to established approaches.
This document presents an overview of weighted correlation network analysis (WGCNA), an R package used to identify clusters (modules) of highly correlated genes in a biological network. It describes the main steps of WGCNA, including data preprocessing, constructing a weighted correlation network, identifying modules of co-expressed genes, relating modules to external traits, studying relationships between modules, and finding key driver genes. The goal is to discover how groups of interacting genes work together to impact phenotypic traits.
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
This lecture is part is an introductory bioinformatics workshop. It gives a background to what sequencing is, what the results of a sequencing experiment are, how to assess the quality of a sequencing run, what error sources exist and how to deal with errors. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/
This document outlines exercises for quality control of NGS data from an Illumina sequencing experiment on tomato ripening stages. The exercises include: 1) evaluating raw fastq files for format and number of sequences; 2) using FastQC to analyze read quality scores, lengths, duplication levels, and k-mer content; and 3) preprocessing the reads using fastq-mcf to trim low quality ends and remove short reads before reanalyzing with FastQC. The goal is to learn how to evaluate NGS read quality and preprocess data prior to downstream analysis.
Degradome sequencing and small rna targetsAswinChilakala
small RNA have numerous roles in plant developmental biology their discovery is one of the important things to take advantage of the plant system moreover the gene target of small RNA identification helps us engineer the development of the plant in a better way one such method is Degradome sequencing.
Zinc finger proteins bind DNA through zinc finger motifs. Each motif contains a beta sheet and alpha helix coordinated by a zinc ion. Early research found zinc fingers bind DNA in the major groove, with fingers 2-5 binding DNA directly and fingers 1-2 interacting through protein-protein interactions. Later, zinc finger proteins were engineered to cut DNA at specific sites, with cleavage occurring near the binding site. This led to the founding of Sangamo Biosciences to develop a modular approach to engineering zinc fingers to target desired DNA sequences. Zinc finger nucleases can introduce double-strand breaks to promote genome editing through homology directed repair.
This document presents an overview of weighted correlation network analysis (WGCNA), an R package used to identify clusters (modules) of highly correlated genes in a biological network. It describes the main steps of WGCNA, including data preprocessing, constructing a weighted correlation network, identifying modules of co-expressed genes, relating modules to external traits, studying relationships between modules, and finding key driver genes. The goal is to discover how groups of interacting genes work together to impact phenotypic traits.
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
This lecture is part is an introductory bioinformatics workshop. It gives a background to what sequencing is, what the results of a sequencing experiment are, how to assess the quality of a sequencing run, what error sources exist and how to deal with errors. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/
This document outlines exercises for quality control of NGS data from an Illumina sequencing experiment on tomato ripening stages. The exercises include: 1) evaluating raw fastq files for format and number of sequences; 2) using FastQC to analyze read quality scores, lengths, duplication levels, and k-mer content; and 3) preprocessing the reads using fastq-mcf to trim low quality ends and remove short reads before reanalyzing with FastQC. The goal is to learn how to evaluate NGS read quality and preprocess data prior to downstream analysis.
Degradome sequencing and small rna targetsAswinChilakala
small RNA have numerous roles in plant developmental biology their discovery is one of the important things to take advantage of the plant system moreover the gene target of small RNA identification helps us engineer the development of the plant in a better way one such method is Degradome sequencing.
Zinc finger proteins bind DNA through zinc finger motifs. Each motif contains a beta sheet and alpha helix coordinated by a zinc ion. Early research found zinc fingers bind DNA in the major groove, with fingers 2-5 binding DNA directly and fingers 1-2 interacting through protein-protein interactions. Later, zinc finger proteins were engineered to cut DNA at specific sites, with cleavage occurring near the binding site. This led to the founding of Sangamo Biosciences to develop a modular approach to engineering zinc fingers to target desired DNA sequences. Zinc finger nucleases can introduce double-strand breaks to promote genome editing through homology directed repair.
Continual Learning with Deep Architectures - Tutorial ICML 2021Vincenzo Lomonaco
Humans have the extraordinary ability to learn continually from experience. Not only we can apply previously learned knowledge and skills to new situations, we can also use these as the foundation for later learning. One of the grand goals of Artificial Intelligence (AI) is building an artificial “continual learning” agent that constructs a sophisticated understanding of the world from its own experience through the autonomous incremental development of ever more complex knowledge and skills (Parisi, 2019). However, despite early speculations and few pioneering works (Ring, 1998; Thrun, 1998; Carlson, 2010), very little research and effort has been devoted to address this vision. Current AI systems greatly suffer from the exposure to new data or environments which even slightly differ from the ones for which they have been trained for (Goodfellow, 2013). Moreover, the learning process is usually constrained on fixed datasets within narrow and isolated tasks which may hardly lead to the emergence of more complex and autonomous intelligent behaviors. In essence, continual learning and adaptation capabilities, while more than often thought as fundamental pillars of every intelligent agent, have been mostly left out of the main AI research focus.
In this tutorial, we propose to summarize the application of these ideas in light of the more recent advances in machine learning research and in the context of deep architectures for AI (Lomonaco, 2019). Starting from a motivation and a brief history, we link recent Continual Learning advances to previous research endeavours on related topics and we summarize the state-of-the-art in terms of major approaches, benchmarks and key results. In the second part of the tutorial we plan to cover more exploratory studies about Continual Learning with low supervised signals and the relationships with other paradigms such as Unsupervised, Semi-Supervised and Reinforcement Learning. We will also highlight the impact of recent Neuroscience discoveries in the design of original continual learning algorithms as well as their deployment in real-world applications. Finally, we will underline the notion of continual learning as a key technological enabler for Sustainable Machine Learning and its societal impact, as well as recap interesting research questions and directions worth addressing in the future.
Authors: Vincenzo Lomonaco, Irina Rish
Official Website: https://sites.google.com/view/cltutorial-icml2021
OLC assembly involves three main steps:
1. Overlap - Compute all overlaps between reads to construct an overlap graph
2. Layout - Bundle stretches of the overlap graph into contigs
3. Consensus - Pick the most likely nucleotide sequence for each contig by determining consensus from the underlying reads
This document provides an overview of next generation sequencing (NGS) analysis. It discusses various NGS platforms such as Illumina, Roche 454, PacBio, and Ion Torrent. It also covers common file formats for sequencing data like FASTQ, quality control measures to assess data quality, and applications of NGS such as RNA-seq and ChIP-seq. The document aims to introduce researchers to basic concepts in NGS analysis and highlights available resources for storing and analyzing large sequencing datasets.
Protein ligands bind to specific sites on proteins. Common protein ligands include antibodies and molecules like nucleic acids and peptides. Main methods to study protein-ligand interactions are spectroscopic techniques like fluorescence spectroscopy and structural methods like X-ray crystallography and NMR spectroscopy. Protein-ligand interactions are crucial for processes in living organisms as they allow for molecular recognition and signal transmission essential to biological functions.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
Recombinant DNA technology allows DNA from different species to be isolated, cut, spliced together, and replicated. This creates new "recombinant" DNA molecules. Key steps include using restriction enzymes to cut DNA into fragments, inserting fragments into cloning vectors like plasmids, and transforming host cells to replicate the recombinant DNA. PCR is also used to amplify specific DNA sequences. Recombinant DNA technology has many applications, including producing human proteins, diagnosing genetic diseases, and detecting bacteria and viruses.
[DL Hacks]Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternati...Deep Learning JP
The document discusses Deep Learning Japan (DL Papers), a website that aggregates and shares Japanese-language papers on deep learning. It provides an overview of the website's features and content, including sections on recent papers, tutorials, tools and frameworks. In summary:
- DL Papers collects and shares Japanese papers on deep learning techniques to help disseminate research.
- The site organizes papers into categories like recent publications, tutorials, tools and frameworks.
- It aims to help more researchers engage with deep learning and accelerate progress in the field through open sharing of ideas.
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3QIAGEN
This webinar presentation provides an overview and tutorial on analyzing data from RT2 Profiler PCR Array experiments. It discusses organizing raw Ct value data, performing ΔCt and ΔΔCt calculations to analyze gene expression changes between sample groups, and using the GeneGlobe Data Analysis Center web portal to analyze the data. The webinar highlights new features of the Data Analysis Center including improved data visualization and an upgraded sample manager. It emphasizes following the standard protocol for setting baselines and thresholds when analyzing PCR array data.
What is PCR
Basic Requirements
Types of PCR
Asymmetric PCR
Applications of PCR
Advantages of PCR
Limitations of PCR
DNA Template
Primers
Taq polymerase
Deoxynucleoside
triphosphates(dNTPs)
Buffer solution
Divalent cations(eg.Mg2+ )
Continual learning involves building machine learning systems that can learn continuously over time from new data and tasks while retaining knowledge from previous learning. This mimics how humans learn throughout their lives. However, continual learning faces challenges like catastrophic forgetting where new learning interferes with past knowledge. Potential solutions involve balancing plasticity to learn new things with stability to retain old knowledge. The field is still new with experiments focused on simple tasks, but continual learning could enable increasingly intelligent systems that learn forever.
ChIP-seq is a technique to identify where proteins bind to DNA in the genome. It involves cross-linking proteins to DNA in cells, fragmenting the DNA, immunoprecipitating the protein-DNA complexes using an antibody for the protein of interest, and then sequencing the retrieved DNA. This allows mapping of the genomic binding sites for the protein. The document discusses experimental design considerations for ChIP-seq, such as antibody choice and controls. It also reviews data analysis steps including read mapping, peak calling to identify enriched regions, and downstream analyses like motif finding. Higher resolution techniques like ChIP-exo are also introduced that can identify protein binding sites at base pair level.
Tensors are higher order extensions of matrices that can incorporate multiple modalities and encode higher order relationships in data. Tensors play a significant role in machine learning through (1) tensor contractions, (2) tensor sketches, and (3) tensor decompositions. Tensor contractions are extensions of matrix products to higher dimensions. Tensor sketches efficiently compress tensors while preserving information. Tensor decompositions compute low rank components that constitute a tensor.
論文読みのスライドです.GANについても一から説明しています.以下に対象論文の詳細をあげておきます.
※知識不足のため,間違っている点もあるかと思いますので,見る方は参考程度にお願いいたします.
Generative Image Inpainting with Contextual Attention
Yu, Jiahui and Lin, Zhe and Yang, Jimei and Shen, Xiaohui and Lu, Xin and Huang, Thomas S
arXiv:https://arxiv.org/pdf/1801.07892.pdf
Github:https://github.com/JiahuiYu/generative_inpainting
This document summarizes different computational methods for protein structure prediction, including homology modeling, fold recognition, threading, and ab initio modeling. Homology modeling relies on identifying proteins with similar sequences and known structures. Fold recognition and threading can be used when there are no homologs, to identify proteins with the same overall fold but different sequences. Ab initio modeling uses physics-based modeling and protein fragments to predict structure from sequence alone, and has challenges due to the vast number of possible conformations.
Peter Langfelder presented on weighted gene co-expression network analysis of HD data. Key points:
- WGCNA identified gene modules in mouse striatum associated with CAG repeat length. Neuronal modules were down with increasing repeats while oligodendrocyte modules were up.
- Human HD brain regions showed common and region-specific responses. A neuronal module was down across all regions while astrocyte and microglial modules were up.
- Consensus modules identified co-expressed genes consistently changed across multiple human HD datasets, providing robust modules for further investigation.
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...Enrico Glaab
EnrichNet is a web-application and web-service to identify and visualize functional associations between a user-defined list of genes/proteins and known cellular pathways. As a complement to classical overlap-based enrichment analysis methods, the EnrichNet approach integrates a novel graph-based statistic with a new interactive visualization of network sub-structures to enable a direct molecular interpretation of how a set of genes or proteins is related to a specific cellular pathway. Available at: http://www.enrichnet.org
Continual Learning with Deep Architectures - Tutorial ICML 2021Vincenzo Lomonaco
Humans have the extraordinary ability to learn continually from experience. Not only we can apply previously learned knowledge and skills to new situations, we can also use these as the foundation for later learning. One of the grand goals of Artificial Intelligence (AI) is building an artificial “continual learning” agent that constructs a sophisticated understanding of the world from its own experience through the autonomous incremental development of ever more complex knowledge and skills (Parisi, 2019). However, despite early speculations and few pioneering works (Ring, 1998; Thrun, 1998; Carlson, 2010), very little research and effort has been devoted to address this vision. Current AI systems greatly suffer from the exposure to new data or environments which even slightly differ from the ones for which they have been trained for (Goodfellow, 2013). Moreover, the learning process is usually constrained on fixed datasets within narrow and isolated tasks which may hardly lead to the emergence of more complex and autonomous intelligent behaviors. In essence, continual learning and adaptation capabilities, while more than often thought as fundamental pillars of every intelligent agent, have been mostly left out of the main AI research focus.
In this tutorial, we propose to summarize the application of these ideas in light of the more recent advances in machine learning research and in the context of deep architectures for AI (Lomonaco, 2019). Starting from a motivation and a brief history, we link recent Continual Learning advances to previous research endeavours on related topics and we summarize the state-of-the-art in terms of major approaches, benchmarks and key results. In the second part of the tutorial we plan to cover more exploratory studies about Continual Learning with low supervised signals and the relationships with other paradigms such as Unsupervised, Semi-Supervised and Reinforcement Learning. We will also highlight the impact of recent Neuroscience discoveries in the design of original continual learning algorithms as well as their deployment in real-world applications. Finally, we will underline the notion of continual learning as a key technological enabler for Sustainable Machine Learning and its societal impact, as well as recap interesting research questions and directions worth addressing in the future.
Authors: Vincenzo Lomonaco, Irina Rish
Official Website: https://sites.google.com/view/cltutorial-icml2021
OLC assembly involves three main steps:
1. Overlap - Compute all overlaps between reads to construct an overlap graph
2. Layout - Bundle stretches of the overlap graph into contigs
3. Consensus - Pick the most likely nucleotide sequence for each contig by determining consensus from the underlying reads
This document provides an overview of next generation sequencing (NGS) analysis. It discusses various NGS platforms such as Illumina, Roche 454, PacBio, and Ion Torrent. It also covers common file formats for sequencing data like FASTQ, quality control measures to assess data quality, and applications of NGS such as RNA-seq and ChIP-seq. The document aims to introduce researchers to basic concepts in NGS analysis and highlights available resources for storing and analyzing large sequencing datasets.
Protein ligands bind to specific sites on proteins. Common protein ligands include antibodies and molecules like nucleic acids and peptides. Main methods to study protein-ligand interactions are spectroscopic techniques like fluorescence spectroscopy and structural methods like X-ray crystallography and NMR spectroscopy. Protein-ligand interactions are crucial for processes in living organisms as they allow for molecular recognition and signal transmission essential to biological functions.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
Recombinant DNA technology allows DNA from different species to be isolated, cut, spliced together, and replicated. This creates new "recombinant" DNA molecules. Key steps include using restriction enzymes to cut DNA into fragments, inserting fragments into cloning vectors like plasmids, and transforming host cells to replicate the recombinant DNA. PCR is also used to amplify specific DNA sequences. Recombinant DNA technology has many applications, including producing human proteins, diagnosing genetic diseases, and detecting bacteria and viruses.
[DL Hacks]Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternati...Deep Learning JP
The document discusses Deep Learning Japan (DL Papers), a website that aggregates and shares Japanese-language papers on deep learning. It provides an overview of the website's features and content, including sections on recent papers, tutorials, tools and frameworks. In summary:
- DL Papers collects and shares Japanese papers on deep learning techniques to help disseminate research.
- The site organizes papers into categories like recent publications, tutorials, tools and frameworks.
- It aims to help more researchers engage with deep learning and accelerate progress in the field through open sharing of ideas.
PCR Array Data Analysis Tutorial: qPCR Technology Webinar Series Part 3QIAGEN
This webinar presentation provides an overview and tutorial on analyzing data from RT2 Profiler PCR Array experiments. It discusses organizing raw Ct value data, performing ΔCt and ΔΔCt calculations to analyze gene expression changes between sample groups, and using the GeneGlobe Data Analysis Center web portal to analyze the data. The webinar highlights new features of the Data Analysis Center including improved data visualization and an upgraded sample manager. It emphasizes following the standard protocol for setting baselines and thresholds when analyzing PCR array data.
What is PCR
Basic Requirements
Types of PCR
Asymmetric PCR
Applications of PCR
Advantages of PCR
Limitations of PCR
DNA Template
Primers
Taq polymerase
Deoxynucleoside
triphosphates(dNTPs)
Buffer solution
Divalent cations(eg.Mg2+ )
Continual learning involves building machine learning systems that can learn continuously over time from new data and tasks while retaining knowledge from previous learning. This mimics how humans learn throughout their lives. However, continual learning faces challenges like catastrophic forgetting where new learning interferes with past knowledge. Potential solutions involve balancing plasticity to learn new things with stability to retain old knowledge. The field is still new with experiments focused on simple tasks, but continual learning could enable increasingly intelligent systems that learn forever.
ChIP-seq is a technique to identify where proteins bind to DNA in the genome. It involves cross-linking proteins to DNA in cells, fragmenting the DNA, immunoprecipitating the protein-DNA complexes using an antibody for the protein of interest, and then sequencing the retrieved DNA. This allows mapping of the genomic binding sites for the protein. The document discusses experimental design considerations for ChIP-seq, such as antibody choice and controls. It also reviews data analysis steps including read mapping, peak calling to identify enriched regions, and downstream analyses like motif finding. Higher resolution techniques like ChIP-exo are also introduced that can identify protein binding sites at base pair level.
Tensors are higher order extensions of matrices that can incorporate multiple modalities and encode higher order relationships in data. Tensors play a significant role in machine learning through (1) tensor contractions, (2) tensor sketches, and (3) tensor decompositions. Tensor contractions are extensions of matrix products to higher dimensions. Tensor sketches efficiently compress tensors while preserving information. Tensor decompositions compute low rank components that constitute a tensor.
論文読みのスライドです.GANについても一から説明しています.以下に対象論文の詳細をあげておきます.
※知識不足のため,間違っている点もあるかと思いますので,見る方は参考程度にお願いいたします.
Generative Image Inpainting with Contextual Attention
Yu, Jiahui and Lin, Zhe and Yang, Jimei and Shen, Xiaohui and Lu, Xin and Huang, Thomas S
arXiv:https://arxiv.org/pdf/1801.07892.pdf
Github:https://github.com/JiahuiYu/generative_inpainting
This document summarizes different computational methods for protein structure prediction, including homology modeling, fold recognition, threading, and ab initio modeling. Homology modeling relies on identifying proteins with similar sequences and known structures. Fold recognition and threading can be used when there are no homologs, to identify proteins with the same overall fold but different sequences. Ab initio modeling uses physics-based modeling and protein fragments to predict structure from sequence alone, and has challenges due to the vast number of possible conformations.
Peter Langfelder presented on weighted gene co-expression network analysis of HD data. Key points:
- WGCNA identified gene modules in mouse striatum associated with CAG repeat length. Neuronal modules were down with increasing repeats while oligodendrocyte modules were up.
- Human HD brain regions showed common and region-specific responses. A neuronal module was down across all regions while astrocyte and microglial modules were up.
- Consensus modules identified co-expressed genes consistently changed across multiple human HD datasets, providing robust modules for further investigation.
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...Enrico Glaab
EnrichNet is a web-application and web-service to identify and visualize functional associations between a user-defined list of genes/proteins and known cellular pathways. As a complement to classical overlap-based enrichment analysis methods, the EnrichNet approach integrates a novel graph-based statistic with a new interactive visualization of network sub-structures to enable a direct molecular interpretation of how a set of genes or proteins is related to a specific cellular pathway. Available at: http://www.enrichnet.org
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis,
ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on ArrayMining.net can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
This document introduces an open source framework called AIQC that aims to accelerate deep learning research in biomedicine. It describes the pain points of current experimentation workflows, such as a lack of reproducibility and scalability. The framework provides a unified and declarative API to encode datasets, define machine learning tasks, and run experiments. It integrates Python data science and deep learning tools to make experiments more systematic and reusable. Examples show how the framework can be used for applications like tumor classification based on gene expression and compound screening based on structural characteristics. The framework is positioned to help biomedical researchers adopt machine learning methods and partner with cloud platforms to apply AI to genomics.
Many of today's researchers are generating DNA sequence data for large numbers of samples in population-based experiments. This may include whole genomes, exomes, or targeted regions. The Golden Helix SNP and Variation Suite (SVS) provides a powerful computing environment for analyzing these data and performing association tests at the gene and/or variant level.
In this presentation, Dr. Christensen will review fundamentals of population-based variant analysis and demonstrate some of the tools available in SVS for analysis of both common and rare variants. The presentation will feature the recently implemented SKAT-O method, as well as other functions for annotation, visualization, quality control and statistical analysis of DNA sequence variants.
Omics data integration for MSA | International Society for Clinical Biostatis...Said el Bouhaddani 👩💻
Our aim is to predict Multiple System Atrophy (MSA), a rare neurodegenerative disorder, using multiple omics datasets in cell lines.
We develop a probabilistic data integration method, POPLS-DA, to identify consistent molecular biomarkers across high dimensional and correlated omics layers.
This document provides an overview of the course BIONF/BENG 203: Functional Genomics. It discusses the grading breakdown, course outline, sources of functional genomic data including expression data from microarrays and RNA-Seq, proteomic data from mass spectrometry, protein-protein interaction data, and systematic phenotyping data. High-throughput methods for measuring these various types of omics data are also summarized.
Analyzing Current Project and High-Throughput Screening Data by Interactive Selection of Frequently-Occurring Scaffolds. Methods described: how to tweak the MOE SA/Report tool to interactively discover scaffolds in large and diverse HTS-like chemical datasets (code on SVL exchange), and how to automate creation of SA/Reports from project data using KNIME.
Extracting a cellular hierarchy from high-dimensional cytometry data with SPADENikolas Pontikos
SPADE is an algorithm that analyzes high-dimensional cytometry data to extract cellular hierarchies and identify rare cell populations. It works by first downsampling the data based on cell density, then performing agglomerative clustering and constructing a minimum spanning tree to connect cell clusters. This allows visualization of marker expression across cell populations and identification of rare cell types that may be missed with manual gating. SPADE was shown to reconstruct known hierarchies in mouse hematopoiesis and identify an unexpected human cell population from mass cytometry data.
Speaker: Benedict C. S. Cross, PhD, Team leader (Discovery Screening), Horizon Discovery
CRISPR–Cas9 mediated genome editing provides a highly efficient way to probe gene function. Using this technology, thousands of genes can be knocked out and their function assessed in a single experiment. We have conducted over 150 of these complex and powerful screens and will use our experience to guide you through the process of screen design, performance and analysis.
We'll be discussing:
• How to use CRISPR screening for target ID and validation, understanding drug MOA and patient stratification
• The screen design, quality control and how to evaluate success of your screening program
• Horizon’s latest developments to the platform
• Horizon’s novel approaches to target validation screening
The document summarizes Andy Pope's presentation on lead discovery and hit identification approaches at GSK. It discusses current hit identification methods like high-throughput screening (HTS) and encoded library technologies. HTS involves screening large libraries of up to 2 million compounds but has limitations due to costs and inability to screen under different conditions. Encoded library technologies address these limitations by allowing synthesis and screening of much larger virtual libraries of over 1 billion compounds under various conditions using DNA-encoded small molecules. The document emphasizes the importance of compound quality and hit qualification methods to identify true hits from screening assays.
The information revolution has transformed many business sectors over the last decade and the pharmaceutical industry is no exception. Developments in scientific and information technologies have unleashed an avalanche of content on research scientists who are struggling to access and filter this in an efficient manner. Furthermore, this domain has traditionally suffered from a lack of standards in how entities, processes and experimental results are described, leading to difficulties in determining whether results from two different sources can be reliably compared. The need to transform the way the life-science industry uses information has led to new thinking about how companies should work beyond their firewalls. In this talk we will provide an overview of the traditional approaches major pharmaceutical companies have taken to knowledge management and describe the business reasons why pre-competitive, cross-industry and public-private partnerships have gained much traction in recent years. We will consider the scientific challenges concerning the integration of biomedical knowledge, highlighting the complexities in representing everyday scientific objects in computerised form. This leads us to discuss how the semantic web might lead us to a long-overdue solution. The talk will be illustrated by focusing on the EU-Open PHACTS initiative (openphacts.org), established to provide a unique public-private infrastructure for pharmaceutical discovery. The aims of this work will be described and how technologies such as just-in-time identity resolution, nanopublication and interactive visualisations are helping to build a powerful software platform designed to appeal to directly to scientific users across the public and private sectors.
InterPro is a database that classifies proteins into families, domains, and sequence features based on their structural and functional properties. It integrates predictive models from several member databases to annotate unknown protein sequences. Protein signatures like patterns, profiles, fingerprints and hidden Markov models are generated from multiple sequence alignments and used by InterPro for classification. AlphaFold is an artificial intelligence system that can predict protein three-dimensional structures directly from amino acid sequences, representing a major advance in solving the protein folding problem.
Objectives are an understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
Single-cell RNA sequencing workshop given at the Ottawa Hospital Research Institute in 2018. Note that slides contain animations that won't be viewed in the slidehsare
This document summarizes a study that used the BigLD algorithm to partition haplotype blocks in chromosome 21 of the NARAC genomic dataset. The researchers:
1) Applied the BigLD algorithm and three other methods (FGT, CIT, SSLD) to detect haplotype blocks in a portion of chromosome 21.
2) Analyzed and compared the blocks detected by each method based on parameters like block size, number of blocks, and genomic coverage.
3) Found that BigLD produced the fewest and largest blocks, indicating more robust partitioning compared to the other methods.
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
Talk presented at the Simons Foundation Biotech Symposium "Complex Data Visualization: Approach and Application" (12 September 2014)
http://www.simonsfoundation.org/event/complex-data-visualization-approach-and-application/
In this talk I describe how we integrated a sophisticated computational framework directly into the StratomeX visualization technique to enable rapid exploration of tens of thousands of stratifications in cancer genomics data, creating a unique and powerful tool for the identification and characterization of tumor subtypes. The tool can handle a wide range of genomic and clinical data types for cohorts with hundreds of patients. StratomeX also provides direct access to comprehensive data sets generated by The Cancer Genome Atlas Firehose analysis pipeline.
http://stratomex.caleydo.org
Similar to Scaffold-based Analytics: Enabling Hit-to-Lead Decisions by Visualizing Chemical Series Linked Across Large Datasets (ACS Boston 2015) (20)
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Scaffold-based Analytics: Enabling Hit-to-Lead Decisions by Visualizing Chemical Series Linked Across Large Datasets (ACS Boston 2015)
1. Scaffold-Based Analytics: Enabling
Hit-to-Lead Decisions by Visualizing
Chemical Series Linked Across
Large Datasets
Deepak Bandyopadhyay,
Constantine Kreatsoulas,
Pat G. Brady, Genaro
Scavello, Dac-Trung Nguyen,
Tyler Peryea, Ajit Jadhav
GSK
NCATS
Thanks to:
Lena Dang and Josh Swamidass (WUSTL),
Rajarshi Guha, Stephen Pickett, Martin
Saunders, Nicola Richmond, Darren Green,
Eric Manas, Todd Graybill, Rob Young, Mike
Ouellette, Stan Martens, Javier Gamo,
Lourdes Rueda
2. Outline
– Intro: analyzing and merging screening output
– Methods for Scaffold-Based Analytics
– Examples – Linking series across datasets
– Hit Prioritization & Scaffold Hopping (TCAMS)
– Dataset Integration & Scaffold Progression (Kinase “X”)
– Conclusion
2
3. Small Molecule Lead Discovery at GSK
High Throughput Screening
- Maximize chemical diversity
Focused Screening
- Compound sets tailored
to target families
- Small scale process
Fragment Hit ID
- Low mol weight, ligand
efficient starting points
High-Content / Phenotypic
Screen
- Disease-relevant assays
- Target agnostic
Screening
output: large,
diverse, and
difficult to
navigate
3
GSK,
Tres Cantos,
Spain
DNA Encoded Library
Technology (ELT)
- Massive combinatorial libraries
- Binders found by Next-Gen Seq.
4. Primary bioassay (pIC50)
Orthogonalassay(pIC50)
Manual Data Surfing
Historical Hit Triage - on Individual Compounds
Criteria
– Activity Data
– Potency in a suite of assays
– Selectivity against off-targets
– Inhibition Frequency Index (IFI)
– Physical/Chemical Properties
– MW, solubility, permeability,…
– Property Forecast Index (PFI)
Use case: isolate good chemical starting points and weed out bad ones
Filters
4
IFI (%) = # HTS assays Hit *100
# HTS assays Tested
PFI = Chromatophic LogD + # of aromatic rings
Lower PFI improves chances of positive outcome
in phys/chem assays correlated with developability
IFI: S. Chakravorty, ACS New Orleans 2013 PFI: R. Young, D.V.S. Green, C. Luscombe, A. Hill. Drug Discovery
Today. Volume 16, Numbers 17/18 September 2011 R
5. Datasets Used in this Presentation
– Tres Cantos Anti-Malarial Set (TCAMS)
– 13.5k public compounds from GSK HTS
– pIC50 against Plasmodium falciparum (PF)
“susceptible” 3D7 strain
– Percent inhibition against “resistant” DD2 strain
– Other properties including IFI
– In-house data on Kinase “X”
– HTS, FBDD, ELT data
Hit
Prioritization
Dataset
Integration
5
Scaffold
Hopping
?
6. Outline
– Intro: analyzing and merging screening output
– Methods for Scaffold-Based Analytics
– Examples – Linking series across datasets
– Hit Prioritization & Scaffold Hopping (TCAMS)
– Dataset Integration & Scaffold Progression (Kinase “X”)
– Conclusion
6
7. Automation is Necessary for Screening Hit Triage…
• Manual selection and scaffold/R-group based SAR do not scale
• 5-50k molecules, 1000’s of chemotypes!
• Traditional methods: clustering, substructure/similarity search, …
SSS2 SSS3SSS1
Manually Merge Results
Multiple Substructure SearchesHierarchical Clustering
Scaffold
Network
(adapted
from J.
Swamidass,
swami.wustl.edu)
7
Agglomerative Clustering
Similarity Search
0.9
0.75
8. … But Clustering Is Not Sufficient for SAR Navigation
– Agglomerative Clustering:
– Hierarchical Clustering:
– Same underlying issues, adds complexity (level of hierarchy, e.g. # rings)
seals
(fur)
?
singleton
?
ducks
(bill)
?
penguins (flipper)
?
Cluster 3 Cluster 10
similar molecules ≠ same cluster
8
Many singletons
Complete Link Cluster ID
ClusterSize
Molecule single cluster, can be limiting
9. Proposed Improvement:
Automatic Decomposition into All (Overlapping) Scaffolds
IFI
1.5%
PF 3D7 LE
0.34
PF 3D7 pIC50
8.1 Molecule
Scaffold(s)
Related Molecules
9
…
49 total
…
226 total
2 total
10. 1.5%
0.318.2
Avg IFI
1.5%
Avg pIC50
8.15
Avg LE
0.32
Avg IFI
3.0%
Avg pIC50
7.8
Avg LE
0.45
Avg IFI
4.0%
Avg pIC50
7.8
Avg LE
0.46
10
Next Step: Combine with Activities and Properties
…
49 total
…
226 total
2 total
1.5%
6.4%
8.5
0.51
0.58
8.2
8.0
2.1%
0.57
7.5
3.0%
0.6
18.1%
24.1%
7.7
0.47
0.36
8.5
2.9%
1.5%
7.4
0.57
0.56
7.9
7.7 8.2
5.0%
0.5
4.4%
0.54
Molecule
Scaffold(s)
Annotation
Related Molecules
11. – 1
Methods Used to Exhaustively Generate Overlapping
Scaffolds
SSSR scaffolds optimized for R-group tables
Frameworks (GSK) Bemis-Murcko like & RECAP
Exhaustive (pro: complete and con: redundant/too simple)
NCATS
R-Group Tool
4
3
2
Rings
Molecule
Scaffold(s)
Related Molecules
11
Scaffold
Network
Generator
Hierarchical
Directed
Graph of
Scaffolds.
Scales
to large
datasets
12. Details: Integrating Scaffold-Based Analytics
into a Single Spotfire Visualization
Main Data Table: ChemBLNTD_TCAMS
Compound ID, SMILES, Properties, Activities
Scaffolds from
NCATS R-
Group Tool
Compound
ID
Frames from
Data-Driven
Frameworks
Cluster
from
Clustering
Properties &
activities
aggregated by
scaffold
Framework ID,
FW SMILES,
Cpd IDs
Cluster ID,
Cluster Size,
Cpd IDs
Scaffold info:
IDs, SMILES
Cpd Info: IDs,
SMILES, Properties
Scaffold ID
(many)
Top-Level Scaffold
from Scaffold
Network Generator
scaffold
subscaffold
Compound
Exemplars from
Top-Level Scaffolds
Scaffold ID
(many)
Scaffold ID
(many)
12
subscaffold
scaffold
n
n
Method Specific
Group IDs
Molecule
Scaffold(s)
Annotation
Related Molecules
We found
Scaffold
Networks
complex
to integrate
& navigate…
13. Outline
– Intro: analyzing and merging screening output
– Methods for Scaffold-Based Analytics
– Examples – Linking series across datasets
– Hit Prioritization & Scaffold Hopping (TCAMS)
– Dataset Integration & Scaffold Progression (Kinase “X”)
– Conclusion
13
14. Framework Overlaps in Related Molecules
Reveal Substructures Associated with Activity
14
Framework
not active in
3D7 strain;
not found by
R-group tool Frameworks
active and
overlapping
Framework
moderately
active
Color by:
Framework
Sector size:
# molecules
Size by:
Ligand
Efficiency
(PF 3D7)
Hit
Prioritization
PercentinhibitioninDD2(PFresistantstrain)
pIC50 in 3D7 (PF susceptible strain)
Each pie is one compound
Each sector/color is one framework
Exemplar compounds
15. PercentinhibitioninDD2(resistantstrain)
pIC50 in 3D7 (PF susceptible strain)
Scaffold Networks Example: Identify
Related Scaffolds with a Desirable Profile
15
Trellis by:
# rings in
scaffold
Color by:
Top-Level
Scaffold
Size by:
Ligand
Efficiency
(PF 3D7)
Scaffold
Hopping
?
… possibly
more layers
with higher
# rings …
Find new bicyclic and tricyclic scaffolds
active against resistant DD2 strain
Original tricyclic scaffold inactive
against resistant DD2 strain
RINGS = RINGS =
16. NCATS R-Group Tool Connects Molecules to
Scaffolds with Aggregate Data and Drill-Down
16
– Minimum # of “useful” scaffolds
– Tautomers under single scaffold
Bonus: sensible R-group tables generated
5.7k scaffolds, filtered to 428 by max pIC50
Avg.IFI
Avg. pIC50 in 3D7 (PF sensitive strain)
17. NCATS R-Group Tool Example:
Deconstruct SAR of Related Molecules
Quinazolines
alone active,
ligand efficient
Discover alt. tricycles
Indazoles
alone only
weakly
active
17
Scaffold
Hopping
?
pIC50 in 3D7 (PF susceptible strain)
IFI
Fuse Design Ideas
Each pie is one compound
Each sector/color is one scaffold
Size by Ligand Efficiency (3D7)
18. NCATS R-Group Tool Example:
Iterative SAR Exploration
New tricycle scaffold
(1824) seems more
active than indoles or
quinazolines alone
18
pIC50 in 3D7 (PF susceptible strain)
IFI
Scaffold
Hopping
?
Each pie is one compound
Each sector/color is one scaffold
Size by Ligand Efficiency (3D7)
19. Scaffold-Based Decision Making
and Hit ID Integration
– Kinase “X”
– Candidate compound demonstrates exquisite kinase selectivity
– Active against Wild-Type, Inactive against Mutant enzyme
– Backup program
– New screens analyzed & integrated using NCATS R-Group Tool
19
HTS 2014
350K top-up
3613 pIC50s
HTS 2012
2M screened
4564 pIC50s
2011 2012 2014 (backup)
Fragment
hits
288 pIC50s
DNA ELT
130 libraries
824 features
No activity dataActivity data available
9259
cpds
Goal: identify selective backup series from new Hit ID efforts
Dataset
Integration
20. HTS 2014 hit
Selective Lead Series Linked Across Datasets
20
MeanΔ(WTpIC50–mutantpIC50)
Mean PFIpred
Scaffold-Level Details:
Mech. pIC50: 7.1
Cell pIC50: 6.3
LE: 0.44
Statistics for 8 exemplars
Mech. pIC50: 6.0 ± 0.88
Cell pIC50: 5.3 ± 0.81
LE: 0.35 ± 0.05
Chemistry initiated on series!
HTS 2012 hit (not followed up)
Scaffold classification by mutant binding
Selective WT/mut.
Non-selective
Size: pIC50
Assay Drill-Down:
Mechanistic
Full-length WT
Truncated WT
Cell
Mutant
pIC50
GSK Compound ID
20122014
Dataset
Integration
21. Identify and Test Unmeasured Compounds
Based on Overlap with Actives Across Datasets
PFI PFI
MW
Ligand-
efficient
HTS hit
Ligand-efficient
HTS and
fragment hits
21
Dataset
Integration
Weak active for Kinase “X”
Trellis by
Scaffold
Color by LE
Shape by:
22. Identify and Test Unmeasured Compounds
Based on Overlap with Actives Across Datasets
PFI PFI
MW
Ligand-
efficient
HTS hit
Low
MW/PFI
untested
fragment
Low MW/PFI
ELT feature
to synthesize
Ligand-efficient
HTS and
fragment hits
Low
MW/PFI
untested
fragment
Low MW/PFI
ELT feature
to synthesize
22
Dataset
Integration
Weak active for Kinase “X”
Trellis by
Scaffold
Color by LE
Shape by:
23. Conclusions and Future Directions
23
• Merging datasets using scaffolds enables a cohesive visualization
of chemical series and suggests opportunities for hybridization
• Automated scaffold and R-group generation is a powerful way to
prioritize hits and replace scaffolds in large and diverse datasets
• Partitioning into clusters is ambiguous, incomplete for SAR navigation.
• Scaffold-Generation Methods (Frameworks, Scaffold Networks,
NCATS R-Group Tool) have their differences, pros and cons
• All methods revealed similar insights from the TCAMS dataset
• Future improvements:
• Scalability to larger and ever-changing datasets
• Automated selection of informative overlapping scaffolds
• Combining multiple scaffold-generation methods
25. Backup and References
– Scaffold Generation Methods:
– NCATS R-group analysis (http://tripod.nih.gov/?p=46 )
– Frameworks (Data-Driven Clustering, GSK/ChemAxon)
– Scaffold Network Generator (http://swami.wustl.edu/sng)
– Agglomerative Clustering (Complete Linkage, GSK/ChemAxon)
25
G. Harper, G. S. Bravi, S. D. Pickett, J. Hussain, and D. V. S.
Green. J. Chem. Inf. Comput. Sci., 44(6), 2145-2156 (2004)
NCATS R–group tool @
http://tripod.nih.gov
M. K. Matlock, J.M. Zaretzki, and S. J. Swamidass.
Bioinformatics. 29(20), 2655-2656 (2013).
26. Hit Prioritization via Clustering:
Exploration within Pre-determined Groups Only
– ~2000 complete linkage clusters in TCAMS set
– Initial clustering limits neighbors you can discover
Percent inh. in DD2 (PF resistant strain)
IFI
Query molecules (scatter plot)
pXC50 in 3D7 (PF susceptible strain)
#aromaticrings
26
Hit
Prioritization
27. Using GSK Frameworks
– 80k GSK frameworks, 7.5k RECAP fragments in TCAMS set
– Score of a framework = Average activity of molecules containing it
– Low scoring frameworks can be filtered out
– Issues identified:
– Many equivalent and redundant frameworks
– Tautomers not unified by current implementation
27
28. Related Molecules with Framework Overlaps:
Reveal Potential Scaffold Hops
Shared framework,
Related chemotypes
Opportunity to design
hybrid series
Color by:
Framework
Sector size:
# molecules
Size by:
Ligand
Efficiency
28
Scaffold
Hopping
?
PercentinhibitioninDD2(PFresistantstrain)
pXC50 in 3D7 (PF susceptible strain)
Molecule
Scaffold(s)
Related Molecules
Each pie is one compound
Each sector/color is one framework
29. Hit Prioritization via Scaffold Networks:
Navigate to Related Scaffolds
13.5k compounds map to 7715 top-level scaffolds
(28.5k total)
29
Color by:
Top-Level Scaffold
Size by:
Ligand Efficiency
Trellis by:
Number
of rings in
scaffold
Hit
Prioritization
Percent inhibition in DD2 (PF resistant strain)
pXC50in3D7(PFsusceptiblestrain)
2
3
4+
Rings
… possibly more layers with higher # rings …
30. Related Molecules from NCATS R-Group Tool:
Visualizing Scaffold Overlap and Activity
Co-occurring
active scaffolds
Scaffold 4719
active by itself
Scaffold 978 alone
not highly active
30
pXC50 in 3D7 (PF susceptible strain)
IFI
Hit
Prioritization
Each pie is one compound
Each sector/color is one scaffold
Editor's Notes
Data visualization & exploration environment (we use Spotfire). PFI lipo akin to cLogP. Lower is better. 30 sec.
Adding the hier does not fix the agg isues, only adds complexity in navigation . Things at different levels may not be matched
What I will be describing is a method that exhaustively finds all possible shared (or common or frequent) substructures – which we call scaffolds within your data set using a tool from the NIH.
Here is a screening hit that I will use to demonstrate this.
… (don’t need to go into gory details)
Biaryl substructure is contained in these molecules that have low similarity to the original hit molecule.
We can aggregate activities & properties at the scaffold-level and then drill-down to the underlying data for individual compounds to progress scaffolds of interest.
Text up top. Grey out clustering. Purple box for aggregate props.
Preprocessed substructure search: which substructure encodes activity?
10 sec. short script
We used the scaffolds to merge all of this data and identify more series that bind selectively
Key message: prioritize ELT with no activity data, just based on overlap with actives from other datasets
This slide can be backup.
Automated substructure search to find part of molecule that’s active. Backup?