Details on several key chemical, natural products and commercial databases well-curated for Drug Discovery studies. Importance of pharmacokinetics and ADME in drug candidate selection in the hit-to-lead process of optimisation.
How to implement cheminformatics methods and computational approaches in medicinal chemistry for a drug candidate selection.
Many images and charts are adapted from research articles and webpages cited in the original slide deck.
Drug and Chemical Databases 2018 - Drug DiscoveryGirinath Pillai
Latest collection of Chemical and Drug Databases for Biological Research as well as Drug Design studies. Databases statistics, links and overview data with CADD introduction.
Unit 4 - Informatics & Methods in drug design: Introduction to Bioinformatics, chemoinformatics. ADME databases, chemical, biochemical and pharmaceutical databases.
SAR versus QSAR, History and development of QSAR, Types of physicochemical
parameters, experimental and theoretical approaches for the determination of
physicochemical parameters such as Partition coefficient, Hammet’s substituent
constant and Taft’s steric constant. Hansch analysis, Free Wilson analysis, 3D-QSAR
approaches like COMFA and COMSIA.
KNIME in Life Science, Cheminformatics and Computational ChemistryGirinath Pillai
This document discusses using KNIME for life sciences applications like drug discovery. It provides an overview of KNIME's life science extensions and nodes, capabilities for data management and analysis, and examples of building predictive models. Specific topics covered include data visualization, cheminformatics nodes, generating molecular properties, similarity searching, and developing QSAR models using data from sources like ChEMBL. The presenter aims to demonstrate how KNIME can be used to generate and analyze chemistry data for machine learning applications in drug discovery.
This document provides an introduction to quantitative structure-activity relationships (QSAR). It defines QSAR as quantifying physicochemical properties of drugs to see their effect on biological activity. Graphs are used to plot activity versus properties, and regression analysis determines correlation. Key properties discussed are hydrophobicity, steric effects, and electronic effects. Hansch analysis uses equations to relate activity to multiple properties. Advantages are understanding structure-activity and enabling novel analog design. Limitations include potential for false correlations and need for large, high-quality datasets.
Computer Aided Drug Design and Discovery : An Overview (2006)Girinath Pillai
The document discusses computer aided drug design and virtual screening. It describes how virtual screening can be used to discover new inhibitors for drug development by simulating the binding of compounds to protein targets. The document outlines the drug discovery process and different types of virtual screening techniques, such as ligand-based and structure-based approaches. It also discusses molecular docking methods and tools that are commonly used to simulate compound binding as part of virtual screening.
DRUG DISCOVERY
Drug Discovery without a lead
LEAD DISCOVERY/IDENTIFICATION
LEAD MODIFICATION
CONCEPT OF PRODRUGS AND SOFT DRUGS
DRUG RECEPTOR INTERACTIONS
How to implement cheminformatics methods and computational approaches in medicinal chemistry for a drug candidate selection.
Many images and charts are adapted from research articles and webpages cited in the original slide deck.
Drug and Chemical Databases 2018 - Drug DiscoveryGirinath Pillai
Latest collection of Chemical and Drug Databases for Biological Research as well as Drug Design studies. Databases statistics, links and overview data with CADD introduction.
Unit 4 - Informatics & Methods in drug design: Introduction to Bioinformatics, chemoinformatics. ADME databases, chemical, biochemical and pharmaceutical databases.
SAR versus QSAR, History and development of QSAR, Types of physicochemical
parameters, experimental and theoretical approaches for the determination of
physicochemical parameters such as Partition coefficient, Hammet’s substituent
constant and Taft’s steric constant. Hansch analysis, Free Wilson analysis, 3D-QSAR
approaches like COMFA and COMSIA.
KNIME in Life Science, Cheminformatics and Computational ChemistryGirinath Pillai
This document discusses using KNIME for life sciences applications like drug discovery. It provides an overview of KNIME's life science extensions and nodes, capabilities for data management and analysis, and examples of building predictive models. Specific topics covered include data visualization, cheminformatics nodes, generating molecular properties, similarity searching, and developing QSAR models using data from sources like ChEMBL. The presenter aims to demonstrate how KNIME can be used to generate and analyze chemistry data for machine learning applications in drug discovery.
This document provides an introduction to quantitative structure-activity relationships (QSAR). It defines QSAR as quantifying physicochemical properties of drugs to see their effect on biological activity. Graphs are used to plot activity versus properties, and regression analysis determines correlation. Key properties discussed are hydrophobicity, steric effects, and electronic effects. Hansch analysis uses equations to relate activity to multiple properties. Advantages are understanding structure-activity and enabling novel analog design. Limitations include potential for false correlations and need for large, high-quality datasets.
Computer Aided Drug Design and Discovery : An Overview (2006)Girinath Pillai
The document discusses computer aided drug design and virtual screening. It describes how virtual screening can be used to discover new inhibitors for drug development by simulating the binding of compounds to protein targets. The document outlines the drug discovery process and different types of virtual screening techniques, such as ligand-based and structure-based approaches. It also discusses molecular docking methods and tools that are commonly used to simulate compound binding as part of virtual screening.
DRUG DISCOVERY
Drug Discovery without a lead
LEAD DISCOVERY/IDENTIFICATION
LEAD MODIFICATION
CONCEPT OF PRODRUGS AND SOFT DRUGS
DRUG RECEPTOR INTERACTIONS
Computer Assisted Drug Design By Rauf Pathan and Patel Mo ShaffanPathan Rauf Khan
CADD is modern technique of drug design and use of this technique reduce drug screening time and discover new drugs with specific therapeutic activity.
This document discusses protein engineering techniques for modifying proteins, including rational protein design using site-directed mutagenesis and directed evolution using random mutagenesis. Site-directed mutagenesis involves introducing point mutations in a particular known area to modify a specific protein function, while directed evolution generates genetic diversity through random mutagenesis and screens variants to identify successful mutations without requiring structural information. Common random mutagenesis methods discussed are error-prone PCR and DNA shuffling, which can be used to engineer properties like protein folding, stability, binding, and catalysis.
This document provides an overview of the history and methods of drug discovery, including traditional and computer-aided approaches. It discusses the traditional drug discovery life cycle from hit identification through random screening and the use of natural products and synthetic chemicals. It then introduces computer-aided drug design (CADD) and describes how it can be used throughout the drug discovery process, including structure-based design, ligand-based design, and de novo design to speed up screening and enable more rational drug design. It also lists some advantages of CADD over traditional methods and examples of drugs successfully developed using these approaches.
Pharmacogenetics is the study of influences of a gene on therapeutic and adverse effects of drugs.
Pharmacogenetics plays an important role in drug development and drug safety.
The document discusses the process of preparing a chemical database for virtual screening or compound acquisition. It begins with assembling collections from in-house and external databases. The collection is then cleaned by removing invalid structures and standardizing structure representations. Property filtering is used to focus on lead-like compounds. Known active molecules are searched for structural similarity. Alternative structures like stereoisomers are explored. Representatives are selected from clustered structures using descriptors and similarity metrics. 3D structures are generated and a final list of compounds is assembled for screening, with some random additions, completing the preparation.
This document outlines the key steps involved in the drug discovery process:
1. Target identification involves identifying protein targets through methods like sequence analysis and cDNA library generation.
2. Target validation confirms the drug will affect the specific target through techniques like chemogenomics and target gene disruption.
3. Lead identification screens compound libraries to find leads that are potent against the target using assays.
4. Lead optimization refines the leads using computer-aided drug design and de novo drug design.
5. Preclinical pharmacology involves testing the drugs on lab animals, human cells, and clinical trials with government approval.
This presentation discusses molecular similarity searching methods for drug discovery. It begins with an introduction to cheminformatics and the principle that structurally similar molecules tend to have similar biological properties. The document then covers molecular representations, methods for calculating similarity coefficients between molecules, and a probabilistic model for similarity searching. It proposes a contribution called the Molecular Dynamic Clustering method that uses molecular dynamics simulations and classification algorithms to better assess molecular similarity.
The document discusses the process of drug discovery, including target selection, lead discovery, medicinal chemistry, in vitro and in vivo studies, and clinical trials. Target selection involves identifying cellular or genetic targets involved in disease through techniques like genomics, proteomics, and bioinformatics. Lead discovery focuses on identifying small molecule modulators of protein function through methods like synthesis, combinatorial chemistry, assay development, and high-throughput screening. Medicinal chemistry then works to optimize these leads. [/SUMMARY]
Challenges and drawbacks of drug discovery and developmentGaurav Aggarwal
Drug discovery and development faces several key challenges:
- It is a lengthy, complex, and costly process with high uncertainty if a drug will succeed.
- Identifying drug targets is difficult due to unknown disease pathophysiology.
- Animal models often cannot fully replicate human diseases.
- Patient heterogeneity makes drug testing challenging without extensive clinical data.
- There is a lack of validated biomarkers to measure disease states.
- Navigating regulatory guidelines from different organizations adds complexity.
1) Docking attempts to predict how biological molecules, such as proteins and ligands, interact and bind to each other. It involves finding the optimal orientation that maximizes molecular interaction and minimizes total energy.
2) Rational drug design uses docking to identify potential drug candidates in ligand databases that may bind to a target protein or receptor. The highest scoring candidates then undergo further testing and optimization.
3) Accurate docking is challenging due to the high degrees of flexibility in both molecules as they interact and conformational changes that can occur upon binding. Improving scoring functions and algorithms to model flexibility remains an important area of research.
PRINCIPLES OF DRUG DISCOVERY & DEVELOPMENT.pptxDharaMehta45
The document provides an overview of the principles of drug discovery and development. It discusses the various phases including target identification and validation, hit identification and validation, lead selection and profiling, and pre-clinical and clinical development. The target identification process involves techniques like molecular biology, genetics, and data mining to identify potential biological targets. High-throughput screening is used to test large libraries of compounds to identify initial hits which are then optimized into drug candidates or leads through techniques such as medicinal chemistry and structure-activity relationships. The overall process takes 13-15 years and over $2 billion from initial drug discovery to regulatory approval and market launch.
Drug discovery and Development by vinay guptaDr Vinay Gupta
The document discusses various aspects of drug discovery and development, including:
1) The drug development process involves pre-clinical and clinical trials that are regulated by agencies like DCGI in India and FDA in the US.
2) Pre-clinical trials involve pharmacological, toxicological, and pharmacokinetic testing in animals to establish safety before human trials.
3) Clinical trials have 4 phases - Phase I evaluates safety in healthy volunteers, Phase II explores efficacy in patients, Phase III confirms efficacy and monitors side effects in large patient groups, and Phase IV involves post-marketing surveillance.
The document provides an overview of the drug development pathway and requirements for clinical trials and regulatory approval.
Molecular descriptors are numerical values that characterize molecular properties and structures. They can represent physicochemical properties or values derived from algorithmic techniques applied to molecular structures. Descriptors vary in complexity and computational requirements. Some are based on experimental data while others are algorithmic constructs. Two-dimensional (2D) descriptors are calculated from 2D structures and include counts, physicochemical properties, and topological indices. Three-dimensional (3D) descriptors encode spatial relationships and include fragment screens and pharmacophore keys.
This document discusses structure-activity relationships (SAR) and quantitative structure-activity relationships (QSAR). SAR involves analyzing how changes to a molecule's structure affect its biological activity. QSAR establishes a mathematical relationship between biological activity and a molecule's geometric and chemical characteristics. SAR identifies important functional groups for binding through systematic structural modifications. QSAR analysis can be 2D, considering factors like those in Hansch analysis or Free-Wilson analysis, or 3D, considering steric and electrostatic values as well as hydrogen bonding abilities as in Comparative Molecular Field Analysis or Comparative Molecular Similarity Index Analysis.
The document discusses revising the Topliss decision tree for analog synthesis based on 30 years of additional medicinal chemistry literature and data from the ChEMBL bioactivity database. It describes creating a "Matsy decision tree" directly from experimental matched molecular series data in ChEMBL, which supports most of the Topliss tree but suggests some differences. The authors developed a data-driven approach not limited to the original Topliss trees to make predictions backed by experimental evidence from targeted datasets.
Quantitative structure-activity relationships (QSAR) use mathematical models to predict biological activity based on molecular properties. QSAR models are developed using statistical methods like partial least squares on datasets of compounds with known activities. Three-dimensional (3D) QSAR extends this approach by incorporating 3D structural descriptors and molecular fields derived from programs like CoMFA, VolSurf, and Catalyst to model activity based on interactions at binding sites. These 3D-QSAR models can be used to predict activity and design new compounds with improved properties.
1) The document discusses the basics of drug design including defining the disease process, identifying targets for drug design like enzymes, receptors and nucleic acids, and the different approaches of ligand-based drug design and structure-based drug design.
2) It also covers important techniques in drug design like computer-aided drug design using computational methods, quantitative structure-activity relationships (QSAR), and the uses of computer graphics in molecular modeling and dynamics simulations.
3) Important experimental techniques discussed are x-ray crystallography and NMR spectroscopy that provide structural information for target biomolecules essential for structure-based drug design.
The Role of Bioinformatics in The Drug Discovery ProcessAdebowale Qazeem
The Role of Bioinformatics in The Drug Discovery Process, is an undergraduate seminar presentation in the department of Biochemistry, Faculty of life Sciences, University of Ilorin, Ilorin.
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem is one of the largest sources of publicly available chemical information, with more than 242.3 million depositor-provided substance descriptions, 94.7 million unique chemical structures, and 234.8 million bioactivity outcomes from 1.25 million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery based on natural products.
PubChem contains a large amount of bioactivity data, most of which are generated from high-throughput screening (HTS). However, these data also include a substantial amount of bioactivity information extracted from scientific articles published in journals in the chemical biology, medicinal chemistry, and natural product domains, thanks to data contribution by other databases like ChEMBL, Guide to Pharmacology, BindingDB, and PDBbind. In addition, through data integration with other databases such as DrugBank, HSDB, and HMDB, PubChem contains a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identify search, 2-D and 3-D similarity searches, substructure and superstructure searches, molecular formula search. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem’s data into their own in-house data on a local computing machine.
Digging out Structures for Repurposing: Non-competitive Intelligence ...Chris Southan
This document summarizes Christopher Southan's presentation on digging through non-competitive intelligence to connect drug code names to structures and related data for drug repurposing opportunities. It finds that only 40-50% of approximately 30,000 drug code names have publicly accessible structures, and an even smaller portion are recorded in PubChem. It outlines Southan's methods for mapping code names to structures through multiple sources and finding associated data in clinical trials, publications, and patents. The conclusion advocates for increased transparency and data sharing to improve opportunities for drug repurposing based on stalled or failed drug candidates.
Computer Assisted Drug Design By Rauf Pathan and Patel Mo ShaffanPathan Rauf Khan
CADD is modern technique of drug design and use of this technique reduce drug screening time and discover new drugs with specific therapeutic activity.
This document discusses protein engineering techniques for modifying proteins, including rational protein design using site-directed mutagenesis and directed evolution using random mutagenesis. Site-directed mutagenesis involves introducing point mutations in a particular known area to modify a specific protein function, while directed evolution generates genetic diversity through random mutagenesis and screens variants to identify successful mutations without requiring structural information. Common random mutagenesis methods discussed are error-prone PCR and DNA shuffling, which can be used to engineer properties like protein folding, stability, binding, and catalysis.
This document provides an overview of the history and methods of drug discovery, including traditional and computer-aided approaches. It discusses the traditional drug discovery life cycle from hit identification through random screening and the use of natural products and synthetic chemicals. It then introduces computer-aided drug design (CADD) and describes how it can be used throughout the drug discovery process, including structure-based design, ligand-based design, and de novo design to speed up screening and enable more rational drug design. It also lists some advantages of CADD over traditional methods and examples of drugs successfully developed using these approaches.
Pharmacogenetics is the study of influences of a gene on therapeutic and adverse effects of drugs.
Pharmacogenetics plays an important role in drug development and drug safety.
The document discusses the process of preparing a chemical database for virtual screening or compound acquisition. It begins with assembling collections from in-house and external databases. The collection is then cleaned by removing invalid structures and standardizing structure representations. Property filtering is used to focus on lead-like compounds. Known active molecules are searched for structural similarity. Alternative structures like stereoisomers are explored. Representatives are selected from clustered structures using descriptors and similarity metrics. 3D structures are generated and a final list of compounds is assembled for screening, with some random additions, completing the preparation.
This document outlines the key steps involved in the drug discovery process:
1. Target identification involves identifying protein targets through methods like sequence analysis and cDNA library generation.
2. Target validation confirms the drug will affect the specific target through techniques like chemogenomics and target gene disruption.
3. Lead identification screens compound libraries to find leads that are potent against the target using assays.
4. Lead optimization refines the leads using computer-aided drug design and de novo drug design.
5. Preclinical pharmacology involves testing the drugs on lab animals, human cells, and clinical trials with government approval.
This presentation discusses molecular similarity searching methods for drug discovery. It begins with an introduction to cheminformatics and the principle that structurally similar molecules tend to have similar biological properties. The document then covers molecular representations, methods for calculating similarity coefficients between molecules, and a probabilistic model for similarity searching. It proposes a contribution called the Molecular Dynamic Clustering method that uses molecular dynamics simulations and classification algorithms to better assess molecular similarity.
The document discusses the process of drug discovery, including target selection, lead discovery, medicinal chemistry, in vitro and in vivo studies, and clinical trials. Target selection involves identifying cellular or genetic targets involved in disease through techniques like genomics, proteomics, and bioinformatics. Lead discovery focuses on identifying small molecule modulators of protein function through methods like synthesis, combinatorial chemistry, assay development, and high-throughput screening. Medicinal chemistry then works to optimize these leads. [/SUMMARY]
Challenges and drawbacks of drug discovery and developmentGaurav Aggarwal
Drug discovery and development faces several key challenges:
- It is a lengthy, complex, and costly process with high uncertainty if a drug will succeed.
- Identifying drug targets is difficult due to unknown disease pathophysiology.
- Animal models often cannot fully replicate human diseases.
- Patient heterogeneity makes drug testing challenging without extensive clinical data.
- There is a lack of validated biomarkers to measure disease states.
- Navigating regulatory guidelines from different organizations adds complexity.
1) Docking attempts to predict how biological molecules, such as proteins and ligands, interact and bind to each other. It involves finding the optimal orientation that maximizes molecular interaction and minimizes total energy.
2) Rational drug design uses docking to identify potential drug candidates in ligand databases that may bind to a target protein or receptor. The highest scoring candidates then undergo further testing and optimization.
3) Accurate docking is challenging due to the high degrees of flexibility in both molecules as they interact and conformational changes that can occur upon binding. Improving scoring functions and algorithms to model flexibility remains an important area of research.
PRINCIPLES OF DRUG DISCOVERY & DEVELOPMENT.pptxDharaMehta45
The document provides an overview of the principles of drug discovery and development. It discusses the various phases including target identification and validation, hit identification and validation, lead selection and profiling, and pre-clinical and clinical development. The target identification process involves techniques like molecular biology, genetics, and data mining to identify potential biological targets. High-throughput screening is used to test large libraries of compounds to identify initial hits which are then optimized into drug candidates or leads through techniques such as medicinal chemistry and structure-activity relationships. The overall process takes 13-15 years and over $2 billion from initial drug discovery to regulatory approval and market launch.
Drug discovery and Development by vinay guptaDr Vinay Gupta
The document discusses various aspects of drug discovery and development, including:
1) The drug development process involves pre-clinical and clinical trials that are regulated by agencies like DCGI in India and FDA in the US.
2) Pre-clinical trials involve pharmacological, toxicological, and pharmacokinetic testing in animals to establish safety before human trials.
3) Clinical trials have 4 phases - Phase I evaluates safety in healthy volunteers, Phase II explores efficacy in patients, Phase III confirms efficacy and monitors side effects in large patient groups, and Phase IV involves post-marketing surveillance.
The document provides an overview of the drug development pathway and requirements for clinical trials and regulatory approval.
Molecular descriptors are numerical values that characterize molecular properties and structures. They can represent physicochemical properties or values derived from algorithmic techniques applied to molecular structures. Descriptors vary in complexity and computational requirements. Some are based on experimental data while others are algorithmic constructs. Two-dimensional (2D) descriptors are calculated from 2D structures and include counts, physicochemical properties, and topological indices. Three-dimensional (3D) descriptors encode spatial relationships and include fragment screens and pharmacophore keys.
This document discusses structure-activity relationships (SAR) and quantitative structure-activity relationships (QSAR). SAR involves analyzing how changes to a molecule's structure affect its biological activity. QSAR establishes a mathematical relationship between biological activity and a molecule's geometric and chemical characteristics. SAR identifies important functional groups for binding through systematic structural modifications. QSAR analysis can be 2D, considering factors like those in Hansch analysis or Free-Wilson analysis, or 3D, considering steric and electrostatic values as well as hydrogen bonding abilities as in Comparative Molecular Field Analysis or Comparative Molecular Similarity Index Analysis.
The document discusses revising the Topliss decision tree for analog synthesis based on 30 years of additional medicinal chemistry literature and data from the ChEMBL bioactivity database. It describes creating a "Matsy decision tree" directly from experimental matched molecular series data in ChEMBL, which supports most of the Topliss tree but suggests some differences. The authors developed a data-driven approach not limited to the original Topliss trees to make predictions backed by experimental evidence from targeted datasets.
Quantitative structure-activity relationships (QSAR) use mathematical models to predict biological activity based on molecular properties. QSAR models are developed using statistical methods like partial least squares on datasets of compounds with known activities. Three-dimensional (3D) QSAR extends this approach by incorporating 3D structural descriptors and molecular fields derived from programs like CoMFA, VolSurf, and Catalyst to model activity based on interactions at binding sites. These 3D-QSAR models can be used to predict activity and design new compounds with improved properties.
1) The document discusses the basics of drug design including defining the disease process, identifying targets for drug design like enzymes, receptors and nucleic acids, and the different approaches of ligand-based drug design and structure-based drug design.
2) It also covers important techniques in drug design like computer-aided drug design using computational methods, quantitative structure-activity relationships (QSAR), and the uses of computer graphics in molecular modeling and dynamics simulations.
3) Important experimental techniques discussed are x-ray crystallography and NMR spectroscopy that provide structural information for target biomolecules essential for structure-based drug design.
The Role of Bioinformatics in The Drug Discovery ProcessAdebowale Qazeem
The Role of Bioinformatics in The Drug Discovery Process, is an undergraduate seminar presentation in the department of Biochemistry, Faculty of life Sciences, University of Ilorin, Ilorin.
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem is one of the largest sources of publicly available chemical information, with more than 242.3 million depositor-provided substance descriptions, 94.7 million unique chemical structures, and 234.8 million bioactivity outcomes from 1.25 million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery based on natural products.
PubChem contains a large amount of bioactivity data, most of which are generated from high-throughput screening (HTS). However, these data also include a substantial amount of bioactivity information extracted from scientific articles published in journals in the chemical biology, medicinal chemistry, and natural product domains, thanks to data contribution by other databases like ChEMBL, Guide to Pharmacology, BindingDB, and PDBbind. In addition, through data integration with other databases such as DrugBank, HSDB, and HMDB, PubChem contains a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identify search, 2-D and 3-D similarity searches, substructure and superstructure searches, molecular formula search. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem’s data into their own in-house data on a local computing machine.
Digging out Structures for Repurposing: Non-competitive Intelligence ...Chris Southan
This document summarizes Christopher Southan's presentation on digging through non-competitive intelligence to connect drug code names to structures and related data for drug repurposing opportunities. It finds that only 40-50% of approximately 30,000 drug code names have publicly accessible structures, and an even smaller portion are recorded in PubChem. It outlines Southan's methods for mapping code names to structures through multiple sources and finding associated data in clinical trials, publications, and patents. The conclusion advocates for increased transparency and data sharing to improve opportunities for drug repurposing based on stalled or failed drug candidates.
This document appears to be a website homepage for Dr. Iddo Friedberg and his lab at Iowa State University. It provides information about Dr. Friedberg's background, research interests in bacterial genome evolution and protein function prediction, current lab members, and lab philosophy of asking biological questions with computational approaches. It also provides summaries of several projects involving modeling the evolution of gene blocks, reconstructing the ancestry of these blocks, and studying the relationship between genome organization and gene expression.
This presentation describes several deficiencies of drug product labels and how information on the semantic web can be used to update the drug product label.
Code camp 2014 Talk Scientific ThinkingMitch Miller
The document discusses the role of a scientific geek consultant in the ChemIDplus project. It provides an overview of ChemIDplus, a database of over 400,000 chemicals maintained by the National Library of Medicine. As a consultant, the author helped develop the original ChemIDplus system, performed database administration tasks like reindexing and updating structures, tested migrations to new systems, and created tools to help orient and clean chemical structure data. The document outlines the technical architecture and some historical details about ChemIDplus and the author's contributions over the years.
Presented online at KSEA - Virginia Washington Metro Regional Conference 2020 (VWMRC 2020) (May 9, 2020)
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource, visited by millions of unique users per month. It contains chemical data from more than 700 data sources and disseminates these data to the public free of charge. Arguably, it is the largest source of publicly available chemical information, containing more than 250 million depositor-provided substance descriptions, 100 million unique chemical structures, and 260 million bioactivity outcomes from one million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery.
The immense quantity of bioactivity data in PubChem can be used to develop computational models to predict bioactivities of small molecules. While these data are primarily generated from high-throughput screening (HTS), they also include a substantial amount of bioactivity information extracted from peer-reviewed journal articles. In addition, through data integration with other databases, PubChem has a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identity, 2-D and 3-D similarity, substructure, superstructure, and molecular formula. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem data with their own.
The Open Source Drug Discovery (OSDD) strategy uses an open innovation model with a porous-walled funnel to facilitate the free flow of ideas and projects. It brings in more contributors to look at projects and enables redundancies and parallelization. OSDD acts as a facilitator to marry academic and delivery-focused approaches and provides expertise, discovery platforms, and coordination of activities from both individual and centrally coordinated projects. OSDD has established multiple platforms for drug discovery including compound management, screening, target validation, and mechanistic studies. It has an extensive portfolio involving over 180 principal investigators from over 100 institutions working on projects ranging from whole cell screening to structure-based drug design.
Ontology for the Financial Services IndustryBarry Smith
This document discusses strategies for integrating reference data using semantic technologies like ontologies. It begins by introducing the speaker and their work developing ontologies. It then discusses challenges like finding, understanding, using, and integrating data across silos. The solution proposed is to publish data using standard web formats like RDF and OWL, link datasets using common controlled vocabularies in ontologies, and build a "web of data". Examples of successful ontology projects like Gene Ontology are provided. The document argues the financial industry should pool information on existing controlled vocabularies, select common modules for reference data integration, and establish governance and training to ensure interoperability and avoid new silos.
Presented at the Bioinformatics Seminar at the University of Arkansas, Little Rock on November 5, 2021.
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical database at the National Library of Medicine, National Institutes of Health. Arguably, PubChem is one of the largest chemical information resources in the public domain, with 111 million unique chemical structures, 1.39 million biological assays, and 292 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents, and more. PubChem is a key resource for big data in chemistry and has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). It has also been used for cheminformatics education as well as chemical health and safety training. This presentation provides a high-level overview of PubChem’s data, tools, and services.
This document summarizes biologics information available in PubChem. It defines biologics as large molecules composed of sugars, proteins, lipids, or nucleic acids. PubChem contains over 1.5 million compounds labeled as biologics, including line notations describing their structure generated by Sugar and Splice. Biologics information in PubChem can be accessed on the website or programmatically via APIs. The document also describes NCBI Glycans, a resource providing definitions for carbohydrate monomers and examples of Symbol Nomenclature for Glycans notation.
PubChem for chemical information literacy trainingSunghwan Kim
Presented at the American Chemical Society Fall 2021 National Meeting (August 23, 2021; virtual).
==== Abstracts ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource that collects chemical information from 780+ data sources. It is visited by millions of users every month and many of them are young students at academic undergraduate or graduate students at academic institutions. While PubChem has a great potential as an online resource for chemical education, it also has important issues that are not familiar to students and educators, including data accuracy, data provenance, structure standardization, terminologies, etc. In this presentation, various aspects of PubChem as a chemical education resource will be discussed, with a special emphasis on how to help students develop chemical information literacy skills.
Antimalarial drug dscovery data disclosureChris Southan
Dr. Christopher Southan presented on comparing open and closed antimalarial drug discovery approaches. He examined 32 recent antimalarial compounds and found major data connectivity issues, such as leads not being findable by code name or having publications not citing patents. In contrast, the open source Sydney University Malaria Project surfaces structures and shares data in near real-time through open lab books and crowdsourcing. Dr. Southan analyzed their collection of 411 molecules and found 250 matched in PubChem quickly. Open approaches can accelerate discovery by years by openly sharing data.
PubChem: A Public Chemical Information Resource for Big Data ChemistrySunghwan Kim
A web-seminar jointly organized by KWSE (Korean Woman Scientists & Engineers) and KWiSE (Korean-American Women in Science and Engineering). Presented on July 27, 2021.
Dyadic International is developing its C1 gene expression platform to more accessibly and affordably produce biologic vaccines and drugs. C1 uses a proprietary fungal system that offers higher productivity, lower costs, and a faster development timeline compared to the commonly used CHO system. Dyadic has collaborative research programs underway with companies like Sanofi and Mitsubishi Tanabe Pharma to evaluate C1 for producing various biologics. The company aims to license its C1 technology to partners in order to disrupt biomanufacturing and address the growing demand for more affordable biologic treatments.
This document discusses molecular docking, which is a computational method used in structure-based drug design to predict the preferred orientation of molecules when bound to their protein targets to form stable complexes. It begins by introducing drug discovery and computational chemistry approaches. It then defines molecular docking and describes different docking types and software. Applications of docking in modern drug discovery are presented, along with case studies and achievements that have resulted in new drug classes. The document concludes that docking contributes promisingly to drug discovery by aiding in target identification and lead optimization.
Structural bioinformatics uses computational techniques to aid in drug discovery. Bioinformatics analyzes gene and protein sequence data to identify potential drug targets. Molecular docking then simulates how candidate drug compounds interact with these targets at the atomic level. This provides insight into how well a compound may bind to and affect the target. The results can help optimize drug candidates prior to further testing.
Predictive in vitro & in silico Methods for Precision Medicine- Robert G. Hun...RobertGHunter
The document summarizes a webcast presentation on predictive toxicology (PredTox) methods and their market opportunity. It discusses how PredTox fits with key trends in genomics, systems biology, and health IT. It also provides an overview of the PredTox landscape, including various in vitro and in silico technologies, applications in precision medicine, and global market drivers and forecasts. Contact information is given for BCC Research, the organization that published the webcast and related market report.
Cheminformatics Education with PubChemSunghwan Kim
Presented on November 13, 2020, as part of the "Integrating Bioinformatics Education Series" (https://ualr.edu/bioinformatics/education-series/), organized by the Arkansas IDeA Network of Biomedical Research Excellence (Arkansas INBRE) (https://inbre.uams.edu/).
Sunghwan Kim
National Library of Medicine, National Institutes of Health, Rockville, Maryland, United States
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
Presented at the American Chemical Society Middle Atlantic Regional Meeting (MARM) 2021 (June 10, 2021).
==== Abstract ====
With the emergence of the age of big data and artificial intelligence, biomedical research communities have a great interest in exploiting the massive amount of chemical and biological data available in the public domain. PubChem (https://pubchem.ncbi.nlm.nih.gov) is one of the largest sources of publicly available chemical information, with +270 million substance descriptions, +110 million unique compounds, +285 million bioactivity outcomes from more than one million biological assay experiments. PubChem provides a wide range of chemical information, including structure, pharmacology, toxicology, drug target, metabolism, chemical vendors, patents, regulations, clinical trials, and many others. These contents can be accessed interactively through web browsers as well as programmatically using computer scripts. They can also be downloaded in bulk through the PubChem File Transfer Protocol (FTP) site. PubChem data has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of PubChem data, tools, and services useful for drug discovery.
Similar to Mining Small Molecules for Drug Discovery (20)
Basics of Quantum and Computational ChemistryGirinath Pillai
Basic fundamentals of theoretical, quantum and computational chemistry. The methods and approaches helps in predicting the electronic structure properties as well as other spectral data.
Visualisation techniques are used in the area of small molecules, drug molecules, protein and to understand complex functions and interaction points to infer the mechanisms
Autodock Made Easy with MGL Tools - Molecular DockingGirinath Pillai
Restructured tutorial for AutoDock and AutoGrid with MGL Tools. Prepared during 2011 adapted from original AutoDock MGL Tools Tutorial
and a video tutorial with the latest enhancements and options are uploaded to Youtube: https://www.youtube.com/watch?v=n53gJE8SHOM
This document provides an overview and installation instructions for machine learning basics using various tools and libraries. It discusses installing and setting up Orange, KNIME, Anaconda, and related Python libraries. Key steps include downloading installers, setting paths, defining workspaces, installing extensions, and creating workflows in Orange and KNIME. Popular cheminformatics and deep learning libraries supported include RDKit, DeepChem, numpy, and scikit-learn.
Machine Learning in Chemistry and Drug Candidate SelectionGirinath Pillai
Application of machine learning and its importance in chemistry, drug discovery, materials science and requirement of the right dataset of chemical structures and activities. Drug Candidate selection criteria is important to avoid failures
How 3D structures to be considered for 3D QSAR?Girinath Pillai
3D molecular descriptors for 3D QSAR.
Physical significance and meaning of descriptors are important.
QSAR should be reliable and reproducible
Drug Discovery
Molecular Dynamics for Beginners : Detailed OverviewGirinath Pillai
Detailed presentation of what is molecular dynamics, how it is performed, why it is performed, applications, limitations and software resources on how to perform calculations are discussed.
Target Identification - Gene Disease and Protein Target PredictionGirinath Pillai
Target Identification with relationship between Genes, Interacting Partners and Disease Associations. Protein Target Prediction and Binding Site predictions
Why Drug Design and Computational Methods are important?Girinath Pillai
The document discusses in silico drug design and computer-aided drug discovery (CADD). It describes drug design as an iterative process involving chemistry, biophysics, and other fields. The workflow involves algorithms, programming, data mining, and other techniques. CADD techniques include modeling disease, drug inhibition, and drug interactions in 3D. Specific CADD techniques discussed include docking, similarity analysis, motif identification, subcellular location prediction, stability and solubility indices, half-life prediction, protein surface scanning, and secondary structure analysis for target identification and modeling.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
The binding of cosmological structures by massless topological defects
Mining Small Molecules for Drug Discovery
1. @giribio
Girinath G. Pillai, PhD @giribio
Mining small molecules
for Drug Discovery
Girinath G. Pillai, PhD
@giribio
2. @giribio
Girinath G. Pillai, PhD @giribio
Note & Disclaimer
● We are not yet completely ready with AI/ML in Drug Discovery (it takes time
like Human Genome Project)
● Slides contains contents/pictures/videos taken from web, articles, lectures,
tutorials and its respective authors own their copyrights.
● No conflict of interest
Technical Slides : slideshare.net/giribio
Handson Videos : youtube.com/giribio (Autodock, Modeller, PaDel, QSAR, MD, etc)
Workflows & Notebooks : github.com/giribio
2
3. @giribio
Girinath G. Pillai, PhD @giribio
Agenda
What to expect?
➔ FDA & Drug Approvals
➔ Drug Discovery
➔ Chemical Representation
➔ Chemical Databases
➔ Herbal/Natural Compound Databases
3
9. @giribio
Girinath G. Pillai, PhD @giribio
9
1960’s to 1980’s
RA Prentis et al, Br. J. Clin. Pharm. 1988
T Kennedy Drug Discov. Today 1997
Cause of Failure - Oral Drugs (Early)
10. @giribio
Girinath G. Pillai, PhD @giribio
10
Estimation of ADME at early stages
Generate models to predict ADME/PKs
2000’s to 2010’s
MJ Waring et al, Nat. Rev. Drug Discov. 2015
J Arrowsmith & P Miller Nat. Rev. Drug Discov. 2013
Cause of Failure - Oral Drugs (Late)
11. @giribio
Girinath G. Pillai, PhD @giribio
➔ Identify chemistries with an
● optimal balance of properties
➔ Quickly identify situations when
● such a balance is not possible
➔ Fail fast, fail cheap
➔ Only when confident
➔ Avoid missed opportunities
11
The Objectives of Drug Discovery
Multi-parameter optimisation
13. @giribio
Girinath G. Pillai, PhD @giribio
Representing Chemicals
➢ Trivial name, e.g. Baking Soda, Aspirin, Citric Acid, etc.
○ Identifies the compound, but gives no (or little) information about
what it consists of
➢ Chemical formula, e.g. C6
H12
O6
.
○ Specifies the type and quantity of the atoms in the compound, but
not its structure (i.e. how the atoms are connected by bonds)
➢ Systematic name, e.g. 1,2-dibromo-3-chloropropane.
○ Identifies the atoms present and how they are connected by bonds.
13
14. @giribio
Girinath G. Pillai, PhD @giribio
What are Small Molecules?
➢ A small molecule is defined as a low molecular weight organic
compound.
➢ Most drugs are small molecules to allow passage over cell
membranes and oral bioavailability.
➢ They are also able to bind to proteins and enzymes, thereby
altering function, which can lead to a therapeutic effect.
14
15. @giribio
Girinath G. Pillai, PhD @giribio
Digital Representations
➢ How do we communicate structural information between
humans and the computer?
○ – Line notations, e.g. Wiswesser Line Notation (and later SMILES)
➢ How do we represent the atoms and bonds in a molecule
internally in a computer?
○ – Atom lookup and connection tables
➢ Trivial name : proline
➢ Systematic names : pyrrolidine‐2‐carboxylic acid
15
19. @giribio
Girinath G. Pillai, PhD @giribio
19
Issues for ML:
● arbitrary size
● arbitrary order
Ideal features:
● general
● compact
● unique
● invariant *
● smooth
● fast
010110101010001011100100010001111110
ML methods need a computer-friendly way to input the atomistic system:
easy for us
easy for CPU
* invariants are determined by the physics of the quantity to predict from the descriptor!
Chemical Compounds for Machines
20. @giribio
Girinath G. Pillai, PhD @giribio
20
010110101010001011100100010001111110
ML methods need a computer-friendly way to input the atomistic system:
Global
Descriptor
110100011110000110010111111110
110100011110001011100001111110
010110101010001011100001111110
Local/Atomic
Descriptor
Descriptors for Chemistry
21. @giribio
Girinath G. Pillai, PhD @giribio
Database?
Database is an “organized collection of information.”
Information in a database can be in any format, including texts, numbers, images,
audios, videos, and many others (and combination of these)
Information must be “organized” for efficient retrieval.
21
22. @giribio
Girinath G. Pillai, PhD @giribio
Database Categories
Primary databases contain experimentally-derived data that are directly
submitted by researchers (also called “primary data”). In essence, these
databases serve as archives that keep original data. aka archival databases.
Secondary databases contain secondary data, which are derived from analyzing
and interpreting primary data. These databases often provide value-added
information related to the primary data, by using information from other
databases and scientific literature.
Essentially, secondary databases serve as reference libraries for the scientific
community, providing highly curated reviews about primary data. aka curated
databases, or knowledgebase.
22
23. @giribio
Girinath G. Pillai, PhD @giribio
Data Provenance
“data provenance” refers to a record trail that describes the origin or source of a
piece of data and the process by which it entered in a database.
Simply put, data provenance deals with the questions
“where the data came from” and
“how and why the data is in its present place”.
Although the data provenance information is critical in the reliability of a data
source (and its data), this information is not easy to manage
23
24. @giribio
Girinath G. Pillai, PhD @giribio
Small Molecule Databases
24
Collection of molecules
reported, published and
proposed
Chemical DBs
Collection of approved
drugs with its
pharmacological and
target details
FDA Drugs
Collection of
compounds/extracts
from plants, animals, etc
Natural Products
Collection of
purchasable as well as
on-demand synthesis
molecules
Commercial DBs
25. @giribio
Girinath G. Pillai, PhD @giribio
Approved Drug Database
➢ DrugBank: comprehensive information on drug molecules
○ Combined detailed drug data (ie. chemical, pharmacological and pharmaceutical) data with
comprehensive drug target (ie. sequence, structure, and pathway) information Allows
searching for similar compounds
➢ SuperDrug : 2500 3D-structures of active ingredients of essential marketed
drugs.
25
26. @giribio
Girinath G. Pillai, PhD @giribio
Chemical Database
➢ PubChem: chemical information repository at the U.S. NIH
○ NCBI maintains with three types of information namely, substance, compound, vendor
details, pharmacology and BioAssays
➢ ChEMBL: literature-extracted biological activity information
○ Curated database of small molecules includes interactions and functional effects of small
molecules binding to their macromolecular targets, and series of drug discovery databases. 1
million bioactive (small drug-like molecules) compounds with 8200 drug targets
➢ Zinc15
○ Curated collection of commercially available 21 million chemical compounds, with 3D
coordinates. Ready to dock and 3D formats
➢ BindingDB
○ contains 910,836 binding data, for 6,263 protein targets and 378,980 small molecules.
➢ ChemSpider: a chemical database integrated with RSC’s publishing process
○ Collection of chemical compounds includes the conversion of chemical names to chemical
structures, the generation of SMILES and InChI strings as well as the prediction of many
physicochemical parameters 26
27. @giribio
Girinath G. Pillai, PhD @giribio
Chemical Databases (contd…)
➢ MCule
○ Commercial database of small molecules
➢ Cambridge Structural Database - CSD
○ Repository for small molecule crystal structures in CIF format The CSD is compiled and
maintained by the CCDC.
➢ ChemIDPlus
○ Database of compounds and structures by US National Library of Medicine
➢ ChemBank
○ Public, web based informatics environment created by the Broad Institute's Chemical Biology
Program Includes freely available data derived from small molecules and small molecule
screens, and resources for studying the data
➢ Ligand Expo
○ Formerly Ligand Depot Provides chemical and structural information about small molecules
within the structure entries of the Protein Data Bank
27
28. @giribio
Girinath G. Pillai, PhD @giribio
Natural Product Database
➢ COCONUT
○ Open database of all natural products
➢ TCM
○ Free small molecular database on traditional Chinese medicine, for virtual screening It is
currently the world's largest TCM database, and contains 170 000 compounds
➢ IMPPAT
○ Database of Indian Medicinal Plants, Phytochemistry and Therapeutics
➢ AromaDb
○ Database of medicinal and aromatic plant’s aroma molecules with phytochemistry and
therapeutic potentials
➢ Dr Duke's Phytochemical and Ethnobotanical
○ Databases facilitate in depth plant, chemical, bioactivity, and ethnobotany searches using
scientific or common names
➢ CMAUP
○ Database of collective molecular activities of useful plants
28
29. @giribio
Girinath G. Pillai, PhD @giribio
Other Chemical Databases
➢ ChEBI ..(Chemical Entities of Biological Interest)
○ Freely available dictionary of molecular entities focused on ‘ chemical compounds provided
by the European Bioinformatics Institute
➢ KEGG DRUG
○ Comprehensive drug information resource for approved drugs in Japan, USA, and Europe
unified based on the chemical structures and/or the chemical components, and associated
with target, metabolizing enzyme, and other molecular interaction network information
Provided by the Kyoto Encyclopedia of Genes and Genomes
29
77. @giribio
Girinath G. Pillai, PhD @giribio
Some Questions!
➢ Predict drug-like molecules? Toxicity?
○ New Strategies
➢ How can we search efficiently? Intelligently?
○ New data structures and algorithms
○ Optimizing old structures
➢ How can we understand this much data?
○ Cluster and visualize millions of data points
○ Define commercially accessible space.
➢ Are there other useful things we can do with this?
○ Discover new polymers, etc.
○ Wonder about the origin of life.
○ Combinatorially combine all known chemicals.
77
78. @giribio
Girinath G. Pillai, PhD @giribio
CONCLUSION
Data Search is an art, dig data as much as you can.
Initially consider 25% score/qlty & 75% diversity as the
size of the lead reduces consider 75% score/qlty & 25%
diversity.
Consider Enrichment factors
Good synergetics between human expertise &
computational tools
Avoid Missed Opportunities
Understand significance of parameters/properties
Evaluate and decide the tool/approach
Check reliability of data used 78
79. @giribio
Girinath G. Pillai, PhD @giribio
Until you
try yourself and
train others
you cannot be
an expert
If you think you
finished
collecting all
Data! It's wrong.
Go and find
more data
80. @giribio
Girinath G. Pillai, PhD @giribio
THANKS
Do you have any questions?
www.zastrain.com
@giribio
80
We do accept research interns