Presentation to the EPA (August 2016) about the BioAssay Express project, from Collaborative Drug Discovery. Describes the history and potential of the project, with the intention of opening a dialog about incorporating EPA toxicity data.
Virtual Screening and Hit PrioritizationPuneet Kacker
This document discusses virtual ligand screening (VLS) as an alternative to high-throughput screening for identifying potential drug candidates. It describes the VLS process, which involves selecting a target and compound library, preparing the target and ligands, running a docking simulation to analyze ligand-target binding, and prioritizing hits. The document outlines advantages of computational methods like VLS compared to experimental screening, as well as some limitations. It also provides examples of free and commercial docking engines that can be used and highlights challenges in VLS like accounting for receptor flexibility.
The document describes using systems biology modeling tools to model the metabolic network of MRSA acetate kinase. It outlines conducting protein homology modeling using PMP to generate structural models of acetate kinase, evaluating models using ERRAT, chemical modeling of protein kinase inhibitors using OCHEM and PubChem, and metabolic network modeling using BioModels and Virtual Cell. The aim is to understand how protein, chemical, and metabolic network modeling can provide more information for research into new antibiotic targets. Key steps include selecting templates for homology modeling, evaluating models with ERRAT, extracting chemical data from OCHEM and PubChem, importing an acetate kinase model from BioModels into Virtual Cell, validating the model, and running simulations in Virtual Cell.
LigBuilder v2.0 is a multiple-purpose program for structure-based de novo drug design and optimization. It detects ligand binding sites on target proteins, designs ligand molecules that could bind to the sites using a genetic algorithm approach, and screens the designed ligands. An example application of LigBuilder v2.0 is provided, where ligands are designed for a given target protein cavity. The key steps involve identifying the cavity, designing ligands, docking the top ligands into the cavity, and analyzing the results. LigBuilder v2.0 is concluded to be an important tool in drug design for its ability to automatically design ligands and evaluate binding without requiring manual design.
This document summarizes a presentation on breeding data management and analysis tools using the Cassavabase database. It describes how Cassavabase collects phenotypic and genotypic data from various programs and ensures data quality. Over 9 million phenotypic observations, 2488 trials and 34,000 genotypes have been collected. It also discusses expanding interoperability through APIs and partnerships with other crop databases. The presentation demonstrates using Cassavabase to access trial data, visualize genotypes, perform statistical analysis, genomic prediction and multi-trait selection to increase genetic gain.
High-throughput screening (HTS) is a scientific method used in drug discovery that allows researchers to quickly test millions of chemical, genetic, or pharmacological compounds using robotics, detectors, and other automated tools. The key tool is a microtiter plate containing hundreds to thousands of wells, each with a different compound. Automated systems transfer plates between stations for mixing, incubation, and analysis to generate large amounts of experimental data. Effective experimental design, quality control, and data analysis methods are needed to identify meaningful results, or "hits", from large HTS datasets. Recent advances allow screening millions of reactions much faster and with less reagent volume than before.
CINF66 Visualizing Molecules In and Out of ContextJeff White
This document discusses using data science techniques to recommend related molecules based on different contexts. It describes analyzing chemical similarity based on molecular structure, related papers in literature, and user behavior data. The approaches were validated by comparing automatically grouped molecule clusters to known clusters in literature and behavior datasets. Recommendations based on fingerprint similarities worked best to predict both literature and behavior relatedness. Going forward, the team aims to develop a molecular recommender system that presents molecules in different contexts and improve their system for extracting chemical entities from text.
Bioassay (commonly used shorthand for biological assay), or biological standardization is a type of scientific experiment. A bioassay involves the use of a live animal (in vivo) or tissue (in vitro) to determine the biological activity of a substance, such as a hormone or drug. Bioassays are typically conducted to measure the effects of a substance on a living organism and are essential in the development of new drugs and in monitoring environmental pollutants. Both are procedures by which the potency or the nature of a substance is estimated by studying its effects on living matter. A bioassay can also be used to determine the concentration of a particular constitution of a mixture.
Virtual Screening and Hit PrioritizationPuneet Kacker
This document discusses virtual ligand screening (VLS) as an alternative to high-throughput screening for identifying potential drug candidates. It describes the VLS process, which involves selecting a target and compound library, preparing the target and ligands, running a docking simulation to analyze ligand-target binding, and prioritizing hits. The document outlines advantages of computational methods like VLS compared to experimental screening, as well as some limitations. It also provides examples of free and commercial docking engines that can be used and highlights challenges in VLS like accounting for receptor flexibility.
The document describes using systems biology modeling tools to model the metabolic network of MRSA acetate kinase. It outlines conducting protein homology modeling using PMP to generate structural models of acetate kinase, evaluating models using ERRAT, chemical modeling of protein kinase inhibitors using OCHEM and PubChem, and metabolic network modeling using BioModels and Virtual Cell. The aim is to understand how protein, chemical, and metabolic network modeling can provide more information for research into new antibiotic targets. Key steps include selecting templates for homology modeling, evaluating models with ERRAT, extracting chemical data from OCHEM and PubChem, importing an acetate kinase model from BioModels into Virtual Cell, validating the model, and running simulations in Virtual Cell.
LigBuilder v2.0 is a multiple-purpose program for structure-based de novo drug design and optimization. It detects ligand binding sites on target proteins, designs ligand molecules that could bind to the sites using a genetic algorithm approach, and screens the designed ligands. An example application of LigBuilder v2.0 is provided, where ligands are designed for a given target protein cavity. The key steps involve identifying the cavity, designing ligands, docking the top ligands into the cavity, and analyzing the results. LigBuilder v2.0 is concluded to be an important tool in drug design for its ability to automatically design ligands and evaluate binding without requiring manual design.
This document summarizes a presentation on breeding data management and analysis tools using the Cassavabase database. It describes how Cassavabase collects phenotypic and genotypic data from various programs and ensures data quality. Over 9 million phenotypic observations, 2488 trials and 34,000 genotypes have been collected. It also discusses expanding interoperability through APIs and partnerships with other crop databases. The presentation demonstrates using Cassavabase to access trial data, visualize genotypes, perform statistical analysis, genomic prediction and multi-trait selection to increase genetic gain.
High-throughput screening (HTS) is a scientific method used in drug discovery that allows researchers to quickly test millions of chemical, genetic, or pharmacological compounds using robotics, detectors, and other automated tools. The key tool is a microtiter plate containing hundreds to thousands of wells, each with a different compound. Automated systems transfer plates between stations for mixing, incubation, and analysis to generate large amounts of experimental data. Effective experimental design, quality control, and data analysis methods are needed to identify meaningful results, or "hits", from large HTS datasets. Recent advances allow screening millions of reactions much faster and with less reagent volume than before.
CINF66 Visualizing Molecules In and Out of ContextJeff White
This document discusses using data science techniques to recommend related molecules based on different contexts. It describes analyzing chemical similarity based on molecular structure, related papers in literature, and user behavior data. The approaches were validated by comparing automatically grouped molecule clusters to known clusters in literature and behavior datasets. Recommendations based on fingerprint similarities worked best to predict both literature and behavior relatedness. Going forward, the team aims to develop a molecular recommender system that presents molecules in different contexts and improve their system for extracting chemical entities from text.
Bioassay (commonly used shorthand for biological assay), or biological standardization is a type of scientific experiment. A bioassay involves the use of a live animal (in vivo) or tissue (in vitro) to determine the biological activity of a substance, such as a hormone or drug. Bioassays are typically conducted to measure the effects of a substance on a living organism and are essential in the development of new drugs and in monitoring environmental pollutants. Both are procedures by which the potency or the nature of a substance is estimated by studying its effects on living matter. A bioassay can also be used to determine the concentration of a particular constitution of a mixture.
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
Towards a integrated network of data and services for the life sciences Modern biological knowledge discovery requires access to machine-understandable data that can be searched, retrieved, and subsequently analyzed using a wide array of analytical software and services. The Semantic Automated Discovery and Integration (SADI) framework is a set of conventions to formalize web service inputs and outputs using OWL ontologies that enable the automatic discovery and invocation of Semantic Web services. In this talk, I will walk through a worked example in the design and deployment of chemical semantic web services using the Chemical Development Toolkit, chemical descriptors from the Chemical Information Ontology (CHEMINF), and the Semanticscience Integrated Ontology (SIO) as a unifying, upper level ontology of basic types and relations. I will discuss how one can make use of the SADI-enabled SHARE client to reason about data obtained from Bio2RDF, the largest linked open data project, and automatically invoke chemical semantic web services to determine a chemical's drug-likeness. If you want to see the potential of the Semantic Web being realized, this talk is for you.
This document provides an overview of databases and tools relevant to systems immunology. It discusses several freely available and licensed databases containing gene expression, drug, pathway, and disease data. Issues with third party data like cleanup requirements and need for downloadability are also covered. Examples are given of integrating data from sources like GEO, DrugBank, Connectivity Map, and ImmPort to enable meta-analyses addressing immunological questions.
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
The challenge of accurately characterizing bioassays is a real pain point for many drug discovery organizations. Research has shown that some organizations have legacy assay collections exceeding 20,000 protocols, the great majority of which are not accurately characterized. This problem is compounded by the fact that many new protocol registrations are still not following FAIR (Findability, Accessibility, Interoperability, and Reusability) Data principles.
BioAssay Express is a tool focused on transforming the traditional protocol description from an unstructured free form text into a well-curated data store based upon FAIR Data principles. By using well-defined annotations for assays, the tool enables precise ontology based searches without having to resort to imprecise keyword searches.
This talk explores a number of new important features designed to help scientists accelerate the drug discovery process. Some example use-cases include: enabling drug repositioning projects; improving SAR models; identifying appropriate machine learning data sets; fine-tuning integrative-omic pathways;
An aspirational goal for our team is to build a metadata schema based on semantic web vocabularies that is comprehensive to the extent that the text description becomes optional. One of the many possibilities is to take the initial prospective ELN entry for a bioassay protocol and feed it directly to an automated instrument. While there are many challenges involved in creating the ELN-to-robot loop, we will provide some insights into our collaborations with UCSF automation experts.
In summary, the ability to quickly and accurately search or analyze bioassay data (public or internal) is a rate limiting problem in drug discovery. We will present the latest developments toward removing this bottleneck.
https://plan.core-apps.com/acs_sd2019/abstract/6f58993d-a716-49ad-9b09-609edde5a3f4
Presentation to ImmPort Science Meeting, February 27, 2014 on the proper treatment of value sets in the Immport Immunology Database and Analysis Portal
Presented by Richard Kidd at "The Future Information Needs of Pharmaceutical & Medicinal Chemistry", Monday 28 November 2011 at The Linnean Society, Burlington Square, London run by the RSC CICAG group.
This document discusses systems biology and some of its tools. It defines systems biology as the study of interactions between parts of biological systems to understand how they function. Biological networks involve interactions between pathways. Networks can be modeled as nodes and edges. Tools described for modeling and analyzing networks include Cytoscape for visualization, CellDesigner for drawing networks, and STRING for protein-protein interaction data. Databases of pathways, interactions and models are also listed.
Virtual screening uses computer-based methods to filter large databases of chemical compounds to identify a subset of compounds that are most likely to bind to and activate a target linked to a disease. It helps address the challenge of exploring the vast chemical space compared to the limited number of compounds that can be experimentally screened. The document discusses various virtual screening methods including ligand-based approaches like similarity searching and pharmacophore modeling as well as structure-based approaches like molecular docking that predict binding orientations. It also covers best practices for applying filters to select for drug-like and lead-like compounds.
Designing a community resource - Sandra OrchardEMBL-ABR
The document discusses designing a new community resource called the Complex Portal to describe protein complexes. It emphasizes conducting user studies, using community standards to enable data sharing and tool interoperability, and obtaining community input to ensure the resource meets researcher needs. Standards like PSI-MI and controlled vocabularies allow the resource to integrate data from other sources and enable sophisticated searches. Outreach is important to establish the resource as the primary reference.
This document provides an overview of standards and best practices for making computational models reusable through the use of model repositories and standard formats. It discusses the COMBINE initiative for standardizing the encoding of models and simulations. The document encourages authors to make their models and data FAIR (Findable, Accessible, Interoperable, Reusable) by using community standards for publishing, exchanging, and archiving models. Examples of open model repositories and standards-compliant tools and libraries are provided to demonstrate how authors can improve sharing and reuse of their models.
Ontologies and Semantic Web technologies play an important role in the life sciences to help make data more interoperable and reusable. There are now many publicly available ontologies that enable biologists to describe everything from gene function through to animal physiology and disease.
Various efforts such as the Open Biomedical Ontologies (OBO) foundry provide central registries for biomedical ontologies and ensure they remain interoperable through a set of common shared development principles.
At EMBL-EBI we contribute to the development of biomedical ontologies and make extensive use of them in the annotation of public datasets. Biological data typically comes with rich and often complex metadata, so the ontologies provide a standard way to capture “what the data is about” and gives us hooks to connect to more data about similar things.
These ontology annotations have been put to good use in a number of large-scale data integration efforts and there’s an increasing recognition of the need for ontologies in making data FAIR (Findable, Accessible, Interoperable and Reusable).
EMBL-EBI build a number of integrative data platforms where ontologies are at the core of our domain models. One example is the Open Targets platform, where data about disease from 18 different databases can be aggregated and grouped based on therapeutic areas in the ontology and used to identify potential drug targets.
The ontologies team at EMBL-EBI provide a suite of services that are aimed at making ontologies more accessible for both humans and machines. We work with scientific data curators and software developers to integrate ontologies and semantics into both the data generation and data presentation workflows. We provide:
– An ontology lookup service (OLS) that provides search and visualisation services to over 200+ ontologies
– Services for automating the annotation of metadata and learning from previous annotations (Zooma)
– An ontology mapping and alignment service (OXO)
– Tools for working with metadata and ontologies in spreadsheets (Webulous)
– Software for enriching documents in search engines to support “semantic” query expansion
I’ll present how we are using these services at EMBL-EBI to scale up the semantic annotation of metadata. I’ll talk about our open source technology stack and describe how we utilise a polyglot persistence approach (graph databases, triples stores, document stores etc) to optimize how we deliver ontologies and semantics to our users.
Code sharing for microbiomics analysis is proposed through standardized R packages and GitHub. This facilitates reproducible, efficient and collaborative analysis. Examples of standardized preprocessing, diversity analysis and visualization tools are provided. The microbiome package and wiki provide ready-made analysis examples to build upon.
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
This document discusses using larger datasets to build, validate, and share machine learning models for drug discovery. It notes that datasets were historically very small, but are now much larger, containing thousands to tens of thousands of compounds. Larger datasets allow modeling a broader range of properties and endpoints. The document examines examples of published models for various targets and properties built using large datasets. It emphasizes the importance of making these models openly accessible so they can be more widely used and built upon by others.
Continued development of ChEBI towards better usability for the systems biolo...Neil Swainston
This document discusses improvements to the Chemical Entities of Biological Interest (ChEBI) database to better support systems biology and metabolic modeling. It will develop an open source application programming interface (API) called libChEBI to programmatically access ChEBI. libChEBI will integrate ChEBI with modeling software, metabolomics data standards, and other databases. It will also enhance ChEBI's database, curation process, and community involvement to incorporate all known metabolites from key organisms and support metabolic reconstruction. These improvements aim to maximize sharing and reuse of systems biology models.
Cheminformatics is the application of computer science to solve chemical problems. It involves acquiring chemical data through experiments or simulations, managing the information in databases, and analyzing the data. Key aspects of cheminformatics include computer-assisted synthesis design, representing chemical structures digitally, and using mathematical models to analyze chemical data. Cheminformatics plays an important role in drug discovery by aiding processes like target identification, lead discovery, and molecular modeling.
This document discusses structure searching in Reaxys, beginning with an introduction to Reaxys and its contents. It then covers essentials of structure searching such as supported structure editors, differences between editor capabilities and Reaxys search features, and Reaxys' substance model. Examples of simple and sophisticated structure searching techniques are provided. The document concludes with an example of reaction similarity searching to find reactions related to a Diels-Alder reaction.
This document discusses how APIs are enabling new ways of accessing and mashing up drug discovery data from various sources. It provides examples of existing bio/chem/med APIs and case studies of companies that are using APIs in novel ways for tasks like patent chemistry searching and therapeutic intelligence analysis. The document advocates for making APIs more accessible to allow broader exploration of data that can uncover new use cases and insights, while also noting challenges around usability, data discovery, and security.
The National Center for Biomedical Ontology (NCBO) provides several software tools to enable semantically aware applications. These include BioPortal, an online library of biomedical ontologies, Ontology Widgets which allow integration of ontologies into websites, and the Annotator, a web service that semantically tags text with ontology terms. The NCBO also plans to develop a comprehensive index of online biomedical resources.
Mixtures QSAR: modelling collections of chemicalsAlex Clark
This document discusses representing and modeling chemical mixtures. It proposes a new data format called Mixfile or MInChI to hierarchically define mixtures and their components, including concentrations. This format aims to support cheminformatics applications like property prediction. Examples are given modeling theophylline solubility and gas absorption using mixture data. The document also describes applying similar methods to model polymer entropy of mixing using a spreadsheet dataset converted to the mixtures format. It concludes that defining mixtures in digital formats will enable greater analysis, modeling and use of mixture data.
Mixtures InChI: a story of how standards drive upstream productsAlex Clark
This document discusses the development of Mixtures InChI (MInChI), a standard for representing chemical mixtures in a machine-readable format. MInChI was developed to address the lack of standards for mixture informatics and interoperability. The document outlines the development of open source tools to generate and edit MInChI notation, as well as efforts to build a community and integrate MInChI into commercial products and databases to enable widespread use and generation of mixture data. Future work discussed includes finalizing the MInChI specification, extending it to additional chemical entities, developing associated properties and metadata, and implementing MInChI at large scale.
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
Towards a integrated network of data and services for the life sciences Modern biological knowledge discovery requires access to machine-understandable data that can be searched, retrieved, and subsequently analyzed using a wide array of analytical software and services. The Semantic Automated Discovery and Integration (SADI) framework is a set of conventions to formalize web service inputs and outputs using OWL ontologies that enable the automatic discovery and invocation of Semantic Web services. In this talk, I will walk through a worked example in the design and deployment of chemical semantic web services using the Chemical Development Toolkit, chemical descriptors from the Chemical Information Ontology (CHEMINF), and the Semanticscience Integrated Ontology (SIO) as a unifying, upper level ontology of basic types and relations. I will discuss how one can make use of the SADI-enabled SHARE client to reason about data obtained from Bio2RDF, the largest linked open data project, and automatically invoke chemical semantic web services to determine a chemical's drug-likeness. If you want to see the potential of the Semantic Web being realized, this talk is for you.
This document provides an overview of databases and tools relevant to systems immunology. It discusses several freely available and licensed databases containing gene expression, drug, pathway, and disease data. Issues with third party data like cleanup requirements and need for downloadability are also covered. Examples are given of integrating data from sources like GEO, DrugBank, Connectivity Map, and ImmPort to enable meta-analyses addressing immunological questions.
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
The challenge of accurately characterizing bioassays is a real pain point for many drug discovery organizations. Research has shown that some organizations have legacy assay collections exceeding 20,000 protocols, the great majority of which are not accurately characterized. This problem is compounded by the fact that many new protocol registrations are still not following FAIR (Findability, Accessibility, Interoperability, and Reusability) Data principles.
BioAssay Express is a tool focused on transforming the traditional protocol description from an unstructured free form text into a well-curated data store based upon FAIR Data principles. By using well-defined annotations for assays, the tool enables precise ontology based searches without having to resort to imprecise keyword searches.
This talk explores a number of new important features designed to help scientists accelerate the drug discovery process. Some example use-cases include: enabling drug repositioning projects; improving SAR models; identifying appropriate machine learning data sets; fine-tuning integrative-omic pathways;
An aspirational goal for our team is to build a metadata schema based on semantic web vocabularies that is comprehensive to the extent that the text description becomes optional. One of the many possibilities is to take the initial prospective ELN entry for a bioassay protocol and feed it directly to an automated instrument. While there are many challenges involved in creating the ELN-to-robot loop, we will provide some insights into our collaborations with UCSF automation experts.
In summary, the ability to quickly and accurately search or analyze bioassay data (public or internal) is a rate limiting problem in drug discovery. We will present the latest developments toward removing this bottleneck.
https://plan.core-apps.com/acs_sd2019/abstract/6f58993d-a716-49ad-9b09-609edde5a3f4
Presentation to ImmPort Science Meeting, February 27, 2014 on the proper treatment of value sets in the Immport Immunology Database and Analysis Portal
Presented by Richard Kidd at "The Future Information Needs of Pharmaceutical & Medicinal Chemistry", Monday 28 November 2011 at The Linnean Society, Burlington Square, London run by the RSC CICAG group.
This document discusses systems biology and some of its tools. It defines systems biology as the study of interactions between parts of biological systems to understand how they function. Biological networks involve interactions between pathways. Networks can be modeled as nodes and edges. Tools described for modeling and analyzing networks include Cytoscape for visualization, CellDesigner for drawing networks, and STRING for protein-protein interaction data. Databases of pathways, interactions and models are also listed.
Virtual screening uses computer-based methods to filter large databases of chemical compounds to identify a subset of compounds that are most likely to bind to and activate a target linked to a disease. It helps address the challenge of exploring the vast chemical space compared to the limited number of compounds that can be experimentally screened. The document discusses various virtual screening methods including ligand-based approaches like similarity searching and pharmacophore modeling as well as structure-based approaches like molecular docking that predict binding orientations. It also covers best practices for applying filters to select for drug-like and lead-like compounds.
Designing a community resource - Sandra OrchardEMBL-ABR
The document discusses designing a new community resource called the Complex Portal to describe protein complexes. It emphasizes conducting user studies, using community standards to enable data sharing and tool interoperability, and obtaining community input to ensure the resource meets researcher needs. Standards like PSI-MI and controlled vocabularies allow the resource to integrate data from other sources and enable sophisticated searches. Outreach is important to establish the resource as the primary reference.
This document provides an overview of standards and best practices for making computational models reusable through the use of model repositories and standard formats. It discusses the COMBINE initiative for standardizing the encoding of models and simulations. The document encourages authors to make their models and data FAIR (Findable, Accessible, Interoperable, Reusable) by using community standards for publishing, exchanging, and archiving models. Examples of open model repositories and standards-compliant tools and libraries are provided to demonstrate how authors can improve sharing and reuse of their models.
Ontologies and Semantic Web technologies play an important role in the life sciences to help make data more interoperable and reusable. There are now many publicly available ontologies that enable biologists to describe everything from gene function through to animal physiology and disease.
Various efforts such as the Open Biomedical Ontologies (OBO) foundry provide central registries for biomedical ontologies and ensure they remain interoperable through a set of common shared development principles.
At EMBL-EBI we contribute to the development of biomedical ontologies and make extensive use of them in the annotation of public datasets. Biological data typically comes with rich and often complex metadata, so the ontologies provide a standard way to capture “what the data is about” and gives us hooks to connect to more data about similar things.
These ontology annotations have been put to good use in a number of large-scale data integration efforts and there’s an increasing recognition of the need for ontologies in making data FAIR (Findable, Accessible, Interoperable and Reusable).
EMBL-EBI build a number of integrative data platforms where ontologies are at the core of our domain models. One example is the Open Targets platform, where data about disease from 18 different databases can be aggregated and grouped based on therapeutic areas in the ontology and used to identify potential drug targets.
The ontologies team at EMBL-EBI provide a suite of services that are aimed at making ontologies more accessible for both humans and machines. We work with scientific data curators and software developers to integrate ontologies and semantics into both the data generation and data presentation workflows. We provide:
– An ontology lookup service (OLS) that provides search and visualisation services to over 200+ ontologies
– Services for automating the annotation of metadata and learning from previous annotations (Zooma)
– An ontology mapping and alignment service (OXO)
– Tools for working with metadata and ontologies in spreadsheets (Webulous)
– Software for enriching documents in search engines to support “semantic” query expansion
I’ll present how we are using these services at EMBL-EBI to scale up the semantic annotation of metadata. I’ll talk about our open source technology stack and describe how we utilise a polyglot persistence approach (graph databases, triples stores, document stores etc) to optimize how we deliver ontologies and semantics to our users.
Code sharing for microbiomics analysis is proposed through standardized R packages and GitHub. This facilitates reproducible, efficient and collaborative analysis. Examples of standardized preprocessing, diversity analysis and visualization tools are provided. The microbiome package and wiki provide ready-made analysis examples to build upon.
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
This document discusses using larger datasets to build, validate, and share machine learning models for drug discovery. It notes that datasets were historically very small, but are now much larger, containing thousands to tens of thousands of compounds. Larger datasets allow modeling a broader range of properties and endpoints. The document examines examples of published models for various targets and properties built using large datasets. It emphasizes the importance of making these models openly accessible so they can be more widely used and built upon by others.
Continued development of ChEBI towards better usability for the systems biolo...Neil Swainston
This document discusses improvements to the Chemical Entities of Biological Interest (ChEBI) database to better support systems biology and metabolic modeling. It will develop an open source application programming interface (API) called libChEBI to programmatically access ChEBI. libChEBI will integrate ChEBI with modeling software, metabolomics data standards, and other databases. It will also enhance ChEBI's database, curation process, and community involvement to incorporate all known metabolites from key organisms and support metabolic reconstruction. These improvements aim to maximize sharing and reuse of systems biology models.
Cheminformatics is the application of computer science to solve chemical problems. It involves acquiring chemical data through experiments or simulations, managing the information in databases, and analyzing the data. Key aspects of cheminformatics include computer-assisted synthesis design, representing chemical structures digitally, and using mathematical models to analyze chemical data. Cheminformatics plays an important role in drug discovery by aiding processes like target identification, lead discovery, and molecular modeling.
This document discusses structure searching in Reaxys, beginning with an introduction to Reaxys and its contents. It then covers essentials of structure searching such as supported structure editors, differences between editor capabilities and Reaxys search features, and Reaxys' substance model. Examples of simple and sophisticated structure searching techniques are provided. The document concludes with an example of reaction similarity searching to find reactions related to a Diels-Alder reaction.
This document discusses how APIs are enabling new ways of accessing and mashing up drug discovery data from various sources. It provides examples of existing bio/chem/med APIs and case studies of companies that are using APIs in novel ways for tasks like patent chemistry searching and therapeutic intelligence analysis. The document advocates for making APIs more accessible to allow broader exploration of data that can uncover new use cases and insights, while also noting challenges around usability, data discovery, and security.
The National Center for Biomedical Ontology (NCBO) provides several software tools to enable semantically aware applications. These include BioPortal, an online library of biomedical ontologies, Ontology Widgets which allow integration of ontologies into websites, and the Annotator, a web service that semantically tags text with ontology terms. The NCBO also plans to develop a comprehensive index of online biomedical resources.
Mixtures QSAR: modelling collections of chemicalsAlex Clark
This document discusses representing and modeling chemical mixtures. It proposes a new data format called Mixfile or MInChI to hierarchically define mixtures and their components, including concentrations. This format aims to support cheminformatics applications like property prediction. Examples are given modeling theophylline solubility and gas absorption using mixture data. The document also describes applying similar methods to model polymer entropy of mixing using a spreadsheet dataset converted to the mixtures format. It concludes that defining mixtures in digital formats will enable greater analysis, modeling and use of mixture data.
Mixtures InChI: a story of how standards drive upstream productsAlex Clark
This document discusses the development of Mixtures InChI (MInChI), a standard for representing chemical mixtures in a machine-readable format. MInChI was developed to address the lack of standards for mixture informatics and interoperability. The document outlines the development of open source tools to generate and edit MInChI notation, as well as efforts to build a community and integrate MInChI into commercial products and databases to enable widespread use and generation of mixture data. Future work discussed includes finalizing the MInChI specification, extending it to additional chemical entities, developing associated properties and metadata, and implementing MInChI at large scale.
Mixtures as first class citizens in the realm of informaticsAlex Clark
Presented at Cambridge (UK) cheminformatics meeting, February 2021. Mixtures of chemicals are underutilised from an informatics point of view, and this presentation shows some of the work done by Collaborative Drug Discovery, IUPAC and InChI Trust to remedy this.
See recording: https://www.youtube.com/watch?v=0ILc0owuEzQ&list=PLfj_gc4RCduuwv9p8lh2xS1EhQ3p_Nd9S&index=1 ... my part starts at 1:05:00
Mixtures: informatics for formulations and consumer productsAlex Clark
The document proposes standards for representing mixtures in a machine-readable format. It introduces Mixfile and MInChI (Mixtures InChI) as hierarchical and concise formats for describing mixtures. Examples of formulations are provided to demonstrate how components, concentrations, and metadata can be encoded. Potential applications of the standards are discussed, such as enabling sophisticated searches of mixture data from publications and vendors to facilitate properties prediction and hazards assessment. Adoption of the standards could help ensure the longevity and sharing of mixture data.
Chemical mixtures: File format, open source tools, example data, and mixtures...Alex Clark
This document discusses representing chemical mixtures using an open format called Mixfile. It proposes Mixfile as a standard format for mixtures, analogous to Molfile for individual molecules. Tools were created to edit and manipulate Mixfiles. Over 5,600 real-world mixture examples were extracted from text and represented in the Mixfile format. A MInChI notation was also defined as a condensed representation of mixtures. Future work is proposed to integrate mixture definitions and lookups into electronic lab notebooks and improve automated extraction of mixture information from text.
Bringing bioassay protocols to the world of informatics, using semantic annot...Alex Clark
This document discusses bringing bioassay protocols into the world of informatics by using semantic annotations. It describes how measurements from bioassays contain many details that are usually only available as text, and outlines an approach using ontologies, natural language processing, and machine learning to extract this information and make it accessible for searching, comparing datasets, and identifying trends. The goal is to make all bioassay protocol data machine readable by developing common templates and annotation standards that can be applied to existing and new assay data sources.
Autonomous model building with a preponderance of well annotated assay protocolsAlex Clark
Combining large amounts of publicly available structure-activity data with assays that have carefully curated annotations opens the door to a number of ways to analyze the data behind the scenes. Combining fully machine readable input for a diverse variety of projects with modelling techniques that can be used without fussy parametrization allows models to be created and updated whenever new data arrives. Predictions from these models can be integrated into normal searching and visualization workflows, without any need for the user to opt-in or make extra decisions. This approach is novel and different from the way structure-activity models are normally deployed: useful predictions can be presented ubiquitously with literally zero additional work on behalf of the user. We will present our efforts to date regarding ways to both passively and actively draw attention to important drug discovery trends while exploring compounds and assays.
Representing molecules with minimalism: A solution to the entropy of informaticsAlex Clark
Cheminformatics as we know it is possible because so many molecular structures can be represented with datastructures and rules that are at first glance quite trivial. This first impression is highly misleading, since even within supposedly well behaved domains, edge cases arising from issues such as resonance, tautomerization, symmetry and stereochemistry - to name but a few - quickly add up. To supplement these genuine challenges, there is a whole additional class of problems caused by the mismatch between chemists' understanding of molecules and the datatypes that are necessary to capture a structure for informatics purposes. This line is blurred by the convenience of representing structures in a form that is very closely related to the diagram styles that have been in use since the dawn of chemistry. There are currently four major approaches to structure representation: connection tables (e.g. MDL Molfile), sketches (e.g. ChemDraw), canonical strings (e.g. SMILES and InChI) and atomic models (numerous 3D formats). Not only do all of these approaches have valid use cases, but they are deceptively incompatible with each other, even when addressing identical needs. Almost without exception, format conversions are not commutative, and every translation involves losing some amount of data. Given that recording chemical structures in machine readable form has become such a critical part of scientific research, it is essential to define a fundamental representation that captures the key structural definition asserted by the experimental chemist, for a broad and useful range of molecules, and ideally in a way that is closely related to visual drawing mnemonics. The number of data concepts needed to satisfy these conditions is quite small, and is mostly satisfied by the most commonly used subset of the venerable MDL Molfile format. This presentation will discuss how this subset, with a few minor corrections and clarifications, can and should be used as the reference standard for molecules, and how the informatics community can benefit from having well defined standards.
SLAS2016: Why have one model when you could have thousands?Alex Clark
Society for Laboratory Automation & Screening, San Diego, January 2016. Presented by Dr. Alex M. Clark. Describes the use of open data resources (ChEMBL) to build target-activity models for drug discovery and toxicity prediction, on a massive scale, using a fully automated process. Concludes with a demo of the PolyPharma app, which shows how these models can be used for prospective drug discovery.
The anatomy of a chemical reaction: Dissection by machine learning algorithmsAlex Clark
This document discusses using machine learning algorithms to analyze chemical reaction data. It describes how current reaction reporting formats are not well-suited for computational analysis. A more structured reporting format is proposed to fully describe reactions in a digitally friendly way, including specifying reactants, products, quantities, yields, and metrics like atom efficiency. This structured data would allow modeling of reaction substitutability and enable large-scale machine learning of chemical transformations.
Compact models for compact devices: Visualisation of SAR using mobile appsAlex Clark
Presented at American Chemical Society meeting, Boston, 2015. Describes how cheminformatics algorithms and visualisation interfaces have advanced on mobile apps to cover a diverse variety of functionality, increasingly calculated on the device itself rather than deferring to a web service. Culminates in a demo of the PolyPharma app prototype (see http://cheminf20.org/2015/08/06/the-polypharma-app-a-mash-up-of-ideas-and-technology)
Green chemistry in chemical reactions: informatics by designAlex Clark
Chemical informatics technology can be of assistance to chemists for describing reactions in numerous ways, including calculating green chemistry metrics such as process mass intensity, E-factor and atom economy. To facilitate this, chemical reactions have to be described in more precise detail than is the norm for most chemists. There are also numerous practical ways to add more green chemistry functionality to lab notebooks, such as enumerating searchable reaction transforms for environmentally favourable reactions, automatically looking up toxicity and hazard information, and others which are mentioned in the slides.
This presentation was given at the Green Chemistry & Engineering conference in 2015 (Americal Chemical Society Green Chemistry Insititute).
Green chemistry is an important subject that needs to be a part of every chemist's education, as well as a part of the daily routine of the professional synthetic chemist. This talk describes how a new app can be used to bring green chemistry metrics to reaction descriptions, once they are captured in a proper cheminformatics format. It also describes some of the additional data resources that can be incorporated into the user experience, and how this helps both students and professionals.
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Alex Clark
Mobile apps for cheminformatics are quite powerful on their own, but can be significantly boosted by connecting them with cloud-hosted functionality. This talk explores the range of functionality that can be covered simply by making use of apps with stateless webservices, i.e. anonymous access without persistent data.
Building a mobile reaction lab notebook (ACS Dallas 2014)Alex Clark
This document discusses building a mobile electronic lab notebook focused on chemical reactions called the Green Lab Notebook. It would allow users to draw chemical structures, balance reactions, and calculate quantities, yields, and green metrics. Key features include digitally capturing reaction data, prioritizing computer-friendly data structures and intuitive workflows, and linking to external databases for solvent data, sustainable feedstocks, and curated green reaction transforms. The goal is to facilitate recording, analyzing, and promoting the reuse of experimental reaction data in a sustainable chemistry context.
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Alex Clark
Presented at 2013 the German Chem[o]informatics Conference in Fulda, 2013: entitled "Putting together the pieces: building a reaction-centric electronic lab notebook for mobile devices".
Cheminformatics
workflows using the
mobile + cloud platform. Presentation by Dr. Alex M. Clark of Molecular Materials Informatics at the NETTAB 2013 meeting in Venice, Italy. The presentation introduces the significance of mobile apps in science, and the scope of their capabilities in chemical structure informatics. The bulk of the talk describes an account of a preliminary workflow using open science data to search for viable leads for a cure for tuberculosis. The workflow described makes use of a combination of mobile, cloud and conventional desktop-based technology, all stitched together by facile communication, sharing and collaboration features.
Open Drug Discovery Teams @ Hacking Health MontrealAlex Clark
Alex Clark of Molecular Materials Informatics (http://molmatinf.com) presents the Open Drug Discovery Teams project to the Hacking Health Montreal audience, April 2013.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
2. The 21st century (supposedly)
2
504854 449764
A Cell Based Secondary Assay To Explore Vero
Cell Cytotoxicity of Purified and Synthesized
Compounds that Inhibit Mycobacterium
Tuberculosis (4)
This functional assay was developed for
detection of compounds inhibiting Vero E6 cells
viability as a secondary screen to the beta-
lactam sensitizing M. tuberculosis bacteriocidal
assay.
In this assay, we treated Vero E6 cells with
compounds selected as hits in the M.
tuberculosis assay for 72 hours over a 10 point
2-fold dilution series, ranging from 0.195 uM to
100 uM. Following 72 hours of treatment,
relative viable cell number was determined
using Cell Titer Glo from Promega. Each plate
contained 64 replicates of vehicle treated cells
which served as negative controls.
Outcome: Compounds that showed <70% cell
viability for at least one concentration were
defined as "Active". If the % viability at all doses
was >70%, the compound was defined as
"Inactive".
...
A High Throughput Confirmatory Assay used to
Identify Novel Compounds that Inhibit
Mycobacterium Tuberculosis in the absence of
Glycerol
Outcome: Compounds that showed >30% inhibition
for at least one concentration in the Mtb dose
response were defined as "Active". If the inhibition
at all doses was <30% in the Mtb assay, the
compound was defined as "Inactive". In the primary
screen a compound was deemed "Inactive" if it had
a Percent Inhibition <70.31%. Compounds with a
Percent Inhibition >70.31% but were not selected
for follow up dose response were labeled
"Inconclusive."
The following tiered system has been implemented
at Southern Research Institute for use with the
PubChem Score. Compounds in the primary screen
are scored on a scale of 0-40 based on inhibitory
activity where a score of 40 corresponds to 100%
inhibition. In the confirmatory dose response
screen, active compounds were scored on a scale
of 41-80 based on the IC50 result in the Mtb assay
while compounds that did not confirm as actives
were given the score 0.
...
69%
similar (?)
4. Common Assay Template
• Built from semantic web ontologies:
BioAssay Ontology (BAO)
Drug Target Ontology (DTO)
Cell Line Ontology (CLO)
Gene Ontology (GO)
... and others
• Each annotated term is a URI: compatible with the
universe of linked data
4
6. New Templates
• Editor and schema are open source:
http://github.com/cdd/bioassay-template
• The Common Assay Template (CAT) designed
mainly for assays from NIH Molecular Libraries
Program: to capture high level characteristics
• Personalised templates can be created
• Other general templates are planned: Toxicity ?
6
7. Content Creation
• Refining the template & user interface, and generating
training data, using text from PubChem assays:
7
8. Machine Learning
• Existing data used for learning: proposed annotations
8
natural
language
processing
Bayesian
models
9. Ontology Branches
• Each annotation is part of a
tree structure
• Inheritance provides extra
layers of meaning
• From common assay
template + ontologies
• Updated as ontologies are
extended or modified
9
10. Searching
• Design a search
using semantic
terms
• Similar interface
as annotation
10
13. Interoperability
• Service auto-downloads assays from PubChem:
whitelisted sources (mainly MLPCN)
• Curated assays can be downloaded via API
using semantic web formats (Turtle/RDF/JSON-LD)
• Adding finalised content currently by manual upload
• Discussions with PubChem regarding two way
communication, e.g. sending annotations back to the
original assay record
13
14. Deployment
• www.bioassayexpress.com is public
currently used as a read-only service
evolving rapidly with new experimental features
free as in beer, not speech
• Private installations and derived products are planned
• Key features will be integrated into CDD Vault & ELN
• Core components & data model open source
(GitHub): community adoption is welcomed
14
15. Probes Report
• Possible use case
(work in progress)
• Rows: NIH probe
compounds
• Columns: Curated
assay targets
• Retroactive look at
the data: more of it,
better quality
15
16. Future Features
• Semantic annotation of assays: make it really easy
ELNs are immediate benefactors
tag, organise, search for assay protocols
• Model building: thousands of datasets - which of them
are measuring the same property in the same way?
• Deep machine learning analysis (open ended!)
• Advanced schema: encode the entire protocol
• An alternative to text: annotate first, automatic report
16
17. Toxicity Data
• One approach might be to design a new schema
(e.g. the Toxicity Assay Template)
• Submission of data such as ToxCast into PubChem,
divided into assays and compounds...
• ... trigger automatic downloading into BioAssay
Express, allowing them to be annotated
• Annotating a large number of toxicity assays would
open a universe of possibilities
17
18. Acknowledgments
• Biologists
Janice Kranz
Haifa Ghandour
Karen Featherstone
18
• Collaborative Drug Discovery
Barry Bunin
Kellan Gregory
and the rest of the team
• More information
http://github.com/cdd/bioassay-template
http://www.bioassayexpress.com
http://collaborativedrug.com
PeerJ
Comp Sci
2:e61
(2016)