This document provides an overview of connectivity between chemistry, biology, and published documents. It discusses the challenges of extracting this information ("D-A-R-C-P") from publications and patents. While some commercial and open-source efforts curate this data, most of it remains buried in documents. Automated extraction has limitations compared to expert curation. The document argues that authors should directly connect their results to databases to improve flow of information.
Progress in drug discovery and chemical biology is hugely enabled by curated document-assay-result-compound-target relationships (D-A-R-C-P) in open databases from resources such as the Guide to Pharmacology and ChEMBL. These are synergistically integrated into PubChem which pre-computes chemical similarity and connectivity between over 95 million structures and 5.6 million BioAssay results. It also links chemistry to documents via various additional routes including MeSH and large scale submissions from publishers. However, these efforts are patchy and very few journals facilitate such connectivity. There thus remains a massive shortfall in public D-A-R-C-P capture from decades of papers and patents. This presentation will cover these aspects and discuss their partial amelioration by options such as author-driven depositions and open lab-book approaches as used by Open Source Malaria
The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.
The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Graham Smith
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsourcing The history of parallel chemistry for lead discovery at Pfizer Sandwich from begining to outsourcing
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
Fernando Huerta (RISE Bioscience & Materials, SE)
Alexander Minidis (Collaborative Drug Discovery - CDD VAULT, Sweden)
How much information does the scientists need to design new potential drugs?
A thorough overview of public scientific information sources (open access) and methods to collect, process, analyse and visualize this information will be presented. A direct application of such free available information in conjunction with freeware will be described in relation with the efforts of the scientific community to find effective medicines for the ZIKA virus.
Tens of thousands of chemicals are currently in commerce, and hundreds more are introduced every year. Because current chemical testing is resource intensive, only a small fraction of chemicals have been adequately evaluated for potential human health effects. New technologies and computational tools have shown promise for closing this knowledge gap. In the U.S. EPA’s ToxCast effort, the use of ~700 high-throughput in vitro assays has broadly characterized the biological activity and potential mechanisms of ~1,800 chemicals. Coupling the high-throughput in vitro assays with additional in vitro pharmacokinetic assays and in vitro-to-in vivo extrapolation modeling allows conversion of in vitro bioactive concentrations to estimates to an administered dose (mg/kg/day). High throughput exposure models are generating exposure estimates based on key aspects of chemical production, fate, transport, and personal use. The path for incorporating new approach methods and technologies for prioritization and assessment of chemical alternatives poses multiple scientific challenges. These challenges include sufficient coverage of toxicological mechanisms to meaningfully interpret negative test results, development of increasingly relevant test systems, computational modeling to integrate experimental data, characterizing uncertainty, and efficient validation of the test systems and computational models. The presentation will cover progress at the U.S. EPA in the development and application of these technologies and approaches in evaluating alternatives and systematically addressing each of these challenges. This abstract does not necessarily reflect U.S. EPA policy.
Access to both experimental and predicted environmental fate and transport data is facilitated by the US-EPA CompTox Chemicals Dashboard. Providing access to various types of data associated with ~900,000 chemical substances, the dashboard is a web-based application supporting computational toxicology research in environmental chemistry. When experimental physicochemical and fate and transport data are not available, QSAR models developed using curated datasets are used for the prediction of properties. These include: bioaccumulation factors, bioconcentration factors, and biodegradation and fish biotransformation half-lives. For chemicals of interest that are not already registered in the dashboard real-time predictions based on structural inputs are available. This presentation will provide an overview of the dashboard with a focus on the availability of environmental fate and transport data, access to real time predictions, and our ongoing efforts to harvest and curate available experimental data from the literature and online databases. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Progress in drug discovery and chemical biology is hugely enabled by curated document-assay-result-compound-target relationships (D-A-R-C-P) in open databases from resources such as the Guide to Pharmacology and ChEMBL. These are synergistically integrated into PubChem which pre-computes chemical similarity and connectivity between over 95 million structures and 5.6 million BioAssay results. It also links chemistry to documents via various additional routes including MeSH and large scale submissions from publishers. However, these efforts are patchy and very few journals facilitate such connectivity. There thus remains a massive shortfall in public D-A-R-C-P capture from decades of papers and patents. This presentation will cover these aspects and discuss their partial amelioration by options such as author-driven depositions and open lab-book approaches as used by Open Source Malaria
The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.
The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Graham Smith
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsourcing The history of parallel chemistry for lead discovery at Pfizer Sandwich from begining to outsourcing
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
Fernando Huerta (RISE Bioscience & Materials, SE)
Alexander Minidis (Collaborative Drug Discovery - CDD VAULT, Sweden)
How much information does the scientists need to design new potential drugs?
A thorough overview of public scientific information sources (open access) and methods to collect, process, analyse and visualize this information will be presented. A direct application of such free available information in conjunction with freeware will be described in relation with the efforts of the scientific community to find effective medicines for the ZIKA virus.
Tens of thousands of chemicals are currently in commerce, and hundreds more are introduced every year. Because current chemical testing is resource intensive, only a small fraction of chemicals have been adequately evaluated for potential human health effects. New technologies and computational tools have shown promise for closing this knowledge gap. In the U.S. EPA’s ToxCast effort, the use of ~700 high-throughput in vitro assays has broadly characterized the biological activity and potential mechanisms of ~1,800 chemicals. Coupling the high-throughput in vitro assays with additional in vitro pharmacokinetic assays and in vitro-to-in vivo extrapolation modeling allows conversion of in vitro bioactive concentrations to estimates to an administered dose (mg/kg/day). High throughput exposure models are generating exposure estimates based on key aspects of chemical production, fate, transport, and personal use. The path for incorporating new approach methods and technologies for prioritization and assessment of chemical alternatives poses multiple scientific challenges. These challenges include sufficient coverage of toxicological mechanisms to meaningfully interpret negative test results, development of increasingly relevant test systems, computational modeling to integrate experimental data, characterizing uncertainty, and efficient validation of the test systems and computational models. The presentation will cover progress at the U.S. EPA in the development and application of these technologies and approaches in evaluating alternatives and systematically addressing each of these challenges. This abstract does not necessarily reflect U.S. EPA policy.
Access to both experimental and predicted environmental fate and transport data is facilitated by the US-EPA CompTox Chemicals Dashboard. Providing access to various types of data associated with ~900,000 chemical substances, the dashboard is a web-based application supporting computational toxicology research in environmental chemistry. When experimental physicochemical and fate and transport data are not available, QSAR models developed using curated datasets are used for the prediction of properties. These include: bioaccumulation factors, bioconcentration factors, and biodegradation and fish biotransformation half-lives. For chemicals of interest that are not already registered in the dashboard real-time predictions based on structural inputs are available. This presentation will provide an overview of the dashboard with a focus on the availability of environmental fate and transport data, access to real time predictions, and our ongoing efforts to harvest and curate available experimental data from the literature and online databases. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data for over 4000 chemicals and 2000 assays, and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include an interactive read-across module, real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the latest release of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
FAIR Data and Model Management for Systems Biology(and SOPs too!)Carole Goble
MultiScale Biology Network Springboard meeting, Nottingham, UK, 1 June 2015
FAIR Data and model management for Systems Biology
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Yes, data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. And the multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Data and model management for the Systems Biology community is a multi-faceted one including: the development and adoption appropriate community standards (and the navigation of the standards maze); the sustaining of international public archives capable of servicing quantitative biology; and the development of the necessary tools and know-how for researchers within their own institutes so that they can steward their assets in a sustainable, coherent and credited manner while minimizing burden and maximising personal benefit.
The FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has grown out of several efforts in European programmes (SysMO and EraSysAPP ERANets and the ISBE ESRFI) and national initiatives (de.NBI, German Virtual Liver Network, SystemsX, UK SynBio centres). It aims to support Systems Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges multi-scale biology presents.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Overview of the SureChEMBL system and web interface.
https://www.surechembl.org/search/
SureChEMBL is a freely available web resource for chemistry patent searching. It is based on a fully automatic and dynamic text and image mining pipeline.
ChemSpider is being built with the intention of being a chemical structure centric community for chemists. With over 16 million chemical structures as of August 2007, and with data deposition and curation mechanisms in place for text, structure and spectra ChemSpider intends to be a meeting place and collaborative environment for chemists to work together.
The US EPA’s CompTox Chemistry Dashboard provides access to various types of data associated with ~760,000 chemical substances. These data include experimental and predicted property data, high-throughput screening assay data and hazard and environmental exposure data. With millions of individual data points and annotations associated with hundreds of thousands of chemicals, data quality is a priority. With tens of thousands of individual users per month browsing the data on the dashboard, the ability of users to provide feedback has allowed us to identify, confirm and address issues in the data. This has required the implementation of novel approaches for data feedback via the user interface that include general feedback on the dashboard and down to individual data points contained in a table. We are presently investigating ways to garner feedback on our ToxCast bioassay data to facilitate the curation of tens of thousands of data points. This presentation will provide an overview of our existing capabilities in the CompTox Chemistry Dashboard for gathering crowdsourced data from the user base and its impact on assisting in the curation of data.
This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Non-targeted analysis (NTA) uses high-resolution mass spectrometry to better understand the identity of a wide variety of chemicals present in environmental samples (and other matrices). However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. Analysis of the resultant mass spectrometry information relies on cheminformatics to identify and rank chemicals and the US EPA has developed functionality within the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) to address challenges related to this analysis. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will review how the CompTox Chemicals Dashboard via its flexible search capabilities, rich data for ~900,000 chemical substances, and visualization approaches within this open chemistry resource provides a freely available software tool to support structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
This presentation was made to the University of North Carolina in Chapel Hill on 9/20/21. The presentation was a general introduction to cheminformatics prior to how to navigate the Dashboard.
• An introduction to the dashboard
• Substances vs structures
• Structure formats for data exchange and connectivity (SMILES, InChIs, molfiles)
• Identifiers – CASRN, chemical names, systematic names
• Data curation approaches: substance-structure ambiguity
• ChemReg: substance registration
• Data gathering for systematic reviews
• Curated lists
• Properties/Fate and Transport
• Access to Exposure Data
• Hazard data in the dashboard – ToxVal data (sourced from >40 databases, >50,000 chemicals, >900,000 data points)
• The Executive Summary of data
• Single chemical searches vs Batch searches
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Globus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Ben Galewsky from the National Center for Supercomputing Applications (NCSA).
At a time when the data explosion has simply been redefined as “Big”, the hurdles associated with building a subject-specific data repository for chemistry are daunting. Combining a multitude of non-standard data formats for chemicals, related properties, reactions, spectra etc., together with the confusion of licensing and embargoing, and providing for data exchange and integration with services and platforms external to the repository, the challenge is significant. This all at a time when semantic technologies are touted as the fundamental technology to enhance integration and discoverability. Funding agencies are demanding change, especially a change towards access to open data to parallel their expectations around Open Access publishing. The Royal Society of Chemistry has been funded by the Engineering and Physical Science Research of the UK to deliver a “chemical database service” for UK scientists. This presentation will provide an overview of the challenges associated with this project and our progress in delivering a chemistry repository capable of handling the complex data types associated with chemistry. The benefits of such a repository in terms of providing data to develop prediction models to further enable scientific discovery will be discussed and the potential impact on the future of scientific publishing will also be examined.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrate advances in biology, chemistry, exposure and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. As an outcome of these efforts the National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences including high-throughput in vitro screening data, legacy in vivo animal data, consumer use and production information, exposure models and chemical structure databases with associated properties. A series of software applications and databases have been produced over the past decade to deliver these data, but recent developments have focused on the development of a new software architecture that assembles the resources into a single platform. Our web application, the CompTox Chemistry Dashboard provides access to data associated with ~750,000 chemical substances. These data include experimental and predicted physicochemical property data, bioassay screening data associated with the ToxCast program, product and functional use information and a myriad of related data of value to environmental scientists.
The dashboard provides chemical-based searching based on chemical names, synonyms and CAS Registry Numbers. Flexible search capabilities allow for chemical identification based on non-targeted analysis studies using mass spectrometry. Chemical identification using both mass and formula-based searching utilizes rank-ordering of results via functional use statistics, thereby providing a solution to help prioritize chemicals for further review when detected in environmental media.
This presentation will provide an overview of the dashboard, its capabilities for delivering data to the environmental chemistry community and how the architecture provides a foundation for the development of additional applications to support chemical risk assessment. This abstract does not reflect U.S. EPA policy.
Communication of chemistry in the internet era, while it has improved, remains challenged in terms of the exchange of data in a lossless fashion. While there are moves afoot within the publishing industry to produce “data journals”, including embracing some of the new approaches for making data available to the community, many challenges remain. Chemistry data sharing, at even the most basic level, remains a challenge for many chemistry journals. The vast majority of chemistry data is provided as PDF files or trapped on webpages and therefore not available for reuse and repurposing without a significant amount of effort to extract the data. Some of the responsibility resides with the scientists who need to be educated and encouraged in the adoption of appropriate exchange formats and utilization of online platforms for data hosting and dissemination. There are certain practices which, if adopted, could increase both the availability and utility of data for the community. This includes recognition that data, in itself, has value above and beyond inclusion in peer-reviewed publications, the adoption of standard (not necessarily open) formats, clear data licensing, and distribution of the data across multiple platforms. This presentation will provide an overview of ongoing efforts within the National Center for Computational Toxicology to publish chemistry data, both in databases and associated with peer-reviewed publications, in a manner that makes our data and models consumable by the community.
This abstract does not reflect U.S. EPA policy.
Presentation on the Chemical Analysis Metadata Platform (ChAMP) as a new project to characterize and organize metadata about chemical analysis methods. The project will develop an ontology, controlled vocabularies, and design rules
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
Presented at the 2018 Research Festival at the National Institutes of Health (NIH) in Bethesda, MD (September 13, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, public-domain bioactivity data available in PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop machine learning-based prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using popular supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The general applicability of the developed models was evaluated with external data sets from ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for bioactivity of small molecules.
Background of the project and simple use cases of using the Open PHACTS API and KNIME to extract compound, target and indication entities from millions of patent documents and infer meaningful links among them. Open PHACTS Linked Data meeting in Vienna.
The CompTox Chemistry Dashboard was developed by the Environmental Protection Agency’s National Center for Computational Toxicology. This dashboard has been architected in a manner that allows for the deployment of multiple “applications”, both as publicly available databases, and for deployment under the constraints of confidential business information (CBI). The public dashboard provide access to multiple types of data for ~750,000 chemicals. This includes, when available for a chemical substance, physicochemical parameters, toxicity and bioassay data, consumer use and analytical data. Fate, exposure, and hazard calculations can benefit from access to the data aggregation and curation efforts that underpin the public dashboard. Also, regulators can benefit from the integration of their own data within their closed infrastructure environments. This presentation will provide a review of the chemistry dashboard architecture and its present application providing access to data to the research and regulatory communities. We will also review present developments in the area of delivering an application programming interface, web services, and software components for integration into third party applications providing access to the data exposed via the dashboard. This abstract does not reflect U.S. EPA policy.
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data for over 4000 chemicals and 2000 assays, and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include an interactive read-across module, real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the latest release of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
FAIR Data and Model Management for Systems Biology(and SOPs too!)Carole Goble
MultiScale Biology Network Springboard meeting, Nottingham, UK, 1 June 2015
FAIR Data and model management for Systems Biology
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Yes, data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. And the multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Data and model management for the Systems Biology community is a multi-faceted one including: the development and adoption appropriate community standards (and the navigation of the standards maze); the sustaining of international public archives capable of servicing quantitative biology; and the development of the necessary tools and know-how for researchers within their own institutes so that they can steward their assets in a sustainable, coherent and credited manner while minimizing burden and maximising personal benefit.
The FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has grown out of several efforts in European programmes (SysMO and EraSysAPP ERANets and the ISBE ESRFI) and national initiatives (de.NBI, German Virtual Liver Network, SystemsX, UK SynBio centres). It aims to support Systems Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges multi-scale biology presents.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Overview of the SureChEMBL system and web interface.
https://www.surechembl.org/search/
SureChEMBL is a freely available web resource for chemistry patent searching. It is based on a fully automatic and dynamic text and image mining pipeline.
ChemSpider is being built with the intention of being a chemical structure centric community for chemists. With over 16 million chemical structures as of August 2007, and with data deposition and curation mechanisms in place for text, structure and spectra ChemSpider intends to be a meeting place and collaborative environment for chemists to work together.
The US EPA’s CompTox Chemistry Dashboard provides access to various types of data associated with ~760,000 chemical substances. These data include experimental and predicted property data, high-throughput screening assay data and hazard and environmental exposure data. With millions of individual data points and annotations associated with hundreds of thousands of chemicals, data quality is a priority. With tens of thousands of individual users per month browsing the data on the dashboard, the ability of users to provide feedback has allowed us to identify, confirm and address issues in the data. This has required the implementation of novel approaches for data feedback via the user interface that include general feedback on the dashboard and down to individual data points contained in a table. We are presently investigating ways to garner feedback on our ToxCast bioassay data to facilitate the curation of tens of thousands of data points. This presentation will provide an overview of our existing capabilities in the CompTox Chemistry Dashboard for gathering crowdsourced data from the user base and its impact on assisting in the curation of data.
This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Non-targeted analysis (NTA) uses high-resolution mass spectrometry to better understand the identity of a wide variety of chemicals present in environmental samples (and other matrices). However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. Analysis of the resultant mass spectrometry information relies on cheminformatics to identify and rank chemicals and the US EPA has developed functionality within the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) to address challenges related to this analysis. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will review how the CompTox Chemicals Dashboard via its flexible search capabilities, rich data for ~900,000 chemical substances, and visualization approaches within this open chemistry resource provides a freely available software tool to support structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
This presentation was made to the University of North Carolina in Chapel Hill on 9/20/21. The presentation was a general introduction to cheminformatics prior to how to navigate the Dashboard.
• An introduction to the dashboard
• Substances vs structures
• Structure formats for data exchange and connectivity (SMILES, InChIs, molfiles)
• Identifiers – CASRN, chemical names, systematic names
• Data curation approaches: substance-structure ambiguity
• ChemReg: substance registration
• Data gathering for systematic reviews
• Curated lists
• Properties/Fate and Transport
• Access to Exposure Data
• Hazard data in the dashboard – ToxVal data (sourced from >40 databases, >50,000 chemicals, >900,000 data points)
• The Executive Summary of data
• Single chemical searches vs Batch searches
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Globus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Ben Galewsky from the National Center for Supercomputing Applications (NCSA).
At a time when the data explosion has simply been redefined as “Big”, the hurdles associated with building a subject-specific data repository for chemistry are daunting. Combining a multitude of non-standard data formats for chemicals, related properties, reactions, spectra etc., together with the confusion of licensing and embargoing, and providing for data exchange and integration with services and platforms external to the repository, the challenge is significant. This all at a time when semantic technologies are touted as the fundamental technology to enhance integration and discoverability. Funding agencies are demanding change, especially a change towards access to open data to parallel their expectations around Open Access publishing. The Royal Society of Chemistry has been funded by the Engineering and Physical Science Research of the UK to deliver a “chemical database service” for UK scientists. This presentation will provide an overview of the challenges associated with this project and our progress in delivering a chemistry repository capable of handling the complex data types associated with chemistry. The benefits of such a repository in terms of providing data to develop prediction models to further enable scientific discovery will be discussed and the potential impact on the future of scientific publishing will also be examined.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrate advances in biology, chemistry, exposure and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. As an outcome of these efforts the National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences including high-throughput in vitro screening data, legacy in vivo animal data, consumer use and production information, exposure models and chemical structure databases with associated properties. A series of software applications and databases have been produced over the past decade to deliver these data, but recent developments have focused on the development of a new software architecture that assembles the resources into a single platform. Our web application, the CompTox Chemistry Dashboard provides access to data associated with ~750,000 chemical substances. These data include experimental and predicted physicochemical property data, bioassay screening data associated with the ToxCast program, product and functional use information and a myriad of related data of value to environmental scientists.
The dashboard provides chemical-based searching based on chemical names, synonyms and CAS Registry Numbers. Flexible search capabilities allow for chemical identification based on non-targeted analysis studies using mass spectrometry. Chemical identification using both mass and formula-based searching utilizes rank-ordering of results via functional use statistics, thereby providing a solution to help prioritize chemicals for further review when detected in environmental media.
This presentation will provide an overview of the dashboard, its capabilities for delivering data to the environmental chemistry community and how the architecture provides a foundation for the development of additional applications to support chemical risk assessment. This abstract does not reflect U.S. EPA policy.
Communication of chemistry in the internet era, while it has improved, remains challenged in terms of the exchange of data in a lossless fashion. While there are moves afoot within the publishing industry to produce “data journals”, including embracing some of the new approaches for making data available to the community, many challenges remain. Chemistry data sharing, at even the most basic level, remains a challenge for many chemistry journals. The vast majority of chemistry data is provided as PDF files or trapped on webpages and therefore not available for reuse and repurposing without a significant amount of effort to extract the data. Some of the responsibility resides with the scientists who need to be educated and encouraged in the adoption of appropriate exchange formats and utilization of online platforms for data hosting and dissemination. There are certain practices which, if adopted, could increase both the availability and utility of data for the community. This includes recognition that data, in itself, has value above and beyond inclusion in peer-reviewed publications, the adoption of standard (not necessarily open) formats, clear data licensing, and distribution of the data across multiple platforms. This presentation will provide an overview of ongoing efforts within the National Center for Computational Toxicology to publish chemistry data, both in databases and associated with peer-reviewed publications, in a manner that makes our data and models consumable by the community.
This abstract does not reflect U.S. EPA policy.
Presentation on the Chemical Analysis Metadata Platform (ChAMP) as a new project to characterize and organize metadata about chemical analysis methods. The project will develop an ontology, controlled vocabularies, and design rules
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
Presented at the 2018 Research Festival at the National Institutes of Health (NIH) in Bethesda, MD (September 13, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, public-domain bioactivity data available in PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop machine learning-based prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using popular supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The general applicability of the developed models was evaluated with external data sets from ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for bioactivity of small molecules.
Background of the project and simple use cases of using the Open PHACTS API and KNIME to extract compound, target and indication entities from millions of patent documents and infer meaningful links among them. Open PHACTS Linked Data meeting in Vienna.
The CompTox Chemistry Dashboard was developed by the Environmental Protection Agency’s National Center for Computational Toxicology. This dashboard has been architected in a manner that allows for the deployment of multiple “applications”, both as publicly available databases, and for deployment under the constraints of confidential business information (CBI). The public dashboard provide access to multiple types of data for ~750,000 chemicals. This includes, when available for a chemical substance, physicochemical parameters, toxicity and bioassay data, consumer use and analytical data. Fate, exposure, and hazard calculations can benefit from access to the data aggregation and curation efforts that underpin the public dashboard. Also, regulators can benefit from the integration of their own data within their closed infrastructure environments. This presentation will provide a review of the chemistry dashboard architecture and its present application providing access to data to the research and regulatory communities. We will also review present developments in the area of delivering an application programming interface, web services, and software components for integration into third party applications providing access to the data exposed via the dashboard. This abstract does not reflect U.S. EPA policy.
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
In 2012, after the first IBM deposition, few would have predicted that PubChem compounds that included patent-extracted structures would exceed 20 million within three years (i.e. 30% of the total). The current major open patent chemistry submitters (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. This “big bang” has a range of utilities and implications. Firstly, pharmaceutical companies must now integrate their exploitation of both public and commercial patent chemistry because capture is divergent. Secondly, the academic community and small companies can now patent-mine extensively without commercial sources. Thirdly, first-filings of most lead series and clinical candidates can now be tracked. Fourthly, drug targets in ChEMBL can be intersected with Structure Activity Relationship (SAR) data sets from patents, some of which are now target-mapped in other databases (doi:10.1016/j.ddtec.2014.12.001). However, while this patent chemistry “big bang” is generally welcomed by database users, there are significant caveats. In particular, both automated and manual extraction bring in a variety of artefacts that add confounding structural “noise”. These include a) permutations of mixtures and chiral exemplifications, b) virtual structures (including isotopic analogues of approved drugs), c) an emerging trend of vendor “patent picking” for non-stocked compounds, d) 85% of public patent chemistry has no biological data links and c) extractions from documents do not directly indicate IP status. These problems and some partial solutions will be discussed.
Quality and noise in big chemistry databasesChris Southan
Presented at Aug 2019 ACS by Antony Williams. Abstract: The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem
The US EPA’s National Center for Computational Toxicology (NCCT) has been both measuring and aggregating data to support our research efforts for over a decade. We have delivered these data via a number of publicly accessible websites, so-called dashboards, to provide transparent access to the outputs of the center. Since the inception of our research, software projects technologies have changed dramatically, as have the expectations regarding the methods by which to access data. Our informatics efforts provide access to millions of dollars of high-throughput screening data available in open, downloadable formats, via web services and through a rich web interface. Similarly, we provide access to experimental and predicted data associated with ~760,000 substances to serve the environmental chemistry community, and open source code for predictive models. This presentation will provide an overview of the efforts of NCCT to provide transparent access to our research and data via our publications (and accompanying supplementary data), via our Open Data policies, and through our databases, software tools and web services. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Chemical Databases and Open Chemistry on the DesktopMarcus Hanwell
The modern chemist has access to large databases containing both experimental and calculated data. The power of HPC resources continues to increase, with more practitioners having routine access to powerful computational chemistry tools. This places an increasingly high burden on users to assimilate these resources into their workflow in order to effectively utilize resources. The creation of an open, extensible application framework that puts computational tools, data, and domain specific knowledge at the fingertips of chemists is increasingly important. A data-centric approach to chemistry, storing all data in a searchable database, will empower users to efficiently collaborate, innovate, and push the frontiers of research. Providing an open, user-friendly and extensible application will open up new tools to experimental chemists, while providing computational chemists the ability to address greater challenges. Additionally, by distributing experimental and computational data across the research community, incorporating cheminformatics analytics techniques, and providing visual search for chemical structures, the workflow of both groups can be significantly improved. This requires suitable data formats for data exchange, and databases with appropriate APIs for querying, and uploading data in order to effectively share. This talk will discuss recent progress made in developing a suite of open chemistry applications on the desktop. The applications can query online databases, such as the NIH structure resolver service, download and manipulate structures, and prepare input files for standalone computational chemistry codes. Another application developed to submit jobs, monitor and retrieve results from HPC resources will also be shown, and a desktop chemistry database browser. The Quixote project aims to establish standards for data exchange in computational chemistry, along with data repositories for organizations. Establishing these standards is important to promote open, reproducible chemistry, and their integration into user-friendly desktop applications will promote their integration in the standard workflow of researchers.
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
Presented at the Joint Statistical Meetings (JSM) 2020 (virtual) on August 3, 2020.
==== Abstract ====
The idea of “big data” has recently been drawing much attention of the scientific community as well as the general public. An example of big data in Chemistry is the data contained in PubChem, which is a public database of chemical substance descriptions and their biological activities at the National Institutes of Health. PubChem is a sizeable system with 235 million depositor-provided substance descriptions, 96 million unique chemical structures, 1.1 million biological assays, and 268 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents and more. PubChem resources have been used in many studies for developing bioactivity and toxicity prediction models, discovering multi-target ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of how PubChem’s data, tools, and services can be used for bioassay data analysis and virtual screening (VS) and discusses important aspects of exploiting PubChem for drug discovery.
This presentation was given at a TRIANGLE AREA MASS SPECTOMETRY meeting on 01/29/2019 in Research Triangle Park, North Carolina to provide a general overview of the CompTox Chemicals Dashboard to an audience of mass spectrometrists and people interested in the capabilities of the dashboard for chemical forensics, structure identification etc.
The Center for Computational Toxicology and Exposure (CCTE) is part of the Office of Research and Development at the US Environmental Protection Agency. As part of its mission the center delivers access to chemicals related data via web-based freely accessible online Dashboards to disseminate data generated within the center as well as harvested and integrated from open databases around the world. The CompTox Chemicals Dashboard (available at https://comptox.epa.gov/dashboard) provides access to >1.2 million chemicals and associated data including experimental and predicted property data, in vivo hazard data, in vitro bioactivity data, exposure data, and various other data types. The curation of the chemicals dataset has included the development of over 400 segregated lists of chemicals that represent specific research areas of interest including disinfectant by-products, per- and polyfluoroalkyl substances (PFAS), extractables and leachables, and chemicals of emerging concern. The chemicals collection, the associated data, the lists and searches for mass and formulae makes the Dashboard an ideal foundation technology to support our colleagues working in the field of mass spectrometry, especially in targeted and non-targeted analysis. This presentation will provide an overview of the Dashboard, its value to the community in terms of providing access to the integrated and highly curated data, and its utility to support researchers in the field of mass spectrometry. New proof-of-concept projects will also be introduced including the development of a cheminformatics enabled methods database. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
Towards a integrated network of data and services for the life sciences Modern biological knowledge discovery requires access to machine-understandable data that can be searched, retrieved, and subsequently analyzed using a wide array of analytical software and services. The Semantic Automated Discovery and Integration (SADI) framework is a set of conventions to formalize web service inputs and outputs using OWL ontologies that enable the automatic discovery and invocation of Semantic Web services. In this talk, I will walk through a worked example in the design and deployment of chemical semantic web services using the Chemical Development Toolkit, chemical descriptors from the Chemical Information Ontology (CHEMINF), and the Semanticscience Integrated Ontology (SIO) as a unifying, upper level ontology of basic types and relations. I will discuss how one can make use of the SADI-enabled SHARE client to reason about data obtained from Bio2RDF, the largest linked open data project, and automatically invoke chemical semantic web services to determine a chemical's drug-likeness. If you want to see the potential of the Semantic Web being realized, this talk is for you.
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
"Development of FDA MicroDB: A Regulatory-Grade
Microbial Reference Database" presentation at the Standards for Pathogen Identification via NGS (SPIN) workshop hosted by the National Institute for Standards and Technology October 2014 by Heike Sichtig, PhD from the FDA and Luke Tallon from IGS UMSOM.
At a time when the data explosion has simply been redefined as “Big”, the hurdles associated with building a subject-specific data repository for chemistry are daunting. Combining a multitude of non-standard data formats for chemicals, related properties, reactions, spectra etc., together with the confusion of licensing and embargoing, and providing for data exchange and integration with services and platforms external to the repository, the challenge is significant. This all at a time when semantic technologies are touted as the fundamental technology to enhance integration and discoverability. Funding agencies are demanding change, especially a change towards access to open data to parallel their expectations around Open Access publishing. The Royal Society of Chemistry has been funded by the Engineering and Physical Science Research of the UK to deliver a “chemical database service” for UK scientists. This presentation will provide an overview of the challenges associated with this project and our progress in delivering a chemistry repository capable of handling the complex data types ssociated with chemistry. The benefits of such a repository in terms of providing data to develop prediction models to further enable scientific discovery will be discussed and the potential impact on the future of scientific publishing will also be examined.
In recent years, the growth of scientific data and the increasing need for data sharing and collaboration in the field of environmental chemistry has led to the creation of various software and databases that facilitate research and development into the safety and toxicity of chemicals. The US-EPA Center for Computational Toxicology and Exposure has been developing software and databases that serve the chemistry community for many years. This presentation will focus on several web-based software applications which have been developed at the USEPA and made available to the community. While the primary software application from the Center is the CompTox Chemicals Dashboard almost a dozen proof-of-concept applications have been built serving various capabilities. The publicly accessible Cheminformatics Modules (https://www.epa.gov/chemicalresearch/cheminformatics) provides access to six individual modules to allow for hazard comparison for sets of chemicals, structure-substructure-similarity searching, structure alerts and batch QSAR prediction of both physicochemical and toxicity endpoints. A number of other applications in development include a chemical transformations database (ChET) and a database of analytical methods and open mass spectral data (AMOS). Each of these depends on the underlying DSSTox chemicals database, a rich source of chemistry data for over 1.2 million chemical substances. I will provide an overview of all tools in development and the integrated nature of the applications based on the underlying chemistry data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Presented to David Gloriam's Group, Copenhagen, Feb 2020
**********************************
The theme will be presented from the perspective of both past involvement in peptide curation in the Guide to Pharmacology (GtoPdb) and in current searching for bioactive peptides in the wider ecosystem that includes ChEMBL and PubChem. The core problem is that peptides hang in limbo land between bioinformatics (BLAST) and cheminformatics (Tanimoto) neither of which provide optimal searching. Curating peptides in GtoPdb presents many challenges, including mapping endogenous peptides to Swiss-Prot cleavage annotations. For synthetic peptides, equivocal specification of modifications and exact positions of radiolabels are also problematic However, target-mapped citation-supported quantitative binding parameters are curated where possible. For those peptides falling below the PubChem CID SMILES limit of approximately 70 residues, GtoPdb has been using Sugar and Splice from NextMove Software to convert into CIDs. Specific problems associated with finding bioactive peptides in databases will be outlined.
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
Introduction/Background & Aims
The beta-amyloid (APP) cleaving enzyme (BACE1) was implicated as a drug target for Alzheimer's Disease (AD) back in 1999. In 2011, the paralogue, BACE2, became a new proposed target for type II diabetes (T2DM) having been reported to be the TMEM27 secretase regulating pancreatic beta-cell function [1]. By 2019 the accumulated evidence, including a swathe of failed clinical trials for BACE1 inhibitors, has produced a de facto de-validation of both targets in both diseases. As a learning exercise, the series of events leading up to this is reviewed here.
Method/Summary of work
Basic information about these two targets and the lead compounds against them were sourced via the IUPHAR/BPS Guide to Pharmacology (GtoPdb) as Target ids: 2330 and 2331, for BACE1 and 2, respectively. This was consolidated by a literature and patent review as well as following them in other databases. The most recent information on clinical trials was sourced from press releases.
Results/Discussion
GtoPdb annotates 24 lead compounds against BACE1 and 12 against BACE2. The corresponding counts mapped to these targets in ChEMBL are 8741 and 1377 making BACE1 one of the most actively pursued enzyme targets ever. Notwithstanding the massive global effort during 2018 Merck’s verubecestat and J&J’s atabecestat BACE1 inhibitors not only failed their Phase III endpoints but even appeared to worsen cognition in prodromal patients. In 2019 Amgen/Novartis stopped Phase II/III trials of umibecestat that also showed more cognitive decline in the treatment group compared to controls. BACE2 presented an anomalous situation in several ways. By 2016 both Novartis and Amgen declared their inability to reproduce the TMEM27 secretase turnover reported in 2011. Notwithstanding, Novartis and other companies have published patents on BACE2-specific inhibitors over several years and paradoxically verubecestat is more potent against BACE2 rather than 1 but was never tested for glucose-lowering. Equally puzzling is that one academic group is still publishing BACE2 inhibitors for T2D even post de-validation. One thing both targets have in common is the complete absence of genetic support from genome-wide disease association studies but this warning sign went unheeded.
Conclusions
The massive waste of resources on the pursuit of BACE1 as an AD target over the last two decades is catastrophic. This tale of de-validation is compounded for this paralogous pair of enzymes by the fact that the original evidence for BACE2 as a T2D target was eventually refuted. The story of these targets highlights a range of crucial pharmacological pitfalls that must be avoided in the future.
Reference(s)
[1] Southan C, Hancock J.M. (2013) A tale of two drug targets: the evolutionary history of BACE1 and BACE2. Front Genet. 4:293.
In silico 360 Analysis for Drug DevelopmentChris Southan
Introduction:
Consequent to a memorandum of understanding between the Karolinska Institutet and the International Union of Basic and Clinical Pharmacology (IUPHAR) in 2018 a report on academic drug development, including guidelines (ADEV) has been drafted [1]. As part of this exercise, we conceived a triage for comprehensive informatics profiling around the compound, target, disease axis. We have termed this “in slico 360” (INS360) the aim of which was to support ADEV teams since they may lack either internal expertise or external support to do this on their own. Indeed, some past SciLifeLab Drug Discovery and Development Platform projects had been halted because of overlooked competitive impingements or insufficient target validation evidence.
Methods
We assessed the current database landscape, mostly public but including commercial, for potential utility for INS360. We were guided primarily by content coverage, usability, and reputation. We also explored some open property prediction resources for assay interference and toxicological inferences.
Results:
As a first-stop-shop, we selected the IUPHAR/BPS Guide to PHARMACOLOGY with ~900 ligand-target relationships captured via expert curation of journal papers Moving up in scale we evaluated ChEMBL at 1.8 million compounds with 1.1 million assay descriptions and 7,000 targets. With yet another jump we could search the patent corpus with 18 million extracted compounds in SureChEMBL. We explored PubChem that integrates these three with over 500 other sources linked to 96 million compounds, BioAssay results and connectivity into the NCBI Entrez system. The final jump in scale for document-to-chemistry navigation was represented by SciFinder with 155 million structures. On the target side, 360-exploration has the need to encompass literature, structure, genetic variation, splicing, interactions, and disease pathways. From their UniProt links, both GtoPdb and ChEMBL provide these entry points. Navigating genetic association data in support of target validation was enabled by the OpenTargets portal and the GWAS Catalog. We also fount servers that could produce prediction scores from chemical structures for a range of features important for de-risking development.
Conclusion:
This work scoped out initial resource choices for the INS360. We propose that not only ADEV operations but essentially any pharmacology research team has much to gain from this approach and many potential pitfalls can consequently be avoided when approaching key checkpoints, such as preparing a publication. However, support may be needed for both institutions and teams to get the best out of these complex and feature-rich databases.
[1] Southan C, (2019) Towards Academic Drug Development Guidelines, ChemRxiv pre-print no. 8869574
Will the correct BACE ORFs please stand up?Chris Southan
BACE1 and BACE2 are protease targets for Alzheimer's and diabetes, respectively but their validation is now questioned
Phylogenetic analysis can added functional insights
This came up against two key problems
A surprising prevalence of incorrect protein sequences predicted from genomes
Many BACE1 and BACE2 orthologues had truncation and/or indel errors.
Key phylogenetic representative genomes are languishing in an unfinished state
Some options for amelioration of these problems will be described
An update on the evolution of these enzymes will be shown
Look for new and potentially useful human 5HT2A-directed small molecule chemistry surfaced since the last meeting., check for compounds against as 5HT2A primary target but also combined inhibitors, poll round the key databases, literature and patents, earching challenges arise from synonym soup, complex cross-reactivities (see PMID 29679900) in vitro data gaps and in vivo polypharmacology
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
This is a poster for the UK ELXIR meetin in Birmingham UK, Nov 2018. It is the summary of a blog-post https://cdsouthan.blogspot.com/2018/08/an-initial-look-at-elixir-chemistry.html that asses chemistry <> protein <> papers connectivity (C-P-P) for five ELIXIR resources
Poster for World Congres of Pharmacology 2018, Kyoto
Introduction: The pharmacological literature and patents connect compound structures to their bioactivity. However, entombing these relationships for millions of compounds among millions of PDFs is acknowledged as massively problematic. The situation is ameliorated by resources that extract the entity and data relationships the authors and inventors put “in” to their PDFs back “out” into structured database records. The IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb) has been doing this by stringent curation of ligands and their quantitative activity against protein targets [1]. Our citations are submitted to PubChem (PC), who then link to PubMed (PM) [2]. This study presents an overview of this connectivity.
Methods: For GtoPdb entries in PC Substance we used the PC interface to count our submitted PM links. This gives the PC > PM mapping counts from which we analysed the PM links. We then performed reciprocal analyses (i.e. PM > PC) by selecting PM sets. We then compared two journals by counting structure links by year and source.
Results: From 8988 GtoPdb-submitted ligand substances in PC (release 2017.5), 7309 are linked to 8980 PM entries. Of the 7309 there are 5632 links to chemical structures in PC the rest being antibodies and larger peptides. From the 8980 PMIDs, the Journal of Medicinal Chemistry (JMC) accounted for 1003 as our most frequently cited primary source of structure-to-activity mappings. For the British Journal of Pharmacology (BJP) most of the 345 cross-references were development compounds. Further analysis showed that from 2014 to 2017 the BJP to PC links of ~ 30 structures per year are mostly from GtoPdb and the Comparative Toxicology Database. However, going back to 2010-12, this increased to 500-800 connections, mainly derived from the IBM automated chemical extraction from abstracts. A similar pattern was observed for JMC.
Conclusion: Navigation between documents and databases is an essential competence for pharmacologists and drug discovery but the NCBI Entrez system is daunting. GtoPdb is a major contributor of high-quality links and provides a first-stop to guide users into the PC/PM systems. However, our results indicated potentially serious specificity issues with automated chemistry-to-journal linking from non-GtoPdb sources.
References: [1] Harding et al. (2018). Nucl. Acids Res. 45 (Database Issue), doi: 10.1093/nar/gkx1121.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
1. An overview of connectivity between
documents, structures and bioactivity
Christopher Southan
Presented at University of Copenhagen, Feb 2020
Host: David Gloriam
1
3. The chemistry < - > biology join
• Chemistry “C” with significant bioactivity in vitro, in cellulo, in vivo , in clinico
• Applicability to drug discovery, pharmacology, chemical biology and enzymology
• Majority of primary quantitative data in papers and patents documents
• Does not cover all nuances of molecular mechanisms of actionNuances and
complexities of molecular mechanisms of action, e.g.
– indirect or complex targets, prodrugs, cellular assays, covalent inhibitors, activators
D – A – R – C – P
5. The “lost connectivity” problem
5
"We have spent millions putting chemistry into PDFs but
now we are spending millions more taking it back out”
(Anon)
Rough estimates of 50+ years of public legacy DARCP:
• “D” ~ 200K papers, ~50K patents
• “C” ~ 5 million structures
• “P” ~ 4000 human proteins, ~ 2000 other species
• Only a small proportion captured as DARCP in open sources
• Quality would be a key issue if “everything” was extracted
7. Unsung Heroes
Impediments to artisanal extraction of DARCP by Biocurators
• Entity disambiguation
• Unintentional obfuscation and errors by journal authors
• Occasional deliberate obfuscation by patent applicants
• Activity parameters (IC50,EC50, Ki, Kd) can have ~ 10-fold variation between
publications for nominally the same assays
• Judging the reproducibility of the publications selected for extraction
• Variable publisher guidelines for entity specification and reporting standards
• Chemical structures often image-only
• Key data buried in supplementary data
• Limited author awareness of assay and target ontologies or gene naming
• Poor sustainability of funding and career structures
9. 9
Commercial biocuration of DARCP
Exelra (formerly GVKBIO)
GOSTAR stats from 2015
• 1.3 million cpds from 112K
papers (~ 15 per paper)
• 3.5 million cpds from 70K
patents (~ 50 per pat)
• 3,882 human targets
• 9 million bioactivities
25. PubChem large-scale C-D submissions
25
• Generally a good thing (inc. 3 million patents) but with caveats
• Difficult to identify “aboutness” of key compounds
• Issues with indexing of non-PubMed DOI-only Journal papers
• Quality issues of automated CNER chemistry extraction
• Introduces // c2d mappings into PubChem
• Massive ‘futile indexing’ of common chemistry
26. Automated entity look-ups on the fly
from documents (including C-D)
26
• Being pushed in European PubMed central via EBI
database look-ups
• PubMed/PubMedCentral via NCBI databases
• Can be a gateway to DARCP but specificity caveats
31. 31
Rounding off: so where do we go from here in
terms of open DARCP capture?
32. Will this make a difference?
32
• This should increase the flow of A,R,C,P from D into repositories
• However, whether this will also extent to D-A-R-C-P into major
databases such as PubChem remains unclear
33. Proposed solution but a council of perfection
33
“Mandating authors to explicitly connect their own DARCLP results
in a form that is FAIR, extrinsic to PDF, structured, with metadata,
machine readable, ontologised, transferable to open database
records and reciprocally linked to publications” (Southan 2019)
• Authors should become their own biocurators
• Has been technically feasible for over a decade
• Even in 2020 not one single journal insists on authors providing
machine-readable DARCLP to flow into PubChem BioAssay
• Impediments include sociological factors and publishing models
34. Conclusions
• The bioscience community (including big data miners) still have
their collective feet nailed to the floor from the 5-decades of
valuable DARCP entombed behind firewalls and buried in patents
• Biocuration makes a crucial contribution but is limited in scale
• Automated extraction is advancing (e.g. via NLP) but is way
behind the specificity of expert biocuration
• Existence of // document <> chemistry systems (e.g. MeSH, IBM,
SureChEMBL, Springer Nature,Theime ,Wikidata) in PubChem
and look-ups in EPMC, are enabling but also confusing
• The spread of Open Science ELNs is good to see but findability,
searchability and database submissions still need to be optimised
• The need remains to facilitate a flow of published (inc. preprints)
of author-specified bioactive chemistry direct to databases (even if
the papers are FAIR)
34
37. Reciprocal links > virtuous circles (II)
37
• GtoMdb users can
navigate “out” via
PubChem or PubMed
• NCBI users can navigate
“in” via PubChem or
PubMed
38. Reciprocal links > virtuous circles (I)
38
• GtoPdb users can navigate “out” via PubChem or PubMed
• NCBI users can navigate “in” via PubChem or PubMed
47. Disinterment from the PDF tomb (I)
Image extraction > structure
• Real chemists sketch images in a jiffy
• The rest of us can use OSRA: Optical Structure Recognition
Editor's Notes
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been