Progress in drug discovery and chemical biology is hugely enabled by curated document-assay-result-compound-target relationships (D-A-R-C-P) in open databases from resources such as the Guide to Pharmacology and ChEMBL. These are synergistically integrated into PubChem which pre-computes chemical similarity and connectivity between over 95 million structures and 5.6 million BioAssay results. It also links chemistry to documents via various additional routes including MeSH and large scale submissions from publishers. However, these efforts are patchy and very few journals facilitate such connectivity. There thus remains a massive shortfall in public D-A-R-C-P capture from decades of papers and patents. This presentation will cover these aspects and discuss their partial amelioration by options such as author-driven depositions and open lab-book approaches as used by Open Source Malaria
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrates advances in biology, chemistry, and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. As an outcome of these efforts the National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. A series of software applications and databases have been produced over the past decade to deliver these data but recent developments have focused on the development of a new software architecture that assembles the resources into a single platform. A new web application, the CompTox Chemistry Dashboard provides access to data associated with ~720,000 chemical substances. These data include experimental and predicted physicochemical property data, bioassay screening data associated with the ToxCast program, product and functional use information and a myriad of related data of value to environmental scientists.
The dashboard provides chemical-based searching based on chemical names, synonyms and CAS Registry Numbers. Flexible search capabilities allow for chemical identification based on non-targeted analysis studies using mass spectrometry. Chemical identification using both mass and formula-based searching utilizes rank-ordering of results via functional use statistics, thereby providing a solution to help prioritize chemicals for further review when detected in environmental media.
This presentation will provide an overview of the CompTox Dashboard, its capabilities for delivering data to the environmental toxicology community and how the architecture provides a foundation for the development of additional applications to support chemical risk assessment. This abstract does not reflect U.S. EPA policy.
Web-based technologies coupled with a drive for improved communication between scientists have resulted in the proliferation of scientific opinion, data and knowledge at an ever-increasing rate. The increasing array of chemistry-related computer-based resources now available provides chemists with a direct path to the discovery of information, once previously accessed via library services and limited to commercial and costly resources. We propose that preclinical absorption, distribution, metabolism, excretion and toxicity data as well as pharmacokinetic properties from studies published in the literature (which use animal or human tissues in vitro or from in vivo studies) are precompetitive in nature and should be freely available on the web. This could be made possible by curating the literature and patents, data donations from pharmaceutical companies and by expanding the currently freely available ChemSpider database of over 21 million molecules with physicochemical properties. This will require linkage to PubMed, PubChem and Wikipedia as well as other frequently used public databases that are currently used, mining the full text publications to extract the pertinent experimental data. These data will need to be extracted using automated and manual methods, cleaned and then published to the ChemSpider or other database such that it will be freely available to the biomedical research and clinical communities. The value of the data being accessible will improve development of drug molecules with good ADME/Tox properties, facilitate computational model building for these properties and enable researchers to not repeat the failures of past drug discovery studies.
The EPA’s CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is a publicly accessible website providing access to data for ~875,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, product use information extracted from safety data sheets, and integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. The application provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders, including Per- & Polyfluoroalkyl Substances (PFAS) containing thousands of chemicals. A procured testing library of hundreds of PFAS chemicals annotated into chemical categories has also been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and links to the open literature. Several specific search types have been developed to directly support the mass spectrometry non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within the database. This presentation will provide an overview of the dashboard, the ongoing expansion of the PFAS chemical library, with associated categorization, and new physicochemical property and environmental fate and transport QSAR prediction models developed for these chemicals. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The increasing popularity of high mass accuracy non-target mass spectrometry methods has yielded extensive identification efforts based on chemical compound databases. Candidate structures are often retrieved with either exact mass or molecular formula from large resources such as PubChem, ChemSpider or the EPA CompTox Chemistry Dashboard. Additional data (e.g. fragmentation, physicochemical properties, reference and data source information) is then used to select potential candidates, depending on the experimental context. However, these strategies require the presence of substances of interest in these compound databases, which is often not the case as no database can be fully inclusive. A prominent example with clear data gaps are surfactants, used in many products in our daily lives, yet often absent as discrete structures in compound databases. Linear alkylbenzene sulfonates (LAS) are a common, high use and high priority surfactant class that have highly complex transformation behaviour in wastewater. Despite extensive reports in the environmental literature, few of the LAS and none of the related transformation products were reported in any compound databases during an investigation into Swiss wastewater effluents, despite these forming the most intense signals. The LAS surfactant class will be used to demonstrate how the coupling of environmental observations with high resolution mass spectrometry and detailed literature data (expert knowledge) on the transformation of these species can be used to progressively “fill the gaps” in compound databases. The LAS and their transformation products have been added to the CompTox Chemistry Dashboard (https://comptox.epa.gov/) using a combination of “representative structures” and “related structures” starting from the structural information contained in the literature. By adding this information into a centralized open resource, future environmental investigations can now profit from the expert knowledge previously scattered throughout the literature. Note: This abstract does not reflect US EPA policy.
The iCSS CompTox Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. This poster reviews the benefits of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. Standard approaches for both mass and formula lookup are available but the dashboard delivers a novel approach for hit ranking based on functional use of the chemicals. The focus on high-quality data, novel ranking approaches and integration to other resources of value to mass spectrometrists makes the CompTox Dashboard a valuable resource for the identification of environmental chemicals. This abstract does not reflect U.S. EPA policy.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrates advances in biology, chemistry, and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. As an outcome of these efforts the National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. A series of software applications and databases have been produced over the past decade to deliver these data but recent developments have focused on the development of a new software architecture that assembles the resources into a single platform. A new web application, the CompTox Chemistry Dashboard provides access to data associated with ~720,000 chemical substances. These data include experimental and predicted physicochemical property data, bioassay screening data associated with the ToxCast program, product and functional use information and a myriad of related data of value to environmental scientists.
The dashboard provides chemical-based searching based on chemical names, synonyms and CAS Registry Numbers. Flexible search capabilities allow for chemical identification based on non-targeted analysis studies using mass spectrometry. Chemical identification using both mass and formula-based searching utilizes rank-ordering of results via functional use statistics, thereby providing a solution to help prioritize chemicals for further review when detected in environmental media.
This presentation will provide an overview of the CompTox Dashboard, its capabilities for delivering data to the environmental toxicology community and how the architecture provides a foundation for the development of additional applications to support chemical risk assessment. This abstract does not reflect U.S. EPA policy.
Web-based technologies coupled with a drive for improved communication between scientists have resulted in the proliferation of scientific opinion, data and knowledge at an ever-increasing rate. The increasing array of chemistry-related computer-based resources now available provides chemists with a direct path to the discovery of information, once previously accessed via library services and limited to commercial and costly resources. We propose that preclinical absorption, distribution, metabolism, excretion and toxicity data as well as pharmacokinetic properties from studies published in the literature (which use animal or human tissues in vitro or from in vivo studies) are precompetitive in nature and should be freely available on the web. This could be made possible by curating the literature and patents, data donations from pharmaceutical companies and by expanding the currently freely available ChemSpider database of over 21 million molecules with physicochemical properties. This will require linkage to PubMed, PubChem and Wikipedia as well as other frequently used public databases that are currently used, mining the full text publications to extract the pertinent experimental data. These data will need to be extracted using automated and manual methods, cleaned and then published to the ChemSpider or other database such that it will be freely available to the biomedical research and clinical communities. The value of the data being accessible will improve development of drug molecules with good ADME/Tox properties, facilitate computational model building for these properties and enable researchers to not repeat the failures of past drug discovery studies.
The EPA’s CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is a publicly accessible website providing access to data for ~875,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, product use information extracted from safety data sheets, and integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. The application provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders, including Per- & Polyfluoroalkyl Substances (PFAS) containing thousands of chemicals. A procured testing library of hundreds of PFAS chemicals annotated into chemical categories has also been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and links to the open literature. Several specific search types have been developed to directly support the mass spectrometry non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within the database. This presentation will provide an overview of the dashboard, the ongoing expansion of the PFAS chemical library, with associated categorization, and new physicochemical property and environmental fate and transport QSAR prediction models developed for these chemicals. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The increasing popularity of high mass accuracy non-target mass spectrometry methods has yielded extensive identification efforts based on chemical compound databases. Candidate structures are often retrieved with either exact mass or molecular formula from large resources such as PubChem, ChemSpider or the EPA CompTox Chemistry Dashboard. Additional data (e.g. fragmentation, physicochemical properties, reference and data source information) is then used to select potential candidates, depending on the experimental context. However, these strategies require the presence of substances of interest in these compound databases, which is often not the case as no database can be fully inclusive. A prominent example with clear data gaps are surfactants, used in many products in our daily lives, yet often absent as discrete structures in compound databases. Linear alkylbenzene sulfonates (LAS) are a common, high use and high priority surfactant class that have highly complex transformation behaviour in wastewater. Despite extensive reports in the environmental literature, few of the LAS and none of the related transformation products were reported in any compound databases during an investigation into Swiss wastewater effluents, despite these forming the most intense signals. The LAS surfactant class will be used to demonstrate how the coupling of environmental observations with high resolution mass spectrometry and detailed literature data (expert knowledge) on the transformation of these species can be used to progressively “fill the gaps” in compound databases. The LAS and their transformation products have been added to the CompTox Chemistry Dashboard (https://comptox.epa.gov/) using a combination of “representative structures” and “related structures” starting from the structural information contained in the literature. By adding this information into a centralized open resource, future environmental investigations can now profit from the expert knowledge previously scattered throughout the literature. Note: This abstract does not reflect US EPA policy.
The iCSS CompTox Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. This poster reviews the benefits of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. Standard approaches for both mass and formula lookup are available but the dashboard delivers a novel approach for hit ranking based on functional use of the chemicals. The focus on high-quality data, novel ranking approaches and integration to other resources of value to mass spectrometrists makes the CompTox Dashboard a valuable resource for the identification of environmental chemicals. This abstract does not reflect U.S. EPA policy.
This presentation was made to the University of North Carolina in Chapel Hill on 9/20/21. The presentation was a general introduction to cheminformatics prior to how to navigate the Dashboard.
• An introduction to the dashboard
• Substances vs structures
• Structure formats for data exchange and connectivity (SMILES, InChIs, molfiles)
• Identifiers – CASRN, chemical names, systematic names
• Data curation approaches: substance-structure ambiguity
• ChemReg: substance registration
• Data gathering for systematic reviews
• Curated lists
• Properties/Fate and Transport
• Access to Exposure Data
• Hazard data in the dashboard – ToxVal data (sourced from >40 databases, >50,000 chemicals, >900,000 data points)
• The Executive Summary of data
• Single chemical searches vs Batch searches
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases within the DSSTox project, and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for ~760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectrometry non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders, including, for example, scientists interested in Per- & Polyfluoroalkyl Substances (PFAS). Added lists include those sourced from the European Union as well as developed in-house and now containing thousands of chemicals. A procured testing library of hundreds of PFAS chemicals annotated into chemical categories has been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and links to the open literature. This presentation will provide an overview of the dashboard, the developing library of PFAS chemicals and associated categorization, and new physicochemical property and environmental fate and transport QSAR prediction models developed for these chemicals. The application of the dashboard to support mass spectrometry non-targeted analysis studies for the identification of PFAS chemicals will also be reviewed. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are utilized to identify emerging contaminants and chemical signatures of interest detected in various media. At the US Environmental Protection Agency the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is an open chemistry resource and web-based application containing data for ~900,000 substances and supports non-targeted and suspect screening analyses. Searching functionality includes identifier searches (e.g. systematic names, trade names and CAS Registry Numbers), mass and formula-based searches and prototype developments include combined substructure-mass/formula searches and searching experimental mass spectral data against predicted fragmentation spectra. A specific type of data mapping in the database uses “MS-Ready” structures, a way to process all registered substances to separate multi-component chemicals into their individual components, removal of stereochemical bonds and desalting and neutralization. This MS-Ready processing supports batch-searching using either mass or formulae to identify candidate chemicals and their mapped substances. A number of chemical lists (https://comptox.epa.gov/dashboard/chemical_lists) have also been developed to support the identification of chemicals related to agrochemistry, specifically pesticides (both active and inert constituents), insecticides and their metabolites and environmental breakdown products). This presentation will provide an overview of how the CompTox Chemicals Dashboard supports mass spectrometry based structure identification and non-targeted analysis of chemicals in agrochemistry. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy
The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.
Presentation on the Chemical Analysis Metadata Platform (ChAMP) as a new project to characterize and organize metadata about chemical analysis methods. The project will develop an ontology, controlled vocabularies, and design rules
The Royal Society of Chemistry (RSC) is a major participant in providing access to chemistry related data via the web. As an internationally renowned society for the chemical sciences, a scientific publisher and the host of the ChemSpider database for the community, RSC continues to make dramatic strides in providing online access to data. ChemSpider provides access to over 30 million chemicals sourced from over 500 data suppliers and linked out to related information on the web. The platform is a crowdsourcing environment whereby members of the community can participate in validating and expanding the content of the database. With a set of application programming interfaces ChemSpider is used by various organizations and projects to serve up data for various purposes. These include structure identification for mass spectrometry instrument vendors, RSC databases such as the Marinlit natural products database and a European grant-based project from the Innovative Medicines Initiative fund. This presentation will provide an overview of various cheminformatics activities and projects that RSC is involved with to serve the medicinal chemistry community. This will include the Open PHACTS semantic web project, the PharmaSea project to identify new pharmaceutical leads from the ocean and the UK National Compound Collection to identify new lead compounds contained within PhD theses.
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are being made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for almost 760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectroscopy non-targeted screening community, who are generating important data for detecting and assessing environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interests to relevant stakeholders including, for example, scientists interested in algal toxins and hydraulic fracturing chemicals. This presentation will provide an overview of the challenges associated with the curation of data from EPA’s December 2016 Hydraulic Fracturing Drinking Water Assessment Report that represented chemicals reported to be used in hydraulic fracturing fluids and those found in produced water. The data have been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and open literature. The application of the dashboard to support mass spectrometry non-targeted analysis studies will also be reviewed. This abstract does not reflect U.S. EPA policy.
Presentation for Texas A&M Superfund Research Center virtual learning series, Big Data in Environmental Science and Toxicology. More details at https://superfund.tamu.edu/big-data-session-2-aug-18-2021/
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data for over 4000 chemicals and ~1500 assays, and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include an interactive read-across module, real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental matrices, improving the means by which exposure analyses can be conducted. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases, fragmentation prediction tools, and retention time prediction models. The goal of this research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemistry Dashboard, an open chemistry resource and web application containing data for ~760,000 substances. Rank-ordering the number of sources associated with chemical records within the Dashboard (Data Source Ranking) improves the identification of unknowns by bringing the most likely candidate structures to the top of a search results list. Database searching has been further optimized with the generation of MS-Ready Structures. MS-Ready structures are de-salted, stripped of stereochemistry, and mixture separated to replicate the form of a chemical observed via HRMS. Functionality to conduct batch searching of molecular formulae and monoisotopic masses was designed and released to improve searching efforts. Finally, a scoring-based identification scheme was developed, optimized, and surfaced via the Dashboard using multiple data streams contained within the database underlying the Dashboard. The scoring-based identification scheme improved the identification of unknowns over previous efforts using data source ranking alone. Combining these steps within an open chemistry resource provides a freely available software tool for structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The US EPA’s CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is a freely available web-based application providing access to data for ~900,000 chemical substances, the majority of these represented as chemical structures. The Dashboard also provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders and, in particular, a list of hundreds of disinfection by-product (DBP) chemicals reported in the literature and detected in the laboratory using mass spectrometric techniques. Many of these chemicals are explicit chemical structures whose structures have been confirmed using purchased or synthesized reference standards. However, some of these chemicals may be ambiguous in nature with no explicit positional isomers being possible to define but the formula and mass spec fragmentation sufficient to define a class of chemicals (e.g. dichlorophenol). Such chemicals may be represented with ambiguous chemical structure forms, so-called Markush structures, and mapped to the individual class members. Chemicals accessible via the Dashboard can include access to a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, product use information extracted from safety data sheets, and integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. Since DBP chemicals are primarily identified using mass spectrometric techniques specific search types have been developed to directly support the non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within the database. This presentation will provide an overview of the Dashboard, the ongoing expansion of the DBP chemical list and specific functionality supporting identification of DBPs by mass spectrometry. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The Open PHACTS project delivers an online platform integrating a wide variety of data from across chemistry and the life sciences and an ecosystem of tools and services to query this data in support of pharmacological research, turning the semantic web from a research project into something that can be used by practising medicinal chemists in both academia and industry. In the summer of 2015 it was the first winner of the European Linked Data Award. At the Royal Society of Chemistry we have provided the chemical underpinnings to this system and in this talk we review its development over the past five years. We cover both our early work on semantic modelling of chemistry data for the Open PHACTS triplestore and more recent work building an all-purpose data platform, for which the Open PHACTS data has been an important test case, what has worked well, what's missing and where this is is likely to go in future.
Presented by Richard Kidd at "The Future Information Needs of Pharmaceutical & Medicinal Chemistry", Monday 28 November 2011 at The Linnean Society, Burlington Square, London run by the RSC CICAG group.
An Integrated Approach To Drug Discovery Using Parallel SynthesisGraham Smith
An Integrated Approach To Drug Discovery Using Parallel Synthesis. The history of parallel chemistry for lead discovery at Pfizer Sandwich from begining to outsourcing
Identification of unknowns in mass spectrometry based non-targeted analyses (NTA) requires the integration of complementary pieces of data to arrive at a confident, consensus structure. Researchers use chemical reference databases, spectral matching, fragment prediction tools, retention time prediction tools, and a variety of other data to arrive at tentative, probable, and confirmed, if possible, identifications. With the diverse, robust data contained within the US EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov), the goal of this research is to identify and implement a harmonized identification tool and workflow using previously generated chemistry data. Data has been compiled from product use, functional use prediction models, environmental media occurrence prediction models, and PubMed references, among other sources. We will report on our development of a visualization tool whereby users can visualize the relative contribution of identification-based metrics on a list of candidate structures and observe the greatest likelihood of occurrence. These data and visualization tools support NTA identification via the Dashboard and demonstrate an open, accessible tool for all users of HRMS data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
Chemical databases have been around for decades, but in recent years we have observed a qualitative change from rather small in-house built proprietary databases to large-scale, open and increasingly complex chemistry knowledgebases. This tectonic shift has imposed new requirements for database design and system architecture as well as the implementation of completely new components and workflows which did not exist in chemical databases before. Probably the most profound change is being caused by the linked nature of modern resources - individual databases are becoming nodes and hubs of a huge and truly distributed web of knowledge. This change has important aspects such as data and format standards, interoperability, provenance, security, quality control and metainformation standards.
ChemSpider at the Royal Society of Chemistry was first public chemical database which incorporated rigorous quality control by introducing both community curation and automated quality checks at the scale of tens of millions of records. Yet we have come to realize that this approach may now be incomplete in a quickly changing world of linked data. In this presentation we will talk about challenges associated with building modern public and private chemical databases as well as lessons that we have learned from our past and present experience. We will also talk about solutions for some common problems.
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
This is a poster for the UK ELXIR meetin in Birmingham UK, Nov 2018. It is the summary of a blog-post https://cdsouthan.blogspot.com/2018/08/an-initial-look-at-elixir-chemistry.html that asses chemistry <> protein <> papers connectivity (C-P-P) for five ELIXIR resources
This presentation was made to the University of North Carolina in Chapel Hill on 9/20/21. The presentation was a general introduction to cheminformatics prior to how to navigate the Dashboard.
• An introduction to the dashboard
• Substances vs structures
• Structure formats for data exchange and connectivity (SMILES, InChIs, molfiles)
• Identifiers – CASRN, chemical names, systematic names
• Data curation approaches: substance-structure ambiguity
• ChemReg: substance registration
• Data gathering for systematic reviews
• Curated lists
• Properties/Fate and Transport
• Access to Exposure Data
• Hazard data in the dashboard – ToxVal data (sourced from >40 databases, >50,000 chemicals, >900,000 data points)
• The Executive Summary of data
• Single chemical searches vs Batch searches
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases within the DSSTox project, and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for ~760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectrometry non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders, including, for example, scientists interested in Per- & Polyfluoroalkyl Substances (PFAS). Added lists include those sourced from the European Union as well as developed in-house and now containing thousands of chemicals. A procured testing library of hundreds of PFAS chemicals annotated into chemical categories has been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and links to the open literature. This presentation will provide an overview of the dashboard, the developing library of PFAS chemicals and associated categorization, and new physicochemical property and environmental fate and transport QSAR prediction models developed for these chemicals. The application of the dashboard to support mass spectrometry non-targeted analysis studies for the identification of PFAS chemicals will also be reviewed. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are utilized to identify emerging contaminants and chemical signatures of interest detected in various media. At the US Environmental Protection Agency the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is an open chemistry resource and web-based application containing data for ~900,000 substances and supports non-targeted and suspect screening analyses. Searching functionality includes identifier searches (e.g. systematic names, trade names and CAS Registry Numbers), mass and formula-based searches and prototype developments include combined substructure-mass/formula searches and searching experimental mass spectral data against predicted fragmentation spectra. A specific type of data mapping in the database uses “MS-Ready” structures, a way to process all registered substances to separate multi-component chemicals into their individual components, removal of stereochemical bonds and desalting and neutralization. This MS-Ready processing supports batch-searching using either mass or formulae to identify candidate chemicals and their mapped substances. A number of chemical lists (https://comptox.epa.gov/dashboard/chemical_lists) have also been developed to support the identification of chemicals related to agrochemistry, specifically pesticides (both active and inert constituents), insecticides and their metabolites and environmental breakdown products). This presentation will provide an overview of how the CompTox Chemicals Dashboard supports mass spectrometry based structure identification and non-targeted analysis of chemicals in agrochemistry. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy
The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.
Presentation on the Chemical Analysis Metadata Platform (ChAMP) as a new project to characterize and organize metadata about chemical analysis methods. The project will develop an ontology, controlled vocabularies, and design rules
The Royal Society of Chemistry (RSC) is a major participant in providing access to chemistry related data via the web. As an internationally renowned society for the chemical sciences, a scientific publisher and the host of the ChemSpider database for the community, RSC continues to make dramatic strides in providing online access to data. ChemSpider provides access to over 30 million chemicals sourced from over 500 data suppliers and linked out to related information on the web. The platform is a crowdsourcing environment whereby members of the community can participate in validating and expanding the content of the database. With a set of application programming interfaces ChemSpider is used by various organizations and projects to serve up data for various purposes. These include structure identification for mass spectrometry instrument vendors, RSC databases such as the Marinlit natural products database and a European grant-based project from the Innovative Medicines Initiative fund. This presentation will provide an overview of various cheminformatics activities and projects that RSC is involved with to serve the medicinal chemistry community. This will include the Open PHACTS semantic web project, the PharmaSea project to identify new pharmaceutical leads from the ocean and the UK National Compound Collection to identify new lead compounds contained within PhD theses.
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are being made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for almost 760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectroscopy non-targeted screening community, who are generating important data for detecting and assessing environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interests to relevant stakeholders including, for example, scientists interested in algal toxins and hydraulic fracturing chemicals. This presentation will provide an overview of the challenges associated with the curation of data from EPA’s December 2016 Hydraulic Fracturing Drinking Water Assessment Report that represented chemicals reported to be used in hydraulic fracturing fluids and those found in produced water. The data have been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and open literature. The application of the dashboard to support mass spectrometry non-targeted analysis studies will also be reviewed. This abstract does not reflect U.S. EPA policy.
Presentation for Texas A&M Superfund Research Center virtual learning series, Big Data in Environmental Science and Toxicology. More details at https://superfund.tamu.edu/big-data-session-2-aug-18-2021/
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data for over 4000 chemicals and ~1500 assays, and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include an interactive read-across module, real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental matrices, improving the means by which exposure analyses can be conducted. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases, fragmentation prediction tools, and retention time prediction models. The goal of this research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemistry Dashboard, an open chemistry resource and web application containing data for ~760,000 substances. Rank-ordering the number of sources associated with chemical records within the Dashboard (Data Source Ranking) improves the identification of unknowns by bringing the most likely candidate structures to the top of a search results list. Database searching has been further optimized with the generation of MS-Ready Structures. MS-Ready structures are de-salted, stripped of stereochemistry, and mixture separated to replicate the form of a chemical observed via HRMS. Functionality to conduct batch searching of molecular formulae and monoisotopic masses was designed and released to improve searching efforts. Finally, a scoring-based identification scheme was developed, optimized, and surfaced via the Dashboard using multiple data streams contained within the database underlying the Dashboard. The scoring-based identification scheme improved the identification of unknowns over previous efforts using data source ranking alone. Combining these steps within an open chemistry resource provides a freely available software tool for structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The US EPA’s CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is a freely available web-based application providing access to data for ~900,000 chemical substances, the majority of these represented as chemical structures. The Dashboard also provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders and, in particular, a list of hundreds of disinfection by-product (DBP) chemicals reported in the literature and detected in the laboratory using mass spectrometric techniques. Many of these chemicals are explicit chemical structures whose structures have been confirmed using purchased or synthesized reference standards. However, some of these chemicals may be ambiguous in nature with no explicit positional isomers being possible to define but the formula and mass spec fragmentation sufficient to define a class of chemicals (e.g. dichlorophenol). Such chemicals may be represented with ambiguous chemical structure forms, so-called Markush structures, and mapped to the individual class members. Chemicals accessible via the Dashboard can include access to a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, product use information extracted from safety data sheets, and integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. Since DBP chemicals are primarily identified using mass spectrometric techniques specific search types have been developed to directly support the non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within the database. This presentation will provide an overview of the Dashboard, the ongoing expansion of the DBP chemical list and specific functionality supporting identification of DBPs by mass spectrometry. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The Open PHACTS project delivers an online platform integrating a wide variety of data from across chemistry and the life sciences and an ecosystem of tools and services to query this data in support of pharmacological research, turning the semantic web from a research project into something that can be used by practising medicinal chemists in both academia and industry. In the summer of 2015 it was the first winner of the European Linked Data Award. At the Royal Society of Chemistry we have provided the chemical underpinnings to this system and in this talk we review its development over the past five years. We cover both our early work on semantic modelling of chemistry data for the Open PHACTS triplestore and more recent work building an all-purpose data platform, for which the Open PHACTS data has been an important test case, what has worked well, what's missing and where this is is likely to go in future.
Presented by Richard Kidd at "The Future Information Needs of Pharmaceutical & Medicinal Chemistry", Monday 28 November 2011 at The Linnean Society, Burlington Square, London run by the RSC CICAG group.
An Integrated Approach To Drug Discovery Using Parallel SynthesisGraham Smith
An Integrated Approach To Drug Discovery Using Parallel Synthesis. The history of parallel chemistry for lead discovery at Pfizer Sandwich from begining to outsourcing
Identification of unknowns in mass spectrometry based non-targeted analyses (NTA) requires the integration of complementary pieces of data to arrive at a confident, consensus structure. Researchers use chemical reference databases, spectral matching, fragment prediction tools, retention time prediction tools, and a variety of other data to arrive at tentative, probable, and confirmed, if possible, identifications. With the diverse, robust data contained within the US EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov), the goal of this research is to identify and implement a harmonized identification tool and workflow using previously generated chemistry data. Data has been compiled from product use, functional use prediction models, environmental media occurrence prediction models, and PubMed references, among other sources. We will report on our development of a visualization tool whereby users can visualize the relative contribution of identification-based metrics on a list of candidate structures and observe the greatest likelihood of occurrence. These data and visualization tools support NTA identification via the Dashboard and demonstrate an open, accessible tool for all users of HRMS data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
Chemical databases have been around for decades, but in recent years we have observed a qualitative change from rather small in-house built proprietary databases to large-scale, open and increasingly complex chemistry knowledgebases. This tectonic shift has imposed new requirements for database design and system architecture as well as the implementation of completely new components and workflows which did not exist in chemical databases before. Probably the most profound change is being caused by the linked nature of modern resources - individual databases are becoming nodes and hubs of a huge and truly distributed web of knowledge. This change has important aspects such as data and format standards, interoperability, provenance, security, quality control and metainformation standards.
ChemSpider at the Royal Society of Chemistry was first public chemical database which incorporated rigorous quality control by introducing both community curation and automated quality checks at the scale of tens of millions of records. Yet we have come to realize that this approach may now be incomplete in a quickly changing world of linked data. In this presentation we will talk about challenges associated with building modern public and private chemical databases as well as lessons that we have learned from our past and present experience. We will also talk about solutions for some common problems.
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
This is a poster for the UK ELXIR meetin in Birmingham UK, Nov 2018. It is the summary of a blog-post https://cdsouthan.blogspot.com/2018/08/an-initial-look-at-elixir-chemistry.html that asses chemistry <> protein <> papers connectivity (C-P-P) for five ELIXIR resources
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
In 2012, after the first IBM deposition, few would have predicted that PubChem compounds that included patent-extracted structures would exceed 20 million within three years (i.e. 30% of the total). The current major open patent chemistry submitters (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. This “big bang” has a range of utilities and implications. Firstly, pharmaceutical companies must now integrate their exploitation of both public and commercial patent chemistry because capture is divergent. Secondly, the academic community and small companies can now patent-mine extensively without commercial sources. Thirdly, first-filings of most lead series and clinical candidates can now be tracked. Fourthly, drug targets in ChEMBL can be intersected with Structure Activity Relationship (SAR) data sets from patents, some of which are now target-mapped in other databases (doi:10.1016/j.ddtec.2014.12.001). However, while this patent chemistry “big bang” is generally welcomed by database users, there are significant caveats. In particular, both automated and manual extraction bring in a variety of artefacts that add confounding structural “noise”. These include a) permutations of mixtures and chiral exemplifications, b) virtual structures (including isotopic analogues of approved drugs), c) an emerging trend of vendor “patent picking” for non-stocked compounds, d) 85% of public patent chemistry has no biological data links and c) extractions from documents do not directly indicate IP status. These problems and some partial solutions will be discussed.
Learn how large-scale normalized data empowers the critical early phases of drug discovery.
To address the core concerns about data quality, comprehensiveness and comparability, the Reaxys product team has developed a completely new repository for bioactivity information. Reaxys Medicinal Chemistry stands as a unique source for normalized data in vitro efficacy, in vivo animal models, compound metabolism, pharmacokinetics and toxicity. This presentation takes a look at how this approach to data supports critical early discovery methods such as in silico screening and target profiling.
Next Generation Data and Opportunities for Clinical PharmacologistsPhilip Bourne
Presentation at the Pre-meeting Workshop Next-Generation Clinical Pharmacology: Integrating Systems Pharmacology, Data-Driven Therapeutics, and Personalized Medicine. American Society for Clinical Pharmacology and Therapeutics Annual Meeting Atlanta GA March 18, 2014.
PubChem as a resource for chemical information trainingSunghwan Kim
Presented at the 257th American Chemical Society (ACS) National Meeting in Orlando, FL (March 31, 2019). [CINF 13]
==== Abstract ====
Libraries at many large academic institutions provide chemical information training programs for students. However, these programs are based on commercial chemical information resources, which come with non-trivial subscription fees. These fees are often too expensive for small organizations, including many primarily undergraduate institutions (PUIs) and community colleges (CCs). It leads to disparity in access to chemical information as well as learning opportunities among students. This issue may be addressed at least in part by developing free online training programs based on public chemical databases, such as PubChem (https://pubchem.ncbi.nlm.nih.gov). PubChem has a great potential as an online resource for chemical education, but it also has important issues that students and teachers should keep in mind, such as data accuracy, data provenance, structure standardization, terminologies and so on. In this presentation, we will discuss various aspects of PubChem as a resource for chemical information training.
Presented to David Gloriam's Group, Copenhagen, Feb 2020
**********************************
The theme will be presented from the perspective of both past involvement in peptide curation in the Guide to Pharmacology (GtoPdb) and in current searching for bioactive peptides in the wider ecosystem that includes ChEMBL and PubChem. The core problem is that peptides hang in limbo land between bioinformatics (BLAST) and cheminformatics (Tanimoto) neither of which provide optimal searching. Curating peptides in GtoPdb presents many challenges, including mapping endogenous peptides to Swiss-Prot cleavage annotations. For synthetic peptides, equivocal specification of modifications and exact positions of radiolabels are also problematic However, target-mapped citation-supported quantitative binding parameters are curated where possible. For those peptides falling below the PubChem CID SMILES limit of approximately 70 residues, GtoPdb has been using Sugar and Splice from NextMove Software to convert into CIDs. Specific problems associated with finding bioactive peptides in databases will be outlined.
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
Introduction/Background & Aims
The beta-amyloid (APP) cleaving enzyme (BACE1) was implicated as a drug target for Alzheimer's Disease (AD) back in 1999. In 2011, the paralogue, BACE2, became a new proposed target for type II diabetes (T2DM) having been reported to be the TMEM27 secretase regulating pancreatic beta-cell function [1]. By 2019 the accumulated evidence, including a swathe of failed clinical trials for BACE1 inhibitors, has produced a de facto de-validation of both targets in both diseases. As a learning exercise, the series of events leading up to this is reviewed here.
Method/Summary of work
Basic information about these two targets and the lead compounds against them were sourced via the IUPHAR/BPS Guide to Pharmacology (GtoPdb) as Target ids: 2330 and 2331, for BACE1 and 2, respectively. This was consolidated by a literature and patent review as well as following them in other databases. The most recent information on clinical trials was sourced from press releases.
Results/Discussion
GtoPdb annotates 24 lead compounds against BACE1 and 12 against BACE2. The corresponding counts mapped to these targets in ChEMBL are 8741 and 1377 making BACE1 one of the most actively pursued enzyme targets ever. Notwithstanding the massive global effort during 2018 Merck’s verubecestat and J&J’s atabecestat BACE1 inhibitors not only failed their Phase III endpoints but even appeared to worsen cognition in prodromal patients. In 2019 Amgen/Novartis stopped Phase II/III trials of umibecestat that also showed more cognitive decline in the treatment group compared to controls. BACE2 presented an anomalous situation in several ways. By 2016 both Novartis and Amgen declared their inability to reproduce the TMEM27 secretase turnover reported in 2011. Notwithstanding, Novartis and other companies have published patents on BACE2-specific inhibitors over several years and paradoxically verubecestat is more potent against BACE2 rather than 1 but was never tested for glucose-lowering. Equally puzzling is that one academic group is still publishing BACE2 inhibitors for T2D even post de-validation. One thing both targets have in common is the complete absence of genetic support from genome-wide disease association studies but this warning sign went unheeded.
Conclusions
The massive waste of resources on the pursuit of BACE1 as an AD target over the last two decades is catastrophic. This tale of de-validation is compounded for this paralogous pair of enzymes by the fact that the original evidence for BACE2 as a T2D target was eventually refuted. The story of these targets highlights a range of crucial pharmacological pitfalls that must be avoided in the future.
Reference(s)
[1] Southan C, Hancock J.M. (2013) A tale of two drug targets: the evolutionary history of BACE1 and BACE2. Front Genet. 4:293.
In silico 360 Analysis for Drug DevelopmentChris Southan
Introduction:
Consequent to a memorandum of understanding between the Karolinska Institutet and the International Union of Basic and Clinical Pharmacology (IUPHAR) in 2018 a report on academic drug development, including guidelines (ADEV) has been drafted [1]. As part of this exercise, we conceived a triage for comprehensive informatics profiling around the compound, target, disease axis. We have termed this “in slico 360” (INS360) the aim of which was to support ADEV teams since they may lack either internal expertise or external support to do this on their own. Indeed, some past SciLifeLab Drug Discovery and Development Platform projects had been halted because of overlooked competitive impingements or insufficient target validation evidence.
Methods
We assessed the current database landscape, mostly public but including commercial, for potential utility for INS360. We were guided primarily by content coverage, usability, and reputation. We also explored some open property prediction resources for assay interference and toxicological inferences.
Results:
As a first-stop-shop, we selected the IUPHAR/BPS Guide to PHARMACOLOGY with ~900 ligand-target relationships captured via expert curation of journal papers Moving up in scale we evaluated ChEMBL at 1.8 million compounds with 1.1 million assay descriptions and 7,000 targets. With yet another jump we could search the patent corpus with 18 million extracted compounds in SureChEMBL. We explored PubChem that integrates these three with over 500 other sources linked to 96 million compounds, BioAssay results and connectivity into the NCBI Entrez system. The final jump in scale for document-to-chemistry navigation was represented by SciFinder with 155 million structures. On the target side, 360-exploration has the need to encompass literature, structure, genetic variation, splicing, interactions, and disease pathways. From their UniProt links, both GtoPdb and ChEMBL provide these entry points. Navigating genetic association data in support of target validation was enabled by the OpenTargets portal and the GWAS Catalog. We also fount servers that could produce prediction scores from chemical structures for a range of features important for de-risking development.
Conclusion:
This work scoped out initial resource choices for the INS360. We propose that not only ADEV operations but essentially any pharmacology research team has much to gain from this approach and many potential pitfalls can consequently be avoided when approaching key checkpoints, such as preparing a publication. However, support may be needed for both institutions and teams to get the best out of these complex and feature-rich databases.
[1] Southan C, (2019) Towards Academic Drug Development Guidelines, ChemRxiv pre-print no. 8869574
Will the correct BACE ORFs please stand up?Chris Southan
BACE1 and BACE2 are protease targets for Alzheimer's and diabetes, respectively but their validation is now questioned
Phylogenetic analysis can added functional insights
This came up against two key problems
A surprising prevalence of incorrect protein sequences predicted from genomes
Many BACE1 and BACE2 orthologues had truncation and/or indel errors.
Key phylogenetic representative genomes are languishing in an unfinished state
Some options for amelioration of these problems will be described
An update on the evolution of these enzymes will be shown
Look for new and potentially useful human 5HT2A-directed small molecule chemistry surfaced since the last meeting., check for compounds against as 5HT2A primary target but also combined inhibitors, poll round the key databases, literature and patents, earching challenges arise from synonym soup, complex cross-reactivities (see PMID 29679900) in vitro data gaps and in vivo polypharmacology
Quality and noise in big chemistry databasesChris Southan
Presented at Aug 2019 ACS by Antony Williams. Abstract: The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem
Poster for World Congres of Pharmacology 2018, Kyoto
Introduction: The pharmacological literature and patents connect compound structures to their bioactivity. However, entombing these relationships for millions of compounds among millions of PDFs is acknowledged as massively problematic. The situation is ameliorated by resources that extract the entity and data relationships the authors and inventors put “in” to their PDFs back “out” into structured database records. The IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb) has been doing this by stringent curation of ligands and their quantitative activity against protein targets [1]. Our citations are submitted to PubChem (PC), who then link to PubMed (PM) [2]. This study presents an overview of this connectivity.
Methods: For GtoPdb entries in PC Substance we used the PC interface to count our submitted PM links. This gives the PC > PM mapping counts from which we analysed the PM links. We then performed reciprocal analyses (i.e. PM > PC) by selecting PM sets. We then compared two journals by counting structure links by year and source.
Results: From 8988 GtoPdb-submitted ligand substances in PC (release 2017.5), 7309 are linked to 8980 PM entries. Of the 7309 there are 5632 links to chemical structures in PC the rest being antibodies and larger peptides. From the 8980 PMIDs, the Journal of Medicinal Chemistry (JMC) accounted for 1003 as our most frequently cited primary source of structure-to-activity mappings. For the British Journal of Pharmacology (BJP) most of the 345 cross-references were development compounds. Further analysis showed that from 2014 to 2017 the BJP to PC links of ~ 30 structures per year are mostly from GtoPdb and the Comparative Toxicology Database. However, going back to 2010-12, this increased to 500-800 connections, mainly derived from the IBM automated chemical extraction from abstracts. A similar pattern was observed for JMC.
Conclusion: Navigation between documents and databases is an essential competence for pharmacologists and drug discovery but the NCBI Entrez system is daunting. GtoPdb is a major contributor of high-quality links and provides a first-stop to guide users into the PC/PM systems. However, our results indicated potentially serious specificity issues with automated chemistry-to-journal linking from non-GtoPdb sources.
References: [1] Harding et al. (2018). Nucl. Acids Res. 45 (Database Issue), doi: 10.1093/nar/gkx1121.
GtoPdb: A resource for cell-based perturbogensChris Southan
Poster for ELRIG, Möndal, 11/12 May 2017.
This poster will also be presented at BioITWorld, Boston, May 23-25
A resource for the selection and interpretation of cell-based perturbogens: the IUPHAR/BPS Guide to PHARMACOLOGY
Christopher Southan, Elena Faccenda, Joanna L. Sharman, Adam J. Pawson, Simon D. Harding, Jamie A Davies,
Translational research requires the integration of the in vitro molecular mechanisms of action (mmoa) of small molecules, cell-based screening studies, animal models and eventual clinical trials. The International Union of Pharmacology (IUPHAR)/British Pharmacology Society (BPS) database, GtoPdb http://www.guidetopharmacology.org/ provides expert-annotated molecular interactions between endogenous receptor ligands, probes, lead compounds, clinical drugs and their protein targets. It thus provides a core set of quantitative pharmacological relationships that can be interrogated for many purposes, including those running cell-based screens, not only during result interpretation but also to identify key compounds for scoping and consolidation experiments. As described in [1] GtoPdb is populated by records extracted from pharmacology and medicinal chemistry journals, and released quarterly. Quality is ensured by curatorial stringency and our unique model of content selection based on recommendations from IUPHAR target class subcommittees of international experts collaborating with the in-house curators. The database now has over 14 000 binding values (mainly IC50, Ki or Kd) between 8000 ligands and 15000 human proteins (mainly primary but also secondary off-target interactions) representing a 7% druggable proteome. Our coverage is complementary to other sources. For example the 6565 structures we recently submitted to PubChem as CIDs, 5206 were not in DrugBank and 1535 not in ChEMBL. This includes recommended tool compounds with relatively defined mmoa (including 110 from the Structural Genomics Consortium Probe Portal). We also have 75% overlap with vendors for procurement and 80% with patent extractions that in many cases allow mapping to SAR data sets from first-filings (some of which we point to). In a cell screening context 1254 of our targets intersect with proteins in the Reactome pathway database. This is one way to select chemical peturbation points that could be detected by assay readouts. From Nov 2015 we have been funded by the Wellcome Trust to extend into immunopharmacology (within the existing database schema) that is now driving overall GtoPdb content expansion. Parties engaged in cell based assays using or could use compounds we have are encouraged to use GtoPdb, contact us for queries, possible analogue expansions and/or alert us to prospective new content. [1] Southan C et. al. (2016) Nucleic Acids Res. 44(D1):D1054-68, PMID: 26464438
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
1. Why is connecting
chemistry-to-biology in open sources
more difficult than it should be?
Presented at UCL School of Pharmacy, London, 13 June 2019
Hosted by Professor Mathew Todd
1
Christopher Southan
2. Abstract
Progress in drug discovery and chemical biology is hugely enabled by
curated document-assay-result-compound-target relationships
(D-A-R-C-P) in open databases from resources such as the Guide to
Pharmacology and ChEMBL. These are synergistically integrated into
PubChem which pre-computes chemical similarity and connectivity
between over 95 million structures and 5.6 million BioAssay results. It
also links chemistry to documents via various additional routes
including MeSH and large scale submissions from publishers.
However, these efforts are patchy and very few journals facilitate such
connectivity.There thus remains a massive shortfall in public D-A-R-
C-P capture from decades of papers and patents.This presentation
will cover these aspects and discuss their partial amelioration by
options such as author-driven depositions and open lab-book
approaches as used by Open Source Malaria
2
4. The core of the problem
4
"We have spent millions putting chemistry into
PDFs but now we are spending more millions
taking it back out” (Anon)
5. The chemistry < - > biology join
• Chemistry that does something significant in vitro, in cellulo, in vivo or in clinic
• Major bioactivity domains from drug discovery, chemical biology and ecology
• Some cases not adequately covered by this simple relationship chain (e.g. heparin
as indirect inhibitor of thrombin or where P could be a bacteria or protozoan)
• The majority of data still primarily archived in papers and patent documents
• Upper limit statistics for quality publications essentially unknown
D – A – R – C – P
6. So how much disintered chemistry is out there?
6
8. Unsung Heroes
Expert extraction of D-A-R-C-P by biocurators is hard for many reasons that
include;
• Poor continuity of funding and career support
• Entity disambiguation challenges
• Unintentional obfuscation, ambiguity and errors by authors (and occasionally
deliberately from patent applicants)
• Difficult to capture nuances and complexities of molecular mechanisms of
action (e.g. prodrugs or no molecular target)
• Even primary activity parameters (IC50, Ki, Kd) have ~ 10-fold variation
between publications for nominally the same assays
• Judging the quality and potential reproducibility of the publications selected
for extraction
• Publisher guidelines only slowly beginning to address above
• Authors engagement with assay and target ontologies is limited
9. Disinterment from the PDF tomb (I)
Image extraction > structure
• Real chemists sketch images in a jiffy
• The rest of us can use OSRA: Optical Structure Recognition
11. 11
Commercial biocuration of D-A-R-C-P
Exelra (formerly GVKBIO)
GOSTAR stats from 2015
• 1.3 million cpds from 112K
papers (~ 15 per paper)
• 3.5 million cpds from 70K
patents (~ 50 per pat)
• 3,882 human targets
17. Recent large-scale chem < > doc PubChem submissions
17
• Generally a good thing but with caveats
• Difficult to automate filtration to identify “aboutness” of key compounds
• Issues with indexing of non-PubMed DOI-only Journal papers
• Quality of CNER chemistry extraction
• Introduces a // document < > structure mapping system into PubChem
18. Reciprocal links > virtuous circles (I)
18
• GtoPdb users can navigate “out” via PubChem or PubMed
• NCBI users can navigate “in” via PubChem or PubMed
19. Reciprocal links > virtuous circles (II)
19
• GtoMdb users can
navigate “out” via
PubChem or PubMed
• NCBI users can navigate
“in” via PubChem or
PubMed
29. Conclusions
• The bioscience community (including big data miners) still have their
collective feet nailed to the floor from the 5-decade backlog of
scientifically valuable bioactive chemistry relationships entombed in
PDF papers and patents
• Biocuration of D-A-R-C-P makes a crucial contribution but limited scale
• Automated entity extraction is advancing but is way behind the
specificity of mechanistic biocuration and is publisher-constrained
• Existence of several // document <> chemistry systems (e.g. MeSH,
IBM, ChEMBL, EPMC, Springer Nature,Theime ,Wikidata) is enabling
but also confusing
• The spread of Open Science ELNs is good to see but findability,
searchability and database submissions still need to be optimised
• The need remains to facilitate a flow of published (inc. preprints) of
author-specified bioactive chemistry direct to databases (even if the
papers are FAIR)
29
30. Proposed core of the solution
30
“Mandating authors to explicitly connect chemical structures to
their experimental bioactivity results in a form (extrinsic to PDF)
that is FAIR, structured, includes metadata, machine readable,
ontologised, transferable to open database records and
reciprocally linked to their publications” (Southan 2019)
• This is, of course, a council of perfection
• In essence, authors should become biocurators
• Currently only a few papers with data sets submitted to PubChem BioAssay
by authors would conform
• Has been technically feasible for at least a decade
• Impediments are thus sociological and publishing models
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been