This presentation highlighted how data curation impacts the reliability of QSAR models. We examined key datasets related to environmental endpoints to validate across chemical structure representations (e.g., mol file and SMILES) and identifiers (chemical names and registry numbers), and approaches to standardize data into QSAR-ready formats prior to modeling procedures. This allowed us to quantify and segregate data into quality categories. This improved our ability to evaluate the resulting models that can be developed from these data slices, and to quantify to what extent efforts developing high-quality datasets have the expected pay-off in terms of predicting performance. The most accurate models that we build will be accessible via our public-facing platform and will be used for screening and prioritizing chemicals for further testing.
The influence of data curation on QSAR Modeling – Presented at American Chemi...Kamel Mansouri
This presentation examined the impact of data quality on the construction of QSAR models being developed within the EPA‘s National Center for Computational Toxicology. We have developed a public-facing platform to provide access to predictive models. As part of the work we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. This abstract does not reflect U.S. EPA policy.
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSKamel Mansouri
OPERA is a free and open source/open data suite of QSAR models
providing predictions for toxicity endpoints and physicochemical,
environmental fate, and ADME properties.
•
In addition to predictions, OPERA provides accuracy estimates, applicability
domain assessment and experimental data when available.
•
Recent additions to OPERA include models for estrogenic activity,
androgenic activity, and acute oral systemic toxicity developed through
international collaborative modeling projects, and updates to models
predicting plasma protein binding and intrinsic hepatic clearance.
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
The increasing number and size of public databases is facilitating the collection of chemical structures and associated experimental data for QSAR modeling. However, the performance of QSAR models is highly dependent not only on the modeling methodology, but also on the quality of the data used. In this study we developed robust QSAR models for endpoints of environmental interest with the aim of helping the regulatory process. We used the publicly available PHYSPROP database that includes a set of thirteen common physicochemical and environmental fate properties, including logP, melting point, Henry’s coefficient, and biodegradability among others. Curation and standardization workflows have been applied to use the highest quality data and generate QSAR-ready structures. The developed models are in agreement with the five OECD principles that requires QSARs to be simple and reliable. These models were applied to a set of ~700k chemicals to produce predictions for display on the EPA CompTox Chemistry Dashboard. In addition to the predictions, this free web and mobile application provides access to the experimental data used for training as well as detailed reports including general model performances, specific applicability domain and prediction accuracy, and the nearest neighboring structures used for prediction. The dashboard also provides access to model QMRFs (QSAR modeling report format) which is a downloadable pdf containing additional details about the modeling approaches, the data, and molecular descriptor interpretation.
The iCSS CompTox Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. However, it can also be applied to many other purposes, e.g., the identification of agrochemicals in waste streams. This presentation will provide a review of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. In order to examine its performance for structure identification, especially in terms of rank-ordering database hits, we have compared it with the ChemSpider database, a well-regarded public database that has become one of the community standards for structure identification. The study has shown that the CompTox Dashboard outperforms ChemSpider in terms of structure identification and ranking providing improved outcomes for mass spectrometry analysis of “known unknowns”.
The influence of data curation on QSAR Modeling – Presented at American Chemi...Kamel Mansouri
This presentation examined the impact of data quality on the construction of QSAR models being developed within the EPA‘s National Center for Computational Toxicology. We have developed a public-facing platform to provide access to predictive models. As part of the work we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. This abstract does not reflect U.S. EPA policy.
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSKamel Mansouri
OPERA is a free and open source/open data suite of QSAR models
providing predictions for toxicity endpoints and physicochemical,
environmental fate, and ADME properties.
•
In addition to predictions, OPERA provides accuracy estimates, applicability
domain assessment and experimental data when available.
•
Recent additions to OPERA include models for estrogenic activity,
androgenic activity, and acute oral systemic toxicity developed through
international collaborative modeling projects, and updates to models
predicting plasma protein binding and intrinsic hepatic clearance.
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
The increasing number and size of public databases is facilitating the collection of chemical structures and associated experimental data for QSAR modeling. However, the performance of QSAR models is highly dependent not only on the modeling methodology, but also on the quality of the data used. In this study we developed robust QSAR models for endpoints of environmental interest with the aim of helping the regulatory process. We used the publicly available PHYSPROP database that includes a set of thirteen common physicochemical and environmental fate properties, including logP, melting point, Henry’s coefficient, and biodegradability among others. Curation and standardization workflows have been applied to use the highest quality data and generate QSAR-ready structures. The developed models are in agreement with the five OECD principles that requires QSARs to be simple and reliable. These models were applied to a set of ~700k chemicals to produce predictions for display on the EPA CompTox Chemistry Dashboard. In addition to the predictions, this free web and mobile application provides access to the experimental data used for training as well as detailed reports including general model performances, specific applicability domain and prediction accuracy, and the nearest neighboring structures used for prediction. The dashboard also provides access to model QMRFs (QSAR modeling report format) which is a downloadable pdf containing additional details about the modeling approaches, the data, and molecular descriptor interpretation.
The iCSS CompTox Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. However, it can also be applied to many other purposes, e.g., the identification of agrochemicals in waste streams. This presentation will provide a review of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. In order to examine its performance for structure identification, especially in terms of rank-ordering database hits, we have compared it with the ChemSpider database, a well-regarded public database that has become one of the community standards for structure identification. The study has shown that the CompTox Dashboard outperforms ChemSpider in terms of structure identification and ranking providing improved outcomes for mass spectrometry analysis of “known unknowns”.
As part of our efforts to develop a public platform to provide access to predictive models we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. Using a thorough manual review of the data underlying the well-known EPI Suite software, we developed automated processes for the validation of the data using a KNIME workflow. This includes: approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into various quality categories has allowed us to thoroughly investigate the resulting models developed from these data slices, as well as allowing us to examine whether or not efforts into the development of large high-quality datasets has the expected pay-off in terms of prediction performance. Machine-learning approaches have been applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. These data are available online via the EPA’s iCSS Chemistry Dashboard. This abstract does not reflect U.S. EPA policy.
Virtual screening of chemicals for endocrine disrupting activity through CER...Kamel Mansouri
Endocrine disrupting chemicals (EDCs) are xenobiotics that mimic the interaction of natural hormones at the receptor level and alter synthesis, transport and metabolism pathways. The prospect of EDCs causing adverse health effects in humans and wildlife has led to the development of scientific and regulatory approaches for evaluating bioactivity. This need is being partially addressed by the use of high-throughput screening (HTS) in vitro approaches and computational modeling. In the framework of the Endocrine Disruptor Screening Program (EDSP), the U.S. EPA led two worldwide consortiums to “virtually” (i.e., in silico) screen chemicals for their potential estrogenic and androgenic activities. The Collaborative Estrogen Receptor (ER) Activity Prediction Project (CERAPP) [1] predicted activities for 32,464 chemicals and the Collaborative Modeling Project for Androgen Receptor (AR) Activity (CoMPARA) generated predictions on the CERAPP list with additional simulated metabolites, totaling 55,450 unique structures. Modelers and computational toxicology scientists from 30 international groups contributed structure-based models and results for activity prediction to one or both projects, with methods ranging from QSARs to docking to predict binding, agonism and antagonism activities. Models were based on a common training set of 1746 chemicals having ToxCast/Tox21 HTS in vitro assay results (18 assays for ER and 11 for AR) integrated into computational networks. The models were then validated using curated literature data from different sources (~7,000 results for ER and ~5,000 results for AR). To overcome the limitations of single approaches, CERAPP and CoMPARA models were each combined into consensus models reaching high predictive accuracy. These consensus models were extended beyond the initially designed datasets by implementing them into the free and open-source application OPERA to avoid running every single model on new chemicals [2]. This implementation was used to screen the entire EPA DSSTox database of ~750,000 chemicals and predicted ER and AR activity is made freely available on the CompTox Chemistry dashboard (https://comptox.epa.gov/dashboard) [3].
Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new publicly-accessible CompTox Dashboard as the first application built on our newly developed architecture. This abstract does not reflect U.S. EPA policy.
Researchers at the EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The intention of this research program is to quickly evaluate thousands of chemicals for potential risk but with much reduced cost relative to historical approaches. This work involves computational and data driven approaches including high-throughput screening, modeling, text-mining and the integration of chemistry, exposure and biological data. We have developed a number of databases and applications that are delivering on the vision of developing a deeper understanding of chemicals and their effects on exposure and biological processes that are supporting a large community of scientists in their research efforts. This presentation will provide an overview of our work to bring together diverse large scale data from the chemical and biological domains, our approaches to integrate and disseminate these data, and the delivery of models supporting computational toxicology. This abstract does not reflect U.S. EPA policy.
The iCSS CompTox Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. This poster reviews the benefits of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. Standard approaches for both mass and formula lookup are available but the dashboard delivers a novel approach for hit ranking based on functional use of the chemicals. The focus on high-quality data, novel ranking approaches and integration to other resources of value to mass spectrometrists makes the CompTox Dashboard a valuable resource for the identification of environmental chemicals. This abstract does not reflect U.S. EPA policy.
Tens of thousands of chemicals are currently in commerce, and hundreds more are introduced every year. Because current chemical testing is resource intensive, only a small fraction of chemicals have been adequately evaluated for potential human health effects. New technologies and computational tools have shown promise for closing this knowledge gap. In the U.S. EPA’s ToxCast effort, the use of ~700 high-throughput in vitro assays has broadly characterized the biological activity and potential mechanisms of ~1,800 chemicals. Coupling the high-throughput in vitro assays with additional in vitro pharmacokinetic assays and in vitro-to-in vivo extrapolation modeling allows conversion of in vitro bioactive concentrations to estimates to an administered dose (mg/kg/day). High throughput exposure models are generating exposure estimates based on key aspects of chemical production, fate, transport, and personal use. The path for incorporating new approach methods and technologies for prioritization and assessment of chemical alternatives poses multiple scientific challenges. These challenges include sufficient coverage of toxicological mechanisms to meaningfully interpret negative test results, development of increasingly relevant test systems, computational modeling to integrate experimental data, characterizing uncertainty, and efficient validation of the test systems and computational models. The presentation will cover progress at the U.S. EPA in the development and application of these technologies and approaches in evaluating alternatives and systematically addressing each of these challenges. This abstract does not necessarily reflect U.S. EPA policy.
The iCSS Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. However, it can also be applied to many other purposes, e.g., the identification of agrochemicals in waste streams. This presentation will provide a review of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. We will also discuss progress towards a high-throughput non-targeted analysis platform for use by the mass spectrometry community. This abstract does not reflect U.S. EPA policy.
This presentation was given at Proctor and Gamble in Mason, OH, in June 2017 in a symposium regarding "Big Data: Perspectives from Diverse Fields on How to Extract, Filter and Manipulate Large Data Sets to Answer Scientific Questions"
This presentation highlights known challenges with the production of high quality chemical databases and outline recent efforts made to address these challenges. Specific examples will be provided illustrating these challenges within the U.S. Environmental Protection Agency (EPA) Computational Toxicology Program. This includes consolidating EPA’s ACToR and DSSTox databases, augmenting computed properties and list search features, and introducing quality metrics to assess confidence in chemical structure assignments across hundreds of thousands of chemical substance records. The past decade has seen enormous investments in the generation and release of data from studies of chemicals and their toxicological effects. There is, however, commonly little concern given to provenance and, more generally, to the quality of the data. The presentation will emphasize the importance of rigorous data review procedures, progress in web-based public access to accurate chemical data sets for use in predictive modeling, and the benefits that these efforts will deliver to toxicologists to embrace the “Big Data” era.
This abstract does not necessarily represent the views of the U.S. Environmental Protection Agency.
There is a growing need for rapid chemical screening and prioritization to inform regulatory decision-making on thousands of chemicals in the environment. We have previously used high-resolution mass spectrometry to examine household vacuum dust samples using liquid chromatography time-of-flight mass spectrometry (LC-TOF/MS). Using a combination of exact mass, isotope distribution, and isotope spacing, molecular features were matched with a list of chemical formulas from the EPA’s Distributed Structure-Searchable Toxicity (DSSTox) database. This has further developed our understanding of how openly available chemical databases, together with the appropriate searches, could be used for the purpose of compound identification. We report here on the utility of the EPA’s iCSS Chemistry Dashboard for the purpose of compound identification using searches against a database of over 720,000 chemicals. We also examine the benefits of QSAR prediction for the purpose of retention time prediction to allow for alignment of both chromatographic and mass spectral properties. This abstract does not reflect U.S. EPA policy.
An examination of data quality on QSAR Modeling in regards to the environment...Kamel Mansouri
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy.
Sustainable chemistry is the design and use of chemicals that minimize impacts to human health, ecosystems and the environment. To assess sustainability, chemicals must be evaluated not only for their toxicity to humans and other species, but also for environmental persistence and potential formation of toxic products as a result of biotic and abiotic transformations. Traditional approaches to evaluate these characteristics are resource intensive and normally lack biologically mechanistic information that might facilitate a “safety by design” approach. A more promising approach would exploit recent advances in high-throughput (HT) and high-content (HC) screening methods coupled with computational methods for data analysis and predictive modelling. The elements of a framework to assess sustainable chemistry could rely on integration of non-testing approaches such as (Q)SAR and read-across, coupled with prediction models derived from HT/HC methods anchored to biological pathways (eg., Adverse Outcome Pathways). Acceptance and use of such integrated approaches necessitates a level of validation that demonstrates scientific confidence for specific decision contexts. Here we illustrate a scientific confidence framework for Tox21 approaches underpinned by a mechanistic basis, and illustrate how this will drive the development of enhanced non-testing approaches. This framework also focuses development of prediction models that are hybrid yet local in terms of their chemistry in nature. Specific examples highlight how the extensive testing library within ToxCast was profiled with respect to its chemistry, resulting in new insights that direct strategic testing as well as formulate new predictive models specifically SARs. This abstract does not necessarily reflect U.S. EPA policy.
Humans are potentially exposed to tens of thousands of man-made chemicals in the environment. It is well known that some environmental chemicals mimic natural hormones and thus have the potential to be endocrine disruptors. Most of these environmental chemicals have never been tested for their ability to disrupt the endocrine system, in particular, their ability to interact with the estrogen receptor. EPA needs tools to prioritize thousands of chemicals, for instance in the Endocrine Disruptor Screening Program (EDSP). This project was intended to be a demonstration of the use of predictive computational models on HTS data including ToxCast and Tox21 assays to prioritize a large chemical universe of 32464 unique structures for one specific molecular target – the estrogen receptor. CERAPP combined multiple computational models for prediction of estrogen receptor activity, and used the predicted results to build a unique consensus model. Models were developed in collaboration between 17 groups in the U.S. and Europe and applied to predict the common set of chemicals. Structure-based techniques such as docking and several QSAR modeling approaches were employed, mostly using a common training set of 1677 compounds provided by U.S. EPA, to build a total of 42 classification models and 8 regression models for binding, agonist and antagonist activity. All predictions were evaluated on ToxCast data and on an external validation set collected from the literature. In order to overcome the limitations of single models, a consensus was built weighting models based on their prediction accuracy scores (including sensitivity and specificity against training and external sets). Individual model scores ranged from 0.69 to 0.85, showing high prediction reliabilities. The final consensus predicted 4001 chemicals as actives to be considered as high priority for further testing and 6742 as suspicious chemicals. This abstract does not necessarily reflect U.S. EPA policy
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Kamel Mansouri
Endocrine disrupting chemicals (EDCs) are xenobiotics that mimic the interaction of natural hormones at the receptor level and alter synthesis, transport and metabolism pathways. The prospect of EDCs causing adverse health effects in humans and wildlife has led to the development of scientific and regulatory approaches for evaluating bioactivity. This need is being partially addressed by the use of high-throughput screening (HTS) in vitro approaches and computational modeling. In the framework of the Endocrine Disruptor Screening Program (EDSP), the U.S. EPA led two worldwide consortiums to “virtually” (i.e., in silico) screen chemicals for their potential estrogenic and androgenic activities. The Collaborative Estrogen Receptor (ER) Activity Prediction Project (CERAPP) [1] predicted activities for 32,464 chemicals and the Collaborative Modeling Project for Androgen Receptor (AR) Activity (CoMPARA) generated predictions on the CERAPP list with additional simulated metabolites, totaling 55,450 unique structures. Modelers and computational toxicology scientists from 30 international groups contributed structure-based models and results for activity prediction to one or both projects, with methods ranging from QSARs to docking to predict binding, agonism and antagonism activities. Models were based on a common training set of 1746 chemicals having ToxCast/Tox21 HTS in vitro assay results (18 assays for ER and 11 for AR) integrated into computational networks. The models were then validated using curated literature data from different sources (~7,000 results for ER and ~5,000 results for AR). To overcome the limitations of single approaches, CERAPP and CoMPARA models were each combined into consensus models reaching high predictive accuracy. These consensus models were extended beyond the initially designed datasets by implementing them into the free and open-source application OPERA to avoid running every single model on new chemicals [2]. This implementation was used to screen the entire EPA DSSTox database of ~750,000 chemicals and predicted ER and AR activity is made freely available on the CompTox Chemistry dashboard (https://comptox.epa.gov/dashboard) [3].
Learn how large-scale normalized data empowers the critical early phases of drug discovery.
To address the core concerns about data quality, comprehensiveness and comparability, the Reaxys product team has developed a completely new repository for bioactivity information. Reaxys Medicinal Chemistry stands as a unique source for normalized data in vitro efficacy, in vivo animal models, compound metabolism, pharmacokinetics and toxicity. This presentation takes a look at how this approach to data supports critical early discovery methods such as in silico screening and target profiling.
Developing tools for high resolution mass spectrometry-based screening via th...Andrew McEachran
Non-targeted and suspect screening studies using high resolution mass spectrometry (HRMS) have revolutionized how chemicals are detected in complex matrices. However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. The US EPA has developed functionality within the CompTox Chemistry Dashboard (https://comptox.epa.gov) to address challenges related to data processing and analysis in HRMS. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will introduce the tools and combined workflow including visualization and access via the Chemistry Dashboard. These tools, data, and visualization approaches within an open chemistry resource provide a freely available software tool to support structure identification and non-targeted analysis. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
SETAC Rome Non-Target Screening For Chemical DiscoveryEmma Schymanski
SETAC Special Session: Solutions for emerging pollutants – Towards a holistic chemical quality status assessment in European freshwater resources
Non-target Screening for Holistic Chemical Monitoring and Compound Discovery: Open Science, Real-time and Retrospective Approaches
Emma L. Schymanski [1*], Reza Aalizadeh [2], Nikiforos Alygizakis [3], Juliane Hollender [4], Martin Krauss [5], Tobias Schulze [5], Jaroslav Slobodnik [3], Nikolaos S. Thomaidis [2] and Antony J. Williams [6]
[1] Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
[2] National and Kapodistrian University of Athens, Department of Chemistry, Athens, Greece.
[3] Environmental Institute, Koš, Slovak Republic
[4] Eawag: Swiss Federal Institute for Aquatic Science and Technology, Dübendorf, Switzerland
[5] UFZ – Helmholtz Centre for Environmental Research, Leipzig, Germany
[6] National Centre for Computational Toxicity, US EPA, Research Triangle Park, Durham, NC, USA
*presenting author: emma.schymanski@uni.lu
Keywords: Non-target screening, Open Science, Mass Spectrometry, Monitoring
Non-target screening (NTS) with high resolution mass spectrometry (HR-MS) provides opportunities to discover chemicals, their dynamics and effects on the environment far beyond the current 45 “priority pollutants” or even “known” chemicals. Open science and the exchange of information (between for example scientists and regulatory authorities) has a critical role to play in the continuing evolution of NTS. Using a variety of case studies from Europe, this talk will highlight how open science activities such as MassBank.EU (https://massbank.eu), the NORMAN Suspect Exchange (http://www.norman-network.com/?q=node/236) and NORMAN Digital Sample Freezing Platform (http://norman-data.eu) as well as the US EPA CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard/) can support NTS. Further, it will show how initiatives such as near “real time” monitoring of the River Rhine and retrospective screening via so-called “digital freezing” platforms have opened up new potential for exploring the dynamics and distribution even of as-yet-unidentified chemicals. Collaborative European and international activities facilitate data exchange amongst analytical data scientists and enable quick, effective and reproducible provisional compound identification in digitally archived HR-MS data. This is leading to new ways of assessing and prioritizing the next generation of “emerging pollutants” in the environment, enabling a pro-active approach to environmental assessment unthinkable only a few years ago. Note: This abstract does not reflect US EPA policy.
As part of our efforts to develop a public platform to provide access to predictive models we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. Using a thorough manual review of the data underlying the well-known EPI Suite software, we developed automated processes for the validation of the data using a KNIME workflow. This includes: approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into various quality categories has allowed us to thoroughly investigate the resulting models developed from these data slices, as well as allowing us to examine whether or not efforts into the development of large high-quality datasets has the expected pay-off in terms of prediction performance. Machine-learning approaches have been applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. These data are available online via the EPA’s iCSS Chemistry Dashboard. This abstract does not reflect U.S. EPA policy.
Virtual screening of chemicals for endocrine disrupting activity through CER...Kamel Mansouri
Endocrine disrupting chemicals (EDCs) are xenobiotics that mimic the interaction of natural hormones at the receptor level and alter synthesis, transport and metabolism pathways. The prospect of EDCs causing adverse health effects in humans and wildlife has led to the development of scientific and regulatory approaches for evaluating bioactivity. This need is being partially addressed by the use of high-throughput screening (HTS) in vitro approaches and computational modeling. In the framework of the Endocrine Disruptor Screening Program (EDSP), the U.S. EPA led two worldwide consortiums to “virtually” (i.e., in silico) screen chemicals for their potential estrogenic and androgenic activities. The Collaborative Estrogen Receptor (ER) Activity Prediction Project (CERAPP) [1] predicted activities for 32,464 chemicals and the Collaborative Modeling Project for Androgen Receptor (AR) Activity (CoMPARA) generated predictions on the CERAPP list with additional simulated metabolites, totaling 55,450 unique structures. Modelers and computational toxicology scientists from 30 international groups contributed structure-based models and results for activity prediction to one or both projects, with methods ranging from QSARs to docking to predict binding, agonism and antagonism activities. Models were based on a common training set of 1746 chemicals having ToxCast/Tox21 HTS in vitro assay results (18 assays for ER and 11 for AR) integrated into computational networks. The models were then validated using curated literature data from different sources (~7,000 results for ER and ~5,000 results for AR). To overcome the limitations of single approaches, CERAPP and CoMPARA models were each combined into consensus models reaching high predictive accuracy. These consensus models were extended beyond the initially designed datasets by implementing them into the free and open-source application OPERA to avoid running every single model on new chemicals [2]. This implementation was used to screen the entire EPA DSSTox database of ~750,000 chemicals and predicted ER and AR activity is made freely available on the CompTox Chemistry dashboard (https://comptox.epa.gov/dashboard) [3].
Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new publicly-accessible CompTox Dashboard as the first application built on our newly developed architecture. This abstract does not reflect U.S. EPA policy.
Researchers at the EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The intention of this research program is to quickly evaluate thousands of chemicals for potential risk but with much reduced cost relative to historical approaches. This work involves computational and data driven approaches including high-throughput screening, modeling, text-mining and the integration of chemistry, exposure and biological data. We have developed a number of databases and applications that are delivering on the vision of developing a deeper understanding of chemicals and their effects on exposure and biological processes that are supporting a large community of scientists in their research efforts. This presentation will provide an overview of our work to bring together diverse large scale data from the chemical and biological domains, our approaches to integrate and disseminate these data, and the delivery of models supporting computational toxicology. This abstract does not reflect U.S. EPA policy.
The iCSS CompTox Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. This poster reviews the benefits of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. Standard approaches for both mass and formula lookup are available but the dashboard delivers a novel approach for hit ranking based on functional use of the chemicals. The focus on high-quality data, novel ranking approaches and integration to other resources of value to mass spectrometrists makes the CompTox Dashboard a valuable resource for the identification of environmental chemicals. This abstract does not reflect U.S. EPA policy.
Tens of thousands of chemicals are currently in commerce, and hundreds more are introduced every year. Because current chemical testing is resource intensive, only a small fraction of chemicals have been adequately evaluated for potential human health effects. New technologies and computational tools have shown promise for closing this knowledge gap. In the U.S. EPA’s ToxCast effort, the use of ~700 high-throughput in vitro assays has broadly characterized the biological activity and potential mechanisms of ~1,800 chemicals. Coupling the high-throughput in vitro assays with additional in vitro pharmacokinetic assays and in vitro-to-in vivo extrapolation modeling allows conversion of in vitro bioactive concentrations to estimates to an administered dose (mg/kg/day). High throughput exposure models are generating exposure estimates based on key aspects of chemical production, fate, transport, and personal use. The path for incorporating new approach methods and technologies for prioritization and assessment of chemical alternatives poses multiple scientific challenges. These challenges include sufficient coverage of toxicological mechanisms to meaningfully interpret negative test results, development of increasingly relevant test systems, computational modeling to integrate experimental data, characterizing uncertainty, and efficient validation of the test systems and computational models. The presentation will cover progress at the U.S. EPA in the development and application of these technologies and approaches in evaluating alternatives and systematically addressing each of these challenges. This abstract does not necessarily reflect U.S. EPA policy.
The iCSS Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. However, it can also be applied to many other purposes, e.g., the identification of agrochemicals in waste streams. This presentation will provide a review of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. We will also discuss progress towards a high-throughput non-targeted analysis platform for use by the mass spectrometry community. This abstract does not reflect U.S. EPA policy.
This presentation was given at Proctor and Gamble in Mason, OH, in June 2017 in a symposium regarding "Big Data: Perspectives from Diverse Fields on How to Extract, Filter and Manipulate Large Data Sets to Answer Scientific Questions"
This presentation highlights known challenges with the production of high quality chemical databases and outline recent efforts made to address these challenges. Specific examples will be provided illustrating these challenges within the U.S. Environmental Protection Agency (EPA) Computational Toxicology Program. This includes consolidating EPA’s ACToR and DSSTox databases, augmenting computed properties and list search features, and introducing quality metrics to assess confidence in chemical structure assignments across hundreds of thousands of chemical substance records. The past decade has seen enormous investments in the generation and release of data from studies of chemicals and their toxicological effects. There is, however, commonly little concern given to provenance and, more generally, to the quality of the data. The presentation will emphasize the importance of rigorous data review procedures, progress in web-based public access to accurate chemical data sets for use in predictive modeling, and the benefits that these efforts will deliver to toxicologists to embrace the “Big Data” era.
This abstract does not necessarily represent the views of the U.S. Environmental Protection Agency.
There is a growing need for rapid chemical screening and prioritization to inform regulatory decision-making on thousands of chemicals in the environment. We have previously used high-resolution mass spectrometry to examine household vacuum dust samples using liquid chromatography time-of-flight mass spectrometry (LC-TOF/MS). Using a combination of exact mass, isotope distribution, and isotope spacing, molecular features were matched with a list of chemical formulas from the EPA’s Distributed Structure-Searchable Toxicity (DSSTox) database. This has further developed our understanding of how openly available chemical databases, together with the appropriate searches, could be used for the purpose of compound identification. We report here on the utility of the EPA’s iCSS Chemistry Dashboard for the purpose of compound identification using searches against a database of over 720,000 chemicals. We also examine the benefits of QSAR prediction for the purpose of retention time prediction to allow for alignment of both chromatographic and mass spectral properties. This abstract does not reflect U.S. EPA policy.
An examination of data quality on QSAR Modeling in regards to the environment...Kamel Mansouri
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy.
Sustainable chemistry is the design and use of chemicals that minimize impacts to human health, ecosystems and the environment. To assess sustainability, chemicals must be evaluated not only for their toxicity to humans and other species, but also for environmental persistence and potential formation of toxic products as a result of biotic and abiotic transformations. Traditional approaches to evaluate these characteristics are resource intensive and normally lack biologically mechanistic information that might facilitate a “safety by design” approach. A more promising approach would exploit recent advances in high-throughput (HT) and high-content (HC) screening methods coupled with computational methods for data analysis and predictive modelling. The elements of a framework to assess sustainable chemistry could rely on integration of non-testing approaches such as (Q)SAR and read-across, coupled with prediction models derived from HT/HC methods anchored to biological pathways (eg., Adverse Outcome Pathways). Acceptance and use of such integrated approaches necessitates a level of validation that demonstrates scientific confidence for specific decision contexts. Here we illustrate a scientific confidence framework for Tox21 approaches underpinned by a mechanistic basis, and illustrate how this will drive the development of enhanced non-testing approaches. This framework also focuses development of prediction models that are hybrid yet local in terms of their chemistry in nature. Specific examples highlight how the extensive testing library within ToxCast was profiled with respect to its chemistry, resulting in new insights that direct strategic testing as well as formulate new predictive models specifically SARs. This abstract does not necessarily reflect U.S. EPA policy.
Humans are potentially exposed to tens of thousands of man-made chemicals in the environment. It is well known that some environmental chemicals mimic natural hormones and thus have the potential to be endocrine disruptors. Most of these environmental chemicals have never been tested for their ability to disrupt the endocrine system, in particular, their ability to interact with the estrogen receptor. EPA needs tools to prioritize thousands of chemicals, for instance in the Endocrine Disruptor Screening Program (EDSP). This project was intended to be a demonstration of the use of predictive computational models on HTS data including ToxCast and Tox21 assays to prioritize a large chemical universe of 32464 unique structures for one specific molecular target – the estrogen receptor. CERAPP combined multiple computational models for prediction of estrogen receptor activity, and used the predicted results to build a unique consensus model. Models were developed in collaboration between 17 groups in the U.S. and Europe and applied to predict the common set of chemicals. Structure-based techniques such as docking and several QSAR modeling approaches were employed, mostly using a common training set of 1677 compounds provided by U.S. EPA, to build a total of 42 classification models and 8 regression models for binding, agonist and antagonist activity. All predictions were evaluated on ToxCast data and on an external validation set collected from the literature. In order to overcome the limitations of single models, a consensus was built weighting models based on their prediction accuracy scores (including sensitivity and specificity against training and external sets). Individual model scores ranged from 0.69 to 0.85, showing high prediction reliabilities. The final consensus predicted 4001 chemicals as actives to be considered as high priority for further testing and 6742 as suspicious chemicals. This abstract does not necessarily reflect U.S. EPA policy
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Kamel Mansouri
Endocrine disrupting chemicals (EDCs) are xenobiotics that mimic the interaction of natural hormones at the receptor level and alter synthesis, transport and metabolism pathways. The prospect of EDCs causing adverse health effects in humans and wildlife has led to the development of scientific and regulatory approaches for evaluating bioactivity. This need is being partially addressed by the use of high-throughput screening (HTS) in vitro approaches and computational modeling. In the framework of the Endocrine Disruptor Screening Program (EDSP), the U.S. EPA led two worldwide consortiums to “virtually” (i.e., in silico) screen chemicals for their potential estrogenic and androgenic activities. The Collaborative Estrogen Receptor (ER) Activity Prediction Project (CERAPP) [1] predicted activities for 32,464 chemicals and the Collaborative Modeling Project for Androgen Receptor (AR) Activity (CoMPARA) generated predictions on the CERAPP list with additional simulated metabolites, totaling 55,450 unique structures. Modelers and computational toxicology scientists from 30 international groups contributed structure-based models and results for activity prediction to one or both projects, with methods ranging from QSARs to docking to predict binding, agonism and antagonism activities. Models were based on a common training set of 1746 chemicals having ToxCast/Tox21 HTS in vitro assay results (18 assays for ER and 11 for AR) integrated into computational networks. The models were then validated using curated literature data from different sources (~7,000 results for ER and ~5,000 results for AR). To overcome the limitations of single approaches, CERAPP and CoMPARA models were each combined into consensus models reaching high predictive accuracy. These consensus models were extended beyond the initially designed datasets by implementing them into the free and open-source application OPERA to avoid running every single model on new chemicals [2]. This implementation was used to screen the entire EPA DSSTox database of ~750,000 chemicals and predicted ER and AR activity is made freely available on the CompTox Chemistry dashboard (https://comptox.epa.gov/dashboard) [3].
Learn how large-scale normalized data empowers the critical early phases of drug discovery.
To address the core concerns about data quality, comprehensiveness and comparability, the Reaxys product team has developed a completely new repository for bioactivity information. Reaxys Medicinal Chemistry stands as a unique source for normalized data in vitro efficacy, in vivo animal models, compound metabolism, pharmacokinetics and toxicity. This presentation takes a look at how this approach to data supports critical early discovery methods such as in silico screening and target profiling.
Developing tools for high resolution mass spectrometry-based screening via th...Andrew McEachran
Non-targeted and suspect screening studies using high resolution mass spectrometry (HRMS) have revolutionized how chemicals are detected in complex matrices. However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. The US EPA has developed functionality within the CompTox Chemistry Dashboard (https://comptox.epa.gov) to address challenges related to data processing and analysis in HRMS. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will introduce the tools and combined workflow including visualization and access via the Chemistry Dashboard. These tools, data, and visualization approaches within an open chemistry resource provide a freely available software tool to support structure identification and non-targeted analysis. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
SETAC Rome Non-Target Screening For Chemical DiscoveryEmma Schymanski
SETAC Special Session: Solutions for emerging pollutants – Towards a holistic chemical quality status assessment in European freshwater resources
Non-target Screening for Holistic Chemical Monitoring and Compound Discovery: Open Science, Real-time and Retrospective Approaches
Emma L. Schymanski [1*], Reza Aalizadeh [2], Nikiforos Alygizakis [3], Juliane Hollender [4], Martin Krauss [5], Tobias Schulze [5], Jaroslav Slobodnik [3], Nikolaos S. Thomaidis [2] and Antony J. Williams [6]
[1] Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
[2] National and Kapodistrian University of Athens, Department of Chemistry, Athens, Greece.
[3] Environmental Institute, Koš, Slovak Republic
[4] Eawag: Swiss Federal Institute for Aquatic Science and Technology, Dübendorf, Switzerland
[5] UFZ – Helmholtz Centre for Environmental Research, Leipzig, Germany
[6] National Centre for Computational Toxicity, US EPA, Research Triangle Park, Durham, NC, USA
*presenting author: emma.schymanski@uni.lu
Keywords: Non-target screening, Open Science, Mass Spectrometry, Monitoring
Non-target screening (NTS) with high resolution mass spectrometry (HR-MS) provides opportunities to discover chemicals, their dynamics and effects on the environment far beyond the current 45 “priority pollutants” or even “known” chemicals. Open science and the exchange of information (between for example scientists and regulatory authorities) has a critical role to play in the continuing evolution of NTS. Using a variety of case studies from Europe, this talk will highlight how open science activities such as MassBank.EU (https://massbank.eu), the NORMAN Suspect Exchange (http://www.norman-network.com/?q=node/236) and NORMAN Digital Sample Freezing Platform (http://norman-data.eu) as well as the US EPA CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard/) can support NTS. Further, it will show how initiatives such as near “real time” monitoring of the River Rhine and retrospective screening via so-called “digital freezing” platforms have opened up new potential for exploring the dynamics and distribution even of as-yet-unidentified chemicals. Collaborative European and international activities facilitate data exchange amongst analytical data scientists and enable quick, effective and reproducible provisional compound identification in digitally archived HR-MS data. This is leading to new ways of assessing and prioritizing the next generation of “emerging pollutants” in the environment, enabling a pro-active approach to environmental assessment unthinkable only a few years ago. Note: This abstract does not reflect US EPA policy.
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...Digiday
What do we love and hate about our jobs in the digital media business? How happy are we? What drives our happiness? NextMark’s Joe Pych will present the results from Digiday’s first ever Digital Media Job Satisfaction Survey. The results may surprise you!
Presenter: Joe Pych, CEO, NextMark @jpych
Green chemistry in chemical reactions: informatics by designAlex Clark
Chemical informatics technology can be of assistance to chemists for describing reactions in numerous ways, including calculating green chemistry metrics such as process mass intensity, E-factor and atom economy. To facilitate this, chemical reactions have to be described in more precise detail than is the norm for most chemists. There are also numerous practical ways to add more green chemistry functionality to lab notebooks, such as enumerating searchable reaction transforms for environmentally favourable reactions, automatically looking up toxicity and hazard information, and others which are mentioned in the slides.
This presentation was given at the Green Chemistry & Engineering conference in 2015 (Americal Chemical Society Green Chemistry Insititute).
The Data Driven University - Automating Data Governance and Stewardship in Au...Pieter De Leenheer
Data Governance and Stewardship requires automation of business semantics management at its nucleus, in order to achieve data trust between business and IT communities in the organization. University divisions operate highly autonomously and decentralized, and are often geographically distributed. Hence, they benefit more from an collaborative and agile approach to Data Governance and Stewardship approach that adapts to its nature.
In this lecture, we start by reviewing 'C' in ICT and reflect on the dilemma: what is the most important quality of data being shared: truth or trust? We review the wide spectrum of business semantics. We visit the different phases of growing data pain as an organization expands, and we map each phase on this spectrum of semantics.
Next, we introduce our principles and framework for business semantics management to support Data Governance and Stewardship focusing on the structural (what), processual (how) and organizational (who) components. We illustrate with use cases from Stanford University, George Washington University and Public Science and Innovation Administrations.
[Etude] Entrepreneurs de la Tech : qui sont-ils?FrenchWeb.fr
Qui est l'entrepreneur de la Tech en France? Comment a-t-il développé son projet? Quelles sont ses aspirations? A quel moment sautent-ils le pas…? Frenchweb a souhaité comprendre qui sont les entrepreneurs du digital en France.
•U.S. Congress mandated that the EPA screen chemicals for their potential to be endocrine disruptors
•Led to development of the Endocrine Disruptor Screening Program (EDSP)
•Initial focus was on environmental estrogens, but program expanded to include androgens and thyroid pathway disruptors
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityKamel Mansouri
In order to protect human health from chemicals that can mimic natural hormones, the U. S. Congress mandated the U.S. EPA to screen chemicals for their potential to be endocrine disruptors through the Endocrine Disruptor Screening Program (EDSP). However, the number of chemicals to which humans are exposed is too large (tens of thousands) to be accommodated by the EDSP Tier 1 battery, so combinations of in vitro high-throughput screening (HTS) assays and computational models are being developed to help prioritize chemicals for more detailed testing. Previously, CERAPP (Collaborative Estrogen Receptor Activity Prediction Project) demonstrated the effectiveness of combining many QSAR models trained on HTS data to prioritize a large chemical list for estrogen receptor activity. The limitations of single models were overcome by combining all models built by the consortium into consensus predictions. CoMPARA is a larger scale collaboration between 35 international groups, following the steps of CERAPP to model androgen receptor activity using a common training set of 1746 compounds provided by U.S. EPA. Eleven HTS ToxCast/Tox21 in vitro assays were integrated into a computational network model to detect true AR activity. Bootstrap uncertainty quantification was used to remove potential false positives/negatives. Reference chemicals (158) from the literature were used to validate the model, which showed 95.2% and 97.5% balanced accuracies for AR agonists and antagonists respectively. A library of ~80k chemical structure, including ~11k chemicals curated from PubChem literature data using ScrubChem tools was integrated with CoMPARA’s consensus predictions that combined several structure-based and QSAR modeling approaches. The results of this project will be used to prioritize a large set of more than 50k chemicals for further testing over the next phases of ToxCast/Tox21, among other projects. This work does not reflect the official policy of any federal agency.
There is a growing need for rapid chemical screening and prioritization to inform regulatory decision-making on thousands of chemicals in the environment. We have previously used high-resolution mass spectrometry to examine household vacuum dust samples using liquid chromatography time-of-flight mass spectrometry (LC-TOF/MS). Using a combination of exact mass, isotope distribution, and isotope spacing, molecular features were matched with a list of chemical formulas from the EPA’s Distributed Structure-Searchable Toxicity (DSSTox) database. This has further developed our understanding of how openly available chemical databases, together with the appropriate searches, could be used for the purpose of compound identification. We report here on the utility of the EPA’s iCSS Chemistry Dashboard for the purpose of compound identification using searches against a database of over 720,000 chemicals. We also examine the benefits of QSAR prediction for the purpose of retention time prediction to allow for alignment of both chromatographic and mass spectral properties. This abstract does not reflect U.S. EPA policy.
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
The ionization state of a chemical, reflected in pKa values, affects lipophilicity, solubility, protein binding and the ability of a chemical to cross the plasma membrane. These properties govern the pharmacokinetic parameters such as absorption, distribution, metabolism, excretion and toxicity and thus pKa is a fundamental chemical property and is used in many models of chemical toxicity.
Experimentally determining pKa is not feasible for high-throughput assays. Predicting pKa is challenging and existing models have been developed only using restricted chemical space (e.g., anilines, phenols, benzoic acids, primary amines) and lack of a generalized model impedes ADME modeling.
No free and open source models exist for heterogeneous chemical classes, however, several proprietary programs exist. In this work, pKa open data bundled with DataWarrior (http://www.openmolecules.org/) were used to develop predictive models for pKa. After data cleaning, there were ~3100 and ~3900 monoprotic chemicals with an acidic or basic pKa, respectively. 1D and 2D chemical descriptors (AlogP, Topological polar surface area, etc) in addition to 12 fingerprints (presence or absence of a chemical group) were generated using PaDEL software. Three datasets were used: acidic, basic and acidic and basic combined.
13 datasets were examined, the 1D/2D descriptors and 12 fingerprints. Using the Extreme Gradient Boosting algorithm showed that the MACCS and Substructure Count fingerprints yielded the best results, with models showing an R-Squared of ~0.78 and a RMSE of 1.42.
Recently, Deep Learning models have showed remarkable progress in image recognition and natural language processing. To determine if the Deep Learning algorithms would increase model performance we examined the datasets and found that the Deep Learning models were somewhat superior than Extreme Gradient Boosting with an R-Squared of ~0.80 and an RMSE of ~1.38.
This work does not reflect U.S. EPA policy.
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Kamel Mansouri
AAAS annual meeting (Boston, Feb 2017)
Humans are potentially exposed to tens of thousands of man-made chemicals in the environment. It is well known that some environmental chemicals mimic natural hormones and thus have the potential to be endocrine disruptors. Most of these environmental chemicals have never been tested for their ability to disrupt the endocrine system, in particular, their ability to interact with the estrogen receptor. EPA needs tools to prioritize thousands of chemicals, for instance in the Endocrine Disruptor Screening Program (EDSP). Collaborative Estrogen Receptor Activity Prediction Project (CERAPP) was intended to be a demonstration of the use of predictive computational models on HTS data including ToxCast and Tox21 assays to prioritize a large chemical universe of 32464 unique structures for one specific molecular target – the estrogen receptor. CERAPP combined multiple computational models for prediction of estrogen receptor activity, and used the predicted results to build a unique consensus model. Models were developed in collaboration between 17 groups in the U.S. and Europe and applied to predict the common set of chemicals. Structure-based techniques such as docking and several QSAR modeling approaches were employed, mostly using a common training set of 1677 compounds provided by U.S. EPA, to build a total of 42 classification models and 8 regression models for binding, agonist and antagonist activity. All predictions were evaluated on ToxCast data and on an external validation set collected from the literature. In order to overcome the limitations of single models, a consensus was built weighting models based on their prediction accuracy scores (including sensitivity and specificity against training and external sets). Individual model scores ranged from 0.69 to 0.85, showing high prediction reliabilities. The final consensus predicted 4001 chemicals as actives to be considered as high priority for further testing and 6742 as suspicious chemicals. The same approach is now being applied on a larger scale project to predict the potential androgen receptor (AR) activity of chemicals. This project called CoMPARA (Collaborative Modeling Project for Androgen Receptor Activity) is a collaboration between 35 international groups working on a common set of ~55k chemicals.
This abstract does not necessarily reflect U.S. EPA policy
The EPA CompTox Chemistry Dashboard provides access to data associated with ~760,000 chemical substances. The available data includes experimental and predicted physicochemical properties, environmental fate and transport data, in vivo and in silico toxicity data, in vitro bioassay data, exposure data and a variety of other types of information. The data are under continuous expansion and curation and the experimental data have been used to develop QSAR and QSPR models. A number of these models are available via a web interface so that users can submit a chemical structure and predict properties in real time. The dashboard also provides access to pre-compiled chemical lists and categories, including pesticides, and chemicals detected in the environment via non-targeted mass spectrometry analysis. The data are searchable using chemical identifiers (systematic names, trade names, CAS Registry Numbers), by structure, mass and formula. Batch searches allow for data associated with thousands of chemicals to be obtained in a few seconds, with just a few button clicks, and downloaded to the desktop. This presentation will provide an overview of the Dashboard and its applications to accessing source data associated with agriculturally related chemicals. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
This presentation was given at a TRIANGLE AREA MASS SPECTOMETRY meeting on 01/29/2019 in Research Triangle Park, North Carolina to provide a general overview of the CompTox Chemicals Dashboard to an audience of mass spectrometrists and people interested in the capabilities of the dashboard for chemical forensics, structure identification etc.
Presentation for Texas A&M Superfund Research Center virtual learning series, Big Data in Environmental Science and Toxicology. More details at https://superfund.tamu.edu/big-data-session-2-aug-18-2021/
This presentation was given at the ASMS Sanibel Conference "Unraveling the Exposome" and provided a general overview of the dashboard and how it integrates to many of the projects that we support but with a special focus on list generation, mass and formula searching based on MS-Ready structures and some of the prototypes that we have been developing to support non-targeted analysis.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental matrices, improving the means by which exposure analyses can be conducted. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases (either commercial or open databases), fragmentation prediction tools, and retention time prediction models. One goal of our research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemicals Dashboard, an open chemistry resource and web application containing data for ~900,000 substances. Database searching using mass or formula-based inputs has been optimized for structure identification using “MS-Ready Structures”: de-salted, stripped of stereochemistry, and mixture separated to replicate the form of a chemical observed via HRMS. Functionality to conduct batch searching of molecular formulae and monoisotopic masses has also been implemented. While the increasing number of free online databases are of value to support chemical structure verification and elucidation there are known issues regarding data quality and careful data curation is a very necessary part of the development of these resources. This presentation will provide an overview of our latest enhancements to the dashboard to support mass spectrometry, incorporation of specific datasets (i.e. to support breath research and household dust analysis) and the value of metadata and predicted fragmentation spectral matching to support structure identification. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Communication of chemistry in the internet era, while it has improved, remains challenged in terms of the exchange of data in a lossless fashion. While there are moves afoot within the publishing industry to produce “data journals”, including embracing some of the new approaches for making data available to the community, many challenges remain. Chemistry data sharing, at even the most basic level, remains a challenge for many chemistry journals. The vast majority of chemistry data is provided as PDF files or trapped on webpages and therefore not available for reuse and repurposing without a significant amount of effort to extract the data. Some of the responsibility resides with the scientists who need to be educated and encouraged in the adoption of appropriate exchange formats and utilization of online platforms for data hosting and dissemination. There are certain practices which, if adopted, could increase both the availability and utility of data for the community. This includes recognition that data, in itself, has value above and beyond inclusion in peer-reviewed publications, the adoption of standard (not necessarily open) formats, clear data licensing, and distribution of the data across multiple platforms. This presentation will provide an overview of ongoing efforts within the National Center for Computational Toxicology to publish chemistry data, both in databases and associated with peer-reviewed publications, in a manner that makes our data and models consumable by the community.
This abstract does not reflect U.S. EPA policy.
In recent years, the growth of scientific data and the increasing need for data sharing and collaboration in the field of environmental chemistry has led to the creation of various software and databases that facilitate research and development into the safety and toxicity of chemicals. The US-EPA Center for Computational Toxicology and Exposure has been developing software and databases that serve the chemistry community for many years. This presentation will focus on several web-based software applications which have been developed at the USEPA and made available to the community. While the primary software application from the Center is the CompTox Chemicals Dashboard almost a dozen proof-of-concept applications have been built serving various capabilities. The publicly accessible Cheminformatics Modules (https://www.epa.gov/chemicalresearch/cheminformatics) provides access to six individual modules to allow for hazard comparison for sets of chemicals, structure-substructure-similarity searching, structure alerts and batch QSAR prediction of both physicochemical and toxicity endpoints. A number of other applications in development include a chemical transformations database (ChET) and a database of analytical methods and open mass spectral data (AMOS). Each of these depends on the underlying DSSTox chemicals database, a rich source of chemistry data for over 1.2 million chemical substances. I will provide an overview of all tools in development and the integrated nature of the applications based on the underlying chemistry data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are utilized to identify emerging contaminants and chemical signatures of interest detected in various media. At the US Environmental Protection Agency the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is an open chemistry resource and web-based application containing data for ~900,000 substances and supports non-targeted and suspect screening analyses. Searching functionality includes identifier searches (e.g. systematic names, trade names and CAS Registry Numbers), mass and formula-based searches and prototype developments include combined substructure-mass/formula searches and searching experimental mass spectral data against predicted fragmentation spectra. A specific type of data mapping in the database uses “MS-Ready” structures, a way to process all registered substances to separate multi-component chemicals into their individual components, removal of stereochemical bonds and desalting and neutralization. This MS-Ready processing supports batch-searching using either mass or formulae to identify candidate chemicals and their mapped substances. A number of chemical lists (https://comptox.epa.gov/dashboard/chemical_lists) have also been developed to support the identification of chemicals related to agrochemistry, specifically pesticides (both active and inert constituents), insecticides and their metabolites and environmental breakdown products). This presentation will provide an overview of how the CompTox Chemicals Dashboard supports mass spectrometry based structure identification and non-targeted analysis of chemicals in agrochemistry. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The Center for Computational Toxicology and Exposure (CCTE) is part of the Office of Research and Development at the US Environmental Protection Agency. As part of its mission the center delivers access to chemicals related data via web-based freely accessible online Dashboards to disseminate data generated within the center as well as harvested and integrated from open databases around the world. The CompTox Chemicals Dashboard (available at https://comptox.epa.gov/dashboard) provides access to >1.2 million chemicals and associated data including experimental and predicted property data, in vivo hazard data, in vitro bioactivity data, exposure data, and various other data types. The curation of the chemicals dataset has included the development of over 400 segregated lists of chemicals that represent specific research areas of interest including disinfectant by-products, per- and polyfluoroalkyl substances (PFAS), extractables and leachables, and chemicals of emerging concern. The chemicals collection, the associated data, the lists and searches for mass and formulae makes the Dashboard an ideal foundation technology to support our colleagues working in the field of mass spectrometry, especially in targeted and non-targeted analysis. This presentation will provide an overview of the Dashboard, its value to the community in terms of providing access to the integrated and highly curated data, and its utility to support researchers in the field of mass spectrometry. New proof-of-concept projects will also be introduced including the development of a cheminformatics enabled methods database. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
The importance of data curation on QSAR Modeling: PHYSPROP open data as a case study. Presented at QSAR2016 Miami, FL 13-17, 2016
1. Office of Research and Development
The importance of data curation on
QSAR Modeling: PHYSPROP open
data as a case study
Kamel Mansouri,
Christopher Grulke
Ann Richard
Richard Judson
Antony Williams
NCCT, U.S. EPA
The views expressed in this presentation are those of the author and do not
necessarily reflect the views or policies of the U.S. EPA
QSAR 2016
14 June 2016, Miami, FL
2. Office of Research and Development
National Center for Computational Toxicology
Recent Cheminformatics development at NCCT
• We are building a new cheminformatics architecture
• PUBLIC dashboard gives access to curated chemistry
• Focus on integrating EPA and external resources
• Aggregating and curating data, visualization elements and
“services” to underpin other efforts
• RapidTox
• Read-across
• Predictive modeling
• Non-targeted screening
2
3. Office of Research and Development
National Center for Computational Toxicology
Developing “NCCT Models”
• Interest in physicochemical properties to include in exposure modeling,
augmented with ToxCast HTS in vitro data etc.
• Our approach to modeling:
– Obtain high quality training sets
– Apply appropriate modeling approaches
– Validate performance of models
– Define the applicability domain and limitations of the models
– Use models to predict properties across our full datasets
• Work has been initiated using available physicochemical data
4. Office of Research and Development
National Center for Computational Toxicology
PHYSPROP Data: Available from:
http://esc.syrres.com/interkow/EpiSuiteData.htm
• Water solubility
• Melting Point
• Boiling Point
• LogP (KOWWIN: Octanol-water partition coefficient)
• Atmospheric Hydroxylation Rate
• LogBCF (Bioconcentration Factor)
• Biodegradation Half-life
• Ready biodegradability
• Henry's Law Constant
• Fish Biotransformation Half-life
• LogKOA (Octanol/Air Partition Coefficient)
• LogKOC (Soil Adsorption Coefficient)
• Vapor Pressure
5. Office of Research and Development
National Center for Computational Toxicology
The Approach
• To build models we need the set of chemicals and their
property series
• Our curation process
– Decide on the “chemical” by checking levels of consistency
– We did NOT validate each measured property value
– Perform initial analysis manually to understand how to clean the data
(chemical structure and ID)
– Automate the process (and test iteratively)
– Process all datasets using final method
6. Office of Research and Development
National Center for Computational Toxicology
KNIME Workflow to Evaluate the Dataset
7. Office of Research and Development
National Center for Computational Toxicology
LogP dataset: 15,809 chemicals (structures)
• CAS Checksum: 12163 valid, 3646 invalid (>23%)
• Invalid names: 555
• Invalid SMILES 133
• Valence errors: 322 Molfile, 3782 SMILES (>24%)
• Duplicates check:
–31 DUPLICATE MOLFILES
–626 DUPLICATE SMILES
–531 DUPLICATE NAMES
• SMILES vs. Molfiles (structure check)
–1279 differ in stereochemistry (~8%)
–362 “Covalent Halogens”
–191 differ as tautomers
–436 are different compounds (~3%)
8. Office of Research and Development
National Center for Computational Toxicology
Examples of Errors
Valence Errors Different Compounds
9. Office of Research and Development
National Center for Computational Toxicology
Examples of Errors
Duplicate Structures Covalent Halogens
10. Office of Research and Development
National Center for Computational Toxicology
Data Files & Quality flags
• The data files have FOUR representations of a
chemical, plus the property value.
http://esc.syrres.com/interkow/EpiSuiteData.htm
4 levels of consistency exists
among:
• The Molblock
• The SMILES string
• The chemical name (based on
ACD/Labs dictionary)
• The CAS Number (based on a
DSSTox lookup)
11. Office of Research and Development
National Center for Computational Toxicology
Quality FLAGS into LogP data
• 4 Stars ENHANCED: 4 levels of consistency with stereo information
• 4 Stars: 4 levels of consistency, stereo ignored.
• 3 Stars Plus: 3 out of 4 levels. The 4th is a tautomer.
• 3 Stars ENHANCED: 4 levels of consistency with stereo information
• 3 Stars: 3 levels of consistency, stereo ignored.
• 2 Stars PLUS : 2 out of 4 levels. The 3rd is a tautomer.
• 1 Star - What's left.
12. Office of Research and Development
National Center for Computational Toxicology
Structure
standardization
Remove of
duplicates
Normalize of
tautomers
Clean salts and
counterions
Remove inorganics
and mixtures
Final inspection
QSAR-ready
structures
Initial
structures
13. Office of Research and Development
National Center for Computational Toxicology
KNIME workflow
UNC, DTU, EPA Consensus
Aim of the workflow:
• Combine (not reproduce) different procedures and ideas
• Minimize the differences between the structures used for prediction
by different groups
• Produce a flexible free and open source workflow to be shared
Indigo
Mansouri et al. (http://ehp.niehs.nih.gov/15-10267/)
14. Office of Research and Development
National Center for Computational Toxicology
Summary:
Property Initial file flagged Updated 3-4 STAR Curated QSAR ready
AOP 818 818 745
BCF 685 618 608
BioHC 175 151 150
Biowin 1265 1196 1171
BP 5890 5591 5436
HL 1829 1758 1711
KM 631 548 541
KOA 308 277 270
LogP 15809 14544 14041
MP 10051 9120 8656
PC 788 750 735
VP 3037 2840 2716
WF 5764 5076 4836
WS 2348 2046 2010
15. Office of Research and Development
National Center for Computational Toxicology
Development of a QSAR model
• Curation of the data
» Flagged and curated files available for sharing
• Preparation of training and test sets
» Inserted as a field in SDFiles and csv data files
• Calculation of an initial set of descriptors
» PaDEL 2D descriptors and fingerprints generated and shared
• Selection of a mathematical method
» Several approaches tested: KNN, PLS, SVM…
• Variable selection technique
» Genetic algorithm
• Validation of the model’s predictive ability
» 5-fold cross validation and external test set
• Define the Applicability Domain
» Local (nearest neighbors) and global (leverage) approaches
16. Office of Research and Development
National Center for Computational Toxicology
QSARs validity, reliability, applicability
and adequacy for regulatory purposes
ORCHESTRA. Theory,
guidance and application
on QSAR and REACH;
2012. http://home.
deib.polimi.it/gini/papers/or
chestra.pdf.
17. Office of Research and Development
National Center for Computational Toxicology
The 5
OECD
principles:
Principle Description
1) A defined endpoint Any physicochemical, biological or environmental effect
that can be measured and therefore modelled.
2) An unambiguous
algorithm
Ensure transparency in the description of the model
algorithm.
3) A defined domain of
applicability
Define limitations in terms of the types of chemical
structures, physicochemical properties and mechanisms of
action for which the models can generate reliable
predictions.
4) Appropriate measures
of goodness-of-fit,
robustness and
predictivity
a) The internal fitting performance of a model
b) the predictivity of a model, determined by using an
appropriate external test set.
5) Mechanistic
interpretation, if possible
Mechanistic associations between the descriptors used in
a model and the endpoint being predicted.
The conditions for the validity of QSARs
18. Office of Research and Development
National Center for Computational Toxicology
NCCT
models
Prop Vars 5-fold CV (75%) Training (75%) Test (25%)
Q2 RMSE N R2 RMSE N R2 RMSE
BCF 10 0.84 0.55 465 0.85 0.53 161 0.83 0.64
BP 13 0.93 22.46 4077 0.93 22.06 1358 0.93 22.08
LogP 9 0.85 0.69 10531 0.86 0.67 3510 0.86 0.78
MP 15 0.72 51.8 6486 0.74 50.27 2167 0.73 52.72
VP 12 0.91 1.08 2034 0.91 1.08 679 0.92 1
WS 11 0.87 0.81 3158 0.87 0.82 1066 0.86 0.86
HL 9 0.84 1.96 441 0.84 1.91 150 0.85 1.82
19. Office of Research and Development
National Center for Computational Toxicology
NCCT
models
Prop Vars 5-fold CV (75%) Training (75%) Test (25%)
Q2 RMSE N R2 RMSE N R2 RMSE
AOH 13 0.85 1.14 516 0.85 1.12 176 0.83 1.23
BioHL 6 0.89 0.25 112 0.88 0.26 38 0.75 0.38
KM 12 0.83 0.49 405 0.82 0.5 136 0.73 0.62
KOC 12 0.81 0.55 545 0.81 0.54 184 0.71 0.61
KOA 2 0.95 0.69 202 0.95 0.65 68 0.96 0.68
BA Sn-Sp BA Sn-Sp BA Sn-Sp
R-Bio 10 0.8 0.82-0.78 1198 0.8 0.82-0.79 411 0.79 0.81-0.77
20. Office of Research and Development
National Center for Computational Toxicology
LogP Model: Weighted kNN Model,
9 descriptors
Weighted 5-nearest neighbors
9 Descriptors
Training set: 10531 chemicals
Test set: 3510 chemicals
5 fold Cross-validation:
Q2=0.85 RMSE=0.69
Fitting:
R2=0.86 RMSE=0.67
Test:
R2=0.86 RMSE=0.78
21. Office of Research and Development
National Center for Computational Toxicology
Standalone
application:
Input:
–MATLAB .mat file, an ASCII file with only a matrix of variables
–SDF file or SMILES strings of QSAR-ready structures. In this case
the program will calculate PaDEL 2D descriptors and make the
predictions.
• The program will extract the molecules names from the input csv or SDF
(or assign arbitrary names if not) As IDs for the predictions.
Output
• Depending on the extension, the can be text file or csv with
–A list of molecules IDs and predictions
–Applicability domain
–Accuracy of the prediction
–Similarity index to the 5 nearest neighbors
–The 5 nearest neighbors from the training set: Exp. value, Prediction,
InChi key
22. Office of Research and Development
National Center for Computational Toxicology
The iCSS Chemistry Dashboard
at https://comptox.epa.gov
22
23. Office of Research and Development
National Center for Computational Toxicology
24. Office of Research and Development
National Center for Computational Toxicology
Conclusion
• QSAR prediction models (kNN) produced for all properties
• 700k chemical structures pushed through NCCT_Models
• Supplementary data will include appropriate files with flags
– full dataset plus QSAR ready form
• Full performance statistics available for all models
• Models will be deployed as prediction engines in the future
– one chemical at a time and batch processing (to be done
after RapidTox Project)
25. Office of Research and Development
National Center for Computational Toxicology
Acknowledgements
National Center for Computational Toxicology
26. Office of Research and Development
National Center for Computational Toxicology
26
Thank you for your attention