The influence of data curation on QSAR Modeling – Presented at American Chemi...Kamel Mansouri
This presentation examined the impact of data quality on the construction of QSAR models being developed within the EPA‘s National Center for Computational Toxicology. We have developed a public-facing platform to provide access to predictive models. As part of the work we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. This abstract does not reflect U.S. EPA policy.
The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Access to both experimental and predicted environmental fate and transport data is facilitated by the US-EPA CompTox Chemicals Dashboard. Providing access to various types of data associated with ~900,000 chemical substances, the dashboard is a web-based application supporting computational toxicology research in environmental chemistry. When experimental physicochemical and fate and transport data are not available, QSAR models developed using curated datasets are used for the prediction of properties. These include: bioaccumulation factors, bioconcentration factors, and biodegradation and fish biotransformation half-lives. For chemicals of interest that are not already registered in the dashboard real-time predictions based on structural inputs are available. This presentation will provide an overview of the dashboard with a focus on the availability of environmental fate and transport data, access to real time predictions, and our ongoing efforts to harvest and curate available experimental data from the literature and online databases. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Non-targeted analysis (NTA) uses high-resolution mass spectrometry to better understand the identity of a wide variety of chemicals present in environmental samples (and other matrices). However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. Analysis of the resultant mass spectrometry information relies on cheminformatics to identify and rank chemicals and the US EPA has developed functionality within the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) to address challenges related to this analysis. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will review how the CompTox Chemicals Dashboard via its flexible search capabilities, rich data for ~900,000 chemical substances, and visualization approaches within this open chemistry resource provides a freely available software tool to support structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
Tens of thousands of chemicals are currently in commerce, and hundreds more are introduced every year. Because current chemical testing is resource intensive, only a small fraction of chemicals have been adequately evaluated for potential human health effects. New technologies and computational tools have shown promise for closing this knowledge gap. In the U.S. EPA’s ToxCast effort, the use of ~700 high-throughput in vitro assays has broadly characterized the biological activity and potential mechanisms of ~1,800 chemicals. Coupling the high-throughput in vitro assays with additional in vitro pharmacokinetic assays and in vitro-to-in vivo extrapolation modeling allows conversion of in vitro bioactive concentrations to estimates to an administered dose (mg/kg/day). High throughput exposure models are generating exposure estimates based on key aspects of chemical production, fate, transport, and personal use. The path for incorporating new approach methods and technologies for prioritization and assessment of chemical alternatives poses multiple scientific challenges. These challenges include sufficient coverage of toxicological mechanisms to meaningfully interpret negative test results, development of increasingly relevant test systems, computational modeling to integrate experimental data, characterizing uncertainty, and efficient validation of the test systems and computational models. The presentation will cover progress at the U.S. EPA in the development and application of these technologies and approaches in evaluating alternatives and systematically addressing each of these challenges. This abstract does not necessarily reflect U.S. EPA policy.
The influence of data curation on QSAR Modeling – Presented at American Chemi...Kamel Mansouri
This presentation examined the impact of data quality on the construction of QSAR models being developed within the EPA‘s National Center for Computational Toxicology. We have developed a public-facing platform to provide access to predictive models. As part of the work we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. This abstract does not reflect U.S. EPA policy.
The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Access to both experimental and predicted environmental fate and transport data is facilitated by the US-EPA CompTox Chemicals Dashboard. Providing access to various types of data associated with ~900,000 chemical substances, the dashboard is a web-based application supporting computational toxicology research in environmental chemistry. When experimental physicochemical and fate and transport data are not available, QSAR models developed using curated datasets are used for the prediction of properties. These include: bioaccumulation factors, bioconcentration factors, and biodegradation and fish biotransformation half-lives. For chemicals of interest that are not already registered in the dashboard real-time predictions based on structural inputs are available. This presentation will provide an overview of the dashboard with a focus on the availability of environmental fate and transport data, access to real time predictions, and our ongoing efforts to harvest and curate available experimental data from the literature and online databases. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Non-targeted analysis (NTA) uses high-resolution mass spectrometry to better understand the identity of a wide variety of chemicals present in environmental samples (and other matrices). However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. Analysis of the resultant mass spectrometry information relies on cheminformatics to identify and rank chemicals and the US EPA has developed functionality within the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) to address challenges related to this analysis. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will review how the CompTox Chemicals Dashboard via its flexible search capabilities, rich data for ~900,000 chemical substances, and visualization approaches within this open chemistry resource provides a freely available software tool to support structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.
Tens of thousands of chemicals are currently in commerce, and hundreds more are introduced every year. Because current chemical testing is resource intensive, only a small fraction of chemicals have been adequately evaluated for potential human health effects. New technologies and computational tools have shown promise for closing this knowledge gap. In the U.S. EPA’s ToxCast effort, the use of ~700 high-throughput in vitro assays has broadly characterized the biological activity and potential mechanisms of ~1,800 chemicals. Coupling the high-throughput in vitro assays with additional in vitro pharmacokinetic assays and in vitro-to-in vivo extrapolation modeling allows conversion of in vitro bioactive concentrations to estimates to an administered dose (mg/kg/day). High throughput exposure models are generating exposure estimates based on key aspects of chemical production, fate, transport, and personal use. The path for incorporating new approach methods and technologies for prioritization and assessment of chemical alternatives poses multiple scientific challenges. These challenges include sufficient coverage of toxicological mechanisms to meaningfully interpret negative test results, development of increasingly relevant test systems, computational modeling to integrate experimental data, characterizing uncertainty, and efficient validation of the test systems and computational models. The presentation will cover progress at the U.S. EPA in the development and application of these technologies and approaches in evaluating alternatives and systematically addressing each of these challenges. This abstract does not necessarily reflect U.S. EPA policy.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data for over 4000 chemicals and 2000 assays, and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include an interactive read-across module, real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the latest release of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrate advances in biology, chemistry, exposure and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. As an outcome of these efforts the National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences including high-throughput in vitro screening data, legacy in vivo animal data, consumer use and production information, exposure models and chemical structure databases with associated properties. A series of software applications and databases have been produced over the past decade to deliver these data, but recent developments have focused on the development of a new software architecture that assembles the resources into a single platform. Our web application, the CompTox Chemistry Dashboard provides access to data associated with ~750,000 chemical substances. These data include experimental and predicted physicochemical property data, bioassay screening data associated with the ToxCast program, product and functional use information and a myriad of related data of value to environmental scientists.
The dashboard provides chemical-based searching based on chemical names, synonyms and CAS Registry Numbers. Flexible search capabilities allow for chemical identification based on non-targeted analysis studies using mass spectrometry. Chemical identification using both mass and formula-based searching utilizes rank-ordering of results via functional use statistics, thereby providing a solution to help prioritize chemicals for further review when detected in environmental media.
This presentation will provide an overview of the dashboard, its capabilities for delivering data to the environmental chemistry community and how the architecture provides a foundation for the development of additional applications to support chemical risk assessment. This abstract does not reflect U.S. EPA policy.
ICSA Presents: Scalable Performance Testing - How Spirent Makes That PossibleSailaja Tennati
@ICSA_Labs Brian Monkman discusses how he is working with Spirent's latest testing solution to help with #performance testing of #security devices at scale. This presentation was shared during #RSAC and #Interop 2014.
The CompTox Chemistry Dashboard was developed by the Environmental Protection Agency’s National Center for Computational Toxicology. This dashboard has been architected in a manner that allows for the deployment of multiple “applications”, both as publicly available databases, and for deployment under the constraints of confidential business information (CBI). The public dashboard provide access to multiple types of data for ~750,000 chemicals. This includes, when available for a chemical substance, physicochemical parameters, toxicity and bioassay data, consumer use and analytical data. Fate, exposure, and hazard calculations can benefit from access to the data aggregation and curation efforts that underpin the public dashboard. Also, regulators can benefit from the integration of their own data within their closed infrastructure environments. This presentation will provide a review of the chemistry dashboard architecture and its present application providing access to data to the research and regulatory communities. We will also review present developments in the area of delivering an application programming interface, web services, and software components for integration into third party applications providing access to the data exposed via the dashboard. This abstract does not reflect U.S. EPA policy.
As part of our efforts to develop a public platform to provide access to predictive models we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. Using a thorough manual review of the data underlying the well-known EPI Suite software, we developed automated processes for the validation of the data using a KNIME workflow. This includes: approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into various quality categories has allowed us to thoroughly investigate the resulting models developed from these data slices, as well as allowing us to examine whether or not efforts into the development of large high-quality datasets has the expected pay-off in terms of prediction performance. Machine-learning approaches have been applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. These data are available online via the EPA’s iCSS Chemistry Dashboard. This abstract does not reflect U.S. EPA policy.
The US EPA’s CompTox Chemistry Dashboard provides access to various types of data associated with ~760,000 chemical substances. These data include experimental and predicted property data, high-throughput screening assay data and hazard and environmental exposure data. With millions of individual data points and annotations associated with hundreds of thousands of chemicals, data quality is a priority. With tens of thousands of individual users per month browsing the data on the dashboard, the ability of users to provide feedback has allowed us to identify, confirm and address issues in the data. This has required the implementation of novel approaches for data feedback via the user interface that include general feedback on the dashboard and down to individual data points contained in a table. We are presently investigating ways to garner feedback on our ToxCast bioassay data to facilitate the curation of tens of thousands of data points. This presentation will provide an overview of our existing capabilities in the CompTox Chemistry Dashboard for gathering crowdsourced data from the user base and its impact on assisting in the curation of data.
This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The US-EPA National Center for Computational Toxicity (NCCT) has been generating data and building software applications and web-based chemistry databases for over a decade. During this period the center has analyzed thousands of chemicals in hundreds of bioassays, has researched high-throughput physicochemical property measurements and investigated approaches for high throughput toxicokinetics. NCCT continues to expand the battery of assays and number of chemicals under examination and is now investigating the application of transcriptomics. In parallel to these experimental efforts, and to support our efforts to develop new approaches to prioritize chemicals based on potential human health risks, we aggregate and curate data streams of various types to support prediction models. Over the past few years some of the data have been delivered through prototype web-based “dashboards” for public consumption. The latest of these web applications, the CompTox Chemicals Dashboard, is an integrated access point to obtain information associated with 875,000 chemical substances and providing experimental and predicted data of various types. This includes physicochemical and fate and transport data, bioactivity data, exposure data and integrated literature searches. Real-time predictions and generalized read-across are possible and advanced search capabilities are available to support EPA-related projects including mass spectrometry non-targeted analysis. This presentation will provide an overview of the CompTox Chemicals Dashboard and the its role in delivering access to the outputs of NCCT. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
As of August 2017, the major automated patent chemistry extractions (in ascending size, NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting with advantages; a) while the relative coverage between open and commercial sources is difficult to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has query, analysis, clustering and linking features difficult to match in commercial sources, e) many structures can be associated with bioactivity data f) connections between manually curated papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However, looking more closely also indicates disadvantages; a) extraction coverage is compromised by dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major open pipeline continuously running in situ but has a PubChem updating lag, c) automated extraction generates structural “noise” that degrades chemistry quality d) PubChem patent document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the records indicateas IP status, e) continual re-extraction of common chemistry results in over-mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated drugs g) linking between assay data and targets is still a manual exercise. However, all things considered the PubChem patent “big bang” presents users with the best of both worlds (PMID 26194581). Academics or smaller enterprises who cannot afford commercial solutions can now patent mine extensively. For those who have such subscriptions, PubChem has become an essential adjunct/complementary source for the analysis of patent chemistry and associated bio entities such as diseases and drug targets.
Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new publicly-accessible CompTox Dashboard as the first application built on our newly developed architecture. This abstract does not reflect U.S. EPA policy.
Non-targeted and suspect screening studies using high resolution mass spectrometry (HRMS) have revolutionized the detection of chemicals in complex matrices. However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. The US EPA has developed functionality within the CompTox Chemicals Dashboard (https://comptox.epa.gov) to address challenges related to data processing and analysis in HRMS. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will introduce the tools and combined workflow, including visualization and access via the CompTox Chemicals Dashboard. These tools, data, and visualization approaches within an open chemistry resource provides a publicly available software tool to support structure identification and non-targeted analyses. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Researchers at the EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The intention of this research program is to quickly evaluate thousands of chemicals for potential risk but with much reduced cost relative to historical approaches. This work involves computational and data driven approaches including high-throughput screening, modeling, text-mining and the integration of chemistry, exposure and biological data. We have developed a number of databases and applications that are delivering on the vision of developing a deeper understanding of chemicals and their effects on exposure and biological processes that are supporting a large community of scientists in their research efforts. This presentation will provide an overview of our work to bring together diverse large scale data from the chemical and biological domains, our approaches to integrate and disseminate these data, and the delivery of models supporting computational toxicology. This abstract does not reflect U.S. EPA policy.
As a service provider for hit identification, Exquiron needs to offer a state-of-the-art cheminformatics, data analysis and reporting platform to their clients. For historical reasons, this platform was based, until recently, on Accelrys’ PipelinePilot. An effort was started end of 2013 to evaluate and migrate all required workflows to the KNIME platform using the Infocom/ChemAxon nodes. With the help of the ChemAxon consulting team and support from KNIME, complex protocols were successfully migrated to the new environment. The presentation will highlight two specific examples of this effort.
The iCSS CompTox Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. This poster reviews the benefits of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. Standard approaches for both mass and formula lookup are available but the dashboard delivers a novel approach for hit ranking based on functional use of the chemicals. The focus on high-quality data, novel ranking approaches and integration to other resources of value to mass spectrometrists makes the CompTox Dashboard a valuable resource for the identification of environmental chemicals. This abstract does not reflect U.S. EPA policy.
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are being made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for almost 760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectroscopy non-targeted screening community, who are generating important data for detecting and assessing environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interests to relevant stakeholders including, for example, scientists interested in algal toxins and hydraulic fracturing chemicals. This presentation will provide an overview of the challenges associated with the curation of data from EPA’s December 2016 Hydraulic Fracturing Drinking Water Assessment Report that represented chemicals reported to be used in hydraulic fracturing fluids and those found in produced water. The data have been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and open literature. The application of the dashboard to support mass spectrometry non-targeted analysis studies will also be reviewed. This abstract does not reflect U.S. EPA policy.
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data for over 4000 chemicals and 2000 assays, and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include an interactive read-across module, real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the latest release of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The development of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago and, specifically, on the PHYSPROP dataset used to train the EPISuite prediction models. This presentation will review our approaches to examining key datasets, the delivery of curated data and the development of machine-learning models for thirteen separate property endpoints of interest to environmental science. We will also review how these data will be made freely accessible to the community via a new “chemistry dashboard”. This abstract does not reflect U.S. EPA policy
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrate advances in biology, chemistry, exposure and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. As an outcome of these efforts the National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences including high-throughput in vitro screening data, legacy in vivo animal data, consumer use and production information, exposure models and chemical structure databases with associated properties. A series of software applications and databases have been produced over the past decade to deliver these data, but recent developments have focused on the development of a new software architecture that assembles the resources into a single platform. Our web application, the CompTox Chemistry Dashboard provides access to data associated with ~750,000 chemical substances. These data include experimental and predicted physicochemical property data, bioassay screening data associated with the ToxCast program, product and functional use information and a myriad of related data of value to environmental scientists.
The dashboard provides chemical-based searching based on chemical names, synonyms and CAS Registry Numbers. Flexible search capabilities allow for chemical identification based on non-targeted analysis studies using mass spectrometry. Chemical identification using both mass and formula-based searching utilizes rank-ordering of results via functional use statistics, thereby providing a solution to help prioritize chemicals for further review when detected in environmental media.
This presentation will provide an overview of the dashboard, its capabilities for delivering data to the environmental chemistry community and how the architecture provides a foundation for the development of additional applications to support chemical risk assessment. This abstract does not reflect U.S. EPA policy.
ICSA Presents: Scalable Performance Testing - How Spirent Makes That PossibleSailaja Tennati
@ICSA_Labs Brian Monkman discusses how he is working with Spirent's latest testing solution to help with #performance testing of #security devices at scale. This presentation was shared during #RSAC and #Interop 2014.
The CompTox Chemistry Dashboard was developed by the Environmental Protection Agency’s National Center for Computational Toxicology. This dashboard has been architected in a manner that allows for the deployment of multiple “applications”, both as publicly available databases, and for deployment under the constraints of confidential business information (CBI). The public dashboard provide access to multiple types of data for ~750,000 chemicals. This includes, when available for a chemical substance, physicochemical parameters, toxicity and bioassay data, consumer use and analytical data. Fate, exposure, and hazard calculations can benefit from access to the data aggregation and curation efforts that underpin the public dashboard. Also, regulators can benefit from the integration of their own data within their closed infrastructure environments. This presentation will provide a review of the chemistry dashboard architecture and its present application providing access to data to the research and regulatory communities. We will also review present developments in the area of delivering an application programming interface, web services, and software components for integration into third party applications providing access to the data exposed via the dashboard. This abstract does not reflect U.S. EPA policy.
As part of our efforts to develop a public platform to provide access to predictive models we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. Using a thorough manual review of the data underlying the well-known EPI Suite software, we developed automated processes for the validation of the data using a KNIME workflow. This includes: approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into various quality categories has allowed us to thoroughly investigate the resulting models developed from these data slices, as well as allowing us to examine whether or not efforts into the development of large high-quality datasets has the expected pay-off in terms of prediction performance. Machine-learning approaches have been applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. These data are available online via the EPA’s iCSS Chemistry Dashboard. This abstract does not reflect U.S. EPA policy.
The US EPA’s CompTox Chemistry Dashboard provides access to various types of data associated with ~760,000 chemical substances. These data include experimental and predicted property data, high-throughput screening assay data and hazard and environmental exposure data. With millions of individual data points and annotations associated with hundreds of thousands of chemicals, data quality is a priority. With tens of thousands of individual users per month browsing the data on the dashboard, the ability of users to provide feedback has allowed us to identify, confirm and address issues in the data. This has required the implementation of novel approaches for data feedback via the user interface that include general feedback on the dashboard and down to individual data points contained in a table. We are presently investigating ways to garner feedback on our ToxCast bioassay data to facilitate the curation of tens of thousands of data points. This presentation will provide an overview of our existing capabilities in the CompTox Chemistry Dashboard for gathering crowdsourced data from the user base and its impact on assisting in the curation of data.
This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The US-EPA National Center for Computational Toxicity (NCCT) has been generating data and building software applications and web-based chemistry databases for over a decade. During this period the center has analyzed thousands of chemicals in hundreds of bioassays, has researched high-throughput physicochemical property measurements and investigated approaches for high throughput toxicokinetics. NCCT continues to expand the battery of assays and number of chemicals under examination and is now investigating the application of transcriptomics. In parallel to these experimental efforts, and to support our efforts to develop new approaches to prioritize chemicals based on potential human health risks, we aggregate and curate data streams of various types to support prediction models. Over the past few years some of the data have been delivered through prototype web-based “dashboards” for public consumption. The latest of these web applications, the CompTox Chemicals Dashboard, is an integrated access point to obtain information associated with 875,000 chemical substances and providing experimental and predicted data of various types. This includes physicochemical and fate and transport data, bioactivity data, exposure data and integrated literature searches. Real-time predictions and generalized read-across are possible and advanced search capabilities are available to support EPA-related projects including mass spectrometry non-targeted analysis. This presentation will provide an overview of the CompTox Chemicals Dashboard and the its role in delivering access to the outputs of NCCT. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
As of August 2017, the major automated patent chemistry extractions (in ascending size, NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting with advantages; a) while the relative coverage between open and commercial sources is difficult to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has query, analysis, clustering and linking features difficult to match in commercial sources, e) many structures can be associated with bioactivity data f) connections between manually curated papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However, looking more closely also indicates disadvantages; a) extraction coverage is compromised by dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major open pipeline continuously running in situ but has a PubChem updating lag, c) automated extraction generates structural “noise” that degrades chemistry quality d) PubChem patent document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the records indicateas IP status, e) continual re-extraction of common chemistry results in over-mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated drugs g) linking between assay data and targets is still a manual exercise. However, all things considered the PubChem patent “big bang” presents users with the best of both worlds (PMID 26194581). Academics or smaller enterprises who cannot afford commercial solutions can now patent mine extensively. For those who have such subscriptions, PubChem has become an essential adjunct/complementary source for the analysis of patent chemistry and associated bio entities such as diseases and drug targets.
Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new publicly-accessible CompTox Dashboard as the first application built on our newly developed architecture. This abstract does not reflect U.S. EPA policy.
Non-targeted and suspect screening studies using high resolution mass spectrometry (HRMS) have revolutionized the detection of chemicals in complex matrices. However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. The US EPA has developed functionality within the CompTox Chemicals Dashboard (https://comptox.epa.gov) to address challenges related to data processing and analysis in HRMS. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will introduce the tools and combined workflow, including visualization and access via the CompTox Chemicals Dashboard. These tools, data, and visualization approaches within an open chemistry resource provides a publicly available software tool to support structure identification and non-targeted analyses. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Researchers at the EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The intention of this research program is to quickly evaluate thousands of chemicals for potential risk but with much reduced cost relative to historical approaches. This work involves computational and data driven approaches including high-throughput screening, modeling, text-mining and the integration of chemistry, exposure and biological data. We have developed a number of databases and applications that are delivering on the vision of developing a deeper understanding of chemicals and their effects on exposure and biological processes that are supporting a large community of scientists in their research efforts. This presentation will provide an overview of our work to bring together diverse large scale data from the chemical and biological domains, our approaches to integrate and disseminate these data, and the delivery of models supporting computational toxicology. This abstract does not reflect U.S. EPA policy.
As a service provider for hit identification, Exquiron needs to offer a state-of-the-art cheminformatics, data analysis and reporting platform to their clients. For historical reasons, this platform was based, until recently, on Accelrys’ PipelinePilot. An effort was started end of 2013 to evaluate and migrate all required workflows to the KNIME platform using the Infocom/ChemAxon nodes. With the help of the ChemAxon consulting team and support from KNIME, complex protocols were successfully migrated to the new environment. The presentation will highlight two specific examples of this effort.
The iCSS CompTox Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. This poster reviews the benefits of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. Standard approaches for both mass and formula lookup are available but the dashboard delivers a novel approach for hit ranking based on functional use of the chemicals. The focus on high-quality data, novel ranking approaches and integration to other resources of value to mass spectrometrists makes the CompTox Dashboard a valuable resource for the identification of environmental chemicals. This abstract does not reflect U.S. EPA policy.
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are being made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for almost 760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectroscopy non-targeted screening community, who are generating important data for detecting and assessing environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interests to relevant stakeholders including, for example, scientists interested in algal toxins and hydraulic fracturing chemicals. This presentation will provide an overview of the challenges associated with the curation of data from EPA’s December 2016 Hydraulic Fracturing Drinking Water Assessment Report that represented chemicals reported to be used in hydraulic fracturing fluids and those found in produced water. The data have been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and open literature. The application of the dashboard to support mass spectrometry non-targeted analysis studies will also be reviewed. This abstract does not reflect U.S. EPA policy.
Sustainable chemistry is the design and use of chemicals that minimize impacts to human health, ecosystems and the environment. To assess sustainability, chemicals must be evaluated not only for their toxicity to humans and other species, but also for environmental persistence and potential formation of toxic products as a result of biotic and abiotic transformations. Traditional approaches to evaluate these characteristics are resource intensive and normally lack biologically mechanistic information that might facilitate a “safety by design” approach. A more promising approach would exploit recent advances in high-throughput (HT) and high-content (HC) screening methods coupled with computational methods for data analysis and predictive modelling. The elements of a framework to assess sustainable chemistry could rely on integration of non-testing approaches such as (Q)SAR and read-across, coupled with prediction models derived from HT/HC methods anchored to biological pathways (eg., Adverse Outcome Pathways). Acceptance and use of such integrated approaches necessitates a level of validation that demonstrates scientific confidence for specific decision contexts. Here we illustrate a scientific confidence framework for Tox21 approaches underpinned by a mechanistic basis, and illustrate how this will drive the development of enhanced non-testing approaches. This framework also focuses development of prediction models that are hybrid yet local in terms of their chemistry in nature. Specific examples highlight how the extensive testing library within ToxCast was profiled with respect to its chemistry, resulting in new insights that direct strategic testing as well as formulate new predictive models specifically SARs. This abstract does not necessarily reflect U.S. EPA policy.
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
This presentation highlighted how data curation impacts the reliability of QSAR models. We examined key datasets related to environmental endpoints to validate across chemical structure representations (e.g., mol file and SMILES) and identifiers (chemical names and registry numbers), and approaches to standardize data into QSAR-ready formats prior to modeling procedures. This allowed us to quantify and segregate data into quality categories. This improved our ability to evaluate the resulting models that can be developed from these data slices, and to quantify to what extent efforts developing high-quality datasets have the expected pay-off in terms of predicting performance. The most accurate models that we build will be accessible via our public-facing platform and will be used for screening and prioritizing chemicals for further testing.
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...ChemAxon
Structure-based virtual screening is an important tool in the drug discovery process. The use of computational tools has allowed for the screening of large libraries of chemical compounds to identify putative ligand-receptor interactions. The identification of valid targets and therapeutic compounds has long-term importance both to public health and the economic strength of the pharmaceutical industry. Receptor-based virtual screening (VS) is a technique in which computational tools are used dock small molecular weight compounds into a protein receptor or enzyme. This technique is most often used in drug discovery, where a large library of chemical structure can be docked and scored to assess the potential if a compound to bind to a drug target. However, high-throughput virtual screening is computationally intensive, and the cost of building, maintaining, and managing a dedicated computing cluster limits access to these technologies to large universities and commercial enterprises. Internet-based, or “cloud” computing, is a business service model in which computational resources are accessed affordably, scalably, and securely as needed. Our product utilizes this cloud infrastructure to deliver virtual screening to clients who either don’t desire to or cannot maintain their own infrastructure. Our elegant and highly efficient system for managing the job queue and maximizing the efficient use of computational resources allows us to provide reduced-cost access to our tools for academic and government researchers. This confluence of residual processing power and need has given rise to our concept of the “bucket list”; a “free” job queue that unassigned agents can perform during the time between finishing a paid job and their “death” at the end of their provisioned hour. We are working with Chemaxon to expand the capabilities of the current system through the following technical achievements: (1) integration of additional chemical libraries and library filtering tools to focus search space prior to docking; (2) enhancement of end user ability to evaluate results through integration of data analysis and visualization tools; (3) integration of additional licensed, proprietary, and public domain tools for additional functionality. This work is funded by NIH’s National Institute of General Medical Science through SBIR Phase II grant GM097902
Validation is the process of checking that your model is consistent with stereochemical standards i.e., validation is the process of evaluating reliability
In this presentation various aspects of validation are discussed
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
The increasing number and size of public databases is facilitating the collection of chemical structures and associated experimental data for QSAR modeling. However, the performance of QSAR models is highly dependent not only on the modeling methodology, but also on the quality of the data used. In this study we developed robust QSAR models for endpoints of environmental interest with the aim of helping the regulatory process. We used the publicly available PHYSPROP database that includes a set of thirteen common physicochemical and environmental fate properties, including logP, melting point, Henry’s coefficient, and biodegradability among others. Curation and standardization workflows have been applied to use the highest quality data and generate QSAR-ready structures. The developed models are in agreement with the five OECD principles that requires QSARs to be simple and reliable. These models were applied to a set of ~700k chemicals to produce predictions for display on the EPA CompTox Chemistry Dashboard. In addition to the predictions, this free web and mobile application provides access to the experimental data used for training as well as detailed reports including general model performances, specific applicability domain and prediction accuracy, and the nearest neighboring structures used for prediction. The dashboard also provides access to model QMRFs (QSAR modeling report format) which is a downloadable pdf containing additional details about the modeling approaches, the data, and molecular descriptor interpretation.
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
Chemical databases have been around for decades, but in recent years we have observed a qualitative change from rather small in-house built proprietary databases to large-scale, open and increasingly complex chemistry knowledgebases. This tectonic shift has imposed new requirements for database design and system architecture as well as the implementation of completely new components and workflows which did not exist in chemical databases before. Probably the most profound change is being caused by the linked nature of modern resources - individual databases are becoming nodes and hubs of a huge and truly distributed web of knowledge. This change has important aspects such as data and format standards, interoperability, provenance, security, quality control and metainformation standards.
ChemSpider at the Royal Society of Chemistry was first public chemical database which incorporated rigorous quality control by introducing both community curation and automated quality checks at the scale of tens of millions of records. Yet we have come to realize that this approach may now be incomplete in a quickly changing world of linked data. In this presentation we will talk about challenges associated with building modern public and private chemical databases as well as lessons that we have learned from our past and present experience. We will also talk about solutions for some common problems.
In this talk, we discuss QuTrack, a Blockchain-based approach to track experiment and model changes primarily for AI and ML models. In addition, we discuss how change analytics can be used for process improvement and to enhance the model development and deployment processes.
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
As the complexity in AI and Machine Learning processes increases, robust data pipelines need to be
developed for industrial scale model development and deployment. . In regulated industries such as
Finance, Healthcare etc. where automated decision making is increasingly becoming used, tracking
design of experiments and from inception to deployment is critical to ensure a robust process is
adopted. Model Life-cycle management solutions are proposed to track experiments, design robust
experiments for hyper parameter tuning, optimization and selection of models and for monitoring.
The number of choices and the parameters that need to be tracked makes is significantly
challenging to trace experiments and to address reproducibility concerns.
In this talk, we discuss QuTrack, a Blockchain-based approach to track experiment and model
changes primarily for AI and ML models. In addition, we discuss how change analytics can be used
for process improvement and to enhance the model development and deployment processes.
Use of spark for proteomic scoring seattle presentationlordjoe
Slides presented to the Seattle Spark Meetup on August 12 2015 - Note the work on Accumulators is a separate GitHub project https://github.com/lordjoe/SparkAccumulators
Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko
Chemical and pharmaceutical companies, and government agencies regulating both chemical and biological compounds, all strive to develop new methods to provide efficient prioritization, evaluation and safety assessments for the hundreds of new chemicals that enter the market annually. While there is a lot of historical data available within the various agencies, organizations and companies, significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. Traditional QSAR methods are based on sets of features (fingerprints) which representing the functional characteristics of chemicals. Unfortunately, due to both data gaps and limitations in the development of QSAR models, read-across approaches have become a popular area of research. Successes in the application of Artificial Neural Networks, and specifically in Deep Learning Neural Networks, has delivered a new optimism that the lack of data and limited feature sets can be overcome by using Deep Learning methods. In this poster we will present a comparison of various machine learning methods applied to several toxicological and physicochemical parameter endpoints. This abstract does not reflect U.S. EPA policy.
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
Feature selection is one of the most common and critical tasks in database classification. It
reduces the computational cost by removing insignificant and unwanted features. Consequently, this
makes the diagnosis process accurate and comprehensible. This paper presents the measurement of
feature relevance based on fuzzy entropy, tested with Radial Basis Classifier (RBF) network,
Bagging(Bootstrap Aggregating), Boosting and stacking for various fields of datasets. Twenty
benchmarked datasets which are available in UCI Machine Learning Repository and KDD have been
used for this work. The accuracy obtained from these classification process shows that the proposed
method is capable of producing good and accurate results with fewer features than the original
datasets.
Similar to Technology for Drug Discovery Research Productivity (20)
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
26. NewEdge platform: Capabilities Starting point:Molecular libraries Hit identification: Virtual screening on shape / fingerprint/ pharmacophore / Docking Early stage discovery The NewEdge Platform Molecular Modeling & Simulation Hit filtration: Screening based on target specificity / QSAR based screening if relevant data is available Hit to Lead: Fragment based approach / Scaffold hopping based approach Library generation: Clues for fragment replacement / Clues for growth / Hybrid library Late stage discovery Lead optimization: Site directed optimization clues / Residue contribution towards Binding energy End point:Selected candidate with high reliability
28. Hit filtration technologies Filtration of identified hits using specificity criteria narrows down the list. VLife’s NewEdge platform offers multiple hit filtration methods based on the similar principal.
29. Technologies for ‘Hit to Lead’ Filtered hits require deeper understanding of chemical space requirements and would require clues for scaffold hopping to achieve novelty. VLife’s NewEdge platform offers multiple tools for obtaining novel Leads from the hits.
30. Technologies for Library generation The identified lead is utilized further for exploration of chemical space by generating lead-like library. VLife’s NewEdge platform offers multiple library generation tools.
32. Protein structure analysis Active site analysis Homology modeling NewEdge: End-to-end capabilities Property visualization Docking QSAR analysis Database querying Virtual screening Multiple scenarios single platform NewEdge platform: Application summary Approaches Applications Activity data I Yes Primary lead chemistry No activity data II Yes Pharmacophore identification No III Yes IV Conformer generation Target structure Primary lead chemistry Close homolog Combinatorial library V No No VI Activity data Remote homolog No activity data VII
33. Section – II Innovations from VLife www.vlifesciences.com
47. QSAR benchmarking II: kNN MFA Activity prediction benchmarking : VLifeSCOPE Comparison of VLifeSCOPE with force field based docking as a means of predicting likely experimental MIC Accuracy measure:Rank order comparison of each molecule of the data set with their MIC Reference: Modeling and interactions of Aspergillusfumigatuslanosterol 14-α demethylase ‘A’ with azole anti fungals (Bioorganic & Medicinal Chemistry 2004, 12 2937–2950) With VLife SCOPE predicted rank order for first four compounds exactly matches experimental finding while binding energy based rank order is completely off track.
48. QSAR benchmarking I: GQSAR QSAR benchmarking : GQSAR Comparison of patent pending GQSAR with other 2D QSAR and 3D QSAR methods for accuracy of predicted activity Accuracy measure: Established statistical measures, pred_r2 and q2 Reference: Group-Based QSAR (G-QSAR): Mitigating Interpretation Challenges in QSAR ,Subhash Ajmani, Kamalakar Jadhav, Sudhir A. Kulkarni, QSAR & Combinatorial Science, 28, 1, 2009, 36–51 VLife’s patented GQSAR is more accurate than similar technologies and far more insightful for lead optimization.
49. QSAR benchmarking II: kNN MFA QSAR benchmarking : kNN-MFA Comparison of kNN MFA method with other QSAR methods for accuracy of prediction in case of non-linear relationships Accuracy measure:Established statistical measures, pred_r2 and q2 Steroids Cancer Anti-Inflammatory Pred_r2 q2 Pred_r2 q2 Pred_r2 q2 Reference: Three-Dimensional QSAR Using the k-Nearest Neighbor Method and Its Interpretation by Subhash Ajmani, Kamalakar Jadhav, Sudhir A. Kulkarni , Journal of Chemical Information and Modeling, 2006, 46, 24-31 VLife’s kNN-MFA method is consistently more accurate than similar technologies across widely varying chemistries.
50. Accuracy Docking benchmarking I: GRIP Docking benchmarking – I: GRIP Comparison with multiple other technologies for accuracy Accuracy measure:Difference of < 1A0 between predicted and laboratory determined result Reference: Standard data for comparison taken from ‘Deciphering common failures in molecular docking of ligand-protein complexes’ by G.M. Verkhivker, D. Bouzida, D.K. Gehlhaar, P.A. Rejto, S. Arthurs, A.B. Colson, S.T. Freer, V. Larson, B. A. Luty, T. Marronne, P.W. Rose, J. Comp. Aid. Mol. Des., 2000, 14, 731-751
51. Docking benchmarking II: GRIP Docking benchmarking – II: GRIP Comparison with multiple other technologies for speed and ability to handle complex molecules Speed measure:Minutes taken per docking Molecular complexity measure:Number of rotatable bonds within molecule Complexity Speed VLife’s GRIP docking is faster, more accurate and is better able to handle complex molecules vis-a-vis wide spectrum of competing technologies.
67. Section – IV Post discovery strategic research services www.vlifesciences.com
68. Multiple scenarios single platform VLifeRVHTS platform: Disclosable approach T3 Target Knowledgebase Target Specific Compound Knowledgebase Target knowledgebase covers database of 1066 targets Knowledgebase covers all possible co-crystallized (CC) and known compounds(KC) with reported activity for a respective protein target T2 T1066 T1 T1 T3 T2 T1066 Screening Lead Compound binding with IRBS Intelligent Rule Based System (IRBS) Derived from VLife RVHTS Platform KC1? CC1? KC2? KC3? CC3? KC1060? CC1066? CC2? Target Selection VLifeIRBS Filtered out with VLifeRVHTS An Intelligent Rule Based System is based on binding study of target specific compound database and Target knowledgebase Selected Targets with VLifeRVHTS Interaction Studies New putative targets for the query compound Priorities and suggestions