This talk was given at EBI's Wellcome Trust Genome Campus and is dedicated to outlining problems with chemical information standardization and various efforts to tackle this problem.
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
Sustainable research progress in many scientific disciplines critically depends on the existence of robust specialized databases that integrate and structure all available experimental information in the respective fields. The need for such reference database is especially critical for nanoscience and nanomaterial research given the significant diversity of shapes, sizes, and properties of engineered nanomaterials and the difficulty of synthesizing engineered nanoparticles with controlled properties. The acquisition of data from public sources is inefficient, time consuming and limited in scope. Moreover, it is not clear where the resources come from to support this activity on a perpetual basis. The NIH has recently posted its intention to provide special funds toward data deposition by the experimental investigators through the ‘data sharing plan’ for each proposal. However, this points to a current weakness which is that all laboratories use different data collection approaches each of which requires interpretation by staff hosting the database. It would be far more efficient and useful if a template with key terms that could be modified to add new or important additional data or parameters for each investigator. We will discuss tools and approaches to facilitate collection and direct deposition of experimental data into Nanomaterial Registry (https://www.nanomaterialregistry.org/) - a versatile semantically enriched templates-based platform for registering diverse data pertaining to nanomaterials research.
Chemistry Validation and Standardization Platform v2.0Valery Tkachenko
In recent years there has been explosive growth in the number of public chemical databases available online, a number of these containing 10s of millions of chemical structures. Examples include PubChem, ChemSpider and ChEMBL and users of these databases have become increasingly aware of the issue of data quality associated with these public resources. Seamless integration and mapping between databases, even for some common chemicals, is challenged by differing approaches to chemical standardization prior to registration into a database. The lack of standards in representing and handling chemical information certainly contributes to aspects of this problem. The Chemistry Validation and Standardization Platform (CVSP), originally developed to support the European Innovative Medicines Initiative project known as OpenPHACTS, was developed with the intention of providing an open platform for processing and standardizing chemical compounds. The system has been used to process millions of chemical compounds for dissemination through public websites and, unlike other validation and standardization systems, the system provides support for both standard and custom rulesets. We will provide an overview of CVSP 2.0, the next generation of the platform extending support to new cheminformatics toolkits and additional capabilities such as collaborative rules authoring.
Open Science Data Repository - the platform for materials researchValery Tkachenko
Over the last few years we have seen a tremendous growth in various data repositories pushed and supported by funding bodies and various data preservation initiatives. As a result we have now a variety of scientific resources, combined into a broad network and indexed through the directories like BioSharing and re3data. Such network, while growing quickly, is still in early days of adopting semantic web standards and does not yet support deep data indexing and discoverability, leave alone that mechanisms of intellectual properties protection are as simple as making data public or private at best. The lack of standards and well defined models to describe a scientific information structure even further inhibits free information flow which is essential for scientific discovery. One of the most affected areas is not surprisingly materials sciences where due to the inherent complexity of the field of study the situation is even more severe. In this talk we present a chemistry information platform designed to support a variety of data formats along with metadata, sophisticated ways of collaboration and secure data exchanges. We will discuss challenges that we have faced developing such platform as well as solutions that we have came with.
Clustering the royal society of chemistry chemical repository to enable enhan...Valery Tkachenko
The Royal Society of Chemistry has hosted the ChemSpider database and associated platforms for over five years. Technologies made significant progress over that period but, more importantly, the community needs in terms of the variety of data types as well as search performance have increased. The preprocessing of chemicals for improved similarity searching and compound database navigation is seen as one crucial component of major development efforts to architect a new data repository. This component is engineered and implemented in collaboration with the group of Professor Oliver Kohlbacher at University of Tübingen. They have developed an approach for clustering large chemical libraries based on a fast, parallel, and purely CPU-based algorithm for 2D binary fingerprint similarity calculation. Using this method, the complete similarity network of our seed set with tens of millions of chemicals has been analyzed at a Tanimoto threshold of 0.6 and all similarity links were fed into our database. The latter is highly beneficial and will allow us to create more complex and enriching visualizations of similar compounds with associated bioactivity data and physicochemical properties for the RSC chemical repository users. This presentation will provide an overview of our experiences in applying clustering to our compound data and how it will be used to enrich data navigation on the RSC data repository.
Building a semantic chemistry platform with the royal society of chemistryValery Tkachenko
We live in an exponentially expanding world of “big data”. Social networks, global portals and other distributed systems have been attempting to deal with the problem for a few years now. Scientific applications are commonly lagging behind the mainstream trends due to the complexity of the scientific domain. The Royal Society of Chemistry is building the Global Chemistry Network connecting a variety of resources both in-house and external, bridging gaps and advancing the chemical sciences. One of the main issues connected to the world of big data is the ease of navigation and comprehensiveness of the search capabilities. This is where the approach of the semantic web meets the world of big data. We will present our approaches in building a global federated chemistry platform connecting multiple domains of chemistry using semantic web technologies.
Model organisms such as budding yeast provide a common platform to interrogate and understand cellular and physiological processes. Knowledge about model organisms, whether generated during the course of scientific investigation, or extracted from published articles, are made available by model organism databases (MODs) such as the Saccharomyces Genome Database (SGD) for powerful, data-driven bioinformatic analyses. Integrative platforms such as InterMine offer a standard platform for MOD data exploration and data mining. Yet, today’s bioinformatic analyses also requires access to a significantly broader set of structured biomedical data, such as what can be found in the emerging network of Linked Open Data (LOD). If MOD data could be provisioned as FAIR (Findable, Accessible, Interoperable, and Reusable), then scientists could leverage a greater amount of interoperable data in knowledge discovery.
The goal of this proposal is to increase the utility of MOD data by implementing standards-compliant data access interfaces that interoperate with Linked Data. We will focus our efforts on developing interfaces for data access, data retrieval, and query answering for SGD. Our software will publish InterMine data as LOD that are semantically annotated with ontologies and be retrieved using standardized formats (e.g. JSON-LD, Turtle). We will facilitate the exploration of MOD data for hypothesis testing, by implementing efficient query answering using Linked Data Fragments, and by developing a set of graphical user interfaces to search for data of interest, explore connections, and answer questions that leverage the wider LOD network. Finally, we will develop a locally and cloud-deployable image to enable the rapid deployment of the proposed infrastructure. Our efforts to increase interoperability and ease of deployment for biomedical data repositories will increase research productivity and reduce costs associated with data integration and warehouse maintenance.
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
Sustainable research progress in many scientific disciplines critically depends on the existence of robust specialized databases that integrate and structure all available experimental information in the respective fields. The need for such reference database is especially critical for nanoscience and nanomaterial research given the significant diversity of shapes, sizes, and properties of engineered nanomaterials and the difficulty of synthesizing engineered nanoparticles with controlled properties. The acquisition of data from public sources is inefficient, time consuming and limited in scope. Moreover, it is not clear where the resources come from to support this activity on a perpetual basis. The NIH has recently posted its intention to provide special funds toward data deposition by the experimental investigators through the ‘data sharing plan’ for each proposal. However, this points to a current weakness which is that all laboratories use different data collection approaches each of which requires interpretation by staff hosting the database. It would be far more efficient and useful if a template with key terms that could be modified to add new or important additional data or parameters for each investigator. We will discuss tools and approaches to facilitate collection and direct deposition of experimental data into Nanomaterial Registry (https://www.nanomaterialregistry.org/) - a versatile semantically enriched templates-based platform for registering diverse data pertaining to nanomaterials research.
Chemistry Validation and Standardization Platform v2.0Valery Tkachenko
In recent years there has been explosive growth in the number of public chemical databases available online, a number of these containing 10s of millions of chemical structures. Examples include PubChem, ChemSpider and ChEMBL and users of these databases have become increasingly aware of the issue of data quality associated with these public resources. Seamless integration and mapping between databases, even for some common chemicals, is challenged by differing approaches to chemical standardization prior to registration into a database. The lack of standards in representing and handling chemical information certainly contributes to aspects of this problem. The Chemistry Validation and Standardization Platform (CVSP), originally developed to support the European Innovative Medicines Initiative project known as OpenPHACTS, was developed with the intention of providing an open platform for processing and standardizing chemical compounds. The system has been used to process millions of chemical compounds for dissemination through public websites and, unlike other validation and standardization systems, the system provides support for both standard and custom rulesets. We will provide an overview of CVSP 2.0, the next generation of the platform extending support to new cheminformatics toolkits and additional capabilities such as collaborative rules authoring.
Open Science Data Repository - the platform for materials researchValery Tkachenko
Over the last few years we have seen a tremendous growth in various data repositories pushed and supported by funding bodies and various data preservation initiatives. As a result we have now a variety of scientific resources, combined into a broad network and indexed through the directories like BioSharing and re3data. Such network, while growing quickly, is still in early days of adopting semantic web standards and does not yet support deep data indexing and discoverability, leave alone that mechanisms of intellectual properties protection are as simple as making data public or private at best. The lack of standards and well defined models to describe a scientific information structure even further inhibits free information flow which is essential for scientific discovery. One of the most affected areas is not surprisingly materials sciences where due to the inherent complexity of the field of study the situation is even more severe. In this talk we present a chemistry information platform designed to support a variety of data formats along with metadata, sophisticated ways of collaboration and secure data exchanges. We will discuss challenges that we have faced developing such platform as well as solutions that we have came with.
Clustering the royal society of chemistry chemical repository to enable enhan...Valery Tkachenko
The Royal Society of Chemistry has hosted the ChemSpider database and associated platforms for over five years. Technologies made significant progress over that period but, more importantly, the community needs in terms of the variety of data types as well as search performance have increased. The preprocessing of chemicals for improved similarity searching and compound database navigation is seen as one crucial component of major development efforts to architect a new data repository. This component is engineered and implemented in collaboration with the group of Professor Oliver Kohlbacher at University of Tübingen. They have developed an approach for clustering large chemical libraries based on a fast, parallel, and purely CPU-based algorithm for 2D binary fingerprint similarity calculation. Using this method, the complete similarity network of our seed set with tens of millions of chemicals has been analyzed at a Tanimoto threshold of 0.6 and all similarity links were fed into our database. The latter is highly beneficial and will allow us to create more complex and enriching visualizations of similar compounds with associated bioactivity data and physicochemical properties for the RSC chemical repository users. This presentation will provide an overview of our experiences in applying clustering to our compound data and how it will be used to enrich data navigation on the RSC data repository.
Building a semantic chemistry platform with the royal society of chemistryValery Tkachenko
We live in an exponentially expanding world of “big data”. Social networks, global portals and other distributed systems have been attempting to deal with the problem for a few years now. Scientific applications are commonly lagging behind the mainstream trends due to the complexity of the scientific domain. The Royal Society of Chemistry is building the Global Chemistry Network connecting a variety of resources both in-house and external, bridging gaps and advancing the chemical sciences. One of the main issues connected to the world of big data is the ease of navigation and comprehensiveness of the search capabilities. This is where the approach of the semantic web meets the world of big data. We will present our approaches in building a global federated chemistry platform connecting multiple domains of chemistry using semantic web technologies.
Model organisms such as budding yeast provide a common platform to interrogate and understand cellular and physiological processes. Knowledge about model organisms, whether generated during the course of scientific investigation, or extracted from published articles, are made available by model organism databases (MODs) such as the Saccharomyces Genome Database (SGD) for powerful, data-driven bioinformatic analyses. Integrative platforms such as InterMine offer a standard platform for MOD data exploration and data mining. Yet, today’s bioinformatic analyses also requires access to a significantly broader set of structured biomedical data, such as what can be found in the emerging network of Linked Open Data (LOD). If MOD data could be provisioned as FAIR (Findable, Accessible, Interoperable, and Reusable), then scientists could leverage a greater amount of interoperable data in knowledge discovery.
The goal of this proposal is to increase the utility of MOD data by implementing standards-compliant data access interfaces that interoperate with Linked Data. We will focus our efforts on developing interfaces for data access, data retrieval, and query answering for SGD. Our software will publish InterMine data as LOD that are semantically annotated with ontologies and be retrieved using standardized formats (e.g. JSON-LD, Turtle). We will facilitate the exploration of MOD data for hypothesis testing, by implementing efficient query answering using Linked Data Fragments, and by developing a set of graphical user interfaces to search for data of interest, explore connections, and answer questions that leverage the wider LOD network. Finally, we will develop a locally and cloud-deployable image to enable the rapid deployment of the proposed infrastructure. Our efforts to increase interoperability and ease of deployment for biomedical data repositories will increase research productivity and reduce costs associated with data integration and warehouse maintenance.
The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this talk, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.
The royal society of chemistry and its adoption of semantic web technologies ...Valery Tkachenko
Semantic web technologies have quickly penetrated all areas of traditional and new database systems and have become the de facto standard in information exchange and communication. The Royal Society of Chemistry has built a new chemistry data repository with the semantic web at the core of the system. Every module of the data repository contains a semantic web layer and is able to interact internally and externally using standard approaches and formats including RDF, appropriate ontologies, SPARQL querying and so on. In this presentation we will review the challenges associated with developing this new system based on semantic web technologies and how the approach that we have taken offers distinct advantages over the original data model designed to produce the ChemSpider database. Its advantages include extensibility, an ontological underpinning, federated integration and the adoption of modern standards rather than the constraints of a standard SQL model.
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMichel Dumontier
Biomedical researchers will remain stymied in their ability to take full advantage of the Big Data revolution if they can never find the datasets that they need to analyze, if there is lack of clarity about what particular datasets contain, and if data are insufficiently described.
CEDAR, an NIH BD2K Center of Excellence, aims to develop methods and tools to vastly ease the burden of authoring good experimental metadata, and to maximally use this information to zero in on datasets of interest.
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationStuart Chalk
Integration of the combined JSmol/JSpecView molecular viewer/spectral viewer software in the Eureka Research Workbench. Can display molecular structures, spectra and the linked version where clicking on a peak shows molecular movement (IR).
Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new CompTox Chemistry Dashboard and the developing architecture to support real-time property and toxicity endpoint prediction. This abstract does not reflect U.S. EPA policy.
Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental matrices, improving the means by which exposure analyses can be conducted. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases, fragmentation prediction tools, and retention time prediction models. The goal of this research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemistry Dashboard, an open chemistry resource and web application containing data for ~760,000 substances. Rank-ordering the number of sources associated with chemical records within the Dashboard (Data Source Ranking) improves the identification of unknowns by bringing the most likely candidate structures to the top of a search results list. Database searching has been further optimized with the generation of MS-Ready Structures. MS-Ready structures are de-salted, stripped of stereochemistry, and mixture separated to replicate the form of a chemical observed via HRMS. Functionality to conduct batch searching of molecular formulae and monoisotopic masses was designed and released to improve searching efforts. Finally, a scoring-based identification scheme was developed, optimized, and surfaced via the Dashboard using multiple data streams contained within the database underlying the Dashboard. The scoring-based identification scheme improved the identification of unknowns over previous efforts using data source ranking alone. Combining these steps within an open chemistry resource provides a freely available software tool for structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsTim Clark
FAIRPORT is an international project to develop a lightweight interoperability architecture for biomedical - and potentially other - data repositories.
This slide deck is a presentation to the FAIRPORT technical team. It describes a proposed model for supporting domain-specific search metadata using a common schema model across all repositories.
The proposal makes use of the following existing technologies, with minor extensions:
- the W3C DCAT model for dataset description
- the W3C SKOS knowledge organization system
- OWL2 Ontology Language
- Dublin Core Vocabulary
- NCBO Bioportal biomedical ontologies collection
Annotopia open annotation services platformTim Clark
Annotopia is an open-access, open-source, open annotation services platform developed for scientific annotation of documents and datasets on the web using the W3C Open Annotation model http://www.openannotation.org/spec/core/.
Using Annotopia, virtually any client application including lightweight web clients, can create, selectively share, and access annotation of web documents and data. This can be done regardless of the ownership of the base objects being annotated.
Annotopia supports unstructured, semi-structured and fully-structured (semantic) annotation; manual and automated (textmining) annotation; permissions, groups, and sharing. It also provides access to specialized vocabulary and text analytics services.
Annotopia is an open source platform licensed under Apache 2.0.
Presentation to ImmPort Science Meeting, February 27, 2014 on the proper treatment of value sets in the Immport Immunology Database and Analysis Portal
exFrame: a Semantic Web Platform for Genomics ExperimentsTim Clark
slides from talk given at Bio-ontologies 2013, Berlin DE, 20 July 2013
Emily Merrill*, Stephane Corlosquet*, Paolo Ciccarese†*, Tim Clark*†‡, Sudeshna Das†*
* Massachusetts General Hospital
† Harvard Medical School
‡ School of Computer Science, University of Manchester
Bio2RDF is an open-source project that offers a large and
connected knowledge graph of Life Science Linked Data. Each dataset is expressed using its own vocabulary, thereby hindering integration, search, query, and browse data across similar or identical types of data. With growth and content changes in source data, a manual approach to maintain mappings has proven untenable. The aim of this work is to develop a (semi)automated procedure to generate high quality mappings
between Bio2RDF and SIO using BioPortal ontologies. Our preliminary results demonstrate that our approach is promising in that it can find new mappings using a transitive closure between ontology mappings. Further development of the methodology coupled with improvements in
the ontology will offer a better-integrated view of the Life Science Linked Data
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are being made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for almost 760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectroscopy non-targeted screening community, who are generating important data for detecting and assessing environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interests to relevant stakeholders including, for example, scientists interested in algal toxins and hydraulic fracturing chemicals. This presentation will provide an overview of the challenges associated with the curation of data from EPA’s December 2016 Hydraulic Fracturing Drinking Water Assessment Report that represented chemicals reported to be used in hydraulic fracturing fluids and those found in produced water. The data have been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and open literature. The application of the dashboard to support mass spectrometry non-targeted analysis studies will also be reviewed. This abstract does not reflect U.S. EPA policy.
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. This document describes a consensus among participating stakeholders in the Health Care and the Life Sciences domain on the description of datasets using the Resource Description Framework (RDF). This specification meets key functional requirements, reuses existing vocabularies to the extent that it is possible, and addresses elements of data description, versioning, provenance, discovery, exchange, query, and retrieval.
The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this talk, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.
The royal society of chemistry and its adoption of semantic web technologies ...Valery Tkachenko
Semantic web technologies have quickly penetrated all areas of traditional and new database systems and have become the de facto standard in information exchange and communication. The Royal Society of Chemistry has built a new chemistry data repository with the semantic web at the core of the system. Every module of the data repository contains a semantic web layer and is able to interact internally and externally using standard approaches and formats including RDF, appropriate ontologies, SPARQL querying and so on. In this presentation we will review the challenges associated with developing this new system based on semantic web technologies and how the approach that we have taken offers distinct advantages over the original data model designed to produce the ChemSpider database. Its advantages include extensibility, an ontological underpinning, federated integration and the adoption of modern standards rather than the constraints of a standard SQL model.
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMichel Dumontier
Biomedical researchers will remain stymied in their ability to take full advantage of the Big Data revolution if they can never find the datasets that they need to analyze, if there is lack of clarity about what particular datasets contain, and if data are insufficiently described.
CEDAR, an NIH BD2K Center of Excellence, aims to develop methods and tools to vastly ease the burden of authoring good experimental metadata, and to maximally use this information to zero in on datasets of interest.
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationStuart Chalk
Integration of the combined JSmol/JSpecView molecular viewer/spectral viewer software in the Eureka Research Workbench. Can display molecular structures, spectra and the linked version where clicking on a peak shows molecular movement (IR).
Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new CompTox Chemistry Dashboard and the developing architecture to support real-time property and toxicity endpoint prediction. This abstract does not reflect U.S. EPA policy.
Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental matrices, improving the means by which exposure analyses can be conducted. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases, fragmentation prediction tools, and retention time prediction models. The goal of this research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemistry Dashboard, an open chemistry resource and web application containing data for ~760,000 substances. Rank-ordering the number of sources associated with chemical records within the Dashboard (Data Source Ranking) improves the identification of unknowns by bringing the most likely candidate structures to the top of a search results list. Database searching has been further optimized with the generation of MS-Ready Structures. MS-Ready structures are de-salted, stripped of stereochemistry, and mixture separated to replicate the form of a chemical observed via HRMS. Functionality to conduct batch searching of molecular formulae and monoisotopic masses was designed and released to improve searching efforts. Finally, a scoring-based identification scheme was developed, optimized, and surfaced via the Dashboard using multiple data streams contained within the database underlying the Dashboard. The scoring-based identification scheme improved the identification of unknowns over previous efforts using data source ranking alone. Combining these steps within an open chemistry resource provides a freely available software tool for structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsTim Clark
FAIRPORT is an international project to develop a lightweight interoperability architecture for biomedical - and potentially other - data repositories.
This slide deck is a presentation to the FAIRPORT technical team. It describes a proposed model for supporting domain-specific search metadata using a common schema model across all repositories.
The proposal makes use of the following existing technologies, with minor extensions:
- the W3C DCAT model for dataset description
- the W3C SKOS knowledge organization system
- OWL2 Ontology Language
- Dublin Core Vocabulary
- NCBO Bioportal biomedical ontologies collection
Annotopia open annotation services platformTim Clark
Annotopia is an open-access, open-source, open annotation services platform developed for scientific annotation of documents and datasets on the web using the W3C Open Annotation model http://www.openannotation.org/spec/core/.
Using Annotopia, virtually any client application including lightweight web clients, can create, selectively share, and access annotation of web documents and data. This can be done regardless of the ownership of the base objects being annotated.
Annotopia supports unstructured, semi-structured and fully-structured (semantic) annotation; manual and automated (textmining) annotation; permissions, groups, and sharing. It also provides access to specialized vocabulary and text analytics services.
Annotopia is an open source platform licensed under Apache 2.0.
Presentation to ImmPort Science Meeting, February 27, 2014 on the proper treatment of value sets in the Immport Immunology Database and Analysis Portal
exFrame: a Semantic Web Platform for Genomics ExperimentsTim Clark
slides from talk given at Bio-ontologies 2013, Berlin DE, 20 July 2013
Emily Merrill*, Stephane Corlosquet*, Paolo Ciccarese†*, Tim Clark*†‡, Sudeshna Das†*
* Massachusetts General Hospital
† Harvard Medical School
‡ School of Computer Science, University of Manchester
Bio2RDF is an open-source project that offers a large and
connected knowledge graph of Life Science Linked Data. Each dataset is expressed using its own vocabulary, thereby hindering integration, search, query, and browse data across similar or identical types of data. With growth and content changes in source data, a manual approach to maintain mappings has proven untenable. The aim of this work is to develop a (semi)automated procedure to generate high quality mappings
between Bio2RDF and SIO using BioPortal ontologies. Our preliminary results demonstrate that our approach is promising in that it can find new mappings using a transitive closure between ontology mappings. Further development of the methodology coupled with improvements in
the ontology will offer a better-integrated view of the Life Science Linked Data
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are being made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for almost 760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectroscopy non-targeted screening community, who are generating important data for detecting and assessing environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interests to relevant stakeholders including, for example, scientists interested in algal toxins and hydraulic fracturing chemicals. This presentation will provide an overview of the challenges associated with the curation of data from EPA’s December 2016 Hydraulic Fracturing Drinking Water Assessment Report that represented chemicals reported to be used in hydraulic fracturing fluids and those found in produced water. The data have been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and open literature. The application of the dashboard to support mass spectrometry non-targeted analysis studies will also be reviewed. This abstract does not reflect U.S. EPA policy.
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. This document describes a consensus among participating stakeholders in the Health Care and the Life Sciences domain on the description of datasets using the Resource Description Framework (RDF). This specification meets key functional requirements, reuses existing vocabularies to the extent that it is possible, and addresses elements of data description, versioning, provenance, discovery, exchange, query, and retrieval.
Imaging abdomen trauma uterine trauma part 11 Dr Ahmed EsawyAHMED ESAWY
Imaging abdomen trauma uterine trauma part 11 dr ahmed esawy
blunt abdominal trauma
penetrating abdominal trauma
fast abdominal ultrasound
haemoperitoneum
pneumoperitoneum
american association of surgeon in trauma AAST
SUBCAPSULAR HAEMATOMA
PARENCHYMAL LACERATION
include different cases for oral radiodiagnosis examination all over the world
CT /MRI Plain X ray images
UTERINE RUPTURE
UTERINE LACERATION
UTERINE CONTUSION
FETAL TRAUMA
This is a presentation given at the Opal Events meeting ""Drug Discovery Partnerships: Filling the Pipeline". I was speaking in a session with Jean-Claude Bradley regarding "Pre-competitive Collaboration: Sharing Data to Increase Predictability". This presentation discussed some of the work we are doing on Open PHACTS. My thanks especially to Carole Goble, Lee Harland and Sean Ekins for their comments.
In recent years there has been a dramatic increase in the number of freely accessible online databases serving the chemistry community. The internet provides chemistry data that can be used for data-mining, for computer models, and integration into systems to aid drug discovery. There is however a responsibility to ensure that the data are high quality to ensure that time is not wasted in erroneous searches, that models are underpinned by accurate data and that improved discoverability of online resources is not marred by incorrect data. In this article we provide an overview of some of the experiences of the authors using online chemical compound databases, critique the approaches taken to assemble data and we suggest approaches to deliver definitive reference data sources.
A Semantic Web based Framework for Linking Healthcare Information with Comput...Koray Atalag
Presented at Health Informatics New Zealand (HINZ 2017) Conference, 1-3 Nov 2017, Rotorua, New Zealand. Authorship: Koray Atalag, Reza Kalbasi, David Nickerson
The University of Auckland
Novel opportunities for computational biology and sociology inavinash tiwari
Current drug discovery is impossible without sophisticated modeling and computation. In this
review we outline previous advances in computational biology and, by tracing the steps involved
in pharmaceutical development, explore a range of novel, high-value opportunities for
computational innovation in modeling the biological process of disease and the social process of
drug discovery. These opportunities include text mining for new drug leads, modeling molecular
pathways and predicting the efficacy of drug cocktails, analyzing genetic overlap between diseases
and predicting alternative drug use. Computation can also be used to model research teams and
innovative regions and to estimate the value of academy–industry links for scientific and human
benefit. Attention to these opportunities could promise punctuated advance and will complement
the well-established computational work on which drug discovery currently relies.
Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...Remedy Informatics
The presentation describes how Remedy Informatics is advocating and innovating "flexible standardization" through an ontology-driven approach to clinical research. You will see in greater detail how a foundational, standardized Mosaic Ontology can be extended for more specific research applications and even more specific and focused disease research.
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases within the DSSTox project, and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for ~760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectrometry non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders, including, for example, scientists interested in Per- & Polyfluoroalkyl Substances (PFAS). Added lists include those sourced from the European Union as well as developed in-house and now containing thousands of chemicals. A procured testing library of hundreds of PFAS chemicals annotated into chemical categories has been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and links to the open literature. This presentation will provide an overview of the dashboard, the developing library of PFAS chemicals and associated categorization, and new physicochemical property and environmental fate and transport QSAR prediction models developed for these chemicals. The application of the dashboard to support mass spectrometry non-targeted analysis studies for the identification of PFAS chemicals will also be reviewed. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
Next generation electronic medical records and search a test implementation i...lucenerevolution
Presented by David Piraino, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic
& Daniel Palmer, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic
Most patient specifc medical information is document oriented with varying amounts of associated meta-data. Most of pateint medical information is textual and semi-structured. Electronic Medical Record Systems (EMR) are not optimized to present the textual information to users in the most understandable ways. Present EMRs show information to the user in a reverse time oriented patient specific manner only. This talk discribes the construction and use of Solr search technologies to provide relevant historical information at the point of care while intepreting radiology images.
Radiology reports over a 4 year period were extracted from our Radiology Information System (RIS) and passed through a text processing engine to extract the results, impression, exam description, location, history, and date. Fifteen cases reported during clinical practice were used as test cases to determine if ""similar"" historical cases were found . The results were evaluated by the number of searches that returned any result in less than 3 seconds and the number of cases that illustrated the questioned diagnosis in the top 10 results returned as determined by a bone and joint radiologist. Also methods to better optimize the search results were reviewed.
An average of 7.8 out of the 10 highest rated reports showed a similar case highly related to the present case. The best search showed 10 out of 10 cases that were good examples and the lowest match search showed 2 out of 10 cases that were good examples.The talk will highlight this specific use case and the issues and advances of using Solr search technology in medicine with focus on point of care applications.
Translational Biomedical Informatics 2010: Infrastructure and Scaling – Brian Athey,
PhD; Professor of Biomedical Informatics and Director for Academic Informatics,
University of Michigan Medical School; Chair Designate for Computational Medicine and Bioinformatics, University of Michigan; Associate Director, Michigan Institute for Clinical Health Research; Principal Investigator, National Center for Integrative Biomedical Informatics
Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...Remedy Informatics
The discovery of clinical insights through effective management and reuse of data requires several conditions to be optimized: Data need to be digital, data need to be structured, and data need to be standardized in terms of metadata and ontology. This presentation describes a bioinformatics system that combines a next-generation biobank management model mapped to applicable international standards and guidelines with a master ontology that controls all input and output and is able to add unique properties to meet the specialized needs of clinicians for cross-disease research.
Evolution of public chemistry databases: past and the futureValery Tkachenko
Over the last few years we have seen a tremendous growth in various chemical databases. As a result we have now a variety of scientific resources, combined into a broad network and indexed through the directories like BioSharing and re3data. Such network, while growing quickly, is still in early days of adopting semantic web standards and does not yet support deep data indexing and discoverability, leave alone that mechanisms of intellectual properties protection are as simple as making data public or private at best. The lack of standards and well defined models to describe a scientific information structure even further inhibits free information flow which is essential for scientific discovery.
In this talk we will share our experience spanning through decades of building chemical databases like PubChem, ChemSpider, OpenPHACTS and National Database Services and will outline fundamental problems associated with chemical databases as such as well as data quality and approaches for the modern architecture of the large-scale chemical databases.
Materials design is a grand challenge of materials science. And the main approach for solving this problem is still intuition-based. Such a way requires a lot of time and financial resources and months to years of conducting the experiment and doing characterization. Therefore, any kind of model that can be used at the very first stage of materials design and can narrow the selection area is a helpful tool for synthetic chemist. Also, an automated search for materials with human-defined target properties in the entire chemical space, i.e. inverse materials design is a highly desired tool in the exploration of materials design space.
Along with that, de novo design is not a kind of a completely new task in a field of development of new organic molecules with target properties. A lot of different generative approaches are being used along with screening the libraries of existing molecules, searching for drugs for a particular target, or generating new ones based on a very simple initial structure.
Here we would like to present a new approach for generating new materials with desired properties. We used autoencoder neural network architecture to encode materials composition and crystal structure as a vector in a latent space. In such case, any Quantitative Structure-Property Relationship (QSPR) model based on the vector can be interpreted as function in the latent space and can be used to predict property of existing materials as well as for prophetic ones. Such an approach has comparable accuracy with such classic computational methods as DFT in the case of predicting values of energies or charges, but significantly transcends them in terms of computational time.
The proposed method was tested for generating super-firm materials, but can easily be extended to any target properties, granted a database of materials properties can be provided for training.
Metal-organic frameworks: from database to supramolecular effects in complexa...Valery Tkachenko
Metal-organic frameworks (MOFs) attract a lot of interest due to their unique structure-dependent properties. Their internal pores comparable to the size of small molecules are naturally refined for various absorbance effects. Possessed properties lie in a foundation of multiple applications, such as catalysis, gas storage/separation and especially – clean energy related ones.
Theoretical calculations are a usual way of decreasing experimental costs while investigating properties of new materials, especially at a design stage. Electronic structure calculations like density functional theory (DFT) in most cases provide an appropriate accuracy in matching experimentally measured data such as adsorbate interaction energies. However, as in the case of experimental studies, large-scale materials screening studies with DFT calculations are rather time-consuming, and it can be carried out only for structures with relatively small unit cell.
Here we would like to present a theoretical and experimental results describing calculation of electron density in metal-organic frameworks. We built a model trained to predict partial charges on MOF atoms based on DFT calculations. The relative error of the model allows us to conclude that models do not decrease the level of accuracy and do not superinduce additional error comparing to DFT. At the same time, computational cost of the model is several orders of magnitude less. Models also demonstrated transferability and allowed to make prediction e.g. for MOFs containing metals not presented in the train set.
We have also built a force-field (FF) of two-centered and three-centered interatomic potentials constructed using predicted charges. The FF proved to reproduce MOF crystal structure. As a final test, we have applied the developed model and FF to a new synthesized lanthanide-containing MOFs to estimate influence of supramolecular effects on metal complexation selectivity.
As a result, we’ve built a model predicting one of basic MOF properties within relatively small computational time and tested it on experimental data, both obtained from literature sources and self-investigated.
Public repositories containing diverse chemical and biological data are one of the main sources of knowledge for further biomedical research. Unfortunately, extraction and transforming these data into a well-interpretable form is a complex exercise. Ongoing efforts of a community are mainly focused on the analysis of co-occurrences of terms, text annotation based on terms similarity and related tasks [1].
Here we present an approach based on natural-language processing techniques, which is intended to shift the focus of a search for similar texts on chemical topics from word- to document-level. PubMed records were used to implement word2vec and doc2vec models. Generated text representations can be used to search for similar abstracts; the similarity is more dependent on this representation than the co-presence of certain terms (neighbor compounds, similar publication date, etc.).
Document-level clustering was also implemented to provide insight into the PubMed text corpus structure. This approach can serve as an alternative to standard topic modeling techniques for the discovery of hidden semantic features in an unsupervised manner.
Machine learning methods for chemical properties and toxicity based endpointsValery Tkachenko
In the last decade there is an increasing interest in using in silico tools for potential risk assessment of newly released chemicals due to the large number of chemicals enter the market yearly and the big uncertainty on their possible hazardous effects. Different tools and methods based on machine learning techniques already exist and were used in a wide range of applications starting from quantitative structure-property relationships and expanding into predictive toxicology. There is a lot of historical data accumulated across multiple databases which is publicly available and can be used with novel machine learning methods. Unfortunately, due to different datasets, metrics and validation strategies, the significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. This work is an attempt to develop a multitask system which can serve as searchable curated collections of multiple chemical datasets and ready to use novel machine learning methods solely built using open source frameworks and libraries. We have implemented a set of self-tuned, using grid search and k-fold validation, traditional machine learning methods (shallow methods) such as Naïve Bayes, k-Nearest Neighbors, Random Forest, Boosted Decision Trees, Regularized Logistic Regression, and Support Vector Machines base on open source Scikitlearn (http://scikit-learn.org/stable/). The novel Deep Neural Networks models of different complexity have been also implemented using Keras (https://keras.io/), a deep learning open library, and a Tensorflow (www.tensorflow.org) as a backend. The machine learning models were trained and evaluated to predict measures of toxicity from the physical characteristics of the structure of chemicals using the same datasets as in the Toxicity Estimation Software Tool (https://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test). The Deep Learning models showed very good performance evaluation characteristics and were found to be useful in predicting of toxicological and physicochemical parameter endpoints. The results of this work support an optimistic view that some of current obstacles in cheminformatics can be overcome by using Deep Learning methods.
Chemical workflows supporting automated research data collectionValery Tkachenko
Acquisition of data from public sources is inefficient, time consuming and limited in scope. The NIH has recently posted its intention to financially support data deposition by investigators through the ‘data sharing plan' for each funded proposal. However, this plan also points to a current weakness of the centralized data sharing and acquisition as all laboratories use different data collection and formatting approaches. These inconsistencies in data formatting by individual labs leads to the need to invest significant resources in data curation and interpretation by the technical staff involved in the maintenance of the centralized data collection resource such as CaNanoLab or Nanomaterial Registry. It would be far more efficient and useful if there were a standardized data collection and deposition template with standard key terms (such as Minimal Information about Nanomaterials, MIAN) that could be modified to add new or important additional data or parameters for each investigator. These new features cold be ultimately adopted in the classification scheme and guide the scope of the expanding database. This approach would be a win-win as it would enable structure for the investigators laboratory, consistency in data reporting and a means of transmitting data to the database in parallel to publication to eliminate the acquisition step from the process. In this talk we will outline our experience building Open Science Data Repository, a federated database system for direct acquisition, curation and management of research data, including nanomaterial data capture, transformation, and streamlined submission to nanomaterial knowledgebases. The key part of the system is microservices based architecture which exposes RESTful API suitable for direct integration into Workflow Management Systems as well as built-in modules facilitating and enforcing various lab-specific standard operating procedures.
Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko
Chemical and pharmaceutical companies, and government agencies regulating both chemical and biological compounds, all strive to develop new methods to provide efficient prioritization, evaluation and safety assessments for the hundreds of new chemicals that enter the market annually. While there is a lot of historical data available within the various agencies, organizations and companies, significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. Traditional QSAR methods are based on sets of features (fingerprints) which representing the functional characteristics of chemicals. Unfortunately, due to both data gaps and limitations in the development of QSAR models, read-across approaches have become a popular area of research. Successes in the application of Artificial Neural Networks, and specifically in Deep Learning Neural Networks, has delivered a new optimism that the lack of data and limited feature sets can be overcome by using Deep Learning methods. In this poster we will present a comparison of various machine learning methods applied to several toxicological and physicochemical parameter endpoints. This abstract does not reflect U.S. EPA policy.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
There are a large number of US government databases housing diverse collections of chemical data including bioassay data (PubChem), toxicity data (CompTox Chemistry Dashboard) and environmental data (a large collection of EPA databases), to name just a few. In many cases integration between the databases, at the chemical structure level, is via alphanumeric text identifiers such as CAS Numbers, or via InChI (International Chemical Identifiers). Structure-based integration is hyper-dependent on the initial inputs providing the chemical structures to the InChI generation algorithm. To ensure optimal integration between various databases, community standards and agreement regarding standardization of chemical structures would be beneficial, not only to integration of US government databases and resources but also to the international scientific community and hosts of online databases. This presentation will discuss our progress to deliver a fully Open Source chemical standardization platform as an exemplar for the community to build on and enhance. The system utilizes the CDK (Chemistry Development Kit), RD Kit and other open source components. The resource expands on our previous work regarding the Chemical Validation and Standardization Platform and has been tested using the open data collection provided by the EPA Comptox Chemistry Dashboard.
Development and comparison of deep learning toolkit with other machine learni...Valery Tkachenko
The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as deep learning. There has been increasing use of deep learning which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts, it is currently not in any of the major cheminformatics tools. It is therefore our goal to develop a deep learning algorithm and toolkit which can be used as a standalone or integrated in new software being developed by us such as the Open Science Data Repository (OSDR). We will show how classic machine learning (CML) methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) compares to cutting edge deep learning and talk about challenges associated with deep neural networks (DNN) learning models. The open source Scikit-learn (http://scikit-learn.org/stable/) ML python library was used for building, tuning, and validating all CML models. The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/), a deep learning library, and Tensorflow (www.tensorflow.org) as a backend. All the developed pipelines consist of stratified splitting of the input dataset into train (80%) and test (20%) datasets. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) were computed for each model for ADME/Tox and other physicochemical properties. DNN learning models were found to be very good in predicting activities and can outperform most of the CML models.
Living in a world of federated knowledge challenges, principles, tools and ...Valery Tkachenko
Over years a multitude of chemical formats and approaches were created to address various aspects of handling chemical information and building databases of chemical knowledge. As a result the current state of this landscape is severely affected by the lack of well-accepted and community-recognized formats, protocols, metadata standards, validation routines and standards in handling, storing and representation, lack of open toolkits which conform to the same standards as well as the lack of platforms which allow interactive and collaborative work to solve all the above problems. While such organizations as RDA and IUPAC as well as some government agencies and institutes are concerned and trying to address the problem it is still a severe pain point. In this presentation we will talk about our experience of building a federated knowledgebase called Open Science Data Repository which supports deposition of raw and structured chemical and analytical data in various formats, runs validation and standardization protocols, is build in a highly modular way that allows using both its API and its components in a Cloud or to be deployed on premises behind firewalls and supports a variety of use cases including collaborative data curation, rich analytics and visualization, real-time machine learning, formats conversion and preparing depositions into PubChem and ChemSpider from a variety of sources and fully supports FAIR principles for research data.
Open chemistry registry and mapping platform based on open source cheminforma...Valery Tkachenko
The Open PHACTS project (openphacts.org) is a European initiative, constituting a public–private partnership to enable easier, cheaper and faster drug discovery. The project is supported by the OpenPHACTS Foundation (www.openphactsfoundation.org) and funded by contributions from several pharmaceutical companies. As part of Open PHACTS, a 'Chemical Registration Service” was created to register chemicals of interest to the project, allowing compound linkage between data sets. A key concept is the support for 'scientific lenses,' which allows hierarchical mapping of chemical entities, including supporting characteristics such as charge state, tautomerism and stereochemistry. Open PHACTS aggregated various databases, including ChEMBL, ChEBI, HMDB, DrugBank, PDB, MeSH, and WikiPathways. A new project builds on the Chemical Registration Service to establish an open chemistry registry and mapping service for general data set linkage. This expansion requires the support of multiple cheminformatics formats, the conversion and mapping of various identifiers, harmonized but configurable standardization, validation of the chemical structures, and the creation of new identifiers, to produce scientific lenses, or 'link sets'. Furthermore, these identifiers will be related to the compounds chemical names (IUPAC and trivial) and related chemical structures. This presentation will describe our ongoing work to create a fully open source, easy to install platform, which supports the ideas introduced by the Open PHACTS project and expands it with community data including, for example, the data now available from the EPA CompTox Chemistry Dashboard (comptox.epa.gov). This new platform supports chemical formats and provides for identifier conversion and cross-validation between datasets. The project is completely based on open source cheminformatics toolkits and available as a set of libraries, docker images and a web frontend based on FAIR and Open Data principles. The openness of this platform will allow for scientists to process their own datasets, and make them interoperable with other online chemical databases.
Using the structured product labeling format to index versatile chemical dataValery Tkachenko
Structured Product Labeling (SPL) is a document markup standard approved by the Health Level Seven (HL7) standards organization and adopted by the FDA as a mechanism for exchanging product and facility information. Product information provided by companies in SPL format may be accessed from the FDA Online Label Repository (labels.fda.gov) and the National Library of Medicine DailyMed web site (dailymed.nlm.nih.gov). FDA also maintains and publishes SPL Indexing Files for Pharmacologic Class, Substance, Product Concept, Biological Drug Substance, and Billing Units. Data from the Indexing Files can be linked to data in both SPL resources and external resources via chemical and non-chemical identifiers. In this talk we will present on the latest addition to SPL which allows indexing data on proteins, polymers and structurally diverse substances. We will also discuss the potential value of SPL to the integration between public chemistry databases, especially those hosted by the United States Government.
In last few years the number and the size of chemical databases has been steadily increasing, as has the complexity of information residing in those databases creating truly multidimensional chemical spaces. Yet the most common user interface approach still remains based on search-and-browse workflow thus essentially preventing a proper navigation through such databases and hiding data patterns which may belong to other dimensions. As we at the Royal Society of Chemistry are building a chemical database service it is potentially useful to be able to visualize large chemical spaces, ranging in size from tens of thousands to tens of millions of compounds. Dimensionality reduction techniques such as PCA have been used to produce two-dimensional displays of large chemical spaces, via the production of scatterplots. Standard chart-plotting libraries allow interactive scatterplots to be produced, but do not scale well to large numbers of data points. Our new visualisation tool, OMPOL, is a browser-based tool for displaying and interacting with these data sets, allowing people to smoothly and responsively pan and zoom these plots, view the names and structures associated with the data points, select regions of chemical space and find typical and atypical members of those regions.
The need for a high quality reaction database underpins synthetic reaction planning, as highlighted by the roadmap of the Dial-a-Molecule grand challenge [1] (the aims of which are to be able to predict the outcome of a reaction a priori and therefore generate products on demand, and also to optimise a reaction).
A number of reaction databases are available [2] - most of these focus on storing basic reaction schemas and details and link to publications for more details. However their main limitation is that because their major source is the abstraction of published literature, insufficient structured reaction detail is recorded:
for someone else to reproduce the reaction
to fully record all reaction products (not just the target product)
previous attempts to reach the optimised reaction route so that this "work-up" can be correlated to allow better prediction of reaction outcomes.
As a result, the reactions domain of the chemical data repository that the Royal Society of Chemistry is developing will capture:
reactions and processes directly from Electronic Lab Notebooks
reactions which gave low yields or unintended products
processes, parameters and equipment in S88 process recipe [3] style for maximum reproducibility
multistep reactions
reactants, products etc. not just as small organic molecules
raw characterisation data linked to products
We will demonstrate a first version, populated with reactions text-mined from RSC articles and examples of notebook reactions and processes as recorded by an academic research group at Cornell University.
[1] Dial a Molecule Grand Challenge, http://generic.wordpress.soton.ac.uk/dial-a-molecule/ (accessed Oct 8, 2015)
[2] Organic Chemistry Resources Worldwide, http://www.organicworldwide.net/content/reaction-databases (accessed Oct 8, 2015)
[3] ISA, "Batch Control Part 1: Model and Terminology," The International Society for Measurement and Control, ISA Press, ISA - S88.01-1995
The Open PHACTS project delivers an online platform integrating a wide variety of data from across chemistry and the life sciences and an ecosystem of tools and services to query this data in support of pharmacological research, turning the semantic web from a research project into something that can be used by practising medicinal chemists in both academia and industry. In the summer of 2015 it was the first winner of the European Linked Data Award. At the Royal Society of Chemistry we have provided the chemical underpinnings to this system and in this talk we review its development over the past five years. We cover both our early work on semantic modelling of chemistry data for the Open PHACTS triplestore and more recent work building an all-purpose data platform, for which the Open PHACTS data has been an important test case, what has worked well, what's missing and where this is is likely to go in future.
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
Chemical databases have been around for decades, but in recent years we have observed a qualitative change from rather small in-house built proprietary databases to large-scale, open and increasingly complex chemistry knowledgebases. This tectonic shift has imposed new requirements for database design and system architecture as well as the implementation of completely new components and workflows which did not exist in chemical databases before. Probably the most profound change is being caused by the linked nature of modern resources - individual databases are becoming nodes and hubs of a huge and truly distributed web of knowledge. This change has important aspects such as data and format standards, interoperability, provenance, security, quality control and metainformation standards.
ChemSpider at the Royal Society of Chemistry was first public chemical database which incorporated rigorous quality control by introducing both community curation and automated quality checks at the scale of tens of millions of records. Yet we have come to realize that this approach may now be incomplete in a quickly changing world of linked data. In this presentation we will talk about challenges associated with building modern public and private chemical databases as well as lessons that we have learned from our past and present experience. We will also talk about solutions for some common problems.
Text mining to produce large chemistry datasets for community accessValery Tkachenko
While in an ideal world all data would be deposited by the producing scientist directly into a database, in the real-world most chemical data is instead presented in a form designed for human rather than machine consumption. Text mining has the potential to extract this data back into a computer understandable form. As all United States patents are available free of charge they make the perfect corpus for extracting a large number of experimental properties of compounds, and chemical reactions.
We report on our text-mining activities to extract millions of textual NMR spectra, hundreds of thousands of physicochemical properties (with their associated compounds) and over a million chemical reactions. All extracted results are to be deposited into online databases allowing the community to benefit from the results of this work.
Using Mestrelab Research’s MNova product we have converted the textual NMR spectra to graphical spectra, and validated each spectrum against its associated chemical structure so as to detect cases where the NMR spectrum could not be produced by the associated structure.
In the case of melting points the resultant dataset, of over a quarter of a million melting compound/temperature relationships, is the largest public dataset the authors are aware of. We have used this dataset to produce a predictive model with results comparable to those of manually curated datasets. Our experiences with modelling this data has demonstrated that we are working at the edge of current algorithmic and computing capabilities for predictive model building, with the resultant matrix containing over 200 billion descriptors. The melting point model and the data it was derived from are available freely from http://www.ochem.eu.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilitySciAstra
The Indian Statistical Institute (ISI) has extended its application deadline for 2024 admissions to April 2. Known for its excellence in statistics and related fields, ISI offers a range of programs from Bachelor's to Junior Research Fellowships. The admission test is scheduled for May 12, 2024. Eligibility varies by program, generally requiring a background in Mathematics and English for undergraduate courses and specific degrees for postgraduate and research positions. Application fees are ₹1500 for male general category applicants and ₹1000 for females. Applications are open to Indian and OCI candidates.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Opportunities in chemical structure standardization
1. Opportunities in Chemical
Structure Standardization
Valery Tkachenko
Science Data Software, Rockville, USA
Expanding IUPAC Standards for Chemical Information
EMBL-EBI Workshop, March 20-21st 2017
3. Predictive data models & toolsExperimental Design
Data Analysis
and
Modeling
Structured
Nanomaterials
Data
Repository
Data collection,
curation, integration,
and structuring
(ontology)
Literature data
Electronic
Databases:
Processing
Experimental
Data
Disease
Experimental
Validation
3
Effect
Decision support
Karmann Mills and
Anthony Hickey
RTI International, RTP, NC 27709
and
Alex Tropsha
Eshelman School of Pharmacy,
University of North Carolina at
Chapel Hill, NC 27599
8. [Very incomplete] list of common problems
• Violation of chemical and common sense
• Violations of valence bond theory
• Unsupported format and chemical model features
• Information loss during conversion
• Tautomers
• Stereochemical issues
• Mixtures
• Other classes of chemicals (materials, formulations, biologicals, structurally
diverse, etc)
• Equivalence/mapping issues
• Identifiers/names issues
• Etc, etc, etc…
10. Solution
• Agreed and machine-readable (digital) standards
• Open-source (transparent) solution
• Organizations AND community support and involvement
• Accessible solution
• Data triaging at data repositories level
• Real-time validation/standardization (API, library, “docker”, etc)
14. OpenPHACTS CRS shortcomings…
• Platform-dependent
• Toolkit-dependent (potential licensing issues)
• No deployable library
• No [convenient] API
15. …OpenPHACTS CRS1 - ongoing work
• Microsoft platform independent
• .NET Core, Python
• Linux
• NoSQL
• Toolkit independent
• Indigo
• RDKit (in progress)
• CDK (planned)
• Docker image
• RESTful API
1 Was open-sourced and now supported by OpenPHACTS Foundation
18. Meet the Team
Alexandru Korotcov
Data Science
Rick Zakharov
Technology
Valery Tkachenko
Support
Boris Sattarov
Cheminformatics
Slides: https://www.slideshare.net/valerytkachenko16
Editor's Notes
Open PHACTS was developed to support the key questions of drug discovery
Business questions have been at the heart of Open PHACTS and have driven the development of the platform
Mx/psa, how calculated who did it?
Mash up. With your data too,
- top layer join together but need them all
commercial
Data provided by many publishers
Originally in many formats: relational, SD files and RDF
Worked closely with publishers
Data licensing was a major issue
Over 5 billion triples – 14 datasets & growing
Hosted on beefy hardware; data in memory (aim)
Extensive memcaching
Pose complex queries to extract data