Opportunities in chemical structure standardization

•Download as PPTX, PDF•

1 like•209 views

This talk was given at EBI's Wellcome Trust Genome Campus and is dedicated to outlining problems with chemical information standardization and various efforts to tackle this problem.

Science

Predictive data models & toolsExperimental Design
Data Analysis
and
Modeling
Structured
Nanomaterials
Data
Repository
Data collection,
curation, integration,
and structuring
(ontology)
Literature data
Electronic
Databases:
Processing
Experimental
Data
Disease
Experimental
Validation
3
Effect
Decision support
Karmann Mills and
Anthony Hickey
RTI International, RTP, NC 27709
and
Alex Tropsha
Eshelman School of Pharmacy,
University of North Carolina at
Chapel Hill, NC 27599

Fourches, Muratov, Tropsha. Nat Chem Biol. 2015,11(8):535.
How the problem is being solved now

[Very incomplete] list of common problems
• Violation of chemical and common sense
• Violations of valence bond theory
• Unsupported format and chemical model features
• Information loss during conversion
• Tautomers
• Stereochemical issues
• Mixtures
• Other classes of chemicals (materials, formulations, biologicals, structurally
diverse, etc)
• Equivalence/mapping issues
• Identifiers/names issues
• Etc, etc, etc…

…problems (continued)
• Multiple [historical, proprietary, shortcoming] formats
• ChemDraw, ChemSketch, AccelrysDraw
• MOL, SDF
• SMILES
• Identifiers
• Names and Synonyms
• Multiple toolkits/models
• Open Source (alphabetical)
• CDK
• RDKit
• Indigo
• OpenBabel
• Etc…
• Commercial (alphabetical)
• CACTVS
• ChemAxon
• OpenEye
• Etc…
• Historical Hysterical software
• No [machine-readable] standards
• No authorities No coordinated efforts!!!

Solution
• Agreed and machine-readable (digital) standards
• Open-source (transparent) solution
• Organizations AND community support and involvement
• Accessible solution
• Data triaging at data repositories level
• Real-time validation/standardization (API, library, “docker”, etc)

@gray_alasdair Big Data Integration 11
OpenPHACTS

OpenPHACTS
Chemistry Registry System (CRS)

OpenPHACTS CRS shortcomings…
• Platform-dependent
• Toolkit-dependent (potential licensing issues)
• No deployable library
• No [convenient] API

…OpenPHACTS CRS1 - ongoing work
• Microsoft  platform independent
• .NET Core, Python
• Linux
• NoSQL
• Toolkit independent
• Indigo
• RDKit (in progress)
• CDK (planned)
• Docker image
• RESTful API
1 Was open-sourced and now supported by OpenPHACTS Foundation

Meet the Team
Alexandru Korotcov
Data Science
Rick Zakharov
Technology
Valery Tkachenko
Support
Boris Sattarov
Cheminformatics
Slides: https://www.slideshare.net/valerytkachenko16

Sustainable research progress in many scientific disciplines critically depends on the existence of robust specialized databases that integrate and structure all available experimental information in the respective fields. The need for such reference database is especially critical for nanoscience and nanomaterial research given the significant diversity of shapes, sizes, and properties of engineered nanomaterials and the difficulty of synthesizing engineered nanoparticles with controlled properties. The acquisition of data from public sources is inefficient, time consuming and limited in scope. Moreover, it is not clear where the resources come from to support this activity on a perpetual basis. The NIH has recently posted its intention to provide special funds toward data deposition by the experimental investigators through the ‘data sharing plan’ for each proposal. However, this points to a current weakness which is that all laboratories use different data collection approaches each of which requires interpretation by staff hosting the database. It would be far more efficient and useful if a template with key terms that could be modified to add new or important additional data or parameters for each investigator. We will discuss tools and approaches to facilitate collection and direct deposition of experimental data into Nanomaterial Registry (https://www.nanomaterialregistry.org/) - a versatile semantically enriched templates-based platform for registering diverse data pertaining to nanomaterials research.

Chemistry Validation and Standardization Platform v2.0

Valery Tkachenko

In recent years there has been explosive growth in the number of public chemical databases available online, a number of these containing 10s of millions of chemical structures. Examples include PubChem, ChemSpider and ChEMBL and users of these databases have become increasingly aware of the issue of data quality associated with these public resources. Seamless integration and mapping between databases, even for some common chemicals, is challenged by differing approaches to chemical standardization prior to registration into a database. The lack of standards in representing and handling chemical information certainly contributes to aspects of this problem. The Chemistry Validation and Standardization Platform (CVSP), originally developed to support the European Innovative Medicines Initiative project known as OpenPHACTS, was developed with the intention of providing an open platform for processing and standardizing chemical compounds. The system has been used to process millions of chemical compounds for dissemination through public websites and, unlike other validation and standardization systems, the system provides support for both standard and custom rulesets. We will provide an overview of CVSP 2.0, the next generation of the platform extending support to new cheminformatics toolkits and additional capabilities such as collaborative rules authoring.

Open Science Data Repository - the platform for materials research

Valery Tkachenko

Over the last few years we have seen a tremendous growth in various data repositories pushed and supported by funding bodies and various data preservation initiatives. As a result we have now a variety of scientific resources, combined into a broad network and indexed through the directories like BioSharing and re3data. Such network, while growing quickly, is still in early days of adopting semantic web standards and does not yet support deep data indexing and discoverability, leave alone that mechanisms of intellectual properties protection are as simple as making data public or private at best. The lack of standards and well defined models to describe a scientific information structure even further inhibits free information flow which is essential for scientific discovery. One of the most affected areas is not surprisingly materials sciences where due to the inherent complexity of the field of study the situation is even more severe. In this talk we present a chemistry information platform designed to support a variety of data formats along with metadata, sophisticated ways of collaboration and secure data exchanges. We will discuss challenges that we have faced developing such platform as well as solutions that we have came with.

Clustering the royal society of chemistry chemical repository to enable enhan...

Valery Tkachenko

The Royal Society of Chemistry has hosted the ChemSpider database and associated platforms for over five years. Technologies made significant progress over that period but, more importantly, the community needs in terms of the variety of data types as well as search performance have increased. The preprocessing of chemicals for improved similarity searching and compound database navigation is seen as one crucial component of major development efforts to architect a new data repository. This component is engineered and implemented in collaboration with the group of Professor Oliver Kohlbacher at University of Tübingen. They have developed an approach for clustering large chemical libraries based on a fast, parallel, and purely CPU-based algorithm for 2D binary fingerprint similarity calculation. Using this method, the complete similarity network of our seed set with tens of millions of chemicals has been analyzed at a Tanimoto threshold of 0.6 and all similarity links were fed into our database. The latter is highly beneficial and will allow us to create more complex and enriching visualizations of similar compounds with associated bioactivity data and physicochemical properties for the RSC chemical repository users. This presentation will provide an overview of our experiences in applying clustering to our compound data and how it will be used to enrich data navigation on the RSC data repository.

Building a semantic chemistry platform with the royal society of chemistry

Valery Tkachenko

We live in an exponentially expanding world of “big data”. Social networks, global portals and other distributed systems have been attempting to deal with the problem for a few years now. Scientific applications are commonly lagging behind the mainstream trends due to the complexity of the scientific domain. The Royal Society of Chemistry is building the Global Chemistry Network connecting a variety of resources both in-house and external, bridging gaps and advancing the chemical sciences. One of the main issues connected to the world of big data is the ease of navigation and comprehensiveness of the search capabilities. This is where the approach of the semantic web meets the world of big data. We will present our approaches in building a global federated chemistry platform connecting multiple domains of chemistry using semantic web technologies.

Building a Standard for Standards: The ChAMP ProjectStuart Chalk

Model Organism Linked Data

Michel Dumontier

Model organisms such as budding yeast provide a common platform to interrogate and understand cellular and physiological processes. Knowledge about model organisms, whether generated during the course of scientific investigation, or extracted from published articles, are made available by model organism databases (MODs) such as the Saccharomyces Genome Database (SGD) for powerful, data-driven bioinformatic analyses. Integrative platforms such as InterMine offer a standard platform for MOD data exploration and data mining. Yet, today’s bioinformatic analyses also requires access to a significantly broader set of structured biomedical data, such as what can be found in the emerging network of Linked Open Data (LOD). If MOD data could be provisioned as FAIR (Findable, Accessible, Interoperable, and Reusable), then scientists could leverage a greater amount of interoperable data in knowledge discovery. The goal of this proposal is to increase the utility of MOD data by implementing standards-compliant data access interfaces that interoperate with Linked Data. We will focus our efforts on developing interfaces for data access, data retrieval, and query answering for SGD. Our software will publish InterMine data as LOD that are semantically annotated with ontologies and be retrieved using standardized formats (e.g. JSON-LD, Turtle). We will facilitate the exploration of MOD data for hypothesis testing, by implementing efficient query answering using Linked Data Fragments, and by developing a set of graphical user interfaces to search for data of interest, explore connections, and answer questions that leverage the wider LOD network. Finally, we will develop a locally and cloud-deployable image to enable the rapid deployment of the proposed infrastructure. Our efforts to increase interoperability and ease of deployment for biomedical data repositories will increase research productivity and reduce costs associated with data integration and warehouse maintenance.

pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)Gregor Hagedorn

The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this talk, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.

The royal society of chemistry and its adoption of semantic web technologies ...

Valery Tkachenko

Semantic web technologies have quickly penetrated all areas of traditional and new database systems and have become the de facto standard in information exchange and communication. The Royal Society of Chemistry has built a new chemistry data repository with the semantic web at the core of the system. Every module of the data repository contains a semantic web layer and is able to interact internally and externally using standard approaches and formats including RDF, appropriate ontologies, SPARQL querying and so on. In this presentation we will review the challenges associated with developing this new system based on semantic web technologies and how the approach that we have taken offers distinct advantages over the original data model designed to produce the ChemSpider database. Its advantages include extensibility, an ontological underpinning, federated integration and the adoption of modern standards rather than the constraints of a standard SQL model.

2011 03-provenance-workshop-edingurgh

Jun Zhao

Making Linked Data SPARQL with the InterMine Biological Data Warehouse

Justin Clark-Casey

Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata

Michel Dumontier

Biomedical researchers will remain stymied in their ability to take full advantage of the Big Data revolution if they can never find the datasets that they need to analyze, if there is lack of clarity about what particular datasets contain, and if data are insufficiently described. CEDAR, an NIH BD2K Center of Excellence, aims to develop methods and tools to vastly ease the burden of authoring good experimental metadata, and to maximally use this information to zero in on datasets of interest.

ACS 248th Paper 136 JSmol/JSpecView Eureka Integration

Stuart Chalk

New developments in delivering public access to data from the National Center...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new CompTox Chemistry Dashboard and the developing architecture to support real-time property and toxicity endpoint prediction. This abstract does not reflect U.S. EPA policy.

Link Analysis of Life Sciences Linked Data

Michel Dumontier

Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.

Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental matrices, improving the means by which exposure analyses can be conducted. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases, fragmentation prediction tools, and retention time prediction models. The goal of this research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemistry Dashboard, an open chemistry resource and web application containing data for ~760,000 substances. Rank-ordering the number of sources associated with chemical records within the Dashboard (Data Source Ranking) improves the identification of unknowns by bringing the most likely candidate structures to the top of a search results list. Database searching has been further optimized with the generation of MS-Ready Structures. MS-Ready structures are de-salted, stripped of stereochemistry, and mixture separated to replicate the form of a chemical observed via HRMS. Functionality to conduct batch searching of molecular formulae and monoisotopic masses was designed and released to improve searching efforts. Finally, a scoring-based identification scheme was developed, optimized, and surfaced via the Dashboard using multiple data streams contained within the database underlying the Dashboard. The scoring-based identification scheme improved the identification of unknowns over previous efforts using data source ranking alone. Combining these steps within an open chemistry resource provides a freely available software tool for structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

The Research Object Initiative:Frameworks and Use Cases

Carole Goble

Fairport domain specific metadata using w3 c dcat & skos w ontology views

Tim Clark

FAIRPORT is an international project to develop a lightweight interoperability architecture for biomedical - and potentially other - data repositories. This slide deck is a presentation to the FAIRPORT technical team. It describes a proposed model for supporting domain-specific search metadata using a common schema model across all repositories. The proposal makes use of the following existing technologies, with minor extensions: - the W3C DCAT model for dataset description - the W3C SKOS knowledge organization system - OWL2 Ontology Language - Dublin Core Vocabulary - NCBO Bioportal biomedical ontologies collection

Annotopia open annotation services platform

Tim Clark

Annotopia is an open-access, open-source, open annotation services platform developed for scientific annotation of documents and datasets on the web using the W3C Open Annotation model http://www.openannotation.org/spec/core/. Using Annotopia, virtually any client application including lightweight web clients, can create, selectively share, and access annotation of web documents and data. This can be done regardless of the ownership of the base objects being annotated. Annotopia supports unstructured, semi-structured and fully-structured (semantic) annotation; manual and automated (textmining) annotation; permissions, groups, and sharing. It also provides access to specialized vocabulary and text analytics services. Annotopia is an open source platform licensed under Apache 2.0.

Enhancing the Quality of ImmPort Data

Barry Smith

eXframe: A Semantic Web Platform for Genomic ExperimentsTim Clark

exFrame: a Semantic Web Platform for Genomics Experiments

Tim Clark

NgspTim Clark

2016 bmdid-mappings

Michel Dumontier

Bio2RDF is an open-source project that offers a large and connected knowledge graph of Life Science Linked Data. Each dataset is expressed using its own vocabulary, thereby hindering integration, search, query, and browse data across similar or identical types of data. With growth and content changes in source data, a manual approach to maintain mappings has proven untenable. The aim of this work is to develop a (semi)automated procedure to generate high quality mappings between Bio2RDF and SIO using BioPortal ontologies. Our preliminary results demonstrate that our approach is promising in that it can find new mappings using a transitive closure between ontology mappings. Further development of the methodology coupled with improvements in the ontology will offer a better-integrated view of the Life Science Linked Data

$Accessing information for chemicals in hydraulic fracturing fluids using the ...$ $Accessing information for chemicals in hydraulic fracturing fluids using the ...$

Accessing information for chemicals in hydraulic fracturing fluids using the ...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are being made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for almost 760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectroscopy non-targeted screening community, who are generating important data for detecting and assessing environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interests to relevant stakeholders including, for example, scientists interested in algal toxins and hydraulic fracturing chemicals. This presentation will provide an overview of the challenges associated with the curation of data from EPA’s December 2016 Hydraulic Fracturing Drinking Water Assessment Report that represented chemicals reported to be used in hydraulic fracturing fluids and those found in produced water. The data have been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and open literature. The application of the dashboard to support mass spectrometry non-targeted analysis studies will also be reviewed. This abstract does not reflect U.S. EPA policy.

W3C HCLS Dataset Description Guidelines

Michel Dumontier

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. This document describes a consensus among participating stakeholders in the Health Care and the Life Sciences domain on the description of datasets using the Resource Description Framework (RDF). This specification meets key functional requirements, reuses existing vocabularies to the extent that it is possible, and addresses elements of data description, versioning, provenance, discovery, exchange, query, and retrieval.

From data to knowledge – the Ondex System for integrating Life Sciences data ...

Catherine Canevet

Nitazoxanide[1]

Danna Vasquez

OpenPHACTS - Chemistry Platform Update and Learnings

Valery Tkachenko

What's hot

An Open Repository Model for Acquiring Knowledge About Scientific Experiments

CEDAR: Center for Expanded Data Annotation and Retrieval

The royal society of chemistry and its adoption of semantic web technologies ...

Valery Tkachenko

2011 03-provenance-workshop-edingurgh

Jun Zhao

Making Linked Data SPARQL with the InterMine Biological Data Warehouse

Justin Clark-Casey

Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata

Michel Dumontier

ACS 248th Paper 136 JSmol/JSpecView Eureka Integration

Stuart Chalk

New developments in delivering public access to data from the National Center...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Link Analysis of Life Sciences Linked Data

Michel Dumontier

Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The Research Object Initiative:Frameworks and Use Cases

Carole Goble

Fairport domain specific metadata using w3 c dcat & skos w ontology views

Tim Clark

Annotopia open annotation services platform

Tim Clark

Enhancing the Quality of ImmPort Data

Barry Smith

eXframe: A Semantic Web Platform for Genomic ExperimentsTim Clark

exFrame: a Semantic Web Platform for Genomics Experiments

Tim Clark

NgspTim Clark

2016 bmdid-mappings

Michel Dumontier

$Accessing information for chemicals in hydraulic fracturing fluids using the ...$ $Accessing information for chemicals in hydraulic fracturing fluids using the ...$

Accessing information for chemicals in hydraulic fracturing fluids using the ...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

W3C HCLS Dataset Description Guidelines

Michel Dumontier

From data to knowledge – the Ondex System for integrating Life Sciences data ...

Catherine Canevet

What's hot (20)

An Open Repository Model for Acquiring Knowledge About Scientific Experiments

The royal society of chemistry and its adoption of semantic web technologies ...

2011 03-provenance-workshop-edingurgh

Making Linked Data SPARQL with the InterMine Biological Data Warehouse

Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata

ACS 248th Paper 136 JSmol/JSpecView Eureka Integration

New developments in delivering public access to data from the National Center...

Link Analysis of Life Sciences Linked Data

Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...

The Research Object Initiative:Frameworks and Use Cases

Fairport domain specific metadata using w3 c dcat & skos w ontology views

Annotopia open annotation services platform

Enhancing the Quality of ImmPort Data

eXframe: A Semantic Web Platform for Genomic Experiments

exFrame: a Semantic Web Platform for Genomics Experiments

Ngsp

2016 bmdid-mappings

$Accessing information for chemicals in hydraulic fracturing fluids using the ...$ $Accessing information for chemicals in hydraulic fracturing fluids using the ...$

Accessing information for chemicals in hydraulic fracturing fluids using the ...

W3C HCLS Dataset Description Guidelines

From data to knowledge – the Ondex System for integrating Life Sciences data ...

Viewers also liked

Nitazoxanide[1]

Danna Vasquez

OpenPHACTS - Chemistry Platform Update and Learnings

Valery Tkachenko

Práctica de la mayúscula 4to y 5to

cepecole

Resumen analítico de los conceptos básicos de access

anamaria2003

Injectable solutions

Prof. Dr. Basavaraj Nanjwade

Imaging abdomen trauma uterine trauma part 11 Dr Ahmed Esawy

AHMED ESAWY

Imaging abdomen trauma uterine trauma part 11 dr ahmed esawy blunt abdominal trauma penetrating abdominal trauma fast abdominal ultrasound haemoperitoneum pneumoperitoneum american association of surgeon in trauma AAST SUBCAPSULAR HAEMATOMA PARENCHYMAL LACERATION include different cases for oral radiodiagnosis examination all over the world CT /MRI Plain X ray images UTERINE RUPTURE UTERINE LACERATION UTERINE CONTUSION FETAL TRAUMA

Moses scott 4.4

Scott Moses

INSTRUCTIVO CAMPEONATO “GURABO TIERRA DE COMBATE”

Federación Puertorriqueña de Karate

Viewers also liked (8)

Nitazoxanide[1]

OpenPHACTS - Chemistry Platform Update and Learnings

Práctica de la mayúscula 4to y 5to

Resumen analítico de los conceptos básicos de access

Injectable solutions

Imaging abdomen trauma uterine trauma part 11 Dr Ahmed Esawy

Moses scott 4.4

INSTRUCTIVO CAMPEONATO “GURABO TIERRA DE COMBATE”

Similar to Opportunities in chemical structure standardization

Online Resources to Support Open Drug Discovery Systems

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

This is a presentation given at the Opal Events meeting ""Drug Discovery Partnerships: Filling the Pipeline". I was speaking in a session with Jean-Claude Bradley regarding "Pre-competitive Collaboration: Sharing Data to Increase Predictability". This presentation discussed some of the work we are doing on Open PHACTS. My thanks especially to Carole Goble, Lee Harland and Sean Ekins for their comments.

Towards a gold standard and regarding quality in public domain chemistry data...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

In recent years there has been a dramatic increase in the number of freely accessible online databases serving the chemistry community. The internet provides chemistry data that can be used for data-mining, for computer models, and integration into systems to aid drug discovery. There is however a responsibility to ensure that the data are high quality to ensure that time is not wasted in erroneous searches, that models are underpinned by accurate data and that improved discoverability of online resources is not marred by incorrect data. In this article we provide an overview of some of the experiences of the authors using online chemical compound databases, critique the approaches taken to assemble data and we suggest approaches to deliver definitive reference data sources.

A Semantic Web based Framework for Linking Healthcare Information with Comput...

Koray Atalag

Novel opportunities for computational biology and sociology in

avinash tiwari

Current drug discovery is impossible without sophisticated modeling and computation. In this review we outline previous advances in computational biology and, by tracing the steps involved in pharmaceutical development, explore a range of novel, high-value opportunities for computational innovation in modeling the biological process of disease and the social process of drug discovery. These opportunities include text mining for new drug leads, modeling molecular pathways and predicting the efficacy of drug cocktails, analyzing genetic overlap between diseases and predicting alternative drug use. Computation can also be used to model research teams and innovative regions and to estimate the value of academy–industry links for scientific and human benefit. Attention to these opportunities could promise punctuated advance and will complement the well-established computational work on which drug discovery currently relies.

Embi cri review-2012-final

Peter Embi

Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...

Remedy Informatics

NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...European School of Oncology

An Introduction to Chemoinformatics for the postgraduate students of Agriculture

Devakumar Jain

Accomplishments And Challenges In Bioinformatics

Dereck Downing

American Society for Mass Spectrometry Conference 2013Dmitry Grapov

Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases within the DSSTox project, and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for ~760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectrometry non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders, including, for example, scientists interested in Per- & Polyfluoroalkyl Substances (PFAS). Added lists include those sourced from the European Union as well as developed in-house and now containing thousands of chemicals. A procured testing library of hundreds of PFAS chemicals annotated into chemical categories has been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and links to the open literature. This presentation will provide an overview of the dashboard, the developing library of PFAS chemicals and associated categorization, and new physicochemical property and environmental fate and transport QSAR prediction models developed for these chemicals. The application of the dashboard to support mass spectrometry non-targeted analysis studies for the identification of PFAS chemicals will also be reviewed. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Amia tb-review-11

Russ Altman

EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...

ChemAxon

SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial. Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching. In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.

A Few Words at the Front Lines (K-16): Teaching and Research at the Interface...SERC at Carleton College

Complex Systems Biology Informed Data Analysis and Machine Learning

Dmitry Grapov

Next generation electronic medical records and search a test implementation i...

lucenerevolution

Presented by David Piraino, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic & Daniel Palmer, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic Most patient specifc medical information is document oriented with varying amounts of associated meta-data. Most of pateint medical information is textual and semi-structured. Electronic Medical Record Systems (EMR) are not optimized to present the textual information to users in the most understandable ways. Present EMRs show information to the user in a reverse time oriented patient specific manner only. This talk discribes the construction and use of Solr search technologies to provide relevant historical information at the point of care while intepreting radiology images. Radiology reports over a 4 year period were extracted from our Radiology Information System (RIS) and passed through a text processing engine to extract the results, impression, exam description, location, history, and date. Fifteen cases reported during clinical practice were used as test cases to determine if ""similar"" historical cases were found . The results were evaluated by the number of searches that returned any result in less than 3 seconds and the number of cases that illustrated the questioned diagnosis in the top 10 results returned as determined by a bone and joint radiologist. Also methods to better optimize the search results were reviewed. An average of 7.8 out of the 10 highest rated reports showed a similar case highly related to the present case. The best search showed 10 out of 10 cases that were good examples and the lowest match search showed 2 out of 10 cases that were good examples.The talk will highlight this specific use case and the issues and advances of using Solr search technology in medicine with focus on point of care applications.

Translational Biomedical Informatics 2010: Infrastructure and Scaling

The Ohio State University Wexner Medical Center

Translational Biomedical Informatics 2010: Infrastructure and Scaling – Brian Athey, PhD; Professor of Biomedical Informatics and Director for Academic Informatics, University of Michigan Medical School; Chair Designate for Computational Medicine and Bioinformatics, University of Michigan; Associate Director, Michigan Institute for Clinical Health Research; Principal Investigator, National Center for Integrative Biomedical Informatics

Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...

Remedy Informatics

The discovery of clinical insights through effective management and reuse of data requires several conditions to be optimized: Data need to be digital, data need to be structured, and data need to be standardized in terms of metadata and ontology. This presentation describes a bioinformatics system that combines a next-generation biobank management model mapped to applicable international standards and guidelines with a master ontology that controls all input and output and is able to add unique properties to meet the specialized needs of clinicians for cross-disease research.

Understanding Gaps between Data Quality Checks and Research Capabilities in a...

The Children's Hospital of Philadelphia

Highly dimensional data_20160926

Laura Clarke

Similar to Opportunities in chemical structure standardization (20)

Online Resources to Support Open Drug Discovery Systems

Towards a gold standard and regarding quality in public domain chemistry data...

A Semantic Web based Framework for Linking Healthcare Information with Comput...

Novel opportunities for computational biology and sociology in

Embi cri review-2012-final

Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...

NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...

An Introduction to Chemoinformatics for the postgraduate students of Agriculture

Accomplishments And Challenges In Bioinformatics

American Society for Mass Spectrometry Conference 2013

Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...

Amia tb-review-11

EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...

A Few Words at the Front Lines (K-16): Teaching and Research at the Interface...

Complex Systems Biology Informed Data Analysis and Machine Learning

Next generation electronic medical records and search a test implementation i...

Translational Biomedical Informatics 2010: Infrastructure and Scaling

Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...

Understanding Gaps between Data Quality Checks and Research Capabilities in a...

Highly dimensional data_20160926

More from Valery Tkachenko

Evolution of public chemistry databases: past and the future

Valery Tkachenko

Over the last few years we have seen a tremendous growth in various chemical databases. As a result we have now a variety of scientific resources, combined into a broad network and indexed through the directories like BioSharing and re3data. Such network, while growing quickly, is still in early days of adopting semantic web standards and does not yet support deep data indexing and discoverability, leave alone that mechanisms of intellectual properties protection are as simple as making data public or private at best. The lack of standards and well defined models to describe a scientific information structure even further inhibits free information flow which is essential for scientific discovery. In this talk we will share our experience spanning through decades of building chemical databases like PubChem, ChemSpider, OpenPHACTS and National Database Services and will outline fundamental problems associated with chemical databases as such as well as data quality and approaches for the modern architecture of the large-scale chemical databases.

In silico design of new functional materials

Valery Tkachenko

Materials design is a grand challenge of materials science. And the main approach for solving this problem is still intuition-based. Such a way requires a lot of time and financial resources and months to years of conducting the experiment and doing characterization. Therefore, any kind of model that can be used at the very first stage of materials design and can narrow the selection area is a helpful tool for synthetic chemist. Also, an automated search for materials with human-defined target properties in the entire chemical space, i.e. inverse materials design is a highly desired tool in the exploration of materials design space. Along with that, de novo design is not a kind of a completely new task in a field of development of new organic molecules with target properties. A lot of different generative approaches are being used along with screening the libraries of existing molecules, searching for drugs for a particular target, or generating new ones based on a very simple initial structure. Here we would like to present a new approach for generating new materials with desired properties. We used autoencoder neural network architecture to encode materials composition and crystal structure as a vector in a latent space. In such case, any Quantitative Structure-Property Relationship (QSPR) model based on the vector can be interpreted as function in the latent space and can be used to predict property of existing materials as well as for prophetic ones. Such an approach has comparable accuracy with such classic computational methods as DFT in the case of predicting values of energies or charges, but significantly transcends them in terms of computational time. The proposed method was tested for generating super-firm materials, but can easily be extended to any target properties, granted a database of materials properties can be provided for training.

Metal-organic frameworks: from database to supramolecular effects in complexa...

Valery Tkachenko

Metal-organic frameworks (MOFs) attract a lot of interest due to their unique structure-dependent properties. Their internal pores comparable to the size of small molecules are naturally refined for various absorbance effects. Possessed properties lie in a foundation of multiple applications, such as catalysis, gas storage/separation and especially – clean energy related ones. Theoretical calculations are a usual way of decreasing experimental costs while investigating properties of new materials, especially at a design stage. Electronic structure calculations like density functional theory (DFT) in most cases provide an appropriate accuracy in matching experimentally measured data such as adsorbate interaction energies. However, as in the case of experimental studies, large-scale materials screening studies with DFT calculations are rather time-consuming, and it can be carried out only for structures with relatively small unit cell. Here we would like to present a theoretical and experimental results describing calculation of electron density in metal-organic frameworks. We built a model trained to predict partial charges on MOF atoms based on DFT calculations. The relative error of the model allows us to conclude that models do not decrease the level of accuracy and do not superinduce additional error comparing to DFT. At the same time, computational cost of the model is several orders of magnitude less. Models also demonstrated transferability and allowed to make prediction e.g. for MOFs containing metals not presented in the train set. We have also built a force-field (FF) of two-centered and three-centered interatomic potentials constructed using predicted charges. The FF proved to reproduce MOF crystal structure. As a final test, we have applied the developed model and FF to a new synthesized lanthanide-containing MOFs to estimate influence of supramolecular effects on metal complexation selectivity. As a result, we’ve built a model predicting one of basic MOF properties within relatively small computational time and tested it on experimental data, both obtained from literature sources and self-investigated.

Abstract recommendation system: beyond word-level representations

Valery Tkachenko

Public repositories containing diverse chemical and biological data are one of the main sources of knowledge for further biomedical research. Unfortunately, extraction and transforming these data into a well-interpretable form is a complex exercise. Ongoing efforts of a community are mainly focused on the analysis of co-occurrences of terms, text annotation based on terms similarity and related tasks [1]. Here we present an approach based on natural-language processing techniques, which is intended to shift the focus of a search for similar texts on chemical topics from word- to document-level. PubMed records were used to implement word2vec and doc2vec models. Generated text representations can be used to search for similar abstracts; the similarity is more dependent on this representation than the co-presence of certain terms (neighbor compounds, similar publication date, etc.). Document-level clustering was also implemented to provide insight into the PubMed text corpus structure. This approach can serve as an alternative to standard topic modeling techniques for the discovery of hidden semantic features in an unsupervised manner.

Machine learning methods for chemical properties and toxicity based endpoints

Valery Tkachenko

In the last decade there is an increasing interest in using in silico tools for potential risk assessment of newly released chemicals due to the large number of chemicals enter the market yearly and the big uncertainty on their possible hazardous effects. Different tools and methods based on machine learning techniques already exist and were used in a wide range of applications starting from quantitative structure-property relationships and expanding into predictive toxicology. There is a lot of historical data accumulated across multiple databases which is publicly available and can be used with novel machine learning methods. Unfortunately, due to different datasets, metrics and validation strategies, the significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. This work is an attempt to develop a multitask system which can serve as searchable curated collections of multiple chemical datasets and ready to use novel machine learning methods solely built using open source frameworks and libraries. We have implemented a set of self-tuned, using grid search and k-fold validation, traditional machine learning methods (shallow methods) such as Naïve Bayes, k-Nearest Neighbors, Random Forest, Boosted Decision Trees, Regularized Logistic Regression, and Support Vector Machines base on open source Scikitlearn (http://scikit-learn.org/stable/). The novel Deep Neural Networks models of different complexity have been also implemented using Keras (https://keras.io/), a deep learning open library, and a Tensorflow (www.tensorflow.org) as a backend. The machine learning models were trained and evaluated to predict measures of toxicity from the physical characteristics of the structure of chemicals using the same datasets as in the Toxicity Estimation Software Tool (https://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test). The Deep Learning models showed very good performance evaluation characteristics and were found to be useful in predicting of toxicological and physicochemical parameter endpoints. The results of this work support an optimistic view that some of current obstacles in cheminformatics can be overcome by using Deep Learning methods.

Chemical workflows supporting automated research data collection

Valery Tkachenko

Acquisition of data from public sources is inefficient, time consuming and limited in scope. The NIH has recently posted its intention to financially support data deposition by investigators through the ‘data sharing plan' for each funded proposal. However, this plan also points to a current weakness of the centralized data sharing and acquisition as all laboratories use different data collection and formatting approaches. These inconsistencies in data formatting by individual labs leads to the need to invest significant resources in data curation and interpretation by the technical staff involved in the maintenance of the centralized data collection resource such as CaNanoLab or Nanomaterial Registry. It would be far more efficient and useful if there were a standardized data collection and deposition template with standard key terms (such as Minimal Information about Nanomaterials, MIAN) that could be modified to add new or important additional data or parameters for each investigator. These new features cold be ultimately adopted in the classification scheme and guide the scope of the expanding database. This approach would be a win-win as it would enable structure for the investigators laboratory, consistency in data reporting and a means of transmitting data to the database in parallel to publication to eliminate the acquisition step from the process. In this talk we will outline our experience building Open Science Data Repository, a federated database system for direct acquisition, curation and management of research data, including nanomaterial data capture, transformation, and streamlined submission to nanomaterial knowledgebases. The key part of the system is microservices based architecture which exposes RESTful API suitable for direct integration into Workflow Management Systems as well as built-in modules facilitating and enforcing various lab-specific standard operating procedures.

Deep learning methods applied to physicochemical and toxicological endpoints

Valery Tkachenko

Chemical and pharmaceutical companies, and government agencies regulating both chemical and biological compounds, all strive to develop new methods to provide efficient prioritization, evaluation and safety assessments for the hundreds of new chemicals that enter the market annually. While there is a lot of historical data available within the various agencies, organizations and companies, significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. Traditional QSAR methods are based on sets of features (fingerprints) which representing the functional characteristics of chemicals. Unfortunately, due to both data gaps and limitations in the development of QSAR models, read-across approaches have become a popular area of research. Successes in the application of Artificial Neural Networks, and specifically in Deep Learning Neural Networks, has delivered a new optimism that the lack of data and limited feature sets can be overcome by using Deep Learning methods. In this poster we will present a comparison of various machine learning methods applied to several toxicological and physicochemical parameter endpoints. This abstract does not reflect U.S. EPA policy.

Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions

Valery Tkachenko

While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.

Using publicly available resources to build a comprehensive knowledgebase of ...

Valery Tkachenko

There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.

Need and benefits for structure standardization to facilitate integration and...

Valery Tkachenko

There are a large number of US government databases housing diverse collections of chemical data including bioassay data (PubChem), toxicity data (CompTox Chemistry Dashboard) and environmental data (a large collection of EPA databases), to name just a few. In many cases integration between the databases, at the chemical structure level, is via alphanumeric text identifiers such as CAS Numbers, or via InChI (International Chemical Identifiers). Structure-based integration is hyper-dependent on the initial inputs providing the chemical structures to the InChI generation algorithm. To ensure optimal integration between various databases, community standards and agreement regarding standardization of chemical structures would be beneficial, not only to integration of US government databases and resources but also to the international scientific community and hosts of online databases. This presentation will discuss our progress to deliver a fully Open Source chemical standardization platform as an exemplar for the community to build on and enhance. The system utilizes the CDK (Chemistry Development Kit), RD Kit and other open source components. The resource expands on our previous work regarding the Chemical Validation and Standardization Platform and has been tested using the open data collection provided by the EPA Comptox Chemistry Dashboard.

Development and comparison of deep learning toolkit with other machine learni...

Valery Tkachenko

The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as deep learning. There has been increasing use of deep learning which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts, it is currently not in any of the major cheminformatics tools. It is therefore our goal to develop a deep learning algorithm and toolkit which can be used as a standalone or integrated in new software being developed by us such as the Open Science Data Repository (OSDR). We will show how classic machine learning (CML) methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) compares to cutting edge deep learning and talk about challenges associated with deep neural networks (DNN) learning models. The open source Scikit-learn (http://scikit-learn.org/stable/) ML python library was used for building, tuning, and validating all CML models. The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/), a deep learning library, and Tensorflow (www.tensorflow.org) as a backend. All the developed pipelines consist of stratified splitting of the input dataset into train (80%) and test (20%) datasets. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) were computed for each model for ADME/Tox and other physicochemical properties. DNN learning models were found to be very good in predicting activities and can outperform most of the CML models.

Living in a world of federated knowledge challenges, principles, tools and ...

Valery Tkachenko

Over years a multitude of chemical formats and approaches were created to address various aspects of handling chemical information and building databases of chemical knowledge. As a result the current state of this landscape is severely affected by the lack of well-accepted and community-recognized formats, protocols, metadata standards, validation routines and standards in handling, storing and representation, lack of open toolkits which conform to the same standards as well as the lack of platforms which allow interactive and collaborative work to solve all the above problems. While such organizations as RDA and IUPAC as well as some government agencies and institutes are concerned and trying to address the problem it is still a severe pain point. In this presentation we will talk about our experience of building a federated knowledgebase called Open Science Data Repository which supports deposition of raw and structured chemical and analytical data in various formats, runs validation and standardization protocols, is build in a highly modular way that allows using both its API and its components in a Cloud or to be deployed on premises behind firewalls and supports a variety of use cases including collaborative data curation, rich analytics and visualization, real-time machine learning, formats conversion and preparing depositions into PubChem and ChemSpider from a variety of sources and fully supports FAIR principles for research data.

Open chemistry registry and mapping platform based on open source cheminforma...

Valery Tkachenko

The Open PHACTS project (openphacts.org) is a European initiative, constituting a public–private partnership to enable easier, cheaper and faster drug discovery. The project is supported by the OpenPHACTS Foundation (www.openphactsfoundation.org) and funded by contributions from several pharmaceutical companies. As part of Open PHACTS, a 'Chemical Registration Service” was created to register chemicals of interest to the project, allowing compound linkage between data sets. A key concept is the support for 'scientific lenses,' which allows hierarchical mapping of chemical entities, including supporting characteristics such as charge state, tautomerism and stereochemistry. Open PHACTS aggregated various databases, including ChEMBL, ChEBI, HMDB, DrugBank, PDB, MeSH, and WikiPathways. A new project builds on the Chemical Registration Service to establish an open chemistry registry and mapping service for general data set linkage. This expansion requires the support of multiple cheminformatics formats, the conversion and mapping of various identifiers, harmonized but configurable standardization, validation of the chemical structures, and the creation of new identifiers, to produce scientific lenses, or 'link sets'. Furthermore, these identifiers will be related to the compounds chemical names (IUPAC and trivial) and related chemical structures. This presentation will describe our ongoing work to create a fully open source, easy to install platform, which supports the ideas introduced by the Open PHACTS project and expands it with community data including, for example, the data now available from the EPA CompTox Chemistry Dashboard (comptox.epa.gov). This new platform supports chemical formats and provides for identifier conversion and cross-validation between datasets. The project is completely based on open source cheminformatics toolkits and available as a set of libraries, docker images and a web frontend based on FAIR and Open Data principles. The openness of this platform will allow for scientists to process their own datasets, and make them interoperable with other online chemical databases.

Using the structured product labeling format to index versatile chemical data

Valery Tkachenko

Structured Product Labeling (SPL) is a document markup standard approved by the Health Level Seven (HL7) standards organization and adopted by the FDA as a mechanism for exchanging product and facility information. Product information provided by companies in SPL format may be accessed from the FDA Online Label Repository (labels.fda.gov) and the National Library of Medicine DailyMed web site (dailymed.nlm.nih.gov). FDA also maintains and publishes SPL Indexing Files for Pharmacologic Class, Substance, Product Concept, Biological Drug Substance, and Billing Units. Data from the Indexing Files can be linked to data in both SPL resources and external resources via chemical and non-chemical identifiers. In this talk we will present on the latest addition to SPL which allows indexing data on proteins, polymers and structurally diverse substances. We will also discuss the potential value of SPL to the integration between public chemistry databases, especially those hosted by the United States Government.

Evolution of open chemical information

Valery Tkachenko

OMPOL – visualisation of large chemical spaces

Valery Tkachenko

In last few years the number and the size of chemical databases has been steadily increasing, as has the complexity of information residing in those databases creating truly multidimensional chemical spaces. Yet the most common user interface approach still remains based on search-and-browse workflow thus essentially preventing a proper navigation through such databases and hiding data patterns which may belong to other dimensions. As we at the Royal Society of Chemistry are building a chemical database service it is potentially useful to be able to visualize large chemical spaces, ranging in size from tens of thousands to tens of millions of compounds. Dimensionality reduction techniques such as PCA have been used to produce two-dimensional displays of large chemical spaces, via the production of scatterplots. Standard chart-plotting libraries allow interactive scatterplots to be produced, but do not scale well to large numbers of data points. Our new visualisation tool, OMPOL, is a browser-based tool for displaying and interacting with these data sets, allowing people to smoothly and responsively pan and zoom these plots, view the names and structures associated with the data points, select regions of chemical space and find typical and atypical members of those regions.

Not just another reaction database

Valery Tkachenko

The need for a high quality reaction database underpins synthetic reaction planning, as highlighted by the roadmap of the Dial-a-Molecule grand challenge [1] (the aims of which are to be able to predict the outcome of a reaction a priori and therefore generate products on demand, and also to optimise a reaction). A number of reaction databases are available [2] - most of these focus on storing basic reaction schemas and details and link to publications for more details. However their main limitation is that because their major source is the abstraction of published literature, insufficient structured reaction detail is recorded: for someone else to reproduce the reaction to fully record all reaction products (not just the target product) previous attempts to reach the optimised reaction route so that this "work-up" can be correlated to allow better prediction of reaction outcomes. As a result, the reactions domain of the chemical data repository that the Royal Society of Chemistry is developing will capture: reactions and processes directly from Electronic Lab Notebooks reactions which gave low yields or unintended products processes, parameters and equipment in S88 process recipe [3] style for maximum reproducibility multistep reactions reactants, products etc. not just as small organic molecules raw characterisation data linked to products We will demonstrate a first version, populated with reactions text-mined from RSC articles and examples of notebook reactions and processes as recorded by an academic research group at Cornell University. [1] Dial a Molecule Grand Challenge, http://generic.wordpress.soton.ac.uk/dial-a-molecule/ (accessed Oct 8, 2015) [2] Organic Chemistry Resources Worldwide, http://www.organicworldwide.net/content/reaction-databases (accessed Oct 8, 2015) [3] ISA, "Batch Control Part 1: Model and Terminology," The International Society for Measurement and Control, ISA Press, ISA - S88.01-1995

Implementing chemistry platform for OpenPHACTS

Valery Tkachenko

The Open PHACTS project delivers an online platform integrating a wide variety of data from across chemistry and the life sciences and an ecosystem of tools and services to query this data in support of pharmacological research, turning the semantic web from a research project into something that can be used by practising medicinal chemists in both academia and industry. In the summer of 2015 it was the first winner of the European Linked Data Award. At the Royal Society of Chemistry we have provided the chemical underpinnings to this system and in this talk we review its development over the past five years. We cover both our early work on semantic modelling of chemistry data for the Open PHACTS triplestore and more recent work building an all-purpose data platform, for which the Open PHACTS data has been an important test case, what has worked well, what's missing and where this is is likely to go in future.

Building linked data large-scale chemistry platform - challenges, lessons and...

Valery Tkachenko

Chemical databases have been around for decades, but in recent years we have observed a qualitative change from rather small in-house built proprietary databases to large-scale, open and increasingly complex chemistry knowledgebases. This tectonic shift has imposed new requirements for database design and system architecture as well as the implementation of completely new components and workflows which did not exist in chemical databases before. Probably the most profound change is being caused by the linked nature of modern resources - individual databases are becoming nodes and hubs of a huge and truly distributed web of knowledge. This change has important aspects such as data and format standards, interoperability, provenance, security, quality control and metainformation standards. ChemSpider at the Royal Society of Chemistry was first public chemical database which incorporated rigorous quality control by introducing both community curation and automated quality checks at the scale of tens of millions of records. Yet we have come to realize that this approach may now be incomplete in a quickly changing world of linked data. In this presentation we will talk about challenges associated with building modern public and private chemical databases as well as lessons that we have learned from our past and present experience. We will also talk about solutions for some common problems.

Text mining to produce large chemistry datasets for community access

Valery Tkachenko

While in an ideal world all data would be deposited by the producing scientist directly into a database, in the real-world most chemical data is instead presented in a form designed for human rather than machine consumption. Text mining has the potential to extract this data back into a computer understandable form. As all United States patents are available free of charge they make the perfect corpus for extracting a large number of experimental properties of compounds, and chemical reactions. We report on our text-mining activities to extract millions of textual NMR spectra, hundreds of thousands of physicochemical properties (with their associated compounds) and over a million chemical reactions. All extracted results are to be deposited into online databases allowing the community to benefit from the results of this work. Using Mestrelab Research’s MNova product we have converted the textual NMR spectra to graphical spectra, and validated each spectrum against its associated chemical structure so as to detect cases where the NMR spectrum could not be produced by the associated structure. In the case of melting points the resultant dataset, of over a quarter of a million melting compound/temperature relationships, is the largest public dataset the authors are aware of. We have used this dataset to produce a predictive model with results comparable to those of manually curated datasets. Our experiences with modelling this data has demonstrated that we are working at the edge of current algorithmic and computing capabilities for predictive model building, with the resultant matrix containing over 200 billion descriptors. The melting point model and the data it was derived from are available freely from http://www.ochem.eu.

More from Valery Tkachenko (20)