Coordination InChI (2019)

100 million compounds, 100K protein structures, 2 million reactions, 1 million journal articles, 20 million patents and 15 billion substructures. Is 20TB really Big Data? With modern hardware and efficient algorithms, many classic cheminformatics problems can be handled with today’s datasets. Noel O’Boyle, Daniel Lowe, John May and Roger Sayle of NextMove Software discuss how traditional cheminformatics tasks can be performed on large chemical datasets through techniques like precomputing substructures, optimised substructure searching, and graph databases.

An artificial oxygenase built form scratch

Rafael Diego Macho Reyes

RDKit UGM 2016: Higher Quality Chemical Depictions

The document discusses improving chemical structure depictions in software. It describes lessons learned in developing better algorithms for layout, orientation, ring templates, and rendering. Key areas of focus are reducing overlaps, improving macrocycle depictions, and using standardized fonts and parameters for high quality publication-grade output. Comparisons of different cheminformatics toolkits on a test set of structures show RDKit generally performs well, while areas for further enhancement in CDK and other tools are discussed.

Computational Chemistry: From Theory to Practice

David Thompson

A schema generation approach for column oriented no sql data stores

KIRAN V

This document proposes two approaches to maintain schema information for column-oriented NoSQL databases like Apache HBase: 1) an online method that uses a generalized framework to parse inserted objects and maintain a global schema, and 2) an offline method that uses a genetic algorithm to select the best object from the data store to construct a "superschema". The system design and results evaluating the performance and accuracy of the two proposed approaches are also presented.

The open patent chemistry “big bang”: Implications, opportunities and caveats

Dr. Haxel Consult

The document summarizes the implications of the large influx of patent chemistry data into PubChem from various sources performing chemical named entity recognition (CNER) on patent texts. Over 30 million structures have been added from these sources. While this "Big Bang" greatly expands the available chemistry, there are also caveats to consider like fragmentation of structures, inclusion of mixtures and virtual structures, and the fact that most added structures lack associated bioactivity data. The opportunities for data mining are significant but care must be taken to understand the limitations and artifacts of the automated extraction methods.

The Materials Project: An Electronic Structure Database for Community-Based M...

The document summarizes the Materials Project, an electronic structure database for materials design maintained by Lawrence Berkeley National Laboratory. It describes how the Materials Project uses high-throughput density functional theory calculations to compute properties of over 50,000 materials in its database. Users can search for materials, analyze computed properties, and design new materials using tools on the project's website.

Discovering advanced materials for energy applications (with high-throughput ...

This document summarizes a talk on discovering advanced materials for energy applications using high-throughput computing and mining the scientific literature. It discusses how materials discovery and optimization typically take decades due to the vast number of possible atomic configurations. Density functional theory provides a way to computationally screen millions of potential materials by automating calculations on supercomputers. Examples are given of new battery cathode and thermoelectric materials that have been discovered through high-throughput density functional theory calculations and later experimentally confirmed.

Mixtures QSAR: modelling collections of chemicals

This document discusses representing and modeling chemical mixtures. It proposes a new data format called Mixfile or MInChI to hierarchically define mixtures and their components, including concentrations. This format aims to support cheminformatics applications like property prediction. Examples are given modeling theophylline solubility and gas absorption using mixture data. The document also describes applying similar methods to model polymer entropy of mixing using a spreadsheet dataset converted to the mixtures format. It concludes that defining mixtures in digital formats will enable greater analysis, modeling and use of mixture data.

Mixtures InChI: a story of how standards drive upstream products

This document discusses the development of Mixtures InChI (MInChI), a standard for representing chemical mixtures in a machine-readable format. MInChI was developed to address the lack of standards for mixture informatics and interoperability. The document outlines the development of open source tools to generate and edit MInChI notation, as well as efforts to build a community and integrate MInChI into commercial products and databases to enable widespread use and generation of mixture data. Future work discussed includes finalizing the MInChI specification, extending it to additional chemical entities, developing associated properties and metadata, and implementing MInChI at large scale.

Mixtures as first class citizens in the realm of informatics

Presented at Cambridge (UK) cheminformatics meeting, February 2021. Mixtures of chemicals are underutilised from an informatics point of view, and this presentation shows some of the work done by Collaborative Drug Discovery, IUPAC and InChI Trust to remedy this. See recording: https://www.youtube.com/watch?v=0ILc0owuEzQ&list=PLfj_gc4RCduuwv9p8lh2xS1EhQ3p_Nd9S&index=1 ... my part starts at 1:05:00

Mixtures: informatics for formulations and consumer products

The document proposes standards for representing mixtures in a machine-readable format. It introduces Mixfile and MInChI (Mixtures InChI) as hierarchical and concise formats for describing mixtures. Examples of formulations are provided to demonstrate how components, concentrations, and metadata can be encoded. Potential applications of the standards are discussed, such as enabling sophisticated searches of mixture data from publications and vendors to facilitate properties prediction and hazards assessment. Adoption of the standards could help ensure the longevity and sharing of mixture data.

Chemical mixtures: File format, open source tools, example data, and mixtures...

This document discusses representing chemical mixtures using an open format called Mixfile. It proposes Mixfile as a standard format for mixtures, analogous to Molfile for individual molecules. Tools were created to edit and manipulate Mixfiles. Over 5,600 real-world mixture examples were extracted from text and represented in the Mixfile format. A MInChI notation was also defined as a condensed representation of mixtures. Future work is proposed to integrate mixture definitions and lookups into electronic lab notebooks and improve automated extraction of mixture information from text.

Bringing bioassay protocols to the world of informatics, using semantic annot...

This document discusses bringing bioassay protocols into the world of informatics by using semantic annotations. It describes how measurements from bioassays contain many details that are usually only available as text, and outlines an approach using ontologies, natural language processing, and machine learning to extract this information and make it accessible for searching, comparing datasets, and identifying trends. The goal is to make all bioassay protocol data machine readable by developing common templates and annotation standards that can be applied to existing and new assay data sources.

ACS CINF Luncheon talk (Boston 2018)

Autonomous model building with a preponderance of well annotated assay protocols

Combining large amounts of publicly available structure-activity data with assays that have carefully curated annotations opens the door to a number of ways to analyze the data behind the scenes. Combining fully machine readable input for a diverse variety of projects with modelling techniques that can be used without fussy parametrization allows models to be created and updated whenever new data arrives. Predictions from these models can be integrated into normal searching and visualization workflows, without any need for the user to opt-in or make extra decisions. This approach is novel and different from the way structure-activity models are normally deployed: useful predictions can be presented ubiquitously with literally zero additional work on behalf of the user. We will present our efforts to date regarding ways to both passively and actively draw attention to important drug discovery trends while exploring compounds and assays.

Representing molecules with minimalism: A solution to the entropy of informatics

Cheminformatics as we know it is possible because so many molecular structures can be represented with datastructures and rules that are at first glance quite trivial. This first impression is highly misleading, since even within supposedly well behaved domains, edge cases arising from issues such as resonance, tautomerization, symmetry and stereochemistry - to name but a few - quickly add up. To supplement these genuine challenges, there is a whole additional class of problems caused by the mismatch between chemists' understanding of molecules and the datatypes that are necessary to capture a structure for informatics purposes. This line is blurred by the convenience of representing structures in a form that is very closely related to the diagram styles that have been in use since the dawn of chemistry. There are currently four major approaches to structure representation: connection tables (e.g. MDL Molfile), sketches (e.g. ChemDraw), canonical strings (e.g. SMILES and InChI) and atomic models (numerous 3D formats). Not only do all of these approaches have valid use cases, but they are deceptively incompatible with each other, even when addressing identical needs. Almost without exception, format conversions are not commutative, and every translation involves losing some amount of data. Given that recording chemical structures in machine readable form has become such a critical part of scientific research, it is essential to define a fundamental representation that captures the key structural definition asserted by the experimental chemist, for a broad and useful range of molecules, and ideally in a way that is closely related to visual drawing mnemonics. The number of data concepts needed to satisfy these conditions is quite small, and is mostly satisfied by the most commonly used subset of the venerable MDL Molfile format. This presentation will discuss how this subset, with a few minor corrections and clarifications, can and should be used as the reference standard for molecules, and how the informatics community can benefit from having well defined standards.

CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...

BioAssay Express

SLAS2016: Why have one model when you could have thousands?

Society for Laboratory Automation & Screening, San Diego, January 2016. Presented by Dr. Alex M. Clark. Describes the use of open data resources (ChEMBL) to build target-activity models for drug discovery and toxicity prediction, on a massive scale, using a fully automated process. Concludes with a demo of the PolyPharma app, which shows how these models can be used for prospective drug discovery.

The anatomy of a chemical reaction: Dissection by machine learning algorithms

This document discusses using machine learning algorithms to analyze chemical reaction data. It describes how current reaction reporting formats are not well-suited for computational analysis. A more structured reporting format is proposed to fully describe reactions in a digitally friendly way, including specifying reactants, products, quantities, yields, and metrics like atom efficiency. This structured data would allow modeling of reaction substitutability and enable large-scale machine learning of chemical transformations.

Similar to Coordination InChI (2019)

Materials Project computation and database infrastructure

Is 20TB really Big Data?

An artificial oxygenase built form scratch

Rafael Diego Macho Reyes

RDKit UGM 2016: Higher Quality Chemical Depictions

Computational Chemistry: From Theory to Practice

David Thompson

A schema generation approach for column oriented no sql data stores

KIRAN V

The open patent chemistry “big bang”: Implications, opportunities and caveats

Dr. Haxel Consult

The Materials Project: An Electronic Structure Database for Community-Based M...

Discovering advanced materials for energy applications (with high-throughput ...

Similar to Coordination InChI (2019) (9)

Materials Project computation and database infrastructure

Is 20TB really Big Data?

An artificial oxygenase built form scratch

RDKit UGM 2016: Higher Quality Chemical Depictions

Computational Chemistry: From Theory to Practice

A schema generation approach for column oriented no sql data stores

The open patent chemistry “big bang”: Implications, opportunities and caveats

The Materials Project: An Electronic Structure Database for Community-Based M...

Discovering advanced materials for energy applications (with high-throughput ...

More from Alex Clark

Mixtures QSAR: modelling collections of chemicals

Mixtures InChI: a story of how standards drive upstream products

Mixtures as first class citizens in the realm of informatics

Mixtures: informatics for formulations and consumer products

Chemical mixtures: File format, open source tools, example data, and mixtures...

Bringing bioassay protocols to the world of informatics, using semantic annot...

ACS CINF Luncheon talk (Boston 2018)

Autonomous model building with a preponderance of well annotated assay protocols

Representing molecules with minimalism: A solution to the entropy of informatics

CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...

BioAssay Express

SLAS2016: Why have one model when you could have thousands?

The anatomy of a chemical reaction: Dissection by machine learning algorithms

Compact models for compact devices: Visualisation of SAR using mobile apps

Presented at American Chemical Society meeting, Boston, 2015. Describes how cheminformatics algorithms and visualisation interfaces have advanced on mobile apps to cover a diverse variety of functionality, increasingly calculated on the device itself rather than deferring to a web service. Culminates in a demo of the PolyPharma app prototype (see http://cheminf20.org/2015/08/06/the-polypharma-app-a-mash-up-of-ideas-and-technology)

Green chemistry in chemical reactions: informatics by design

Chemical informatics technology can be of assistance to chemists for describing reactions in numerous ways, including calculating green chemistry metrics such as process mass intensity, E-factor and atom economy. To facilitate this, chemical reactions have to be described in more precise detail than is the norm for most chemists. There are also numerous practical ways to add more green chemistry functionality to lab notebooks, such as enumerating searchable reaction transforms for environmentally favourable reactions, automatically looking up toxicity and hazard information, and others which are mentioned in the slides. This presentation was given at the Green Chemistry & Engineering conference in 2015 (Americal Chemical Society Green Chemistry Insititute).

ICCE 2014: The Green Lab Notebook

Green chemistry is an important subject that needs to be a part of every chemist's education, as well as a part of the daily routine of the professional synthetic chemist. This talk describes how a new app can be used to bring green chemistry metrics to reaction descriptions, once they are captured in a proper cheminformatics format. It also describes some of the additional data resources that can be incorporated into the user experience, and how this helps both students and professionals.

Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)

Building a mobile reaction lab notebook (ACS Dallas 2014)

This document discusses building a mobile electronic lab notebook focused on chemical reactions called the Green Lab Notebook. It would allow users to draw chemical structures, balance reactions, and calculate quantities, yields, and green metrics. Key features include digitally capturing reaction data, prioritizing computer-friendly data structures and intuitive workflows, and linking to external databases for solvent data, sustainable feedstocks, and curated green reaction transforms. The goal is to facilitate recording, analyzing, and promoting the reuse of experimental reaction data in a sustainable chemistry context.

Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013

Alex Clark : NETTAB 2013