PubChem QC project. In this project we calculate all molecules in the PubChem Project. Currently 1,100,000 molecules are available at http://pubchemqc.riken.jp/ . Results are in public domain.
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
This document proposes a method to detect related semantic datasets based on frequent subgraph mining. The method extracts the most frequent subgraphs from RDF graphs using the SUBDUE algorithm. These subgraphs are then matched across datasets to identify potential links. The method is evaluated against gold standard links and baselines, showing precise but limited recall. Future work to improve recall is discussed, such as using string similarity techniques.
Universal Smiles: Finally a canonical SMILES stringbaoilleach
The document discusses the development of a "Universal SMILES" string that can generate a canonical SMILES identifier for molecules. It describes taking the canonical labels from the InChI and using them to traverse the molecular graph in a set way, encoding the results as a SMILES string. This approach was able to generate canonical SMILES for over 99.7% of molecules tested from large databases, with the main exceptions due to differences in stereochemistry perception between the InChI and the toolkit used. The Universal SMILES represents a significant step towards a single canonical representation for small molecules.
This document provides an introduction to retrosynthesis prediction and machine learning approaches for the task. It describes how retrosynthesis involves tracing reactions backward from a target product to predict required reactants. Classical computer-aided methods used reaction templates requiring domain expertise, while modern machine learning methods use neural networks to learn retrosynthesis without templates or can predict the most suitable template. Representative deep learning models discussed include sequence-to-sequence, graph neural networks, and transformer-based methods.
This document summarizes work using high-throughput computing on the Open Science Grid to generate large materials databases. Key points:
- The researchers used over 2.6 million CPU hours on the Open Science Grid to run thousands of ab initio calculations for materials properties like diffusion coefficients.
- This enabled the creation of the world's largest database of diffusion data from a single research group, with properties for over 350 material systems.
- The databases are publicly available online and help discover new scientific insights not possible from smaller datasets.
- The researchers are now using the same high-throughput approach on the Open Science Grid to calculate other materials properties at scale, like excess formation volumes in alloys.
PubChem QC project. In this project we calculate all molecules in the PubChem Project. Currently 1,100,000 molecules are available at http://pubchemqc.riken.jp/ . Results are in public domain.
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
This document proposes a method to detect related semantic datasets based on frequent subgraph mining. The method extracts the most frequent subgraphs from RDF graphs using the SUBDUE algorithm. These subgraphs are then matched across datasets to identify potential links. The method is evaluated against gold standard links and baselines, showing precise but limited recall. Future work to improve recall is discussed, such as using string similarity techniques.
Universal Smiles: Finally a canonical SMILES stringbaoilleach
The document discusses the development of a "Universal SMILES" string that can generate a canonical SMILES identifier for molecules. It describes taking the canonical labels from the InChI and using them to traverse the molecular graph in a set way, encoding the results as a SMILES string. This approach was able to generate canonical SMILES for over 99.7% of molecules tested from large databases, with the main exceptions due to differences in stereochemistry perception between the InChI and the toolkit used. The Universal SMILES represents a significant step towards a single canonical representation for small molecules.
This document provides an introduction to retrosynthesis prediction and machine learning approaches for the task. It describes how retrosynthesis involves tracing reactions backward from a target product to predict required reactants. Classical computer-aided methods used reaction templates requiring domain expertise, while modern machine learning methods use neural networks to learn retrosynthesis without templates or can predict the most suitable template. Representative deep learning models discussed include sequence-to-sequence, graph neural networks, and transformer-based methods.
This document summarizes work using high-throughput computing on the Open Science Grid to generate large materials databases. Key points:
- The researchers used over 2.6 million CPU hours on the Open Science Grid to run thousands of ab initio calculations for materials properties like diffusion coefficients.
- This enabled the creation of the world's largest database of diffusion data from a single research group, with properties for over 350 material systems.
- The databases are publicly available online and help discover new scientific insights not possible from smaller datasets.
- The researchers are now using the same high-throughput approach on the Open Science Grid to calculate other materials properties at scale, like excess formation volumes in alloys.
Materials Project computation and database infrastructureAnubhav Jain
The document describes the Materials Project computation infrastructure, which uses the Atomate framework to automatically run density functional theory simulations on over 85,000 materials in a high-throughput manner, with the results stored in a MongoDB database for users to explore and analyze in order to accelerate materials innovation. The Materials Project infrastructure aims to make it easy for researchers to generate large amounts of computational data on materials properties through standardized and scalable workflows.
100 million compounds, 100K protein structures, 2 million reactions, 1 million journal articles, 20 million patents and 15 billion substructures. Is 20TB really Big Data? With modern hardware and efficient algorithms, many classic cheminformatics problems can be handled with today’s datasets. Noel O’Boyle, Daniel Lowe, John May and Roger Sayle of NextMove Software discuss how traditional cheminformatics tasks can be performed on large chemical datasets through techniques like precomputing substructures, optimised substructure searching, and graph databases.
The document discusses improving chemical structure depictions in software. It describes lessons learned in developing better algorithms for layout, orientation, ring templates, and rendering. Key areas of focus are reducing overlaps, improving macrocycle depictions, and using standardized fonts and parameters for high quality publication-grade output. Comparisons of different cheminformatics toolkits on a test set of structures show RDKit generally performs well, while areas for further enhancement in CDK and other tools are discussed.
A schema generation approach for column oriented no sql data storesKIRAN V
This document proposes two approaches to maintain schema information for column-oriented NoSQL databases like Apache HBase: 1) an online method that uses a generalized framework to parse inserted objects and maintain a global schema, and 2) an offline method that uses a genetic algorithm to select the best object from the data store to construct a "superschema". The system design and results evaluating the performance and accuracy of the two proposed approaches are also presented.
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
The document summarizes the implications of the large influx of patent chemistry data into PubChem from various sources performing chemical named entity recognition (CNER) on patent texts. Over 30 million structures have been added from these sources. While this "Big Bang" greatly expands the available chemistry, there are also caveats to consider like fragmentation of structures, inclusion of mixtures and virtual structures, and the fact that most added structures lack associated bioactivity data. The opportunities for data mining are significant but care must be taken to understand the limitations and artifacts of the automated extraction methods.
The Materials Project: An Electronic Structure Database for Community-Based M...Anubhav Jain
The document summarizes the Materials Project, an electronic structure database for materials design maintained by Lawrence Berkeley National Laboratory. It describes how the Materials Project uses high-throughput density functional theory calculations to compute properties of over 50,000 materials in its database. Users can search for materials, analyze computed properties, and design new materials using tools on the project's website.
Discovering advanced materials for energy applications (with high-throughput ...Anubhav Jain
This document summarizes a talk on discovering advanced materials for energy applications using high-throughput computing and mining the scientific literature. It discusses how materials discovery and optimization typically take decades due to the vast number of possible atomic configurations. Density functional theory provides a way to computationally screen millions of potential materials by automating calculations on supercomputers. Examples are given of new battery cathode and thermoelectric materials that have been discovered through high-throughput density functional theory calculations and later experimentally confirmed.
Mixtures QSAR: modelling collections of chemicalsAlex Clark
This document discusses representing and modeling chemical mixtures. It proposes a new data format called Mixfile or MInChI to hierarchically define mixtures and their components, including concentrations. This format aims to support cheminformatics applications like property prediction. Examples are given modeling theophylline solubility and gas absorption using mixture data. The document also describes applying similar methods to model polymer entropy of mixing using a spreadsheet dataset converted to the mixtures format. It concludes that defining mixtures in digital formats will enable greater analysis, modeling and use of mixture data.
Mixtures InChI: a story of how standards drive upstream productsAlex Clark
This document discusses the development of Mixtures InChI (MInChI), a standard for representing chemical mixtures in a machine-readable format. MInChI was developed to address the lack of standards for mixture informatics and interoperability. The document outlines the development of open source tools to generate and edit MInChI notation, as well as efforts to build a community and integrate MInChI into commercial products and databases to enable widespread use and generation of mixture data. Future work discussed includes finalizing the MInChI specification, extending it to additional chemical entities, developing associated properties and metadata, and implementing MInChI at large scale.
Mixtures as first class citizens in the realm of informaticsAlex Clark
Presented at Cambridge (UK) cheminformatics meeting, February 2021. Mixtures of chemicals are underutilised from an informatics point of view, and this presentation shows some of the work done by Collaborative Drug Discovery, IUPAC and InChI Trust to remedy this.
See recording: https://www.youtube.com/watch?v=0ILc0owuEzQ&list=PLfj_gc4RCduuwv9p8lh2xS1EhQ3p_Nd9S&index=1 ... my part starts at 1:05:00
Mixtures: informatics for formulations and consumer productsAlex Clark
The document proposes standards for representing mixtures in a machine-readable format. It introduces Mixfile and MInChI (Mixtures InChI) as hierarchical and concise formats for describing mixtures. Examples of formulations are provided to demonstrate how components, concentrations, and metadata can be encoded. Potential applications of the standards are discussed, such as enabling sophisticated searches of mixture data from publications and vendors to facilitate properties prediction and hazards assessment. Adoption of the standards could help ensure the longevity and sharing of mixture data.
Chemical mixtures: File format, open source tools, example data, and mixtures...Alex Clark
This document discusses representing chemical mixtures using an open format called Mixfile. It proposes Mixfile as a standard format for mixtures, analogous to Molfile for individual molecules. Tools were created to edit and manipulate Mixfiles. Over 5,600 real-world mixture examples were extracted from text and represented in the Mixfile format. A MInChI notation was also defined as a condensed representation of mixtures. Future work is proposed to integrate mixture definitions and lookups into electronic lab notebooks and improve automated extraction of mixture information from text.
Bringing bioassay protocols to the world of informatics, using semantic annot...Alex Clark
This document discusses bringing bioassay protocols into the world of informatics by using semantic annotations. It describes how measurements from bioassays contain many details that are usually only available as text, and outlines an approach using ontologies, natural language processing, and machine learning to extract this information and make it accessible for searching, comparing datasets, and identifying trends. The goal is to make all bioassay protocol data machine readable by developing common templates and annotation standards that can be applied to existing and new assay data sources.
Autonomous model building with a preponderance of well annotated assay protocolsAlex Clark
Combining large amounts of publicly available structure-activity data with assays that have carefully curated annotations opens the door to a number of ways to analyze the data behind the scenes. Combining fully machine readable input for a diverse variety of projects with modelling techniques that can be used without fussy parametrization allows models to be created and updated whenever new data arrives. Predictions from these models can be integrated into normal searching and visualization workflows, without any need for the user to opt-in or make extra decisions. This approach is novel and different from the way structure-activity models are normally deployed: useful predictions can be presented ubiquitously with literally zero additional work on behalf of the user. We will present our efforts to date regarding ways to both passively and actively draw attention to important drug discovery trends while exploring compounds and assays.
Representing molecules with minimalism: A solution to the entropy of informaticsAlex Clark
Cheminformatics as we know it is possible because so many molecular structures can be represented with datastructures and rules that are at first glance quite trivial. This first impression is highly misleading, since even within supposedly well behaved domains, edge cases arising from issues such as resonance, tautomerization, symmetry and stereochemistry - to name but a few - quickly add up. To supplement these genuine challenges, there is a whole additional class of problems caused by the mismatch between chemists' understanding of molecules and the datatypes that are necessary to capture a structure for informatics purposes. This line is blurred by the convenience of representing structures in a form that is very closely related to the diagram styles that have been in use since the dawn of chemistry. There are currently four major approaches to structure representation: connection tables (e.g. MDL Molfile), sketches (e.g. ChemDraw), canonical strings (e.g. SMILES and InChI) and atomic models (numerous 3D formats). Not only do all of these approaches have valid use cases, but they are deceptively incompatible with each other, even when addressing identical needs. Almost without exception, format conversions are not commutative, and every translation involves losing some amount of data. Given that recording chemical structures in machine readable form has become such a critical part of scientific research, it is essential to define a fundamental representation that captures the key structural definition asserted by the experimental chemist, for a broad and useful range of molecules, and ideally in a way that is closely related to visual drawing mnemonics. The number of data concepts needed to satisfy these conditions is quite small, and is mostly satisfied by the most commonly used subset of the venerable MDL Molfile format. This presentation will discuss how this subset, with a few minor corrections and clarifications, can and should be used as the reference standard for molecules, and how the informatics community can benefit from having well defined standards.
Presentation to the EPA (August 2016) about the BioAssay Express project, from Collaborative Drug Discovery. Describes the history and potential of the project, with the intention of opening a dialog about incorporating EPA toxicity data.
SLAS2016: Why have one model when you could have thousands?Alex Clark
Society for Laboratory Automation & Screening, San Diego, January 2016. Presented by Dr. Alex M. Clark. Describes the use of open data resources (ChEMBL) to build target-activity models for drug discovery and toxicity prediction, on a massive scale, using a fully automated process. Concludes with a demo of the PolyPharma app, which shows how these models can be used for prospective drug discovery.
The anatomy of a chemical reaction: Dissection by machine learning algorithmsAlex Clark
This document discusses using machine learning algorithms to analyze chemical reaction data. It describes how current reaction reporting formats are not well-suited for computational analysis. A more structured reporting format is proposed to fully describe reactions in a digitally friendly way, including specifying reactants, products, quantities, yields, and metrics like atom efficiency. This structured data would allow modeling of reaction substitutability and enable large-scale machine learning of chemical transformations.
Materials Project computation and database infrastructureAnubhav Jain
The document describes the Materials Project computation infrastructure, which uses the Atomate framework to automatically run density functional theory simulations on over 85,000 materials in a high-throughput manner, with the results stored in a MongoDB database for users to explore and analyze in order to accelerate materials innovation. The Materials Project infrastructure aims to make it easy for researchers to generate large amounts of computational data on materials properties through standardized and scalable workflows.
100 million compounds, 100K protein structures, 2 million reactions, 1 million journal articles, 20 million patents and 15 billion substructures. Is 20TB really Big Data? With modern hardware and efficient algorithms, many classic cheminformatics problems can be handled with today’s datasets. Noel O’Boyle, Daniel Lowe, John May and Roger Sayle of NextMove Software discuss how traditional cheminformatics tasks can be performed on large chemical datasets through techniques like precomputing substructures, optimised substructure searching, and graph databases.
The document discusses improving chemical structure depictions in software. It describes lessons learned in developing better algorithms for layout, orientation, ring templates, and rendering. Key areas of focus are reducing overlaps, improving macrocycle depictions, and using standardized fonts and parameters for high quality publication-grade output. Comparisons of different cheminformatics toolkits on a test set of structures show RDKit generally performs well, while areas for further enhancement in CDK and other tools are discussed.
A schema generation approach for column oriented no sql data storesKIRAN V
This document proposes two approaches to maintain schema information for column-oriented NoSQL databases like Apache HBase: 1) an online method that uses a generalized framework to parse inserted objects and maintain a global schema, and 2) an offline method that uses a genetic algorithm to select the best object from the data store to construct a "superschema". The system design and results evaluating the performance and accuracy of the two proposed approaches are also presented.
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
The document summarizes the implications of the large influx of patent chemistry data into PubChem from various sources performing chemical named entity recognition (CNER) on patent texts. Over 30 million structures have been added from these sources. While this "Big Bang" greatly expands the available chemistry, there are also caveats to consider like fragmentation of structures, inclusion of mixtures and virtual structures, and the fact that most added structures lack associated bioactivity data. The opportunities for data mining are significant but care must be taken to understand the limitations and artifacts of the automated extraction methods.
The Materials Project: An Electronic Structure Database for Community-Based M...Anubhav Jain
The document summarizes the Materials Project, an electronic structure database for materials design maintained by Lawrence Berkeley National Laboratory. It describes how the Materials Project uses high-throughput density functional theory calculations to compute properties of over 50,000 materials in its database. Users can search for materials, analyze computed properties, and design new materials using tools on the project's website.
Discovering advanced materials for energy applications (with high-throughput ...Anubhav Jain
This document summarizes a talk on discovering advanced materials for energy applications using high-throughput computing and mining the scientific literature. It discusses how materials discovery and optimization typically take decades due to the vast number of possible atomic configurations. Density functional theory provides a way to computationally screen millions of potential materials by automating calculations on supercomputers. Examples are given of new battery cathode and thermoelectric materials that have been discovered through high-throughput density functional theory calculations and later experimentally confirmed.
Mixtures QSAR: modelling collections of chemicalsAlex Clark
This document discusses representing and modeling chemical mixtures. It proposes a new data format called Mixfile or MInChI to hierarchically define mixtures and their components, including concentrations. This format aims to support cheminformatics applications like property prediction. Examples are given modeling theophylline solubility and gas absorption using mixture data. The document also describes applying similar methods to model polymer entropy of mixing using a spreadsheet dataset converted to the mixtures format. It concludes that defining mixtures in digital formats will enable greater analysis, modeling and use of mixture data.
Mixtures InChI: a story of how standards drive upstream productsAlex Clark
This document discusses the development of Mixtures InChI (MInChI), a standard for representing chemical mixtures in a machine-readable format. MInChI was developed to address the lack of standards for mixture informatics and interoperability. The document outlines the development of open source tools to generate and edit MInChI notation, as well as efforts to build a community and integrate MInChI into commercial products and databases to enable widespread use and generation of mixture data. Future work discussed includes finalizing the MInChI specification, extending it to additional chemical entities, developing associated properties and metadata, and implementing MInChI at large scale.
Mixtures as first class citizens in the realm of informaticsAlex Clark
Presented at Cambridge (UK) cheminformatics meeting, February 2021. Mixtures of chemicals are underutilised from an informatics point of view, and this presentation shows some of the work done by Collaborative Drug Discovery, IUPAC and InChI Trust to remedy this.
See recording: https://www.youtube.com/watch?v=0ILc0owuEzQ&list=PLfj_gc4RCduuwv9p8lh2xS1EhQ3p_Nd9S&index=1 ... my part starts at 1:05:00
Mixtures: informatics for formulations and consumer productsAlex Clark
The document proposes standards for representing mixtures in a machine-readable format. It introduces Mixfile and MInChI (Mixtures InChI) as hierarchical and concise formats for describing mixtures. Examples of formulations are provided to demonstrate how components, concentrations, and metadata can be encoded. Potential applications of the standards are discussed, such as enabling sophisticated searches of mixture data from publications and vendors to facilitate properties prediction and hazards assessment. Adoption of the standards could help ensure the longevity and sharing of mixture data.
Chemical mixtures: File format, open source tools, example data, and mixtures...Alex Clark
This document discusses representing chemical mixtures using an open format called Mixfile. It proposes Mixfile as a standard format for mixtures, analogous to Molfile for individual molecules. Tools were created to edit and manipulate Mixfiles. Over 5,600 real-world mixture examples were extracted from text and represented in the Mixfile format. A MInChI notation was also defined as a condensed representation of mixtures. Future work is proposed to integrate mixture definitions and lookups into electronic lab notebooks and improve automated extraction of mixture information from text.
Bringing bioassay protocols to the world of informatics, using semantic annot...Alex Clark
This document discusses bringing bioassay protocols into the world of informatics by using semantic annotations. It describes how measurements from bioassays contain many details that are usually only available as text, and outlines an approach using ontologies, natural language processing, and machine learning to extract this information and make it accessible for searching, comparing datasets, and identifying trends. The goal is to make all bioassay protocol data machine readable by developing common templates and annotation standards that can be applied to existing and new assay data sources.
Autonomous model building with a preponderance of well annotated assay protocolsAlex Clark
Combining large amounts of publicly available structure-activity data with assays that have carefully curated annotations opens the door to a number of ways to analyze the data behind the scenes. Combining fully machine readable input for a diverse variety of projects with modelling techniques that can be used without fussy parametrization allows models to be created and updated whenever new data arrives. Predictions from these models can be integrated into normal searching and visualization workflows, without any need for the user to opt-in or make extra decisions. This approach is novel and different from the way structure-activity models are normally deployed: useful predictions can be presented ubiquitously with literally zero additional work on behalf of the user. We will present our efforts to date regarding ways to both passively and actively draw attention to important drug discovery trends while exploring compounds and assays.
Representing molecules with minimalism: A solution to the entropy of informaticsAlex Clark
Cheminformatics as we know it is possible because so many molecular structures can be represented with datastructures and rules that are at first glance quite trivial. This first impression is highly misleading, since even within supposedly well behaved domains, edge cases arising from issues such as resonance, tautomerization, symmetry and stereochemistry - to name but a few - quickly add up. To supplement these genuine challenges, there is a whole additional class of problems caused by the mismatch between chemists' understanding of molecules and the datatypes that are necessary to capture a structure for informatics purposes. This line is blurred by the convenience of representing structures in a form that is very closely related to the diagram styles that have been in use since the dawn of chemistry. There are currently four major approaches to structure representation: connection tables (e.g. MDL Molfile), sketches (e.g. ChemDraw), canonical strings (e.g. SMILES and InChI) and atomic models (numerous 3D formats). Not only do all of these approaches have valid use cases, but they are deceptively incompatible with each other, even when addressing identical needs. Almost without exception, format conversions are not commutative, and every translation involves losing some amount of data. Given that recording chemical structures in machine readable form has become such a critical part of scientific research, it is essential to define a fundamental representation that captures the key structural definition asserted by the experimental chemist, for a broad and useful range of molecules, and ideally in a way that is closely related to visual drawing mnemonics. The number of data concepts needed to satisfy these conditions is quite small, and is mostly satisfied by the most commonly used subset of the venerable MDL Molfile format. This presentation will discuss how this subset, with a few minor corrections and clarifications, can and should be used as the reference standard for molecules, and how the informatics community can benefit from having well defined standards.
Presentation to the EPA (August 2016) about the BioAssay Express project, from Collaborative Drug Discovery. Describes the history and potential of the project, with the intention of opening a dialog about incorporating EPA toxicity data.
SLAS2016: Why have one model when you could have thousands?Alex Clark
Society for Laboratory Automation & Screening, San Diego, January 2016. Presented by Dr. Alex M. Clark. Describes the use of open data resources (ChEMBL) to build target-activity models for drug discovery and toxicity prediction, on a massive scale, using a fully automated process. Concludes with a demo of the PolyPharma app, which shows how these models can be used for prospective drug discovery.
The anatomy of a chemical reaction: Dissection by machine learning algorithmsAlex Clark
This document discusses using machine learning algorithms to analyze chemical reaction data. It describes how current reaction reporting formats are not well-suited for computational analysis. A more structured reporting format is proposed to fully describe reactions in a digitally friendly way, including specifying reactants, products, quantities, yields, and metrics like atom efficiency. This structured data would allow modeling of reaction substitutability and enable large-scale machine learning of chemical transformations.
Compact models for compact devices: Visualisation of SAR using mobile appsAlex Clark
Presented at American Chemical Society meeting, Boston, 2015. Describes how cheminformatics algorithms and visualisation interfaces have advanced on mobile apps to cover a diverse variety of functionality, increasingly calculated on the device itself rather than deferring to a web service. Culminates in a demo of the PolyPharma app prototype (see http://cheminf20.org/2015/08/06/the-polypharma-app-a-mash-up-of-ideas-and-technology)
Green chemistry in chemical reactions: informatics by designAlex Clark
Chemical informatics technology can be of assistance to chemists for describing reactions in numerous ways, including calculating green chemistry metrics such as process mass intensity, E-factor and atom economy. To facilitate this, chemical reactions have to be described in more precise detail than is the norm for most chemists. There are also numerous practical ways to add more green chemistry functionality to lab notebooks, such as enumerating searchable reaction transforms for environmentally favourable reactions, automatically looking up toxicity and hazard information, and others which are mentioned in the slides.
This presentation was given at the Green Chemistry & Engineering conference in 2015 (Americal Chemical Society Green Chemistry Insititute).
Green chemistry is an important subject that needs to be a part of every chemist's education, as well as a part of the daily routine of the professional synthetic chemist. This talk describes how a new app can be used to bring green chemistry metrics to reaction descriptions, once they are captured in a proper cheminformatics format. It also describes some of the additional data resources that can be incorporated into the user experience, and how this helps both students and professionals.
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Alex Clark
Mobile apps for cheminformatics are quite powerful on their own, but can be significantly boosted by connecting them with cloud-hosted functionality. This talk explores the range of functionality that can be covered simply by making use of apps with stateless webservices, i.e. anonymous access without persistent data.
Building a mobile reaction lab notebook (ACS Dallas 2014)Alex Clark
This document discusses building a mobile electronic lab notebook focused on chemical reactions called the Green Lab Notebook. It would allow users to draw chemical structures, balance reactions, and calculate quantities, yields, and green metrics. Key features include digitally capturing reaction data, prioritizing computer-friendly data structures and intuitive workflows, and linking to external databases for solvent data, sustainable feedstocks, and curated green reaction transforms. The goal is to facilitate recording, analyzing, and promoting the reuse of experimental reaction data in a sustainable chemistry context.
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Alex Clark
Presented at 2013 the German Chem[o]informatics Conference in Fulda, 2013: entitled "Putting together the pieces: building a reaction-centric electronic lab notebook for mobile devices".
Cheminformatics
workflows using the
mobile + cloud platform. Presentation by Dr. Alex M. Clark of Molecular Materials Informatics at the NETTAB 2013 meeting in Venice, Italy. The presentation introduces the significance of mobile apps in science, and the scope of their capabilities in chemical structure informatics. The bulk of the talk describes an account of a preliminary workflow using open science data to search for viable leads for a cure for tuberculosis. The workflow described makes use of a combination of mobile, cloud and conventional desktop-based technology, all stitched together by facile communication, sharing and collaboration features.
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Sérgio Sacani
Magmatic iron-meteorite parent bodies are the earliest planetesimals in the Solar System,and they preserve information about conditions and planet-forming processes in thesolar nebula. In this study, we include comprehensive elemental compositions andfractional-crystallization modeling for iron meteorites from the cores of five differenti-ated asteroids from the inner Solar System. Together with previous results of metalliccores from the outer Solar System, we conclude that asteroidal cores from the outerSolar System have smaller sizes, elevated siderophile-element abundances, and simplercrystallization processes than those from the inner Solar System. These differences arerelated to the formation locations of the parent asteroids because the solar protoplane-tary disk varied in redox conditions, elemental distributions, and dynamics at differentheliocentric distances. Using highly siderophile-element data from iron meteorites, wereconstruct the distribution of calcium-aluminum-rich inclusions (CAIs) across theprotoplanetary disk within the first million years of Solar-System history. CAIs, the firstsolids to condense in the Solar System, formed close to the Sun. They were, however,concentrated within the outer disk and depleted within the inner disk. Future modelsof the structure and evolution of the protoplanetary disk should account for this dis-tribution pattern of CAIs.
Order : Trombidiformes (Acarina) Class : Arachnida
Mites normally feed on the undersurface of the leaves but the symptoms are more easily seen on the uppersurface.
Tetranychids produce blotching (Spots) on the leaf-surface.
Tarsonemids and Eriophyids produce distortion (twist), puckering (Folds) or stunting (Short) of leaves.
Eriophyids produce distinct galls or blisters (fluid-filled sac in the outer layer)
This presentation offers a general idea of the structure of seed, seed production, management of seeds and its allied technologies. It also offers the concept of gene erosion and the practices used to control it. Nursery and gardening have been widely explored along with their importance in the related domain.
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...MrSproy
ABSTRACT
The J'BaFofi, or "Giant Spider," is a mainly legendary arachnid by reportedly inhabiting the dense rain forests of
the Congo. As despite numerous anecdotal accounts and cultural references, the scientific validation remains more elusive.
My study aims to proper evaluate the existence of the J'BaFofi through the analysis of historical reports,indigenous
testimonies and modern exploration efforts.
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxshubhijain836
Centrifugation is a powerful technique used in laboratories to separate components of a heterogeneous mixture based on their density. This process utilizes centrifugal force to rapidly spin samples, causing denser particles to migrate outward more quickly than lighter ones. As a result, distinct layers form within the sample tube, allowing for easy isolation and purification of target substances.
Dr. Firoozeh Kashani-Sabet is an innovator in Middle Eastern Studies and approaches her work, particularly focused on Iran, with a depth and commitment that has resulted in multiple book publications. She is notable for her work with the University of Pennsylvania, where she serves as the Walter H. Annenberg Professor of History.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
Presentation of our paper, "Towards Quantitative Evaluation of Explainable AI Methods for Deepfake Detection", by K. Tsigos, E. Apostolidis, S. Baxevanakis, S. Papadopoulos, V. Mezaris. Presented at the ACM Int. Workshop on Multimedia AI against Disinformation (MAD’24) of the ACM Int. Conf. on Multimedia Retrieval (ICMR’24), Thailand, June 2024. https://doi.org/10.1145/3643491.3660292 https://arxiv.org/abs/2404.18649
Software available at https://github.com/IDT-ITI/XAI-Deepfakes
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
2. COORDINATION INCHI
Goal
• Ideally
• All drawings of a chemical entity produce the
same InChI/C
• One InChI/C can never match two drawings of
different molecules
• Probably impossible, but can we get close
enough to be useful?
2
{coordination
INCHI}
3. COORDINATION INCHI
Deliverable
• Training set for inorganic compounds:
- real-world compounds (CSD, PubChem, misc)
- some drawn well, others drawn badly
• Prognosis for issues to expect:
a. current InChI works fine, or
b. new layer is required, or
c. intractible problems persist
• Use as a definitive pass/fail validation key
3
4. COORDINATION INCHI
Source Data
• Cambridge Structural Database:
- ≤ 500K inorganics that aren't polymers
- 2D coordinates, intelligent bonds, H-counts
- selected ~500 by diverse clustering
• PubChem:
- picked ~200 from large subset of garbage
- most had to be redrawn
• Miscellaneous:
- privately curated data ~500 compounds
- carefully drawn inorganic valences
4
9. COORDINATION INCHI
Rule 1
• If your representation does not imply the
correct molecular formula
• Most cheminformatics formats/editors/use
patterns fail this test for nontrivial inorganics
9
then you are wrong
10. COORDINATION INCHI
Rule 2
• In order of preference:
(a) correct valence for early main groups
(b) inferred electron delocalisation paths
(c) realistic bond orders & formal charges
(d) sensible oxidation states on metals
(e) symmetry
• Usually possible to satisfy all conditions, with
frequent exception of symmetry
10
11. COORDINATION INCHI
Rule 3
• Non-trivial inorganics usually offer many
correct ways to draw
• Avoid overspecification
- more metadata can be added later
• Use only minimum information needed to:
- satisfy rule 1 (imply formula)
- optimise for rule 2
- resolve genuinely different molecules
11
14. COORDINATION INCHI
Algorithm: Implementation
• atom priority → [element, hcount, chg*]
• bond → <0, 0..1, 1, 1..2, 2, 2..3, 3+>
• iterate: atom priority → [a, ⇪{b1, a1}, {b2, a2}, …]
• if degenerate, bump lowest priority atom & repeat
• outcome: atom priority = walk order
• can now serialise in various different ways, e.g.
SMILES-esque, InChI-esque
14
15. COORDINATION INCHI
Algorithm: Outcome
• Algorithm weakest link is detecting
delocalisation islands
• User weakest link is implying correct
hydrogen counts
• Remarkably tolerant to multiple ways of
drawing inorganic bonds
• Preliminary results are promising for
disambiguating inorganics correctly
15