This document discusses various aspects of molecular file formats used for data exchange of small molecules, biopolymers, and reactions. It addresses topics such as representation of hydrogens, valence models, benchmarking different file readers, and preserving labels in files. The author presents results from testing 24 different molecular file readers on their ability to correctly interpret an MDL benchmark file containing over 10,000 connection tables with varying elements, charges, and environments. While many readers can read MDL files, they do not always agree on semantics, though performance is generally better on subsets relevant to pharmaceutical applications. The benchmarking led to improvements in several open-source and commercial toolkits.
Automated Extraction of Reactions from the Patent Literaturedan2097
We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.
CINF 51: Analyzing success rates of supposedly 'easy' reactionsNextMove Software
Chemists, like insects, come in a bewildering number of varieties and specializations. Traditional retrosynthesis tools are aimed at expert synthetic chemists to assist them with challenging total syntheses, or at process chemists searching for optimal routes via obscure reaction mechanisms. In this talk, we instead consider the role of computer software to support non-experts in synthetic chemistry, such as medicinal and computational chemists. Here the challenge is not in choosing the reaction, but instead preventing silly mistakes with the most widely applied classes of named reactions. Anecdotal experience with the content of pharmaceutical ELNs shows that low yield reactions often correlate with the presence of known incompatible functional groups, such as a second halide in Suzuki couplings.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
Unlocking chemical information from tables and legacy articlesNextMove Software
Many tools for text-mining are designed to work with unstructured text. Here we present the results of our efforts to apply text-mining to the semi-structured content of tables.
We will cover the difficulties of coping with the various different ways that the contents of a table may be specified and the challenges of resolving references to elsewhere in the document. We report on the extraction of melting points, boiling points, NMR spectra and biological activity data from tabular data.
In collaboration with the Royal Society of Chemistry (RSC) we have also investigated the application of these tools to the RSC’s back archive, both in tables and free text. We cover the difficulties in adapting tools optimized for patents to journal articles and the difficulties in handling the older, less structured, text that dates from as far back as 1841.
The information extracted from this project, both from patents and the RSC’s back archive, will form a key contribution to the RSC’s public data repository. In all cases the evidence text for the extracted information is provided along with a link back to the document from which it was extracted, ensuring that the provenance of the information can be verified.
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
A benchmark of substructure searching tools given at the Cambridge Cheminformatics Network Meeting (May 27th). Slides have added annotated to aid description.
Automated Extraction of Reactions from the Patent Literaturedan2097
We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.
CINF 51: Analyzing success rates of supposedly 'easy' reactionsNextMove Software
Chemists, like insects, come in a bewildering number of varieties and specializations. Traditional retrosynthesis tools are aimed at expert synthetic chemists to assist them with challenging total syntheses, or at process chemists searching for optimal routes via obscure reaction mechanisms. In this talk, we instead consider the role of computer software to support non-experts in synthetic chemistry, such as medicinal and computational chemists. Here the challenge is not in choosing the reaction, but instead preventing silly mistakes with the most widely applied classes of named reactions. Anecdotal experience with the content of pharmaceutical ELNs shows that low yield reactions often correlate with the presence of known incompatible functional groups, such as a second halide in Suzuki couplings.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
Unlocking chemical information from tables and legacy articlesNextMove Software
Many tools for text-mining are designed to work with unstructured text. Here we present the results of our efforts to apply text-mining to the semi-structured content of tables.
We will cover the difficulties of coping with the various different ways that the contents of a table may be specified and the challenges of resolving references to elsewhere in the document. We report on the extraction of melting points, boiling points, NMR spectra and biological activity data from tabular data.
In collaboration with the Royal Society of Chemistry (RSC) we have also investigated the application of these tools to the RSC’s back archive, both in tables and free text. We cover the difficulties in adapting tools optimized for patents to journal articles and the difficulties in handling the older, less structured, text that dates from as far back as 1841.
The information extracted from this project, both from patents and the RSC’s back archive, will form a key contribution to the RSC’s public data repository. In all cases the evidence text for the extracted information is provided along with a link back to the document from which it was extracted, ensuring that the provenance of the information can be verified.
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
A benchmark of substructure searching tools given at the Cambridge Cheminformatics Network Meeting (May 27th). Slides have added annotated to aid description.
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...NextMove Software
Of the many chemical reactions performed by synthetic chemists in the pharmaceutical industry and academia, some are potentially more hazardous than others. Fortunately, best practices, compliance and education helps ensure that incidents are rare, but as highlighted by the recent explosion and building evacuation at two UK universities in March 2015, constant vigilance is necessary to ensure a safe work environment. The primary problem is not that chemical safety information, for example from MSDS/SDS data sheets, Bretherick's Handbook or the internet, is readily available, but that the volume of such information makes it difficult for an experimentalist to identify relevant risks in a timely manner.
In this talk, we describe our attempts to encode the Environmental Protection Agency's (EPA's) guidance entitled 'A Method for Determining Compatibility of Hazardous Waste', 1980, in an XML file format. Typical current state-of-the-art methods for alerting potential chemical safety hazards, for example in ELNs, simply annotate reactants with codes extracted from their MSDS/SDS data sheets, such as Global Harmonized System of Classification and Labelling of Chemicals (GHS) or EU R-phrases/S-phrases, leaving the chemist to manually assess whether any of the described incompatibilities is relevant. In this work, we use combinations of SMARTS patterns (for chemical classes) and InChIs (for specific molecules), to capture known reagent incompatibilities, that may be safe in isolation. Specific alerts describe documented incompatibilities between compounds (e.g. acetone and H2O2) while more general alerts can capture known or inferred incompatibilities between functional groups (e.g. ketones and peroxides). The encoding is hierarchical allowing only the most relevant warnings to be triggered. Alerts are encoded in a flexible XML format, facilitating extension and exchange. Advanced features of this XML format, include the ability to specify reactant quantities, and the use of predicted properties (such as of products) in rules.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
The molfile serves as a de facto standard for chemical information exchange. It is perhaps the most widely supported format with its core syntax being easy to understand, parse, and generate. Beyond the core syntax, more advanced features such as sgroups and enhanced stereochemistry are rarely supported, often only being partially implemented and used. Additionally, several vendors, toolkits, and service providers have added extended syntax to their molfiles to solve particular corner cases or representation problems. This talk will provide a brief summary of the less widely supported features of the molfile including sgroups and enhanced stereochemistry. Additionally, a survey of how extensions can/have been implemented and which extensions exist "in the wild". Special attention will be paid to capturing of coordination bonds for extending the InChI to handle organometallics.
Automatic extraction of bioactivity data from patentsNextMove Software
Structure-Activity Relationship (SAR) analysis is important for the development of novel small molecule drugs. Such analyses rely on bioactivity data either from in-house or published data, with data from the latter currently being extracted manually at much expensive.
Here we report on an entirely automated system for extracting bioactivity data that we are developing, initially targeting US patents. The system relies on combining the results of many technologies: chemical entity recognition, chemical name to structure, table processing, chemical compound number resolution, chemical sketch interpretation, and even in some cases reconstitution of molecules from a generic core and R-group definitions. Where possible, the target and the assay description are also identified.
To assess the precision/recall of our system we compare our results with those manually extracted from US patents by BindingDB. We also compare the data we’ve extracted with the data present in ChEMBL from journal articles, to analyse whether there are significant differences between activity data in journal articles and patents e.g. differences in targets of interest.
The Synthetically Accessible Virtual Inventory (SAVI) project is an international collaboration between partners in government laboratories, small companies, not-for-profits, and large corporations, to computationally generate a very large number of reliably and inexpensively synthesizable novel screening sample structures. SAVI handles reactions not by virtue of applying simple SMIRKS to a set of building blocks of unknown availability. It instead combines a set of transforms richly annotated with chemical context, coming from, or being newly developed in the mold of, the original LHASA project knowledgebase, with a set of highly annotated, reliably available, purchasable starting materials. These components are tied together for SAVI product generation with the chemoinformatics toolkit CACTVS with custom developments for this project. Each product is annotated with a number of computed properties seen as important in current drug design, including rules for identifying potentially reactive or promiscuous compounds. After having produced and made publicly available the first (beta) set of 283 million SAVI products annotated with proposed one-step syntheses, we will be reporting on the second full production run aimed at creating a database of one billion high-quality, easily synthesizable screening samples. We will present the current status, ongoing developments, as well as scientific and technical challenges of the project.
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSSimBioSys_Inc
Underpinning the computer-aided synthesis design system, ARChem, are algorithms that extract synthetic knowledge from large reaction databases. The generation of reaction rules that facilitate retrosynthetic analysis, as well as the extraction of information about expected yields, regioselectivity, functional group compatibility, and stereo-chemistry are discussed in these slides.
ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
The increasing number and size of public databases is facilitating the collection of chemical structures and associated experimental data for QSAR modeling. However, the performance of QSAR models is highly dependent not only on the modeling methodology, but also on the quality of the data used. In this study we developed robust QSAR models for endpoints of environmental interest with the aim of helping the regulatory process. We used the publicly available PHYSPROP database that includes a set of thirteen common physicochemical and environmental fate properties, including logP, melting point, Henry’s coefficient, and biodegradability among others. Curation and standardization workflows have been applied to use the highest quality data and generate QSAR-ready structures. The developed models are in agreement with the five OECD principles that requires QSARs to be simple and reliable. These models were applied to a set of ~700k chemicals to produce predictions for display on the EPA CompTox Chemistry Dashboard. In addition to the predictions, this free web and mobile application provides access to the experimental data used for training as well as detailed reports including general model performances, specific applicability domain and prediction accuracy, and the nearest neighboring structures used for prediction. The dashboard also provides access to model QMRFs (QSAR modeling report format) which is a downloadable pdf containing additional details about the modeling approaches, the data, and molecular descriptor interpretation.
A talk I gave on August 19, 2012 at the 2012 ACS 244th national meeting held in Philadelphia, PA. This talk was given as a part of a series of 8 minutes long "lightening talks" on a variety of topics in chemistry and cheminfomatics. This talk looked at errors and missing information from literature in the field of chemistry.
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...NextMove Software
Of the many chemical reactions performed by synthetic chemists in the pharmaceutical industry and academia, some are potentially more hazardous than others. Fortunately, best practices, compliance and education helps ensure that incidents are rare, but as highlighted by the recent explosion and building evacuation at two UK universities in March 2015, constant vigilance is necessary to ensure a safe work environment. The primary problem is not that chemical safety information, for example from MSDS/SDS data sheets, Bretherick's Handbook or the internet, is readily available, but that the volume of such information makes it difficult for an experimentalist to identify relevant risks in a timely manner.
In this talk, we describe our attempts to encode the Environmental Protection Agency's (EPA's) guidance entitled 'A Method for Determining Compatibility of Hazardous Waste', 1980, in an XML file format. Typical current state-of-the-art methods for alerting potential chemical safety hazards, for example in ELNs, simply annotate reactants with codes extracted from their MSDS/SDS data sheets, such as Global Harmonized System of Classification and Labelling of Chemicals (GHS) or EU R-phrases/S-phrases, leaving the chemist to manually assess whether any of the described incompatibilities is relevant. In this work, we use combinations of SMARTS patterns (for chemical classes) and InChIs (for specific molecules), to capture known reagent incompatibilities, that may be safe in isolation. Specific alerts describe documented incompatibilities between compounds (e.g. acetone and H2O2) while more general alerts can capture known or inferred incompatibilities between functional groups (e.g. ketones and peroxides). The encoding is hierarchical allowing only the most relevant warnings to be triggered. Alerts are encoded in a flexible XML format, facilitating extension and exchange. Advanced features of this XML format, include the ability to specify reactant quantities, and the use of predicted properties (such as of products) in rules.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
The molfile serves as a de facto standard for chemical information exchange. It is perhaps the most widely supported format with its core syntax being easy to understand, parse, and generate. Beyond the core syntax, more advanced features such as sgroups and enhanced stereochemistry are rarely supported, often only being partially implemented and used. Additionally, several vendors, toolkits, and service providers have added extended syntax to their molfiles to solve particular corner cases or representation problems. This talk will provide a brief summary of the less widely supported features of the molfile including sgroups and enhanced stereochemistry. Additionally, a survey of how extensions can/have been implemented and which extensions exist "in the wild". Special attention will be paid to capturing of coordination bonds for extending the InChI to handle organometallics.
Automatic extraction of bioactivity data from patentsNextMove Software
Structure-Activity Relationship (SAR) analysis is important for the development of novel small molecule drugs. Such analyses rely on bioactivity data either from in-house or published data, with data from the latter currently being extracted manually at much expensive.
Here we report on an entirely automated system for extracting bioactivity data that we are developing, initially targeting US patents. The system relies on combining the results of many technologies: chemical entity recognition, chemical name to structure, table processing, chemical compound number resolution, chemical sketch interpretation, and even in some cases reconstitution of molecules from a generic core and R-group definitions. Where possible, the target and the assay description are also identified.
To assess the precision/recall of our system we compare our results with those manually extracted from US patents by BindingDB. We also compare the data we’ve extracted with the data present in ChEMBL from journal articles, to analyse whether there are significant differences between activity data in journal articles and patents e.g. differences in targets of interest.
The Synthetically Accessible Virtual Inventory (SAVI) project is an international collaboration between partners in government laboratories, small companies, not-for-profits, and large corporations, to computationally generate a very large number of reliably and inexpensively synthesizable novel screening sample structures. SAVI handles reactions not by virtue of applying simple SMIRKS to a set of building blocks of unknown availability. It instead combines a set of transforms richly annotated with chemical context, coming from, or being newly developed in the mold of, the original LHASA project knowledgebase, with a set of highly annotated, reliably available, purchasable starting materials. These components are tied together for SAVI product generation with the chemoinformatics toolkit CACTVS with custom developments for this project. Each product is annotated with a number of computed properties seen as important in current drug design, including rules for identifying potentially reactive or promiscuous compounds. After having produced and made publicly available the first (beta) set of 283 million SAVI products annotated with proposed one-step syntheses, we will be reporting on the second full production run aimed at creating a database of one billion high-quality, easily synthesizable screening samples. We will present the current status, ongoing developments, as well as scientific and technical challenges of the project.
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSSimBioSys_Inc
Underpinning the computer-aided synthesis design system, ARChem, are algorithms that extract synthetic knowledge from large reaction databases. The generation of reaction rules that facilitate retrosynthetic analysis, as well as the extraction of information about expected yields, regioselectivity, functional group compatibility, and stereo-chemistry are discussed in these slides.
ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
The increasing number and size of public databases is facilitating the collection of chemical structures and associated experimental data for QSAR modeling. However, the performance of QSAR models is highly dependent not only on the modeling methodology, but also on the quality of the data used. In this study we developed robust QSAR models for endpoints of environmental interest with the aim of helping the regulatory process. We used the publicly available PHYSPROP database that includes a set of thirteen common physicochemical and environmental fate properties, including logP, melting point, Henry’s coefficient, and biodegradability among others. Curation and standardization workflows have been applied to use the highest quality data and generate QSAR-ready structures. The developed models are in agreement with the five OECD principles that requires QSARs to be simple and reliable. These models were applied to a set of ~700k chemicals to produce predictions for display on the EPA CompTox Chemistry Dashboard. In addition to the predictions, this free web and mobile application provides access to the experimental data used for training as well as detailed reports including general model performances, specific applicability domain and prediction accuracy, and the nearest neighboring structures used for prediction. The dashboard also provides access to model QMRFs (QSAR modeling report format) which is a downloadable pdf containing additional details about the modeling approaches, the data, and molecular descriptor interpretation.
A talk I gave on August 19, 2012 at the 2012 ACS 244th national meeting held in Philadelphia, PA. This talk was given as a part of a series of 8 minutes long "lightening talks" on a variety of topics in chemistry and cheminfomatics. This talk looked at errors and missing information from literature in the field of chemistry.
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...ChemAxon
Without standards the world would be a place of utter chaos. With the ever increasing complexity of modern life, standards specifications ensure interoperability between the multitude of systems that society has grown to depend upon. In the world of cheminformatics, a recent trend has been the promotion of authority prescribed “de jure” standards (such as InChI and HELM), over the more traditional “de facto” standards (such as V2000 Mol files, SMILES strings). Voluntary “de facto” standards are selected by the community, reusing practical solutions that work well, and thereby become dominant, whilst less practical approaches are ignored, in a process akin to Darwinian selection. Obligatory “de jure” standards, however, are imposed on an industry rather than selected by it, often by bureaucrats and lawyers rather than experts. A recent example of this is ISO international standard 11238, entitled “Health informatics — Identification of Medicinal Products — Data elements and structures for the unique identification and exchange of regulated information on substances”. This standard covers file formats for exchanging chemical structures between government agencies including the US Food & Drug Administration (FDA). Amongst the implementation challenges required by this standard is the ability to handle MDL V2000 connection tables with normalized whitespace (i.e. stripped of carriage-returns, linefeeds and the multiple spaces used to align columns). To the author’s knowledge, no cheminformatics suite in the world could meet this requirement at the time the standard was ratified by several countries standards bodies in 2011. With luck, poor “de jure” standards will be ignored by legislators, rather than imposing unreasonable burdens on the communities they were designed to help.
Launching digital biology, 12 May 2015, Bremenbioflux
Intro. It is not a secret that in biology laboratories hours of manual work are considered a compulsory part of the experiment. During a day of work, lab researchers have to pipette the right amounts of fluids in tubes, carry them from one machine to another, program and handle each machine individually, label and document carefully each step and then convert the results to data and analyse it. For a simple routine experiment, each of the mentioned tasks is performed at least 10 times/day. Past decade, a big effort has been done to produce machines (e.g., pipetting robots) that would automate some of the tasks in the lab. However, these machines were developed under the industrial mindset to maximize the throughput of a single task. Thus, these machines are of large size, task-specific, difficult to use (they usually come with dedicated drivers and software) and most importantly, extremely expensive. A solution is the use of digital microfluidics to enable the advance from automated biology to digital biology. In my vision, a digital lab should be:
• fully integrated, running all the tasks on the same machine
• easy to use, with a web-based software for biological design of new experiments and hardware control
• general-purpose, allowing easy reconfiguration and design of new experiments
• cheap, offering open-source and do-it-yourself assembly kits
Talk. In the talk, I will present an overview of the road to digital biology, covering all the main aspects, from computer-aided design to hardware production and biological applications.
Hands on. Also, prepare for some real engineering action :). I will execute live a biochemical application (enzymatic reaction of β-galactosidase with Xgal) on my homemade digital biochip. We will then discuss the current challenges in the development process and everyone will get a chance to play with the device. And of course, I will happily consider any engineering advice or idea you have :).
In this deck from the 2017 MVAPICH User Group, Adam Moody from Lawrence Livermore National Laboratory presents: MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts.
"High-performance computing is being applied to solve the world's most daunting problems, including researching climate change, studying fusion physics, and curing cancer. MPI is a key component in this work, and as such, the MVAPICH team plays a critical role in these efforts. In this talk, I will discuss recent science that MVAPICH has enabled and describe future research that is planned. I will detail how the MVAPICH team has responded to address past problems and list the requirements that future work will demand."
Watch the video: https://wp.me/p3RLHQ-hp6
Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...chrisrobschu
DOE Funded Activities
Objectives:
•Use engineering analyses to screen H2 storage systems against DoD targets & requirements (FY15)
•Identify suitable hydrogen storage materials and suitable vehicle demonstration platforms
•Develop a preliminary design of an integrated UUV design with a solid hydrogen storage system
•Complete detailed design of the hydrogen storage system
•Complete integrated system design
ONR/NUWC Funded Activities
Objectives:
•
Design and build a small bench-scale, alane-based, hydrogen storage vessel
•
Perform preliminary testing on the bench-scale, storage system
•
Package and ship bench-scale vessel and alanematerial to the Navy NUWC
•
Provide technical support to Navy NUWC for their further testing and evaluation
Doe amr st134_motyka_2016_p
A presentation of the on-going work on interoperability within the toolchain. A new domain OSLC KM is introduced, some experiments for reusing models are also presented and, some videos are also used to present some user stories.
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
Presented by Roger Sayle at the 11th International Conference on Chemical Structures (ICCS) 2018.
Exponential database growth will always be a technical challenge but evolutionary strategies offer to hold off the inevitable in the short term and revolutionary strategies promise a longer-term solution.
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
Chemical sketches are ubiquitous in the published literature. Unlike connection table formats that precisely capture chemistry for database entry, the primary purpose of a sketch format is to produce a high quality image for conveying information to other chemists. Chemical sketches can be presented in a variety of chemistry-specific formats as well as image formats, with the later presenting additional challenges to interpretation. Since 2001 the United States Patent Office has redrawn all chemical sketches in ChemDraw, yielding to date over 25 million freely available CDX files.
Correctly extracting chemistry from these files required tackling of many areas including disambiguation of ambiguous labels (e.g. B, D, P, V, Ac), interpreting labels (e.g. COOH), interpretation of free text overlaid on the structure (e.g. brackets for a repeated group) and assignment of reaction role.
We report our work on extracting chemical structures and reactions from sketches and demonstrate the improvements in quality that tackling the intricacies of sketches provides over more naïve approaches. One notable improvement is the ability to better distinguish between specific compounds, fragments, generic structures and reaction schemes. We compare the chemistry extracted from sketches with the results from text-mining, and show that a large amount of chemistry is only available from one medium or the other. We also explore cases where the combination of the output from sketches and text enables extraction of data that either method in isolation could not e.g. Markush structures, reactions where the product is given as a sketch.
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...NextMove Software
ACS National Meeting Boston Fall 2015
A Matched Molecular Series (MMS) is a set of molecules that differ by substitution at the same scaffold location [1]. For two molecules, this is equivalent to a Matched Molecular Pair.
We present a graphical interface for querying a database of bioactivity or physicochemical property data using a matched series. Using the database, predictions are made using the Matsy method [2] which suggests what R groups will improve the particular property value of interest.
An interesting aspect of our approach is that the interface treats the distinct R groups attached to a particular scaffold as first-class entities that can be manipulated and rearranged to see the effect on the predictions. This makes it easy, for example, to compare the predictions based simply on matched-pair information versus information from longer length series.
References:
[1] Wawer, M.; Bajorath, J. J. Med. Chem. 2011, 54, 2944.
[2] O’Boyle, N.M.; Boström, J.; Sayle, R.A.; Gill, A. J. Med. Chem. 2014, 57, 2704.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Reading and Writing Molecular File Formats for Data Exchange of Small Molecules, Biopolymers and Reactions
1. Reading and Writing Molecular File
Formats for Data Exchange of Small
Molecules, Biopolymers and Reactions
Roger Sayle
NextMove Software, Cambridge, UK
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
2. connection table compression
• One distinguishing feature between computational
chemistry (COMP) and cheminformatics (CINF) is the
representation of hydrogens.
• Typically in organic chemistry, approximately half of
the bonds in a “full” connection table are the bonds
to terminal hydrogen atoms.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
4. valence models
• To make life interesting, developers of molecule file
formats that support implicit hydrogens (Daylight
and MDL) allow omission of the number of implicit
hydrogens on an atom, to save space, instead relying
on the deriving the number of implicit hydrogens for
the atom’s default valence in a given environment.
• Both formats can fully specify implicit hydrogen
count/valence, but alas not all readers/writers
support these conventions.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
5. example valence model
• As an example, Daylight’s valence model for SMILES
states that all aromatic nitrogens have no implicit
hydrogens by default.
• If a hydrogen is present, it should be written as
“[nH]”, and if it’s charged as “[nH+]”.
• The number of hydrogens never needs to be
guessed, and the molecular formula is precisely
specified.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
6. the peril of Valence models
A popular desktop sketching program is responsible for
the poor reaction yields of traditional alchemists.
Saving the above reaction as an MDL RXN file and
reloading it changes the hydrogen count due to bugs in
the software. In practice, such errors lose the
distinction between sodium metal and sodium hydride
(etc.) in pharmaceutical ELNs.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
7. the mdl/accelrys valence model
• The correct valence is specified by MDL/ISIS (and as
documented in Accelrys’ “Chemical Representation”)
• Neutral carbon should be four valent.
– Everyone agrees on this (hopefully).
• +1 Nitrogen cation should be four valent.
• +1 Carbon cation should be three valent.
• -4 charge Boron should a have valence 1.
– OEChem says 0, OpenBabel says 3, RDKit says 7...
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
8. the “mdlbench.sdf” benchmark
• To test the fidelity of MDL valence implementations
in available SD file readers, we evaluate the SMILES
generated from a standard reference test file.
• This SD file contains 10,208 connection tables:
– 114 different elements plus “D” and “T”
– 11 charge states (from -4 to +6 inclusive)
– 8 environments (minimum valences from 0 to 7)
– 116*11*8 = 10208
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
9. recall/precision and subsets
• Rather than assess programs on the 10K theoretically
possible atomic valences, a more realistic benchmark
is to consider the environments encountered in
practice (such as the 199 found in Accrelyrs’ MDDR
database, version 2011.2).
• An alternative restriction is to consider only the 10
“organic” elements (B,C,N,O,P,S,F,Cl,Br,I).
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
12. mdlbench.sdf summary
• Although many toolkits can read MDL SD files, they
don’t all perfectly agree on the semantics.
• Different readers, sometimes written by the same
company, can interpret the exact same MOL file as
different molecules.
• Fortunately, most compounds encountered in the
pharmaceutical industry fall into the widely
understood “well-behaved” subset.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
13. mdlbench.sdf postscript
• Following this experiment, feedback was passed back
to several vendors/developers and the benchmark
including the expected results were published online.
• Patches were submitted to RDKit and OpenBabel to
achieve 100% conformance on this benchmark.
• Many vendors, including ChemAxon, Dotmatics,
Optibrium and Xemistry, and open source projects,
including MayaChem and Balloon, have announced
improvements based on this work.
• As a result the community converges on a standard.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
14. part 2: further support
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
15. label preservation (MDL ALIAS)
Can be converted to a molfile as
NextMove07171309292D
5 4 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0
1.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.5000 -0.8660 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.5000 0.8660 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
2.5000 0.8660 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 2 0 0 0 0
2 4 1 0 0 0 0
4 5 1 0 0 0 0
A 1
Alexa555
A 5
LH-RH
M END
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
16. v3000 mol files
• Accelrys’ introduction of the version 3000 molfile
introduces a chicken and egg problem for the
cheminformatics community.
• Adoption of the format is hampered by the limited
support for v3000 molfiles amongst third-party tools,
but equally the v3000 files in circulation create
problems for developers.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
17. reactions as 1st class citizens
• Although all toolkits support small molecule
exchange, support for reactions is generally poorer.
• This situation is compounded by treating reactions as
different objects/types than regular molecules, and
sometimes introducing format distinctions based
upon the content of a file.
• Many formats, including MDL mol file (via CPSS), ISIS
Sketch, SMILES, ChemDraw CDX and CDXML merrily
encode both molecules and reactions.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
18. support for empty reactions
• An interesting class of informatics problems concerns
molecules with no atoms, and reactions with no
reactants and/or products.
• In SDF and RDF file formats, these records can have
associated data, but it is not uncommon for this to
be skipped/lost by many tools.
• A SMILES example might be “>> title”.
• It is not uncommon for chemists to draw nothing
after an arrow to indicate a reaction failure, or
nothing before one to indicate compound purchase.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
19. agents, solvents and catalysts
An extremely useful extension to the MDL RXN file
format introduced by ChemAxon is support for agents.
After the reactant count and product count fields,
allowing an agent count avoid information loss,
capturing items drawn above/below a reaction arrow.
CSc1ccccc1(F)>OO>CS(=O)c1ccc(cc1)F
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
20. sketch file support
• The community should tackle better handling of
sketch file formats, such as ISIS Sketch, CDX, CDXML,
Marvin and ACD/Labs sk2.
• The challenge is that these formats are intended to
be consumed by human beings not computers, but
they do capture details otherwise lost in translation.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
21. decrypting cdx & cDXML files
• CambridgeSoft are to be congratulated for publicly
documenting their CDX and CDXML file formats.
• Unfortunately this online ChemDraw developer
resource is no longer being kept up to date.
• Mistakes: “arrow” is encoded by object 0x8021.
• New tags: object 0x802b encodes “annotation”.
• Proprietary property tags: USPTO’s “PageDefinition”.
• Support for reading and writing isotopic information
in CDXML files has been contributed to Open Babel.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
23. chemdraw superatom correction
Chemist drew Chemist Intended Chemist got
DMF
K2CO3
HBTU
mCPBA
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
24. salt/component grouping
• Keeping track of the intended number and formulae
of reactants, products and agents in a reaction,
requires preserving salt form associations.
• This can be implemented by honoring the “group”
information from the sketch as single disconnected
components in MDL RXN and RD file output.
• These associations are traditionally lost in SMILES...
...>CC(=O)[O-].[Cl-].[Cl-].[Fe].[K+].[Pd+2]...>...
but can be retained via ChemAxon/GGA extensions
to SMILES that appear as “[Na+].[Cl-] |f:0.1| salt”.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
25. zero order bonds
• Clark[1] has proposed a useful extension to MDL file
format for representing bonds of order zero, “M
ZBO” which solve a number representation issues in
organometallics.
• In this presentation, we propose a novel extension to
SMILES to encode/preserve these: [K]..OC(=O)O..[K]
1. Alex M. Clark, “Accurate Specification of Molecular Structures: The Case for Zero-Order
Bonds and Explicit Hydrogen Counting”, J. Chem. Inf. Model. 51:3149-3157, 2011.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
26. biomolecules as MDL sgroups
• Peptides, nucleic acids and sugars can be annotated
in a number of ways in molfiles, typically as
abbreviated Sgroups that define leaving groups.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$
27. variable attachment points
• Reported oligosaccharide structures often contain
linkages whose exact attachment point is unknown.
• Glc(β1→?)Glc has four possible structures, including:
• Such structures can be round-tripped between IUPAC
line notations, SMILES and MDL v2000 and v3000
molfiles (sometimes using common extensions).
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
28. variable attachment points
• Conveniently, these formats use the same “dummy
atom” semantics to represent a set of atoms.
• SMILES does not support positional variation, but
ChemAxon use an extension: CCCl.Cl* |m:4:1.0|
• V3000 molfiles from both Accelrys Draw and Marvin
store variable attachments in the bond block
M V30 7 1 8 7 ENDPTS=(3 3 2 1) ATTACH=ANY
• V2000 molfiles from ChemAxon Marvin and ACDLabs
ChemSketch use their own incompatible extensions.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
29. acknowledgements
• Plamen Petrov, AstraZeneca, Molndal, Sweden.
• Sorel Muresan, AkzoNobel, Stenungsund, Sweden.
• Richard Hall, Astex Pharmaceuticals, Cambridge, UK.
• Markus Sitzmann, NCI, Washington DC, USA.
• Wolf-Dietrich Ihlenfeldt, Xemistry GmbH, Germany.
• Evan Bolton, PubChem, NCBI, Washington DC, USA.
• Daniel Lowe and Noel O’Boyle, NextMove Software,
Cambridge, UK.
• Thank you for your time. Questions?
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013