The document discusses extracting reaction information from pharmaceutical electronic laboratory notebooks (ELNs) and some of the challenges involved. It describes exporting data from the complex database schemas of ELNs, converting file formats like sketches to standard reaction formats, standardizing reactions, determining reaction identity when reactions are duplicated, and classifying reactions into an ontology. The goal is to make the synthetic knowledge contained in ELNs accessible for applications like yield prediction, retrosynthesis, and reducing costs through more efficient routes.
Automated Extraction of Reactions from the Patent Literaturedan2097
We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...NextMove Software
This document discusses various aspects of molecular file formats used for data exchange of small molecules, biopolymers, and reactions. It addresses topics such as representation of hydrogens, valence models, benchmarking different file readers, and preserving labels in files. The author presents results from testing 24 different molecular file readers on their ability to correctly interpret an MDL benchmark file containing over 10,000 connection tables with varying elements, charges, and environments. While many readers can read MDL files, they do not always agree on semantics, though performance is generally better on subsets relevant to pharmaceutical applications. The benchmarking led to improvements in several open-source and commercial toolkits.
Standardized Representations of ELN Reactions for Categorization and Duplicat...NextMove Software
1) Electronic lab notebooks contain rich experimental data on synthetic methods but transforming this data into a standardized format suitable for analysis poses challenges.
2) Exact duplicates are common in reaction data sets, but true variations are relatively rare once reactions are normalized for structure, roles, and other factors.
3) Pivoting from individual experiments to de-duplicated, categorized reactions allows insights into reaction conditions and step sequences but requires addressing complications like molecule normalization, superatoms, and reaction roles.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
A benchmark of substructure searching tools given at the Cambridge Cheminformatics Network Meeting (May 27th). Slides have added annotated to aid description.
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
CINF 51: Analyzing success rates of supposedly 'easy' reactionsNextMove Software
The document discusses reaction prediction and retrosynthesis tools. It notes that analysis of reaction data from electronic lab notebooks (ELNs) at major pharmaceutical companies reveals surprisingly high rates of synthesis failure not reported in literature. Understanding the causes of these failures, like low predicted octanol-water partition coefficients correlating with lower success rates for parallel Suzuki coupling reactions, may be more important than developing new reaction predictions. The document advocates analyzing both successful and unsuccessful reaction data to improve predictive tools.
Unlocking chemical information from tables and legacy articlesNextMove Software
Many tools for text-mining are designed to work with unstructured text. Here we present the results of our efforts to apply text-mining to the semi-structured content of tables.
We will cover the difficulties of coping with the various different ways that the contents of a table may be specified and the challenges of resolving references to elsewhere in the document. We report on the extraction of melting points, boiling points, NMR spectra and biological activity data from tabular data.
In collaboration with the Royal Society of Chemistry (RSC) we have also investigated the application of these tools to the RSC’s back archive, both in tables and free text. We cover the difficulties in adapting tools optimized for patents to journal articles and the difficulties in handling the older, less structured, text that dates from as far back as 1841.
The information extracted from this project, both from patents and the RSC’s back archive, will form a key contribution to the RSC’s public data repository. In all cases the evidence text for the extracted information is provided along with a link back to the document from which it was extracted, ensuring that the provenance of the information can be verified.
Automated Extraction of Reactions from the Patent Literaturedan2097
We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...NextMove Software
This document discusses various aspects of molecular file formats used for data exchange of small molecules, biopolymers, and reactions. It addresses topics such as representation of hydrogens, valence models, benchmarking different file readers, and preserving labels in files. The author presents results from testing 24 different molecular file readers on their ability to correctly interpret an MDL benchmark file containing over 10,000 connection tables with varying elements, charges, and environments. While many readers can read MDL files, they do not always agree on semantics, though performance is generally better on subsets relevant to pharmaceutical applications. The benchmarking led to improvements in several open-source and commercial toolkits.
Standardized Representations of ELN Reactions for Categorization and Duplicat...NextMove Software
1) Electronic lab notebooks contain rich experimental data on synthetic methods but transforming this data into a standardized format suitable for analysis poses challenges.
2) Exact duplicates are common in reaction data sets, but true variations are relatively rare once reactions are normalized for structure, roles, and other factors.
3) Pivoting from individual experiments to de-duplicated, categorized reactions allows insights into reaction conditions and step sequences but requires addressing complications like molecule normalization, superatoms, and reaction roles.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
A benchmark of substructure searching tools given at the Cambridge Cheminformatics Network Meeting (May 27th). Slides have added annotated to aid description.
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
CINF 51: Analyzing success rates of supposedly 'easy' reactionsNextMove Software
The document discusses reaction prediction and retrosynthesis tools. It notes that analysis of reaction data from electronic lab notebooks (ELNs) at major pharmaceutical companies reveals surprisingly high rates of synthesis failure not reported in literature. Understanding the causes of these failures, like low predicted octanol-water partition coefficients correlating with lower success rates for parallel Suzuki coupling reactions, may be more important than developing new reaction predictions. The document advocates analyzing both successful and unsuccessful reaction data to improve predictive tools.
Unlocking chemical information from tables and legacy articlesNextMove Software
Many tools for text-mining are designed to work with unstructured text. Here we present the results of our efforts to apply text-mining to the semi-structured content of tables.
We will cover the difficulties of coping with the various different ways that the contents of a table may be specified and the challenges of resolving references to elsewhere in the document. We report on the extraction of melting points, boiling points, NMR spectra and biological activity data from tabular data.
In collaboration with the Royal Society of Chemistry (RSC) we have also investigated the application of these tools to the RSC’s back archive, both in tables and free text. We cover the difficulties in adapting tools optimized for patents to journal articles and the difficulties in handling the older, less structured, text that dates from as far back as 1841.
The information extracted from this project, both from patents and the RSC’s back archive, will form a key contribution to the RSC’s public data repository. In all cases the evidence text for the extracted information is provided along with a link back to the document from which it was extracted, ensuring that the provenance of the information can be verified.
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
Eugene Garfield wrote his 1961 PhD thesis on an algorithm for translating chemical names to molecular formulas, laying the foundations for chemical text mining and natural language processing in cheminformatics. Garfield developed a system using a dictionary of chemical morphemes and rules to decompose names and derive formulas. Modern name-to-structure software and chemical text mining tools have expanded on Garfield's work, but his insights into using molecular formulas for indexing rather than structural representations remain relevant. Garfield's 1961 thesis is recognized as pioneering the fields of chemical information science and text mining in chemistry.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
The document discusses improving chemical structure depictions in software. It describes lessons learned in developing better algorithms for layout, orientation, ring templates, and rendering. Key areas of focus are reducing overlaps, improving macrocycle depictions, and using standardized fonts and parameters for high quality publication-grade output. Comparisons of different cheminformatics toolkits on a test set of structures show RDKit generally performs well, while areas for further enhancement in CDK and other tools are discussed.
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
The document summarizes a presentation about using large graph databases for chemical similarity searching. It describes building a graph database of 68 billion molecular substructures from 340 million molecules and using graph edit distance to perform sublinear-scaling searches through the database to identify similar molecules. This approach scales better to large datasets than traditional fingerprint-based similarity methods.
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
This document summarizes an implementation of Merck's reaction review policy in another pharmaceutical company's electronic laboratory notebook (ELN). Key points include:
- The policy was implemented to reduce laboratory accidents by learning from past incidents.
- It involved adding new fields to the ELN like "reaction vessel size" to capture scale-dependent hazards.
- Algorithms were developed to categorize compounds by hazards and flag risky experiments based on criteria like reactive functional groups, physical properties, and reaction types.
- Future work aims to more accurately represent mixtures, predict compound properties, and integrate the system with other chemistry databases and ontologies.
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
This document discusses differences in how various cheminformatics toolkits interpret chemical structure file formats like MDL molfiles. It finds that toolkits often assign different interpretations to aspects like atomic mass, valence, stereochemistry, and representations of complex structures. Existing extensions to the formats, like ctab blocks, aim to address challenges in representing special cases but normalization remains difficult. Standardization efforts could help minimize unexpected behavior between tools.
The document summarizes the Chemical Validation and Standardization Platform (CVSP) used by Open PHACTS to validate and standardize chemical structure data from various sources. CVSP performs validation of chemical structures, generates standardized representations, and establishes parent-child relationships between structures. It has validated over 1.3 million records from ChEMBL and over 6,500 from DrugBank, identifying various issues. Standardized structures and relationships are exported in RDF/turtle format to integrate with the Open PHACTS semantic web platform.
Standardization and Generation of Parents for Open PHACTS Chemical Registry S...Ken Karapetyan
The document describes standards and processes for generating parent structures from chemical registry data. It outlines validation checks performed on chemical structures and issues assigned severity levels. Standardization procedures are described, including disconnecting certain atoms from metals, ionizing acids, dearomatizing structures, and removing chiral flags and stereocenters in some cases. Parent structures are generated by applying different modifications like making structures insensitive to isotopes, stereo, or tautomers. Primary compound keys may be standard InChI or absolute SMILES strings, but a non-standard InChI approach is proposed to better distinguish stereo and tautomers. Feedback is requested to improve the chemical structure standardization and parenting procedures.
This document describes a new fluorimetric method for detecting and quantifying siderophores using Calcein Blue dye. Siderophores are iron-chelating compounds released by bacteria under iron-deficient conditions and can be used as markers for bacterial detection. The method exploits the property that Calcein Blue fluorescence is quenched by iron but regained when a stronger chelator like a siderophore removes the iron. Standard strains, clinical isolates, and media compositions were tested. A standard curve using the siderophore desferal allowed quantification of siderophores down to 50 nM. This sensitive, simple fluorescence-based method provides a new tool for bacterial detection within 7-8 hours.
Preparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivativeselshimaa eid
This document describes the preparation of pyrimido[4,5-b][1,6]naphthyridin-4(1H)-one derivatives using a zeolite-nanogold catalyst. An efficient one-pot synthesis is developed involving the cyclocondensation of 6-amino-2-thioxo-2,3-dihydropyrimidin-4(1H)-one, aromatic aldehydes, and 1-benzylpiperidin-4-one in ethanol at 80°C. The nanogold catalyst is characterized and found to contain 4-6 nm gold nanoparticles dispersed on zeolite. Several derivatives are synthesized in good yields and characterized. Molecular dock
This document summarizes the synthesis and characterization of isomeric Yb(III) coordination polymers formed from reactions of YbX3 salts (X = NO3- or CF3SO3-) with 4,4'-bipyridine-N,N'-dioxide (4,4'-bpdo) or 3,3'-bipyridine-N,N'-dioxide (3,3'-bpdo) ligands. Depending on the choice of ligand and anion, either linear 1D chains or 2D and 3D coordination polymer networks were obtained. The 3,3'-bpdo ligand yielded structures similar to 4,4'-bpdo when NO3- was used, but with CF3
Collaboration for Innovation: Enriching the Knowledge Pool BIOVIA
The document discusses controlling the morphology of polymer membranes through precipitation processes. It describes how slowly adding a nonsolvent to a polymer solution causes an exchange between solvent and nonsolvent, resulting in a precipitation point where the polymer changes from a sol to gel. This precipitation point defines the morphology of the formed membrane. The study examines using mixed solvents and different PVDF polymer types to control membrane performances by altering the precipitation point during formation. The goal is to develop porous polymer membranes with optimized properties for filtration and separation applications.
This tutorial document provides instructions for preparing protein and ligand files for use with the Autodock and Autodock Tools molecular docking programs. It describes how to set up access to the programs, prepare PDB files of the protein and ligand by removing unwanted elements and adding hydrogens and charges. It also explains how to select rotatable bonds in the ligand and specify flexible residues in the protein to allow for induced fit docking.
Exploring SAR between Patents and PubChemChris Southan
Using Chemicalize.org with other open resources, Christopher Southan demonstrates how to extract structure-activity relationship (SAR) information from patents and explore intersections in PubChem. Southan shows how to name-to-structure convert selected structures, display analogue series similarity, and bulk upload extracted structures to PubChem for further analysis. While Chemicalize enables small-scale patent and literature mining, challenges include discerning relevant sections and locating drug-relevant structures. Southan provides examples mining a DPPIV inhibitor patent, intersecting results with other sources, and "walking" to related patents. With synergies between Chemicalize and other open resources, academic drug discovery can be advanced.
One pot synthesis of cu(ii) 2,2′ bipyridyl complexes of 5-hydroxy-hydurilic acidrkkoiri
This document describes the one-pot synthesis of two new copper(II) complexes containing the ligands 5-hydroxy-hydurilic acid (complex 1) and alloxanic acid (complex 2) from the reaction of a barbiturate derivative (LH4) with Cu(II) 2,2'-bipyridyl complexes. It also reports the synthesis of a third complex (complex 3) from the reaction of LH4 with copper nitrate that retains the ligand framework. The complexes were characterized using X-ray crystallography, spectroscopy, and electrochemistry. Complexes 1 and 3 were found to cleave DNA and showed cytotoxic activity against cancer cells, while complex 2 was insoluble and not
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
Eugene Garfield wrote his 1961 PhD thesis on an algorithm for translating chemical names to molecular formulas, laying the foundations for chemical text mining and natural language processing in cheminformatics. Garfield developed a system using a dictionary of chemical morphemes and rules to decompose names and derive formulas. Modern name-to-structure software and chemical text mining tools have expanded on Garfield's work, but his insights into using molecular formulas for indexing rather than structural representations remain relevant. Garfield's 1961 thesis is recognized as pioneering the fields of chemical information science and text mining in chemistry.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
The document discusses improving chemical structure depictions in software. It describes lessons learned in developing better algorithms for layout, orientation, ring templates, and rendering. Key areas of focus are reducing overlaps, improving macrocycle depictions, and using standardized fonts and parameters for high quality publication-grade output. Comparisons of different cheminformatics toolkits on a test set of structures show RDKit generally performs well, while areas for further enhancement in CDK and other tools are discussed.
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
The document summarizes a presentation about using large graph databases for chemical similarity searching. It describes building a graph database of 68 billion molecular substructures from 340 million molecules and using graph edit distance to perform sublinear-scaling searches through the database to identify similar molecules. This approach scales better to large datasets than traditional fingerprint-based similarity methods.
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
This document summarizes an implementation of Merck's reaction review policy in another pharmaceutical company's electronic laboratory notebook (ELN). Key points include:
- The policy was implemented to reduce laboratory accidents by learning from past incidents.
- It involved adding new fields to the ELN like "reaction vessel size" to capture scale-dependent hazards.
- Algorithms were developed to categorize compounds by hazards and flag risky experiments based on criteria like reactive functional groups, physical properties, and reaction types.
- Future work aims to more accurately represent mixtures, predict compound properties, and integrate the system with other chemistry databases and ontologies.
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
This document discusses differences in how various cheminformatics toolkits interpret chemical structure file formats like MDL molfiles. It finds that toolkits often assign different interpretations to aspects like atomic mass, valence, stereochemistry, and representations of complex structures. Existing extensions to the formats, like ctab blocks, aim to address challenges in representing special cases but normalization remains difficult. Standardization efforts could help minimize unexpected behavior between tools.
The document summarizes the Chemical Validation and Standardization Platform (CVSP) used by Open PHACTS to validate and standardize chemical structure data from various sources. CVSP performs validation of chemical structures, generates standardized representations, and establishes parent-child relationships between structures. It has validated over 1.3 million records from ChEMBL and over 6,500 from DrugBank, identifying various issues. Standardized structures and relationships are exported in RDF/turtle format to integrate with the Open PHACTS semantic web platform.
Standardization and Generation of Parents for Open PHACTS Chemical Registry S...Ken Karapetyan
The document describes standards and processes for generating parent structures from chemical registry data. It outlines validation checks performed on chemical structures and issues assigned severity levels. Standardization procedures are described, including disconnecting certain atoms from metals, ionizing acids, dearomatizing structures, and removing chiral flags and stereocenters in some cases. Parent structures are generated by applying different modifications like making structures insensitive to isotopes, stereo, or tautomers. Primary compound keys may be standard InChI or absolute SMILES strings, but a non-standard InChI approach is proposed to better distinguish stereo and tautomers. Feedback is requested to improve the chemical structure standardization and parenting procedures.
This document describes a new fluorimetric method for detecting and quantifying siderophores using Calcein Blue dye. Siderophores are iron-chelating compounds released by bacteria under iron-deficient conditions and can be used as markers for bacterial detection. The method exploits the property that Calcein Blue fluorescence is quenched by iron but regained when a stronger chelator like a siderophore removes the iron. Standard strains, clinical isolates, and media compositions were tested. A standard curve using the siderophore desferal allowed quantification of siderophores down to 50 nM. This sensitive, simple fluorescence-based method provides a new tool for bacterial detection within 7-8 hours.
Preparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivativeselshimaa eid
This document describes the preparation of pyrimido[4,5-b][1,6]naphthyridin-4(1H)-one derivatives using a zeolite-nanogold catalyst. An efficient one-pot synthesis is developed involving the cyclocondensation of 6-amino-2-thioxo-2,3-dihydropyrimidin-4(1H)-one, aromatic aldehydes, and 1-benzylpiperidin-4-one in ethanol at 80°C. The nanogold catalyst is characterized and found to contain 4-6 nm gold nanoparticles dispersed on zeolite. Several derivatives are synthesized in good yields and characterized. Molecular dock
This document summarizes the synthesis and characterization of isomeric Yb(III) coordination polymers formed from reactions of YbX3 salts (X = NO3- or CF3SO3-) with 4,4'-bipyridine-N,N'-dioxide (4,4'-bpdo) or 3,3'-bipyridine-N,N'-dioxide (3,3'-bpdo) ligands. Depending on the choice of ligand and anion, either linear 1D chains or 2D and 3D coordination polymer networks were obtained. The 3,3'-bpdo ligand yielded structures similar to 4,4'-bpdo when NO3- was used, but with CF3
Collaboration for Innovation: Enriching the Knowledge Pool BIOVIA
The document discusses controlling the morphology of polymer membranes through precipitation processes. It describes how slowly adding a nonsolvent to a polymer solution causes an exchange between solvent and nonsolvent, resulting in a precipitation point where the polymer changes from a sol to gel. This precipitation point defines the morphology of the formed membrane. The study examines using mixed solvents and different PVDF polymer types to control membrane performances by altering the precipitation point during formation. The goal is to develop porous polymer membranes with optimized properties for filtration and separation applications.
This tutorial document provides instructions for preparing protein and ligand files for use with the Autodock and Autodock Tools molecular docking programs. It describes how to set up access to the programs, prepare PDB files of the protein and ligand by removing unwanted elements and adding hydrogens and charges. It also explains how to select rotatable bonds in the ligand and specify flexible residues in the protein to allow for induced fit docking.
Exploring SAR between Patents and PubChemChris Southan
Using Chemicalize.org with other open resources, Christopher Southan demonstrates how to extract structure-activity relationship (SAR) information from patents and explore intersections in PubChem. Southan shows how to name-to-structure convert selected structures, display analogue series similarity, and bulk upload extracted structures to PubChem for further analysis. While Chemicalize enables small-scale patent and literature mining, challenges include discerning relevant sections and locating drug-relevant structures. Southan provides examples mining a DPPIV inhibitor patent, intersecting results with other sources, and "walking" to related patents. With synergies between Chemicalize and other open resources, academic drug discovery can be advanced.
One pot synthesis of cu(ii) 2,2′ bipyridyl complexes of 5-hydroxy-hydurilic acidrkkoiri
This document describes the one-pot synthesis of two new copper(II) complexes containing the ligands 5-hydroxy-hydurilic acid (complex 1) and alloxanic acid (complex 2) from the reaction of a barbiturate derivative (LH4) with Cu(II) 2,2'-bipyridyl complexes. It also reports the synthesis of a third complex (complex 3) from the reaction of LH4 with copper nitrate that retains the ligand framework. The complexes were characterized using X-ray crystallography, spectroscopy, and electrochemistry. Complexes 1 and 3 were found to cleave DNA and showed cytotoxic activity against cancer cells, while complex 2 was insoluble and not
The document discusses using field programmable gate arrays (FPGAs) to efficiently simulate protein folding through Monte Carlo simulations. FPGAs offer high programmability and efficiency compared to CPUs and GPUs for this compute-intensive biological application. FPGAs are identified as the best candidate platform due to their flexibility, performance, and lower power consumption and cost compared to application-specific integrated circuits.
This document provides step-by-step instructions for setting up a self-hosted WordPress blog in less than 5 minutes. It begins by explaining the benefits of a self-hosted blog compared to free blogging platforms. It then outlines the four steps to set up a blog: 1) select a domain name, 2) buy the domain and hosting account, 3) install WordPress using cPanel, and 4) access the WordPress dashboard. Specific recommendations are provided for buying hosting from Hostgator and using "Blogenator" as the coupon code for 25% off. Instructions for installing WordPress via Fantastico Deluxe in cPanel are also included.
This document discusses implementing cooperative learning in ESL classrooms. It addresses:
1) Benefits of cooperative learning like increased student talk and motivation.
2) Approaches for mixed-ability classrooms like tiered tasks that appropriately challenge all students.
3) Answers frequently asked questions about cooperative learning logistics like how to form groups, manage noise levels, and keep groups engaged.
This document provides a template for writing an informal letter in English. It includes common phrases used for greetings, asking about the recipient's news, giving one's own news, making apologies, invitations, requests, thanks, congratulations, good luck wishes, suggestions, and sign-offs. Sample sentences are provided for each section to demonstrate how to structure different parts of an informal letter.
The document discusses habitual behavior in the present and past tenses. For the present, the simple present tense is used to describe habitual actions or permanent situations, often with frequency adverbs. For the past, the past simple or "used to" are used to refer to regular past actions or habits, while "would" refers to past habits but not situations. Various alternatives to express habitual behavior are also outlined such as the present continuous, adjectives, "tend to", and "will".
Anatomy of a Methodology: To introduce structure into a structure. Please do let me know if you want a copy. I have a write up that is more detailed. You may write to me at saumya _ ganguly (at) gmail (dot) com
Hamed Hashemian is writing to apply for a posted position. He includes his relevant experience working for various employers over the past four years, as well as his qualifications which include a diploma and degree from Barcelona University. He expresses his availability for an interview and looks forward to hearing from the recipient. Hamed closes by signing off with his name.
The Congressional Budget Office presented its Budget and Economic Outlook for 2015 to 2025. The CBO director Douglas Elmendorf discussed projections that showed growing budget deficits, a rising federal debt level, increasing spending primarily on healthcare programs, and steady economic growth over the next decade. Key inputs in the CBO's economic projections included slower potential labor force growth and steady potential productivity and inflation.
Наличие в бизнесе нескольких собственников нередко осложняет его развитие, а порой даже способно разрушить. Согласование их позиций критически важно для роста таких компаний. Особенно в том случае, когда собственники одновременно выполняют еще и функции топ-менеджеров.
The document provides guidance on writing a formal letter in English. It outlines the typical sections included in a formal letter such as the opening which usually states the reason for writing, asking questions politely, referring to previous correspondence, or complaining about an issue. The letter should close by thanking the recipient for their time and indicating they can be contacted for additional information. Formal letters use polite vocabulary, complex sentence structure, and punctuation like semi-colons. In contrast, informal letters employ casual language, simpler sentences, and punctuation like exclamation points.
Peripheral blood stem cell transplantation (PBSCT) involves collecting stem cells from a patient's bloodstream and later infusing them back into the patient after chemotherapy or radiation therapy. PBSCT has replaced bone marrow as the most common stem cell transplantation procedure. Stem cells are collected from the bloodstream using growth factors alone or with chemotherapy, and the minimum number needed for a safe transplant is 2 million CD34+ cells per kilogram of body weight. PBSCT results in faster recovery time compared to bone marrow transplants due to higher numbers of stem cells and T cells collected.
This document contains a collection of useful phrases for speaking in English organized into different categories such as home/family, studies/work, holidays/travel, hobbies/sports, music/going out, and special occasions. It also includes phrases for discussing and analyzing photographs, stating opinions, and agreeing or disagreeing with other speakers. Some of the phrases provided are for describing one's hometown, family, school days, hobbies, music preferences, and plans for holidays or special events. Other phrases are useful for analyzing photographs, stating opinions, and discussing ideas with others. The document serves as a reference for English speakers to use helpful expressions in different conversational contexts.
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
Leveraging your operational data for advanced and predictive analytics enables deeper insights and greater value for cloud applications. DSE Analytics is a complete platform for Operational Analytics, including data ingestion, stream processing, batch analysis, and machine learning.
In this talk we will provide an overview of DSE Analytics as it applies to data science tools and techniques, and demonstrate these via real world use cases and examples.
Brian Hess
Rob Murphy
Rocco Varela
About the Speakers
Brian Hess Senior Product Manager, Analytics, DataStax
Brian has been in the analytics space for over 15 years ranging from government to data mining applied research to analytics in enterprise data warehousing and NoSQL engines, in roles ranging from Cryptologic Mathematician to Director of Advanced Analytics to Senior Product Manager. In all these roles he has pushed data analytics and processing to massive scales in order to solve problems that were previously unsolvable.
The document discusses naming in software development and argues that adding "Exception" as a suffix to exception class names is often redundant and uninformative. It provides examples of exception class names from Java that are more descriptive without the suffix, such as "IntegerDivisionByZero" instead of "ArithmeticException". The document concludes that exception class names should describe the actual problem rather than using generic terms or adding unnecessary suffixes like "Exception".
This document discusses several approaches for embedding knowledge bases and relations into continuous vector spaces using neural networks. It first describes earlier models like semantic embedding and semantic matching energy which used single hidden layers. It then explains more complex models like neural tensor networks that use tensors to model relations. The document also discusses applications of these embeddings for tasks like link prediction, question answering, and knowledge base expansion. It provides details on model formulations, scoring functions, training objectives, and datasets used for evaluation.
Efficient Searching and Similarity of Unmapped Reactions: Application to ELN ...NextMove Software
The document discusses challenges in analyzing reaction data from electronic laboratory notebooks (ELNs) to better understand chemical reactions. It outlines approaches to standardize reaction representation, define reaction identity, improve reaction depiction and searching, calculate reaction similarity, and classify reactions. The goal is to enable medicinal chemists to make more effective use of reaction data in ELNs to improve drug discovery processes.
1. The document discusses cheminformatics topics including substructure searching using SMARTS, calculation of topological polar surface area, quantitative structure-activity relationships (QSAR) modeling, and Lipinski's rule of five.
2. It provides an example of using SMARTS to search for functional groups known to cause toxicological problems.
3. QSAR modeling involves using molecular descriptors and regression techniques to develop mathematical models that correlate molecular structure to properties or biological activities.
Open-source tools for querying and organizing large reaction databasesGreg Landrum
Gregory Landrum presented on open-source tools for querying and organizing large reaction databases using the RDKit. He discussed public reaction data sources extracted from patents, handling reactions with RDKit, using fingerprints to analyze reactions, and applying machine learning and clustering to reaction fingerprints to validate their ability to distinguish reactions and group similar ones together. He also explored analyzing functional group changes between reactants and products of reactions.
nano catalysis as a prospectus of green chemistry Ankit Grover
Nanocatalysis and green chemistry prospects.
Nanocatalysts have higher activity, selectivity, and efficiency than traditional catalysts due to their high surface area to volume ratio. They can be designed for sustainability by having properties like recyclability, durability, and cost-effectiveness. Examples discussed include gold nanoparticle catalysts for oxidation reactions and magnetically separable nanoparticle catalysts. Nanocatalyst applications highlighted are water splitting for hydrogen production and storage, and fuel cells.
Thermodynamics for medicinal chemistry designPeter Kenny
This document discusses concepts in medicinal chemistry design and thermodynamics. It begins by outlining some challenges in drug discovery, such as targeting weakly linked disease targets and predicting toxicity. It then discusses molecular design approaches, including controlling compound properties and sampling chemical space. Key concepts discussed include target engagement potential, property-based design to find an optimal "sweet spot", and using thermodynamics and molecular interactions to analyze activity and properties. The document questions the use of rules and guidelines in medicinal chemistry and advocates analyzing data to understand actual trends rather than assuming functional forms. It also discusses issues with ligand efficiency metrics and advocates using residuals to quantify activity compared to observed trends in the data.
The document discusses chemical kinetics modeling of combustion reactions. It describes the need for detailed chemical kinetic models to understand fuel oxidation, pollutant formation, and combustion chemistry phenomena. Complex fuel mixtures require the use of reduced model fuels and reaction mechanisms. Detailed mechanisms are generated using generic reaction classes and can involve hundreds to thousands of species and reactions. Mechanism reduction techniques like lumping and skeletal reduction are used to reduce mechanism size and computational cost for modeling applications.
Molecular design: One step back and two paths forwardPeter Kenny
I presented this at the RACI Biomolecular on the Beach conference in December 2011. A correlation inflation teaser followed by alkane/water logP and SAR/SPR based on relationships between structures. The photograph in the title slide was taken in Asunción.
This document discusses the automation of computational chemistry calculations and protocols to reliably generate molecular property data. It addresses validating computational methods, analyzing results for errors and outliers, and comparing output to experimental data. The goal is to provide high-quality "experimental" data through automated high-throughput computation while ensuring valid results and identifying unusual computations. Workflows, parsing tools, and dissemination methods are presented for managing large numbers of jobs and analyzing results.
This document summarizes research conducted by the Craig group on the synthesis of tetrahydropyridine substrates for C-H activation. The group has synthesized various 1,4-bis(arylsulfonyl)-1,2,3,4-tetrahydropyridines and studied their reactivity via C-H activation and cross-coupling reactions. The author aims to apply this methodology to aziridine-based substrates by synthesizing 1-(arylsulfonyl)-1,2,3,4-tetrahydropyridines and investigating their potential for selective functionalization via C-H bond activation. The document outlines the synthetic route towards these tetrahydropyridine substrates involving ring-opening of
Some new directions for pharmaceutical molecular designPeter Kenny
I used this talk on visits to International Medical University (Kuala Lumpur), Nanyang Technological University (Singapore) and Novartis Institute for Tropical Diseases (Singapore)
The document provides instructions for completing Test #9 for the CHE 299: Material and Energy Balances course. It specifies that the test must be completed alone without any assistance. It also provides naming conventions for saving and submitting the test file. The test contains 6 multiple choice or calculation questions covering various chemistry topics like phase changes, exothermic reactions, conversion reactions, and enthalpy. Full work and steps must be shown for calculations.
This document discusses various types of stimuli-responsive hydrogels, with a focus on ultrashort peptide hydrogels. It provides background on the history and properties of hydrogels. It describes different types of hydrogels including those based on self-assembled peptide structures like alpha-helices, beta-sheets, and coiled coils. The document discusses potential biomedical applications of hydrogels for tissue engineering and drug delivery. It outlines the methodology and results of synthesizing various ultrashort peptide samples and testing their ability to form hydrogels in response to pH and metal ions. Samples that formed hydrogels followed a design of a six residue peptide with decreasing hydrophobicity and an alanine-
Aspects of pharmaceutical molecular design (Belgrade version)Peter Kenny
Peter W Kenny discusses various aspects of pharmaceutical molecular design. Some key challenges in drug discovery include exploiting weakly linked disease targets, predicting toxicity, and measuring free drug concentrations in vivo. Molecular design aims to control compound behavior through molecular properties. Both hypothesis-driven and prediction-driven approaches are used, along with sampling chemical space. Target engagement potential is proposed as a basis for design, with objectives of low target binding, high anti-target binding, and control of free drug concentrations. Property-based design searches for an optimal "sweet spot" of affinity and ADMET properties. Hypothesis-driven design uses structure-activity relationships as an efficient framework, while prediction-driven design builds predictive models.
Design of fragment screening libraries (Feb 2010 version)Peter Kenny
I have lectured on design of fragment screening libraries a number of times and, to be honest, my material is getting a bit dated. This presentation is from Feb 2010 when I was visiting CSIRO and the photo in the title slide was taken in Tierra del Fuego.
This document provides an introduction to chemical reaction engineering (CRE). CRE studies the rates and mechanisms of chemical reactions and the design of reactors. It is important because it involves converting raw materials into products through both physical and chemical treatment steps. Reactions take place in reactors, so CRE focuses on designing and controlling reactions. Reactor design considers economics, kinetics, heat and mass transfer, and other factors. Reactions are classified as homogeneous or heterogeneous. The rate of a reaction is the rate at which a chemical species loses its identity per unit volume and can be expressed as the rate of formation or disappearance of that species.
Molecular design: How to and how not to?Peter Kenny
This document discusses various topics related to molecular design and quantitative structure-activity relationships (QSAR). It notes some challenges in drug discovery like targeting poorly linked disease targets. It also discusses hypothesis-driven versus prediction-driven molecular design and challenges in predicting toxicity. Various methods for analyzing correlations in structure-activity data are described, including issues like data binning inflating correlations. The document advocates analyzing continuous data as continuous and considering relationships between molecular structures rather than just descriptors. It also discusses limitations of commonly used partitioning systems like octanol/water and highlights alternative approaches.
This document summarizes research characterizing a chlorite dismutase (Cld) enzyme from Klebsiella pneumoniae. The enzyme, KpCld, is part of a subfamily of dimeric Clds found in non-perchlorate respiring bacteria. While it shares structural similarities in its active site with efficient O2-producing Clds, it exhibits limited turnover due to degradation of its heme cofactor. Experiments show KpCld can generate O2 from chlorite, and a K. pneumoniae mutant lacking Cld has reduced growth in the presence of chlorate under nitrate-respiring conditions, suggesting KpCld functions to detoxify endogenously produced chlorite. The
Aspects of pharmaceutical molecular design (Fidelta version)Peter Kenny
This document discusses various aspects of pharmaceutical molecular design. It touches on three key points:
1) Pharmaceutical molecular design aims to control compound behavior through manipulation of molecular properties in a hypothesis-driven or prediction-driven manner.
2) Hypothesis-driven design frameworks help efficiently assemble structure-activity relationships to better understand molecules and ask insightful questions.
3) Prediction-driven design assumes predictive models can be built with sufficient accuracy, though issues like non-uniform sampling of chemical space and overfitting remain challenges.
Micellar Effect On Dephosphorylation Of Bis-4-Chloro-3,5-Dimethylphenylphosph...IOSR Journals
The rate enhancement depends on the hydrophobicity of the nucleophile. The micellar catalyzed reaction between bis-4-chloro-3,5-dimethylphenylphosphate ester and hydroxide or hydroperoxide anions has been examined in buffered medium (pH 8-10). First order rate constant (Kψ) for the reaction of hydroxide ion with bis-4-CDMPP go through maxima with the increasing concentration of cetyltrimethylammoniumbromide (CTABr). Micelles of CTABr very effective catalyst to the reactions of phosphate diesters. Rate constants measured with OH2- ions are approximately twice and thrice than that of OH- ions in presence of CTABr.
Similar to Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions from Pharmaceutical ELNs (20)
The document summarizes a presentation given at the 6th RDKit User Group Meeting in Cambridge, UK in September 2018. It discusses DeepSMILES, a modification of the SMILES notation for chemical structures that is designed for use in machine learning applications. DeepSMILES was presented by Noel O'Boyle and Andrew Dalke. The document provides examples of DeepSMILES notation for ring closures, branches, and installation instructions.
Building a bridge between human-readable and machine-readable representations...NextMove Software
This document discusses approaches for representing biopolymers such as peptides in both human-readable and machine-readable formats. It describes names that are preferred by machines, such as SMILES and IUPAC condensed representations, as well as names preferred by humans using the standard 3-letter amino acid codes and other conventions. The document then discusses developing a naming system that is suitable for both humans and machines by text-mining peptide representations from literature to identify commonly used abbreviations, codes, and modifications. The goal is to develop a systematic naming approach that captures this real-world peptide terminology while also enabling computational interpretation and processing.
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
The document discusses a benchmark for evaluating how accurately different cheminformatics toolkits can read SMILES strings representing stereochemistry, implicit hydrogens, and aromatic systems. The author tested 15 toolkits on test cases examining cis-trans stereochemistry, implicit valence as defined by the SMILES specification, and their ability to consistently interpret the aromatic nature and hydrogen counts of molecules represented by SMILES strings. While stereochemistry is generally handled well, adherence to the SMILES valence model and consistent aromatic interpretation vary more between tools. The benchmark aims to identify such differences and facilitate improvements to interoperability.
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
The document discusses recent advances in chemical and biological search systems, focusing on the competition between traditional linear methods and newer sublinear methods. It presents several optimizations for substructure and fingerprint similarity searching, including loading the entire database into memory, using a compact binary representation, hardware population counting, sorting fingerprints by population count, reciprocal multiplication, and just-in-time compilation to generate specialized machine code for queries. Benchmarks show these methods achieving speeds over 200 million compounds screened per second on a single GPU.
The document summarizes a study that compared implementations of the Cahn-Ingold-Prelog rules for assigning stereochemistry descriptors to molecules by different chemical software programs. Fourteen test structures were used that covered all the CIP sequence rules. The results showed some inconsistencies between programs in assigning descriptors. The authors have initiated a collaboration between software developers to refine and standardize the CIP rules and improve consistency between programs.
The document discusses improvements to the maxminpicker algorithm in the RDKit for selecting diverse subsets of compounds from large datasets. It describes the maxminpicker concept of selecting compounds furthest from already picked compounds to optimize diversity. The key improvements discussed are avoiding distance matrices, preserving distance bounds between iterations, and using linked lists instead of distance matrices to improve performance from days to hours for large datasets.
The document discusses several issues related to digital chemical representations and InChI. It provides a brief history of chemical notation systems and discusses some limitations of InChI, including representing tautomers, polymers, and neutral component duplication. The document also notes ongoing work to address stereochemistry issues and support for experimental features like polymer representation. However, many challenges remain for InChI to fully represent the complexity of real-world chemicals.
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
The document summarizes challenges and successes in machine interpretation of Markush descriptions, which are chemical structures that allow for variability. Three main approaches to tackling Markush structures are discussed: 1) Simplifying the problem by extracting individual compounds from Markush examples, 2) Directly encoding Markush sketches and definitions, and 3) Performing generic structural searches using natural language queries rather than graph representations. The results of each approach are presented along with comparisons to previous efforts in automatically interpreting Markush structures.
This document discusses PubChem's potential as a database for biologics like peptides and oligosaccharides. It notes that PubChem contains over 500,000 peptides and 80,000 oligosaccharides, more than are in specialized databases. The document explores analyzing the structural and sequence diversity of peptides and oligosaccharides in PubChem. It finds thousands of amino acids, hundreds of thousands of peptides, and variants formed by modification or sequence changes. Representing structures as sequences allows clustering related structures and identifying disulfide-bridged peptides. While intended as a small molecule database, PubChem represents an untapped resource for biologics data.
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
The document describes advanced grammars for named entity recognition using LeadMine text mining software. It discusses how LeadMine uses CaffeineFix technology to specify dictionaries and regular expressions to match entities. It provides examples of entity types recognized by LeadMine such as chemicals, proteins, diseases, and reactions. It also describes how LeadMine generates plural forms and normalizes entities. The document outlines some complex grammars used, including grammars for chemicals, numbers, dates, and more.
This document discusses several challenges in standardizing chemical information exchange. It summarizes issues with SMILES and molfile standardization, inconsistencies in InChI and HELM implementations, ambiguity in peptide and tautomer notations, and challenges representing polymers, inorganics, and chemical hazards. The document argues that developing open standards for SMARTS, reactions, and physical properties would help address these challenges. However, fully representing complex chemical concepts like tautomers, biomolecules, and mixtures remains at the cutting edge of cheminformatics.
Automatic extraction of bioactivity data from patentsNextMove Software
Structure-Activity Relationship (SAR) analysis is important for the development of novel small molecule drugs. Such analyses rely on bioactivity data either from in-house or published data, with data from the latter currently being extracted manually at much expensive.
Here we report on an entirely automated system for extracting bioactivity data that we are developing, initially targeting US patents. The system relies on combining the results of many technologies: chemical entity recognition, chemical name to structure, table processing, chemical compound number resolution, chemical sketch interpretation, and even in some cases reconstitution of molecules from a generic core and R-group definitions. Where possible, the target and the assay description are also identified.
To assess the precision/recall of our system we compare our results with those manually extracted from US patents by BindingDB. We also compare the data we’ve extracted with the data present in ChEMBL from journal articles, to analyse whether there are significant differences between activity data in journal articles and patents e.g. differences in targets of interest.
The document summarizes six not-so-easy pieces related to cheminformatics and the RDKit toolkit: 1) tautomer matching without enumeration, 2) adding support for nucleic acids, 3) revisions to inorganic structures, 4) comments on reaction rule sets, 5) InChI's new polymer support, and 6) standardizing the interpretation of SMIRKS patterns. The talk discusses challenges and recent advances in each area made within RDKit.
Roger Sayle gave a presentation on chemical structure representation in PubChem at the 252nd ACS National Meeting. Some key points:
1) PubChem distinguishes between deposited structures (substances) and normalized structures (compounds), retaining both to provide a unique and invaluable feature in its architecture. It contains over 209 million substances and nearly 92 million compounds.
2) Determining molecular identity can be challenging due to alternate representations, protonation states, tautomerism, and errors. PubChem utilizes standardization services and algorithms to normalize structures.
3) PubChem has implemented innovations like distinguishing substances and compounds, developing canonical SMILES representations, and normalizing tautomers and resonance forms to scale
GHS and NFPA diamonds: where they come from and how they can be usefulNextMove Software
The document discusses GHS and NFPA hazard diamonds and how they can be useful for analyzing reactions in electronic lab notebooks. It describes challenges in identifying hazard types and severity from chemical structures alone. Solutions discussed include only alerting on the most hazardous GHS categories 1 and NFPA 4, and deriving hazard categories from physical properties when data is unavailable for over 99% of chemicals. Case studies explore how hazard categories are determined and examples of interpreting chemical properties to identify explosion and reactivity hazards.
Line notations for nucleic acids (both natural and therapeutic)NextMove Software
This document summarizes an overview presentation on efforts to update and extend the 1970 IUPAC/IUBMB recommendations on nucleic acid notations. It discusses representational challenges for different levels of nucleic acids from bioinformatics to natural RNA variants to nucleic acid therapeutics. Examples are provided of proposed syntax for non-standard features that closely resembles other databases. There is ongoing work to further expand representations as the chemical space of nucleic acid therapeutics continues to grow.
The document discusses techniques for interpreting chemical sketches found in documents to make the embedded chemistry searchable. It describes challenges in interpreting sketches, such as ambiguous symbols and representations of attachment points. The presentation evaluates an approach to extracting structures from chemical reaction sketches, substituents, and tables of variable compounds found in patents. Over 600,000 unique structures were extracted from US patent applications, many not found through other text or structure mining methods. Limitations in interpreting more complex sketches are also outlined.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
National Security Agency - NSA mobile device best practices
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions from Pharmaceutical ELNs
1. Extraction, Analysis, Atom Mapping,
Classification and Naming of
Reactions from Pharmaceutical ELNs
Roger Sayle, Daniel Lowe, Noel O’Boyle
NextMove Software, Cambridge, UK
Michael Kappler, Hoffmann-La Roche, Nutley, NJ, USA
Nick Tomkinson, AstraZeneca, Alderley Park, UK
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
2. introduction
• Pharmaceutical ELNs contain a wealth of
synthetic chemistry knowledge, particularly on
failed reactions, often not described in the
literature or available from public sources.
• This presentation describes several of the
technical informatics challenges encountered
during the process of exploiting reaction
information from ELNs.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
3. motivation #1: economics
• The primary motivation for reaction informatics is
reducing cost of goods, through higher yields, fewer
failed reactions and more direct synthetic routes.
Yuta Fujiwara et al., “Practical and innate carbon-hydrogen functionalization of heterocycles”,
Nature Vol. 492, pp. 95-99, 6th December 2012.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
4. motivation #2: yield prediction
• Nadine et al.[1] hypothesize that low LogP is a major
cause of synthesis failure in parallel synthesis of
combinatorial libraries.
• GSK data set from Pickett et al.[2] of 2500 Suzuki
couplings in a 50x50 library of MMP-12 inhibitors.
– 1704 compounds measured, mean logP = 3.56 (1.44)
– 566 compounds not made, mean logP = 2.83 (1.52)
– Student’s t-test for different distributions, p<2x10-22.
1. Nadine, Hattotuwagama and Churcher ,“Lead-Oriented Synthesis: A New Opportunity for
Synthetic Chemistry”, Angew. Chem. Int. Ed, 51:1114 2012.
2. Pickett et al., ACS Med. Chem. Lett. 2(1):28, 2011
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
6. motivation #3: virtual libraries
• Statistics of ELN reactions may be used to redefine
RECAP-style rules for retro synthesis [1,2]
• For example, reduction of nitro groups to amines is
one of the top 5 most common ELN reactions, but
nitration reactions are amongst the rarest.
• This implies that nitro containing compounds are
purchased as reagents rather made in-house.
1. Xiao Qing Lewell, Duncan B. Judd, Stephen P. Watson and Michael M. Hann, “RECAP –
Retrosynthetic Combinatorial Analysis Procedure”, JCICS 38(3):511-522, 1998.
2. Vainio, Kogej and Raubacher, “Automated Recycling of Chemistry for Virtual Screening and
Library Design”, J. Chem. Inf. Model. 52(7):1777-1786, 2012.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
7. The challenges
1. Export of the data from the ELN.
2. High fidelity conversion to other file formats.
3. Reaction normalization/standardization.
4. Reaction identity (canonicalization).
5. Reaction naming and classification.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
9. Database export
• Typically, ELNs are implemented as complex schemas
within relational databases (Oracle), supporting
transactions, auditing and security privileges.
• Not uncommonly the vendor provided functionality
or APIs for data export are slow and/or buggy.
• In addition to reactions and structures, there is often
a requirement to export all associated data, including
textual and numeric data, tables, even LCMS and
NMR spectra.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
11. File format conversion
• The source format for many reactions is typically a
sketch, in either CDX, CDXML, ISIS Sketch or Marvin
file format.
• For data processing reactions are much easier to
handle as reaction SMILES, MDL RXN or RD files and
possibly even variants of MOL and SD file formats.
• Alas handling of reaction file formats is generally
poorly handled by many cheminformatics tools.
• Additionally, reaction file formats can rarely encode
all of the same information.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
12. decrypting cdx & cDXML files
• CambridgeSoft are to be congratulated for publicly
documenting their CDX and CDXML file formats.
• Unfortunately this online ChemDraw developer
resource is no longer being kept up to date.
• New tags: object 0x802b encodes “annotation”.
• Mistakes: “arrow” is encoded by object 0x8021.
• Proprietary property tags: USPTO’s “PageDefinition”.
• Support for reading and writing isotopic information
in CDXML files has been contributed to Open Babel.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
13. agents and catalysts
How to express reaction role, i.e. molecules drawn
above and below the reaction arrow.
CSc1ccccc1(F)>OO>CS(=O)c1ccc(cc1)F
Although standard MDL RXN files only capture reactants and
products, a useful ChemAxon extension is to add a third count
for agents. Downstream tools can optionally remove or
reclassify these agents to produce strictly compliant MDL files.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
14. Valence issues
A tricky problem is working around the different MDL
valence interpretations used by cheminformatics tools.
For example, it is not uncommon for the alchemical reaction
[Pb]>>[Au] to become reinterpreted as [PbH2]>>[Au] after
writing and re-reading from MDL formats. Such errors lose the
distinction between sodium metal and sodium hydride, resulting
in incorrect molecular formulae, or to distinguish metals from
radicals, causing problems for substructure searching.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
15. rich text format
• Exporting formatted text from electronic lab note
books, such as experimental/preparation write-ups,
requires converting Microsoft Rich Text Format (RTF).
• This can be translated into HTML or ASCII, and then
long lines wrapped for inclusion in SD/RD data fields.
• Special code is required for handling non-U.S.
character sets, and whilst Western European,
Russian, Chinese and Japanese were expected,
finding Arabic, Thai and Vietnamese text in major
pharmaceutical ELNs came as a surprise.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
16. salt/component grouping
• Keeping track of the intended number and formulae
of reactants, products and agents in a reaction,
requires preserving salt form associations.
• This is implemented by honoring the “group”
information from the sketch as single disconnected
components in MDL RXN and RD file output.
• These associations are traditionally lost in SMILES...
...>CC(=O)[O-].[Cl-].[Cl-].[Fe].[K+].[Pd+2]...>...
but can be retained via ChemAxon/GGA extensions.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
17. three types of superatom/Label
1. Recognized superatoms are expanded to their
explicit all atom representations.
2. Unrecognized superatoms/labels that are bonded to
a molecule are encoded as dummy asterisk atoms
where the text is preserved as an MDL atom alias.
– Support for writing MDL aliases contributed to RDKit.
3. Disconnected unrecognized labels are preserved as
supplementary data fields in SD and RD file formats.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
21. Reaction standardization
Chemists may draw the reaction components in
multiple ways (e.g. tautomers and protonation states)
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
22. Reaction standardization
• An often overlooked aspect of ELNs is the need to
enforce “business rules” to consistently represent a
reaction, in the same way that normalized molecules
are stored in registration systems.
• Pharmaceutical ELNs contain structures where nitros
are represented arbitrarily, and even cases where
azide representations differ on each side of an arrow.
• Unfortunately, the rules used for molecules (such as
InChI) may be inappropriate for reactions, where
metal co-ordination and radicals play a major role.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
23. Reaction identity
ELNs frequently contain repeated reactions, duplicates.
We need define when two reactions are the same.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
24. Reaction identity
• When translating from the experiment-centric view
of an ELN to the reaction-centric view of a reaction
database one asks when are two reactions the same.
• A pragmatic/operational definition might be that two
experiments with identical sets of reactants and
products, but differing quantities, conditions,
catalysts and solvents are variations of each other.
• Whether a component is a reactant, catalyst, solvent
or reagent may be consistently defined by atom-
mapping; reactants contribute atoms to the product.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
25. Reaction classification
• For searching and analysis it is often convenient to
algorithmically assign each reaction to a type, often a
named reaction such as Negishi coupling, Diels-Alder
cycloaddition, nitro reduction or chiral separation.
• This is implemented using a database of SMIRKS-like
transforms that may be pre-compiled for efficient
matching and portability across informatics toolkits.
• This approach provides both classification under the
RSC’s RXNO ontology and reaction atom mapping for
component role assignment.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
26. simple classifications
• Examples of simple reaction classifications
A.B>>C Regular reaction
A.B>> Failed reaction
>>C Compound purchase
A>>A Purification
A.B>>A Separation (and chiral separation)
A.B>>C.A Catalysts or unreacted reagents
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
27. categorization of ELN reactions
1. J. Carey, D. Laffan, C. Thomson, M. Williams, Org. Biomol. Chem. 2337, 2006.
2. S. Roughley and A. Jordan, J. Med. Chem. 54:3451-3479, 2011.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
34%
17%
5%
2%
3%
6%
10%
1%
15%
2%
5% Heteroatom alkylation and arylation
Acylation and related processes
C-C bond formations
Heterocycle formation
Protections
Deprotections
Reductions
Oxidations
Functional group conversion
Functional group addition
Resolution
28. reaction ontology
• Reactions are classified into a common subset of the
Carey et al. classes and the RSC’s RXNO ontology.
• There are 12 super-classes
– e.g. 3 C-C bond formation (RXNO:0000002).
• These contain 84 class/categories.
– e.g. 3.5 Pd-catalyzed C-C bond formation (RXNO:0000316)
• These contain ~300 named reactions/types.
– e.g. 3.5.3 Negishi coupling (RXNO:0000088)
• These require >400 SMIRKS-like transformations.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
29. example smirks-like transform
• Reactions are specified as SMIRKS transformations:
[BrD1h0+0:1][#6:2].[#7X3v3+0:3][H]>>[#6:2][#7:3]
1.6.2 BROMO_N_ALKYLATION
• As demonstrated in the example above, these
patterns may operate on explicit hydrogen atoms for
brevity, but these are “compiled” via more efficient
SMARTS-like patterns for matching during naming.
• The nitrogen match becomes “[#7X3v3h>0+0:3]”.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
30. eTl* summary statistics
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
ELN Experiment Category Fraction
A NULL CLOBs 0.30%
B Empty sketches 0.24%
C No reaction (molecule?) 8.52%
D No reactants 0.63%
E No products 0.06%
F Regular Reactions 88.93%
G Markush Reactions 1.32%
Total 100.0%
Export success for a typical pharmaceutical ELN is
currently about 90.94% (see D+E+F+G above).
* Extract-Transform-Load (ETL), A data warehousing term.
31. analysis of eln reaction yields
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
Data courtesy of Nick Tomkinson, AstraZeneca RDI, Alderley Park, UK.
32. analysis of eln reaction yields
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
Data courtesy of Nick Tomkinson, AstraZeneca RDI, Alderley Park, UK.
34. conclusions
• In an attempt to better understand and hopefully
improve the productivity of synthetic chemists,
new computational methods have been developed
to process “real world” organic reaction data.
• The fruits of this work now enable medicinal
chemists and informaticians to make greater use of
the wealth of information in their in-house ELNs.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
35. acknowledgements
• Anna Paola Pellicioli and Greg Landrum, Novartis,CH.
• Ethan Hoff and Manli Zheng, AbbVie, IL, USA.
• Colin Batchelor, RSC, Cambridge, UK.
• David Drake, AstraZeneca, Alderley Park, UK.
• Plamen Petrov, AstraZeneca, Molndal, SE.
• Daniel Stoffler, Hoffman-La Roche, Basel, CH.
• Pat Walters, Vertex Pharmaceuticals, MA, USA.
• Andrew Wooster, GSK, RTP, NC, USA.
• Thank you for your time.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
36. co-ordinate handling
• Several approaches for handling 2D co-ordinates
– Preserve original co-ordinates as drawn by chemist
– Center and/or rescale for downstream depiction tools
– Regenerate all 2D co-ordinates algorithmically
– Clean-up long short/bonds introduced by superatom
expansion, attempting to preserve original orientation.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
37. nested reactions
• Some ELN configurations allow more than one
reaction or experiment per lab notebook page, i.e.
multiple reaction sketches in different “tabs”.
• Two solutions to this breaking of the 1-to-1 mapping
between COLLECTION_ID and reaction include:
1. Nested Reactions: Using the MDL RD file’s ability to
embed each reaction step as data fields in a single
record.
2. Splitting Reactions: Where each reaction has its
own record, possibly duplicating shared data.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
38. meta data fields
• In addition to the data explicitly recorded by the
chemist and captured by the ELN, it is also often
useful to export meta-data from an ELN schema.
• This includes data fields such as experiment creation
data, creation modification date, experiment status
(open/closed), chemist name, chemist user id, etc.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
39. incremental updates
• An important feature for a production data
warehouse for ELN reaction data is the ability to keep
its contents valid/fresh with live data.
• The can be implemented by supporting incremental
updates that export only those creations that have
been modified or created since a given date or in a
range of dates.
• One subtlety is the handling of “closed” experiments
(i.e. those signed off by a supervisor), whose status
change date need not match the last modified date.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
40. reaction role assignment
• Normalized reaction roles may be assigned via
reaction atom mapping algorithms.
• Reaction components that contribute atoms to the
product are defined to be reactants, and the
remaining components as catalysts and solvents.
• Hence, c1cccc1[N+](=O)[O-].[Ni]>>c1ccccc1N
may be canonicalized as
c1ccccc1[N+](=O)[O-]>[Ni]>c1ccccc1N
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013
41. future work
• Preserving superatom definitions as S groups in
V2000 and V3000 format Mol/RXN files.
• Enhanced stereochemistry (non-tetrahedral and axial
chiralities for catalyst optimization).
• Improved support for Marvin and ISIS sketches.
• Support for IDBS e-Workbook ELN.
6th Joint Sheffield Conference on Chemoinformatics, Tuesday 23rd July 2013