This document provides an introduction to retrosynthesis prediction and machine learning approaches for the task. It describes how retrosynthesis involves tracing reactions backward from a target product to predict required reactants. Classical computer-aided methods used reaction templates requiring domain expertise, while modern machine learning methods use neural networks to learn retrosynthesis without templates or can predict the most suitable template. Representative deep learning models discussed include sequence-to-sequence, graph neural networks, and transformer-based methods.
This lecture outlines the different strategies for finding a fragment hit and the subsequent elaboration strategies used in order to increase potency to develop a lead compound in drug discovery.
Lecture 9 slides: Machine learning for Protein Structure ...butest
This document introduces machine learning approaches for protein structure prediction. It discusses using machine learning to predict a protein's structure given its sequence by looking at regions of the protein and learning to classify them. Two main questions are addressed: how to describe protein structures and how to train predictors on examples. Common machine learning techniques for this problem include decision trees, neural networks, and logic programs. The importance of testing predictors on unseen data to avoid overfitting is also covered.
This document provides an outline for a presentation on biological networks, including introducing biological networks, describing their basic components and types, methods for predicting and building networks, sources of interaction data, tools for network visualization and analysis, and a demonstration of building, visualizing and analyzing biological networks using Cytoscape. The presentation covers topics like nodes and edges in networks, features used to analyze networks, methods for predicting networks from sequences and omics data, integrated databases for interaction data, and popular tools for searching, visualizing and performing network analysis.
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...Chanin Nantasenamat
In this lecture, I provide an overview on how computers can be instrumental in drug discovery efforts. Topics covered includes: big data as a result of omics effort; bioinformatics; cheminformatics; biological space; chemical space; how computers particularly machine learning (and data science) can be applied in the context of drug discovery.
A video of this lecture is also provided on the "Data Professor" YouTube channel available at http://bit.ly/dataprofessor
If you are fascinated about data science, it would mean the world to me if you would consider subscribing to this channel (by clicking the link below):
http://bit.ly/dataprofessor
This document discusses the identification and generation of pharmacophores. A pharmacophore is a specific 3D arrangement of functional groups within a molecule that are necessary for binding to an enzyme or receptor. Pharmacophore identification is important for understanding ligand-receptor interactions. Pharmacophore models are derived from the common features of known active molecules and define the spatial relationships between these features. Several computational methods can be used to generate pharmacophore hypotheses, including systematic search, distance geometry, and clique detection algorithms.
Peptidomimetics are compounds whose essential elements (pharmacophore) mimic a natural peptide or protein in 3D space and which retain the ability to interact with the biological target and produce the same biological effect.
Peptidomimetics are designed to circumvent some of the problems associated with a natural peptide for example
Stability against proteolysis (duration of activity)
Poor bioavailability.
Receptor selectivity or potency (often can be substantially improved).
1) The document discusses various methods for determining the 3D structure of proteins, including x-ray crystallography, NMR spectroscopy, and cryo-electron microscopy.
2) X-ray crystallography involves purifying the protein, crystallizing it, collecting diffraction data from x-rays hitting the crystal, using this data to determine phases and calculate an electron density map, and building an atomic model through refinement.
3) NMR spectroscopy involves dissolving the purified protein and using nuclear magnetic resonance to measure distances between atomic nuclei, allowing the structure to be calculated.
The document discusses the process of preparing a chemical database for virtual screening or compound acquisition. It begins with assembling collections from in-house and external databases. The collection is then cleaned by removing invalid structures and standardizing structure representations. Property filtering is used to focus on lead-like compounds. Known active molecules are searched for structural similarity. Alternative structures like stereoisomers are explored. Representatives are selected from clustered structures using descriptors and similarity metrics. 3D structures are generated and a final list of compounds is assembled for screening, with some random additions, completing the preparation.
This lecture outlines the different strategies for finding a fragment hit and the subsequent elaboration strategies used in order to increase potency to develop a lead compound in drug discovery.
Lecture 9 slides: Machine learning for Protein Structure ...butest
This document introduces machine learning approaches for protein structure prediction. It discusses using machine learning to predict a protein's structure given its sequence by looking at regions of the protein and learning to classify them. Two main questions are addressed: how to describe protein structures and how to train predictors on examples. Common machine learning techniques for this problem include decision trees, neural networks, and logic programs. The importance of testing predictors on unseen data to avoid overfitting is also covered.
This document provides an outline for a presentation on biological networks, including introducing biological networks, describing their basic components and types, methods for predicting and building networks, sources of interaction data, tools for network visualization and analysis, and a demonstration of building, visualizing and analyzing biological networks using Cytoscape. The presentation covers topics like nodes and edges in networks, features used to analyze networks, methods for predicting networks from sequences and omics data, integrated databases for interaction data, and popular tools for searching, visualizing and performing network analysis.
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...Chanin Nantasenamat
In this lecture, I provide an overview on how computers can be instrumental in drug discovery efforts. Topics covered includes: big data as a result of omics effort; bioinformatics; cheminformatics; biological space; chemical space; how computers particularly machine learning (and data science) can be applied in the context of drug discovery.
A video of this lecture is also provided on the "Data Professor" YouTube channel available at http://bit.ly/dataprofessor
If you are fascinated about data science, it would mean the world to me if you would consider subscribing to this channel (by clicking the link below):
http://bit.ly/dataprofessor
This document discusses the identification and generation of pharmacophores. A pharmacophore is a specific 3D arrangement of functional groups within a molecule that are necessary for binding to an enzyme or receptor. Pharmacophore identification is important for understanding ligand-receptor interactions. Pharmacophore models are derived from the common features of known active molecules and define the spatial relationships between these features. Several computational methods can be used to generate pharmacophore hypotheses, including systematic search, distance geometry, and clique detection algorithms.
Peptidomimetics are compounds whose essential elements (pharmacophore) mimic a natural peptide or protein in 3D space and which retain the ability to interact with the biological target and produce the same biological effect.
Peptidomimetics are designed to circumvent some of the problems associated with a natural peptide for example
Stability against proteolysis (duration of activity)
Poor bioavailability.
Receptor selectivity or potency (often can be substantially improved).
1) The document discusses various methods for determining the 3D structure of proteins, including x-ray crystallography, NMR spectroscopy, and cryo-electron microscopy.
2) X-ray crystallography involves purifying the protein, crystallizing it, collecting diffraction data from x-rays hitting the crystal, using this data to determine phases and calculate an electron density map, and building an atomic model through refinement.
3) NMR spectroscopy involves dissolving the purified protein and using nuclear magnetic resonance to measure distances between atomic nuclei, allowing the structure to be calculated.
The document discusses the process of preparing a chemical database for virtual screening or compound acquisition. It begins with assembling collections from in-house and external databases. The collection is then cleaned by removing invalid structures and standardizing structure representations. Property filtering is used to focus on lead-like compounds. Known active molecules are searched for structural similarity. Alternative structures like stereoisomers are explored. Representatives are selected from clustered structures using descriptors and similarity metrics. 3D structures are generated and a final list of compounds is assembled for screening, with some random additions, completing the preparation.
Combinatorial chemistry and high throughputscreeningSaikiranKulkarni
Combinatorial chemistry is a collection of techniques which allow for the synthesis of multiple compounds at the same time.
Combinatorial chemistry is one of the important new methodologies developed by researchers in the pharmaceutical industry to reduce the time and costs associated with producing effective and competitive new drugs, By accelerating the process of chemical synthesis, this method is having a profound effect on all branches of chemistry, but especially on drug discovery.
The screening of chemical libraries with traditional methods, such as high-throughput screening (HTS), is expensive and time consuming. Quantitative structure–activity relation (QSAR) modeling is an alternative method that can assist in the selection of lead molecules by using the information from
reference active and inactive compounds. This approach requires good molecular descriptors that are representative of the molecular features responsible for the relevant molecular activity.
Drug and Chemical Databases 2018 - Drug DiscoveryGirinath Pillai
Latest collection of Chemical and Drug Databases for Biological Research as well as Drug Design studies. Databases statistics, links and overview data with CADD introduction.
HERE IN THIS PRESENTATION HY HOMOLOGY MODELING IS EXPLAIN , WITH EXAMPLES OF PROTEIN PRIMARY AND SECONDARY, SHOWING THE IMAGES FORM WHICH MAKES EASY TO UNDERSTAND
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screeningDeependra Ban
The document discusses ligand-based and structure-based virtual screening techniques. It begins with an introduction to the speaker's background and expertise in chemoinformatics. It then provides an overview of the drug discovery challenge and how virtual screening can help address it. The remainder of the document focuses on explaining different virtual screening methods, including ligand-based approaches like similarity searching, pharmacophore mapping, and machine learning methods, as well as structure-based protein-ligand docking. Limitations of methods are discussed along with ways to improve performance, such as using multiple active structures, data fusion techniques, and accounting for conformational flexibility.
ADMET properties prediction using AI will accelerate the process of drug discovery.
This slide mostly focuses on using graph-based deep learning techniques to predict drug properties.
Global and local restrictions Peptidomimetics ASHOK GAUTAM
Peptidomimetics are small protein-like chains designed to mimic peptides but with greater stability and specificity. They are created either by modifying existing peptides or designing new structures that mimic peptides. Peptidomimetics incorporate conformational constraints locally or globally to restrict flexibility and exclude potential conformations, allowing for more targeted interaction with biological targets. Conformational constraints are needed to improve properties like stability, activity, and selectivity for applications like drug development and targeted cancer therapies. Common constraints include cyclization, disulfide bonds, and restricted amino acids.
3D QSAR approaches relate the biological activity of compounds to their 3D structural properties using statistical analysis. CoMFA is a commonly used 3D-QSAR method that involves aligning molecules, placing them in a grid, calculating electrostatic and steric field properties at each point, and correlating these descriptors to biological activity using PLS analysis. CoMFA results are often displayed as contour plots that identify regions where certain molecular properties increase or decrease activity. X-ray crystallography and NMR spectroscopy can provide experimental data on bioactive conformations.
This document discusses de novo drug design, which involves using a target receptor's 3D structure to design new molecules that can interact with it, without relying on existing leads. The key steps of computer-based de novo design are generating primary constraints from the receptor, deriving interaction sites in the binding pocket, building up ligand structures using methods like growing and linking fragments, scoring the ligands to evaluate binding affinity, and applying secondary constraints related to drug properties. Successful applications of de novo design include HIV protease inhibitors and COMT inhibitors.
The document discusses using neural networks to predict secondary protein structure from amino acid sequences. It describes training a feed-forward neural network using backpropagation. The network takes a sliding window of 17 amino acids as input and predicts each amino acid as helix, sheet, or coil. Different network architectures are tested and strategies to improve accuracy like adding more biological input features or changing the network architecture are discussed.
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...Masahito Ohue
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network Considering Distance on a Molecular Graph
Int’l Workshop on Mathematical Modeling and Problem Solving (MPS)
2019 Int’l Conference on Parallel and Distributed Processing Techniques & Applications (PDPTA’19)
Session 2. July 29, 2019 @Luxor, Las Vegas
https://americancse.org/events/csce2019/program/pdp_csc_ipc_msv_gcc_29
This document discusses structure-based and ligand-based drug design approaches. Structure-based design uses the 3D structure of biological targets to dock potential drug molecules. Ligand-based design analyzes similar molecules that bind to the target to derive pharmacophore models or quantitative structure-activity relationships (QSAR) to predict new candidates. Specific structure-based methods covered include docking tools like AutoDock and CDOCKER, and accounting for protein and complex flexibility. Ligand-based methods discussed are QSAR techniques like Comparative Molecular Field Analysis (CoMSIA) and Field Analysis (CoMFA). In conclusion, computational approaches like these are valuable for drug discovery by facilitating the identification and testing of new ligand
Molecular and Quantum Mechanics in drug designAjay Kumar
This document discusses and compares molecular mechanics and quantum mechanics methods for drug design. It provides an overview of molecular mechanics, which uses classical physics to model potential energy surfaces, and common molecular mechanics force fields such as AMBER and CHARMM. It also describes quantum mechanics principles, density functional theory, and semi-empirical methods. Key differences between molecular mechanics and quantum mechanics are noted, such as system size, time required, and accuracy. Applications of each method in drug design are mentioned.
This document discusses protein threading modeling methods. Protein threading, also called fold recognition, is used to model proteins that have the same fold as proteins with known structures but no homologous sequences. It differs from homology modeling which is used for proteins that have homologous sequences. Protein threading works by using statistical knowledge of relationships between structures in the Protein Data Bank and the sequence of the protein being modeled. It is based on observations that there are a limited number of folds in nature and most new structures have similar folds to ones already in the PDB. The document then describes the general steps of the protein threading method.
Molecular docking by harendra ...power point presentationHarendra Bisht
Molecular docking is a computational method used to predict how a small molecule, like a drug, binds to a larger target molecule, like a protein. It works by fitting the structures of the two molecules together to find the highest affinity binding mode. The docking process involves defining the active site on the target protein, generating possible positions for the small molecule to bind, scoring the interactions between them, and identifying the best binding pose. Docking can help researchers design new drugs that effectively interact with protein targets.
The document summarizes the rules for SMILES (Simplified Molecular Input Line Entry System) notation. It discusses 6 rules: 1) atom specifications including element symbols, isotopes, charges, and implicit hydrogens, 2) bond specifications such as single, double, triple, and aromatic bonds, 3) branching denoted with parentheses, 4) ring closures using bond numbers, 5) disconnections represented by a period, and 6) isomerism including isotopes, double bond orientations, stereochemistry, and atom mapping in reactions. SMILES provides a systematic way to represent molecular structures and reactions through a grammar of symbols and notation following these six rules.
PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptxMO.SHAHANAWAZ
Detail about PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW MOLECULES AND IT’S IMPORTANCE IN DRUG DISCOVERY, including DESCRIPTORS OF ADMET PREDICTION, DATASETS USED IN ADMET PREDICTION
Fragment-based drug design (FBDD) uses small molecular fragments that bind weakly to a target protein's binding site. These fragments can then be grown, merged, or linked to improve binding affinity. FBDD provides starting points for challenging targets like protein-protein interactions. It increases the use of biophysics to characterize compound binding. FBDD also gives small research groups access to tools for identifying chemical probes of biological systems.
Genome editing methods such as ZFNs, TALENs, and CRISPR/Cas9 use engineered nucleases to create targeted double-stranded breaks in DNA which are then repaired through endogenous cellular processes. These nucleases can be used to modify genomes through techniques like gene knockout, targeted mutation insertion/deletion/correction, and studying gene function. CRISPR/Cas9 uses a guide RNA and Cas9 nuclease to target specific DNA sequences for editing. The four main steps for CRISPR are: 1) selecting target sequences near a PAM site, 2) designing and cloning gRNA, 3) delivering Cas9 and gRNA into cells, and 4) DNA repair after cleavage results in gene modification
In this project, we investigated the use of association rules to extract useful knowledge from raw ontological data. To this end, we proposed an approach to pass from graph representation to transactional data. Then, we used different technological solutions to improve the performance of frequent item-sets extraction such as the FP-growth algorithm, and Hadoop. Check our code on Github: https://github.com/8-chems/OntologyMiner
Combinatorial chemistry and high throughputscreeningSaikiranKulkarni
Combinatorial chemistry is a collection of techniques which allow for the synthesis of multiple compounds at the same time.
Combinatorial chemistry is one of the important new methodologies developed by researchers in the pharmaceutical industry to reduce the time and costs associated with producing effective and competitive new drugs, By accelerating the process of chemical synthesis, this method is having a profound effect on all branches of chemistry, but especially on drug discovery.
The screening of chemical libraries with traditional methods, such as high-throughput screening (HTS), is expensive and time consuming. Quantitative structure–activity relation (QSAR) modeling is an alternative method that can assist in the selection of lead molecules by using the information from
reference active and inactive compounds. This approach requires good molecular descriptors that are representative of the molecular features responsible for the relevant molecular activity.
Drug and Chemical Databases 2018 - Drug DiscoveryGirinath Pillai
Latest collection of Chemical and Drug Databases for Biological Research as well as Drug Design studies. Databases statistics, links and overview data with CADD introduction.
HERE IN THIS PRESENTATION HY HOMOLOGY MODELING IS EXPLAIN , WITH EXAMPLES OF PROTEIN PRIMARY AND SECONDARY, SHOWING THE IMAGES FORM WHICH MAKES EASY TO UNDERSTAND
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screeningDeependra Ban
The document discusses ligand-based and structure-based virtual screening techniques. It begins with an introduction to the speaker's background and expertise in chemoinformatics. It then provides an overview of the drug discovery challenge and how virtual screening can help address it. The remainder of the document focuses on explaining different virtual screening methods, including ligand-based approaches like similarity searching, pharmacophore mapping, and machine learning methods, as well as structure-based protein-ligand docking. Limitations of methods are discussed along with ways to improve performance, such as using multiple active structures, data fusion techniques, and accounting for conformational flexibility.
ADMET properties prediction using AI will accelerate the process of drug discovery.
This slide mostly focuses on using graph-based deep learning techniques to predict drug properties.
Global and local restrictions Peptidomimetics ASHOK GAUTAM
Peptidomimetics are small protein-like chains designed to mimic peptides but with greater stability and specificity. They are created either by modifying existing peptides or designing new structures that mimic peptides. Peptidomimetics incorporate conformational constraints locally or globally to restrict flexibility and exclude potential conformations, allowing for more targeted interaction with biological targets. Conformational constraints are needed to improve properties like stability, activity, and selectivity for applications like drug development and targeted cancer therapies. Common constraints include cyclization, disulfide bonds, and restricted amino acids.
3D QSAR approaches relate the biological activity of compounds to their 3D structural properties using statistical analysis. CoMFA is a commonly used 3D-QSAR method that involves aligning molecules, placing them in a grid, calculating electrostatic and steric field properties at each point, and correlating these descriptors to biological activity using PLS analysis. CoMFA results are often displayed as contour plots that identify regions where certain molecular properties increase or decrease activity. X-ray crystallography and NMR spectroscopy can provide experimental data on bioactive conformations.
This document discusses de novo drug design, which involves using a target receptor's 3D structure to design new molecules that can interact with it, without relying on existing leads. The key steps of computer-based de novo design are generating primary constraints from the receptor, deriving interaction sites in the binding pocket, building up ligand structures using methods like growing and linking fragments, scoring the ligands to evaluate binding affinity, and applying secondary constraints related to drug properties. Successful applications of de novo design include HIV protease inhibitors and COMT inhibitors.
The document discusses using neural networks to predict secondary protein structure from amino acid sequences. It describes training a feed-forward neural network using backpropagation. The network takes a sliding window of 17 amino acids as input and predicts each amino acid as helix, sheet, or coil. Different network architectures are tested and strategies to improve accuracy like adding more biological input features or changing the network architecture are discussed.
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...Masahito Ohue
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network Considering Distance on a Molecular Graph
Int’l Workshop on Mathematical Modeling and Problem Solving (MPS)
2019 Int’l Conference on Parallel and Distributed Processing Techniques & Applications (PDPTA’19)
Session 2. July 29, 2019 @Luxor, Las Vegas
https://americancse.org/events/csce2019/program/pdp_csc_ipc_msv_gcc_29
This document discusses structure-based and ligand-based drug design approaches. Structure-based design uses the 3D structure of biological targets to dock potential drug molecules. Ligand-based design analyzes similar molecules that bind to the target to derive pharmacophore models or quantitative structure-activity relationships (QSAR) to predict new candidates. Specific structure-based methods covered include docking tools like AutoDock and CDOCKER, and accounting for protein and complex flexibility. Ligand-based methods discussed are QSAR techniques like Comparative Molecular Field Analysis (CoMSIA) and Field Analysis (CoMFA). In conclusion, computational approaches like these are valuable for drug discovery by facilitating the identification and testing of new ligand
Molecular and Quantum Mechanics in drug designAjay Kumar
This document discusses and compares molecular mechanics and quantum mechanics methods for drug design. It provides an overview of molecular mechanics, which uses classical physics to model potential energy surfaces, and common molecular mechanics force fields such as AMBER and CHARMM. It also describes quantum mechanics principles, density functional theory, and semi-empirical methods. Key differences between molecular mechanics and quantum mechanics are noted, such as system size, time required, and accuracy. Applications of each method in drug design are mentioned.
This document discusses protein threading modeling methods. Protein threading, also called fold recognition, is used to model proteins that have the same fold as proteins with known structures but no homologous sequences. It differs from homology modeling which is used for proteins that have homologous sequences. Protein threading works by using statistical knowledge of relationships between structures in the Protein Data Bank and the sequence of the protein being modeled. It is based on observations that there are a limited number of folds in nature and most new structures have similar folds to ones already in the PDB. The document then describes the general steps of the protein threading method.
Molecular docking by harendra ...power point presentationHarendra Bisht
Molecular docking is a computational method used to predict how a small molecule, like a drug, binds to a larger target molecule, like a protein. It works by fitting the structures of the two molecules together to find the highest affinity binding mode. The docking process involves defining the active site on the target protein, generating possible positions for the small molecule to bind, scoring the interactions between them, and identifying the best binding pose. Docking can help researchers design new drugs that effectively interact with protein targets.
The document summarizes the rules for SMILES (Simplified Molecular Input Line Entry System) notation. It discusses 6 rules: 1) atom specifications including element symbols, isotopes, charges, and implicit hydrogens, 2) bond specifications such as single, double, triple, and aromatic bonds, 3) branching denoted with parentheses, 4) ring closures using bond numbers, 5) disconnections represented by a period, and 6) isomerism including isotopes, double bond orientations, stereochemistry, and atom mapping in reactions. SMILES provides a systematic way to represent molecular structures and reactions through a grammar of symbols and notation following these six rules.
PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptxMO.SHAHANAWAZ
Detail about PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW MOLECULES AND IT’S IMPORTANCE IN DRUG DISCOVERY, including DESCRIPTORS OF ADMET PREDICTION, DATASETS USED IN ADMET PREDICTION
Fragment-based drug design (FBDD) uses small molecular fragments that bind weakly to a target protein's binding site. These fragments can then be grown, merged, or linked to improve binding affinity. FBDD provides starting points for challenging targets like protein-protein interactions. It increases the use of biophysics to characterize compound binding. FBDD also gives small research groups access to tools for identifying chemical probes of biological systems.
Genome editing methods such as ZFNs, TALENs, and CRISPR/Cas9 use engineered nucleases to create targeted double-stranded breaks in DNA which are then repaired through endogenous cellular processes. These nucleases can be used to modify genomes through techniques like gene knockout, targeted mutation insertion/deletion/correction, and studying gene function. CRISPR/Cas9 uses a guide RNA and Cas9 nuclease to target specific DNA sequences for editing. The four main steps for CRISPR are: 1) selecting target sequences near a PAM site, 2) designing and cloning gRNA, 3) delivering Cas9 and gRNA into cells, and 4) DNA repair after cleavage results in gene modification
In this project, we investigated the use of association rules to extract useful knowledge from raw ontological data. To this end, we proposed an approach to pass from graph representation to transactional data. Then, we used different technological solutions to improve the performance of frequent item-sets extraction such as the FP-growth algorithm, and Hadoop. Check our code on Github: https://github.com/8-chems/OntologyMiner
Analytic Dependency Loops in Architectural Models of Cyber-Physical SystemsIvan Ruchkin
Rigorous engineering of safety-critical Cyber-Physical Systems (CPS) requires integration of heterogeneous modeling methods from different disciplines. It is often necessary to view this integration from the perspective of analyses -- algorithms that read and change models. Although analytic integration supports formal contract-based verfication of model evolution, it suffers from the limitation of analytic dependency loops. Dependency loops between analyses cannot be resolved based on the existing contract-based verification. This paper makes a step towards using rich architectural description to resolve circular analytic dependencies. We characterize the dependency loop problem and discuss three algorithmic approaches to resolving such loops: analysis iteration, constraint solving, and genetic search. These approaches take advantage of information in multi-view architectures to resolve analytic dependency loops.
Numerical methods are used for solving mathematical problems that cannot be solved
analytically. Computational chemical engineering uses modern numerical analysis tools for solving differential
equation models which arise in chemical engineering. These tools allow us to design, operate, and optimize the
chemical production of industrial and consumer goods. This paper provides a brief introduction to
computational chemical engineering
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
This document discusses experimental design for distributed machine learning models. It outlines common problems in machine learning modeling like selecting the best algorithm and evaluating a model's expected generalization error. It describes steps in a machine learning study like collecting data, building models, and designing experiments. The goal of experimentation is to understand how model factors affect outcomes and obtain statistically significant conclusions. Techniques discussed for analyzing distributed model outputs include precision-recall curves, confusion matrices, and hypothesis testing methods like the chi-squared test and McNemar's test. The document emphasizes that experimental design for distributed learning poses new challenges around data characteristics, computational complexity, and reproducing results across models.
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...ssuser4b1f48
1) The document proposes a deep learning framework called DeepLGF to predict drug-drug interactions by combining local and global feature extraction from biomedical knowledge graphs.
2) DeepLGF uses graph neural networks and knowledge graph embedding methods to extract local drug features from chemical structures and biological functions, and global features from the relationships between drugs and other biological entities.
3) Experimental results on prediction tasks using several drug interaction datasets demonstrate that DeepLGF outperforms other state-of-the-art models and has promising applications in drug development and clinical use.
The document discusses various object-oriented methodologies including Rumbaugh, Booch, and Jacobson methodologies. It provides details on Rumbaugh's Object Modeling Technique (OMT) which separates modeling into object, dynamic, and functional models. It describes Booch's methodology which uses class, object, state transition, and other diagrams. It also discusses Jacobson's methodologies including Object-Oriented Software Engineering (OOSE) which is use case driven, and Object-Oriented Business Engineering (OOBE) which uses use cases. The document then covers topics on software quality assurance including types of errors, testing strategies like black box and white box testing, and testing approaches like top-down
Open Chemistry: Input Preparation, Data Visualization & AnalysisMarcus Hanwell
The document outlines an open-source software development project called Open Chemistry that aims to integrate desktop chemistry applications, high-performance computing resources, and database/informatics resources. It describes several software applications being developed as part of Open Chemistry, including Avogadro 2 for structure editing and visualization, MoleQueue for running computational jobs on local and remote systems, and MongoChem for storing and searching chemistry data. The goal of Open Chemistry is to advance computational chemistry tools through open-source development and tight integration of related applications.
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
This document reviews existing methods for improving the K-means clustering algorithm. K-means is widely used but has limitations such as sensitivity to outliers and initial centroid selection. The document summarizes several proposed approaches, including using MapReduce to select initial centroids and form clusters for large datasets, reducing execution time by cutting off iterations, improving cluster quality by selecting centroids systematically, and using sampling techniques to reduce I/O and network costs. It concludes that improved algorithms address K-means limitations better than the traditional approach.
Presented on 4 Dec 2014
@Jeju, South Korea
Title: An efficient method for assessing the impact of refactoring candidates on maintainability based on matrix computation
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
Writing unit tests became a daily routine task of software developers. But the quality of these tests vary a lot across projects and developers. Are your tests really testing what your class does? Are they easy to read? Are they easy to modify? After having a short theoretical overview of the purpose of unit tests and test driven development, I would share plenty of small tricks to improve the quality of unit tests. Although the examples will be written in Java using the JUnit and Mockito libraries, the ideas are mostly language and technology agnostic.
The slides were originally presented at the SDL DevOps Meetup in the Amsterdam office.
2020.04.07 automated molecular design and the bradshaw platform webinarPistoia Alliance
This presentation described how data-driven chemoinformatics methods may automate much of what has historically been done by a medicinal chemist. It explored what is reasonable to expect “AI” approaches might achieve, and what is best left with a human expert. The implications of automation for the human-machine interface were explored and illustrated with examples from Bradshaw, GSK’s experimental automated design environment.
The document is a thesis proposal by Justin Sybrandt at Clemson University that outlines his past and proposed work on exploiting latent features in text and graphs. It summarizes Sybrandt's peer-reviewed work using embeddings to generate biomedical hypotheses from text and validate hypotheses through ranking. It also discusses pending work on heterogeneous bipartite graph embeddings and partitioned hypergraphs. The proposal provides background on Sybrandt's hypothesis generation work and outlines his proposed future research directions involving graph embeddings.
Use of GAN's to analyze chemical reactionsMatthew Clark
The document discusses using conditional generative adversarial networks (cGANs) to analyze chemical reactions by representing molecules as 3D electron density fields, which are better representations than traditional chemical drawings. It describes how cGANs can be trained on transformations between reactant and product electron density fields to predict chemical reactions, and presents some challenges with interpreting density fields and representing many-to-one reactions. The work shows promise but requires improvements in distinguishing similar elements, selecting better test reactions, and reproducing known chemical rules.
This document is a slide presentation by Sri Krishnamurthy on machine learning applications in credit risk. The presentation discusses using machine learning algorithms like supervised learning algorithms for prediction and classification, and unsupervised learning algorithms like clustering, to analyze credit risk data. It provides examples of how clustering algorithms like K-means and hierarchical clustering can be used to group credit risk applicants. The presentation also discusses challenges of adopting open-source software in enterprises and potential use cases for a regulatory sandbox for testing financial technology solutions.
Novel algorithms for detection of unknown chemical molecules with specific bi...Aboul Ella Hassanien
The document proposes novel algorithms for detecting unknown chemical molecules with specific biological activities. It introduces two approaches: 1) a qualitative structure-activity relationships approach using molecular descriptors and machine learning models, and 2) a graph algorithms based approach using a new coding system and kernel functions. For the latter, it presents a new atoms similarity algorithm and paths of stars algorithm, applying them to drug activity prediction tasks with competitive accuracy compared to other methods. The algorithms aim to reduce the time and cost of classifying chemical compounds.
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
The document provides details on method development for chromatography. It discusses defining key terms, developing a test method plan, optimizing methods through experimental design techniques like factorial design. The method development process involves studying samples, setting goals, reviewing literature, selecting an approach, optimizing parameters, and finalizing the method. Critical parameters like column length and temperature, flow rate, mobile phase composition are identified for optimization. Formal validation is required once the method is developed.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
2. Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
3. Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
4. Retrosynthesis prediction
• What is retrosynthesis prediction?
• Retrosynthesis or retrosynthetic pathway planning is the process of tracing back the
forward reaction, predicting which reactants are required to synthesize the target product.
4
5. Retrosynthesis prediction
• Retrosynthesis is crucial process of discovering new materials and drugs.
5
Desired
properties
Candidate
Product
Candidate
Reactants Test by chemist
Retrosynthesis prediction
6. • Each process of discovering new materials and drug has own error, it should be
verified by chemist.
• Expensive
6
Desired
properties
Candidate
Product
Candidate
Reactants Test by chemist
Retrosynthesis prediction
Retrosynthesis prediction
8. Retrosynthesis prediction
• If retrosynthesis prediction can be done with high accuracy …
• Capable of unlocking future possibilities of a fully automated material/drug discovery
pipeline.
8
Desired
properties
Candidate
Product
Candidate
Reactants
Test by robot
Retrosynthesis prediction
9. Dataset description
• SMILES (Simplified Molecular-Input Line-Entry System) [1]
• SMILES is a specification in the form of a line notation for describing the structure of
chemical species [2].
• Generation of SMILES.
• By printing symbol nodes encountered in a depth-first tree traversal of a chemical graph
9[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
10. Dataset description
• SMILES in detail
• Character of carbon(C) is omitted in the graph.
• Hydrogen(H) is omitted in the SMILES.
• Ring structures are written by breaking each ring at an arbitrary point to make an acyclic str
ucture and adding numerical ring closure labels to show connectivity between non-adjacen
t atoms.
• Branches are described with parentheses.
• A bond is represented using one of the symbols: ., -, =, #, $, :, /,
• “.” indicates two parts are not bonded together
10[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
11. Dataset description
• Benchmark:
1. USPTO (United States Patent and Trademark Office)
• USPTO benchmark contains SMIELS representation of single target product (input) and
reactants (target)
• Variants
• USPTO-50k
• USTPO-500K
• USPTO-MIT
2. Pistachio [32]
3. Reaxys [25]
11[25] reaxys.com [32] Mayfield et al.
12. Overview of general approaches: Template-based
• Template-based approaches [2, 3, 4, 5, 14, 15, 16, 17] use the known chemical
reaction which is called reaction template.
• Reaction template contains sub-graph reaction patterns that describing how the reaction
occur between reactants and product.
• Pros
• High interpretability
• Cons
• Low generalizability to unseen templates
• Require domain knowledge to extract the reaction templates
12
13. Overview of general approaches: Template-free
• Template-free approaches [6, 7, 8, 9, 10, 12] learn mapping function product to a set of
reactants by extracting features directly from data.
• Seq2Seq framework
• [6, 7, 8, 12]
• Graph2Grpah framework
• [9, 10]
• Pros
• Generalizability
• Not require domain knowledge
• Cons
• Invalid/Inaccessible predictions
• Low interpretability
13
f
14. Overview of general approaches: Selection-based
• Selection-based approaches [11] select a candidate set of purchasable reactants.
• The objective of [11] is to discover retrosynthetic routes from a given desired product to co
mmercially available reactants
• Pros
• Accessibility of the prediction
• Not require domain knowledge
• Cons
• Novelty
14[11] Guo et al.
Rank := f(product; )
Purchasable pool
15. Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
16. Classical computer-aided methods
• Before deep learning, computer-aided retrosynthesis were mainly conducted using
reaction template. [2, 3, 4, 15, 16, 17]
• They are mainly about how to use known reactions and extract meaningful reaction
context.
• Characteristics
• It needs chemical expertise.
• Heuristics
• Computationally expensive
• Chemical space is vast
• Subgraph isomorphism problem*1.
• Not scalable
• Not generalizable
16*1: Appendix-1
17. Classical computer-aided methods
• The first computer-aided retrosynthesis:
• [18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985
• The author won the Nobel Prize in Chemistry for his contribution of retrosynthetic analysis.
• [19] The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Mol
ecules (Nobel lecture), 1991
17[18, 19] Corey et al.
19. • Key Idea
• It uses product similarity and reactants similarity to rank template of precedent reactions.
19[3] Coley et al.
Classical computer-aided methods:
Recent work [3] 2017 – Key Idea
20. • How to measure molecular similarity*2?
• Molecular fingerprints are a way of encoding the structure of molecule. We can use RDKit
library to get it.
• Most common way is Tanimoto similarity, but there is no canonical definition of molecule
similarity (subgraph isomorphism problem*1).
• , : Molecular fingerprint
20*1: Appendix-1, *2: Appendix-2
Img from [20]
Classical computer-aided methods:
Recent work [3] 2017 – Method (Similarity)
21. • Example of using similarity in [3]
• Total similarity := Product Sim * Reactants (Precursor) sim
21[3] Coley et al.
Rank
Classical computer-aided methods:
Recent work [3] 2017 – Method (Using similarity)
22. • Result of [3]
• [3] performs better than seq2seq. However, the seq2seq in table is template-free and [3] is
template-based.
• Contribution
• It mimics the retrosynthetic strategy by using molecular similarity without need to encode
any chemical knowledge.
• Limitation
• It inherently disfavors making creative retrosynthetic strategy because it relies on
precedent reactions.
22*3: Appendix-3
*3
Classical computer-aided methods:
Recent work [3] 2017 - Results
23. Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• Open NMT
• Related works
• Future directions
• Reference
• Appendix
• Library
• Related works
24. Machine learning based methods
• Data-driven methods using machine learning and deep learning have been activated
since mid-2010s.
• The need for expertise has been reduced.
• More scalable and generalizable.
• Representative proposed methods
• Template-based
• NeuralSim [14], Graph Logic Network (GLN) [5]
• Template-free
• Seq2Seq [21], Molecular Transformer (MT) [6, 7], Latent variable Transformer (LV-MT)
[8], Self-Corrected Transformer (SCROP) [22], Graph2Graph (G2G) [9], GraphRetro [10]
• Selection-based
• Bayesian-Retro [11]
24
26. • Template-based: NeuralSim [14] (2017)
• Key Idea
• Given a target product, it uses neural network to predict most suitable rule in reaction
template.
26[14] Segler et al.
Machine learning based methods
Template-based: NeuralSim [14] 2017 – Key Idea
27. • Template-based: NeuralSim [14]
• It uses primitive models such as MLP and Highway network [23].
• It defines rule-selection as a multiclass classification.
• Molecular Descriptor [24] is defined as sum of molecular fingerprint:
27[14] Segler et al. [23] Srivastava et al. [24] pdf file
Machine learning based methods
Template-based: NeuralSim [14] 2017 - Method
28. • Template-based: NeuralSim [14]
• Experiments
• Dataset: Reaxys database [25]
• # of class: 8720
• Contribution
• It shows neural networks can learn to which molecular context particular rules can be applied.
• Limitation
• The performance is affected by rule set cardinality.
• The larger the set size, the lower the performance.
28[14] Segler et al.
Machine learning based methods
Template-based: NeuralSim [14] 2017 - Results
29. • Template-based: Graph Logic Network (GLN) [5] (NeurIPS 2019)
29[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019
30. • Key Idea
• Modeling the joint distribution of reaction templates and reactants using logic variable.
• It learns when rules from reaction templates should be applied.
30[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 – Key Idea
31. • Retrosynthesis Template
• Using the retrosynthesis template can be decomposed into 2-step logic.
• Match template
• Match reactants
31[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Background
32. • Match template
• Match reactants
• Uncertainty
• Template score function
• Reactants score function
32[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
33. • Final joint probability
33[5] Dai et al. *4: Appendix-4
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
Parameterizing by GNN (Graph Neural Network)*4
34. • MLE with Efficient Inference
• Gradient approximation
34
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
[5] Dai et al.
35. • Top-k results
• Contribution
• Interpretability: Integration of probabilistic models and template(chemical rule)
• Limitation
• It share limitations of template-based method
• Scalability
35[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Results
36. 36[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017
37. • Template-free: Seq2Seq [21] (2017)
• It tokenizes SMILES and treats retrosynthesis as machine translation.
• It uses bidirectional LSTM for a encoder and decoder.
• It uses beam search to produce a set of reactants.
37[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017 - Method
38. • Results
• It performs comparably to the rule-based expert system baseline.
• Contribution
• It shows fully data-driven seq2seq model can learn retrosynthetic pathway.
• Limitations
• It produces grammatically invalid SMILES and chemically implausible predictions.
• Just naïve application of seq2seq model.
• Predictions generated by a vanilla seq2seq model with beam search typically exemplifies
low diversity with only minor differences in the suffix. [8]
38[21] Liu et al, [8] Chen et al
Machine learning based methods
Template-free: Seq2Seq [21] 2017 – Results
39. • Grammatically invalid SMILES
• Grammatically valid but chemically implausible
39[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017 – Results
40. 40[6] Schwaller et al., [7] Lee et al.
Machine learning based methods
Template-free: Molecular Transformer [6, 7] 2019
41. • Key Idea
• It also tokenizes SMILES and treats retrosynthesis as machine translation like [21].
• It uses Transformer instead of LSTM
• It performs better than seq2seq [21] but has same limitations.
41
Machine learning based methods
Template-free: Molecular Transformer [6, 7] 2019 – Key Idea
[6] Schwaller et al., [7] Lee et al. [21] Liu et al.
43. • It extends Molecular Transformer (MT) to become more generalizable to rare
reactions and produce diverse path.
• Key Idea
• It proposes novel pretrain method.
• Random bond cut
• Template-based bond cut
• It trains a mixture model with the online hard-EM algorithm.
43[8] Chen et al
Machine learning based methods
Template-free: LV-MT [8] 2019 – Key Idea
44. • Pretrain methods
• Random bond cut
• For each input target product, it generates new examples by selecting a random
bond to break.
• Template-based bond cut
• Instead of randomly breaking bonds, it uses the templates to break bonds.
• The model is pre-trained on these auxiliary examples, and then used as initialization
to be fine-tuned on the actual retrosynthesis data.
44
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Pretrain)
[8] Chen et al
45. • Why latent variables are introduced?
• It tackles the problem of generating diverse predictions.
• The outputs of beam search tend to be similar to each other.
• Given a target SMILES string x and reactants SMILES string y, a mixture model
introduces a multinomial latent variable z ∈ { 1, · · · , K } to capture different reaction
types, and decomposes the marginal likelihood as:
45
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Latent Var.)
[8] Chen et al
46. • Hard-EM algorithm
1. Taking a mini-batch of training examples
2. It enumerates all K values of z and compute their loss,
• Dropout should be turned off [26].
3. For each , it selects the value of z that yields the minimum loss:
• For p(y | z, x; θ), it shares the encoder-decoder network among mixture components, and
feed the embedding of z as an input to the decoder so that y is conditioned on it
4. Back-propagate through it, so only one component receives gradients per example.
• Dropout should be turned back on [26].
46[8] Chen et al., [26] Shen et al.
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Latent Var.)
47. • Results*5
47*5: We report better hyper-parameters and the results in Appendix-5
Machine learning based methods
Template-free: LV-MT [8] 2019 – Results
48. • Contributions
• It proposes novel pretraining methods for retrosynthesis.
• It uses mixture model Transformer for diverse predictions.
• Limitations
• The more latent variables are used, the worse the top 1 performance.
• The latent variable does not appear to contain information about the reaction class.
48
Machine learning based methods
Template-free: LV-MT [8] 2019 – Results
[8] Chen et al
49. • Template-free: Self-Corrected Transformer (SCROP) [22] (2020)
49[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020
50. • Template-free: Self-Corrected Transformer (SCROP) [22] (2020)
• Key Idea
• It uses Transformer for correcting invalid predicted SMILES
• It makes syntax correction data via trained Transformer by constructing set of invalid
prediction-ground truth pairs.
• It trains another Transformer for syntax corrector using syntax correction data.
• At test time, it retains the top-1 candidate produced by the syntax corrector and
replace the original one.
50[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020 – Key Idea
51. • Results
• Compare to Transformer (SCROP-noSC), the performance is improved by 0.4~1.7%.
51
Machine learning based methods
Template-free: SCROP [22] 2020 – Results
[22] Zheng et al.
52. • Invalid SMILES rates
• Limitations
• Why SCROP? We can remove invalid SMILES by using RDKit without learned model.
52[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020 – Results
53. • Template-free: Graph2Graph (G2G) [9] (ICML 2020)
53[9] Shi et al.
Machine learning based methods
Template-free: G2G [9] 2020
54. • Key Idea
• It decomposes retrosynthesis as 2-step procedure:
• Breaking target product
• Transforming broken target product
• It trains Reaction Center Identification (RCI) module for making synthon(s) via breaking bonds in a
product graph.
• It trains Variational Graph Translation module for making reactants via a series of graph
transformation.
54
Machine learning based methods
Template-free: G2G [9] 2020 – Key Idea
[9] Shi et al.
55. • Reaction Center Identification (RCI)
• It uses a R-GCN [27] for learning graph representation.
• Overview
1. Given a chemical reaction , it derives a binary label matrix
2. Computing node embeddings and graph embedding.
3. To estimate the reactivity score of atom pair (i,j), the edge embedding is formed by
concatenating several features.
4. The final reactivity score of the atom pair (i, j) is calculated as:
5. The RCI is optimized by maximizing the cross entropy of the binary label
55
Machine learning based methods
Template-free: G2G [9] 2020 – Method (RCI)
[9] Shi et al. [27] Schlichtkrull et al.
56. • Reactants generation via Variational Graph Translation (VGT).
1. It receives synthons from the RCI and transform the synthons to reactants.
2. It generates a sequence of graph transformation actions , and apply them on
the initial synthon graph.
• It assumes graph generation as a Markov Decision Process (MDP).
56
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
[9] Shi et al.
57. • Reactants generation via Variational Graph Translation (VGT).
• Overview
1. Let transformation trajectory := , the graph transformation is
deterministic if the transformation trajectory is defined.
=
2. Let denote the graph after applying the sequence of actions to
3. Leveraging assumption of a MDP,
=
4. Finally, Graph transformation cab be factorized as follows:
57
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
[9] Shi et al.
58. • Reactants generation via Variational Graph Translation (VGT).
• Overview (cont’d)
4. Let an action is a tuple
5. It decomposes the distribution into 3 parts:
i. Termination prediction
ii. Nodes selection
iii. Edge labeling
6. It uses variational inference by introducing an approximate posterior
58[9] Shi et al.
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
59. • Top-k result
59[9] Shi et al.
Reaction class is given Reaction class is unkwon
Machine learning based methods
Template-free: G2G [9] 2020 – Results
60. • Module performance
• Contribution
• It novelly formulates retrosynthesis prediction as a graph-to-graphs translation task
• Limitation
• Well-tuned Molecule Transformers performs better
60
Machine learning based methods
Template-free: G2G [9] 2020 – Results
[9] Shi et al.
61. • Template-free: GraphRetro [10] (arXiv 2020)
61
Machine learning based methods
Template-free: GraphRetro [10] 2020
[10] Somnath et al.
62. • Template-free: GraphRetro [10] (arXiv 2020)
• Key Idea
• It also uses the idea of breaking and modifying graphs like G2G[22].
• G2G[22] modified the graph at the level of atoms, but it operates at level of molecular fragments
called as leaving groups.
• G2G: Sequential generation
• GraphRetro: Leaving group selection
62
Machine learning based methods
Template-free: GraphRetro [10] 2020 – Key Idea
[10] Somnath et al.
63. • Top-k result
63
Machine learning based methods
Template-free: GraphRetro [10] 2020 - Results
[10] Somnath et al.
64. • Module performance
• Contribution
• Choosing a leaving group is a good idea for retrosynthesis problems
• Limitation
• Domain knowledge is required to create a leaving group vocabulary
64
Machine learning based methods
Template-free: GraphRetro [10] 2020 - Results
[10] Somnath et al.
67. Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Key Idea
• Key Idea
• It uses pre-trained forward model for likelihood of Bayes’ theorem and uses approximate
posterior distribution of reactants.
• It uses Monte Carlo search for exploring synthetic routes
67[11] Guo et al.
68. Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method
• Method
• Likelihood is the Boltzmann distribution with an inverse temperature.
• Energy function: Tanimoto distance between target product and predicted product
• Approximate posterior
• Exact computation across all candidates is generally infeasible.
68
Predicted product by forward model (Molecular Transformer)
[11] Guo et al.
69. Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method (SMC)
• Method (Cont’d)
• Sampling from the posterior
• Sequential Monte Carlo (SMC)
•
• Cons
• Particle impoverishment [38]
• Rapid loss of diversity
• Computation cost of using forward model (Molecular Transformer)
69[11] Guo et al. [38] Stavropoulos et al.
70. Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method
• Method (Cont’d)
• SMC accelerated by surrogate likelihood.
• It trains Gradient Boosting Regression Tree that predicts likelihood of Molecular
Transformer
70[11] Guo et al.
72. Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
73. Challenges
Challenge 1. Balancing between template-free and template-based model
Challenge 2. Multi-Step retrosynthesis
Challenge 3. Extremely large space of synthesis routes
Challenge 4. Molecule decoding (Graph generation)
73[3] Coley et al. [14] Segler et al.
74. Challenges:
1. Balancing between template-free and template-based model
• How about a hybrid model using uncertainty ?
74
f
Pros
• High
interpretability
Cons
• Low
generalizability
• Require domain
knowledge
Pros
• Generalizability
Cons
• Invalid/Inaccessible
predictions
• Low interpretability
75. • Most chemical molecules in real world cannot be synthesized within one step.
• It could go up to 60 steps or even more.
• Error accumulation
• Extremely large space
• Most recent work [13] uses neural guided A* search.
75[13] Chen et al.
Challenges:
2. Multi-Step retrosynthesis
76. • Each molecule could be synthesized by hundreds of different possible reactants.
• How to measure a good synthesis routes ?
76
Challenges:
3. Extremely large space of synthesis routes
77. • Modeling complex distributions over graphs and then efficiently sampling is challengin
g!
• Why is it challenging?
• Non-unique
• High dimensional nature of graphs
• Complex, non-local dependencies b/w nodes and edges.
• Proposed methods
• Graph VAE [29] (ICANN 2018)
• Graph RNN [30] (ICML 2018)
• GRAN [31] (NeurIPS 2019)
• Junction tree VAE [35] (ICML 2019)
77[29] Schlichtkrull et al. [30] You et al. [31] Liao et al. [35] Jin et al.
Challenges:
4. Molecule decoding (Graph generation)
78. Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
79. Practice: RDkit
• Data pre-processing (RDKit)
• RDKit[20] is an open-source library for Cheminformatics.
• https://www.rdkit.org
• Why RDKit?
• Visualizing
• Substructure searching
• Calculate molecule similarity
• Validity check
• Various function for Cheminformatics
• We upload RDKit tutorial notebook:
• https://github.com/wonjun-dev/contrastive-retro
79
80. Practice: OpenNMT
• OpenNMT
• OpenNMT[28] is an open-source library for neural machine translations.
• https://opennmt.net
• It supports various models for encoder-decoder framework.
• Why OpenNMT?
• It supports various models for encoder-decoder framework.
• Built-in functions.
• Easy to engineer.
• Cons
• Too huge
• Flexibility
• Discontinued procedure (train-inference-performance check)*7
80[28] Klein et al., *7: We made fully-automated script.
81. Practice: OpenNMT – Where you should change
• OpenNMT
• Primary files in OpenNMT
• Data loader
• preprocess.py
• inputter.py (.onmt/inputters)
• Options
• opts.py (./onmt) => Several options for train, translate, preprocessing and etc. You can
make your own options in here.
• Train
• train.py => Entry point of training
• train_single.py (./ommt) => Second entry point of training
• trainer.py (./onmt) => Main training loop
• loss.py (.onmt/utils) => Several classes for loss function
• Model
• model_builder (./onmt)
• model.py (./onmt/models) => Model class
• model_saver (./onmt/models)
• Translation
• translate.py => Entry point of translation
• translator.py (./onmt/translate) => Translator class
• Performance check
• parse_output.py (./parse) => Parse predicted output and calculate accuracy via RDKit.
81
82. Practice: OpenNMT – Automated script
• OpenNMT
• We provide fully-automated (training to parsing) script.
• https://github.com/wonjun-dev/contrastive-retro @master branch
• run_experiment_mt.sh
• Train – Inference (Translate) – Performance check (Parse) – Averaging
• arg[0] : GPU id
• arg[1]: seed
• run_average.py
• The performance variation of MT and LV-MT is quite large depending on seed.
82
83. Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
84. Related works
• Forward synthesis
• Given reactants and reagents, predict the products.
• [7, 34, 36, 37]
• Reaction center prediction
• The task of identifying the reaction center is related to the step of deriving the synthons
(intermediate outcomes) in retrosynthesis.
• [9, 10, 33, 34]
• Graph generation
• Generative models for real-world graphs, including social, chemical and knowledge graph
• [29, 30, 31, 35]
84
85. Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
86. Future directions
• Training chemical language models like BERT
• Learning better chemical representation
• Atomic or molecular embedding considering chemical properties
• Robust to SMILES augmentation
• Contrastive learning
• Template-Generative Hybrid model
• Graph encoding – SMILES decoding
• Graph decoding is challenging
• Predictive model for subgraph isomorphism
• Subgraph isomorphism is a NP-complete problem, it is not scalable.
86
87. References
[1] Weininger et al. “A chemical language and information system. 1. introduction to methodology and encoding
rules.” Journal of Chemical Information and Modeling, 1988.
[2] Christ et al. “Mining electronic laboratory notebooks: Analysis, retrosynthesis, and reaction based
enumeration.” Journal of Chemical Information and Modeling, 2012.
[3] Coley et al. “Computer-assisted retrosynthesis based on molecular similarity.” ACS Central Science, 2017.
[4] Klucznik et al. “Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed
in the laboratory.” Chem, 2018.
[5] Dai et al. “Retrosynthesis prediction with conditional graph logic network”. NeurIPS, 2019.
[6] Schwaller et al. “Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction.” ACS
Central Science, 2019.
[7] Lee et al. “Molecular transformer unifies reaction prediction and retrosynthesis across pharma chemical space.”
Chemical Communications, 2019.
[8] Chen et al. “Learning to make generalizable and diverse predictions for retrosynthesis.” arXiv preprint 2019.
[9] Shi et al. “A graph to graphs framework for retrosynthesis prediction.”, ICML, 2020
[10] Somnath et al. “Learning graph models for template-free retrosynthesis.”, arXiv, 2020
[11] Guo et al. “A Bayesian algorithm for retrosynthesis.”, arXiv, 2020
[12] Lin et al. “Automatic retrosynthetic route planning using template-free models.”, Chem. Sci., 2020
[13] Chen et al. “Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search”, ICML, 2020
87
88. References
[14] Segler et al., “Neural-Symbolic machine learning for retrosynthesis and reaction prediction.”, Chemistry-A European
Journal, 2017
[15] Satoh et al., “A novel approach to retrosynthetic analysis using knowledge bases derived from reaction databases.”,
Chem. Inf. Comput. Sci., 1999
[16] Law et al., “Route designer: A retrosynthetic analysis tool utilizing automated retrosynthetic rule generation.”, Chem.
Inf., 2009
[17] Gasteiger et al., “A collection of computer methods for synthesis design and reaction prediction.”, Recl. Trav. Chim.
Pays-Bas, 1992
[18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985
[19] Corey et al., “The logic of chemical synthesis: Multistep synthesis of complex carbogenic molecules. (Nobel lecture)”,
1991
[20] http://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf
[21] Liu et al., “Retrosynthetic reaction prediction using neural sequence-to-sequence models.”, ACS Cent. Sci., 2017
[22] Zheng et al., “Predicting retrosynthetic reactions using self-corrected transformer neural networks.”, J. Chem. Inf.
Model., 2020
[23] Srivastava et al., “Highway networks”, NIPS, 2015
[24] https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fchem.201605499&fil
e=chem201605499-sup-0001-misc_information.pdf
[25] http://www.reaxys.com, Reaxys is a registered trademark of RELX Intellectual Properties SA used under license.
[26] Shen et al., “Mixture model for diverse machine translations: Tricks off the trade.”, arXiv, 2019
88
89. References
[27] Schlichtkrull et al., “Modeling relational data with graph convolutional networks.”, In European
Semantic Web Conference, 2018
[28] Klein et al., “OpenNMT: Open-Source Toolkit for Neural Machine Translation.”, arXiv, 2017
[29] Simonovsky et al., “GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders.”,
ICANN, 2018
[30] You et al., “GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models.”, ICML, 2018
[31] Liao et al., “Efficient Graph Generation with Graph Recurrent Attention Networks.”, NeurIPS, 2019
[32] Mayfield et al., “Pistachio 2.0 edn software.”, 2018
[33] Coley et al., “A graph-convolutional neural network model for the prediction of chemical reactivity.”,
Chemical Science 2019
[34] Coley et al., “Predicting organic reaction outcomes with Weisfeiler-Lehman Network.”, NeurIPS, 2017
[35] Jin et al., “Junction Tree Variational Autoencoder for molecular graph generation.”, ICML, 2019
[36] Bradshaw et al., “A generative model for electron path.”, ICLR, 2019
[37] DO et al., “Graph transformation policy network for chemical reaction prediction.”, KDD, 2019
[38] Stavropoulos et al., “Sequential Monte Carlo method in practice.”, Springer, 2001
89
90. Appendix
1. Subgraph isomorphism problem
• It is a computational task in which two graphs G and H are given as input, and one must det
ermine whether G contains a subgraph that is isomorphic to H
• NP-Complete
2. Molecular similarity metrics (x and y are molecular fingerprint)
90
91. Appendix
3. Reaction class
• Meta-information about type of chemical reactions.
• In USPTO, there are 10 reaction classes
91
93. Appendix
5. Better hyper-parameters of MT and the results.
• Dropout p=0.25 is better than p=0.1
• We can remove invalid and repeated SMILES via RDKit.
• Also, Using 6 layers and increasing the dropout rate is better than using 4 layers.
93
Top 1 Top 3 Top 5 Top 10
MT [8] 0.420 0.570 0.619 0.657
MT (p=0.25, w/o
inval/repeat)
0.432 0.645 0.709 0.771