Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
Use of spark for proteomic scoring seattle presentationlordjoe
Slides presented to the Seattle Spark Meetup on August 12 2015 - Note the work on Accumulators is a separate GitHub project https://github.com/lordjoe/SparkAccumulators
Open chemistry registry and mapping platform based on open source cheminforma...Valery Tkachenko
The Open PHACTS project (openphacts.org) is a European initiative, constituting a public–private partnership to enable easier, cheaper and faster drug discovery. The project is supported by the OpenPHACTS Foundation (www.openphactsfoundation.org) and funded by contributions from several pharmaceutical companies. As part of Open PHACTS, a 'Chemical Registration Service” was created to register chemicals of interest to the project, allowing compound linkage between data sets. A key concept is the support for 'scientific lenses,' which allows hierarchical mapping of chemical entities, including supporting characteristics such as charge state, tautomerism and stereochemistry. Open PHACTS aggregated various databases, including ChEMBL, ChEBI, HMDB, DrugBank, PDB, MeSH, and WikiPathways. A new project builds on the Chemical Registration Service to establish an open chemistry registry and mapping service for general data set linkage. This expansion requires the support of multiple cheminformatics formats, the conversion and mapping of various identifiers, harmonized but configurable standardization, validation of the chemical structures, and the creation of new identifiers, to produce scientific lenses, or 'link sets'. Furthermore, these identifiers will be related to the compounds chemical names (IUPAC and trivial) and related chemical structures. This presentation will describe our ongoing work to create a fully open source, easy to install platform, which supports the ideas introduced by the Open PHACTS project and expands it with community data including, for example, the data now available from the EPA CompTox Chemistry Dashboard (comptox.epa.gov). This new platform supports chemical formats and provides for identifier conversion and cross-validation between datasets. The project is completely based on open source cheminformatics toolkits and available as a set of libraries, docker images and a web frontend based on FAIR and Open Data principles. The openness of this platform will allow for scientists to process their own datasets, and make them interoperable with other online chemical databases.
Presented at Artificial Intelligence and Machine Learning for Advanced Drug Discovery & Development 2019 on 28th May 2019 by Dr Ed Griffen of MedChemica Ltd
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
This presentation highlighted how data curation impacts the reliability of QSAR models. We examined key datasets related to environmental endpoints to validate across chemical structure representations (e.g., mol file and SMILES) and identifiers (chemical names and registry numbers), and approaches to standardize data into QSAR-ready formats prior to modeling procedures. This allowed us to quantify and segregate data into quality categories. This improved our ability to evaluate the resulting models that can be developed from these data slices, and to quantify to what extent efforts developing high-quality datasets have the expected pay-off in terms of predicting performance. The most accurate models that we build will be accessible via our public-facing platform and will be used for screening and prioritizing chemicals for further testing.
Talking Data is the largest independent big data service company in China. Their network covers 70% of the mobile services nationwide with 3 billion ad clicks per day. Amongst those clicks, 90% are potentially fraudulent. Click fraud is happening at an overwhelming volume leading to misusage of data and wasting money. Hence, Kaggle (a platform for predictive modeling and analytics competitions from the U.S.) has partnered up with TalkingData to help resolve this issue.
This paper is to build predictive analysis models using traditional and Big Data methods to determine whether a smartphone app will be downloaded after clicking an advertisement. We have used data named “TalkingData AdTracking Fraud Detection Challenge”, which is of 7GB and given by a Kaggle competition. Four classification models are implemented with this massive data set in order to predict fraud in both traditional and Big Data methods. We define it fraud when the user clicked on an advertisement without downloading. The traditional platform has a resource limitation to build models with data set over a giga-byte so that we generate a sample data for the traditional models and adopt the full data set for the models in the Big Data Spark ML systems. We also present the accuracy and performance of the models implemented in both traditional and Big Data systems.
Use of spark for proteomic scoring seattle presentationlordjoe
Slides presented to the Seattle Spark Meetup on August 12 2015 - Note the work on Accumulators is a separate GitHub project https://github.com/lordjoe/SparkAccumulators
Open chemistry registry and mapping platform based on open source cheminforma...Valery Tkachenko
The Open PHACTS project (openphacts.org) is a European initiative, constituting a public–private partnership to enable easier, cheaper and faster drug discovery. The project is supported by the OpenPHACTS Foundation (www.openphactsfoundation.org) and funded by contributions from several pharmaceutical companies. As part of Open PHACTS, a 'Chemical Registration Service” was created to register chemicals of interest to the project, allowing compound linkage between data sets. A key concept is the support for 'scientific lenses,' which allows hierarchical mapping of chemical entities, including supporting characteristics such as charge state, tautomerism and stereochemistry. Open PHACTS aggregated various databases, including ChEMBL, ChEBI, HMDB, DrugBank, PDB, MeSH, and WikiPathways. A new project builds on the Chemical Registration Service to establish an open chemistry registry and mapping service for general data set linkage. This expansion requires the support of multiple cheminformatics formats, the conversion and mapping of various identifiers, harmonized but configurable standardization, validation of the chemical structures, and the creation of new identifiers, to produce scientific lenses, or 'link sets'. Furthermore, these identifiers will be related to the compounds chemical names (IUPAC and trivial) and related chemical structures. This presentation will describe our ongoing work to create a fully open source, easy to install platform, which supports the ideas introduced by the Open PHACTS project and expands it with community data including, for example, the data now available from the EPA CompTox Chemistry Dashboard (comptox.epa.gov). This new platform supports chemical formats and provides for identifier conversion and cross-validation between datasets. The project is completely based on open source cheminformatics toolkits and available as a set of libraries, docker images and a web frontend based on FAIR and Open Data principles. The openness of this platform will allow for scientists to process their own datasets, and make them interoperable with other online chemical databases.
Presented at Artificial Intelligence and Machine Learning for Advanced Drug Discovery & Development 2019 on 28th May 2019 by Dr Ed Griffen of MedChemica Ltd
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
This presentation highlighted how data curation impacts the reliability of QSAR models. We examined key datasets related to environmental endpoints to validate across chemical structure representations (e.g., mol file and SMILES) and identifiers (chemical names and registry numbers), and approaches to standardize data into QSAR-ready formats prior to modeling procedures. This allowed us to quantify and segregate data into quality categories. This improved our ability to evaluate the resulting models that can be developed from these data slices, and to quantify to what extent efforts developing high-quality datasets have the expected pay-off in terms of predicting performance. The most accurate models that we build will be accessible via our public-facing platform and will be used for screening and prioritizing chemicals for further testing.
Talking Data is the largest independent big data service company in China. Their network covers 70% of the mobile services nationwide with 3 billion ad clicks per day. Amongst those clicks, 90% are potentially fraudulent. Click fraud is happening at an overwhelming volume leading to misusage of data and wasting money. Hence, Kaggle (a platform for predictive modeling and analytics competitions from the U.S.) has partnered up with TalkingData to help resolve this issue.
This paper is to build predictive analysis models using traditional and Big Data methods to determine whether a smartphone app will be downloaded after clicking an advertisement. We have used data named “TalkingData AdTracking Fraud Detection Challenge”, which is of 7GB and given by a Kaggle competition. Four classification models are implemented with this massive data set in order to predict fraud in both traditional and Big Data methods. We define it fraud when the user clicked on an advertisement without downloading. The traditional platform has a resource limitation to build models with data set over a giga-byte so that we generate a sample data for the traditional models and adopt the full data set for the models in the Big Data Spark ML systems. We also present the accuracy and performance of the models implemented in both traditional and Big Data systems.
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
Join us for a complimentary webinar where Dr. Matthew Clark, Life Sciences Services Consultant at Elsevier, will share best practices around developing predictive models using Reaxys. Given the importance of understanding solubility during the early drug discovery stages, Dr. Clark will showcase an example around creating a predictive model of solubility in water using various Reaxys data points. In this webinar, he will provide insights into:
Analyzing the predictive ability of the model by comparing Reaxys data
to a well-known solubility modelling set
How to examine key Reaxys data points and link them to their original source for citation and cross-checking purposes
Best practices around using workflow tools like KNIME for making customized models that focus on specific chemical classes
Join us for a complimentary webinar where Dr. Matthew Clark, Life Sciences Services Consultant at Elsevier, will share best practices around developing predictive models using Reaxys. Given the importance of understanding solubility during the early drug discovery stages, Dr. Clark will showcase an example around creating a predictive model of solubility in water using various Reaxys data points. In this webinar, he will provide insights into:
Analyzing the predictive ability of the model by comparing Reaxys data
to a well-known solubility modelling set
How to examine key Reaxys data points and link them to their original source for citation and cross-checking purposes
Best practices around using workflow tools like KNIME for making customized models that focus on specific chemical classes
ADMET properties prediction using AI will accelerate the process of drug discovery.
This slide mostly focuses on using graph-based deep learning techniques to predict drug properties.
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
Concepts include decision tree with its examples. Measures used for splitting in decision tree like gini index, entropy, information gain, pros and cons, validation. Basics of random forests with its example and uses.
Top 10 Data Science Practitioner PitfallsSri Ambati
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, Mark Landry, one of the world’s leading Kagglers, will review the top 10 common pitfalls and steps to avoid them.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Processing malaria HTS results using KNIME: a tutorialGreg Landrum
Walks through a couple of KNIME Workflows for working with HTS Data.
The workflows are derived from the work described in this publication: https://f1000research.com/articles/6-1136/v2
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Nutraceutical market, scope and growth: Herbal drug technology
Large scale classification of chemical reactions from patent data
1. Gregory Landrum
NIBR Informatics, Basel
Novartis Institutes for BioMedical Research
10th International Conference on Chemical Structures/
10th German Conference on Chemoinformatics
Large scale classification of chemical
reactions from patent data
2. Outline
2
§ Public data sources and reactions
§ Fingerprints for reactions
§ Validation:
• Machine learning
• Clustering
§ Application: models for predicting yield
3. Public data sources in cheminformatics
3
an aside at the beginning
§ Publicly available data sources for small molecules and
their biological activities/interactions:
• PDB, PubChem, ChEMBL, etc.
§ Publicly available data sources for the chemistry behind
how those molecules were actually made (i.e. reactions):
• pretty much nothing until recently
§ Plenty of data locked up in large commercial databases,
and pharmaceutical companies’ ELNs, very very little in
the open
The “public/open” point is important for
collaboration and reproducibility
4. A large, public source of chemical reactions
4
Not just what we made, but how we made it
§ Text-mining applied to open patent data to extract chemical reactions :
1.12 million reactions[1]
§ Reactions classified using namerxn, when possible, into 318 standard
types : >599000 classified reactions[2]
[1] Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD
thesis. University of Cambridge: Cambridge, UK; 2012.
[2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software)
http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-the-
wild/
6. Got the reactions, what about reaction fingerprints?
6
Criteria for them to be useful
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
7. Our toolbox: the RDKit
§ Open-source C++ toolkit for cheminformatics
§ Wrappers for Python (2.x), Java, C#
§ Functionality:
• 2D and 3D molecular operations
• Descriptor generation for machine learning
• PostgreSQL database cartridge for substructure and similarity searching
• Knime nodes
• IPython integration
• Lucene integration (experimental)
• Supports Mac/Windows/Linux
§ Releases every 6 months
§ business-friendly BSD license
§ Code: https://github.com/rdkit
§ http://www.rdkit.org
8. Similarity and reactions
8
What are we talking about?
§ These two reactions are both type: “1.2.5 Ketone reductive amination”
It’s obvious that these are the same, right?
9. Similarity and reactions
9
What are we talking about?
§ These two reactions are both type: “1.2.5 Ketone reductive amination”
It’s obvious that these are the same, right?
10. Got the reactions, what about reaction fingerprints?
10
Start simple: use difference fingerprints:
Similar idea here:
1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of
Metabolites. ChemMedChem 3, 821–832 (2008).
2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction
Vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009).
FPReacts = FPi
i∈Reactants
∑
FPProducts = FPi
i∈Products
∑
FPRxn = FPProds − FPReacts
11. Refine the fingerprints a bit
11
Text-mined reactions often include catalysts,
reagents, or solvents in the reactants
Explore two options for handling this:
1. Decrease the weight of reactant molecules where too many
of the bits are not present in the product fingerprint
2. Decrease the weight of reactant molecules where too many
atoms are unmapped
12. Are the fingerprints useful?
12
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
13. Machine learning and chemical reactions
13
§ Validation set:
• The 68 reaction types with at least 2000 instances from the patent
data set
- “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral
separation)
- Final: 66 reaction types
§ Process:
• Training set is 200 random instances of each reaction type
• Test set is 800 random instances of each reaction type
• Learning: random forest (scikit-learn)
14. Learning reaction classes
14
Results for test data
Overall:
• Recall: 0.94
• Precision: 0.94
• Accuracy: 0.94
For a 66-class classifier, this looks pretty good!
15. Learning reaction classes
15
~94% accuracy
much of the
confusion is
between related
types
Confusion matrix for test data
Bromo Suzuki coupling
Bromo Suzuki-type coupling
Bromo N-arylation
16. Are the fingerprints useful?
16
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
17. Clustering reactions
17
§ Reaction similarity validation set:
• The 66 most common reaction types from the patent data set
• Look at the homogeneity of clusters with at least 10 members
1.2.5 Ketone reductive
amination
1.2.5 Ketone reductive
amination
1.2.5 Ketone reductive
amination
Integration
Interpretation: <30% of clusters are <90% homogeneous
Interpretation: <40% of clusters are <80% homogeneous
18. Using the fingerprints
18
Can we help classify the remaining 600K reactions?
§ Apply the 66 class random forest to generate class predictions for the
unclassified compounds in order to find reactions we missed
§ Cluster the unclassified molecules, look for big clusters of unclassified
molecules, and (manually) assign classes to them.
§ Both of these approaches have been successful
19. Predicting yields
19
§ The data set includes text-mined yield information as well as
calculated yields.
§ For modeling: prefer the text-mined value, but take the calculated one
if that’s the only thing available
§ Look at stats for the 93 reaction classes that have at least 500
members with yields, a min yield > 0 and a max yield < 110 %:
21. Try building models for yield
21
§ Start with class 7.1.1 “nitro to amino”
§ Break into low-yield (<50%) and high-yield (>70%)
classes.
14% are low-yield
22. § Try building a random forest using the atom-pair based
reaction fingerprints
Try building models for yield
22
things that don’t work
That’s performance on the training set
23. § Try building a random forest using the atom-pair based
reactant fingerprints
Try building models for yield
23
things that don’t work
That’s performance on the training set
24. § Look at the ROC curve for the training-set data
Try building models for yield
24
things that don’t work?
first wrong “low-yield” prediction
nine wrong “low-yield” predictions
The model is doing a great job
of ordering compounds, but a
bad job of classifying
compounds
25. Unbalanced data and ensemble classifiers
25
an aside
§ Usual decision rule for a two-class ensemble classifier:
take the result that the the majority of the models (decision
trees for random forests) vote for.
§ That’s a decision boundary = 0.5
§ If the dataset is unbalanced, why should we expect
balanced behavior from the classifier?
§ Idea: use the composition of the training set to decide
what the decision boundary should be.
For example: if the data set is ~20% “low yield”, then assign “low
yield” to any example where at least 20% of the trees say “low yield”
26. § Try building a random forest using the atom-pair based
reactant fingerprints
§ What about moving the decision boundary to 0.2 to reflect
the unbalanced data set ?
Try building models for yield
26
Getting close to working
That’s performance on the training set
Starting to look ok. What about the test set?
27. § Results from a random forest using the atom-pair based
reactant fingerprints with the shifted decision boundary
Try building models for yield
27
Getting close to working
Not too terrible.
test set
28. § Aldehyde reductive amination (no shift):
§ Williamson ether synthesis (boundary 0.3)
Try building models for yield
28
Some more models
test set
test set
29. § Chloro N-Alkylation (no shift):
§ Chloro N-Alkylation (0.4 shift)
Try building models for yield
29
Some more models
test set
test set
30. Wrapping up
30
§ Dataset: 1+ million reactions text mined from patents
(publically available) with reaction classes assigned
§ Fingerprints: weighted atom-pair delta and functional-
group delta fingerprints implemented using the RDKit
§ Fingerprint Validation:
• Multiclass random-forest classifier ~94% accurate
• Similarity measure works: similar reactions cluster together
§ Combination of clustering + functional group analysis
allows identification of new reaction classes
§ We’re also able to use the fingerprints to build reasonable
models for yield
32. Advertising
32
3rd RDKit User Group Meeting
22-24 October 2014
Merck KGaA, Darmstadt, Germany
Talks, “talktorials”, lightning talks, social activities, and a hackathon on
the 24th.
Registration: http://goo.gl/z6QzwD
Full announcement: http://goo.gl/ZUm2wm
We’re looking for speakers. Please contact greg.landrum@gmail.com