Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Gregory Landrum
NIBR Informatics, Basel
Novartis Institutes for BioMedical Research
10th International Conference on Chemi...
Outline
2
§ Public data sources and reactions
§ Fingerprints for reactions
§ Validation:
•  Machine learning
•  Cluster...
Public data sources in cheminformatics
3
an aside at the beginning
§ Publicly available data sources for small molecules ...
A large, public source of chemical reactions
4
Not just what we made, but how we made it
§  Text-mining applied to open p...
More about the classes
5
Frequency of reaction classes:
44675 2.1.2 Carboxylic acid + amine reaction
39297 1.7.9 Williamso...
Got the reactions, what about reaction fingerprints?
6
Criteria for them to be useful
§ Question 1: do they contain bits ...
Our toolbox: the RDKit
§  Open-source C++ toolkit for cheminformatics
§  Wrappers for Python (2.x), Java, C#
§  Functio...
Similarity and reactions
8
What are we talking about?
§  These two reactions are both type: “1.2.5 Ketone reductive amina...
Similarity and reactions
9
What are we talking about?
§  These two reactions are both type: “1.2.5 Ketone reductive amina...
Got the reactions, what about reaction fingerprints?
10
Start simple: use difference fingerprints:
Similar idea here:
1) R...
Refine the fingerprints a bit
11
Text-mined reactions often include catalysts,
reagents, or solvents in the reactants
Expl...
Are the fingerprints useful?
12
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from anot...
Machine learning and chemical reactions
13
§ Validation set:
•  The 68 reaction types with at least 2000 instances from t...
Learning reaction classes
14
Results for test data
Overall:
•  Recall: 0.94
•  Precision: 0.94
•  Accuracy: 0.94
For a 66-...
Learning reaction classes
15
~94% accuracy
much of the
confusion is
between related
types
Confusion matrix for test data
B...
Are the fingerprints useful?
16
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from anot...
Clustering reactions
17
§ Reaction similarity validation set:
•  The 66 most common reaction types from the patent data s...
Using the fingerprints
18
Can we help classify the remaining 600K reactions?
§  Apply the 66 class random forest to gener...
Predicting yields
19
§  The data set includes text-mined yield information as well as
calculated yields.
§  For modeling...
Predicting yields
20
§  Look at the most populated classes:
Try building models for yield
21
§ Start with class 7.1.1 “nitro to amino”
§ Break into low-yield (<50%) and high-yield ...
§ Try building a random forest using the atom-pair based
reaction fingerprints
Try building models for yield
22
things th...
§ Try building a random forest using the atom-pair based
reactant fingerprints
Try building models for yield
23
things th...
§ Look at the ROC curve for the training-set data
Try building models for yield
24
things that don’t work?
first wrong “l...
Unbalanced data and ensemble classifiers
25
an aside
§ Usual decision rule for a two-class ensemble classifier:
take the ...
§ Try building a random forest using the atom-pair based
reactant fingerprints
§ What about moving the decision boundary...
§ Results from a random forest using the atom-pair based
reactant fingerprints with the shifted decision boundary
Try bui...
§ Aldehyde reductive amination (no shift):
§ Williamson ether synthesis (boundary 0.3)
Try building models for yield
28
...
§ Chloro N-Alkylation (no shift):
§ Chloro N-Alkylation (0.4 shift)
Try building models for yield
29
Some more models
te...
Wrapping up
30
§ Dataset: 1+ million reactions text mined from patents
(publically available) with reaction classes assig...
§ NextMove Software:
• Roger Sayle
• Daniel Lowe
§ NIBR:
• Anna Pelliccioli
• Sereina Riniker
• Mike Tarselli
31
Acknowl...
Advertising
32
3rd RDKit User Group Meeting
22-24 October 2014
Merck KGaA, Darmstadt, Germany
Talks, “talktorials”, lightn...
Upcoming SlideShare
Loading in …5
×

Large scale classification of chemical reactions from patent data

1,459 views

Published on

Presentation from the 10th International Conference on Chemical Structures / 10th German Conference on Chemoinformatics in Noordwijkerhout

Published in: Science, Technology, Education
  • Be the first to comment

  • Be the first to like this

Large scale classification of chemical reactions from patent data

  1. 1. Gregory Landrum NIBR Informatics, Basel Novartis Institutes for BioMedical Research 10th International Conference on Chemical Structures/ 10th German Conference on Chemoinformatics Large scale classification of chemical reactions from patent data
  2. 2. Outline 2 § Public data sources and reactions § Fingerprints for reactions § Validation: •  Machine learning •  Clustering § Application: models for predicting yield
  3. 3. Public data sources in cheminformatics 3 an aside at the beginning § Publicly available data sources for small molecules and their biological activities/interactions: •  PDB, PubChem, ChEMBL, etc. § Publicly available data sources for the chemistry behind how those molecules were actually made (i.e. reactions): •  pretty much nothing until recently § Plenty of data locked up in large commercial databases, and pharmaceutical companies’ ELNs, very very little in the open The “public/open” point is important for collaboration and reproducibility
  4. 4. A large, public source of chemical reactions 4 Not just what we made, but how we made it §  Text-mining applied to open patent data to extract chemical reactions : 1.12 million reactions[1] §  Reactions classified using namerxn, when possible, into 318 standard types : >599000 classified reactions[2] [1] Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD thesis. University of Cambridge: Cambridge, UK; 2012. [2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software) http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-the- wild/
  5. 5. More about the classes 5 Frequency of reaction classes: 44675 2.1.2 Carboxylic acid + amine reaction 39297 1.7.9 Williamson ether synthesis 28194 2.1.1 Amide Schotten-Baumann 26739 1.3.7 Chloro N-arylation 22400 1.6.2 Bromo N-alkylation 20465 7.1.1 Nitro to amino 20405 1.6.4 Chloro N-alkylation 17226 6.2.2 CO2H-Me deprotection 16602 6.1.1 N-Boc deprotection 16021 6.2.1 CO2H-Et deprotection 12952 1.2.1 Aldehyde reductive amination 12250 2.2.3 Sulfonamide Schotten-Baumann 10659 11.9 Separation 8538 3.1.5 Bromo Suzuki-type coupling 7261 1.7.7 Mitsunobu aryl ether synthesis 7102 6.3.7 Methoxy to hydroxy 7071 3.3.1 Sonogashira coupling 6472 3.1.1 Bromo Suzuki coupling 6383 1.8.5 Thioether synthesis 5791 9.1.6 Hydroxy to chloro 20 most common classes:
  6. 6. Got the reactions, what about reaction fingerprints? 6 Criteria for them to be useful § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  7. 7. Our toolbox: the RDKit §  Open-source C++ toolkit for cheminformatics §  Wrappers for Python (2.x), Java, C# §  Functionality: •  2D and 3D molecular operations •  Descriptor generation for machine learning •  PostgreSQL database cartridge for substructure and similarity searching •  Knime nodes •  IPython integration •  Lucene integration (experimental) •  Supports Mac/Windows/Linux §  Releases every 6 months §  business-friendly BSD license §  Code: https://github.com/rdkit §  http://www.rdkit.org
  8. 8. Similarity and reactions 8 What are we talking about? §  These two reactions are both type: “1.2.5 Ketone reductive amination” It’s obvious that these are the same, right?
  9. 9. Similarity and reactions 9 What are we talking about? §  These two reactions are both type: “1.2.5 Ketone reductive amination” It’s obvious that these are the same, right?
  10. 10. Got the reactions, what about reaction fingerprints? 10 Start simple: use difference fingerprints: Similar idea here: 1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem 3, 821–832 (2008). 2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction Vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009). FPReacts = FPi i∈Reactants ∑ FPProducts = FPi i∈Products ∑ FPRxn = FPProds − FPReacts
  11. 11. Refine the fingerprints a bit 11 Text-mined reactions often include catalysts, reagents, or solvents in the reactants Explore two options for handling this: 1.  Decrease the weight of reactant molecules where too many of the bits are not present in the product fingerprint 2.  Decrease the weight of reactant molecules where too many atoms are unmapped
  12. 12. Are the fingerprints useful? 12 § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  13. 13. Machine learning and chemical reactions 13 § Validation set: •  The 68 reaction types with at least 2000 instances from the patent data set -  “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral separation) -  Final: 66 reaction types § Process: •  Training set is 200 random instances of each reaction type •  Test set is 800 random instances of each reaction type •  Learning: random forest (scikit-learn)
  14. 14. Learning reaction classes 14 Results for test data Overall: •  Recall: 0.94 •  Precision: 0.94 •  Accuracy: 0.94 For a 66-class classifier, this looks pretty good!
  15. 15. Learning reaction classes 15 ~94% accuracy much of the confusion is between related types Confusion matrix for test data Bromo Suzuki coupling Bromo Suzuki-type coupling Bromo N-arylation
  16. 16. Are the fingerprints useful? 16 § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  17. 17. Clustering reactions 17 § Reaction similarity validation set: •  The 66 most common reaction types from the patent data set •  Look at the homogeneity of clusters with at least 10 members 1.2.5 Ketone reductive amination 1.2.5 Ketone reductive amination 1.2.5 Ketone reductive amination Integration Interpretation: <30% of clusters are <90% homogeneous Interpretation: <40% of clusters are <80% homogeneous
  18. 18. Using the fingerprints 18 Can we help classify the remaining 600K reactions? §  Apply the 66 class random forest to generate class predictions for the unclassified compounds in order to find reactions we missed §  Cluster the unclassified molecules, look for big clusters of unclassified molecules, and (manually) assign classes to them. §  Both of these approaches have been successful
  19. 19. Predicting yields 19 §  The data set includes text-mined yield information as well as calculated yields. §  For modeling: prefer the text-mined value, but take the calculated one if that’s the only thing available §  Look at stats for the 93 reaction classes that have at least 500 members with yields, a min yield > 0 and a max yield < 110 %:
  20. 20. Predicting yields 20 §  Look at the most populated classes:
  21. 21. Try building models for yield 21 § Start with class 7.1.1 “nitro to amino” § Break into low-yield (<50%) and high-yield (>70%) classes. 14% are low-yield
  22. 22. § Try building a random forest using the atom-pair based reaction fingerprints Try building models for yield 22 things that don’t work That’s performance on the training set
  23. 23. § Try building a random forest using the atom-pair based reactant fingerprints Try building models for yield 23 things that don’t work That’s performance on the training set
  24. 24. § Look at the ROC curve for the training-set data Try building models for yield 24 things that don’t work? first wrong “low-yield” prediction nine wrong “low-yield” predictions The model is doing a great job of ordering compounds, but a bad job of classifying compounds
  25. 25. Unbalanced data and ensemble classifiers 25 an aside § Usual decision rule for a two-class ensemble classifier: take the result that the the majority of the models (decision trees for random forests) vote for. § That’s a decision boundary = 0.5 § If the dataset is unbalanced, why should we expect balanced behavior from the classifier? § Idea: use the composition of the training set to decide what the decision boundary should be. For example: if the data set is ~20% “low yield”, then assign “low yield” to any example where at least 20% of the trees say “low yield”
  26. 26. § Try building a random forest using the atom-pair based reactant fingerprints § What about moving the decision boundary to 0.2 to reflect the unbalanced data set ? Try building models for yield 26 Getting close to working That’s performance on the training set Starting to look ok. What about the test set?
  27. 27. § Results from a random forest using the atom-pair based reactant fingerprints with the shifted decision boundary Try building models for yield 27 Getting close to working Not too terrible. test set
  28. 28. § Aldehyde reductive amination (no shift): § Williamson ether synthesis (boundary 0.3) Try building models for yield 28 Some more models test set test set
  29. 29. § Chloro N-Alkylation (no shift): § Chloro N-Alkylation (0.4 shift) Try building models for yield 29 Some more models test set test set
  30. 30. Wrapping up 30 § Dataset: 1+ million reactions text mined from patents (publically available) with reaction classes assigned § Fingerprints: weighted atom-pair delta and functional- group delta fingerprints implemented using the RDKit § Fingerprint Validation: •  Multiclass random-forest classifier ~94% accurate •  Similarity measure works: similar reactions cluster together § Combination of clustering + functional group analysis allows identification of new reaction classes § We’re also able to use the fingerprints to build reasonable models for yield
  31. 31. § NextMove Software: • Roger Sayle • Daniel Lowe § NIBR: • Anna Pelliccioli • Sereina Riniker • Mike Tarselli 31 Acknowledgements
  32. 32. Advertising 32 3rd RDKit User Group Meeting 22-24 October 2014 Merck KGaA, Darmstadt, Germany Talks, “talktorials”, lightning talks, social activities, and a hackathon on the 24th. Registration: http://goo.gl/z6QzwD Full announcement: http://goo.gl/ZUm2wm We’re looking for speakers. Please contact greg.landrum@gmail.com

×