Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The anatomy of a
chemical reaction:
Dissection by machine
learning algorithms
Alex M. Clark, Ph.D.
August 2014
© 2015 Mole...
MOLECULAR MATERIALS INFORMATICS
21st Century Publishing
2
chemist
experiment
write up
confirm
μ pub
URI
viewing
searching
m...
MOLECULAR MATERIALS INFORMATICS
All your byte are belong to us
• Just because a reaction scheme is digital…
• … doesn’t me...
MOLECULAR MATERIALS INFORMATICS
Production Raster Graphics
4
Generic molfile
15 16 0 0 0 0 0 0 0 0999 V2000
-3.9510 4.0500...
MOLECULAR MATERIALS INFORMATICS
Production Vector Graphics
• Manuscripts usually delivered as PDFs:
5
MOLECULAR MATERIALS INFORMATICS
Spreadsheets
• Data gives the impression of organisation
• Very high degrees of freedom, n...
MOLECULAR MATERIALS INFORMATICS
Common Scheme
7
MOLECULAR MATERIALS INFORMATICS
Digitally Friendly
8
primary
reactant
secondary
reactants
catalyst
solvent
intermediate
by...
MOLECULAR MATERIALS INFORMATICS
Representation
• For machines: representation
must be very rigidly defined
• For humans: ca...
MOLECULAR MATERIALS INFORMATICS
Balancing
10
MOLECULAR MATERIALS INFORMATICS
Quantities
11
MOLECULAR MATERIALS INFORMATICS
Green Metrics
12
• Totals for reactants, products & waste
• For each non-waste product: yi...
MOLECULAR MATERIALS INFORMATICS 13
Reaction Transforms
• Reaction = specific description of experiment
1 2 3
4
1 2
3
4
• Tr...
MOLECULAR MATERIALS INFORMATICS 14
Convenience
• Apply to a molecule...
10 g
MOLECULAR MATERIALS INFORMATICS 15
Decision MakingProductSearchResults
Yield PMI E-factor
Atom
Economy
100% 2.18 1.18 100%...
MOLECULAR MATERIALS INFORMATICS
Model Building
• Most reaction data is noisy and incomplete
• Imagine opportunities with q...
MOLECULAR MATERIALS INFORMATICS
Conclusions & Future
• Most published reactions intractible to machines
• Most reaction in...
Acknowledgments
http://molmatinf.com
http://molsync.com
http://cheminf20.org
@aclarkxyz
• Antony Williams
• Sean Ekins
• L...
Upcoming SlideShare
Loading in …5
×

The anatomy of a chemical reaction: Dissection by machine learning algorithms

1,041 views

Published on

Presented at American Chemical Society meeting, Boston, 2015. The open data revolution stands to make a profound contribution to cheminformatics, but only if scientists compose their data in a way that is readable to machines as well as humans. This talk describes some of the do's and don't's for preparing chemical reactions for the benefit of machine learning algorithms.

Published in: Science
  • Be the first to comment

  • Be the first to like this

The anatomy of a chemical reaction: Dissection by machine learning algorithms

  1. 1. The anatomy of a chemical reaction: Dissection by machine learning algorithms Alex M. Clark, Ph.D. August 2014 © 2015 Molecular Materials Informatics, Inc. http://molmatinf.com
  2. 2. MOLECULAR MATERIALS INFORMATICS 21st Century Publishing 2 chemist experiment write up confirm μ pub URI viewing searching machine learning
  3. 3. MOLECULAR MATERIALS INFORMATICS All your byte are belong to us • Just because a reaction scheme is digital… • … doesn’t mean it’s of any use to a computer. 3
  4. 4. MOLECULAR MATERIALS INFORMATICS Production Raster Graphics 4 Generic molfile 15 16 0 0 0 0 0 0 0 0999 V2000 -3.9510 4.0500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -5.2500 3.3000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.6519 3.3000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -5.2500 1.8000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.9510 1.0500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.6519 1.8000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.2306 3.7694 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.3482 2.5611 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.2240 1.3407 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -6.5490 4.0500 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0 -7.8481 3.3000 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0 -6.5490 5.5500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.1518 2.5673 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9072 1.2714 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.8964 3.8695 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 2 0 0 0 0 2 4 2 0 0 0 0 4 5 1 0 0 0 0 5 6 2 0 0 0 0 6 3 1 0 0 0 0 3 7 1 0 0 0 0 7 8 2 0 0 0 0 8 9 1 0 0 0 0 9 6 1 0 0 0 0 2 10 1 0 0 0 0 10 11 1 0 0 0 0 10 12 2 0 0 0 0 8 13 1 0 0 0 0 13 14 1 0 0 0 0 13 15 1 0 0 0 0 M CHG 2 10 1 11 -1 M END
  5. 5. MOLECULAR MATERIALS INFORMATICS Production Vector Graphics • Manuscripts usually delivered as PDFs: 5
  6. 6. MOLECULAR MATERIALS INFORMATICS Spreadsheets • Data gives the impression of organisation • Very high degrees of freedom, nothing for structures 6
  7. 7. MOLECULAR MATERIALS INFORMATICS Common Scheme 7
  8. 8. MOLECULAR MATERIALS INFORMATICS Digitally Friendly 8 primary reactant secondary reactants catalyst solvent intermediate byproducts final product reagent
  9. 9. MOLECULAR MATERIALS INFORMATICS Representation • For machines: representation must be very rigidly defined • For humans: can generate diagram programmatically • MDL RXN/RDfile ~50% there • DataSheet XML with Experiment aspect http://molmatinf.com/fmtaspect.html 9 StructureStep Role 1 1 1 1 1 1 2 2 Reactant Reagent Product Product Stoich. 1 1 1 1 1 1 1 1 Reactant Reagent Reagent Reagent
  10. 10. MOLECULAR MATERIALS INFORMATICS Balancing 10
  11. 11. MOLECULAR MATERIALS INFORMATICS Quantities 11
  12. 12. MOLECULAR MATERIALS INFORMATICS Green Metrics 12 • Totals for reactants, products & waste • For each non-waste product: yield, PMI, E-factor, Atom-E… always calculated, always recorded
  13. 13. MOLECULAR MATERIALS INFORMATICS 13 Reaction Transforms • Reaction = specific description of experiment 1 2 3 4 1 2 3 4 • Transform = the generic form of a reaction
  14. 14. MOLECULAR MATERIALS INFORMATICS 14 Convenience • Apply to a molecule... 10 g
  15. 15. MOLECULAR MATERIALS INFORMATICS 15 Decision MakingProductSearchResults Yield PMI E-factor Atom Economy 100% 2.18 1.18 100% 84% 12.49 11.49 93.3% 82% 19.17 18.17 87.4% 100% 8.26 7.26 56.3% 63% 8.93 7.93 73.3%
  16. 16. MOLECULAR MATERIALS INFORMATICS Model Building • Most reaction data is noisy and incomplete • Imagine opportunities with quantity & quality... 16 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 • For example: model solvent substitution
  17. 17. MOLECULAR MATERIALS INFORMATICS Conclusions & Future • Most published reactions intractible to machines • Most reaction informatics formats 50% complete • Full description has immediate benefits... • ... eventual large scale machine learning. • μPublications with provenance: the path to open repositories - but requires attention to content 17
  18. 18. Acknowledgments http://molmatinf.com http://molsync.com http://cheminf20.org @aclarkxyz • Antony Williams • Sean Ekins • Leah McEwen • Open data advocates • Inquiries to info@molmatinf.com

×