This document discusses using machine learning algorithms to analyze chemical reaction data. It describes how current reaction reporting formats are not well-suited for computational analysis. A more structured reporting format is proposed to fully describe reactions in a digitally friendly way, including specifying reactants, products, quantities, yields, and metrics like atom efficiency. This structured data would allow modeling of reaction substitutability and enable large-scale machine learning of chemical transformations.
3. MOLECULAR MATERIALS INFORMATICS
All your byte are belong to us
• Just because a reaction scheme is digital…
• … doesn’t mean it’s of any use to a computer.
3
16. MOLECULAR MATERIALS INFORMATICS
Model Building
• Most reaction data is noisy and incomplete
• Imagine opportunities with quantity & quality...
16
1
2
3
4
5 1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
• For example: model solvent substitution
17. MOLECULAR MATERIALS INFORMATICS
Conclusions & Future
• Most published reactions intractible to machines
• Most reaction informatics formats 50% complete
• Full description has immediate benefits...
• ... eventual large scale machine learning.
• μPublications with provenance: the path to open
repositories - but requires attention to content
17