The collection, curation and modeling of Open Melting Point measurements


Published on

Jean-Claude Bradley and Andrew Lang present at the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry on August 26, 2011 about "The collection, curation and modeling of Open Melting Point measurements". The talk also covers the role of Open Notebook Science and Google Apps Scripts in this effort.

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The collection, curation and modeling of Open Melting Point measurements

  1. 1. The collection, curation and modeling of Open Melting Point measurements<br />5th Meeting on U.S. Government Chemical Databases and Open Chemistry<br />Jean-Claude Bradley<br />Andrew Lang<br />Antony Williams<br />Department of Chemistry<br />Drexel University<br />ChemSpider<br />Royal Society of Chemistry<br />Department of Mathematics<br />Oral Roberts University<br />August 26, 2011<br />
  2. 2. The Problem of Data Quality in Chemistry<br /><ul><li>Lack of provenance
  3. 3. Reliance on a system of “trusted sources”</li></ul>In the case of melting points:<br /><ul><li>CRC Handbook
  4. 4. Merck Index
  5. 5. Chemical Vendor Catalogs (e.g. Sigma-Aldrich)
  6. 6. Peer-Reviewed Journals</li></li></ul><li>Strategy for the curation of melting points<br />Rely on redundancy when possible<br />Provide the maximum level of provenance when necessary (Open Notebook Science)<br />Adhere to Open Data, Open Descriptors and Open Algorithms for measurements and modeling<br />Using technology, we can begin to replace the “trusted source” model with one based on transparency and provenance<br />
  7. 7. The Chemical Information Validation Sheet <br />567 curated and referenced measurements from <br />Fall 2010 Chemical Information Retrieval course<br />
  8. 8. Investigating the m.p. inconsistencies of EGCG<br />
  9. 9. Most popular data sources<br />
  10. 10. Alfa Aesar donates melting points to the public<br />
  11. 11. Open Melting Point Explorer<br />
  12. 12. Outliers<br />EPA/PhysProp (donated all data to public also)<br />MDPI <br />dataset<br />
  13. 13. Outliers for ethanol: Alfa Aesar and Oxford MSDS<br />
  14. 14. Inconsistencies and SMILES problems within MDPI dataset<br />
  15. 15. MDPI Dataset labeled with High Trust Level<br />
  16. 16. EPA/PHYSPROP Structure Errors (Incorrect Valence): 2315 out of 43543 were contained pentavalentnitrogens<br />
  17. 17. EPA/PHYSPROP Errors: Structure displayed is for the neutral compound dopamine but the associated CAS Number and chemical name in the file are for the hydrobromidesalt.<br />
  18. 18. Common errors in datasets<br />multiple melting points for the same compound in the same database<br />stereochemistry issues<br />sign inversion<br />conversion errors (Kelvin/CelciusFahrenheit/Celcius)<br />bad SMILES (non-rendering)<br />salts associated with SMILES for free base<br />using boiling point for melting point<br />
  19. 19. Open melting point datasets<br />Double+ validated: 2706 compounds (7413 highly curated measurements. range: 0.01-5 C. Compounds that had at least one chiral center, possessed cis/trans isomerism, were inorganic or a salt removed.)<br />Entire dataset: 19933 unique compounds (27684 measurements – no inorganics or salts)<br />
  20. 20. Open Models with Open Data Using Open Descriptors (CDK)<br />
  21. 21. Modeling Results<br />
  22. 22. Melting point prediction service<br />
  23. 23. Melting point predictions and measurements on iPhone/iPad(Alex Clark)<br />
  24. 24. Publication of double+ validated melting point dataset to Nature Precedings and LuLu<br />
  25. 25. For all Formats of ONS Projects<br />
  26. 26. Open Melting Point Datasets<br />Currently 20,000 compounds with Open MPs<br />
  27. 27. Some melting points can’t be resolved <br />only with literature: 4-benzyltoluene<br />
  28. 28. Motivation: Faster Science,Better Science<br />
  29. 29. Open Lab Notebook page measuring the melting point of 4-benzyltoluene<br />
  30. 30. Using melting point for temperature dependent solubility prediction<br />
  31. 31. Crowdsourcing Solubility Data<br />
  32. 32. Integration of Multiple Web Services to Recommend Solvents for Reactions<br />
  33. 33.
  34. 34. All ONS web services <br />
  35. 35. Google Apps Scripts web services<br />
  36. 36. Google Apps Scripts for conveniently exploring melting point data<br />
  37. 37. Comparison of model with triple validated measurements<br />Straight chain carboxylic acids from 1 to 10 carbons<br />Straight chain alcohols from 1 to 10 carbons<br />
  38. 38. Cyclic primary amines from 3 to 6 carbons (cyclobutylamine flagged for validation – only single source available) <br />
  39. 39. Google Apps Scripts for planning reactions and creating schemes<br />
  40. 40. Open Melting Points in Supplementary Data Pages of Wikipedia (Martin Walker)<br />
  41. 41. Conclusions<br /><ul><li>For science to progress quickly there is great benefit in moving away from a “trusted source” model to one based on transparency and data provenance
  42. 42. Open Notebook Science offers an efficient way to make research transparent and discoverable</li>