Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hosting public domain chemicals
data online for the community – the
challenges of handling materials
Antony Williams
Oppor...
About Me…
• I am NOT a materials chemist
• I am an NMR spectroscopist by training
• Worked on a LIMS while at Kodak
• 10 y...
I would tell a chemistry joke…
But all of the good ones…
An ambitious idea….
• Let’s map together all online chemistry data
and build systems to integrate it
• Heck, let’s integra...
What about this….
• We’re going to map the world
• We’re going to take photos of as many places
as we can and link them to...
Where is chemistry online?
• Encyclopedic articles (Wikipedia)
• Chemical vendor databases
• Metabolic pathway databases
•...
Chemistry on the Internet…
• Most searching for chemistry on the internet…
• Name searching Google/Bing/Yahoo
• Name searc...
Chemistry on the Internet…
• Most searching for chemistry on the internet…
• Name searching Google/Bing/Yahoo
• Name searc...
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• O...
ChemSpider
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
RSC Books
Google Books
Vendors and data sources
APIs
APIs
Organic Chemistry is hard…
…it has alkynes of trouble
Flavors of Chemistry
Molfiles
10 9 0 0 1 0 0 0 0 0 1 V2000
31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
26.6526 -9.0366 0.0000 C 0 0 0 0 0 ...
Molfiles
• Molfiles are the primary exchange format
between structure drawing packages
• Can be different between differen...
SMILES
• SMILES is a common format
• Can support polymers,
organometallics, etc.
• Does NOT carry X,Y or Z
coordinates for...
Stereo
Tautomeric forms
Vendor-dependent SMILES
ACD/Labs
CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2
=O
OpenEye
CC1=C(C(=O)c2cc...
Chemists are good…
The InChI Identifier
InChI
• SINGLE code base managed by IUPAC –
integrated into drawing packages. No
variability as with SMILES
• InChI String...
Multiple Layers
Tautomers
Stereo
InChIStrings Hash to InChIKeys
Structure search the web
Exact Search
Skeleton Search
Data Quality/Standardization
• MANY structures meant to be something
online are MISREPRESENTED.
• Commonly you will have b...
Data Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
Data quality is a known issue
Data quality is a known issue
Substructure # of
Hits
# of
Correct
Hits
No
stereochemistry
Incomplete
Stereochemistry
Complete but
incorrect
stereochemis...
Patent data in public
databases
Patent data in public
databases
You just can’t trust atoms!
You just can’t trust atoms!
They make up everything…
ALL variants of Yohimbine!!!
What’s Methane? OLD PUBCHEM
What ELSE is Methane???
NEW PUBCHEM
Depiction vs Accurate
Representation
Depiction vs Accurate
Representation
What is the Structure of Vitamin K1?
Standardize
• Use the SRS as a guidance document for
standardization
• Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
Ammonium salts
Can we MAKE Quality Data?
• We are building systems for everyone to
validate and standardize their data
DICTIONARIES are powerful
• Search all forms of structure IDs
• Systematic name(s)
• Trivial Name(s)
• SMILES
• InChI Stri...
Many Names, One Structure
But big and often noisy
Text-Mining and Markup…
Text-Mining and Markup…
With links out to platforms
Dictionaries are invaluable
Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in E...
Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in E...
Name to Structure Conversion
Name to Structure Conversion
ChemSpider “Annotations”
• Users can add
• Descriptions, Syntheses and Commentaries
• Links to PubMed articles
• Links to ...
Spectral Data
• Spectral data to be deposited in standard
formats – JCAMP or images
• All spectra available at:
http://www...
Student Submissions
Data on ChemSpider
Spectral Data EXTRACTION
ORIGINAL
EXTRACTED
It’s exactly the WRONG WAY!
• We should NOT be mining data out of future
publications
• Structures should be submitted “co...
An Adventure into the World of
Small but significant contribution..
ChemSpider SyntheticPages
Micropublishing with Peer Review
(a chemical synthesis blog?)
Multi-Step Synthesis
Interactive Data
Chemistry data is of value?
• Reference databases generate hundreds of
millions of dollars/euros per year
• So much data g...
A shift to Openness
How will I get recognized?
• Who in the room has an ORCID?
Deposition of Research Data
• If we manage compounds, syntheses and
analytical data…
• If we have security and provenance ...
Recognition: need to have Impact
Quantitating scientists?
National Information Standards
Organization and “Altmetrics”
http://www.niso.org/apps/group_public/download.php/13295/niso...
What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical
data, ...
Compounds
Reactions
Analytical data
Crystallography data
Deposition of Data
• Developing systems that provides
feedback to users regarding data quality
• Validate/standardize chem...
So we know about ORGANICS
• Comment – you don’t know all of the
challenges until you start to work in the area!
• We, and ...
Questions to consider…
• Organics are hard enough!
• What are your best dictionaries of materials?
• We have chemical onto...
Polymorphism is common
Known Challenges
• Many materials are non-stoichiometric
• How to represent composite materials (e.g.
supported catalysts)...
Collaboration is key
Internet Data
The Future
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databas...
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
Hosting public domain chemicals data online for the community – the challenges of handling materials
Upcoming SlideShare
Loading in …5
×

Hosting public domain chemicals data online for the community – the challenges of handling materials

2,075 views

Published on

The Royal Society of Chemistry hosts one of the worlds’ richest collections of online chemistry data that is free-to-access for the community. ChemSpider presently hosts over 30 million unique chemical compounds together with associated data and accessible via a number of search techniques. With almost 50,000 unique users per day from around the world the site offers scientists the ability to investigate the world of small molecules via property searches, analytical data and predictive models. The challenges associated with providing a similar platform for “materials” are manifold but, if they could be addressed, would offer a valuable service to the materials community. This presentation will provide an overview of how ChemSpider was built, our efforts to expand the capabilities to a more encompassing data repository and some of the challenges faced to embrace the diverse world of materials informatics and online data access.

Published in: Science
  • Be the first to comment

Hosting public domain chemicals data online for the community – the challenges of handling materials

  1. 1. Hosting public domain chemicals data online for the community – the challenges of handling materials Antony Williams Opportunities in Materials Informatics, University of Wisconsin-Madison February 9th , 2015 0000-0002-2668-4821
  2. 2. About Me… • I am NOT a materials chemist • I am an NMR spectroscopist by training • Worked on a LIMS while at Kodak • 10 years in commercial cheminformatics • Built the ChemSpider database as a hobby • Worked on validating compounds on Wikipedia • Manage cheminformatics team for RSC • Believer in the value of social networking and Open Data for science • Dane Morgan asked me to tell jokes…
  3. 3. I would tell a chemistry joke… But all of the good ones…
  4. 4. An ambitious idea…. • Let’s map together all online chemistry data and build systems to integrate it • Heck, let’s integrate chemistry and biology data and add in disease data too if we can • Let’s extract property data and model it and see if we can extract new relationships – quantitative and qualitative • Let’s make it all available on the web…for free
  5. 5. What about this…. • We’re going to map the world • We’re going to take photos of as many places as we can and link them together • We’ll let people annotate and curate the map • Then let’s make it available free on the web • We’ll make it available for decision making • Put it on Mobile Devices, give it away…
  6. 6. Where is chemistry online? • Encyclopedic articles (Wikipedia) • Chemical vendor databases • Metabolic pathway databases • Property databases • Patents with chemical structures • Drug Discovery data • Scientific publications • Compound aggregators • Blogs/Wikis and Open Notebook Science
  7. 7. Chemistry on the Internet… • Most searching for chemistry on the internet… • Name searching Google/Bing/Yahoo • Name searching Wikipedia • Name searching Wolfram Alpha • Name, name, name, name…searching • Structure searching DOZENS of websites, each with different information or…
  8. 8. Chemistry on the Internet… • Most searching for chemistry on the internet… • Name searching Google/Bing/Yahoo • Name searching Wikipedia • Name searching Wolfram Alpha • Name, name, name, name…searching • Structure searching DOZENS of websites, each with different information or… • Search ONE website integrating the others!
  9. 9. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!! • Note…NOT all websites connected
  10. 10. ChemSpider
  11. 11. ChemSpider
  12. 12. ChemSpider
  13. 13. Experimental/Predicted Properties
  14. 14. Literature references
  15. 15. Patents references
  16. 16. RSC Books
  17. 17. Google Books
  18. 18. Vendors and data sources
  19. 19. APIs
  20. 20. APIs
  21. 21. Organic Chemistry is hard…
  22. 22. …it has alkynes of trouble
  23. 23. Flavors of Chemistry
  24. 24. Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END
  25. 25. Molfiles • Molfiles are the primary exchange format between structure drawing packages • Can be different between different drawing packages • Most commonly carry X,Y coordinates for layout • Can support polymers, organometallics, etc. • Can carry 3D coordinates
  26. 26. SMILES • SMILES is a common format • Can support polymers, organometallics, etc. • Does NOT carry X,Y or Z coordinates for layout so requires layout algorithms – can be problematic! • Generally different between drawing packages
  27. 27. Stereo
  28. 28. Tautomeric forms
  29. 29. Vendor-dependent SMILES ACD/Labs CC(C)CCC[C@@H](C)CCC[C@@H] (C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2 =O OpenEye CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CC C[C@H](C)CCC[C@H](C)CCCC(C)C ChEMBL CC(C)CCC[C@@H](C)CCC[C@@H] (C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=
  30. 30. Chemists are good…
  31. 31. The InChI Identifier
  32. 32. InChI • SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES • InChI Strings can be reversed to structures – same problem as with SMILES – no layout • Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet
  33. 33. Multiple Layers
  34. 34. Tautomers
  35. 35. Stereo
  36. 36. InChIStrings Hash to InChIKeys
  37. 37. Structure search the web
  38. 38. Exact Search
  39. 39. Skeleton Search
  40. 40. Data Quality/Standardization • MANY structures meant to be something online are MISREPRESENTED. • Commonly you will have better success finding information by name searches than structure – with many caveats of course… • Validating chemical structure representations is laborious work – and it’s shocking to review data…
  41. 41. Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  42. 42. Data quality is a known issue
  43. 43. Data quality is a known issue
  44. 44. Substructure # of Hits # of Correct Hits No stereochemistry Incomplete Stereochemistry Complete but incorrect stereochemistry Gonane 34 5 8 21 0 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23 10 Only 34 out of 149 structures were correct!
  45. 45. Patent data in public databases
  46. 46. Patent data in public databases
  47. 47. You just can’t trust atoms!
  48. 48. You just can’t trust atoms! They make up everything…
  49. 49. ALL variants of Yohimbine!!!
  50. 50. What’s Methane? OLD PUBCHEM
  51. 51. What ELSE is Methane???
  52. 52. NEW PUBCHEM
  53. 53. Depiction vs Accurate Representation
  54. 54. Depiction vs Accurate Representation
  55. 55. What is the Structure of Vitamin K1?
  56. 56. Standardize • Use the SRS as a guidance document for standardization • Adjust as necessary to our needs
  57. 57. Nitro groups
  58. 58. Salt and Ionic Bonds
  59. 59. Ammonium salts
  60. 60. Can we MAKE Quality Data? • We are building systems for everyone to validate and standardize their data
  61. 61. DICTIONARIES are powerful • Search all forms of structure IDs • Systematic name(s) • Trivial Name(s) • SMILES • InChI Strings • InChIKeys • Database IDs • Registry Number
  62. 62. Many Names, One Structure
  63. 63. But big and often noisy
  64. 64. Text-Mining and Markup…
  65. 65. Text-Mining and Markup…
  66. 66. With links out to platforms
  67. 67. Dictionaries are invaluable
  68. 68. Text Mining on IUPAC Names The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  69. 69. Text Mining on IUPAC Names The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  70. 70. Name to Structure Conversion
  71. 71. Name to Structure Conversion
  72. 72. ChemSpider “Annotations” • Users can add • Descriptions, Syntheses and Commentaries • Links to PubMed articles • Links to articles via DOIs • Add spectral data • Add Crystallographic Information Files • Add photos • Add MP3 files • Add Videos
  73. 73. Spectral Data • Spectral data to be deposited in standard formats – JCAMP or images • All spectra available at: http://www.chemspider.com/spectra.aspx • Data are deposited on a regular basis • Students • Chemical vendors • Growing collection now
  74. 74. Student Submissions
  75. 75. Data on ChemSpider
  76. 76. Spectral Data EXTRACTION
  77. 77. ORIGINAL EXTRACTED
  78. 78. It’s exactly the WRONG WAY! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive, preferably with OPEN DATA
  79. 79. An Adventure into the World of Small but significant contribution..
  80. 80. ChemSpider SyntheticPages
  81. 81. Micropublishing with Peer Review (a chemical synthesis blog?)
  82. 82. Multi-Step Synthesis
  83. 83. Interactive Data
  84. 84. Chemistry data is of value? • Reference databases generate hundreds of millions of dollars/euros per year • So much data generated that could go public • Maybe 5% of all data generated is published • There is no “Journal of Failed Experiments” • Funding agencies start to demand Open Data • Scientists want funding but also recognition
  85. 85. A shift to Openness
  86. 86. How will I get recognized? • Who in the room has an ORCID?
  87. 87. Deposition of Research Data • If we manage compounds, syntheses and analytical data… • If we have security and provenance of data… • If we deliver user interfaces to satisfy the various use cases… • Then we have delivered electronic lab notebooks for chemistry laboratories. ELNs are research data repositories
  88. 88. Recognition: need to have Impact
  89. 89. Quantitating scientists?
  90. 90. National Information Standards Organization and “Altmetrics” http://www.niso.org/apps/group_public/download.php/13295/niso_altmetrics_white_paper_draft_v4.pdf
  91. 91. What are we building? • We are building the “RSC Data Repository” • Containers for compounds, reactions, analytical data, tabular data • Algorithms for data validation and standardization • Flexible indexing and search technologies • A platform for modeling data and hosting existing models and predictive algorithms
  92. 92. Compounds
  93. 93. Reactions
  94. 94. Analytical data
  95. 95. Crystallography data
  96. 96. Deposition of Data • Developing systems that provides feedback to users regarding data quality • Validate/standardize chemical compounds • Check for balanced reactions • Checks spectral data • EXAMPLE Future work • Properties – compare experimental to pred. • Automated structure verification - NMR
  97. 97. So we know about ORGANICS • Comment – you don’t know all of the challenges until you start to work in the area! • We, and cheminformatics companies, have solved MANY, but not all of the issues regarding organic chemistry management • The majority of our approaches do not map to materials • No standard ways to represent compounds • No InChI for materials
  98. 98. Questions to consider… • Organics are hard enough! • What are your best dictionaries of materials? • We have chemical ontologies. Status for materials? • Is open annotation of your databases possible? • What standards do you have for materials data exchange?
  99. 99. Polymorphism is common
  100. 100. Known Challenges • Many materials are non-stoichiometric • How to represent composite materials (e.g. supported catalysts)? • Methods to distinguish novelty in materials (equivalent to diversity in organic structures)? • Many more I will learn at this workshop..?
  101. 101. Collaboration is key
  102. 102. Internet Data The Future Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals
  103. 103. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×