Your SlideShare is downloading. ×
0
ChemSpider as an Integration
Hub for Interlinked
Chemistry Data
Antony Williams
SETAC
November 18th 2013
How Much Data Online?
• How much data regarding environmental
toxicology and chemistry is online?
• How can it all be mapp...
A Grand Challenge….
• Let’s map together all historical chemistry
data and build systems to integrate new data
• Let’s int...
What about this….
• We’re going to map the world
• We’re going to take photos of as many
places as we can and link them to...
The World of Online Chemistry
•
•
•
•
•
•
•
•

Property databases
Compound aggregators
Screening assay results
Scientific ...
How to Map Data Together
• Download the structure representations
and map together at the structure level
• Integrate and ...
ChemSpider
• Build a HUB connecting as many data
sources as possible
• NOT to harvest all data from each data source
• Tod...
RSC’s ChemSpider
Identifiers are very useful! But
what when they are “closed”
CAS Numbers Validation?
Various Registration Numbers
Mappings and Inconsistencies
Imatinib

Mesylate

ChemSpider

Drugbank

PubChem
The InChI Identifier
InChIStrings Hash to InChIKeys
Vancomycin – Search the
Internet
Vancomycin

Search Molecular
SKELETON

Search Full Molecule
Full Skeleton Search: 529 Hits
Full Molecule Search: 294 Hits
Historical Data for reference
• As evidence that InChI is proliferating and
data is improving:
• Three years ago there wer...
What you might not know about
Chemistry Databases on the Internet
NCGC Pharma Collection
NCGC Pharma Collection
NCGC Pharma Collection
PHYSPROP Database

• The freely downloadable
database under the EPI
Suite prediction software
• Very Basic filters suggest...
The Stereochemistry challenge.
12500 chemicals with “missed” stereo
NIST Webbook
PubChem
Patents
Patents
But Chemspider is curated right?
Originally 15 compounds “called” Yohimbine
54 Skeletons for Yohimbine
Crowdsourced Curation
• Crowd-sourced curation: identify/tag errors,
edit names, synonyms, identify records to
deprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
Chemical name dictionaries for:
• Text-mining (publications, patents)
• Used to index PubMed and link Google Patents

• Li...
I want to know about “Vincristine”
Vincristine: Identifiers to link
Vincristine: Vendors and Sources
Linked by Structure
Vincristine: Patents
Linked by Name
Vincristine: Articles
Linked by Name
What needs to happen?
• Standards
• Standardization of structures
• More sharing of data – downloadable data
collections f...
Adopting Modified FDA Rules
Nitro groups
Salt and Ionic Bonds
Ammonium salts
What if we could capture it all?
Digitally Enhancing the RSC Archive
Start with data in publications
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4thiadiazol-5-yl)urea prepared in Example 6 , thiony...
ChemSpider Reactions
Turn “Figures” Into Data

FIGURE

EXTRACTED DATA
Conclusions
• There are some amazing online resources for
environmental toxicology and chemistry already!
• ChemSpider has...
Thank you
Email: williamsa@rsc.org
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net...
ChemSpider as an integration hub for interlinked chemistry data
Upcoming SlideShare
Loading in...5
×

ChemSpider as an integration hub for interlinked chemistry data

4,743

Published on

The internet has provided access to unprecedented quantities of data. In the domain of chemistry specifically over the past decade the web has become populated with tens of millions of chemical structures and related properties, both experimental and predicted, together with tens of thousands of spectra and syntheses. The data have, to a large extent, remained disparate and disconnected. In recent years with the wave of Web 2.0 participation any chemist can contribute to both the sharing and validation of chemistry-related data whether it be via Wikipedia, the online encyclopedia, or one of the multiple public compound databases. Toxicologists commonly wish to source data, either for reference purposes, to support the development of models or, when experimental data are not available, predicted data will suffice. This presentation will offer a perspective of the type and quality of chemistry data available today, our experiences of building the ChemSpider public compound database to link together chemistry on the internet and our efforts to both encourage and enable even greater integration and connectivity for chemistry data for the community.

Published in: Technology, Education

Transcript of "ChemSpider as an integration hub for interlinked chemistry data"

  1. 1. ChemSpider as an Integration Hub for Interlinked Chemistry Data Antony Williams SETAC November 18th 2013
  2. 2. How Much Data Online? • How much data regarding environmental toxicology and chemistry is online? • How can it all be mapped together?
  3. 3. A Grand Challenge…. • Let’s map together all historical chemistry data and build systems to integrate new data • Let’s integrate chemistry, toxicology and biology data and add in disease data too • Lets model the data and see if we can extract new relationships – quantitative and qualitative • Let’s make it all available on the web
  4. 4. What about this…. • We’re going to map the world • We’re going to take photos of as many places as we can and link them together • We’ll let people annotate and curate the map • Then let’s make it available free on the web • We’ll make it available for decision making • Put it on Mobile Devices, Give it Away
  5. 5. The World of Online Chemistry • • • • • • • • Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science
  6. 6. How to Map Data Together • Download the structure representations and map together at the structure level • Integrate and mesh chemical names, chemical properties, analytical data • Carry URL links and retain external links to original data sets (assume no link decay) • It sounds easy….
  7. 7. ChemSpider • Build a HUB connecting as many data sources as possible • NOT to harvest all data from each data source • Today we have >29 million unique chemicals from >500 data sources • Focus on improving data quality • Allow users to enhance, curate and annotate
  8. 8. RSC’s ChemSpider
  9. 9. Identifiers are very useful! But what when they are “closed”
  10. 10. CAS Numbers Validation?
  11. 11. Various Registration Numbers
  12. 12. Mappings and Inconsistencies Imatinib Mesylate ChemSpider Drugbank PubChem
  13. 13. The InChI Identifier
  14. 14. InChIStrings Hash to InChIKeys
  15. 15. Vancomycin – Search the Internet
  16. 16. Vancomycin Search Molecular SKELETON Search Full Molecule
  17. 17. Full Skeleton Search: 529 Hits
  18. 18. Full Molecule Search: 294 Hits
  19. 19. Historical Data for reference • As evidence that InChI is proliferating and data is improving: • Three years ago there were only 104 hits on the complete InChI online • Only 4 were correct
  20. 20. What you might not know about Chemistry Databases on the Internet
  21. 21. NCGC Pharma Collection
  22. 22. NCGC Pharma Collection
  23. 23. NCGC Pharma Collection
  24. 24. PHYSPROP Database • The freely downloadable database under the EPI Suite prediction software • Very Basic filters suggest data quality issues
  25. 25. The Stereochemistry challenge. 12500 chemicals with “missed” stereo
  26. 26. NIST Webbook
  27. 27. PubChem
  28. 28. Patents
  29. 29. Patents
  30. 30. But Chemspider is curated right?
  31. 31. Originally 15 compounds “called” Yohimbine 54 Skeletons for Yohimbine
  32. 32. Crowdsourced Curation • Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
  33. 33. Search “Vitamin H”
  34. 34. “Curate” Identifiers
  35. 35. “Curate” Identifiers
  36. 36. “Curate” Identifiers
  37. 37. Chemical name dictionaries for: • Text-mining (publications, patents) • Used to index PubMed and link Google Patents • Linking to other databases – think Biology! • When structures are not available names link • Searching the web • Names link to structures link to InChIs
  38. 38. I want to know about “Vincristine”
  39. 39. Vincristine: Identifiers to link
  40. 40. Vincristine: Vendors and Sources Linked by Structure
  41. 41. Vincristine: Patents Linked by Name
  42. 42. Vincristine: Articles Linked by Name
  43. 43. What needs to happen? • Standards • Standardization of structures • More sharing of data – downloadable data collections for mapping, meshing and integration • InChI adoption • Collaboration • Stop reinventing the wheel • Share data, share efforts and speed the process
  44. 44. Adopting Modified FDA Rules
  45. 45. Nitro groups
  46. 46. Salt and Ionic Bonds
  47. 47. Ammonium salts
  48. 48. What if we could capture it all? Digitally Enhancing the RSC Archive
  49. 49. Start with data in publications
  50. 50. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer, thermometer and reflux condenser. The reaction mixture was heated at reflux with stirring , for a period of about one-half hour. After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-Nmethyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  51. 51. ChemSpider Reactions
  52. 52. Turn “Figures” Into Data FIGURE EXTRACTED DATA
  53. 53. Conclusions • There are some amazing online resources for environmental toxicology and chemistry already! • ChemSpider has an important role in quality data and linking resources • Crowdsourced deposition, validation and curation works • Standards are an important part of data linking • MORE collaboration and data sharing can benefit us all
  54. 54. Thank you Email: williamsa@rsc.org Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×