Your SlideShare is downloading. ×
0
ChemSpider – An Online Database and  Registration System Linking the Web Antony Williams and Valery Tkachenko EBI Chemical...
www.chemspider.com
ChemSpider… <ul><li>>26 million unique molecules from >400 sources </li></ul><ul><li>.NET, SQL Server and GGA Indigo toolk...
Vancomycin
Vancomycin Search Molecular SKELETON Search Full Molecule
Full  Skeleton  Search: 104 Hits
Full  Molecule  Search: 4 Hits
ChemSpider… <ul><li>>26 million unique molecules from >400 sources </li></ul><ul><li>.NET, SQL Server and GGA Indigo toolk...
I want to know about “Vincristine” If all algorithms work then everything on the page is correct by default except the nam...
Vincristine: Identifiers and Properties
Vincristine: Vendors and Sources Linked by  Structure
Vincristine: Patents Linked by  Name
Vincristine: Articles Linked by  Name
ORIGINAL  ChemSpider <ul><li>“ Create a system for linking and navigating databases on the web ” </li></ul><ul><li>Use the...
How do we build it? <ul><li>We deal in Molfiles or SDF files – with coordinates </li></ul><ul><li>Deposit anything that ha...
InChIs – both on ChemSpider
Downsides of InChI <ul><li>InChI was a moving target (multi versions) but overall worked as planned. </li></ul><ul><li>Goo...
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization Issues Depiction based on molfile
Downsides of Overall Approach <ul><li>Meshing data together based on InChIs worked for simple molecules </li></ul><ul><li>...
Yohimbine
Originally 15 compounds “called” Yohimbine 54 Skeletons for Yohimbine
ChemSpider as an Aggregator <ul><li>ChemSpider has inherited many errors, and it continues but we are way more careful now...
Chemistry Databases on the Internet <ul><li>Some public databases are “trusted” as primary sources </li></ul><ul><li>Trust...
PHYSPROP Database <ul><li>The freely downloadable database under the EPI Suite prediction software </li></ul><ul><li>Very ...
The  Stereochemistry challenge. 12500 chemicals with “missed” stereo
Searches on ChemSpider <ul><li>Most searches are text-based: people searching for information about known chemicals </li><...
NIST Webbook
PubChem
NPC Browser  http://tripod.nih.gov/npc/
NPC Browser  http://tripod.nih.gov/npc/
 
NPC Browser  http://tripod.nih.gov/npc/
Synonyms on PubChem <ul><li>1,3-DICHLORO-PROPAN-2-ONE </li></ul><ul><li>(2R,3R)-Butanediol bis(methanesulfonate) </li></ul...
Synonyms on PubChem
Data Proliferation
 
 
 
 
 
What is meant by a name?
Choose a Starting Point
“ The First 10”
What is getting into Our Databases? <ul><li>Large aggregators are inheriting junk data </li></ul><ul><li>Data HAS prolifer...
Standardization of Patent Data???
Standardization of Patent Data???
WYSIWYG compounds
WYSIWYG compounds
Text Mining Chemical Name Errors
“ DPA”
All aggegators suffer dilution!
Structures have timelines
Name-Structure Dictionaries…
Depiction for Humans
Human Depiction versus Algorithms
Human Depiction versus Algorithms
Identifier Dictionaries <ul><li>Reciprocal curation processes…share curation with each other. </li></ul><ul><li>If a datab...
Proof of Concept Data Curation Sharing
Structure Validation using feed <ul><li>Look for approved synonyms </li></ul><ul><li>Compare feed InChIKey with database I...
<ul><li>Open PHACTS : partnership between European Community and EFPIA </li></ul><ul><li>Freely accessible for knowledge d...
Adopting Modified FDA Rules <ul><li>As already used by ChEMBL… </li></ul>
Nitro groups
Salt and Ionic Bonds
Ammonium salts
Parent and Child <ul><li>Chemical entities reduced to primary component plus relationships </li></ul><ul><ul><li>salt form...
ChemSpider Standardization <ul><li>Entire ChemSpider database will be standardized using modified FDA rule set </li></ul><...
Project Status <ul><li>Standardization pipelining process initiated  </li></ul><ul><li>Rule implementation and checking – ...
Conclusions <ul><li>ChemSpider has an important role in quality data </li></ul><ul><li>Crowdsourced deposition, validation...
Acknowledgments  <ul><li>The ChemSpider team </li></ul><ul><li>Our data providers, depositors, collaborators and curators ...
Thank you Email: williamsa@rsc.org  Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog:  www.chemconnector...
Upcoming SlideShare
Loading in...5
×

ChemSpider – An Online Database and Registration System Linking the Web

1,514

Published on

This presentation was given at the EBI Meeting in Cambridge on October 11th 2011 regarding Chemical Registration and Standardization.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,514
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "ChemSpider – An Online Database and Registration System Linking the Web"

  1. 1. ChemSpider – An Online Database and Registration System Linking the Web Antony Williams and Valery Tkachenko EBI Chemical Registry Systems Workshop, October 2011
  2. 2. www.chemspider.com
  3. 3. ChemSpider… <ul><li>>26 million unique molecules from >400 sources </li></ul><ul><li>.NET, SQL Server and GGA Indigo toolkit </li></ul><ul><li>Multiple Open Source components – Jmol, JSpecView, Balloon, OpenBabel, MediaWiki </li></ul><ul><li>Slices of data are Open but the entire data collection is not Open </li></ul><ul><li>Crowdsourced depositions and curations </li></ul><ul><li>Uses InChIs for navigating and linking the web </li></ul>
  4. 4. Vancomycin
  5. 5. Vancomycin Search Molecular SKELETON Search Full Molecule
  6. 6. Full Skeleton Search: 104 Hits
  7. 7. Full Molecule Search: 4 Hits
  8. 8. ChemSpider… <ul><li>>26 million unique molecules from >400 sources </li></ul><ul><li>.NET, SQL Server and GGA Indigo toolkit </li></ul><ul><li>Multiple Open Source components – Jmol, JSpecView, Balloon, OpenBabel, MediaWiki </li></ul><ul><li>Slides of data are Open but the entire data collection is not Open </li></ul><ul><li>Crowdsourced depositions and curations </li></ul><ul><li>Uses InChIs for navigating and linking the web </li></ul><ul><li>Uses Names for navigating and linking the web </li></ul>
  9. 9. I want to know about “Vincristine” If all algorithms work then everything on the page is correct by default except the name-structure relationship!
  10. 10. Vincristine: Identifiers and Properties
  11. 11. Vincristine: Vendors and Sources Linked by Structure
  12. 12. Vincristine: Patents Linked by Name
  13. 13. Vincristine: Articles Linked by Name
  14. 14. ORIGINAL ChemSpider <ul><li>“ Create a system for linking and navigating databases on the web ” </li></ul><ul><li>Use the power of InChI , and the proliferation of InChIs in databases, to make connections </li></ul><ul><li>Developed on .NET and SQL Server for speed of implementation and existing skill sets </li></ul><ul><li>Seeded with PubChem database of 10.5M chemicals and expanded using other sources to 20M </li></ul>
  15. 15. How do we build it? <ul><li>We deal in Molfiles or SDF files – with coordinates </li></ul><ul><li>Deposit anything that has an InChI – we support what InChI can handle, good and bad </li></ul><ul><li>Standardization based on “InChI standardization” </li></ul><ul><li>InChIs aggregate (certain) tautomers </li></ul><ul><li>We link out to external sites using their IDs </li></ul>
  16. 16. InChIs – both on ChemSpider
  17. 17. Downsides of InChI <ul><li>InChI was a moving target (multi versions) but overall worked as planned. </li></ul><ul><li>Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules” </li></ul><ul><li>InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against… </li></ul>
  18. 18. Side Effects of InChI Usage
  19. 19. SMILES by comparison…
  20. 20. Side Effects of InChI Usage
  21. 21. Standardization Issues Depiction based on molfile
  22. 22. Downsides of Overall Approach <ul><li>Meshing data together based on InChIs worked for simple molecules </li></ul><ul><li>2D layout errors inherited or limited by algorithm </li></ul><ul><li>Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same </li></ul>
  23. 23. Yohimbine
  24. 24. Originally 15 compounds “called” Yohimbine 54 Skeletons for Yohimbine
  25. 25. ChemSpider as an Aggregator <ul><li>ChemSpider has inherited many errors, and it continues but we are way more careful now with pre-filtering </li></ul><ul><li>Cannot deposit chemicals without an InChI </li></ul><ul><li>Deprecated compounds remain deprecated </li></ul><ul><li>Curated name-structure relationships do NOT remove the related structure </li></ul><ul><ul><li>If Taxol is removed from 20 asserted “incorrect structures” those compounds remain in the database </li></ul></ul>
  26. 26. Chemistry Databases on the Internet <ul><li>Some public databases are “trusted” as primary sources </li></ul><ul><li>Trust is granted without investigation or understanding of the content </li></ul><ul><li>What do we know about some of the online resources? </li></ul>
  27. 27. PHYSPROP Database <ul><li>The freely downloadable database under the EPI Suite prediction software </li></ul><ul><li>Very Basic filters suggest data quality issues </li></ul>
  28. 28. The Stereochemistry challenge. 12500 chemicals with “missed” stereo
  29. 29. Searches on ChemSpider <ul><li>Most searches are text-based: people searching for information about known chemicals </li></ul><ul><li>Creating accurate name-structure dictionaries is critical </li></ul>
  30. 30. NIST Webbook
  31. 31. PubChem
  32. 32. NPC Browser http://tripod.nih.gov/npc/
  33. 33. NPC Browser http://tripod.nih.gov/npc/
  34. 35. NPC Browser http://tripod.nih.gov/npc/
  35. 36. Synonyms on PubChem <ul><li>1,3-DICHLORO-PROPAN-2-ONE </li></ul><ul><li>(2R,3R)-Butanediol bis(methanesulfonate) </li></ul><ul><li>Ethyl-1-propenyl ether, mixture of cis and trans </li></ul><ul><li>PSS-[2-[(Chloromethyl)phenyl]ethyl]-Heptaisobutyl substituted </li></ul><ul><li>1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo [9.5.1.1(3,9).1(5,15).1(7,13)]octasiloxane </li></ul>
  36. 37. Synonyms on PubChem
  37. 38. Data Proliferation
  38. 44. What is meant by a name?
  39. 45. Choose a Starting Point
  40. 46. “ The First 10”
  41. 47. What is getting into Our Databases? <ul><li>Large aggregators are inheriting junk data </li></ul><ul><li>Data HAS proliferated from ChemSpider through PubChem – in process of deprecating and redepositing </li></ul><ul><li>A lot of data is for chemicals that will never exist (probably) </li></ul>
  42. 48. Standardization of Patent Data???
  43. 49. Standardization of Patent Data???
  44. 50. WYSIWYG compounds
  45. 51. WYSIWYG compounds
  46. 52. Text Mining Chemical Name Errors
  47. 53. “ DPA”
  48. 54. All aggegators suffer dilution!
  49. 55. Structures have timelines
  50. 56. Name-Structure Dictionaries…
  51. 57. Depiction for Humans
  52. 58. Human Depiction versus Algorithms
  53. 59. Human Depiction versus Algorithms
  54. 60. Identifier Dictionaries <ul><li>Reciprocal curation processes…share curation with each other. </li></ul><ul><li>If a database has a compound already then use InChiKeys to match “suggested” validation against the compound. </li></ul><ul><li>A series of “added” and “removed” synonyms against InChIKeys for matching. </li></ul>
  55. 61. Proof of Concept Data Curation Sharing
  56. 62. Structure Validation using feed <ul><li>Look for approved synonyms </li></ul><ul><li>Compare feed InChIKey with database InChIKey </li></ul><ul><li>If different, flag for inspection </li></ul>
  57. 63. <ul><li>Open PHACTS : partnership between European Community and EFPIA </li></ul><ul><li>Freely accessible for knowledge discovery and verification. </li></ul><ul><ul><li>Data on small molecules </li></ul></ul><ul><ul><li>Pharmacological profiles </li></ul></ul><ul><ul><li>Pharmacokinetics </li></ul></ul><ul><ul><li>ADMET data </li></ul></ul><ul><ul><li>Biological targets and pathways </li></ul></ul><ul><ul><li>Proprietary and public data sources. </li></ul></ul>
  58. 64. Adopting Modified FDA Rules <ul><li>As already used by ChEMBL… </li></ul>
  59. 65. Nitro groups
  60. 66. Salt and Ionic Bonds
  61. 67. Ammonium salts
  62. 68. Parent and Child <ul><li>Chemical entities reduced to primary component plus relationships </li></ul><ul><ul><li>salt forms </li></ul></ul><ul><ul><li>solvates </li></ul></ul><ul><ul><li>combinations </li></ul></ul>
  63. 69. ChemSpider Standardization <ul><li>Entire ChemSpider database will be standardized using modified FDA rule set </li></ul><ul><li>Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated </li></ul><ul><li>Standardization procedures automatically applied to all future depositions </li></ul>
  64. 70. Project Status <ul><li>Standardization pipelining process initiated </li></ul><ul><li>Rule implementation and checking – iterative work with Open PHACTS pharma members </li></ul><ul><li>Data model development to support parent-child relationships </li></ul><ul><li>In dialog with the FDA about latest form of recommendations </li></ul>
  65. 71. Conclusions <ul><li>ChemSpider has an important role in quality data </li></ul><ul><li>Crowdsourced deposition, validation and curation works but low engagement to date </li></ul><ul><li>Standardization of our entire backfile is necessary </li></ul><ul><li>Designing the standardization processes with input from pharma and general chemists is necessary </li></ul>
  66. 72. Acknowledgments <ul><li>The ChemSpider team </li></ul><ul><li>Our data providers, depositors, collaborators and curators </li></ul><ul><li>Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel) </li></ul>
  67. 73. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×