Acs towards a gold standard database

735 views

Published on

talk given at acs 29 march 2012

Published in: Health & Medicine, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
735
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Acs towards a gold standard database

  1. 1. Towards a Gold Standard: Improving The Quality of Public Domain Chemistry Databases Antony J. Williams1, Sean Ekins 2 1Royal Society of Chemistry, Wake Forest, NC 27587 2Collaborations in Chemistry, Fuquay Varina, NC 27526.
  2. 2. The future: crowdsourced drug discoveryWilliams et al., Drug Discovery World, Winter 2009
  3. 3. Chemistry structures are proliferating on the web Safety data Toxicity data Blogs and Wikis Property databases Users take them at face value Experimental results Scientific publications They SHOULD NOT!!! Compound aggregators Open Notebook Science Metabolic pathway databases Encyclopedic articles (Wikipedia) Immense quantities of scientific information are contained in the thousands of databases Progress can however be inhibited by errors in these databases, downstream effects when the data is reused. http://bit.ly/zWGaps
  4. 4. What is the Structure of Vitamin K1?
  5. 5. What Mechanisms Do we Have to Alert the Community ? Email database owner and hope for a response Blog it  Tony has been blogging about database quality for years and nobody was listening – other than the people at PubChem  For some databases, when he blogged they listened and would edit! Tweet it Dec 2010 - We felt something had to be said definitively about structure quality Publish it – wrote to Science, Nature and then PLoS Computational Biology http://bit.ly/qtJF2f Perhaps the phone?
  6. 6. April 27 2011- Then came the : The NPC Browser Science Translational Medicine 2011
  7. 7. But wait, hold on – did anyone peer review the database??Database released and within days ..A quick analysis of structure quality revealed..100’s of errors found in structures Williams and Ekins, DDT, 16: 747-750 (2011)
  8. 8. NPC Browserhttp://tripod.nih.gov/npc/
  9. 9. Neomycin in NPC Browserhttp://tripod.nih.gov/npc/
  10. 10. Neomycin In ChemSpider
  11. 11. How many contribute to clean-up? Less than a dozen contributors to data The majority are project members The crowd is small… This is the same for all cheminformatics crowd- based efforts
  12. 12. What Mechanisms Do we Have to Alert the Community – Publishing is too slow Tony Blogged April 28th 1 day after release http://bit.ly/jn8wLC I Blogged April 29th http://bit.ly/lXHInG suggesting the need for a gold standard database After more extensive analysis we sent a manuscript to Science Translational Medicine - Rejected Drug Discovery Today..accepted…8 Months after we pointed out the issue even before NPC Browser release.. Williams and Ekins, DDT, 16: 747-750 (2011)
  13. 13. Responses from Community and NCGC Comments on initial blog NCGC added a disclaimer which I blogged about May 23rd http://bit.ly/m4Tx2b Sept 8th 2011 Email from Tudor Oprea (cc’ed to 60 others) He has also been pointing out database errors for years.. Followed by one from Chris Austin offering to meet us Several individuals thanked us for the alert
  14. 14. More Extensive Analysis and solutions  More analysis of NPC browser errors  “analysis of the NPC browser ‘HTS amenable compounds’ subset of data for 7600 compounds identified fundamental errors in stereochemistry, valency issues and charge imbalances in a few minutes work using a rudimentary software tool”  Analysis of other chemistry databases and errors  Other types of databases and errors  Offered solutionsTowards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improvingthe Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012
  15. 15. Data Errors in the NPC Browser: Analysis of Steroids Substructure # of # of No Incomplete Complete but Hits Correct stereochemistry Stereochemistry incorrect Hits stereochemistry Gonane 34 5 8 21 0 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23 10Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improvingthe Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012
  16. 16. Why this matters to us and YOU the CROWD ?
  17. 17. What You Might Not Know About Chemistry Databases On The Internet Data-sharing between open databases is cyclic This can proliferate errors in the “Linked Data”
  18. 18. Public Domain Databases Our databases are a mess… Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming and challenging
  19. 19. Molecule Data Quality Impacts in silico drug discovery  vast ligand and protein–protein interaction databases  develop computational models  global mapping of pharmacological space  drug-target networks of approved drugs  prediction of off-target effects
  20. 20. Different types of databases and errors Bayer paper on target validation 2/3 of papers did not live up to claims MDL Drug Data Report (MDDR), errors Errors in clinical research databases vary from 2.3% to 26.9% Multicenter analysis by MS-based proteomics identified generic problems in databases when characterizing proteins -search engines could not distinguish different identifiers many algorithms calculated molecular weight incorrectly One database had between 2.1% and 13.6% of annotated Pfam hits unjustified ligand–protein X-ray structure - these can also have errors with far reaching consequences
  21. 21. Solutions Structure Validation and Standardization Curation Annotation Structure filters  Incorrect valency, atom labels, aromatic bonds, stereochemistry, salts, duplication Structure standardization guidelines  Provided by the FDA (Substance Registration System UniqueIngredient Identifier (UNII): http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSyste m-UniqueIngredientIdentifierUNII/default.htm) Need a record of molecule provenance Can we track databases and quality - - www.scidbs.com
  22. 22. RSC Introduces “Validation Service”
  23. 23. Scidbs.com Default Body
  24. 24. Scidbs.com DB logo Type of DB Contact Owner Default Body Website License Curation etc
  25. 25. Data should be: Free from structure errors Free from data errors Free from experimental errors Are we asking too much? Is it even possible??Yet when we alert others: When we raise our hands we are ignored Our scientific community needs to wake up
  26. 26. Today NPC browser has fewer errors..so do ALL databases! More people aware of molecule quality online. Trust is earned not just granted! The future database user is more informed Tomorrow Peer reviewers test the databases that are in manuscripts NIH checks databases before release! COLLABORATION between government DBs. PLEASE!!! We need minimal compound database standards (MCDS)
  27. 27. AcknowledgementWe thank the paper reviewersand blog commentersfor their constructive commentsChris LipinskiThis work was unfunded(but was the right thing to do!)www.scidbs.com

×