Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-...
Upfront Acknowledgment - All Authors… <ul><li>Royal  Society of Chemistry – Antony Williams, David Sharpe </li></ul><ul><l...
Internet-Based Chemistry <ul><li>Internet-based chemistry resources are: </li></ul><ul><ul><li>Diverse in quality </li></u...
 
<ul><li>Open PHACTS : partnership between European Community and EFPIA </li></ul><ul><li>Freely accessible for knowledge d...
Stop Whining – Fix it
What needs to happen? <ul><li>Standards </li></ul><ul><ul><li>Standardization of structures  </li></ul></ul><ul><ul><ul><l...
Standards : Structure Standardization
Standards : Structure Standardization
Standards : Structure Standardization
Collaboration
Then this won’t happen…
 
Top 200 Drugs on Wikipedia http://en.wikipedia.org/wiki/List_of_bestselling_drugs
The Project Challenge PART ONE <ul><li>Agree on the set of chemical names to work with </li></ul><ul><li>Independently cre...
The Project Challenge PART TWO <ul><li>Use Gold Standard SDF File to investigate data quality on these compounds in Intern...
200 Top-Selling Drugs (2006) <ul><li>Biologicals removed immediately </li></ul><ul><li>Single compounds versus mixtures id...
Different Approaches <ul><li>ACD/Labs – Curated commercial dictionary </li></ul><ul><li>RSC|ChemSpider and UNC Chapel Hill...
Different Approaches
Different Approaches
Different Approaches
Different Approaches
Choose a Starting Point
Comparisons
Observations <ul><li>Manual curation – slow and imperfect process.  </li></ul><ul><ul><li>A loop of assertions </li></ul><...
Structure Representations
Representing Racemates
Representing Racemates - Formoterol
Racemic Mixtures
Racemic Mixtures X
“ The First 10”
Collaboration on Curation <ul><li>If we could collaborate on curation…share through standards and open interfaces </li></ul>
Proof of Concept Data Curation Sharing
SciDBs.com (Coming soon)
Conclusions <ul><li>It is DIFFICULT to aggregate high quality structure datasets of even common drugs! </li></ul><ul><li>I...
Thank you Email: williamsa@rsc.org  Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector....
Upcoming SlideShare
Loading in...5
×

Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

2,891

Published on

Internet-based public domain databases containing chemical compounds have grown in number, capability and content in recent years. There are now many databases containing millions of chemical compounds associated with different types of data including chemical names, properties, analytical data, and with associated mapping to proteins, assay data, clinical information and so on. These disparate data sources suffer from one common issue – quality of data. This presentation will provide an overview of our efforts to source the appropriate structural representations for 200 top-selling drugs from public domain sources. This intra- and inter-laboratory comparison of approaches, processes and necessary agreements exposed the challenges associated with aggregating structure-based data. The project also provided data regarding the distribution of quality issues associated with many of the community’s popular databases.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,891
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

  1. 1. Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs Antony Williams ACS Denver September 2011
  2. 2. Upfront Acknowledgment - All Authors… <ul><li>Royal Society of Chemistry – Antony Williams, David Sharpe </li></ul><ul><li>University of North Carolina, Chapel Hill – Alex Tropsha, Denis Fourches, Eugene Muratov, Andrew Fant </li></ul><ul><li>Chemotargets SL – Ricard Garcia-Serna </li></ul><ul><li>IMIM-Hospital del Mar Research Institute and Universitat Pompeu Fabra – Jordi Mestres </li></ul><ul><li>Astra Zeneca – Sorel Muresan, Christopher Southan </li></ul><ul><li>ACD/Labs – Andrey Erin </li></ul>
  3. 3. Internet-Based Chemistry <ul><li>Internet-based chemistry resources are: </li></ul><ul><ul><li>Diverse in quality </li></ul></ul><ul><ul><li>Confusing </li></ul></ul><ul><ul><li>Uncoordinated </li></ul></ul><ul><ul><li>Fixable – with a lot of effort </li></ul></ul>
  4. 5. <ul><li>Open PHACTS : partnership between European Community and EFPIA </li></ul><ul><li>Freely accessible for knowledge discovery and verification. </li></ul><ul><ul><li>Data on small molecules </li></ul></ul><ul><ul><li>Pharmacological profiles </li></ul></ul><ul><ul><li>Pharmacokinetics </li></ul></ul><ul><ul><li>ADMET data </li></ul></ul><ul><ul><li>Biological targets and pathways </li></ul></ul><ul><ul><li>Proprietary and public data sources. </li></ul></ul>
  5. 6. Stop Whining – Fix it
  6. 7. What needs to happen? <ul><li>Standards </li></ul><ul><ul><li>Standardization of structures </li></ul></ul><ul><ul><ul><li>ChEBI/PubChem sharing </li></ul></ul></ul><ul><ul><ul><li>InChI adoption </li></ul></ul></ul><ul><li>Collaboration </li></ul><ul><ul><li>Stop reinventing the wheel </li></ul></ul><ul><ul><li>Share data, share efforts and speed the process </li></ul></ul><ul><li>Vision is not good enough – Execute! </li></ul>
  7. 8. Standards : Structure Standardization
  8. 9. Standards : Structure Standardization
  9. 10. Standards : Structure Standardization
  10. 11. Collaboration
  11. 12. Then this won’t happen…
  12. 14. Top 200 Drugs on Wikipedia http://en.wikipedia.org/wiki/List_of_bestselling_drugs
  13. 15. The Project Challenge PART ONE <ul><li>Agree on the set of chemical names to work with </li></ul><ul><li>Independently create an SDF file in each “lab” </li></ul><ul><li>Compare differences and agree on final structures </li></ul><ul><li>Issue “Gold Standard” SDF file to team </li></ul>
  14. 16. The Project Challenge PART TWO <ul><li>Use Gold Standard SDF File to investigate data quality on these compounds in Internet Databases </li></ul><ul><li>Two checks </li></ul><ul><ul><li>Search chemical name – does it return the correct compound. If not correct, how is it different? </li></ul></ul><ul><ul><li>Search “structure” – SMILES, Molfile, InChIString or InChIKey </li></ul></ul>
  15. 17. 200 Top-Selling Drugs (2006) <ul><li>Biologicals removed immediately </li></ul><ul><li>Single compounds versus mixtures identified </li></ul><ul><li>Decision to NOT exclude racemates </li></ul><ul><li>List of 152 drugs to analyze </li></ul><ul><li>Generic names used </li></ul>
  16. 18. Different Approaches <ul><li>ACD/Labs – Curated commercial dictionary </li></ul><ul><li>RSC|ChemSpider and UNC Chapel Hill – manual curation </li></ul><ul><li>ChemoTargets/IMIM – lookup against database </li></ul><ul><li>AstraZeneca – lookup against database </li></ul>
  17. 19. Different Approaches
  18. 20. Different Approaches
  19. 21. Different Approaches
  20. 22. Different Approaches
  21. 23. Choose a Starting Point
  22. 24. Comparisons
  23. 25. Observations <ul><li>Manual curation – slow and imperfect process. </li></ul><ul><ul><li>A loop of assertions </li></ul></ul><ul><ul><li>Software tool issues </li></ul></ul><ul><li>Lookup – fast and imperfect </li></ul><ul><ul><li>Totally dependent on initial investment in time </li></ul></ul><ul><li>InChIs </li></ul><ul><ul><li>Very useful for comparison </li></ul></ul><ul><ul><li>Imperfect </li></ul></ul>
  24. 26. Structure Representations
  25. 27. Representing Racemates
  26. 28. Representing Racemates - Formoterol
  27. 29. Racemic Mixtures
  28. 30. Racemic Mixtures X
  29. 31. “ The First 10”
  30. 32. Collaboration on Curation <ul><li>If we could collaborate on curation…share through standards and open interfaces </li></ul>
  31. 33. Proof of Concept Data Curation Sharing
  32. 34. SciDBs.com (Coming soon)
  33. 35. Conclusions <ul><li>It is DIFFICULT to aggregate high quality structure datasets of even common drugs! </li></ul><ul><li>InChI is very enabling but enhanced stereo necessary </li></ul><ul><li>Is there a need to be “right”? </li></ul><ul><li>Publication will provide: </li></ul><ul><ul><li>Recommendations for structure standardization </li></ul></ul><ul><ul><li>Rank ordering of resources </li></ul></ul><ul><ul><li>Suggestions for InChI enhancement </li></ul></ul><ul><ul><li>SDF file </li></ul></ul><ul><ul><li>Curation feed of structures and synonyms </li></ul></ul>
  34. 36. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×