Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website


Published on

The original abstract is below. Ultimately this work was not funded by Microsoft and we did not deliver it on Sharepoint Server. Nevertheless, we DO depend heavily on Microsoft Technology to do what we do... .NET and SQL server specifically.

DailyMed is a website hosted by the FDA providing access to information about marketed drugs. This information includes FDA approved labels (package inserts) and provides a standard, comprehensive, up-to-date, look-up and download resource of medication content and labeling as found in medication package inserts. With an intention of enhancing the dataset by making it searchable by chemical structure/substructure we determined that the data contained numerous chemistry errors. We have therefore used a combination of text-mining, automated and manual curation to improve the quality of the data set. In so doing we have also made querying of the data more flexible. Specifically we have used the Microsoft Sharepoint technology to create a portal allowing both text-based and structure-based querying. We will report on the advantages such an approach delivers in terms of flexible interrogation of DailyMed.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

  1. 1. Cleaning up chemistry for the pharma industry Delivering a flexible platform for interrogating the FDA DailyMed website Antony Williams
  2. 2. Vision <ul><li>Use the DailyMed FDA website data as a data source </li></ul><ul><li>Use Microsoft Sharepoint Server as a platform to demonstrate integrated ChemSpider technology </li></ul><ul><li>Deliver some “Chemistry” on the BioIT Alliance website </li></ul><ul><li>Get funding to support ChemSpider </li></ul>
  3. 3. Reality
  4. 4. Chemistry on the Internet <ul><li>The Internet can clearly benefit chemists searching for information </li></ul><ul><li>Much of the information is based on assertions and User Beware! </li></ul><ul><li>The Quality of information available is diverse and how does the user know what is and is not “correct”? </li></ul>
  5. 5. <ul><li>21.5 million structures, 150 data sources and growing </li></ul><ul><li>Flexible searching </li></ul><ul><li>Deposition of structures, spectra, crowdsourced curation and annotation </li></ul>
  6. 6. Complex Data and Information
  7. 7. 21.5 Million Structures, Varied Sources <ul><li>There are “bad structures” on the database </li></ul><ul><li>There are bad structure-name pairs </li></ul><ul><li>Users have associated “incorrect information” </li></ul>
  8. 8. Data Curation
  9. 9. Caution! Question Everything!
  10. 10. Question Everything
  11. 11. Vancomycin <ul><li>Who will curate? </li></ul><ul><li>PubChem is not resourced to clean these errors  </li></ul><ul><li>How would you clean such a large dataset? </li></ul>
  12. 12. Vancomycin ChemSpider: 1 compound – 3 days
  13. 13. DailyMed <ul><ul><li>“ DailyMed provides high quality information about marketed drugs. </li></ul></ul><ul><ul><li>This information includes FDA approved labels (package inserts).” </li></ul></ul>
  14. 14. The FDA’s DailyMed
  15. 15. The Intention <ul><li>Make DailyMed structure searchable via ChemSpider </li></ul><ul><li>In the process curate data on ChemSpider and validate data on DailyMed </li></ul><ul><li>Improve the curation platform on ChemSpider </li></ul><ul><li>Perform markup of DailyMed articles to enhance the reading experience </li></ul>
  16. 16. Structures on DailyMed Poor Representations
  17. 17. Structures on DailyMed Lack of Stereochemisty
  18. 18. Incorrect Structures Simply Wrong
  19. 19. Incorrect Structures Scanning (?) Issues
  20. 20. Incorrect Structures “HOO-BOY!!!!!”
  21. 21. Does it Matter? <ul><li>Does it matter to the consumer that the structures are wrong? No…what matters is what is in the bottle is the right medication! </li></ul><ul><li>To make DailyMed structure searchable it DOES matter </li></ul><ul><li>To data mine DailyMed it matters </li></ul><ul><li>To mark up DailyMed it matters </li></ul>
  22. 22. The Process <ul><li>Import all XML files from DailyMed </li></ul><ul><li>Use “Home built” entity extraction based on our dictionary of chemical names </li></ul><ul><li>Articles online here: </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>Example Article: </li></ul></ul>
  23. 23. State of the Data
  24. 24. Tolinase: DailyMed on ChemSpider
  25. 25. OTHER Mentioned Chemicals
  26. 26. One Name – Multiple Structures NO Stereo Full Stereo Partial Stereo Partial Stereo
  27. 27. Editing a Record <ul><li>Do NOT deprecate record…remove association between name and chemical structure </li></ul>
  28. 29. Partial Stereochemistry
  29. 30. Loop of Assertions <ul><li>Reduce to ONE structure – with full explicit stereo </li></ul>
  30. 31. How bad can it get??? And who is right????
  31. 32. Name-Structure Pairs <ul><li>Cleaning up the associations of names and structures is torturous and time-consuming </li></ul><ul><li>Decisions get made and can be challenged </li></ul><ul><li>Names are not “removed” …they are still on the database </li></ul><ul><li>Such a curated “dictionary” is very valuable </li></ul>
  32. 33. ChemMantis <ul><li>Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem – ChemMantis </li></ul><ul><li>A platform for entity extraction for chemistry documents, markup and integration to online information sources – Wikipedia, ChemSpider, Entrez… </li></ul><ul><li>Web-based submission, markup and publishing platform now hosting the ChemSpider Journal of Chemistry </li></ul>
  33. 34. Back to DailyMed
  34. 35. Quality of Structures!!!
  35. 36. ChemMantis Markup
  36. 37. Species Markup
  37. 38. Dictionaries are Easily Enhanced <ul><li>Copy-Paste into appropriate Entity Dictionary </li></ul><ul><li>Impacts all future markups </li></ul><ul><li>Expanding knowledgebases of information </li></ul><ul><li>Linked out to rich sources of information </li></ul>
  38. 40. Outlinks…
  39. 41. Where To From Here? <ul><li>The platform is built…it’s all eyeballs for curation now </li></ul><ul><li>As structure-identifier pairs are curated DailyMed will improve </li></ul><ul><li>The project is now on hold – no resources to continue </li></ul>
  40. 42. If We Had Our Way… <ul><li>Convert every DailyMed Label to a ChemMantis marked up document </li></ul><ul><li>Use the XML segregation of the Tablet Labels to tag where chemicals are in the label </li></ul><ul><li>Allow data mining based on “where” in a label the chemicals are..drug-drug interactions etc </li></ul><ul><li>Markup and mine property data out of the labels using new dictionaries related to properties such as IC50 and toxicity </li></ul>
  41. 43. Conclusions <ul><li>The internet enables chemistry – and at a reduced cost </li></ul><ul><li>Question Quality! All online information is suspect </li></ul><ul><li>Crowdsourcing for expansion, curation and integration can both improve the quality of existing information and add new content </li></ul><ul><li>If the FDA doesn’t have responsibility for what is on Tablet Labels…who does? The answer is simply an assertion! </li></ul>
  42. 44. Interesting Sites <ul><li>ChemSpider </li></ul><ul><ul><li> </li></ul></ul><ul><li>ChemSpider Journal of Chemistry </li></ul><ul><ul><li> </li></ul></ul><ul><li>The InChI resolver </li></ul><ul><ul><li> (goes live at ACS Spring) </li></ul></ul><ul><li>The ChemSpider blog </li></ul><ul><ul><li> </li></ul></ul><ul><li>Contact </li></ul><ul><ul><li>[email_address] </li></ul></ul>