Checking, Curating And Qualifying Chemistry
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Checking, Curating And Qualifying Chemistry

on

  • 2,453 views

An overview of what we do to curate and annotate small molecules and how it's the basis of Chemmantis. A presentation given to the PDB team at Rutgers University

An overview of what we do to curate and annotate small molecules and how it's the basis of Chemmantis. A presentation given to the PDB team at Rutgers University

Statistics

Views

Total Views
2,453
Views on SlideShare
2,452
Embed Views
1

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Checking, Curating And Qualifying Chemistry Presentation Transcript

  • 1. Checking, Curating and Qualifying Chemistry to Build a Structure Centric Community for Chemists Rutgers University 12/2/2008 Antony Williams
  • 2. ChemSpider - A Search Engine for Chemists
    • Questions a chemist might ask…
      • What is the melting point of n-butanol?
      • What is the chemical structure of Xanax?
      • Chemically, what is phenolphthalein?
      • What are the stereocenters of cholesterol?
      • Where can I find publications about xylene?
      • What are the different trade names for Ketoconazole?
      • What is the NMR spectrum of Aspirin?
      • What are the safety handling issues for Thymol Blue?
      • ChemSpider can answer all of these questions
  • 3. Tell Me About Glutathione
  • 4. Tell Me About Glutathione
  • 5. Tell Me About Glutathione
  • 6. Tell Me About Glutathione
  • 7. Tell Me About Glutathione
  • 8. Link outs
  • 9. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
  • 10. How many names does a compound have?
  • 11. ChemSpider Data Content
    • Over 21.5 million unique chemical structures from ca. 150 data sources
      • Online Databases –PubChem, Drugbank, KEGG, Wikipedia
      • Literature – PubMed, J Het Chem, Nature, RSC, Open Access
      • Chemical Vendors – over 40 different vendors and growing
      • Personal Depositions – individual contributions
      • Content database vendors
      • Analytical data collections
      • Patents
      • Web scraping
      • Content is linked back to the original data sources
  • 12. Complex Search
  • 13. The Quality of Data Online…
    • Aggregating data opens up quality issues
    • Structure-identifier associations are “dirty”
    • Structures are COMMONLY incorrect
    • Manual curation of small databases is enough work – what about millions of structures?
    • Structures are far from perfect. What is a “correct structure”?
      • Full stereochemistry?
      • Historical timeline of structure?
      • Who is the authority?
  • 14. Quality is a Major Issue- Search Butanol OLD EXAMPLE..now fixed
  • 15. Wikipedia Chemistry Curation project
    • Only ca. 5000 organic structures, 7000 total structures
    • Almost a year of work so far for a team of 6 people
    • Many errors removed in the process. Curation process is a daily event for users/depositors
    • Slow and torturous process
    • http://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structure
  • 16. Wikipedia Curation
    • Looking for self-consistency across a Wikipedia Page
    • Primary key is the article TITLE
    • The chemical shown needs to match the title
    • Cyclic self-consistency – and decisions must get made
  • 17. Other issues…
  • 18. Charges
  • 19. Sugars – Machine Readable vs Aesthetics Haworth Stereo Fischer
  • 20. Wikipedia – Crowdsourcing Chemistry
  • 21. Thymol Blue on ChemSpider
    • Data online includes:
      • UV-vis spectrum
      • Measured experimental properties
      • Link to Wikipedia article
      • Links to chromatography details
      • Multiple identifiers/trade names etc.
      • Links to vendors/suppliers/other databases
      • Safety information
      • http://www.chemspider.com/q/thymol%20blue
  • 22. Crowd-sourcing Curation
    • How to curate data for millions of structures?
    • Robot processes can clean up depositions
      • Search for Chloride and check molecular formula for Cl
      • Check for stereochemistry and remove names with stereo
    • Provide a simple-to-use platform to curate, annotate and tag data
    • Provide curator administration to prevent vandalism (Veropedia)
  • 23. Post Comments
    • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
  • 24. Multi-level Curation and Approval
  • 25. Crowd-sourcing Chemistry
    • Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
    • ALSO
    • Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
  • 26. Vancomycin
    • Originally 12 structures with vancomycin
      • Incomplete stereochemistry
      • Complete but different stereochemistry
      • Different charge states
    • 1 remains after community collaboration with ChEBI
  • 27. “ Collaboration” with ChEBI
  • 28. Ginkgolide B
  • 29. DailyMed
  • 30. Quality of Structures
  • 31. Quality of Structures!!!
  • 32.  
  • 33. “Entity Extraction”
    • Rule-based recognition of systematic names:
      • Use a lexeme of name fragments
      • Rules for identifying bounds of a name
    • Look-up dictionary:
      • Drug Names
      • Trivial Names
      • Numbers : Registry IDs, EINECS/ELINCS
      • Massive look-up dictionary of validated identifiers on ChemSpider
  • 34. Name Recognition
    • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) .
    • The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 35. Name Recognition
    • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2   (30.00 mL) at  0 oC  were  successively  added  (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excess of anhydrous MgSO 4 (2.00 g,16.67 mmol) .
    • The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 36. ChemMantis
    • Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem
  • 37. Document markup
  • 38. Markup – 3 seconds!
  • 39. On the fly conversion
  • 40. Shorthand Formulae Supported
  • 41. One Click to more Info…
  • 42. Names and Structures
    • Dichloroacetone
    • Trichloromethylsilane
  • 43. Ambiguity
  • 44. Ambiguity in Abbreviations - DPA
  • 45. IUPAC PAC Articles
  • 46. Patents
  • 47.
    • Single Configuration File defines entities for markup
    • Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical
    • We can extend our system – should we integrate to PDB somehow?
  • 48. Nature Publications
  • 49. Entity Balloons
    • Structures are the language of chemistry
    • Show structures to chemists and search/link from there
    • Link to PDB ?
  • 50. Other Dictionaries - Species
    • We are considering
      • Bacteria
      • Fungi
      • Enzymes
      • Viruses
      • PDB codes?
  • 51. Integrations Out to Other Sources
  • 52. Reactions
  • 53. Conclusions
    • The quality of structure-based data online should always be questioned – that includes ChemSpider
    • Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always
    • ChemSpider has a large validated structure-name dictionary
    • Chemical name extraction and document markup is very enabling