Checking, Curating And Qualifying Chemistry
Upcoming SlideShare
Loading in...5
×
 

Checking, Curating And Qualifying Chemistry

on

  • 2,431 views

An overview of what we do to curate and annotate small molecules and how it's the basis of Chemmantis. A presentation given to the PDB team at Rutgers University

An overview of what we do to curate and annotate small molecules and how it's the basis of Chemmantis. A presentation given to the PDB team at Rutgers University

Statistics

Views

Total Views
2,431
Views on SlideShare
2,430
Embed Views
1

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Checking, Curating And Qualifying Chemistry Checking, Curating And Qualifying Chemistry Presentation Transcript

  • Checking, Curating and Qualifying Chemistry to Build a Structure Centric Community for Chemists Rutgers University 12/2/2008 Antony Williams
  • ChemSpider - A Search Engine for Chemists
    • Questions a chemist might ask…
      • What is the melting point of n-butanol?
      • What is the chemical structure of Xanax?
      • Chemically, what is phenolphthalein?
      • What are the stereocenters of cholesterol?
      • Where can I find publications about xylene?
      • What are the different trade names for Ketoconazole?
      • What is the NMR spectrum of Aspirin?
      • What are the safety handling issues for Thymol Blue?
      • ChemSpider can answer all of these questions
  • Tell Me About Glutathione
  • Tell Me About Glutathione
  • Tell Me About Glutathione
  • Tell Me About Glutathione
  • Tell Me About Glutathione
  • Link outs
  • Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
  • How many names does a compound have?
  • ChemSpider Data Content
    • Over 21.5 million unique chemical structures from ca. 150 data sources
      • Online Databases –PubChem, Drugbank, KEGG, Wikipedia
      • Literature – PubMed, J Het Chem, Nature, RSC, Open Access
      • Chemical Vendors – over 40 different vendors and growing
      • Personal Depositions – individual contributions
      • Content database vendors
      • Analytical data collections
      • Patents
      • Web scraping
      • Content is linked back to the original data sources
  • Complex Search
  • The Quality of Data Online…
    • Aggregating data opens up quality issues
    • Structure-identifier associations are “dirty”
    • Structures are COMMONLY incorrect
    • Manual curation of small databases is enough work – what about millions of structures?
    • Structures are far from perfect. What is a “correct structure”?
      • Full stereochemistry?
      • Historical timeline of structure?
      • Who is the authority?
  • Quality is a Major Issue- Search Butanol OLD EXAMPLE..now fixed
  • Wikipedia Chemistry Curation project
    • Only ca. 5000 organic structures, 7000 total structures
    • Almost a year of work so far for a team of 6 people
    • Many errors removed in the process. Curation process is a daily event for users/depositors
    • Slow and torturous process
    • http://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structure
  • Wikipedia Curation
    • Looking for self-consistency across a Wikipedia Page
    • Primary key is the article TITLE
    • The chemical shown needs to match the title
    • Cyclic self-consistency – and decisions must get made
  • Other issues…
  • Charges
  • Sugars – Machine Readable vs Aesthetics Haworth Stereo Fischer
  • Wikipedia – Crowdsourcing Chemistry
  • Thymol Blue on ChemSpider
    • Data online includes:
      • UV-vis spectrum
      • Measured experimental properties
      • Link to Wikipedia article
      • Links to chromatography details
      • Multiple identifiers/trade names etc.
      • Links to vendors/suppliers/other databases
      • Safety information
      • http://www.chemspider.com/q/thymol%20blue
  • Crowd-sourcing Curation
    • How to curate data for millions of structures?
    • Robot processes can clean up depositions
      • Search for Chloride and check molecular formula for Cl
      • Check for stereochemistry and remove names with stereo
    • Provide a simple-to-use platform to curate, annotate and tag data
    • Provide curator administration to prevent vandalism (Veropedia)
  • Post Comments
    • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
  • Multi-level Curation and Approval
  • Crowd-sourcing Chemistry
    • Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
    • ALSO
    • Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
  • Vancomycin
    • Originally 12 structures with vancomycin
      • Incomplete stereochemistry
      • Complete but different stereochemistry
      • Different charge states
    • 1 remains after community collaboration with ChEBI
  • “ Collaboration” with ChEBI
  • Ginkgolide B
  • DailyMed
  • Quality of Structures
  • Quality of Structures!!!
  •  
  • “Entity Extraction”
    • Rule-based recognition of systematic names:
      • Use a lexeme of name fragments
      • Rules for identifying bounds of a name
    • Look-up dictionary:
      • Drug Names
      • Trivial Names
      • Numbers : Registry IDs, EINECS/ELINCS
      • Massive look-up dictionary of validated identifiers on ChemSpider
  • Name Recognition
    • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) .
    • The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • Name Recognition
    • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2   (30.00 mL) at  0 oC  were  successively  added  (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excess of anhydrous MgSO 4 (2.00 g,16.67 mmol) .
    • The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • ChemMantis
    • Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem
  • Document markup
  • Markup – 3 seconds!
  • On the fly conversion
  • Shorthand Formulae Supported
  • One Click to more Info…
  • Names and Structures
    • Dichloroacetone
    • Trichloromethylsilane
  • Ambiguity
  • Ambiguity in Abbreviations - DPA
  • IUPAC PAC Articles
  • Patents
    • Single Configuration File defines entities for markup
    • Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical
    • We can extend our system – should we integrate to PDB somehow?
  • Nature Publications
  • Entity Balloons
    • Structures are the language of chemistry
    • Show structures to chemists and search/link from there
    • Link to PDB ?
  • Other Dictionaries - Species
    • We are considering
      • Bacteria
      • Fungi
      • Enzymes
      • Viruses
      • PDB codes?
  • Integrations Out to Other Sources
  • Reactions
  • Conclusions
    • The quality of structure-based data online should always be questioned – that includes ChemSpider
    • Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always
    • ChemSpider has a large validated structure-name dictionary
    • Chemical name extraction and document markup is very enabling