0
Delivering Curated Chemistry to theWorld via Crowdsourced Deposition    and Annotation on ChemSpider                      ...
The World of Online Chemistry   Property databases   Compound aggregators   Screening assay results   Scientific publi...
We Have …Too Much Data!!!
e-Science and Primary Data How much data generated in a lab, that COULD  go public, is lost forever?
TotallySynthetic.com
e-Science and Primary Data How much data generated in a lab, that COULD  go public, is lost forever? Public Domain refer...
PubChem
ChEMBL
Collaborative Knowledge Management
e-Science and Primary Data How much data generated in a lab, that COULD  go public, is lost forever? Public Domain refer...
RSC’s ChemSpider
Available Information… Linked to vendors, safety data, toxicity, metabolism
Available Information….
Crowdsourced “Annotations” Users can add   Descriptions/Syntheses/Commentaries   Links to PubMed articles   Links to a...
Spectra
Spectra
Data on the Web
Chemistry Data online is messy We have inherited errors All public compound databases, including ours,  have errors “In...
The Structure of Vitamin K?
MeSH A lipid cofactor that is required for normal blood  clotting. Several forms of vitamin K have been  identified: VITA...
The Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
ChEBI – Manual Curation
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-    enyl)naphthalene-1,4-dione” Variants of systematic names on PubChem   2-...
Question Everything online: www.dhmo.org
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
EPA’s DailyMed
EPA’s DailyMed
EPA’s DailyMed
PHYSPROP DatabaseThe freely downloadabledatabase under the EPISuite prediction softwareVery Basic filters suggestdata qual...
The Stereochemistry challenge.12500 chemicals with “missed” stereo
With Great Fanfare…
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)              Science Translational Medicine 2011
Public Domain Databases Our databases are a mess… Non-curated databases are proliferating errors We source and deposit ...
Stop Whining – Fix it
Crowdsourced Curation  Crowd-sourced curation: identify/tag errors, edit   names, synonyms, identify records to deprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
Standards : Structure Standardization
Standards : Structure Standardization
Standards : Structure Standardization
What needs to happen? Standards   Standardization of structures      ChEBI/PubChem sharing      InChI adoption
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Vancomycin – Search the Internet
VancomycinSearch Molecular   Search Full Molecule  SKELETON
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Crowdsourcing Works >130 people have deposited data and  participated in data curation Different level curators check ea...
What needs to happen? Standards   Standardization of structures      ChEBI/PubChem sharing      InChI adoption Collab...
Antony Williams vs Identifiers                             Passport ID     Dad, Tony, others5 email addresses             ...
Aspirin names and synonyms            • Text searches depend on              correct association            • 335 suggeste...
The Final Search Strategy
All Those Names, One Structure
Ambiguity in Identifiers
Curated Dictionaries Matter
Success Depends on Dictionaries
Validated Name-Structure Dictionaries Chemical name dictionaries are used for:      Text-mining (publications, patents) ...
I want to know about “Vincristine”                        If all algorithms work then                        everything on...
Vincristine: Identifiers and Properties
Vincristine: Vendors and SourcesLinked by Structure
Vincristine: PatentsLinked by Name
Vincristine: ArticlesLinked by Name
Challenges of Complex MoleculesYohimbine
Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine
Pharma Information Tombs   Internal and external content   Built to meet primary use-case   Tailored indexes and GUIs ...
What could create change? Harvard Business Review (2010)“One change would make a substantialdifference [to drug R&D]: the...
It is so difficult to navigate…                                                        IP?                                ...
Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver se...
ChemSpider Resources for Chemistry
The Future                      Internet Data Small organic molecules              Commercial Software Undefined materials...
The Future of Chemistry on the Web? Public compound databases federate & build  a linked environment of validated data! ...
Acknowledgments The ChemSpider team Our data providers, depositors, collaborators and  curators Software providers – Op...
Thank youEmail: williamsa@rsc.orgTwitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSL...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Upcoming SlideShare
Loading in...5
×

Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider

472

Published on

RSC|ChemSpider is one of the world’s largest online resources for chemistry related data and services. Developed with the intention of delivering access to structure-based chemistry data via the internet the ChemSpider platform hosts over 26 million unique chemical compounds aggregated from over 400 data sources and provides an environment for the community to both annotate and curate these existing data as well as deposit new data to the system. The search system delivers flexible querying capabilities together with links to external sites for publication and patent data. This presentation will review the present capabilities of the ChemSpider system providing direct examples of how to use the system to source high quality data of value to chemists. We will discuss some of the challenges associated with validating data quality and examine how ChemSpider is a part of the new “semantic web for chemistry”. ChemSpider has also spawned a number of additional projects include ChemSpider SyntheticPages for hosting openly peer-reviewed chemical synthesis articles, Learn Chemistry Wiki for students learning chemistry and SpectraSchool for learning spectroscopy.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
472
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider"

  1. 1. Delivering Curated Chemistry to theWorld via Crowdsourced Deposition and Annotation on ChemSpider Antony Williams University of Illinois in Chicago, January 27th 2012
  2. 2. The World of Online Chemistry Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
  3. 3. We Have …Too Much Data!!!
  4. 4. e-Science and Primary Data How much data generated in a lab, that COULD go public, is lost forever?
  5. 5. TotallySynthetic.com
  6. 6. e-Science and Primary Data How much data generated in a lab, that COULD go public, is lost forever? Public Domain reference databases of value?  Syntheses  Properties  Spectra  CIFs  Images
  7. 7. PubChem
  8. 8. ChEMBL
  9. 9. Collaborative Knowledge Management
  10. 10. e-Science and Primary Data How much data generated in a lab, that COULD go public, is lost forever? Public Domain reference databases of value?  Syntheses  Properties  Spectra  CIFs  Images Much of chemistry is chemical structure-based – where and how could we host these data?
  11. 11. RSC’s ChemSpider
  12. 12. Available Information… Linked to vendors, safety data, toxicity, metabolism
  13. 13. Available Information….
  14. 14. Crowdsourced “Annotations” Users can add  Descriptions/Syntheses/Commentaries  Links to PubMed articles  Links to articles via DOIs  Add spectral data  Add Crystallographic Information Files  Add photos  Add MP3 files  Add Videos
  15. 15. Spectra
  16. 16. Spectra
  17. 17. Data on the Web
  18. 18. Chemistry Data online is messy We have inherited errors All public compound databases, including ours, have errors “Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
  19. 19. The Structure of Vitamin K?
  20. 20. MeSH A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  21. 21. The Structure of Vitamin K1?
  22. 22. What is the Structure of Vitamin K1?
  23. 23. CAS’s Common Chemistry
  24. 24. Wikipedia
  25. 25. ChEBI – Manual Curation
  26. 26. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2- enyl)naphthalene-1,4-dione” Variants of systematic names on PubChem 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
  27. 27. Question Everything online: www.dhmo.org
  28. 28. It’s all on Wikipedia…
  29. 29. Chemistry on The Internet Is Messy
  30. 30. It’s Methane…
  31. 31. What’s Methane?
  32. 32. What’s Methane?
  33. 33. What ELSE is Methane???
  34. 34. EPA’s DailyMed
  35. 35. EPA’s DailyMed
  36. 36. EPA’s DailyMed
  37. 37. PHYSPROP DatabaseThe freely downloadabledatabase under the EPISuite prediction softwareVery Basic filters suggestdata quality issues
  38. 38. The Stereochemistry challenge.12500 chemicals with “missed” stereo
  39. 39. With Great Fanfare…
  40. 40. NPC Browser http://tripod.nih.gov/npc/
  41. 41. NPC Browser http://tripod.nih.gov/npc/
  42. 42. Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  43. 43. Public Domain Databases Our databases are a mess… Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming and challenging
  44. 44. Stop Whining – Fix it
  45. 45. Crowdsourced Curation  Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
  46. 46. Search “Vitamin H”
  47. 47. “Curate” Identifiers
  48. 48. “Curate” Identifiers
  49. 49. “Curate” Identifiers
  50. 50. Standards : Structure Standardization
  51. 51. Standards : Structure Standardization
  52. 52. Standards : Structure Standardization
  53. 53. What needs to happen? Standards  Standardization of structures  ChEBI/PubChem sharing  InChI adoption
  54. 54. The InChI Identifier
  55. 55. Multiple Layers
  56. 56. InChIStrings Hash to InChIKeys
  57. 57. Vancomycin – Search the Internet
  58. 58. VancomycinSearch Molecular Search Full Molecule SKELETON
  59. 59. Full Skeleton Search: 104 Hits
  60. 60. Full Molecule Search: 4 Hits
  61. 61. Crowdsourcing Works >130 people have deposited data and participated in data curation Different level curators check each other More curators and depositors are encouraged!
  62. 62. What needs to happen? Standards  Standardization of structures  ChEBI/PubChem sharing  InChI adoption Collaboration  Stop reinventing the wheel  Share data, share efforts and speed the process
  63. 63. Antony Williams vs Identifiers Passport ID Dad, Tony, others5 email addresses LicenseChemSpiderman (blog, SSNTwitter account,Facebook, Friendfeed)OpenID…. Green Card
  64. 64. Aspirin names and synonyms • Text searches depend on correct association • 335 suggested identifiers for Aspirin just on PubChem! • Disambiguation dictionaries are necessary, not just for authors!
  65. 65. The Final Search Strategy
  66. 66. All Those Names, One Structure
  67. 67. Ambiguity in Identifiers
  68. 68. Curated Dictionaries Matter
  69. 69. Success Depends on Dictionaries
  70. 70. Validated Name-Structure Dictionaries Chemical name dictionaries are used for:  Text-mining (publications, patents)  Used to index PubMed and link to Google Patents  Linking to other databases – think Biology!  When structures are not available drug names link  Searching the web  Names link to structures link to InChIs
  71. 71. I want to know about “Vincristine” If all algorithms work then everything on the page is correct by default except the name-structure relationship!
  72. 72. Vincristine: Identifiers and Properties
  73. 73. Vincristine: Vendors and SourcesLinked by Structure
  74. 74. Vincristine: PatentsLinked by Name
  75. 75. Vincristine: ArticlesLinked by Name
  76. 76. Challenges of Complex MoleculesYohimbine
  77. 77. Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine
  78. 78. Pharma Information Tombs Internal and external content Built to meet primary use-case Tailored indexes and GUIs Internal unique language & metadata Poor interoperability/integration Powerpoint, Documents, Excel Many suppliers of systems and content in a single workflow In vivo Pipeline Literature Patents News SAR CSRs Safety Etc
  79. 79. What could create change? Harvard Business Review (2010)“One change would make a substantialdifference [to drug R&D]: the creation of agreed-upon standards for digitally representing drug assets.”
  80. 80. It is so difficult to navigate… IP? What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Competitors? Working On Connections Now? to disease? Expressed in right cell type?
  81. 81. Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project Guiding principle is open access, open usage, open source - Key to standards adoption -
  82. 82. ChemSpider Resources for Chemistry
  83. 83. The Future Internet Data Small organic molecules Commercial Software Undefined materials Pre-competitive Data Organometallics Open Science Nanomaterials Open Data Polymers Publishers Minerals Educators Particle bound Open Databases Links to Biologicals Chemical Vendors
  84. 84. The Future of Chemistry on the Web? Public compound databases federate & build a linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make publications discoverable Public-Private databases can be linked Open Data proliferate The “Semantic Web” in action
  85. 85. Acknowledgments The ChemSpider team Our data providers, depositors, collaborators and curators Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel) Sean Ekins @collabchem
  86. 86. Thank youEmail: williamsa@rsc.orgTwitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×