ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?
Upcoming SlideShare
Loading in...5
×
 

ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

on

  • 2,613 views

With an intention to provide a high quality free internet resource of chemistry related data for the community, ChemSpider has aggregated almost 25 million compounds linked out to over 400 data ...

With an intention to provide a high quality free internet resource of chemistry related data for the community, ChemSpider has aggregated almost 25 million compounds linked out to over 400 data sources and provided a platform for the community to both deposit and curate data. This experiment in crowdsourcing for chemistry has now been running for over three years. This presentation will review a number of aspects of the project including (a) the level of community participation in depositing and curating data; (b) the nature of data and content supplied by the community; (c) how ChemSpider is used by the community; (d) using game-based systems to assist in data curation; (e) algorithmic-based approaches to data validation and filtering; and (f) sharing data curation efforts with other online databases.

Statistics

Views

Total Views
2,613
Views on SlideShare
1,937
Embed Views
676

Actions

Likes
0
Downloads
9
Comments
0

8 Embeds 676

http://www.chemconnector.com 644
http://www.chemspider.com 17
http://localhost 9
http://translate.googleusercontent.com 2
http://74.6.117.48 1
http://webcache.googleusercontent.com 1
https://si0.twimg.com 1
http://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists? ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists? Presentation Transcript

    • ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists? Antony Williams ACS Denver August 30th 2011
    • What’s said on the web is true…
    • What’s said on the web is true…
    • What’s said on the web is true…
      • “ We then established a collaboration with professor Sum Ting Wong, a fugitive from the North Korean University Hu Yu Hai Ding, currently in Rome (Italy).”
      • “ This was identified as the new protein Wai So Dim (WSD).”
    • Who is Sandy Lawson? Ask Google
    • Who is Sandy..to me?
      • Mentor in computer-generated nomenclature
      • Educational Technologist
      • Innovator
      • Ethical
      • “ Gentleman Sandy”
    • What is the Structure of Vitamin K1?
    • ChemSpider
      • The Free Chemical Database
      • A central hub for chemists to source information
        • >26 million unique chemical records
        • Aggregated from >400 data sources
        • Chemicals, spectra, CIF files, movies, images, podcasts, links to patents, publications, predictions
      • A central hub for chemists to deposit & curate data
    • ChemSpider general statements
      • ChemSpider : one of many important resources
      • The “Google and Wikipedia of Chemistry”
      • A vision of “Linking all chemistry on the internet”
      • Most people in this room probably know about it
      • New people discover us regularly
      • Our distinct roles are:
        • Hosting and exposing data for the community
        • Curating and validating chemistry-related data
    • I want to know about “Vincristine”
    • I want to know about “Vincristine” If all algorithms work then everything on the page is correct by default except the name!
    • Vincristine: Identifiers and Properties
    • Vincristine: Identifiers and Properties
    • Vincristine: Vendors and Sources
    • Vincristine: Patents
    • Vincristine: Articles
    • Searches: The INTERNET All ChemSpider and Internet searches are “simply algorithms” but synonym searching is based on an assertion
    • InChIs
    • Validated Names for Searching…
    • What you might not know about Chemistry Databases on the Internet
      • Data-sharing between the databases is cyclic –proliferating errors – “Linked Data”
    • What you might not know about Chemistry Databases on the Internet
      • Some public databases are “trusted” as primary sources
      • Trust is granted without investigation or understanding of the content
      • Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
    • What you might not know about Chemistry Databases on the Internet
      • Some public databases are “trusted” as primary sources.
    • What you might not know about Chemistry Databases on the Internet
      • Some public databases are “trusted” as primary sources
      • Trust is granted without investigation or understanding of the content
      • What do we know about some of the online resources?
    • PHYSPROP Database
      • The freely downloadable database under the EPI Suite prediction software
      • Very Basic filters suggest data quality issues
    • The Stereochemistry challenge. 12500 chemicals with “missed” stereo
    • NIST Webbook
    • PubChem
    • What you might not know about Chemistry Databases on the Internet
      • Make sure you blame the database hosts!!! (???)
      • Errors are primarily deposited and inherited by the data suppliers
      • Chemistry databases depend enormously on structure representations…
    •  
    •  
    •  
    • What you might not know about Chemistry Databases on the Internet
      • Despite all of the blog posts, lectures, presentations and pleas it’s not improving
    • NPC Browser http://tripod.nih.gov/npc/
    • NPC Browser http://tripod.nih.gov/npc/
    • NPC Browser http://tripod.nih.gov/npc/
    • NPC Browser http://tripod.nih.gov/npc/
    • Patents
    • Patents
    • WYSIWYG compounds
    • WYSIWYG compounds
    • But Chemspider is curated right?
    • Originally 15 compounds “called” Yohimbine 54 Skeletons for Yohimbine
    • All aggegators suffer dilution!
    • What is the structure of Discodermolide?
    • How to distinguish…who’s wrong?
    • Neither is wrong
    • Data Curation…long torturous task
      • Data curation – JUST structure-name validation is a long, torturous, iterative task.
      • How about validating “data” – PhysChem data such as logP data, boiling points, melting points, spectra
    • Curating Melting Point Data http://tinyurl.com/3e44vbx
    • Melting Point Validation Work
    • Some melting points can’t be resolved only with literature: 4-benzyltoluene
    • Data Curation…long torturous task
      • Data curation – JUST structure-name validation is a long, torturous, iterative task.
      • How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra
      • The crowd in crowdsourcing is …generally small
      • Which of the large databases are doing careful curation. How can we share the workload? Hmm..
    • ChemSpider can “do it” for us
      • ChemSpider provides a curation interface
      • All curation activities are available for review, online immediately, iteratively checked
      • Curators have different abilities based on their profile: There are only a few “Master Curators”.
      • Can we “share” the curation workload?
    • Identifier Dictionaries
      • Reciprocal curation processes…share curation with each other.
      • If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.
      • A series of “added” and “removed” synonyms against InChIKeys for matching.
    • Proof of Concept Data Curation Sharing
    • Structure Validation using feed
      • Look for approved synonyms
      • Compare feed InChIKey with database InChIKey
      • If different, flag for inspection
    • Identifier Dictionaries
      • Reciprocal curation processes…share curation with each other.
      • If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.
      • A series of “added” and “removed” synonyms against InChIKeys for matching.
      • Who will participate???
    • Batch Validation Also Works!
      • Batch validation of name-structure relationships
      • “ Background Processing framework”
      • Hexamethylchickenwire Chloride = C12H23O5
    • Batch Validation Also Works!
      • Batch validation of name-structure relationships
      • “ Background Processing framework”
      • Hexamethylchickenwire Chloride = C12H23O5
    • Batch Validation Also Works!
      • Batch validation of name-structure relationships
      • “ Background Processing framework”
      • Hexamethylchickenwire Chloride = C12H23O5
      • Define set of synonym filters and process the entire backfile. We will use synonym filters at deposition
    • Community Contribution to ChemSpider
      • ChemSpider as a host for community contributions
        • Curation and validation input
        • Structures
        • Movies
        • Images
        • Analytical data – especially spectra
    • Spectra
    • www.SpectralGame.com http://www.jcheminf.com/content/1/1/9
    • Spectral Game
    • Data Curation
    • Reversed Spectrum
    • Download, reprocess, redeposit
    • True Curation of Data
    • Batch wise validation of NMR data
    • Automated C13 Verification
    • Mixture Identified
    • NMR Verification
      • H1 NMR: 77% of spectra consistent
      • C13NMR: 67% of spectra consistent
      • Algorithms NOT perfect but did identify:
        • Misreferenced data
        • Reversed spectra
        • 22 mixtures identified
        • Signal-to-noise was poor – missing peaks
      • What about 2DNMR verification?
    • ChemSpider ID 24528095 HHCOSY
    • ChemSpider ID 24528095 HSQC
    • Crowdsourced Spectral Data
      • Spectral data available at
      • http :// www.chemspider.com/spectra.aspx
      • Regular data depositions
      • Generally licensed as Open Data
      • Chemical vendors now contributing spectral data – up to 800 spectra presently being acquired
      • All data welcomed – who will they benefit ?
        • www.SpectralGame.com
        • http://spectraschool.rsc.org/
    • SpectraSchool
    •  
    • Community Contribution to ChemSpider
      • ChemSpider as a host for community contributions
        • Curation and validation input
        • Analytical data – especially spectra
        • Movies, images
        • Is it just structures?
      • ChemSpider SyntheticPages as a host for reaction syntheses
    • ChemSpider SyntheticPages
    • ChemSpider SyntheticPages
    • Submission Process
      • Simple template-based submission process
      • Submissions reviewed by editorial board. Published as is or comments sent to author
      • Online Peer Review process
      • Data supported include web movies, images, live spectra etc.
      • DOI issued to author
    • Is it working?
      • Show of hands…
        • How many of you know CSSP?
        • Have any of you submitted to CSSP?
      • Low submissions but some dedicated authors
    • Is it working?
      • Show of hands…
        • How many of you know CSSP?
        • Have any of you submitted to CSSP?
      • Low submissions but some dedicated authors
      • It is NOT a technology issue
        • Students need permission to publish
        • Publishing syntheses might prevent publication
        • CSSP would grow if we abstracted supp. info – templated supp info. submissions could help.
    • Crowdsourcing – does it work?
      • 131 people EVER has either deposited or curated data on ChemSpider
      • ChemSpider SyntheticPages has a small group of dedicated authors
      • Database hosts and vendors make the largest contributions of data
      • ChemSpider staff do the most curation
    • If it was not just about me…
      • We might have a community built encyclopedia
      • I might know where the best restaurants are
      • I might get good advice on books to read
      • I might know which movies to watch
      • I might know which plumber to call
      • Data might just be Open
    • If it was not just about me…
      • We might have a community built encyclopedia
      • I might know where the best restaurants are
      • I might get good advice on books to read
      • I might know which movies to watch
      • I might know which plumber to call
      • Data might just be Open
    • How will it improve?
      • Participation
      • and
      • contribution
    • RSC’s LearnChemistry:Share
      • Improved Quality of data is essential
      • Open PHACTS : partnership between European Community and EFPIA
      • Freely accessible for knowledge discovery and verification.
        • Data on small molecules
        • Pharmacological profiles
        • ADMET data
        • Biological targets and pathways
        • Proprietary and public data sources.
    • Conclusions
      • ChemSpider has an important role in quality data
      • Crowdsourced deposition, validation and curation works but low engagement to date
      • Primary challenge – engaging the community to help create what they want. Rewards and recognition ?
      • MORE collaboration can benefit us all
      • All indicators are good for continued growth
    • Acknowledgments
      • The ChemSpider team
      • Craig Knox, DrugBank
      • Our data providers, depositors, collaborators and curators
      • Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
    • Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams