Your SlideShare is downloading. ×
Whitney Symposium Lecture June 2008
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Whitney Symposium Lecture June 2008


Published on

A Presentation Given at the General Electric Whitney Symposium on Networks

A Presentation Given at the General Electric Whitney Symposium on Networks

Published in: Education, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Transcript

    • 1. Crowd-Sourcing to Build a Structure Centric Community for Chemists Antony Williams Whitney Symposium 2008 - Networks
    • 2. Social Networking for Chemists
    • 3. Network Drug Discovery Tools
    • 4. Beware the Networks!
    • 5. Collaborative Authoring in Academia
      • Group level collaboration via Wikis
    • 6. Collaborative Authoring for Drug Discovery
      • Pfizerpedia
    • 7. Collaborative Knowledge Management for Chemists – Wikipedia, Built by a Network
    • 8. and biologists…WikiProteins
    • 9. WikiProteins What Is Tegafur?
    • 10. Commonly Lacking…
      • Approaches generally lack “structural intelligence”
        • Structures have properties (Mw, MF, exp. & pred. properties)
        • Collections of structures need to be searchable by structure
        • Most data collections are “self-contained” and rarely connecting to other resources via “structure”
    • 11. A Search Engine for Chemists
      • Questions a chemist might ask…
        • What is the melting point of n-butanol?
        • What is the chemical structure of Xanax?
        • Chemically, what is viagra?
        • What are the stereocenters of cholesterol?
        • Where can I find publications about Taxol?
        • What are the different trade names for Ketoconazole?
        • What is the NMR spectrum of Aspirin?
        • What are the safety handling issues for Thymol Blue?
        • ChemSpider can answer all of these questions
    • 12. ChemSpider Data Content
      • Over 20 million unique chemical structures :
        • Online Databases –PubChem, Drugbank, HMDB, Wikipedia
        • Chemical Vendors – over 40 different vendors and growing
        • Personal Depositions – individual contributions
        • Journal Publishers
        • Content database vendors
        • Analytical data collections
        • Patents (9 MILLION Structures to search patents )
        • Web scraping
        • Content is linked back to the original data sources
    • 13. A Structure Centric Community for Chemists
      • A FREE ACCESS platform for deposition, management, curation, annotation and extension of information associated with chemical structures
      • Semantically connect to other sites providing access to knowledge, data and information of determined quality
      • Search by alphanumeric text, chemical structure and substructure and combination searches
      • Predict properties for submitted structures
    • 14. Tell me about Aspirin
    • 15. Tell me about Aspirin
    • 16. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
    • 17. Tell me about Aspirin
    • 18. Tell me About Aspirin
    • 19. Tell me about Aspirin
    • 20. Tell me about Aspirin
    • 21. Abstract Compounds?
      • Is there any information about “Quesnoin”?
      • Type in the name (and there may be many) or other identifier
      • Paste a chemical structure
      • Draw the structure
    • 22. Example Search
    • 23. Example Search
    • 24. Example Search 2
      • What compounds have a mass of 300+/-0.001?
      • or search a combination of intrinsic/predicted properties
    • 25. Example Search 2
    • 26. Complex Search
    • 27. Search Open Access Journals – ChemSpider
    • 28. Search PubMed – ChemSpider
    • 29. The Quality of Data Online…
      • Aggregating data opens up quality issues
      • Structure-identifier associations are “dirty”
      • Structures are COMMONLY incorrect – stereochem issues
      • Manual curation of small databases is enough work – what about millions of structures?
      • Structures are far from perfect. What is a “correct structure”?
        • Full stereochemistry?
        • Historical timeline of structure?
        • Who is the authority?
    • 30. Who holds THE Quality Authority?
      • Chemical Abstracts Service is the structural authority today. 1400 (?) employees, world standard in chemistry information
      • 101 years of knowledge, process and expertise. MANUAL curation is key. Robotic curation is enabling
      • How can an online, free access system peacefully co-exist with the authority?
    • 31. Quality is a Major Issue- Search Butanol
    • 32. Crowd-sourcing Database Compilation
    • 33. Wikipedia – Crowdsourcing Chemistry
    • 34. Wikipedia Chemistry Curation project
      • Only ca. 5000 organic structures, 7000 total structures
      • MONTHS of work so far for a team of 6 people
      • Many errors removed in the process. Curation process is a daily event for users/depositors
      • Slow and torturous process for stereo molecules.
    • 35. Thymol Blue on ChemSpider
      • Data online includes:
        • UV-vis spectrum
        • Measured experimental properties
        • Link to Wikipedia article
        • Links to chromatography details
        • Multiple identifiers/trade names etc.
        • Links to vendors/suppliers/other databases
        • Safety information
    • 36. Differences between ChemSpider/Wikipedia No Analytical Data Active editors – about 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 5000 people/day; 1100 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~5000 organics, 2000 others >20 million unique structures Wikipedia ChemSpider
    • 37. Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of WP:Chem advocates, curators and admins Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
    • 38. Crowd-sourcing Curation
      • How to curate data for millions of structures?
      • Robot processes can clean up depositions
        • Search for Chloride and check molecular formula for Cl
        • Check for stereochemistry and remove names with stereo
      • Provide a simple-to-use platform to curate, annotate and tag data
      • Provide curator administration to prevent vandalism (Veropedia)
    • 39. Multi-level Curation and Approval
    • 40. Post Comments
      • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
    • 41. Crowd-sourcing Chemistry
      • Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
      • ALSO
      • Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
    • 42. But, when registered and logged in…
      • Ability to curate and add to the database
        • Add structures
        • “ Clean” structures
        • Add data (spectra, CIFs, images)
        • Add links to other pages (URLs)
        • Add publication details
    • 43. Adding to the Database - Structure
    • 44. Adding New Text Data Add Publication Add Identifier Add URL
    • 45. Adding Supplementary Info to a Structure
    • 46. Can ChemSpider Enable Discovery?
      • Yes, chemists can search by text, structure, substructure or properties to look at relationships and probe drug discovery
    • 47. ChemSpider – Research in Progress
      • Supporting Open Notebook Science as a repository – JC Bradley at Drexel University
      • For the purpose of online virtual screening
      • Applying descriptors of various types to filter a database of 20 million compounds
      • In progress:
        • Utilizing SimBioSys’ LASSO Descriptor
        • Collaboration based on NISS’ ChemModLab
    • 48. LASSO Ligand Activity by Surface Similarity Order
    • 49. LASSO Descriptors on ChemSpider SEMANTIC WEB in action
    • 50. LASSO Searching Method 1
      • Ask the question “What are the top 1000 molecules with similar LASSO descriptors to the actives for the Estrogen Receptor”
    • 51. It WORKS - Enrichment Plot
      • 60% of the actives were recovered in the top 1% of the database.
      • “ Environmental binders” are weak binders
      • The top ranked compounds may well be active ER binders
      • Likely candidates for experimental investigation
    • 52. Tipping Point
      • Tipping point - the point at which a slow gradual change becomes irreversible and then proceeds with gathering pace
    • 53. ChemSpider Forums/Blogs
    • 54. ChemSpider TouchGraph
    • 55. What would we most like to do?
      • Enable “Collaborative Science”. What would that look like?
      • Access to chemical supplies when people need them
      • Awareness of available literature, patents, databases of curated content – whether Open Access or not. Transaction fees (or not) are between user and provider
      • Host Open Notebook Science exchanges
    • 56. “ ChemSpider Inside”
      • Instrument vendors integrated ChemSpider to their metabolism ID project – ChemSpider linked to all Mass Spec Intruments doing Metabolite ID?
      • Wikipedia roundtrip linking to ChemSpider
      • Google indexing ChemSpider at “fixed rate”
      • Integration to desktop drawing packages
      • Members of Microsoft BioIT Alliance
      • Discussions on Taverna’s Workflow Sourceforge group
      • Hosting Open Access articles shortly…
    • 57. Where to from here? Short term
      • Integrated text and structure/substructure searching of the Open Access literature is in development
      • Web-based scraping of structure-based information – examples in place
      • Enhanced web services layer to integrate searches
      • Deposit updated Patent Database (9 million structures)
      • Reaction handling and deposition
    • 58. Where to from here? Mid-term
      • Spidering for Chemistry – extract data from articles, webpages and data sources AND stay within copyright
      • WiChempedia project – wiki-layers on top of ChemSpider, alongside Wikipedia curation project.
      • Deeper integration to text-based searching and conversion of chemical names to structures for online structure searching:
        • Improved integration with NCBI Entrez system
        • Deliver “dedicated websites” for specific publishers
    • 59. Where to from here? Mid-Term
      • An extensible datamodel “on the fly” allows us to easily expand to integrate abstract data to structures
      • Data mine and curate “parameters” – physicochemical and physiological parameters to enable QSAR analysis, data modeling and provision of models online (UNC-Chapel Hill, NISS)
    • 60. Our Challenges
      • There are “no employees”
      • ChemSpider is non-funded
      • System is hyper-dependent on ISP, power and limited compute power
      • We are upsetting a lot of people – evangelists, cheminformatics system vendors, publishers, data content providers
    • 61. Acknowledgments
      • The ChemSpider team of volunteer developers
      • ChemSpider Advisory Group
      • Our curators, depositors and users
      • Suppliers of commercial software – Microsoft, ACD/Labs, OpenEye, ChemAxon, SimBioSys
      • SureChem – Structure Based Online Patent Searching
    • 62. Further reading
      • Internet-based tools for communication and collaboration in chemistry, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008 502-506, doi:10.1016/j.drudis.2008.03.015
      • A perspective of publicly accessible/open-access chemistry databases, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008, 495-501, doi:10.1016/j.drudis.2008.03.017