Whitney Symposium Lecture June 2008


Published on

A Presentation Given at the General Electric Whitney Symposium on Networks

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Whitney Symposium Lecture June 2008

    1. 1. Crowd-Sourcing to Build a Structure Centric Community for Chemists Antony Williams Whitney Symposium 2008 - Networks
    2. 2. Social Networking for Chemists
    3. 3. Network Drug Discovery Tools www.curehunter.com
    4. 4. Beware the Networks!
    5. 5. Collaborative Authoring in Academia <ul><li>Group level collaboration via Wikis </li></ul>
    6. 6. Collaborative Authoring for Drug Discovery <ul><li>Pfizerpedia </li></ul>
    7. 7. Collaborative Knowledge Management for Chemists – Wikipedia, Built by a Network
    8. 8. and biologists…WikiProteins
    9. 9. WikiProteins What Is Tegafur?
    10. 10. Commonly Lacking… <ul><li>Approaches generally lack “structural intelligence” </li></ul><ul><ul><li>Structures have properties (Mw, MF, exp. & pred. properties) </li></ul></ul><ul><ul><li>Collections of structures need to be searchable by structure </li></ul></ul><ul><ul><li>Most data collections are “self-contained” and rarely connecting to other resources via “structure” </li></ul></ul>
    11. 11. A Search Engine for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-butanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is viagra? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about Taxol? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul><ul><ul><li>ChemSpider can answer all of these questions </li></ul></ul>
    12. 12. ChemSpider Data Content <ul><li>Over 20 million unique chemical structures : </li></ul><ul><ul><li>Online Databases –PubChem, Drugbank, HMDB, Wikipedia </li></ul></ul><ul><ul><li>Chemical Vendors – over 40 different vendors and growing </li></ul></ul><ul><ul><li>Personal Depositions – individual contributions </li></ul></ul><ul><ul><li>Journal Publishers </li></ul></ul><ul><ul><li>Content database vendors </li></ul></ul><ul><ul><li>Analytical data collections </li></ul></ul><ul><ul><li>Patents (9 MILLION Structures to search patents ) </li></ul></ul><ul><ul><li>Web scraping </li></ul></ul><ul><ul><li>Content is linked back to the original data sources </li></ul></ul>
    13. 13. A Structure Centric Community for Chemists <ul><li>A FREE ACCESS platform for deposition, management, curation, annotation and extension of information associated with chemical structures </li></ul><ul><li>Semantically connect to other sites providing access to knowledge, data and information of determined quality </li></ul><ul><li>Search by alphanumeric text, chemical structure and substructure and combination searches </li></ul><ul><li>Predict properties for submitted structures </li></ul>
    14. 14. Tell me about Aspirin
    15. 15. Tell me about Aspirin
    16. 16. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
    17. 17. Tell me about Aspirin
    18. 18. Tell me About Aspirin
    19. 19. Tell me about Aspirin
    20. 20. Tell me about Aspirin
    21. 21. Abstract Compounds? <ul><li>Is there any information about “Quesnoin”? </li></ul><ul><li>Type in the name (and there may be many) or other identifier </li></ul><ul><li>Paste a chemical structure </li></ul><ul><li>Draw the structure </li></ul>
    22. 22. Example Search
    23. 23. Example Search
    24. 24. Example Search 2 <ul><li>What compounds have a mass of 300+/-0.001? </li></ul><ul><li>or search a combination of intrinsic/predicted properties </li></ul>
    25. 25. Example Search 2
    26. 26. Complex Search
    27. 27. Search Open Access Journals – ChemSpider
    28. 28. Search PubMed – ChemSpider
    29. 29. The Quality of Data Online… <ul><li>Aggregating data opens up quality issues </li></ul><ul><li>Structure-identifier associations are “dirty” </li></ul><ul><li>Structures are COMMONLY incorrect – stereochem issues </li></ul><ul><li>Manual curation of small databases is enough work – what about millions of structures? </li></ul><ul><li>Structures are far from perfect. What is a “correct structure”? </li></ul><ul><ul><li>Full stereochemistry? </li></ul></ul><ul><ul><li>Historical timeline of structure? </li></ul></ul><ul><ul><li>Who is the authority? </li></ul></ul>
    30. 30. Who holds THE Quality Authority? <ul><li>Chemical Abstracts Service is the structural authority today. 1400 (?) employees, world standard in chemistry information </li></ul><ul><li>101 years of knowledge, process and expertise. MANUAL curation is key. Robotic curation is enabling </li></ul><ul><li>How can an online, free access system peacefully co-exist with the authority? </li></ul>
    31. 31. Quality is a Major Issue- Search Butanol
    32. 32. Crowd-sourcing Database Compilation
    33. 33. Wikipedia – Crowdsourcing Chemistry
    34. 34. Wikipedia Chemistry Curation project <ul><li>Only ca. 5000 organic structures, 7000 total structures </li></ul><ul><li>MONTHS of work so far for a team of 6 people </li></ul><ul><li>Many errors removed in the process. Curation process is a daily event for users/depositors </li></ul><ul><li>Slow and torturous process for stereo molecules. </li></ul>
    35. 35. Thymol Blue on ChemSpider <ul><li>Data online includes: </li></ul><ul><ul><li>UV-vis spectrum </li></ul></ul><ul><ul><li>Measured experimental properties </li></ul></ul><ul><ul><li>Link to Wikipedia article </li></ul></ul><ul><ul><li>Links to chromatography details </li></ul></ul><ul><ul><li>Multiple identifiers/trade names etc. </li></ul></ul><ul><ul><li>Links to vendors/suppliers/other databases </li></ul></ul><ul><ul><li>Safety information </li></ul></ul>
    36. 36. Differences between ChemSpider/Wikipedia No Analytical Data Active editors – about 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 5000 people/day; 1100 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~5000 organics, 2000 others >20 million unique structures Wikipedia ChemSpider
    37. 37. Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of WP:Chem advocates, curators and admins Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
    38. 38. Crowd-sourcing Curation <ul><li>How to curate data for millions of structures? </li></ul><ul><li>Robot processes can clean up depositions </li></ul><ul><ul><li>Search for Chloride and check molecular formula for Cl </li></ul></ul><ul><ul><li>Check for stereochemistry and remove names with stereo </li></ul></ul><ul><li>Provide a simple-to-use platform to curate, annotate and tag data </li></ul><ul><li>Provide curator administration to prevent vandalism (Veropedia) </li></ul>
    39. 39. Multi-level Curation and Approval
    40. 40. Post Comments <ul><li>Anyone can “Post Comments” associated with a structure. To curate data we require login to track </li></ul>
    41. 41. Crowd-sourcing Chemistry <ul><li>Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation </li></ul><ul><li>ALSO </li></ul><ul><li>Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data) </li></ul>
    42. 42. But, when registered and logged in… <ul><li>Ability to curate and add to the database </li></ul><ul><ul><li>Add structures </li></ul></ul><ul><ul><li>“ Clean” structures </li></ul></ul><ul><ul><li>Add data (spectra, CIFs, images) </li></ul></ul><ul><ul><li>Add links to other pages (URLs) </li></ul></ul><ul><ul><li>Add publication details </li></ul></ul>
    43. 43. Adding to the Database - Structure
    44. 44. Adding New Text Data Add Publication Add Identifier Add URL
    45. 45. Adding Supplementary Info to a Structure
    46. 46. Can ChemSpider Enable Discovery? <ul><li>Yes, chemists can search by text, structure, substructure or properties to look at relationships and probe drug discovery </li></ul>
    47. 47. ChemSpider – Research in Progress <ul><li>Supporting Open Notebook Science as a repository – JC Bradley at Drexel University </li></ul><ul><li>For the purpose of online virtual screening </li></ul><ul><li>Applying descriptors of various types to filter a database of 20 million compounds </li></ul><ul><li>In progress: </li></ul><ul><ul><li>Utilizing SimBioSys’ LASSO Descriptor </li></ul></ul><ul><ul><li>Collaboration based on NISS’ ChemModLab </li></ul></ul>
    48. 48. LASSO Ligand Activity by Surface Similarity Order
    49. 49. LASSO Descriptors on ChemSpider SEMANTIC WEB in action
    50. 50. LASSO Searching Method 1 <ul><li>Ask the question “What are the top 1000 molecules with similar LASSO descriptors to the actives for the Estrogen Receptor” </li></ul>
    51. 51. It WORKS - Enrichment Plot <ul><li>60% of the actives were recovered in the top 1% of the database. </li></ul><ul><li>“ Environmental binders” are weak binders </li></ul><ul><li>The top ranked compounds may well be active ER binders </li></ul><ul><li>Likely candidates for experimental investigation </li></ul>
    52. 52. Tipping Point <ul><li>Tipping point - the point at which a slow gradual change becomes irreversible and then proceeds with gathering pace </li></ul>
    53. 53. ChemSpider Forums/Blogs <ul><li>Forum.chemspider.com </li></ul><ul><li>www.chemspider.com/blog </li></ul>
    54. 54. ChemSpider TouchGraph
    55. 55. What would we most like to do? <ul><li>Enable “Collaborative Science”. What would that look like? </li></ul><ul><li>Access to chemical supplies when people need them </li></ul><ul><li>Awareness of available literature, patents, databases of curated content – whether Open Access or not. Transaction fees (or not) are between user and provider </li></ul><ul><li>Host Open Notebook Science exchanges </li></ul>
    56. 56. “ ChemSpider Inside” <ul><li>Instrument vendors integrated ChemSpider to their metabolism ID project – ChemSpider linked to all Mass Spec Intruments doing Metabolite ID? </li></ul><ul><li>Wikipedia roundtrip linking to ChemSpider </li></ul><ul><li>Google indexing ChemSpider at “fixed rate” </li></ul><ul><li>Integration to desktop drawing packages </li></ul><ul><li>Members of Microsoft BioIT Alliance </li></ul><ul><li>Discussions on Taverna’s Workflow Sourceforge group </li></ul><ul><li>Hosting Open Access articles shortly… </li></ul>
    57. 57. Where to from here? Short term <ul><li>Integrated text and structure/substructure searching of the Open Access literature is in development </li></ul><ul><li>Web-based scraping of structure-based information – examples in place </li></ul><ul><li>Enhanced web services layer to integrate searches </li></ul><ul><li>Deposit updated Patent Database (9 million structures) </li></ul><ul><li>Reaction handling and deposition </li></ul>
    58. 58. Where to from here? Mid-term <ul><li>Spidering for Chemistry – extract data from articles, webpages and data sources AND stay within copyright </li></ul><ul><li>WiChempedia project – wiki-layers on top of ChemSpider, alongside Wikipedia curation project. </li></ul><ul><li>Deeper integration to text-based searching and conversion of chemical names to structures for online structure searching: </li></ul><ul><ul><li>Improved integration with NCBI Entrez system </li></ul></ul><ul><ul><li>Deliver “dedicated websites” for specific publishers </li></ul></ul>
    59. 59. Where to from here? Mid-Term <ul><li>An extensible datamodel “on the fly” allows us to easily expand to integrate abstract data to structures </li></ul><ul><li>Data mine and curate “parameters” – physicochemical and physiological parameters to enable QSAR analysis, data modeling and provision of models online (UNC-Chapel Hill, NISS) </li></ul>
    60. 60. Our Challenges <ul><li>There are “no employees” </li></ul><ul><li>ChemSpider is non-funded </li></ul><ul><li>System is hyper-dependent on ISP, power and limited compute power </li></ul><ul><li>We are upsetting a lot of people – evangelists, cheminformatics system vendors, publishers, data content providers </li></ul>
    61. 61. Acknowledgments <ul><li>The ChemSpider team of volunteer developers </li></ul><ul><li>ChemSpider Advisory Group </li></ul><ul><li>Our curators, depositors and users </li></ul><ul><li>Suppliers of commercial software – Microsoft, ACD/Labs, OpenEye, ChemAxon, SimBioSys </li></ul><ul><li>SureChem – Structure Based Online Patent Searching </li></ul>
    62. 62. Further reading <ul><li>www.chemspider.com/blog </li></ul><ul><li>Internet-based tools for communication and collaboration in chemistry, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008 502-506, doi:10.1016/j.drudis.2008.03.015 </li></ul><ul><li>A perspective of publicly accessible/open-access chemistry databases, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008, 495-501, doi:10.1016/j.drudis.2008.03.017 </li></ul>