Crowd-Sourcing to Build a Structure
  Centric Community for Chemists

                           Antony Williams
         ...
Social Networking for Chemists




 Building a Structure Centric Community for Chemists
Network Drug Discovery Tools
             www.curehunter.com




Building a Structure Centric Community for Chemists
Beware the Networks!




Building a Structure Centric Community for Chemists
Collaborative Authoring in Academia
   Group level collaboration via Wikis




                 Building a Structure Cent...
Collaborative Authoring for Drug Discovery

   Pfizerpedia




                  Building a Structure Centric Community f...
Collaborative Knowledge Management
for Chemists – Wikipedia, Built by a Network




         Building a Structure Centric ...
and biologists…WikiProteins




Building a Structure Centric Community for Chemists
WikiProteins




                                                         What
                                           ...
Commonly Lacking…

   Approaches generally lack “structural intelligence”
       Structures have properties (Mw, MF, exp...
A Search Engine for Chemists

   Questions a chemist might ask…
       What is the melting point of n-butanol?
       W...
ChemSpider Data Content

   Over 20 million unique chemical structures :
       Online Databases –PubChem, Drugbank, HMD...
A Structure Centric Community for Chemists

   A FREE ACCESS platform for deposition,
    management, curation, annotatio...
Tell me about Aspirin




Building a Structure Centric Community for Chemists
Tell me about Aspirin




Building a Structure Centric Community for Chemists
Links out to KEGG
Kyoto Encyclopedia of Genes and Genomes




         Building a Structure Centric Community for Chemists
Tell me about Aspirin




Building a Structure Centric Community for Chemists
Tell me About Aspirin




Building a Structure Centric Community for Chemists
Tell me about Aspirin




Building a Structure Centric Community for Chemists
Tell me about Aspirin




Building a Structure Centric Community for Chemists
Abstract Compounds?

   Is there any information about “Quesnoin”?




   Type in the name (and there may be many) or ot...
Example Search




Building a Structure Centric Community for Chemists
Example Search




Building a Structure Centric Community for Chemists
Example Search 2

   What compounds have a mass of 300+/-0.001?




   or search a combination of intrinsic/predicted pr...
Example Search 2




Building a Structure Centric Community for Chemists
Complex Search




Building a Structure Centric Community for Chemists
Search Open Access Journals – ChemSpider




       Building a Structure Centric Community for Chemists
Search PubMed – ChemSpider




Building a Structure Centric Community for Chemists
The Quality of Data Online…
   Aggregating data opens up quality issues
   Structure-identifier associations are “dirty”...
Who holds THE Quality Authority?

   Chemical Abstracts Service is the structural authority
    today. 1400 (?) employees...
Quality is a Major Issue- Search Butanol




       Building a Structure Centric Community for Chemists
Crowd-sourcing Database Compilation




      Building a Structure Centric Community for Chemists
Wikipedia – Crowdsourcing Chemistry




       Building a Structure Centric Community for Chemists
Wikipedia Chemistry Curation project

   Only ca. 5000 organic structures, 7000 total structures
   MONTHS of work so fa...
Thymol Blue on ChemSpider

   Data online includes:
       UV-vis spectrum
       Measured experimental properties
    ...
Differences between ChemSpider/Wikipedia

           ChemSpider                                  Wikipedia
>20 million uni...
Differences between Wikipedia/ChemSpider

            Wikipedia                                    ChemSpider
Supported by...
Crowd-sourcing Curation

   How to curate data for millions of structures?
   Robot processes can clean up depositions
 ...
Multi-level Curation and Approval




   Building a Structure Centric Community for Chemists
Post Comments
   Anyone can “Post Comments” associated with a
    structure. To curate data we require login to track



...
Crowd-sourcing Chemistry

   Crowd-sourced curation: identify and tag errors, edit
    names, synonyms, identify records ...
But, when registered and logged in…

   Ability to curate and add to the database
       Add structures
       “Clean” ...
Adding to the Database - Structure




    Building a Structure Centric Community for Chemists
Adding New Text Data


Add Publication                                                Add URL




                        ...
Adding Supplementary Info to a Structure




       Building a Structure Centric Community for Chemists
Can ChemSpider Enable Discovery?
   Yes, chemists can search by text, structure, substructure or
    properties to look a...
ChemSpider – Research in Progress

   Supporting Open Notebook Science as a repository –
    JC Bradley at Drexel Univers...
LASSO
Ligand Activity by Surface Similarity Order




         Building a Structure Centric Community for Chemists
LASSO Descriptors on ChemSpider
     SEMANTIC WEB in action




Building a Structure Centric Community for Chemists
LASSO Searching Method 1

   Ask the question “What are the top 1000 molecules
    with similar LASSO descriptors to the ...
It WORKS - Enrichment Plot




   60% of the actives were recovered in the top 1% of the database.
   “Environmental bin...
Tipping Point

   Tipping point - the point at
    which a slow gradual change
    becomes irreversible and then
    proc...
ChemSpider Forums/Blogs

   Forum.chemspider.com
   www.chemspider.com/blog




               Building a Structure Cent...
ChemSpider TouchGraph




Building a Structure Centric Community for Chemists
What would we most like to do?

   Enable “Collaborative Science”. What would that look
    like?

   Access to chemical...
“ChemSpider Inside”
   Instrument vendors integrated ChemSpider to their
    metabolism ID project – ChemSpider linked to...
Where to from here? Short term

   Integrated text and structure/substructure searching of the
    Open Access literature...
Where to from here? Mid-term

   Spidering for Chemistry – extract data from articles,
    webpages and data sources AND ...
Where to from here? Mid-Term

   An extensible datamodel “on the fly” allows us to
    easily expand to integrate abstrac...
Our Challenges

   There are “no employees”
   ChemSpider is non-funded
   System is hyper-dependent
    on ISP, power ...
Acknowledgments

   The ChemSpider team of volunteer developers
   ChemSpider Advisory Group
   Our curators, depositor...
Further reading

   www.chemspider.com/blog
   Internet-based tools for communication and
    collaboration in chemistry...
Upcoming SlideShare
Loading in...5
×

Whitney Symposium Lecturejune 2008 1220331644496491 9

622

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
622
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Whitney Symposium Lecturejune 2008 1220331644496491 9

  1. 1. Crowd-Sourcing to Build a Structure Centric Community for Chemists Antony Williams Whitney Symposium 2008 - Networks
  2. 2. Social Networking for Chemists Building a Structure Centric Community for Chemists
  3. 3. Network Drug Discovery Tools www.curehunter.com Building a Structure Centric Community for Chemists
  4. 4. Beware the Networks! Building a Structure Centric Community for Chemists
  5. 5. Collaborative Authoring in Academia  Group level collaboration via Wikis Building a Structure Centric Community for Chemists
  6. 6. Collaborative Authoring for Drug Discovery  Pfizerpedia Building a Structure Centric Community for Chemists
  7. 7. Collaborative Knowledge Management for Chemists – Wikipedia, Built by a Network Building a Structure Centric Community for Chemists
  8. 8. and biologists…WikiProteins Building a Structure Centric Community for Chemists
  9. 9. WikiProteins What Is Tegafur? Building a Structure Centric Community for Chemists
  10. 10. Commonly Lacking…  Approaches generally lack “structural intelligence”  Structures have properties (Mw, MF, exp. & pred. properties)  Collections of structures need to be searchable by structure  Most data collections are “self-contained” and rarely connecting to other resources via “structure” Building a Structure Centric Community for Chemists
  11. 11. A Search Engine for Chemists  Questions a chemist might ask…  What is the melting point of n-butanol?  What is the chemical structure of Xanax?  Chemically, what is viagra?  What are the stereocenters of cholesterol?  Where can I find publications about Taxol?  What are the different trade names for Ketoconazole?  What is the NMR spectrum of Aspirin?  What are the safety handling issues for Thymol Blue?  ChemSpider can answer all of these questions Building a Structure Centric Community for Chemists
  12. 12. ChemSpider Data Content  Over 20 million unique chemical structures :  Online Databases –PubChem, Drugbank, HMDB, Wikipedia  Chemical Vendors – over 40 different vendors and growing  Personal Depositions – individual contributions  Journal Publishers  Content database vendors  Analytical data collections  Patents (9 MILLION Structures to search patents)  Web scraping Content is linked back to the original data sources Building a Structure Centric Community for Chemists
  13. 13. A Structure Centric Community for Chemists  A FREE ACCESS platform for deposition, management, curation, annotation and extension of information associated with chemical structures  Semantically connect to other sites providing access to knowledge, data and information of determined quality  Search by alphanumeric text, chemical structure and substructure and combination searches  Predict properties for submitted structures Building a Structure Centric Community for Chemists
  14. 14. Tell me about Aspirin Building a Structure Centric Community for Chemists
  15. 15. Tell me about Aspirin Building a Structure Centric Community for Chemists
  16. 16. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes Building a Structure Centric Community for Chemists
  17. 17. Tell me about Aspirin Building a Structure Centric Community for Chemists
  18. 18. Tell me About Aspirin Building a Structure Centric Community for Chemists
  19. 19. Tell me about Aspirin Building a Structure Centric Community for Chemists
  20. 20. Tell me about Aspirin Building a Structure Centric Community for Chemists
  21. 21. Abstract Compounds?  Is there any information about “Quesnoin”?  Type in the name (and there may be many) or other identifier  Paste a chemical structure  Draw the structure Building a Structure Centric Community for Chemists
  22. 22. Example Search Building a Structure Centric Community for Chemists
  23. 23. Example Search Building a Structure Centric Community for Chemists
  24. 24. Example Search 2  What compounds have a mass of 300+/-0.001?  or search a combination of intrinsic/predicted properties Building a Structure Centric Community for Chemists
  25. 25. Example Search 2 Building a Structure Centric Community for Chemists
  26. 26. Complex Search Building a Structure Centric Community for Chemists
  27. 27. Search Open Access Journals – ChemSpider Building a Structure Centric Community for Chemists
  28. 28. Search PubMed – ChemSpider Building a Structure Centric Community for Chemists
  29. 29. The Quality of Data Online…  Aggregating data opens up quality issues  Structure-identifier associations are “dirty”  Structures are COMMONLY incorrect – stereochem issues  Manual curation of small databases is enough work – what about millions of structures?  Structures are far from perfect. What is a “correct structure”?  Full stereochemistry?  Historical timeline of structure?  Who is the authority? Building a Structure Centric Community for Chemists
  30. 30. Who holds THE Quality Authority?  Chemical Abstracts Service is the structural authority today. 1400 (?) employees, world standard in chemistry information  101 years of knowledge, process and expertise. MANUAL curation is key. Robotic curation is enabling  How can an online, free access system peacefully co- exist with the authority? Building a Structure Centric Community for Chemists
  31. 31. Quality is a Major Issue- Search Butanol Building a Structure Centric Community for Chemists
  32. 32. Crowd-sourcing Database Compilation Building a Structure Centric Community for Chemists
  33. 33. Wikipedia – Crowdsourcing Chemistry Building a Structure Centric Community for Chemists
  34. 34. Wikipedia Chemistry Curation project  Only ca. 5000 organic structures, 7000 total structures  MONTHS of work so far for a team of 6 people  Many errors removed in the process. Curation process is a daily event for users/depositors  Slow and torturous process for stereo molecules. Building a Structure Centric Community for Chemists
  35. 35. Thymol Blue on ChemSpider  Data online includes:  UV-vis spectrum  Measured experimental properties  Link to Wikipedia article  Links to chromatography details  Multiple identifiers/trade names etc.  Links to vendors/suppliers/other databases  Safety information Building a Structure Centric Community for Chemists
  36. 36. Differences between ChemSpider/Wikipedia ChemSpider Wikipedia >20 million unique structures ~5000 organics, 2000 others Complex queries – Properties, Text Text, structure/substructure, OA publishers, Data Sources, … Prediction of properties No Analytical Data No Active depositors/curators – 30 Active editors – about 50 (?) 5000 people/day; 1100 registered ???? Compound monographs linked Detailed compound monographs Building a Structure Centric Community for Chemists
  37. 37. Differences between Wikipedia/ChemSpider Wikipedia ChemSpider Supported by tried and tested Primarily Microsoft .NET Media-Wiki platform. technologies with OS components Established infrastructure and “Out of a basement” on three Wikipedia Foundation Team servers and 5 volunteers Chemistry is a subset of the ‘Pedia Chemistry is the focus of ‘Spider GFL licensing for everything Mixed “licensing” Strong team of WP:Chem Growing team of WP:Chem advocates, curators and admins advocates, curators and admins Worldwide reputation as quality Growing reputation as focused on source quality Building a Structure Centric Community for Chemists
  38. 38. Crowd-sourcing Curation  How to curate data for millions of structures?  Robot processes can clean up depositions  Search for Chloride and check molecular formula for Cl  Check for stereochemistry and remove names with stereo  Provide a simple-to-use platform to curate, annotate and tag data  Provide curator administration to prevent vandalism (Veropedia) Building a Structure Centric Community for Chemists
  39. 39. Multi-level Curation and Approval Building a Structure Centric Community for Chemists
  40. 40. Post Comments  Anyone can “Post Comments” associated with a structure. To curate data we require login to track Building a Structure Centric Community for Chemists
  41. 41. Crowd-sourcing Chemistry  Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation  ALSO  Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data) Building a Structure Centric Community for Chemists
  42. 42. But, when registered and logged in…  Ability to curate and add to the database  Add structures  “Clean” structures  Add data (spectra, CIFs, images)  Add links to other pages (URLs)  Add publication details Building a Structure Centric Community for Chemists
  43. 43. Adding to the Database - Structure Building a Structure Centric Community for Chemists
  44. 44. Adding New Text Data Add Publication Add URL Add Identifier Building a Structure Centric Community for Chemists
  45. 45. Adding Supplementary Info to a Structure Building a Structure Centric Community for Chemists
  46. 46. Can ChemSpider Enable Discovery?  Yes, chemists can search by text, structure, substructure or properties to look at relationships and probe drug discovery Building a Structure Centric Community for Chemists
  47. 47. ChemSpider – Research in Progress  Supporting Open Notebook Science as a repository – JC Bradley at Drexel University  For the purpose of online virtual screening  Applying descriptors of various types to filter a database of 20 million compounds  In progress:  Utilizing SimBioSys’ LASSO Descriptor  Collaboration based on NISS’ ChemModLab Building a Structure Centric Community for Chemists
  48. 48. LASSO Ligand Activity by Surface Similarity Order Building a Structure Centric Community for Chemists
  49. 49. LASSO Descriptors on ChemSpider SEMANTIC WEB in action Building a Structure Centric Community for Chemists
  50. 50. LASSO Searching Method 1  Ask the question “What are the top 1000 molecules with similar LASSO descriptors to the actives for the Estrogen Receptor” Building a Structure Centric Community for Chemists
  51. 51. It WORKS - Enrichment Plot  60% of the actives were recovered in the top 1% of the database.  “Environmental binders” are weak binders  The top ranked compounds may well be active ER binders  Likely candidates for experimental investigation Building a Structure Centric Community for Chemists
  52. 52. Tipping Point  Tipping point - the point at which a slow gradual change becomes irreversible and then proceeds with gathering pace Building a Structure Centric Community for Chemists
  53. 53. ChemSpider Forums/Blogs  Forum.chemspider.com  www.chemspider.com/blog Building a Structure Centric Community for Chemists
  54. 54. ChemSpider TouchGraph Building a Structure Centric Community for Chemists
  55. 55. What would we most like to do?  Enable “Collaborative Science”. What would that look like?  Access to chemical supplies when people need them  Awareness of available literature, patents, databases of curated content – whether Open Access or not. Transaction fees (or not) are between user and provider  Host Open Notebook Science exchanges Building a Structure Centric Community for Chemists
  56. 56. “ChemSpider Inside”  Instrument vendors integrated ChemSpider to their metabolism ID project – ChemSpider linked to all Mass Spec Intruments doing Metabolite ID?  Wikipedia roundtrip linking to ChemSpider  Google indexing ChemSpider at “fixed rate”  Integration to desktop drawing packages  Members of Microsoft BioIT Alliance  Discussions on Taverna’s Workflow Sourceforge group  Hosting Open Access articles shortly… Building a Structure Centric Community for Chemists
  57. 57. Where to from here? Short term  Integrated text and structure/substructure searching of the Open Access literature is in development  Web-based scraping of structure-based information – examples in place  Enhanced web services layer to integrate searches  Deposit updated Patent Database (9 million structures)  Reaction handling and deposition Building a Structure Centric Community for Chemists
  58. 58. Where to from here? Mid-term  Spidering for Chemistry – extract data from articles, webpages and data sources AND stay within copyright  WiChempedia project – wiki-layers on top of ChemSpider, alongside Wikipedia curation project.  Deeper integration to text-based searching and conversion of chemical names to structures for online structure searching:  Improved integration with NCBI Entrez system  Deliver “dedicated websites” for specific publishers Building a Structure Centric Community for Chemists
  59. 59. Where to from here? Mid-Term  An extensible datamodel “on the fly” allows us to easily expand to integrate abstract data to structures  Data mine and curate “parameters” – physicochemical and physiological parameters to enable QSAR analysis, data modeling and provision of models online (UNC-Chapel Hill, NISS) Building a Structure Centric Community for Chemists
  60. 60. Our Challenges  There are “no employees”  ChemSpider is non-funded  System is hyper-dependent on ISP, power and limited compute power  We are upsetting a lot of people – evangelists, cheminformatics system vendors, publishers, data content providers Building a Structure Centric Community for Chemists
  61. 61. Acknowledgments  The ChemSpider team of volunteer developers  ChemSpider Advisory Group  Our curators, depositors and users  Suppliers of commercial software – Microsoft, ACD/Labs, OpenEye, ChemAxon, SimBioSys  SureChem – Structure Based Online Patent Searching Building a Structure Centric Community for Chemists
  62. 62. Further reading  www.chemspider.com/blog  Internet-based tools for communication and collaboration in chemistry, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008 502-506, doi:10.1016/j.drudis.2008.03.015  A perspective of publicly accessible/open-access chemistry databases, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008, 495-501, doi:10.1016/j.drudis.2008.03.017 Building a Structure Centric Community for Chemists
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×