Crowdsourcing, Collaborations and
                Text-Mining in a
        World of Open Chemistry

                      ...
Imagine a time when ….

The internet is searchable by chemical structure and
substructure (e.g.Wikipedia, Google Scholar)
...
ChemSpider - A Search Engine for Chemists

Questions a chemist might ask…
  What is the melting point of n-butanol?
  What...
What is a Structure?
     Ask a computer…ask a chemist




Building a Structure Centric Community for Chemists
Tell Me About Glutathione




Building a Structure Centric Community for Chemists
Tell Me About Glutathione




Building a Structure Centric Community for Chemists
Tell Me About Glutathione




Building a Structure Centric Community for Chemists
Tell Me About Glutathione




Building a Structure Centric Community for Chemists
Tell Me About Glutathione




Building a Structure Centric Community for Chemists
Tell Me About Glutathione




Building a Structure Centric Community for Chemists
Link outs




Building a Structure Centric Community for Chemists
Links out to KEGG
Kyoto Encyclopedia of Genes and Genomes




         Building a Structure Centric Community for Chemists
How many names does a compound have?




       Building a Structure Centric Community for Chemists
ChemSpider Data Content

Over 21.5 million unique chemical structures from ca. 150 data
sources
   Online Databases –PubCh...
Other Searches

What compounds have a mass of 300+/-0.001?




or search a combination of intrinsic/predicted properties
 ...
Other Searches




Building a Structure Centric Community for Chemists
Complex Search




Building a Structure Centric Community for Chemists
The Quality of Data Online…
Aggregating data opens up quality issues
Structure-identifier associations are “dirty”
Structu...
Who holds THE Quality Authority?

Chemical Abstracts Service is the structural authority
today. 1400 employees, world stan...
Quality is a Major Issue- Search Butanol
             OLD EXAMPLE..now fixed




   Building a Structure Centric Community...
Wikipedia Chemistry Curation project

Only ca. 5000 organic structures, 7000 total
structures
Almost a year of work so far...
Wikipedia Curation

Looking for self-consistency
across a Wikipedia Page
Primary key is the article TITLE
The chemical sho...
Viagra or Sildenafil




Building a Structure Centric Community for Chemists
Other issues…




Building a Structure Centric Community for Chemists
Charges




Building a Structure Centric Community for Chemists
Sugars – Machine Readable vs Aesthetics




Haworth                       Stereo                         Fischer

       B...
Wikipedia – Crowdsourcing Chemistry




       Building a Structure Centric Community for Chemists
Thymol Blue on ChemSpider

Data online includes:
  UV-vis spectrum
  Measured experimental properties
  Link to Wikipedia ...
Differences between ChemSpider/Wikipedia

           ChemSpider                                  Wikipedia
>21 million uni...
Differences between Wikipedia/ChemSpider

            Wikipedia                                    ChemSpider
Supported by...
Crowd-sourcing Curation

How to curate data for millions of structures?
Robot processes can clean up depositions
  Search ...
Post Comments
Anyone can “Post Comments” associated with a
structure. To curate data we require login to track




       ...
Multi-level Curation and Approval




    Building a Structure Centric Community for Chemists
Crowd-sourcing Chemistry

Crowd-sourced curation: identify and tag errors, edit
names, synonyms, identify records for depr...
DailyMed




Building a Structure Centric Community for Chemists
Quality of Structures




Building a Structure Centric Community for Chemists
Quality of Structures!!!




Building a Structure Centric Community for Chemists
Structure-Centric

We want to search “information” by structure, substructure,
similarity of structure
Specific focus on O...
“Entity Extraction”

Rule-based recognition of systematic names:
  Use a lexeme of name fragments
  Rules for identifying ...
Building a Structure Centric Community for Chemists
Name Recognition

Azo aldehyde 2 was synthesized according to a
reported method [17]. To a stirred solution of azo aldehyd...
Name Recognition

Azo aldehyde 2 was synthesized according to a
reported method [17]. To a stirred solution of azo aldehyd...
How Many Chemical Names?
“She had the drive to derive success in any
venture and was well versed in Karate.
When the man i...
How Many Chemical Names?
“She had the drive to derive success in any
venture and was well versed in Karate.
When the man i...
ChemMantis

Chemical Markup And Nomenclature Transformation
Integrated System




           Building a Structure Centric ...
Making Open Access Articles Searchable
                         Proof of Concept
Can we HOST Chemistry Open Access article...
Document markup

ChemSpider now hosting Open Access articles from
MDPI, Molecular Diversity Preservation International
Hos...
A Standard for Document Markup?

NLM-DTD: National Library of Medicine; Document
Type Definition
Approved markup definitio...
NLM/DTD markup




Building a Structure Centric Community for Chemists
Chemistry and Biology



Menus can be extended as necessary




            Building a Structure Centric Community for Che...
Document markup




Building a Structure Centric Community for Chemists
Markup – 3 seconds!




Building a Structure Centric Community for Chemists
On the fly conversion




Building a Structure Centric Community for Chemists
Shorthand Formulae Supported




 Building a Structure Centric Community for Chemists
One Click to more Info…




Building a Structure Centric Community for Chemists
Structure Image Conversion




Building a Structure Centric Community for Chemists
Two Seconds Later




Building a Structure Centric Community for Chemists
Not Always Perfect….




Building a Structure Centric Community for Chemists
A Platform for Markup

Can we provide a platform for document markup for
chemists?
Workflow:
  Upload word docs, RTF files...
Challenges

Computer software can generate chemical names better
than the majority of chemists
The majority of chemical na...
Names and Structures

Dichloroacetone




Trichloromethylsilane




             Building a Structure Centric Community fo...
Ambiguity




Building a Structure Centric Community for Chemists
Ambiguity in Abbreviations - DPA




    Building a Structure Centric Community for Chemists
Ambiguity in Abbreviations - THF




    Building a Structure Centric Community for Chemists
Import is Easy

Make articles Public/Private (embargo date soon)
Auto-markup and check by user




             Building a...
IUPAC PAC Articles




Building a Structure Centric Community for Chemists
Supports Word .DOC, HTML, RTF




    Building a Structure Centric Community for Chemists
Drexel University Documents




Building a Structure Centric Community for Chemists
Drexel University Documents




Building a Structure Centric Community for Chemists
Drexel University Documents




Building a Structure Centric Community for Chemists
Patents




Building a Structure Centric Community for Chemists
Single Configuration File defines entities
for markup
Algorithms can be built for certain
entities but the majority are di...
Nature Publications




Building a Structure Centric Community for Chemists
Entity Balloons

Structures are the
language of chemistry
Show structures to
chemists and search/link
from there




     ...
Other Dictionaries - Species

We are considering
  Bacteria
  Fungi
  Enzymes
  Viruses
  PDB codes….




            Buil...
Integrations Out to Other Sources




   Building a Structure Centric Community for Chemists
Integrations Out to Other Sources




   Building a Structure Centric Community for Chemists
Reactions




Building a Structure Centric Community for Chemists
Manual Curation is Always Necessary




       Building a Structure Centric Community for Chemists
Text-Indexing and ChemSpider?

ChemSpider text-indexes almost 500,000 Open Access
and Free Access articles




Collection ...
Open Access Literature Search




Building a Structure Centric Community for Chemists
Conclusions

The quality of structure-based data online should
always be questioned – that includes ChemSpider
Data on Che...
Oops…




Building a Structure Centric Community for Chemists
Upcoming SlideShare
Loading in...5
×

A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

1,774

Published on

This is a presentation I gave at the Nature Publishing group offices in London UK. It covers general information about ChemSPider and our efforts with ChemMantis.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,774
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

  1. 1. Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams
  2. 2. Imagine a time when …. The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) Chemistry articles are indexed and searchable by a free online service The web is linked together through the “language of chemistry” Publicly funded research data can be shared and discussed in the Open, maybe as ONS? Cheminformatics has as much of a public face as bioinformatics Building a Structure Centric Community for Chemists
  3. 3. ChemSpider - A Search Engine for Chemists Questions a chemist might ask… What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue? ChemSpider can answer all of these questions Building a Structure Centric Community for Chemists
  4. 4. What is a Structure? Ask a computer…ask a chemist Building a Structure Centric Community for Chemists
  5. 5. Tell Me About Glutathione Building a Structure Centric Community for Chemists
  6. 6. Tell Me About Glutathione Building a Structure Centric Community for Chemists
  7. 7. Tell Me About Glutathione Building a Structure Centric Community for Chemists
  8. 8. Tell Me About Glutathione Building a Structure Centric Community for Chemists
  9. 9. Tell Me About Glutathione Building a Structure Centric Community for Chemists
  10. 10. Tell Me About Glutathione Building a Structure Centric Community for Chemists
  11. 11. Link outs Building a Structure Centric Community for Chemists
  12. 12. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes Building a Structure Centric Community for Chemists
  13. 13. How many names does a compound have? Building a Structure Centric Community for Chemists
  14. 14. ChemSpider Data Content Over 21.5 million unique chemical structures from ca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Access Chemical Vendors – over 40 different vendors and growing Personal Depositions – individual contributions Content database vendors Analytical data collections Patents Web scraping Content is linked back to the original data sources Building a Structure Centric Community for Chemists
  15. 15. Other Searches What compounds have a mass of 300+/-0.001? or search a combination of intrinsic/predicted properties Building a Structure Centric Community for Chemists
  16. 16. Other Searches Building a Structure Centric Community for Chemists
  17. 17. Complex Search Building a Structure Centric Community for Chemists
  18. 18. The Quality of Data Online… Aggregating data opens up quality issues Structure-identifier associations are “dirty” Structures are COMMONLY incorrect Manual curation of small databases is enough work – what about millions of structures? Structures are far from perfect. What is a “correct structure”? Full stereochemistry? Historical timeline of structure? Who is the authority? Building a Structure Centric Community for Chemists
  19. 19. Who holds THE Quality Authority? Chemical Abstracts Service is the structural authority today. 1400 employees, world standard in chemistry information 101 years of knowledge, process and expertise. How can an online, free access system peacefully co- exist with the authority? Building a Structure Centric Community for Chemists
  20. 20. Quality is a Major Issue- Search Butanol OLD EXAMPLE..now fixed Building a Structure Centric Community for Chemists
  21. 21. Wikipedia Chemistry Curation project Only ca. 5000 organic structures, 7000 total structures Almost a year of work so far for a team of 6 people Many errors removed in the process. Curation process is a daily event for users/depositors Slow and torturous process http://en.wikipedia.org/wiki/Talk:Tacrolimus# IUPAC_Name_and_structure Building a Structure Centric Community for Chemists
  22. 22. Wikipedia Curation Looking for self-consistency across a Wikipedia Page Primary key is the article TITLE The chemical shown needs to match the title Cyclic self-consistency – and decisions must get made Building a Structure Centric Community for Chemists
  23. 23. Viagra or Sildenafil Building a Structure Centric Community for Chemists
  24. 24. Other issues… Building a Structure Centric Community for Chemists
  25. 25. Charges Building a Structure Centric Community for Chemists
  26. 26. Sugars – Machine Readable vs Aesthetics Haworth Stereo Fischer Building a Structure Centric Community for Chemists
  27. 27. Wikipedia – Crowdsourcing Chemistry Building a Structure Centric Community for Chemists
  28. 28. Thymol Blue on ChemSpider Data online includes: UV-vis spectrum Measured experimental properties Link to Wikipedia article Links to chromatography details Multiple identifiers/trade names etc. Links to vendors/suppliers/other databases Safety information http://www.chemspider.com/q/thymol%20blue Building a Structure Centric Community for Chemists
  29. 29. Differences between ChemSpider/Wikipedia ChemSpider Wikipedia >21 million unique structures ~5000 organics, 2000 others Complex queries – Properties, Text Text, structure/substructure, OA publishers, Data Sources, … Prediction of properties No Analytical Data No, but links. Active depositors/curators – 30 Active editors > 50 (?) 6000 people/day; 1900 registered ???? Compound monographs linked Detailed compound monographs Building a Structure Centric Community for Chemists
  30. 30. Differences between Wikipedia/ChemSpider Wikipedia ChemSpider Supported by tried and tested Primarily Microsoft .NET Media-Wiki platform. technologies with OS components Established infrastructure and “Out of a basement” on three Wikipedia Foundation Team servers and 5 volunteers Chemistry is a subset of the ‘Pedia Chemistry is the focus of ‘Spider GFL licensing for everything Mixed “licensing” Strong team of WP:Chem Growing team of advocates, advocates, curators and admins curators and users Worldwide reputation as quality Growing reputation as focused on source – good and bad quality Building a Structure Centric Community for Chemists
  31. 31. Crowd-sourcing Curation How to curate data for millions of structures? Robot processes can clean up depositions Search for Chloride and check molecular formula for Cl Check for stereochemistry and remove names with stereo Provide a simple-to-use platform to curate, annotate and tag data Provide curator administration to prevent vandalism (Veropedia) Building a Structure Centric Community for Chemists
  32. 32. Post Comments Anyone can “Post Comments” associated with a structure. To curate data we require login to track Building a Structure Centric Community for Chemists
  33. 33. Multi-level Curation and Approval Building a Structure Centric Community for Chemists
  34. 34. Crowd-sourcing Chemistry Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation ALSO Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data) Building a Structure Centric Community for Chemists
  35. 35. DailyMed Building a Structure Centric Community for Chemists
  36. 36. Quality of Structures Building a Structure Centric Community for Chemists
  37. 37. Quality of Structures!!! Building a Structure Centric Community for Chemists
  38. 38. Structure-Centric We want to search “information” by structure, substructure, similarity of structure Specific focus on Open Chemistry at present Standard approaches would be: Identify chemical names “entity extraction” Convert chemical names to structures and index ChemSpider has a validated dictionary of structure-name pairs Use name extraction, name-conversion and dictionary look- up. THEN curate. Building a Structure Centric Community for Chemists
  39. 39. “Entity Extraction” Rule-based recognition of systematic names: Use a lexeme of name fragments Rules for identifying bounds of a name Look-up dictionary: Drug Names Trivial Names Numbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated identifiers on ChemSpider Building a Structure Centric Community for Chemists
  40. 40. Building a Structure Centric Community for Chemists
  41. 41. Name Recognition Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1(0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %) Building a Structure Centric Community for Chemists
  42. 42. Name Recognition Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1(0.40 g, 1.88 mmol) and a excess of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %) Building a Structure Centric Community for Chemists
  43. 43. How Many Chemical Names? “She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.” Building a Structure Centric Community for Chemists
  44. 44. How Many Chemical Names? “She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.” Building a Structure Centric Community for Chemists
  45. 45. ChemMantis Chemical Markup And Nomenclature Transformation Integrated System Building a Structure Centric Community for Chemists
  46. 46. Making Open Access Articles Searchable Proof of Concept Can we HOST Chemistry Open Access articles on ChemSpider and add-value Can we identify chemical names in Open Access articles in a user-friendly manner Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles? Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive? Building a Structure Centric Community for Chemists
  47. 47. Document markup ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International Hosting the Molbank collection at present Building a Structure Centric Community for Chemists
  48. 48. A Standard for Document Markup? NLM-DTD: National Library of Medicine; Document Type Definition Approved markup definitions to apply to journal articles – extended as necessary for our purposes Building a Structure Centric Community for Chemists
  49. 49. NLM/DTD markup Building a Structure Centric Community for Chemists
  50. 50. Chemistry and Biology Menus can be extended as necessary Building a Structure Centric Community for Chemists
  51. 51. Document markup Building a Structure Centric Community for Chemists
  52. 52. Markup – 3 seconds! Building a Structure Centric Community for Chemists
  53. 53. On the fly conversion Building a Structure Centric Community for Chemists
  54. 54. Shorthand Formulae Supported Building a Structure Centric Community for Chemists
  55. 55. One Click to more Info… Building a Structure Centric Community for Chemists
  56. 56. Structure Image Conversion Building a Structure Centric Community for Chemists
  57. 57. Two Seconds Later Building a Structure Centric Community for Chemists
  58. 58. Not Always Perfect…. Building a Structure Centric Community for Chemists
  59. 59. A Platform for Markup Can we provide a platform for document markup for chemists? Workflow: Upload word docs, RTF files or point to HTML and load Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation Publish final version with NLM-DTD markup Deposit all structures on ChemSpider under embargo and wait for article DOI to release Building a Structure Centric Community for Chemists
  60. 60. Challenges Computer software can generate chemical names better than the majority of chemists The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous One name, Multiple Structures Building a Structure Centric Community for Chemists
  61. 61. Names and Structures Dichloroacetone Trichloromethylsilane Building a Structure Centric Community for Chemists
  62. 62. Ambiguity Building a Structure Centric Community for Chemists
  63. 63. Ambiguity in Abbreviations - DPA Building a Structure Centric Community for Chemists
  64. 64. Ambiguity in Abbreviations - THF Building a Structure Centric Community for Chemists
  65. 65. Import is Easy Make articles Public/Private (embargo date soon) Auto-markup and check by user Building a Structure Centric Community for Chemists
  66. 66. IUPAC PAC Articles Building a Structure Centric Community for Chemists
  67. 67. Supports Word .DOC, HTML, RTF Building a Structure Centric Community for Chemists
  68. 68. Drexel University Documents Building a Structure Centric Community for Chemists
  69. 69. Drexel University Documents Building a Structure Centric Community for Chemists
  70. 70. Drexel University Documents Building a Structure Centric Community for Chemists
  71. 71. Patents Building a Structure Centric Community for Chemists
  72. 72. Single Configuration File defines entities for markup Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical We can extend our system to support your needs based on dictionaries – what does NPG need/not need? Building a Structure Centric Community for Chemists
  73. 73. Nature Publications Building a Structure Centric Community for Chemists
  74. 74. Entity Balloons Structures are the language of chemistry Show structures to chemists and search/link from there Building a Structure Centric Community for Chemists
  75. 75. Other Dictionaries - Species We are considering Bacteria Fungi Enzymes Viruses PDB codes…. Building a Structure Centric Community for Chemists
  76. 76. Integrations Out to Other Sources Building a Structure Centric Community for Chemists
  77. 77. Integrations Out to Other Sources Building a Structure Centric Community for Chemists
  78. 78. Reactions Building a Structure Centric Community for Chemists
  79. 79. Manual Curation is Always Necessary Building a Structure Centric Community for Chemists
  80. 80. Text-Indexing and ChemSpider? ChemSpider text-indexes almost 500,000 Open Access and Free Access articles Collection is growing and more publishers have already agreed. Including theses in the future. Building a Structure Centric Community for Chemists
  81. 81. Open Access Literature Search Building a Structure Centric Community for Chemists
  82. 82. Conclusions The quality of structure-based data online should always be questioned – that includes ChemSpider Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always ChemSpider has a large validated structure-name dictionary Chemical name extraction and document markup is very enabling Building a Structure Centric Community for Chemists
  83. 83. Oops… Building a Structure Centric Community for Chemists
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×