Facilitating Scientific Discovery
through Crowdsourcing and
Distributed Participation
Antony Williams
NETTAB
October 17th ...
If it was not just about me…
• Together we might:
•
•
•
•
•
•
•

build an encyclopedia
…and rate restaurants
…share book r...
If it was not just about me…
• Together we might:
•
•
•
•
•
•
•

build an encyclopedia
…and rate restaurants
…share book r...
Crowdsource the galaxy
Various ways to contribute
Where Am I From?
What can be done with Big Data
Patients Like Me
Patients Like Me
Let’s Change the World
• Let’s map together all historical chemistry
data and build systems to integrate new data
• Heck, ...
That’s a BIG Request
What About Something Smaller?
• We’re going to map the world
• We’re going to take photos of as many
places as we can and ...
I’m from here…
Wikipedia
Wikipedia
Moel-Y-Parc
I care…I want to contribute…
The Power of Contribution
How do you spell Afonwen?
And the Welsh know!
Whoa…
•
•
•
•

So the world can be mapped…
We can enter a 3D environment within the map
We can add annotations
We can use ...
Chemistry Data is Everywhere
In a Galaxy far, far away…
• Build a structure-centric hub
• Aggregate structure-based data and integrate
• Link to additi...
A LITTLE Chemistry First
Structural Diagrams
Structural Diagrams
Analytical Data
Does Stereochemistry Matter?
Does one stereocenter matter?
 Distaval, Talimol, Nibrol,
Sedimide, Quietoplex,
Contergan, Neurosedyn,
Softenon, Thalidom...
Structural Representations
The InChI Standard
InChIKeys
Search the Web by Structure
The Quality of Chemical Data Online
What is the Structure of Vitamin K?
A lipid cofactor that is required for normal
blood...
What is the Structure of Vitamin K1?
What is the Structure of Vitamin K1?
Chemical Abstracts Service
Wikipedia
Wolfram Alpha
DailyMed
People Use Trusted Resources…
We can all help clean it up…
How will it improve?
Participation
and
contribution
Building a chemistry hub ..issues
ALL Different, ALL “Domoic Acids”
ONE is “correct”
ALL Different, ALL “Domoic Acids”
The EXPERTS must get it right?!
Web 2.0 Contribution
• We have been contributing
to the web for a along time
already – but how much in
chemistry?
• A few ...
Challenging a Publication
This story is only 11 years ago!
Oops…
>2 Years to Resolution
The New Way of Challenging
Challenging Science…
Collaboration towards completion
Detailed constructive dialog
Oxidation by Sodium Hydride?
The Blogosphere Analyzes…
How much is in the archives?
Data Mine the Archives
• Imagine the power of data-mining the archives
of the publishers
• Generate nanopublications from ...
Open Notebook Science Analysis
Oxidation by Sodium Hydride?

THIS IS A COP OUT!!!
What is Hexacyclinol?
The Blogosphere “Discusses”…
What is real, what is fake?

http://www.youtube.com/watch?v=hMpAoC-h5SA
Question Everything Online:
www.dhmo.org
Chemistry is Dangerous!

http://tinyurl.com/cl2awnj
Chemistry is Dangerous
Florida DJs May Face Felony for April
Fools' Water Joke
“… told their listeners that "dihydrogen
mo...
How do you recognize good vs bad?
Is this real?
Junk vs Real
•

“We then established a collaboration with
professor Sum Ting Wong, a fugitive from
the North Korean Univer...
What is real, what is fake?
Helping to change science
•
•
•
•

Participation and contribution
Immediacy of action
Platforms for contribution
Openness…...
Getting Called Out in Public…
Rules for Licensing Data
Challenged in the Twittersphere
Annotating Articles Today…
Attribution to me…
Back to Chemistry….
• ANYBODY can annotate a record on ChemSpider
• Registered users can deposit new data
•
•
•
•
•
•
•

C...
CURATION Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
Data Validation is Exacting Work
It is so difficult to navigate…
IP?
IP?
What’s the
What’s the
structure?
structure?
Are they in
Are they in
our file?
our ...
• Open PHACTS Project
• Develop a set of robust standards
• Integrate Chemistry and Biology data by implementing the
stand...
ChemSpider serving RDF and
the semantic web
•
•

Using RDF permalinks
http://www.chemspider.com/Chemical-Structure.7787.rd...
RDF and the semantic web
The issue of identifiers
Aspirin on ChemSpider
What is Gleevec
Imatinib

Mesylate

ChemSpider

Drugbank

PubChem
Dynamic Equality
Strict

Relaxed

Analysing

Browsing

chemspider:gleevec
chemspider:gleevec

drugbank:gleevec
drugbank:gl...
But what about Data Quality???
CVSP : chemical validation
Free chemistry validation platform that performs:
•Structure val...
Validation
Standardization
• Custom processing let’s user to put together workflow
from pre-defined standardization modules list
Data Review
DrugBank
DrugBank dataset (6516
records)
~60 records that can’t be dearomatized unambiguously

DB04283

DB04462
Nonsensical bonds
~30 records with bonds that do not make
sense

DB04283
DDB04009
Mismatches – which is correct?
~40 records where InChIs did not match the structure
DrugBank ID: DB00755
InChI=1S/C20H28O2...
Stereobond issues
7 records with 2 stereo bonds at chiral
atoms
J. Brechner, IUPAC
Graphical
Representation of
stereochem....
Chemistry Data out to OPS
Data
Chemistry Validation and Standardization
Platform (CVSP)
at cvsp.chemspider.com
•Validation...
Data is being
imported from
ChemSpider to
Open PHACTS in
RDF/turtle
RDF/VOID
– VoID is an RDF Schema vocabulary for expressing metadata about RDF
datasets.
• skos:exactMatch (Simple Knowledg...
RDF Export
Ongoing Updates to Linksets
•
•
•
•

Ongoing data checking activities on ChemSpider
Does Crowdsourcing work?
But WHY would...
Crowdsourcing – does it work?
• >200 people EVER have deposited or curated
data
• Database hosts make the largest contribu...
Contributions
Curation Activities
•
•
•
•

2009 – 8255 curations by 43 people
2010 – 10014 curations by 66 people
2011 – 16025 curations...
Depositors
•
•
•
•

2009
2010
2011
2012

91 unique depositors
120 unique depositors
99 unique depositors
120 unique deposi...
Rewards and Recognition
• The badgesonomy culture of recognition is
growing.
• Badges are commonplace
• FourSquare
• Klout
Rewards and Recognition
• Rewards and Recognition
integrated across all
ChemSpider related projects
• Including paths to e...
The Alt-Metrics Manifesto
http://altmetrics.org/manifesto/
ImpactStory
Future Recognition on
Scientists AltMetrics
Usage, Citations, Social Media, etc
Detailed Usage Statistics
What makes a Scientist Notable?
What will it be in the future???
We Are All Being Quantified…
How I am Quantified…
Likely Enabled by
• Persistent unique digital identifier
• Integrates to workflows such as manuscript
and grant submission...
Summary
• A grand vision for semantic data linking is
coming to fruition. We have a lot do.
• Data quality enhancements ca...
Inside our Publication Archive
• How much data is in RSC archive, in the
publications and in the supplementary info?
• How...
What if we could capture it all?
Digitally Enhancing the RSC Archive
Start with data in publications
Turn “Figures” Into Data

Spectral
FIGURE

Extracted
Spectrum
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4thiadiazol-5-yl)urea prepared in Example 6 , thiony...
ChemSpider Reactions
Thank You
Email: williamsa@rsc.org
Twitter: ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation
Upcoming SlideShare
Loading in...5
×

Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation

1,418

Published on

Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,418
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation

  1. 1. Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation Antony Williams NETTAB October 17th 2013
  2. 2. If it was not just about me… • Together we might: • • • • • • • build an encyclopedia …and rate restaurants …share book reviews …and movie reviews …and reviews of service providers …organize sit-ins and social action …and more data might just be Open
  3. 3. If it was not just about me… • Together we might: • • • • • • • build an encyclopedia …and rate restaurants …share book reviews …and movie reviews …and reviews of service providers …organize sit-ins and social action …and more data might just be Open
  4. 4. Crowdsource the galaxy
  5. 5. Various ways to contribute
  6. 6. Where Am I From?
  7. 7. What can be done with Big Data
  8. 8. Patients Like Me
  9. 9. Patients Like Me
  10. 10. Let’s Change the World • Let’s map together all historical chemistry data and build systems to integrate new data • Heck, let’s integrate chemistry and biology data and add in disease data too • Lets model the data and see if we can extract new relationships – quantitative and qualitative • Let’s make it all available on the web
  11. 11. That’s a BIG Request
  12. 12. What About Something Smaller? • We’re going to map the world • We’re going to take photos of as many places as we can and link them together • We’ll let people annotate and curate the map • Then let’s make it available free on the web • We’ll make it available for decision making • Put it on Mobile Devices, Give it Away
  13. 13. I’m from here…
  14. 14. Wikipedia
  15. 15. Wikipedia
  16. 16. Moel-Y-Parc
  17. 17. I care…I want to contribute…
  18. 18. The Power of Contribution
  19. 19. How do you spell Afonwen?
  20. 20. And the Welsh know!
  21. 21. Whoa… • • • • So the world can be mapped… We can enter a 3D environment within the map We can add annotations We can use the data, we can reference it, we can extract it, we can make decisions with it • And we can do it on our lap, in our hands • Let’s crowdsource chemistry
  22. 22. Chemistry Data is Everywhere
  23. 23. In a Galaxy far, far away… • Build a structure-centric hub • Aggregate structure-based data and integrate • Link to additional data sources • • • • Patents Publications Vendors Models
  24. 24. A LITTLE Chemistry First
  25. 25. Structural Diagrams
  26. 26. Structural Diagrams
  27. 27. Analytical Data
  28. 28. Does Stereochemistry Matter?
  29. 29. Does one stereocenter matter?  Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide
  30. 30. Structural Representations
  31. 31. The InChI Standard
  32. 32. InChIKeys Search the Web by Structure
  33. 33. The Quality of Chemical Data Online What is the Structure of Vitamin K? A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K1 (phytomenadione) derived from plants, VITAMIN K2 (menaquinone) from bacteria & synthetic naphthoquinone provitamins, VITAMIN K3 (menadione).
  34. 34. What is the Structure of Vitamin K1?
  35. 35. What is the Structure of Vitamin K1?
  36. 36. Chemical Abstracts Service
  37. 37. Wikipedia
  38. 38. Wolfram Alpha
  39. 39. DailyMed
  40. 40. People Use Trusted Resources…
  41. 41. We can all help clean it up…
  42. 42. How will it improve? Participation and contribution
  43. 43. Building a chemistry hub ..issues ALL Different, ALL “Domoic Acids”
  44. 44. ONE is “correct” ALL Different, ALL “Domoic Acids”
  45. 45. The EXPERTS must get it right?!
  46. 46. Web 2.0 Contribution • We have been contributing to the web for a along time already – but how much in chemistry? • A few blogs, an increasing amount of tweeting but what about data sharing in chemistry?
  47. 47. Challenging a Publication This story is only 11 years ago!
  48. 48. Oops…
  49. 49. >2 Years to Resolution
  50. 50. The New Way of Challenging
  51. 51. Challenging Science…
  52. 52. Collaboration towards completion
  53. 53. Detailed constructive dialog
  54. 54. Oxidation by Sodium Hydride?
  55. 55. The Blogosphere Analyzes…
  56. 56. How much is in the archives?
  57. 57. Data Mine the Archives • Imagine the power of data-mining the archives of the publishers • Generate nanopublications from the articles and make available to the reasoning engines • Now imagine extracting all the “data” and freeing it up to the semantic web. • MORE LATER ON THAT!
  58. 58. Open Notebook Science Analysis
  59. 59. Oxidation by Sodium Hydride? THIS IS A COP OUT!!!
  60. 60. What is Hexacyclinol?
  61. 61. The Blogosphere “Discusses”…
  62. 62. What is real, what is fake? http://www.youtube.com/watch?v=hMpAoC-h5SA
  63. 63. Question Everything Online: www.dhmo.org
  64. 64. Chemistry is Dangerous! http://tinyurl.com/cl2awnj
  65. 65. Chemistry is Dangerous Florida DJs May Face Felony for April Fools' Water Joke “… told their listeners that "dihydrogen monoxide" was coming out of the taps throughout the Fort Myers area.”
  66. 66. How do you recognize good vs bad?
  67. 67. Is this real?
  68. 68. Junk vs Real • “We then established a collaboration with professor Sum Ting Wong, a fugitive from the North Korean University Hu Yu Hai Ding” • “..identified as the new protein Wai So Dim”
  69. 69. What is real, what is fake?
  70. 70. Helping to change science • • • • Participation and contribution Immediacy of action Platforms for contribution Openness…whatever that is
  71. 71. Getting Called Out in Public… Rules for Licensing Data
  72. 72. Challenged in the Twittersphere
  73. 73. Annotating Articles Today…
  74. 74. Attribution to me…
  75. 75. Back to Chemistry…. • ANYBODY can annotate a record on ChemSpider • Registered users can deposit new data • • • • • • • Compounds Reactions Links Analytical Data Images Movies, Data sets • Registered users can validate existing data
  76. 76. CURATION Search “Vitamin H”
  77. 77. “Curate” Identifiers
  78. 78. “Curate” Identifiers
  79. 79. Data Validation is Exacting Work
  80. 80. It is so difficult to navigate… IP? IP? What’s the What’s the structure? structure? Are they in Are they in our file? our file? What’s What’s similar? similar? Pharmacology Pharmacology data? data? What’s the What’s the target? target? Known Known Pathways? Pathways? Competitors? Competitors? Connections Connections to disease? to disease? Working On Working On Now? Now? Expressed in Expressed in right cell type? right cell type?
  81. 81. • Open PHACTS Project • Develop a set of robust standards • Integrate Chemistry and Biology data by implementing the standards in a semantic integration hub • Deliver services to support drug discovery programs in pharma and public domain • INITIALLY 22 partners, 8 pharmaceutical companies, 3 biotechs Guiding principle is open access, open usage, open source Guiding principle is open access, open usage, open source -- Key to standards adoption -Key to standards adoption
  82. 82. ChemSpider serving RDF and the semantic web • • Using RDF permalinks http://www.chemspider.com/Chemical-Structure.7787.rd • • • Using a Search Term http://www.chemspider.com/rdf.ashx?q=cyclohexane http://rdf.chemspider.com/cyclohexane
  83. 83. RDF and the semantic web
  84. 84. The issue of identifiers
  85. 85. Aspirin on ChemSpider
  86. 86. What is Gleevec Imatinib Mesylate ChemSpider Drugbank PubChem
  87. 87. Dynamic Equality Strict Relaxed Analysing Browsing chemspider:gleevec chemspider:gleevec drugbank:gleevec drugbank:gleevec LinkSet#1 { chemspider:gleevec hasParent imatinib ... drugbank:gleevec exactMatch imatinib ... }
  88. 88. But what about Data Quality??? CVSP : chemical validation Free chemistry validation platform that performs: •Structure validation • • • • • • Atoms Bonds Valence Stereo If aromatic - check that uniquely dearomatized Strongest acid not ionized first in partially-ionized system •Cross-matching of SDF fields • synonyms • InChIs • Smiles
  89. 89. Validation
  90. 90. Standardization • Custom processing let’s user to put together workflow from pre-defined standardization modules list
  91. 91. Data Review
  92. 92. DrugBank DrugBank dataset (6516 records) ~60 records that can’t be dearomatized unambiguously DB04283 DB04462
  93. 93. Nonsensical bonds ~30 records with bonds that do not make sense DB04283 DDB04009
  94. 94. Mismatches – which is correct? ~40 records where InChIs did not match the structure DrugBank ID: DB00755 InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-1218-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,15H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+ DruGBank ID: DB00614
  95. 95. Stereobond issues 7 records with 2 stereo bonds at chiral atoms J. Brechner, IUPAC Graphical Representation of stereochem. configurations Section: ST-1.1.10 DB08128 DB06287
  96. 96. Chemistry Data out to OPS Data Chemistry Validation and Standardization Platform (CVSP) at cvsp.chemspider.com •Validation •Standardization •Parent generation RDF Export
  97. 97. Data is being imported from ChemSpider to Open PHACTS in RDF/turtle
  98. 98. RDF/VOID – VoID is an RDF Schema vocabulary for expressing metadata about RDF datasets. • skos:exactMatch (Simple Knowledge Organisation System) e.g. To link compounds in OPS with compounds in ChEBI. • skos:closeMatch e.g. To link Stereo Insensitive Parents to their Children within OPS. • skos:relatedMatch e.g. To link Parent compounds that contain others as Fragments. – Recommendations on VoID specified by Manchester Uni. here: http://www.cs.man.ac.uk/~graya/ops/2012/ED-datadesc/
  99. 99. RDF Export
  100. 100. Ongoing Updates to Linksets • • • • Ongoing data checking activities on ChemSpider Does Crowdsourcing work? But WHY would people validate ChemSpider? What can be done to encourage participation? • How will we deliver on Barend’s Million Minds approach – he has ideas…see later
  101. 101. Crowdsourcing – does it work? • >200 people EVER have deposited or curated data • Database hosts make the largest contributions • ChemSpider staff do the most curation
  102. 102. Contributions
  103. 103. Curation Activities • • • • 2009 – 8255 curations by 43 people 2010 – 10014 curations by 66 people 2011 – 16025 curations by 116 people 2012 – 13127 curations by 74 people
  104. 104. Depositors • • • • 2009 2010 2011 2012 91 unique depositors 120 unique depositors 99 unique depositors 120 unique depositors • “The crowd is small – very small”
  105. 105. Rewards and Recognition • The badgesonomy culture of recognition is growing. • Badges are commonplace • FourSquare • Klout
  106. 106. Rewards and Recognition • Rewards and Recognition integrated across all ChemSpider related projects • Including paths to expose such recognition on AltMetrics platforms • MAY work for young scientists
  107. 107. The Alt-Metrics Manifesto http://altmetrics.org/manifesto/
  108. 108. ImpactStory
  109. 109. Future Recognition on
  110. 110. Scientists AltMetrics
  111. 111. Usage, Citations, Social Media, etc
  112. 112. Detailed Usage Statistics
  113. 113. What makes a Scientist Notable? What will it be in the future???
  114. 114. We Are All Being Quantified…
  115. 115. How I am Quantified…
  116. 116. Likely Enabled by • Persistent unique digital identifier • Integrates to workflows such as manuscript and grant submission • Supports automated linkages with your professional activities
  117. 117. Summary • A grand vision for semantic data linking is coming to fruition. We have a lot do. • Data quality enhancements can come through crowdsourcing (and intelligent robots) • Participation may be driven by new approaches to rewards and recognition • Publishers are understanding the value of data, but still guard it, and generally don’t have systems to handle it. It’s changing. • There is so much value in the historical data
  118. 118. Inside our Publication Archive • How much data is in RSC archive, in the publications and in the supplementary info? • How many compounds for ChemSpider? • How many syntheses for ChemSpider reactions? • How many characterization measurements? • Property Data • Spectral Data • Graphs and charts to be used for modeling?
  119. 119. What if we could capture it all? Digitally Enhancing the RSC Archive
  120. 120. Start with data in publications
  121. 121. Turn “Figures” Into Data Spectral FIGURE Extracted Spectrum
  122. 122. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-Nmethyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  123. 123. ChemSpider Reactions
  124. 124. Thank You Email: williamsa@rsc.org Twitter: ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×