bioLogical
mass collaboration
           Benjamin Good
    University of British Columbia


   Symposium on (Bio)semantics...
mass collaboration



- calling on a million minds...
bioLogic


       R
X              Y
X              Y
X              Y
X              Y
The plan for today

Mostly-manual strategies for creating
bioLogical knowledge

  • pull
    ➡ social tagging
  • push
   ...
pull


1. incentive

  • passive altruism: actions taken for
    individual gain result in collective
    benefit.
pull


2. example

 • hyperlinks: individual website
    authors did not intend to make
    Google possible...
Social tagging




(image from Lund (2006) http://xtech06.usefulinc.com/schedule/paper/75)
bioLogic captured


        hasTag
  URI            T
More data captured


        http://upload.wikimedia.org/wikipedia/commons/c/c9/Hippocampus-mri.jpg


                    ...
Tags

• Not the same as either professionally
  or automatically generated keywords.

  - (Al-Khalifa & Davis 2007)
• Can ...
Tagging in science?


• How does social tagging compare to
  professional indexing in the life
  sciences?

• (Good, Tenni...
“Tuned responses of astrocytes and their influence on
     hemodynamic signals in the visual cortex”
growth of Citeulike
                                                     Number Distinct Pubmed Documents tagged per month...
but..
           Tags per Pubmed Citation: Citeulike Aggregate                   MeSH Descriptors per Pubmed Citation
    ...
because..
                          Posts per pubmed Citation: Connotea                                                   ...
open social tagging -
        in science


➡ low numbers of tags per post
➡ low numbers of posts per document
➡ low value ...
adding value to each tag


      • social semantic tagging,
          ➡ tagging with encoded concepts
             instead...
Tagging with Connotea
Typical tagging

User types
in all tags


Type-ahead
  displays
 previously
 used tags
Tagging with E.D.
Adding a
semantic
  tag
Adding a semantic tag
More data captured for
      each tag
E.D. can be customized

• Tag with:
  genes, gene ontology terms, terms from
  OWL ontologies

• Recently used to conduct ...
but!


• Does not address the volume problem -
  more participation is needed to make
  social tagging a useful source of
...
The plan for today

Mostly-manual strategies for creating
bioLogical knowledge

  • pull
    ➡ social tagging
  • push
   ...
push


• Key difference from pull model is
  that system designers push specific
  requests to users

• many incentive opti...
Pushy pattern

1. design frame for knowledge to be
   collected             ?
               ?                      ?


2....
Mechanical Turk:
     pushing with money


• A “marketplace for work”
  hosted by Amazon Inc.
  “artificial artificial
  int...
Mechanical Turk and
                NLP

     • Snow et al (2008)
         - used workers on the AMT to label
            ...
Snow et al (2008) cont.

         Results for affect recognition

         • labels = 7000
         • cost = $2
         •...
ESP game, pushing with fun




 Von Ahn and Dabbish (2004) Labeling Images with a Computer Game
              http://www.c...
ESP game results (2004)

• >4 million images labeled
• >23,000 players
• Given 5,000 players online
  simultaneously, coul...
iCAPTURer: assessing
              push for bioLogical
                 knowledge
         • Can we acquire bio-ontologica...
iCAPTURer 1

        Goals

        1. Identify concepts from text
        2. Link concepts to synonyms and to
           ...
iCAPTURer 1 - terminology builder
                                                              Abstracts


        Automa...
iCAPTURer 1 - taxonomy builder
                                              T-cell activation        Validated
          ...
iCAPTURer 1 - taxonomy builder
                                                                UMLS Semantic
             ...
iCAPTURer 1 results
  regarding volunteers


• Recruiting went surprisingly well.
• Volume of contributions highly skewed
...
Participation curve


                 0.14

               12
                0.12

  Percent of    0.1

  total        0...
knowledge gathered
                                                           1) Collection: 2 days , 68 participants
    ...
knowledge gathered
                                                           1) Collection: 2 days , 68 participants
    ...
Initial acquisition verse
         evaluation

   11,000
Number of
assertions
gathered

      1,000

              Knowled...
Initial acquisition verse
         evaluation

   11,000
                    “I assert that t cell            “I agree tha...
iCAPTURer 2 pattern


1. Infer complete ontology
2. Present each edge as a multiple choice
   question {true, false, I don...
iCAPTURer 2
knowledge sought



      ? subClassOf ?
  X                    Y



      (immunology)
iCAPTURer2 results
                                                          1.2




• Same pattern of
                   ...
iCAPTURer summary

• Scientifically relevant tasks are harder
  - the population pool is smaller, but - in
  my experience ...
Small steps



• but apparently in a promising direction
Filling in Freebase with
       Typewriter
                                   ? is a ?
                           X       ...
Filling in Freebase with
       Typewriter
                                   ? is a ?
                           X       ...
To achieve mass collaborative bioLogical
knowledge assembly, make it possible for
people to contribute in multiple modes

...
“...how you envision future
developments...”




Automation
“...how you envision future
developments...”




               +
Automation         Human computation
“...how you envision future
developments...”




               +
Automation         Human computation

    = increasingly...
“...how your own expertise would fit into
      this realm...”

more
              requires
bioLogical
                    ...
Thanks to


• developers: Eddie Kawas, Paul Lu
• advisor: Mark Wilkinson
• Barend Mons for the invitation and
  Marco Roos...
Bio Logical Mass Collaboration3
Upcoming SlideShare
Loading in …5
×

Bio Logical Mass Collaboration3

548 views
488 views

Published on

This presentation describes two modes of web-based knowledge acquisition in the domain of bioinformatics. "Pull" models such as social tagging systems that engage passive altruism and "push" models such as the Mechanical Turk that actively guide and incentivise the knowledge acquisition process.

Published in: Technology, Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
548
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Bio Logical Mass Collaboration3

  1. 1. bioLogical mass collaboration Benjamin Good University of British Columbia Symposium on (Bio)semantics for complex systems biology, Leiden University Medical Center 12 March 2009.
  2. 2. mass collaboration - calling on a million minds...
  3. 3. bioLogic R X Y X Y X Y X Y
  4. 4. The plan for today Mostly-manual strategies for creating bioLogical knowledge • pull ➡ social tagging • push ➡ frames and games
  5. 5. pull 1. incentive • passive altruism: actions taken for individual gain result in collective benefit.
  6. 6. pull 2. example • hyperlinks: individual website authors did not intend to make Google possible...
  7. 7. Social tagging (image from Lund (2006) http://xtech06.usefulinc.com/schedule/paper/75)
  8. 8. bioLogic captured hasTag URI T
  9. 9. More data captured http://upload.wikimedia.org/wikipedia/commons/c/c9/Hippocampus-mri.jpg Resource Tagged Tagging Tagger JaneTagger 2007-8-29 Event Tagging Context Associated Tags hippocampus mri image wikipedia
  10. 10. Tags • Not the same as either professionally or automatically generated keywords. - (Al-Khalifa & Davis 2007) • Can be used to improve Web search - (Morrison 2008)
  11. 11. Tagging in science? • How does social tagging compare to professional indexing in the life sciences? • (Good, Tennis, Wilkinson in preparation)
  12. 12. “Tuned responses of astrocytes and their influence on hemodynamic signals in the visual cortex”
  13. 13. growth of Citeulike Number Distinct Pubmed Documents tagged per month 100000 90000 80000 Citeulike Observed pmids/ 70000 Citeulike Extrapolated 95% lower bound 60000 month 95% upper bound N distinct PMIDS MEDLINE Linear (MEDLINE) 50000 Linear (Citeulike Extrapolated) Extrapolated Upper Bound 40000 Extrapolated Lower Bound 30000 20000 10000 0 29-Oct- 25-Jul- 20-Apr- 15-Jan- 11-Oct- 7-Jul- 2-Apr- 28-Dec- 23-Sep- 19-Jun- 1999 2002 2005 2008 2010 2013 2016 2018 2021 2024
  14. 14. but.. Tags per Pubmed Citation: Citeulike Aggregate MeSH Descriptors per Pubmed Citation 0.5 0.5 0.4 0.4 0.3 0.3 Density Density 0.2 0.2 0.1 0.1 0.0 0.0 02468 11 14 17 20 23 26 29 02468 11 14 17 20 23 26 29 N tags N tags
  15. 15. because.. Posts per pubmed Citation: Connotea Posts per pubmed Citation: Citeulike 14000 8000 10000 ! ! 10000 6000 N citations N citations 6000 4000 ! 2000 ! 2000 ! ! ! ! !!! !! !!! !!!!!!!!!! !!!!!!!!!! !!!!! !! !! !!!!!!!! !!! ! !! ! ! ! ! ! 0 0 0 5 10 15 20 25 30 0 20 40 60 N posts N posts
  16. 16. open social tagging - in science ➡ low numbers of tags per post ➡ low numbers of posts per document ➡ low value of tags as descriptors..
  17. 17. adding value to each tag • social semantic tagging, ➡ tagging with encoded concepts instead of strings of letters ➡ = the Entity Describer (E.D.) Good, Kawas, Wilkinson (2007) Bridging the gap between social tagging and semantic annotation. Nature Precedings
  18. 18. Tagging with Connotea
  19. 19. Typical tagging User types in all tags Type-ahead displays previously used tags
  20. 20. Tagging with E.D.
  21. 21. Adding a semantic tag
  22. 22. Adding a semantic tag
  23. 23. More data captured for each tag
  24. 24. E.D. can be customized • Tag with: genes, gene ontology terms, terms from OWL ontologies • Recently used to conduct a successful experiment in BioMoby Web service annotation
  25. 25. but! • Does not address the volume problem - more participation is needed to make social tagging a useful source of bioLogical knowledge.
  26. 26. The plan for today Mostly-manual strategies for creating bioLogical knowledge • pull ➡ social tagging • push ➡ frames and games
  27. 27. push • Key difference from pull model is that system designers push specific requests to users • many incentive options: financial, psychological...
  28. 28. Pushy pattern 1. design frame for knowledge to be collected ? ? ? 2. choose incentive system 3. design interface 4. collect knowledge 5. aggregate knowledge
  29. 29. Mechanical Turk: pushing with money • A “marketplace for work” hosted by Amazon Inc. “artificial artificial intelligence”
  30. 30. Mechanical Turk and NLP • Snow et al (2008) - used workers on the AMT to label text for use in training/testing NLP algorithms. - word sense disambiguation, affect recognition and several more. Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263
  31. 31. Snow et al (2008) cont. Results for affect recognition • labels = 7000 • cost = $2 • time = 5.9 hours • when aggregated, results equal or better than expert labelers in most cases. Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263
  32. 32. ESP game, pushing with fun Von Ahn and Dabbish (2004) Labeling Images with a Computer Game http://www.cs.cmu.edu/~biglou/ESP.pdf
  33. 33. ESP game results (2004) • >4 million images labeled • >23,000 players • Given 5,000 players online simultaneously, could label all of the images accessible to Google in a month • (See the “Google image labeling game”…)
  34. 34. iCAPTURer: assessing push for bioLogical knowledge • Can we acquire bio-ontological knowledge from untrained volunteers in a scalable, Web-based manner? • 2 experiments in the context of scientific conferences Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development. Good and Wilkinson 2007. Ontology engineering using volunteer labor
  35. 35. iCAPTURer 1 Goals 1. Identify concepts from text 2. Link concepts to synonyms and to hyponyms (‘x is_a y’) rooted in the UMLS Semantic Network Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development.
  36. 36. iCAPTURer 1 - terminology builder Abstracts Automatic term extraction - Text2Onto Taste bar Cell foo smooth muscle cell Candidate terms immune response Glucose cell Cell biology queen Volunteers filter terms and extend terminology Validated smooth muscle cell immune response terms apoptosis
  37. 37. iCAPTURer 1 - taxonomy builder T-cell activation Validated smooth muscle cell terms apoptosis Volunteers assign parents UMLS Semantic Generic Concept Network Entity Event Physical_Object Conceptual_Entity Process Activity
  38. 38. iCAPTURer 1 - taxonomy builder UMLS Semantic Generic Concept Network Entity Event Physical_Object Conceptual_Entity Process Activity smooth muscle cell T-cell activation apoptosis
  39. 39. iCAPTURer 1 results regarding volunteers • Recruiting went surprisingly well. • Volume of contributions highly skewed - a few did most of the work
  40. 40. Participation curve 0.14 12 0.12 Percent of 0.1 total 0.08 7 knowledge 0.06 added 0.04 0.02 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 Volunteer
  41. 41. knowledge gathered 1) Collection: 2 days , 68 participants Terms Hyponyms Synonyms 207 232auto.+ 340 = 661 429man. 2) Evaluation: 3 days , 65 participants, 11,545 votes A: Terms sorted by fraction quot;truequot; votes C: Hyponyms sorted by fraction quot;truequot; votes B: Synonyms sorted by fraction quot;truequot; votes 1 1 0.9 1 0.9 0.9 0.8 0.8 %”true” 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 votes 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 0 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639 hyponym synonym Term 93% true > false 54% true > false 49% true > false
  42. 42. knowledge gathered 1) Collection: 2 days , 68 participants Terms Hyponyms Synonyms 207 232auto.+ 340 = 661 429man. 2) Evaluation: 3 days , 65 participants, 11,545 votes A: Terms sorted by fraction quot;truequot; votes C: Hyponyms sorted by fraction quot;truequot; votes B: Synonyms sorted by fraction quot;truequot; votes 1 1 0.9 1 0.9 0.9 0.8 0.8 %”true” 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 votes 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 0 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639 hyponym synonym Term 93% true > false 54% true > false 49% true > false
  43. 43. Initial acquisition verse evaluation 11,000 Number of assertions gathered 1,000 Knowledge capture Evaluation conducted at YI forum via email request
  44. 44. Initial acquisition verse evaluation 11,000 “I assert that t cell “I agree that t cell Number of activation is a kind of activation is a kind of assertions immune response” immune response” gathered 1,000 Knowledge capture Evaluation conducted at YI forum via email request • Multiple choice (voting) • Forms • Tree navigation • Home setting • Conference setting • 3 days • 2 days • 68 people • 65 people
  45. 45. iCAPTURer 2 pattern 1. Infer complete ontology 2. Present each edge as a multiple choice question {true, false, I don’t know} 3. Aggregate votes to decide on each triple
  46. 46. iCAPTURer 2 knowledge sought ? subClassOf ? X Y (immunology)
  47. 47. iCAPTURer2 results 1.2 • Same pattern of 1 fraction subclass judgments made 0.8 participation 0.6 0.4 • Only 66% correct 0.2 overall in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Volunteer assessing subClass assertions • highly biased towards saying ‘yes’.
  48. 48. iCAPTURer summary • Scientifically relevant tasks are harder - the population pool is smaller, but - in my experience generally very willing. • Engaging the competitive instinct was helpful in obtaining the responses we did. • Much room for further investigation.
  49. 49. Small steps • but apparently in a promising direction
  50. 50. Filling in Freebase with Typewriter ? is a ? X Y http://typewriter.freebaseapps.com/ March 9, 2009
  51. 51. Filling in Freebase with Typewriter ? is a ? X Y http://typewriter.freebaseapps.com/ March 9, 2009
  52. 52. To achieve mass collaborative bioLogical knowledge assembly, make it possible for people to contribute in multiple modes - as creators - as evaluators - as system builders (open APIs are crucial) and for multiple reasons - personal information management - fun, competition - finance R X Y X Y X Y X Y
  53. 53. “...how you envision future developments...” Automation
  54. 54. “...how you envision future developments...” + Automation Human computation
  55. 55. “...how you envision future developments...” + Automation Human computation = increasingly high-throughput bioLogical knowledge representation
  56. 56. “...how your own expertise would fit into this realm...” more requires bioLogical knowledge representation analyses machine learning knows a bit about community action ben http://biordf.net/~bgood/
  57. 57. Thanks to • developers: Eddie Kawas, Paul Lu • advisor: Mark Wilkinson • Barend Mons for the invitation and Marco Roos for the accommodation! http://biordf.net/~bgood/

×