Bio Logical Mass Collaboration3

bioLogical
mass collaboration
Benjamin Good
University of British Columbia

Symposium on (Bio)semantics for complex
systems biology, Leiden University Medical Center
12 March 2009.

mass collaboration

- calling on a million minds...

bioLogic

R
X Y
X Y
X Y
X Y

The plan for today

Mostly-manual strategies for creating
bioLogical knowledge

• pull
➡ social tagging
• push
➡ frames and games

pull

1. incentive

• passive altruism: actions taken for
individual gain result in collective
beneﬁt.

pull

2. example

• hyperlinks: individual website
authors did not intend to make
Google possible...

Social tagging

(image from Lund (2006) http://xtech06.usefulinc.com/schedule/paper/75)

bioLogic captured

hasTag
URI T

More data captured

http://upload.wikimedia.org/wikipedia/commons/c/c9/Hippocampus-mri.jpg

Resource Tagged

Tagging
Tagger
JaneTagger 2007-8-29
Event
Tagging Context

Associated Tags

hippocampus mri image wikipedia

Tags

• Not the same as either professionally
or automatically generated keywords.

- (Al-Khalifa & Davis 2007)
• Can be used to improve Web search
- (Morrison 2008)

Tagging in science?

• How does social tagging compare to
professional indexing in the life
sciences?

• (Good, Tennis, Wilkinson in
preparation)

“Tuned responses of astrocytes and their inﬂuence on
hemodynamic signals in the visual cortex”

growth of Citeulike
Number Distinct Pubmed Documents tagged per month

100000

90000

80000

Citeulike Observed

pmids/
70000
Citeulike Extrapolated
95% lower bound
60000

month
95% upper bound
N distinct PMIDS

MEDLINE
Linear (MEDLINE)
50000
Linear (Citeulike Extrapolated)
Extrapolated Upper Bound
40000 Extrapolated Lower Bound

30000

20000

10000

0
29-Oct- 25-Jul- 20-Apr- 15-Jan- 11-Oct- 7-Jul- 2-Apr- 28-Dec- 23-Sep- 19-Jun-
1999 2002 2005 2008 2010 2013 2016 2018 2021 2024

but..
Tags per Pubmed Citation: Citeulike Aggregate MeSH Descriptors per Pubmed Citation
0.5

0.5
0.4

0.4
0.3

0.3
Density

Density
0.2

0.2
0.1

0.1
0.0

0.0
02468 11 14 17 20 23 26 29 02468 11 14 17 20 23 26 29

N tags N tags

because..
Posts per pubmed Citation: Connotea Posts per pubmed Citation: Citeulike
14000

8000 10000
! !
10000

6000
N citations

N citations
6000

4000
!

2000
!
2000

!
!
! !
!!!
!! !!!
!!!!!!!!!! !!!!!!!!!! !!!!! !! !!
!!!!!!!!
!!! ! !! ! ! ! ! !
0

0
0 5 10 15 20 25 30 0 20 40 60

N posts N posts

open social tagging -
in science

➡ low numbers of tags per post
➡ low numbers of posts per document
➡ low value of tags as descriptors..

adding value to each tag

• social semantic tagging,
➡ tagging with encoded concepts
instead of strings of letters

➡ = the Entity Describer (E.D.)

Good, Kawas, Wilkinson (2007) Bridging the gap between social
tagging and semantic annotation. Nature Precedings

Typical tagging

User types
in all tags

Type-ahead
displays
previously
used tags

More data captured for
each tag

E.D. can be customized

• Tag with:
genes, gene ontology terms, terms from
OWL ontologies

• Recently used to conduct a successful
experiment in BioMoby Web service
annotation

but!

• Does not address the volume problem -
more participation is needed to make
social tagging a useful source of
bioLogical knowledge.

push

• Key difference from pull model is
that system designers push speciﬁc
requests to users

• many incentive options:
ﬁnancial, psychological...

Pushy pattern

1. design frame for knowledge to be
collected ?
? ?

2. choose incentive system
3. design interface
4. collect knowledge
5. aggregate knowledge

Mechanical Turk:
pushing with money

• A “marketplace for work”
hosted by Amazon Inc.
“artiﬁcial artiﬁcial
intelligence”

Mechanical Turk and
NLP

• Snow et al (2008)
- used workers on the AMT to label
text for use in training/testing NLP
algorithms.

- word sense disambiguation, affect
recognition and several more.

Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for
Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263

Snow et al (2008) cont.

Results for affect recognition

• labels = 7000
• cost = $2
• time = 5.9 hours
• when aggregated, results equal or better
than expert labelers in most cases.

Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for
Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263

ESP game, pushing with fun

Von Ahn and Dabbish (2004) Labeling Images with a Computer Game
http://www.cs.cmu.edu/~biglou/ESP.pdf

ESP game results (2004)

• >4 million images labeled
• >23,000 players
• Given 5,000 players online
simultaneously, could label all of the
images accessible to Google in a month

• (See the “Google image labeling
game”…)

iCAPTURer: assessing
push for bioLogical
knowledge
• Can we acquire bio-ontological
knowledge from untrained volunteers
in a scalable, Web-based manner?

• 2 experiments in the context of
scientiﬁc conferences

Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development.
Good and Wilkinson 2007. Ontology engineering using volunteer labor

iCAPTURer 1

Goals

1. Identify concepts from text
2. Link concepts to synonyms and to
hyponyms (‘x is_a y’) rooted in the
UMLS Semantic Network

Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development.

iCAPTURer 1 - terminology builder
Abstracts

Automatic term extraction - Text2Onto
Taste bar
Cell foo smooth muscle cell Candidate
terms
immune response
Glucose cell
Cell biology queen

Volunteers ﬁlter terms and extend terminology

Validated
smooth muscle cell
immune response
terms
apoptosis

iCAPTURer 1 - taxonomy builder
T-cell activation Validated
smooth muscle cell
terms
apoptosis

Volunteers assign parents

UMLS Semantic
Generic Concept
Network
Entity Event

Physical_Object Conceptual_Entity Process Activity

iCAPTURer 1 - taxonomy builder
UMLS Semantic
Generic Concept
Network
Entity Event

Physical_Object Conceptual_Entity Process Activity

smooth muscle cell T-cell activation
apoptosis

iCAPTURer 1 results
regarding volunteers

• Recruiting went surprisingly well.
• Volume of contributions highly skewed
- a few did most of the work

Participation curve

0.14

12
0.12

Percent of 0.1

total 0.08
7
knowledge 0.06
added 0.04

0.02

0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Volunteer

knowledge gathered
1) Collection: 2 days , 68 participants
Terms Hyponyms Synonyms

207
232auto.+ 340
= 661
429man.

2) Evaluation: 3 days , 65 participants, 11,545 votes
A: Terms sorted by fraction quot;truequot; votes
C: Hyponyms sorted by fraction quot;truequot; votes B: Synonyms sorted by fraction quot;truequot; votes
1
1
0.9 1
0.9
0.9
0.8
0.8

%”true”
0.8
0.7
0.7
0.7
0.6 0.6
0.6
0.5 0.5

votes
0.5

0.4
0.4 0.4

0.3
0.3 0.3

0.2
0.2
0.2
0.1
0.1
0.1
0
0
1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331
0 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232
1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639

hyponym synonym
Term
93% true > false 54% true > false
49% true > false

Initial acquisition verse
evaluation

11,000
Number of
assertions
gathered

1,000

Knowledge capture Evaluation conducted
at YI forum via email request

Initial acquisition verse
evaluation

11,000
“I assert that t cell “I agree that t cell
Number of
activation is a kind of activation is a kind of
assertions
immune response” immune response”
gathered

1,000

Knowledge capture Evaluation conducted
at YI forum via email request
• Multiple choice (voting)
• Forms
• Tree navigation
• Home setting
• Conference setting
• 3 days
• 2 days
• 68 people
• 65 people

iCAPTURer 2 pattern

1. Infer complete ontology
2. Present each edge as a multiple choice
question {true, false, I don’t know}
3. Aggregate votes to decide on each
triple

iCAPTURer 2
knowledge sought

? subClassOf ?
X Y

(immunology)

iCAPTURer2 results
1.2

• Same pattern of
1

fraction subclass judgments made
0.8

participation 0.6

0.4

• Only 66% correct 0.2

overall in 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Volunteer

assessing subClass
assertions

• highly biased
towards saying
‘yes’.

iCAPTURer summary

• Scientiﬁcally relevant tasks are harder
- the population pool is smaller, but - in
my experience generally very willing.

• Engaging the competitive instinct was
helpful in obtaining the responses we
did.

• Much room for further investigation.

Small steps

• but apparently in a promising direction

Filling in Freebase with
Typewriter
? is a ?
X Y

http://typewriter.freebaseapps.com/
March 9, 2009

To achieve mass collaborative bioLogical
knowledge assembly, make it possible for
people to contribute in multiple modes

- as creators
- as evaluators
- as system builders (open APIs are crucial)

and for multiple reasons
- personal information management
- fun, competition
- ﬁnance

R
X Y
X Y
X Y
X Y

“...how you envision future
developments...”

Automation

developments...”

+
Automation Human computation

developments...”

+
Automation Human computation

= increasingly high-throughput
bioLogical knowledge representation

“...how your own expertise would ﬁt into
this realm...”

more
requires
bioLogical
knowledge representation
analyses
machine learning
knows a bit about community action
ben

http://biordf.net/~bgood/

Thanks to

• developers: Eddie Kawas, Paul Lu
• advisor: Mark Wilkinson
• Barend Mons for the invitation and
Marco Roos for the accommodation!

http://biordf.net/~bgood/

Bio Logical Mass Collaboration3

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (20)

Similar to Bio Logical Mass Collaboration3

Similar to Bio Logical Mass Collaboration3 (20)

More from Benjamin Good

More from Benjamin Good (20)

Recently uploaded

Recently uploaded (20)

Bio Logical Mass Collaboration3