2. Crowdsourcing Web semantics
• Crowdsourcing is increasingly used
to augment the results of
algorithms solving Semantic Web
problems
• Great challenges
• Which form of crowdsourcing for
which task?
• How to design the crowdsourcing
exercise effectively?
• How to combine human- and
machine-driven approaches?
4. Typology of crowdsourcing
Macrotasks Microtasks
Challenges Self-organized
crowds
Crowdfunding
4
Source:Source: (Prpić, Shukla, Kietzmann and McCarthy 2015)
5. Dimensions of crowdsourcing
• What to crowdsource?
– Tasks you can’t complete in-house or using computers
– A question of time, budget, resources, ethics etc.
• Who is the crowd?
– Crowdsourcing ≠‘turkers’
– Open call, biased through use of platforms and promotion channels
– No traditional means to manage and incentivize
– Crowd has little to no context about the project
• How to crowdsource?
– Macro vs. microtasks
– Complex workflows
– Assessment and aggregation
– Aligning incentives
– Hybrid systems
– Learn to optimize from previous interactions
5
6. Hybrid systems (or ‚social machines‘)
06-Mar-15 Tutorial@ISWC2013 6
Physical world
(people and devices)
Design and composition Participation and data supply
Model of social interaction
Virtual world
(Network of
social interactions)
Source: Dave Robertson
8. Overview
• Compare two forms of crowdsourcing
(challenges and paid microtasks) in
the context of data quality assessment
and repair
• Experiments on DBpedia
– Identify potential errors, classify,
and repair them
– TripleCheckMate challenge vs.
Mechanical Turk
8
9. What to crowdsource
• Incorrect object
Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.
• Incorrect data type or language tags
Example: dbpedia:Torishima_Izu_Islands foaf:name “鳥島”@en.
• Incorrect link to “external Web pages”
Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink
<http://cedarlakedvd.com/>
10. Who is the crowd
Challenge
LD Experts
Difficult task
Final prize
Find Verify
Microtasks
Workers
Easy task
Micropayments
TripleCheckMate
[Kontoskostas2013] MTurk
Adapted from [Bernstein2010]
http://mturk.com
12. How to crowdsource: microtask design
• Selection of foaf:name or
rdfs:label to extract human-
readable descriptions
• Values extracted automatically
from Wikipedia infoboxes
• Link to the Wikipedia article via
foaf:isPrimaryTopicOf
• Preview of external pages by
implementing HTML iframe
Incorrect object
Incorrect data type or language tag
Incorrect outlink
13. Experiments
• Crowdsourcing approaches
• Find stage: Challenge with LD experts
• Verify stage: Microtasks (5 assignments)
• Gold standard
• Two of the authors generated gold standard for all contest triples
indepedently
• Conflicts resolved via mutual agreement
• Validation: precision
14. Overall results
LD experts Microtask
workers
Number of distinct
participants
50 80
Total time
3 weeks (predefined) 4 days
Total triples evaluated
1,512 1,073
Total cost
~ US$ 400 (predefined) ~ US$ 43
15. Precision results: Incorrect objects
• Turkers can reduce the error rates of LD experts from the Find stage
• 117 DBpedia triples had predicates related to dates with
incorrect/incomplete values:
”2005 Six Nations Championship” Date 12 .
• 52 DBpedia triples had erroneous values from the source:
”English (programming language)” Influenced by ? .
• Experts classified all these triples as incorrect
• Workers compared values against Wikipedia and successfully classified this
triples as “correct”
Triples compared LD Experts MTurk
(majority voting: n=5)
509 0.7151 0.8977
16. Precision results: Incorrect data types
0
20
40
60
80
100
120
140
Date English Millimetre Nanometre
Number
Number
with
decimals
Second Volt Year Not
specified /
URI
Numberoftriples
Data types
Experts TP
Experts FP
Crowd TP
Crowd FP
Triples compared LD Experts MTurk
(majority voting: n=5)
341 0.8270 0.4752
17. Precision results: Incorrect links
• We analyzed the 189 misclassifications by the experts
• The 6% misclassifications by the workers correspond to
pages with a language different from English
50%
39%
11%
Freebase links
Wikipedia images
External links
Triples compared Baseline LD Experts MTurk
(n=5 majority voting)
223 0.2598 0.1525 0.9412
18. Conclusions
• The effort of LD experts should be invested in tasks that
demand domain-specific skills
– Experts seem less motivated to perform on tasks that
come with an additional overhead (e.g., checking an
external link, going back on Wikipedia, data repair)
• MTurk crowd was exceptionally good at performing object
types and links checks
– But seem not to understand the intricacies of data types
19. Future work
• Additional experiments with data types
• Workflows for new types of errors, e.g., tasks which experts cannot
answer on the fly
• Training the crowd
• Building the DBpedia ‘social machine’
– Add automatic QA tools
– Close the feedback loop (from crowd to experts)
– Change the parameters of the ‘Find’ step to achieve seamless
integration
– Change crowdsourcing model of ‘Find’ to improve retention
19
20. CROWDSOURCING IMAGE
ANNOTATION
Improving paid microtasks through gamification and adaptive
furtherance incentives
O Feyisetan, E Simperl, M Van Kleek, N Shadbolt
WWW 2015, to appear
20
21. Overview
• Make paid microtasks more
cost-effective though
gamification*
*use of game elements and
mechanics in a non-game
context
23. Who is the crowd: paid microtask platforms
23
[Kaufmann, Schulze, Viet, 2011]
24. Research hypotheses
• Contributors will perform better if tasks are
more engaging
• Increased accuracy through higher inter-
annotator agreement
• Cost savings through reduced unit costs
• Micro-targeting incentives when players
attempt to quit improves retention
25. How to crowdsource: microtask design
• Image labeling tasks published
on microtask platform
– Free text labels, varying
numbers of labels per
image, taboo words
– 1st setting: ‘standard’ tasks,
including basic spam
control
– 2nd setting: the same
requirements and rewards,
but contributors were asked
to carry out the task in
WordSmith
25
26. How to crowdsource: gamification
• Levels – 9 levels from newbie to Wordsmith, function of images
tagged
• Badges – function of number of images tagged
• Bonus points – for new tags
• Treasure points – for multiples of bonus points
• Feedback alerts - related to badges, points, levels
• Leaderboard - hourly scores and level of the top 5 players
• Activities widget – real-time updates on other players
26
27. How to crowdsource: incentives
• Money – 5 cents extra for the effort
• Power – see how other players tagged the images
• Leaderboard - advance to the ‘Global’ leaderboard seen by everyone
• Badges – receive ’Ultimate’ badge and a shiny new avatar
• Levels – advance straight to the next level
• Access - quicker access to treasure points
27
Assigned at random or targeted based on Bayes inference
(number of tagged images as feature)
29. Experiments (2)
• 1st experiment: ‘newcomers’
– Tag one image with two keywords ~ complete
Wordsmith level 1
– 200 images, US$ 0.02 per task, 3 judgements per task
• 2nd experiment: ‘novice’
– Tag 11 images ~ complete Wordsmith level 2
– 2200 images, US$ 0.10 per task, 600 workers
– Furtherance incentives: none, random, targeted (Bayes
inference based on # of images tagged)
29
30. Results: 1st experiment
Metric CrowdFlower Wordsmith
Total workers 600 423
Total keywords 1,200 41,206
Unique keywords 111 5,708
Avg. agreement 5.72% 37.7%
Mean images/person 1 32
Max images/person 1 200
31. Results: 2nd experiment
Metric CrowdFlower Wordsmith (no
furtherance
incentives
Total workers 600 514
Total keywords 13,200 35,890
Unique keywords 1,323 4,091
Avg. agreement 6.32% 10.9%
Mean images/person 11 27
Max images/person 1 351
32. Results: 2nd experiment, incentives
• Incentives led to increased participation
– Power: mean image/person = 132
– People come back (20 times!) and play longer (43 hours,
3 hours without incentives)
32
33. Conclusions and future work
• Task design matters as much as payment
• Gamification achieves high accuracy for lower costs and
improved engagement
• Workers appreciate social features, but their main
motivation is still task-driven
– Feedback mechanisms, peer learning, training
33