Active Content-Based Crowdsourcing Task Selection

Active Content-Based
Crowdsourcing Task
Selection
Piyush Bansal, Carsten Eickhoff, Thomas Hofmann
ETH Zurich
1

Outline
● Past work
○ Exploiting Document content for vote aggregation
● Ongoing extensions
○ Crowdsourcing in extreme budget constraints.
○ Information theoretic approaches
○ Experiments and results
○ Conclusion
2

State of the Art
● Crowdsourced relevance assessment cheap and effective
● Quality control via redundancy yields strong performance
● Untapped source of information: document content
● Key idea: Locality of relevance
Davtyan et al. 2015: Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes
3

●
Clustering Hypothesis for relevance assessment
4

Methods
● (informal) Problem statement: Given a set of relevance assessments,
how accurately can we infer the relevance of unjudged Web pages?
○ Solution ideas:
■ Assign same relevance assessment label as nearest neighbor.
■ Borrow relevance assessments from <n> nearest neighbors and
then assign the majority label.
■ Smooth expected relevance across similarity space (KDE, GPs)
○ Baseline:
■ Majority Voting for label aggregation, and coin toss for unjudged
Web pages.
5

Motivation for our work
Consider the task of search relevance assessment
● Extremely budget-constrained scenario
● Can only ask humans to rate a few Web pages per query
● In previous figure: Number of votes < 1
7

A Generic Model of Crowdsourcing
8

9

Difallah et al. 2013: Pick-a-crowd,
Nushi et al. 2015: Crowd Access Path Optimization
10

11

12
Kazai et al. 2011: Worker types and personality traits in crowdsourcing relevance labels
Davtyan et al. 2015: Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes

13

Preliminaries
● RequestVote
○ Sample radom vote from crowd
● AggregateVotes
○ Gaussian Processes (GP) for inferring relevance labels for
unjudged documents.
○ Described by mean function (here: constant),
○ and covariance function (here: linear covariance).
14

PickDocument
● What subset of documents to select for labeling?
○ Typical Active learning problem
○ Focus on optimal data acquisition
○ Baseline: Random sampling
● Select points that the classifier is most uncertain about
○ uncertainty based sampling.
15

Solution
● Variance-based sampling:
○ Proxy for “uncertainty”, as entropy is a measure of uncertainty
○ Variance-based sampling as approximation to max entropy sampling.
○ In Gaussian processes, the posterior variance does not depend on
the actual observed values of random variables.
16

Solution
● Selecting points that maximise variance is NP complete2
● However, this criterion is “submodular"
○ Submodularity (informally): In mathematics, a submodular set function (also
known as a submodular function) is a set function whose value, informally, has the
property that the difference in the incremental value of the function that a single element
makes when added to an input set decreases as the size of the input set increases.
○ However, due to Nemhauser (1978), an approximate solution (1 - 1/e)
OPT to this is achieved via a greedy algorithm.
2 Krause et al. 2008: Near-optimal sensor placements in Gaussian processes
3 Nemhauser et al. 1978: An analysis of approximations for maximizing submodular set functions
17

Algorithm: Variance based sampling
18

Mutual Information based sampling
● Variance-based sampling is only concerned with reducing
uncertainty at sampled points.
● We care about system-wide uncertainty.
● Maximise Mutual Information b/w selected documents and
rest of space.
● Equivalent to maximally minimising the entropy between
selected documents, and the rest of space (DA).
19

Algorithm: MI based sampling
20

Experiments
● TREC Crowdsourcing Track 2011 data
● 30 (28) topics
● ~100 documents (ClueWeb’09) to be judged per topic
● ~15 historic votes per query-document pair
● Project documents in 100D doc2vec space
21

Results - on TREC2011 CrowdSourcing Dataset.
22

Conclusions
● Active Learning for Crowdsourcing Vote Sampling
● Two information-theoretic criteria
○ Variance
○ Mutual information
● Saves up to 25% budget at constant quality
● Can be computed efficiently (greedy)
● Does not depend on sampled observations
● In the future: application to other modalities (images, videos)
24

Active Content-Based Crowdsourcing Task Selection

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Active Content-Based Crowdsourcing Task Selection

Similar to Active Content-Based Crowdsourcing Task Selection (20)

More from Carsten Eickhoff

More from Carsten Eickhoff (8)

Recently uploaded

Recently uploaded (20)

Active Content-Based Crowdsourcing Task Selection