Crowdsourcing has long established itself as a viable alternative to corpus annotation by domain experts for tasks such as document relevance assessment. The crowdsourcing process traditionally relies on high degrees of label redundancy in order to mitigate the detrimental effects of individually noisy worker submissions. Such redundancy comes at the cost of increased label volume, and, subsequently, monetary requirements. In practice, especially as the size of datasets increases, this is undesirable.
In this paper, we focus on an alternate method that exploits document information instead, to infer relevance labels for unjudged documents. We present an active learning scheme for document selection that aims at maximising the overall relevance label prediction accuracy, for a given budget of available relevance judgements by exploiting system-wide estimates of label variance and mutual information. Our experiments are based on TREC 2011 Crowdsourcing Track data and show that our method is able to achieve state-of-the-art performance while requiring 17 – 25% less budget.
This paper has been accepted for presentation at the 25th ACM International Conference on Information and Knowledge Management (CIKM).
2. Outline
● Past work
○ Exploiting Document content for vote aggregation
● Ongoing extensions
○ Crowdsourcing in extreme budget constraints.
○ Information theoretic approaches
○ Experiments and results
○ Conclusion
2
3. State of the Art
● Crowdsourced relevance assessment cheap and effective
● Quality control via redundancy yields strong performance
● Untapped source of information: document content
● Key idea: Locality of relevance
Davtyan et al. 2015: Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes
3
5. Methods
● (informal) Problem statement: Given a set of relevance assessments,
how accurately can we infer the relevance of unjudged Web pages?
○ Solution ideas:
■ Assign same relevance assessment label as nearest neighbor.
■ Borrow relevance assessments from <n> nearest neighbors and
then assign the majority label.
■ Smooth expected relevance across similarity space (KDE, GPs)
○ Baseline:
■ Majority Voting for label aggregation, and coin toss for unjudged
Web pages.
5
7. Motivation for our work
Consider the task of search relevance assessment
● Extremely budget-constrained scenario
● Can only ask humans to rate a few Web pages per query
● In previous figure: Number of votes < 1
7
12. A Generic Model of Crowdsourcing
12
Kazai et al. 2011: Worker types and personality traits in crowdsourcing relevance labels
Davtyan et al. 2015: Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes
14. Preliminaries
● RequestVote
○ Sample radom vote from crowd
● AggregateVotes
○ Gaussian Processes (GP) for inferring relevance labels for
unjudged documents.
○ Described by mean function (here: constant),
○ and covariance function (here: linear covariance).
14
15. PickDocument
● What subset of documents to select for labeling?
○ Typical Active learning problem
○ Focus on optimal data acquisition
○ Baseline: Random sampling
● Select points that the classifier is most uncertain about
○ uncertainty based sampling.
15
16. Solution
● Variance-based sampling:
○ Proxy for “uncertainty”, as entropy is a measure of uncertainty
○ Variance-based sampling as approximation to max entropy sampling.
○ In Gaussian processes, the posterior variance does not depend on
the actual observed values of random variables.
16
17. Solution
● Selecting points that maximise variance is NP complete2
● However, this criterion is “submodular"
○ Submodularity (informally): In mathematics, a submodular set function (also
known as a submodular function) is a set function whose value, informally, has the
property that the difference in the incremental value of the function that a single element
makes when added to an input set decreases as the size of the input set increases.
○ However, due to Nemhauser (1978), an approximate solution (1 - 1/e)
OPT to this is achieved via a greedy algorithm.
2 Krause et al. 2008: Near-optimal sensor placements in Gaussian processes
3 Nemhauser et al. 1978: An analysis of approximations for maximizing submodular set functions
17
19. Mutual Information based sampling
● Variance-based sampling is only concerned with reducing
uncertainty at sampled points.
● We care about system-wide uncertainty.
● Maximise Mutual Information b/w selected documents and
rest of space.
● Equivalent to maximally minimising the entropy between
selected documents, and the rest of space (DA).
19
24. Conclusions
● Active Learning for Crowdsourcing Vote Sampling
● Two information-theoretic criteria
○ Variance
○ Mutual information
● Saves up to 25% budget at constant quality
● Can be computed efficiently (greedy)
● Does not depend on sampled observations
● In the future: application to other modalities (images, videos)
24