SlideShare a Scribd company logo
Feeling Lucky? Multi-armed bandits for Ordering Judgements
in Pooling-based Evaluation
David E. Losada
Javier Parapar, Álvaro Barreiro
ACM SAC, 2016
Evaluation
is crucial
compare retrieval algorithms, design
new search solutions, ...
information retrieval evaluation:
3 main ingredients
docs
information retrieval evaluation:
3 main ingredients
queries
information retrieval evaluation:
3 main ingredients
relevance
judgements
relevance assessments are incomplete
relevance assessments are incomplete
...
search system 1 search system 2 search system 3 search system n
relevance assessments are incomplete
...
search system 1 search system 2 search system 3 search system n
relevance assessments are incomplete
1. WSJ13
2. WSJ17
.
.
100. AP567
101. AP555
.
.
.
1. FT941
2. WSJ13
.
.
100. WSJ19
101. AP555
.
.
.
1. ZF207
2. AP881
.
.
100. FT967
101. AP555
.
.
.
1. WSJ13
2. CR93E
.
.
100. AP111
101. AP555
.
.
.
...
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
1. WSJ13
2. WSJ17
.
.
100. AP567
101. AP555
.
.
.
1. FT941
2. WSJ13
.
.
100. WSJ19
101. AP555
.
.
.
1. ZF207
2. AP881
.
.
100. FT967
101. AP555
.
.
.
1. WSJ13
2. CR93E
.
.
100. AP111
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
human assessments
finding relevant docs is the key
Most productive use of assessors' time
is spent on judging relevant docs
(Sanderson & Zobel, 2005)
Effective adjudication methods
Give priority to pooled docs that are
potentially relevant
Can signifcantly reduce the num. of
judgements required to identify a given
num. of relevant docs
But most existing methods are adhoc...
Our main idea...
Cast doc adjudication as a
reinforcement learning problem
Doc judging is an iterative process where
we learn as judgements come in
Doc adjudication as a reinforcement learning problem
Initially we know nothing about the quality of the runs
? ? ? ?...
As judgements
come in...
And we can adapt and allocate more docs for judgement from
the most promising runs
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
exploration vs exploitation
exploits current knowledge
spends no time sampling inferior actions
maximizes expected reward
on the next action
explores uncertain actions
gets more info about expected payofs
may produce greater total reward
in the long run
allocation methods: choose next action (play) based on past plays and obtained rewards
implement diferent ways to trade between exploration and exploitation
Multi-armed bandits for ordering judgements
...
machines = runs
...
play a machine = select a run and get the next (unjudged) doc
1. WSJ13
2. CR93E
.
.
(binary) reward = relevance/non-relevance of the selected doc
Allocation methods tested
...
random ϵn
-greedy
with prob 1-ϵ plays the machine
with the highest avg reward
with prob ϵ plays a
random machine
prob of exploration (ϵ) decreases
with the num. of plays
Upper Confdence Bound
(UCB)
computes upper confdence
bounds for avg rewards
conf. intervals get narrower
with the number of plays
selects the machine with the
highest optimistic estimate
Allocation methods tested: Bayesian bandits
prior probabilities of giving a relevant doc: Uniform(0,1) ( or, equivalently, Beta(α,β), α,β=1 )
U(0,1) U(0,1) U(0,1) U(0,1)
...
evidence (O ∈ {0,1}) is Bernoulli (or, equivalently, Binomial(1,p) )
posterior probabilities of giving a relevant doc: Beta(α+O, β+1-O) (Beta: conjugate prior
for Binomial)
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
two strategies to select the next machine:
Bayesian Learning Automaton (BLA): draws a sample from each the posterior distribution
and selects the machine yieding the highest sample
MaxMean (MM): selects the machine with the highest expectation of the posterior distribution
test different document adjudication strategies in
terms of how quickly they find the relevant
docs in the pool
experiments
# rel docs found at diff. number of
judgements performed
experiments: data
experiments: baselines
...WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
AP111, AP881, AP567, CR93E, FT967, WSJ13, ...
DocId: sorts by Doc Id
experiments: baselines
1. WSJ13
2. WSJ17
.
.
100. AP567
...
1. FT941
2. WSJ13
.
.
100. WSJ19
1. WSJ13
2. CR93E
.
.
100. AP111
WSJ13, FT941, ZF207, WSJ17, CR93E, AP881 ...
Rank: rank #1 docs go 1st, then rank #2 docs, ...
1. ZF207
2. AP881
.
.
100. FT967
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
starts with uniform priorities for all runs (e.g. max priority=100)
selects a random run (from those with max priority)
1. ZF207
2. AP881
3. FT967
.
.
100 100 100 100
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
starts with uniform priorities for all runs (e.g. max priority=100)
selects a random run (from those with max priority)
1. ZF207
2. AP881
3. FT967
.
.
100 100 100 100
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13, CR93E
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13, CR93E, AP111
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
when a non-rel doc is found, priority is decreased
100 99
WSJ13, CR93E, AP111
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
and we jump again to another max priority run
1. ZF207
2. AP881
3. FT967
.
.
100 100 99 100
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
...
1. FT941
2. WSJ13
3. WSJ19
.
1. WSJ13
2. CR93E
3. AP111
.
Moffat et al.'s method (A) (Moffat et al 2007)
based on rank-biased precision (RBP)
sums a rank-dependent score for each doc
1. ZF207
2. AP881
3. FT967
.
score
0.20
0.16
0.13
.
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
...
1. FT941
2. WSJ13
3. WSJ19
.
1. WSJ13
2. CR93E
3. AP111
.
Moffat et al.'s method (A) (Moffat et al 2007)
based on rank-biased precision (RBP)
sums a rank-dependent score for each doc
1. ZF207
2. AP881
3. FT967
.
score
0.20
0.16
0.13
.
all docs are ranked by decreasing accummulated score
and the ranked list defines the order in which docs are judged
WSJ13: 0.20+0.16+0.20+...
experiments: baselines
Moffat et al.'s method (B) (Moffat et al 2007)
evolution over A's method
considers not only the rank-dependent doc's
contributions but also the runs' residuals
promotes the selection of docs from runs with many
unjudged docs
Moffat et al.'s method (C) (Moffat et al 2007)
evolution over B's method
considers not only the rank-dependent doc's and the residuals
promotes the selection of docs from effective runs
experiments: baselines
MTF: best performing baseline
experiments: MTF vs bandit-based models
experiments: MTF vs bandit-based models
Random: weakest approach
BLA/UCB/ϵn
-greedy are suboptimal
(sophisticated exploration/exploitation trading
not needed)
MTF and MM: best performing methods
improved bandit-based models
MTF: forgets quickly about past rewards
(a single non-relevance doc triggers a jump)
non-stationary
bandit-based
solutions:
not all historical
rewards count the
same
MM-NS and BLA-NS
non-stationary
variants of MM and
BLA
stationary bandits
Beta( , ), , =1α β α β
rel docs add 1 to α
non-rel docs add 1 to β
(after n iterations)
Beta(αn
,βn
)
αn
=1+jrels
βn
=1+jrets
– jrels
jrels
: # judged relevant docs (retrieved by s)
jrets
: # judged docs (retrieved by s)
all judged docs count the same
non-stationary bandits
Beta( , ), , =1α β α β
jrels
= rate*jrels
+ reld
jrets
= rate*jrets
+ 1
(after n iterations)
Beta(αn
,βn
)
αn
=1+jrels
βn
=1+jrets
– jrels
rate>1: weights more early relevant docs
rate<1: weights more late relevant docs
rate=0: only the last judged doc counts
(BLA-NS, MM-NS)
rate=1: stationary version
experiments: improved bandit-based models
conclusions
multi-arm bandits: formal & effective framework for
doc adjudication in a pooling-based evaluation
it's not good to increasingly reduce exploration
(UCB, ϵn
-greedy)
it's good to react quickly to non-relevant docs
(non-stationary variants)
future work
query-related
variabilities
hierarchical
bandits
stopping
criteria
metasearch
reproduce our experiments & test new ideas!
http://tec.citius.usc.es/ir/code/pooling_bandits.html
(our R code, instructions, etc)
David E. Losada
Javier Parapar, Álvaro Barreiro
Feeling Lucky? Multi-armed bandits for Ordering Judgements
in Pooling-based Evaluation
Acknowledgements:
MsSaraKelly. picture pg 1 (modified).CC BY 2.0.
Sanofi Pasteur. picture pg 2 (modified).CC BY-NC-ND 2.0.
pedrik. picture pgs 3-5.CC BY 2.0.
Christa Lohman. picture pg 3 (left).CC BY-NC-ND 2.0.
Chris. picture pg 4 (tag cloud).CC BY 2.0.
Daniel Horacio Agostini. picture pg 5 (right).CC BY-NC-ND 2.0.
ScaarAT. picture pg 14.CC BY-NC-ND 2.0.
Sebastien Wiertz. picture pg 15 (modified).CC BY 2.0.
Willard. picture pg 16 (modified).CC BY-NC-ND 2.0.
Jose Luis Cernadas Iglesias. picture pg 17 (modified).CC BY 2.0.
Michelle Bender. picture pg 25 (left).CC BY-NC-ND 2.0.
Robert Levy. picture pg 25 (right).CC BY-NC-ND 2.0.
Simply Swim UK. picture pg 37.CC BY-SA 2.0.
Sarah J. Poe. picture pg 55.CC BY-ND 2.0.
Kate Brady. picture pg 58.CC BY 2.0.
August Brill. picture pg 59.CC BY 2.0.
This work was supported by the
“Ministerio de Economía y Competitividad”
of the Goverment of Spain and
FEDER Funds under
research projects
TIN2012-33867 and TIN2015-64282-R.

More Related Content

Similar to Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation.

cs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.pptcs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.ppt
Samiksha880257
 
Main Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docxMain Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docx
infantsuk
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
SanGeet25
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final Exam
Brent Heard
 
Probability unit2.pptx
Probability unit2.pptxProbability unit2.pptx
Probability unit2.pptx
SNIGDHABADIDA2127755
 
GA.pptx
GA.pptxGA.pptx
Final examexamplesapr2013
Final examexamplesapr2013Final examexamplesapr2013
Final examexamplesapr2013
Brent Heard
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcuts
PrincessNorberte
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
butest
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
jfs-masters-1
jfs-masters-1jfs-masters-1
jfs-masters-1
James Swafford
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
Sammer Qader
 
Lab23 chisquare2007
Lab23 chisquare2007Lab23 chisquare2007
Lab23 chisquare2007
sbarkanic
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
Nagendrasahu6
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
PriyadharshiniG41
 
Statistical tests
Statistical testsStatistical tests
Statistical tests
martyynyyte
 
Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)
Eric Choi
 
Ch. 11 Simulations Good
Ch. 11 Simulations GoodCh. 11 Simulations Good
Ch. 11 Simulations Good
christjt
 
UNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptxUNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptx
DrUdayKiranG
 
Solutions to Statistical infeence by George Casella
Solutions to Statistical infeence by George CasellaSolutions to Statistical infeence by George Casella
Solutions to Statistical infeence by George Casella
JamesR0510
 

Similar to Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation. (20)

cs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.pptcs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.ppt
 
Main Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docxMain Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docx
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final Exam
 
Probability unit2.pptx
Probability unit2.pptxProbability unit2.pptx
Probability unit2.pptx
 
GA.pptx
GA.pptxGA.pptx
GA.pptx
 
Final examexamplesapr2013
Final examexamplesapr2013Final examexamplesapr2013
Final examexamplesapr2013
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcuts
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
jfs-masters-1
jfs-masters-1jfs-masters-1
jfs-masters-1
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
Lab23 chisquare2007
Lab23 chisquare2007Lab23 chisquare2007
Lab23 chisquare2007
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
 
Statistical tests
Statistical testsStatistical tests
Statistical tests
 
Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)
 
Ch. 11 Simulations Good
Ch. 11 Simulations GoodCh. 11 Simulations Good
Ch. 11 Simulations Good
 
UNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptxUNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptx
 
Solutions to Statistical infeence by George Casella
Solutions to Statistical infeence by George CasellaSolutions to Statistical infeence by George Casella
Solutions to Statistical infeence by George Casella
 

Recently uploaded

The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 

Recently uploaded (20)

The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 

Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation.

  • 1. Feeling Lucky? Multi-armed bandits for Ordering Judgements in Pooling-based Evaluation David E. Losada Javier Parapar, Álvaro Barreiro ACM SAC, 2016
  • 2. Evaluation is crucial compare retrieval algorithms, design new search solutions, ...
  • 3. information retrieval evaluation: 3 main ingredients docs
  • 4. information retrieval evaluation: 3 main ingredients queries
  • 5. information retrieval evaluation: 3 main ingredients relevance judgements
  • 7. relevance assessments are incomplete ... search system 1 search system 2 search system 3 search system n
  • 8. relevance assessments are incomplete ... search system 1 search system 2 search system 3 search system n
  • 9. relevance assessments are incomplete 1. WSJ13 2. WSJ17 . . 100. AP567 101. AP555 . . . 1. FT941 2. WSJ13 . . 100. WSJ19 101. AP555 . . . 1. ZF207 2. AP881 . . 100. FT967 101. AP555 . . . 1. WSJ13 2. CR93E . . 100. AP111 101. AP555 . . . ... rankings of docs by estimated relevance (runs)
  • 10. relevance assessments are incomplete 1. WSJ13 2. WSJ17 . . 100. AP567 101. AP555 . . . 1. FT941 2. WSJ13 . . 100. WSJ19 101. AP555 . . . 1. ZF207 2. AP881 . . 100. FT967 101. AP555 . . . 1. WSJ13 2. CR93E . . 100. AP111 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs)
  • 11. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs)
  • 12. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs) WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ...
  • 13. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs) WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ... human assessments
  • 14. finding relevant docs is the key Most productive use of assessors' time is spent on judging relevant docs (Sanderson & Zobel, 2005)
  • 15. Effective adjudication methods Give priority to pooled docs that are potentially relevant Can signifcantly reduce the num. of judgements required to identify a given num. of relevant docs But most existing methods are adhoc...
  • 16. Our main idea... Cast doc adjudication as a reinforcement learning problem Doc judging is an iterative process where we learn as judgements come in
  • 17. Doc adjudication as a reinforcement learning problem Initially we know nothing about the quality of the runs ? ? ? ?... As judgements come in... And we can adapt and allocate more docs for judgement from the most promising runs
  • 19. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 21. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 23. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 25. exploration vs exploitation exploits current knowledge spends no time sampling inferior actions maximizes expected reward on the next action explores uncertain actions gets more info about expected payofs may produce greater total reward in the long run allocation methods: choose next action (play) based on past plays and obtained rewards implement diferent ways to trade between exploration and exploitation
  • 26. Multi-armed bandits for ordering judgements ... machines = runs ... play a machine = select a run and get the next (unjudged) doc 1. WSJ13 2. CR93E . . (binary) reward = relevance/non-relevance of the selected doc
  • 27. Allocation methods tested ... random ϵn -greedy with prob 1-ϵ plays the machine with the highest avg reward with prob ϵ plays a random machine prob of exploration (ϵ) decreases with the num. of plays Upper Confdence Bound (UCB) computes upper confdence bounds for avg rewards conf. intervals get narrower with the number of plays selects the machine with the highest optimistic estimate
  • 28. Allocation methods tested: Bayesian bandits prior probabilities of giving a relevant doc: Uniform(0,1) ( or, equivalently, Beta(α,β), α,β=1 ) U(0,1) U(0,1) U(0,1) U(0,1) ... evidence (O ∈ {0,1}) is Bernoulli (or, equivalently, Binomial(1,p) ) posterior probabilities of giving a relevant doc: Beta(α+O, β+1-O) (Beta: conjugate prior for Binomial)
  • 29. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 30. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 31. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 32. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 33. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 34. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 35. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 36. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes: two strategies to select the next machine: Bayesian Learning Automaton (BLA): draws a sample from each the posterior distribution and selects the machine yieding the highest sample MaxMean (MM): selects the machine with the highest expectation of the posterior distribution
  • 37. test different document adjudication strategies in terms of how quickly they find the relevant docs in the pool experiments # rel docs found at diff. number of judgements performed
  • 39. experiments: baselines ...WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ... AP111, AP881, AP567, CR93E, FT967, WSJ13, ... DocId: sorts by Doc Id
  • 40. experiments: baselines 1. WSJ13 2. WSJ17 . . 100. AP567 ... 1. FT941 2. WSJ13 . . 100. WSJ19 1. WSJ13 2. CR93E . . 100. AP111 WSJ13, FT941, ZF207, WSJ17, CR93E, AP881 ... Rank: rank #1 docs go 1st, then rank #2 docs, ... 1. ZF207 2. AP881 . . 100. FT967
  • 41. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) starts with uniform priorities for all runs (e.g. max priority=100) selects a random run (from those with max priority) 1. ZF207 2. AP881 3. FT967 . . 100 100 100 100
  • 42. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) starts with uniform priorities for all runs (e.g. max priority=100) selects a random run (from those with max priority) 1. ZF207 2. AP881 3. FT967 . . 100 100 100 100
  • 43. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100
  • 44. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13
  • 45. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13, CR93E
  • 46. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13, CR93E, AP111
  • 47. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found when a non-rel doc is found, priority is decreased 100 99 WSJ13, CR93E, AP111
  • 48. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) and we jump again to another max priority run 1. ZF207 2. AP881 3. FT967 . . 100 100 99 100
  • 49. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . ... 1. FT941 2. WSJ13 3. WSJ19 . 1. WSJ13 2. CR93E 3. AP111 . Moffat et al.'s method (A) (Moffat et al 2007) based on rank-biased precision (RBP) sums a rank-dependent score for each doc 1. ZF207 2. AP881 3. FT967 . score 0.20 0.16 0.13 .
  • 50. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . ... 1. FT941 2. WSJ13 3. WSJ19 . 1. WSJ13 2. CR93E 3. AP111 . Moffat et al.'s method (A) (Moffat et al 2007) based on rank-biased precision (RBP) sums a rank-dependent score for each doc 1. ZF207 2. AP881 3. FT967 . score 0.20 0.16 0.13 . all docs are ranked by decreasing accummulated score and the ranked list defines the order in which docs are judged WSJ13: 0.20+0.16+0.20+...
  • 51. experiments: baselines Moffat et al.'s method (B) (Moffat et al 2007) evolution over A's method considers not only the rank-dependent doc's contributions but also the runs' residuals promotes the selection of docs from runs with many unjudged docs Moffat et al.'s method (C) (Moffat et al 2007) evolution over B's method considers not only the rank-dependent doc's and the residuals promotes the selection of docs from effective runs
  • 52. experiments: baselines MTF: best performing baseline
  • 53. experiments: MTF vs bandit-based models
  • 54. experiments: MTF vs bandit-based models Random: weakest approach BLA/UCB/ϵn -greedy are suboptimal (sophisticated exploration/exploitation trading not needed) MTF and MM: best performing methods
  • 55. improved bandit-based models MTF: forgets quickly about past rewards (a single non-relevance doc triggers a jump) non-stationary bandit-based solutions: not all historical rewards count the same MM-NS and BLA-NS non-stationary variants of MM and BLA
  • 56. stationary bandits Beta( , ), , =1α β α β rel docs add 1 to α non-rel docs add 1 to β (after n iterations) Beta(αn ,βn ) αn =1+jrels βn =1+jrets – jrels jrels : # judged relevant docs (retrieved by s) jrets : # judged docs (retrieved by s) all judged docs count the same non-stationary bandits Beta( , ), , =1α β α β jrels = rate*jrels + reld jrets = rate*jrets + 1 (after n iterations) Beta(αn ,βn ) αn =1+jrels βn =1+jrets – jrels rate>1: weights more early relevant docs rate<1: weights more late relevant docs rate=0: only the last judged doc counts (BLA-NS, MM-NS) rate=1: stationary version
  • 58. conclusions multi-arm bandits: formal & effective framework for doc adjudication in a pooling-based evaluation it's not good to increasingly reduce exploration (UCB, ϵn -greedy) it's good to react quickly to non-relevant docs (non-stationary variants)
  • 60. reproduce our experiments & test new ideas! http://tec.citius.usc.es/ir/code/pooling_bandits.html (our R code, instructions, etc)
  • 61. David E. Losada Javier Parapar, Álvaro Barreiro Feeling Lucky? Multi-armed bandits for Ordering Judgements in Pooling-based Evaluation Acknowledgements: MsSaraKelly. picture pg 1 (modified).CC BY 2.0. Sanofi Pasteur. picture pg 2 (modified).CC BY-NC-ND 2.0. pedrik. picture pgs 3-5.CC BY 2.0. Christa Lohman. picture pg 3 (left).CC BY-NC-ND 2.0. Chris. picture pg 4 (tag cloud).CC BY 2.0. Daniel Horacio Agostini. picture pg 5 (right).CC BY-NC-ND 2.0. ScaarAT. picture pg 14.CC BY-NC-ND 2.0. Sebastien Wiertz. picture pg 15 (modified).CC BY 2.0. Willard. picture pg 16 (modified).CC BY-NC-ND 2.0. Jose Luis Cernadas Iglesias. picture pg 17 (modified).CC BY 2.0. Michelle Bender. picture pg 25 (left).CC BY-NC-ND 2.0. Robert Levy. picture pg 25 (right).CC BY-NC-ND 2.0. Simply Swim UK. picture pg 37.CC BY-SA 2.0. Sarah J. Poe. picture pg 55.CC BY-ND 2.0. Kate Brady. picture pg 58.CC BY 2.0. August Brill. picture pg 59.CC BY 2.0. This work was supported by the “Ministerio de Economía y Competitividad” of the Goverment of Spain and FEDER Funds under research projects TIN2012-33867 and TIN2015-64282-R.