SlideShare a Scribd company logo
1 of 61
Download to read offline
Feeling Lucky? Multi-armed bandits for Ordering Judgements
in Pooling-based Evaluation
David E. Losada
Javier Parapar, Álvaro Barreiro
ACM SAC, 2016
Evaluation
is crucial
compare retrieval algorithms, design
new search solutions, ...
information retrieval evaluation:
3 main ingredients
docs
information retrieval evaluation:
3 main ingredients
queries
information retrieval evaluation:
3 main ingredients
relevance
judgements
relevance assessments are incomplete
relevance assessments are incomplete
...
search system 1 search system 2 search system 3 search system n
relevance assessments are incomplete
...
search system 1 search system 2 search system 3 search system n
relevance assessments are incomplete
1. WSJ13
2. WSJ17
.
.
100. AP567
101. AP555
.
.
.
1. FT941
2. WSJ13
.
.
100. WSJ19
101. AP555
.
.
.
1. ZF207
2. AP881
.
.
100. FT967
101. AP555
.
.
.
1. WSJ13
2. CR93E
.
.
100. AP111
101. AP555
.
.
.
...
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
1. WSJ13
2. WSJ17
.
.
100. AP567
101. AP555
.
.
.
1. FT941
2. WSJ13
.
.
100. WSJ19
101. AP555
.
.
.
1. ZF207
2. AP881
.
.
100. FT967
101. AP555
.
.
.
1. WSJ13
2. CR93E
.
.
100. AP111
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
human assessments
finding relevant docs is the key
Most productive use of assessors' time
is spent on judging relevant docs
(Sanderson & Zobel, 2005)
Effective adjudication methods
Give priority to pooled docs that are
potentially relevant
Can signifcantly reduce the num. of
judgements required to identify a given
num. of relevant docs
But most existing methods are adhoc...
Our main idea...
Cast doc adjudication as a
reinforcement learning problem
Doc judging is an iterative process where
we learn as judgements come in
Doc adjudication as a reinforcement learning problem
Initially we know nothing about the quality of the runs
? ? ? ?...
As judgements
come in...
And we can adapt and allocate more docs for judgement from
the most promising runs
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
exploration vs exploitation
exploits current knowledge
spends no time sampling inferior actions
maximizes expected reward
on the next action
explores uncertain actions
gets more info about expected payofs
may produce greater total reward
in the long run
allocation methods: choose next action (play) based on past plays and obtained rewards
implement diferent ways to trade between exploration and exploitation
Multi-armed bandits for ordering judgements
...
machines = runs
...
play a machine = select a run and get the next (unjudged) doc
1. WSJ13
2. CR93E
.
.
(binary) reward = relevance/non-relevance of the selected doc
Allocation methods tested
...
random ϵn
-greedy
with prob 1-ϵ plays the machine
with the highest avg reward
with prob ϵ plays a
random machine
prob of exploration (ϵ) decreases
with the num. of plays
Upper Confdence Bound
(UCB)
computes upper confdence
bounds for avg rewards
conf. intervals get narrower
with the number of plays
selects the machine with the
highest optimistic estimate
Allocation methods tested: Bayesian bandits
prior probabilities of giving a relevant doc: Uniform(0,1) ( or, equivalently, Beta(α,β), α,β=1 )
U(0,1) U(0,1) U(0,1) U(0,1)
...
evidence (O ∈ {0,1}) is Bernoulli (or, equivalently, Binomial(1,p) )
posterior probabilities of giving a relevant doc: Beta(α+O, β+1-O) (Beta: conjugate prior
for Binomial)
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
two strategies to select the next machine:
Bayesian Learning Automaton (BLA): draws a sample from each the posterior distribution
and selects the machine yieding the highest sample
MaxMean (MM): selects the machine with the highest expectation of the posterior distribution
test different document adjudication strategies in
terms of how quickly they find the relevant
docs in the pool
experiments
# rel docs found at diff. number of
judgements performed
experiments: data
experiments: baselines
...WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
AP111, AP881, AP567, CR93E, FT967, WSJ13, ...
DocId: sorts by Doc Id
experiments: baselines
1. WSJ13
2. WSJ17
.
.
100. AP567
...
1. FT941
2. WSJ13
.
.
100. WSJ19
1. WSJ13
2. CR93E
.
.
100. AP111
WSJ13, FT941, ZF207, WSJ17, CR93E, AP881 ...
Rank: rank #1 docs go 1st, then rank #2 docs, ...
1. ZF207
2. AP881
.
.
100. FT967
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
starts with uniform priorities for all runs (e.g. max priority=100)
selects a random run (from those with max priority)
1. ZF207
2. AP881
3. FT967
.
.
100 100 100 100
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
starts with uniform priorities for all runs (e.g. max priority=100)
selects a random run (from those with max priority)
1. ZF207
2. AP881
3. FT967
.
.
100 100 100 100
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13, CR93E
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13, CR93E, AP111
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
when a non-rel doc is found, priority is decreased
100 99
WSJ13, CR93E, AP111
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
and we jump again to another max priority run
1. ZF207
2. AP881
3. FT967
.
.
100 100 99 100
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
...
1. FT941
2. WSJ13
3. WSJ19
.
1. WSJ13
2. CR93E
3. AP111
.
Moffat et al.'s method (A) (Moffat et al 2007)
based on rank-biased precision (RBP)
sums a rank-dependent score for each doc
1. ZF207
2. AP881
3. FT967
.
score
0.20
0.16
0.13
.
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
...
1. FT941
2. WSJ13
3. WSJ19
.
1. WSJ13
2. CR93E
3. AP111
.
Moffat et al.'s method (A) (Moffat et al 2007)
based on rank-biased precision (RBP)
sums a rank-dependent score for each doc
1. ZF207
2. AP881
3. FT967
.
score
0.20
0.16
0.13
.
all docs are ranked by decreasing accummulated score
and the ranked list defines the order in which docs are judged
WSJ13: 0.20+0.16+0.20+...
experiments: baselines
Moffat et al.'s method (B) (Moffat et al 2007)
evolution over A's method
considers not only the rank-dependent doc's
contributions but also the runs' residuals
promotes the selection of docs from runs with many
unjudged docs
Moffat et al.'s method (C) (Moffat et al 2007)
evolution over B's method
considers not only the rank-dependent doc's and the residuals
promotes the selection of docs from effective runs
experiments: baselines
MTF: best performing baseline
experiments: MTF vs bandit-based models
experiments: MTF vs bandit-based models
Random: weakest approach
BLA/UCB/ϵn
-greedy are suboptimal
(sophisticated exploration/exploitation trading
not needed)
MTF and MM: best performing methods
improved bandit-based models
MTF: forgets quickly about past rewards
(a single non-relevance doc triggers a jump)
non-stationary
bandit-based
solutions:
not all historical
rewards count the
same
MM-NS and BLA-NS
non-stationary
variants of MM and
BLA
stationary bandits
Beta( , ), , =1α β α β
rel docs add 1 to α
non-rel docs add 1 to β
(after n iterations)
Beta(αn
,βn
)
αn
=1+jrels
βn
=1+jrets
– jrels
jrels
: # judged relevant docs (retrieved by s)
jrets
: # judged docs (retrieved by s)
all judged docs count the same
non-stationary bandits
Beta( , ), , =1α β α β
jrels
= rate*jrels
+ reld
jrets
= rate*jrets
+ 1
(after n iterations)
Beta(αn
,βn
)
αn
=1+jrels
βn
=1+jrets
– jrels
rate>1: weights more early relevant docs
rate<1: weights more late relevant docs
rate=0: only the last judged doc counts
(BLA-NS, MM-NS)
rate=1: stationary version
experiments: improved bandit-based models
conclusions
multi-arm bandits: formal & effective framework for
doc adjudication in a pooling-based evaluation
it's not good to increasingly reduce exploration
(UCB, ϵn
-greedy)
it's good to react quickly to non-relevant docs
(non-stationary variants)
future work
query-related
variabilities
hierarchical
bandits
stopping
criteria
metasearch
reproduce our experiments & test new ideas!
http://tec.citius.usc.es/ir/code/pooling_bandits.html
(our R code, instructions, etc)
David E. Losada
Javier Parapar, Álvaro Barreiro
Feeling Lucky? Multi-armed bandits for Ordering Judgements
in Pooling-based Evaluation
Acknowledgements:
MsSaraKelly. picture pg 1 (modified).CC BY 2.0.
Sanofi Pasteur. picture pg 2 (modified).CC BY-NC-ND 2.0.
pedrik. picture pgs 3-5.CC BY 2.0.
Christa Lohman. picture pg 3 (left).CC BY-NC-ND 2.0.
Chris. picture pg 4 (tag cloud).CC BY 2.0.
Daniel Horacio Agostini. picture pg 5 (right).CC BY-NC-ND 2.0.
ScaarAT. picture pg 14.CC BY-NC-ND 2.0.
Sebastien Wiertz. picture pg 15 (modified).CC BY 2.0.
Willard. picture pg 16 (modified).CC BY-NC-ND 2.0.
Jose Luis Cernadas Iglesias. picture pg 17 (modified).CC BY 2.0.
Michelle Bender. picture pg 25 (left).CC BY-NC-ND 2.0.
Robert Levy. picture pg 25 (right).CC BY-NC-ND 2.0.
Simply Swim UK. picture pg 37.CC BY-SA 2.0.
Sarah J. Poe. picture pg 55.CC BY-ND 2.0.
Kate Brady. picture pg 58.CC BY 2.0.
August Brill. picture pg 59.CC BY 2.0.
This work was supported by the
“Ministerio de Economía y Competitividad”
of the Goverment of Spain and
FEDER Funds under
research projects
TIN2012-33867 and TIN2015-64282-R.

More Related Content

Viewers also liked

Predictive Modeling in Underwriting
Predictive Modeling in UnderwritingPredictive Modeling in Underwriting
Predictive Modeling in UnderwritingKevin Pledge
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshopodsc
 
Advanced Pricing in General Insurance
Advanced Pricing in General InsuranceAdvanced Pricing in General Insurance
Advanced Pricing in General InsuranceSyed Danish Ali
 
Insurance pricing
Insurance pricingInsurance pricing
Insurance pricingLincy PT
 

Viewers also liked (6)

Predictive Modeling in Underwriting
Predictive Modeling in UnderwritingPredictive Modeling in Underwriting
Predictive Modeling in Underwriting
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshop
 
Advanced Pricing in General Insurance
Advanced Pricing in General InsuranceAdvanced Pricing in General Insurance
Advanced Pricing in General Insurance
 
Actuarial Analytics in R
Actuarial Analytics in RActuarial Analytics in R
Actuarial Analytics in R
 
Princing insurance contracts with R
Princing insurance contracts with RPrincing insurance contracts with R
Princing insurance contracts with R
 
Insurance pricing
Insurance pricingInsurance pricing
Insurance pricing
 

Similar to Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation

Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationOlivier Teytaud
 
Main Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docxMain Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docxinfantsuk
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptSanGeet25
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamBrent Heard
 
Final examexamplesapr2013
Final examexamplesapr2013Final examexamplesapr2013
Final examexamplesapr2013Brent Heard
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsPrincessNorberte
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regionsbutest
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random ForestsCloudxLab
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer Sammer Qader
 
Lab23 chisquare2007
Lab23 chisquare2007Lab23 chisquare2007
Lab23 chisquare2007sbarkanic
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxPriyadharshiniG41
 
Statistical tests
Statistical testsStatistical tests
Statistical testsmartyynyyte
 
Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)Eric Choi
 
Ch. 11 Simulations Good
Ch. 11 Simulations GoodCh. 11 Simulations Good
Ch. 11 Simulations Goodchristjt
 
UNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptxUNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptxDrUdayKiranG
 

Similar to Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation (20)

Anova.ppt
Anova.pptAnova.ppt
Anova.ppt
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimization
 
Main Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docxMain Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docx
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final Exam
 
Probability unit2.pptx
Probability unit2.pptxProbability unit2.pptx
Probability unit2.pptx
 
GA.pptx
GA.pptxGA.pptx
GA.pptx
 
Final examexamplesapr2013
Final examexamplesapr2013Final examexamplesapr2013
Final examexamplesapr2013
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcuts
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
jfs-masters-1
jfs-masters-1jfs-masters-1
jfs-masters-1
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
Lab23 chisquare2007
Lab23 chisquare2007Lab23 chisquare2007
Lab23 chisquare2007
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
 
Statistical tests
Statistical testsStatistical tests
Statistical tests
 
Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)
 
Ch. 11 Simulations Good
Ch. 11 Simulations GoodCh. 11 Simulations Good
Ch. 11 Simulations Good
 
UNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptxUNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptx
 

Recently uploaded

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation

  • 1. Feeling Lucky? Multi-armed bandits for Ordering Judgements in Pooling-based Evaluation David E. Losada Javier Parapar, Álvaro Barreiro ACM SAC, 2016
  • 2. Evaluation is crucial compare retrieval algorithms, design new search solutions, ...
  • 3. information retrieval evaluation: 3 main ingredients docs
  • 4. information retrieval evaluation: 3 main ingredients queries
  • 5. information retrieval evaluation: 3 main ingredients relevance judgements
  • 7. relevance assessments are incomplete ... search system 1 search system 2 search system 3 search system n
  • 8. relevance assessments are incomplete ... search system 1 search system 2 search system 3 search system n
  • 9. relevance assessments are incomplete 1. WSJ13 2. WSJ17 . . 100. AP567 101. AP555 . . . 1. FT941 2. WSJ13 . . 100. WSJ19 101. AP555 . . . 1. ZF207 2. AP881 . . 100. FT967 101. AP555 . . . 1. WSJ13 2. CR93E . . 100. AP111 101. AP555 . . . ... rankings of docs by estimated relevance (runs)
  • 10. relevance assessments are incomplete 1. WSJ13 2. WSJ17 . . 100. AP567 101. AP555 . . . 1. FT941 2. WSJ13 . . 100. WSJ19 101. AP555 . . . 1. ZF207 2. AP881 . . 100. FT967 101. AP555 . . . 1. WSJ13 2. CR93E . . 100. AP111 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs)
  • 11. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs)
  • 12. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs) WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ...
  • 13. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs) WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ... human assessments
  • 14. finding relevant docs is the key Most productive use of assessors' time is spent on judging relevant docs (Sanderson & Zobel, 2005)
  • 15. Effective adjudication methods Give priority to pooled docs that are potentially relevant Can signifcantly reduce the num. of judgements required to identify a given num. of relevant docs But most existing methods are adhoc...
  • 16. Our main idea... Cast doc adjudication as a reinforcement learning problem Doc judging is an iterative process where we learn as judgements come in
  • 17. Doc adjudication as a reinforcement learning problem Initially we know nothing about the quality of the runs ? ? ? ?... As judgements come in... And we can adapt and allocate more docs for judgement from the most promising runs
  • 19. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 21. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 23. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 25. exploration vs exploitation exploits current knowledge spends no time sampling inferior actions maximizes expected reward on the next action explores uncertain actions gets more info about expected payofs may produce greater total reward in the long run allocation methods: choose next action (play) based on past plays and obtained rewards implement diferent ways to trade between exploration and exploitation
  • 26. Multi-armed bandits for ordering judgements ... machines = runs ... play a machine = select a run and get the next (unjudged) doc 1. WSJ13 2. CR93E . . (binary) reward = relevance/non-relevance of the selected doc
  • 27. Allocation methods tested ... random ϵn -greedy with prob 1-ϵ plays the machine with the highest avg reward with prob ϵ plays a random machine prob of exploration (ϵ) decreases with the num. of plays Upper Confdence Bound (UCB) computes upper confdence bounds for avg rewards conf. intervals get narrower with the number of plays selects the machine with the highest optimistic estimate
  • 28. Allocation methods tested: Bayesian bandits prior probabilities of giving a relevant doc: Uniform(0,1) ( or, equivalently, Beta(α,β), α,β=1 ) U(0,1) U(0,1) U(0,1) U(0,1) ... evidence (O ∈ {0,1}) is Bernoulli (or, equivalently, Binomial(1,p) ) posterior probabilities of giving a relevant doc: Beta(α+O, β+1-O) (Beta: conjugate prior for Binomial)
  • 29. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 30. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 31. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 32. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 33. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 34. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 35. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 36. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes: two strategies to select the next machine: Bayesian Learning Automaton (BLA): draws a sample from each the posterior distribution and selects the machine yieding the highest sample MaxMean (MM): selects the machine with the highest expectation of the posterior distribution
  • 37. test different document adjudication strategies in terms of how quickly they find the relevant docs in the pool experiments # rel docs found at diff. number of judgements performed
  • 39. experiments: baselines ...WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ... AP111, AP881, AP567, CR93E, FT967, WSJ13, ... DocId: sorts by Doc Id
  • 40. experiments: baselines 1. WSJ13 2. WSJ17 . . 100. AP567 ... 1. FT941 2. WSJ13 . . 100. WSJ19 1. WSJ13 2. CR93E . . 100. AP111 WSJ13, FT941, ZF207, WSJ17, CR93E, AP881 ... Rank: rank #1 docs go 1st, then rank #2 docs, ... 1. ZF207 2. AP881 . . 100. FT967
  • 41. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) starts with uniform priorities for all runs (e.g. max priority=100) selects a random run (from those with max priority) 1. ZF207 2. AP881 3. FT967 . . 100 100 100 100
  • 42. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) starts with uniform priorities for all runs (e.g. max priority=100) selects a random run (from those with max priority) 1. ZF207 2. AP881 3. FT967 . . 100 100 100 100
  • 43. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100
  • 44. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13
  • 45. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13, CR93E
  • 46. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13, CR93E, AP111
  • 47. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found when a non-rel doc is found, priority is decreased 100 99 WSJ13, CR93E, AP111
  • 48. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) and we jump again to another max priority run 1. ZF207 2. AP881 3. FT967 . . 100 100 99 100
  • 49. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . ... 1. FT941 2. WSJ13 3. WSJ19 . 1. WSJ13 2. CR93E 3. AP111 . Moffat et al.'s method (A) (Moffat et al 2007) based on rank-biased precision (RBP) sums a rank-dependent score for each doc 1. ZF207 2. AP881 3. FT967 . score 0.20 0.16 0.13 .
  • 50. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . ... 1. FT941 2. WSJ13 3. WSJ19 . 1. WSJ13 2. CR93E 3. AP111 . Moffat et al.'s method (A) (Moffat et al 2007) based on rank-biased precision (RBP) sums a rank-dependent score for each doc 1. ZF207 2. AP881 3. FT967 . score 0.20 0.16 0.13 . all docs are ranked by decreasing accummulated score and the ranked list defines the order in which docs are judged WSJ13: 0.20+0.16+0.20+...
  • 51. experiments: baselines Moffat et al.'s method (B) (Moffat et al 2007) evolution over A's method considers not only the rank-dependent doc's contributions but also the runs' residuals promotes the selection of docs from runs with many unjudged docs Moffat et al.'s method (C) (Moffat et al 2007) evolution over B's method considers not only the rank-dependent doc's and the residuals promotes the selection of docs from effective runs
  • 52. experiments: baselines MTF: best performing baseline
  • 53. experiments: MTF vs bandit-based models
  • 54. experiments: MTF vs bandit-based models Random: weakest approach BLA/UCB/ϵn -greedy are suboptimal (sophisticated exploration/exploitation trading not needed) MTF and MM: best performing methods
  • 55. improved bandit-based models MTF: forgets quickly about past rewards (a single non-relevance doc triggers a jump) non-stationary bandit-based solutions: not all historical rewards count the same MM-NS and BLA-NS non-stationary variants of MM and BLA
  • 56. stationary bandits Beta( , ), , =1α β α β rel docs add 1 to α non-rel docs add 1 to β (after n iterations) Beta(αn ,βn ) αn =1+jrels βn =1+jrets – jrels jrels : # judged relevant docs (retrieved by s) jrets : # judged docs (retrieved by s) all judged docs count the same non-stationary bandits Beta( , ), , =1α β α β jrels = rate*jrels + reld jrets = rate*jrets + 1 (after n iterations) Beta(αn ,βn ) αn =1+jrels βn =1+jrets – jrels rate>1: weights more early relevant docs rate<1: weights more late relevant docs rate=0: only the last judged doc counts (BLA-NS, MM-NS) rate=1: stationary version
  • 58. conclusions multi-arm bandits: formal & effective framework for doc adjudication in a pooling-based evaluation it's not good to increasingly reduce exploration (UCB, ϵn -greedy) it's good to react quickly to non-relevant docs (non-stationary variants)
  • 60. reproduce our experiments & test new ideas! http://tec.citius.usc.es/ir/code/pooling_bandits.html (our R code, instructions, etc)
  • 61. David E. Losada Javier Parapar, Álvaro Barreiro Feeling Lucky? Multi-armed bandits for Ordering Judgements in Pooling-based Evaluation Acknowledgements: MsSaraKelly. picture pg 1 (modified).CC BY 2.0. Sanofi Pasteur. picture pg 2 (modified).CC BY-NC-ND 2.0. pedrik. picture pgs 3-5.CC BY 2.0. Christa Lohman. picture pg 3 (left).CC BY-NC-ND 2.0. Chris. picture pg 4 (tag cloud).CC BY 2.0. Daniel Horacio Agostini. picture pg 5 (right).CC BY-NC-ND 2.0. ScaarAT. picture pg 14.CC BY-NC-ND 2.0. Sebastien Wiertz. picture pg 15 (modified).CC BY 2.0. Willard. picture pg 16 (modified).CC BY-NC-ND 2.0. Jose Luis Cernadas Iglesias. picture pg 17 (modified).CC BY 2.0. Michelle Bender. picture pg 25 (left).CC BY-NC-ND 2.0. Robert Levy. picture pg 25 (right).CC BY-NC-ND 2.0. Simply Swim UK. picture pg 37.CC BY-SA 2.0. Sarah J. Poe. picture pg 55.CC BY-ND 2.0. Kate Brady. picture pg 58.CC BY 2.0. August Brill. picture pg 59.CC BY 2.0. This work was supported by the “Ministerio de Economía y Competitividad” of the Goverment of Spain and FEDER Funds under research projects TIN2012-33867 and TIN2015-64282-R.