SlideShare a Scribd company logo
Feeling Lucky? Multi-armed bandits for Ordering Judgements
in Pooling-based Evaluation
David E. Losada
Javier Parapar, Álvaro Barreiro
ACM SAC, 2016
Evaluation
is crucial
compare retrieval algorithms, design
new search solutions, ...
information retrieval evaluation:
3 main ingredients
docs
information retrieval evaluation:
3 main ingredients
queries
information retrieval evaluation:
3 main ingredients
relevance
judgements
relevance assessments are incomplete
relevance assessments are incomplete
...
search system 1 search system 2 search system 3 search system n
relevance assessments are incomplete
...
search system 1 search system 2 search system 3 search system n
relevance assessments are incomplete
1. WSJ13
2. WSJ17
.
.
100. AP567
101. AP555
.
.
.
1. FT941
2. WSJ13
.
.
100. WSJ19
101. AP555
.
.
.
1. ZF207
2. AP881
.
.
100. FT967
101. AP555
.
.
.
1. WSJ13
2. CR93E
.
.
100. AP111
101. AP555
.
.
.
...
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
1. WSJ13
2. WSJ17
.
.
100. AP567
101. AP555
.
.
.
1. FT941
2. WSJ13
.
.
100. WSJ19
101. AP555
.
.
.
1. ZF207
2. AP881
.
.
100. FT967
101. AP555
.
.
.
1. WSJ13
2. CR93E
.
.
100. AP111
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
relevance assessments are incomplete
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
101. AP555
.
.
.
...
pool
depth
rankings of docs by estimated relevance (runs)
WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
human assessments
finding relevant docs is the key
Most productive use of assessors' time
is spent on judging relevant docs
(Sanderson & Zobel, 2005)
Effective adjudication methods
Give priority to pooled docs that are
potentially relevant
Can signifcantly reduce the num. of
judgements required to identify a given
num. of relevant docs
But most existing methods are adhoc...
Our main idea...
Cast doc adjudication as a
reinforcement learning problem
Doc judging is an iterative process where
we learn as judgements come in
Doc adjudication as a reinforcement learning problem
Initially we know nothing about the quality of the runs
? ? ? ?...
As judgements
come in...
And we can adapt and allocate more docs for judgement from
the most promising runs
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
Multi-armed bandits
...
unknown probabilities of giving a prize
play and observe the reward
Multi-armed bandits
...
unknown probabilities of giving a prize
exploration vs exploitation
exploits current knowledge
spends no time sampling inferior actions
maximizes expected reward
on the next action
explores uncertain actions
gets more info about expected payofs
may produce greater total reward
in the long run
allocation methods: choose next action (play) based on past plays and obtained rewards
implement diferent ways to trade between exploration and exploitation
Multi-armed bandits for ordering judgements
...
machines = runs
...
play a machine = select a run and get the next (unjudged) doc
1. WSJ13
2. CR93E
.
.
(binary) reward = relevance/non-relevance of the selected doc
Allocation methods tested
...
random ϵn
-greedy
with prob 1-ϵ plays the machine
with the highest avg reward
with prob ϵ plays a
random machine
prob of exploration (ϵ) decreases
with the num. of plays
Upper Confdence Bound
(UCB)
computes upper confdence
bounds for avg rewards
conf. intervals get narrower
with the number of plays
selects the machine with the
highest optimistic estimate
Allocation methods tested: Bayesian bandits
prior probabilities of giving a relevant doc: Uniform(0,1) ( or, equivalently, Beta(α,β), α,β=1 )
U(0,1) U(0,1) U(0,1) U(0,1)
...
evidence (O ∈ {0,1}) is Bernoulli (or, equivalently, Binomial(1,p) )
posterior probabilities of giving a relevant doc: Beta(α+O, β+1-O) (Beta: conjugate prior
for Binomial)
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
Allocation methods tested: Bayesian bandits
...
we iteratively update our estimations using Bayes:
two strategies to select the next machine:
Bayesian Learning Automaton (BLA): draws a sample from each the posterior distribution
and selects the machine yieding the highest sample
MaxMean (MM): selects the machine with the highest expectation of the posterior distribution
test different document adjudication strategies in
terms of how quickly they find the relevant
docs in the pool
experiments
# rel docs found at diff. number of
judgements performed
experiments: data
experiments: baselines
...WSJ13
WSJ17 AP567
WSJ19AP111 CR93E
ZF207AP881FT967
pool
...
AP111, AP881, AP567, CR93E, FT967, WSJ13, ...
DocId: sorts by Doc Id
experiments: baselines
1. WSJ13
2. WSJ17
.
.
100. AP567
...
1. FT941
2. WSJ13
.
.
100. WSJ19
1. WSJ13
2. CR93E
.
.
100. AP111
WSJ13, FT941, ZF207, WSJ17, CR93E, AP881 ...
Rank: rank #1 docs go 1st, then rank #2 docs, ...
1. ZF207
2. AP881
.
.
100. FT967
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
starts with uniform priorities for all runs (e.g. max priority=100)
selects a random run (from those with max priority)
1. ZF207
2. AP881
3. FT967
.
.
100 100 100 100
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
starts with uniform priorities for all runs (e.g. max priority=100)
selects a random run (from those with max priority)
1. ZF207
2. AP881
3. FT967
.
.
100 100 100 100
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13, CR93E
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
100
WSJ13, CR93E, AP111
experiments: baselines
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
extracts & judges docs from the selected run
stays in the run until a non-rel doc is found
when a non-rel doc is found, priority is decreased
100 99
WSJ13, CR93E, AP111
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
.
...
1. FT941
2. WSJ13
3. WSJ19
.
.
1. WSJ13
2. CR93E
3. AP111
.
.
MoveToFront (MTF) (Cormack et al 98)
and we jump again to another max priority run
1. ZF207
2. AP881
3. FT967
.
.
100 100 99 100
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
...
1. FT941
2. WSJ13
3. WSJ19
.
1. WSJ13
2. CR93E
3. AP111
.
Moffat et al.'s method (A) (Moffat et al 2007)
based on rank-biased precision (RBP)
sums a rank-dependent score for each doc
1. ZF207
2. AP881
3. FT967
.
score
0.20
0.16
0.13
.
experiments: baselines
1. WSJ13
2. WSJ17
3. AP567
.
...
1. FT941
2. WSJ13
3. WSJ19
.
1. WSJ13
2. CR93E
3. AP111
.
Moffat et al.'s method (A) (Moffat et al 2007)
based on rank-biased precision (RBP)
sums a rank-dependent score for each doc
1. ZF207
2. AP881
3. FT967
.
score
0.20
0.16
0.13
.
all docs are ranked by decreasing accummulated score
and the ranked list defines the order in which docs are judged
WSJ13: 0.20+0.16+0.20+...
experiments: baselines
Moffat et al.'s method (B) (Moffat et al 2007)
evolution over A's method
considers not only the rank-dependent doc's
contributions but also the runs' residuals
promotes the selection of docs from runs with many
unjudged docs
Moffat et al.'s method (C) (Moffat et al 2007)
evolution over B's method
considers not only the rank-dependent doc's and the residuals
promotes the selection of docs from effective runs
experiments: baselines
MTF: best performing baseline
experiments: MTF vs bandit-based models
experiments: MTF vs bandit-based models
Random: weakest approach
BLA/UCB/ϵn
-greedy are suboptimal
(sophisticated exploration/exploitation trading
not needed)
MTF and MM: best performing methods
improved bandit-based models
MTF: forgets quickly about past rewards
(a single non-relevance doc triggers a jump)
non-stationary
bandit-based
solutions:
not all historical
rewards count the
same
MM-NS and BLA-NS
non-stationary
variants of MM and
BLA
stationary bandits
Beta( , ), , =1α β α β
rel docs add 1 to α
non-rel docs add 1 to β
(after n iterations)
Beta(αn
,βn
)
αn
=1+jrels
βn
=1+jrets
– jrels
jrels
: # judged relevant docs (retrieved by s)
jrets
: # judged docs (retrieved by s)
all judged docs count the same
non-stationary bandits
Beta( , ), , =1α β α β
jrels
= rate*jrels
+ reld
jrets
= rate*jrets
+ 1
(after n iterations)
Beta(αn
,βn
)
αn
=1+jrels
βn
=1+jrets
– jrels
rate>1: weights more early relevant docs
rate<1: weights more late relevant docs
rate=0: only the last judged doc counts
(BLA-NS, MM-NS)
rate=1: stationary version
experiments: improved bandit-based models
conclusions
multi-arm bandits: formal & effective framework for
doc adjudication in a pooling-based evaluation
it's not good to increasingly reduce exploration
(UCB, ϵn
-greedy)
it's good to react quickly to non-relevant docs
(non-stationary variants)
future work
query-related
variabilities
hierarchical
bandits
stopping
criteria
metasearch
reproduce our experiments & test new ideas!
http://tec.citius.usc.es/ir/code/pooling_bandits.html
(our R code, instructions, etc)
David E. Losada
Javier Parapar, Álvaro Barreiro
Feeling Lucky? Multi-armed bandits for Ordering Judgements
in Pooling-based Evaluation
Acknowledgements:
MsSaraKelly. picture pg 1 (modified).CC BY 2.0.
Sanofi Pasteur. picture pg 2 (modified).CC BY-NC-ND 2.0.
pedrik. picture pgs 3-5.CC BY 2.0.
Christa Lohman. picture pg 3 (left).CC BY-NC-ND 2.0.
Chris. picture pg 4 (tag cloud).CC BY 2.0.
Daniel Horacio Agostini. picture pg 5 (right).CC BY-NC-ND 2.0.
ScaarAT. picture pg 14.CC BY-NC-ND 2.0.
Sebastien Wiertz. picture pg 15 (modified).CC BY 2.0.
Willard. picture pg 16 (modified).CC BY-NC-ND 2.0.
Jose Luis Cernadas Iglesias. picture pg 17 (modified).CC BY 2.0.
Michelle Bender. picture pg 25 (left).CC BY-NC-ND 2.0.
Robert Levy. picture pg 25 (right).CC BY-NC-ND 2.0.
Simply Swim UK. picture pg 37.CC BY-SA 2.0.
Sarah J. Poe. picture pg 55.CC BY-ND 2.0.
Kate Brady. picture pg 58.CC BY 2.0.
August Brill. picture pg 59.CC BY 2.0.
This work was supported by the
“Ministerio de Economía y Competitividad”
of the Goverment of Spain and
FEDER Funds under
research projects
TIN2012-33867 and TIN2015-64282-R.

More Related Content

Viewers also liked

Predictive Modeling in Underwriting
Predictive Modeling in UnderwritingPredictive Modeling in Underwriting
Predictive Modeling in Underwriting
Kevin Pledge
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshop
odsc
 
Advanced Pricing in General Insurance
Advanced Pricing in General InsuranceAdvanced Pricing in General Insurance
Advanced Pricing in General Insurance
Syed Danish Ali
 
Actuarial Analytics in R
Actuarial Analytics in RActuarial Analytics in R
Actuarial Analytics in R
Revolution Analytics
 
Princing insurance contracts with R
Princing insurance contracts with RPrincing insurance contracts with R
Princing insurance contracts with R
Giorgio Alfredo Spedicato
 
Insurance pricing
Insurance pricingInsurance pricing
Insurance pricing
Lincy PT
 

Viewers also liked (6)

Predictive Modeling in Underwriting
Predictive Modeling in UnderwritingPredictive Modeling in Underwriting
Predictive Modeling in Underwriting
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshop
 
Advanced Pricing in General Insurance
Advanced Pricing in General InsuranceAdvanced Pricing in General Insurance
Advanced Pricing in General Insurance
 
Actuarial Analytics in R
Actuarial Analytics in RActuarial Analytics in R
Actuarial Analytics in R
 
Princing insurance contracts with R
Princing insurance contracts with RPrincing insurance contracts with R
Princing insurance contracts with R
 
Insurance pricing
Insurance pricingInsurance pricing
Insurance pricing
 

Similar to Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation

Anova.ppt
Anova.pptAnova.ppt
Anova.ppt
satyamsk
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimization
Olivier Teytaud
 
cs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.pptcs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.ppt
Samiksha880257
 
Main Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docxMain Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docx
infantsuk
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
SanGeet25
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final Exam
Brent Heard
 
Probability unit2.pptx
Probability unit2.pptxProbability unit2.pptx
Probability unit2.pptx
SNIGDHABADIDA2127755
 
GA.pptx
GA.pptxGA.pptx
Final examexamplesapr2013
Final examexamplesapr2013Final examexamplesapr2013
Final examexamplesapr2013
Brent Heard
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcuts
PrincessNorberte
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
butest
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
jfs-masters-1
jfs-masters-1jfs-masters-1
jfs-masters-1
James Swafford
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
Sammer Qader
 
Lab23 chisquare2007
Lab23 chisquare2007Lab23 chisquare2007
Lab23 chisquare2007
sbarkanic
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
Nagendrasahu6
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
PriyadharshiniG41
 
Statistical tests
Statistical testsStatistical tests
Statistical tests
martyynyyte
 
Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)
Eric Choi
 
Ch. 11 Simulations Good
Ch. 11 Simulations GoodCh. 11 Simulations Good
Ch. 11 Simulations Good
christjt
 

Similar to Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation (20)

Anova.ppt
Anova.pptAnova.ppt
Anova.ppt
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimization
 
cs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.pptcs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.ppt
 
Main Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docxMain Task Submit the Following 1. Calculate the sample size.docx
Main Task Submit the Following 1. Calculate the sample size.docx
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final Exam
 
Probability unit2.pptx
Probability unit2.pptxProbability unit2.pptx
Probability unit2.pptx
 
GA.pptx
GA.pptxGA.pptx
GA.pptx
 
Final examexamplesapr2013
Final examexamplesapr2013Final examexamplesapr2013
Final examexamplesapr2013
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcuts
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
jfs-masters-1
jfs-masters-1jfs-masters-1
jfs-masters-1
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
Lab23 chisquare2007
Lab23 chisquare2007Lab23 chisquare2007
Lab23 chisquare2007
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Minmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptxMinmax and alpha beta pruning.pptx
Minmax and alpha beta pruning.pptx
 
Statistical tests
Statistical testsStatistical tests
Statistical tests
 
Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)Data Analytics Project_Eun Seuk Choi (Eric)
Data Analytics Project_Eun Seuk Choi (Eric)
 
Ch. 11 Simulations Good
Ch. 11 Simulations GoodCh. 11 Simulations Good
Ch. 11 Simulations Good
 

Recently uploaded

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 

Recently uploaded (20)

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 

Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation

  • 1. Feeling Lucky? Multi-armed bandits for Ordering Judgements in Pooling-based Evaluation David E. Losada Javier Parapar, Álvaro Barreiro ACM SAC, 2016
  • 2. Evaluation is crucial compare retrieval algorithms, design new search solutions, ...
  • 3. information retrieval evaluation: 3 main ingredients docs
  • 4. information retrieval evaluation: 3 main ingredients queries
  • 5. information retrieval evaluation: 3 main ingredients relevance judgements
  • 7. relevance assessments are incomplete ... search system 1 search system 2 search system 3 search system n
  • 8. relevance assessments are incomplete ... search system 1 search system 2 search system 3 search system n
  • 9. relevance assessments are incomplete 1. WSJ13 2. WSJ17 . . 100. AP567 101. AP555 . . . 1. FT941 2. WSJ13 . . 100. WSJ19 101. AP555 . . . 1. ZF207 2. AP881 . . 100. FT967 101. AP555 . . . 1. WSJ13 2. CR93E . . 100. AP111 101. AP555 . . . ... rankings of docs by estimated relevance (runs)
  • 10. relevance assessments are incomplete 1. WSJ13 2. WSJ17 . . 100. AP567 101. AP555 . . . 1. FT941 2. WSJ13 . . 100. WSJ19 101. AP555 . . . 1. ZF207 2. AP881 . . 100. FT967 101. AP555 . . . 1. WSJ13 2. CR93E . . 100. AP111 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs)
  • 11. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs)
  • 12. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs) WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ...
  • 13. relevance assessments are incomplete 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . 101. AP555 . . . ... pool depth rankings of docs by estimated relevance (runs) WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ... human assessments
  • 14. finding relevant docs is the key Most productive use of assessors' time is spent on judging relevant docs (Sanderson & Zobel, 2005)
  • 15. Effective adjudication methods Give priority to pooled docs that are potentially relevant Can signifcantly reduce the num. of judgements required to identify a given num. of relevant docs But most existing methods are adhoc...
  • 16. Our main idea... Cast doc adjudication as a reinforcement learning problem Doc judging is an iterative process where we learn as judgements come in
  • 17. Doc adjudication as a reinforcement learning problem Initially we know nothing about the quality of the runs ? ? ? ?... As judgements come in... And we can adapt and allocate more docs for judgement from the most promising runs
  • 19. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 21. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 23. Multi-armed bandits ... unknown probabilities of giving a prize play and observe the reward
  • 25. exploration vs exploitation exploits current knowledge spends no time sampling inferior actions maximizes expected reward on the next action explores uncertain actions gets more info about expected payofs may produce greater total reward in the long run allocation methods: choose next action (play) based on past plays and obtained rewards implement diferent ways to trade between exploration and exploitation
  • 26. Multi-armed bandits for ordering judgements ... machines = runs ... play a machine = select a run and get the next (unjudged) doc 1. WSJ13 2. CR93E . . (binary) reward = relevance/non-relevance of the selected doc
  • 27. Allocation methods tested ... random ϵn -greedy with prob 1-ϵ plays the machine with the highest avg reward with prob ϵ plays a random machine prob of exploration (ϵ) decreases with the num. of plays Upper Confdence Bound (UCB) computes upper confdence bounds for avg rewards conf. intervals get narrower with the number of plays selects the machine with the highest optimistic estimate
  • 28. Allocation methods tested: Bayesian bandits prior probabilities of giving a relevant doc: Uniform(0,1) ( or, equivalently, Beta(α,β), α,β=1 ) U(0,1) U(0,1) U(0,1) U(0,1) ... evidence (O ∈ {0,1}) is Bernoulli (or, equivalently, Binomial(1,p) ) posterior probabilities of giving a relevant doc: Beta(α+O, β+1-O) (Beta: conjugate prior for Binomial)
  • 29. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 30. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 31. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 32. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 33. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 34. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 35. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes:
  • 36. Allocation methods tested: Bayesian bandits ... we iteratively update our estimations using Bayes: two strategies to select the next machine: Bayesian Learning Automaton (BLA): draws a sample from each the posterior distribution and selects the machine yieding the highest sample MaxMean (MM): selects the machine with the highest expectation of the posterior distribution
  • 37. test different document adjudication strategies in terms of how quickly they find the relevant docs in the pool experiments # rel docs found at diff. number of judgements performed
  • 39. experiments: baselines ...WSJ13 WSJ17 AP567 WSJ19AP111 CR93E ZF207AP881FT967 pool ... AP111, AP881, AP567, CR93E, FT967, WSJ13, ... DocId: sorts by Doc Id
  • 40. experiments: baselines 1. WSJ13 2. WSJ17 . . 100. AP567 ... 1. FT941 2. WSJ13 . . 100. WSJ19 1. WSJ13 2. CR93E . . 100. AP111 WSJ13, FT941, ZF207, WSJ17, CR93E, AP881 ... Rank: rank #1 docs go 1st, then rank #2 docs, ... 1. ZF207 2. AP881 . . 100. FT967
  • 41. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) starts with uniform priorities for all runs (e.g. max priority=100) selects a random run (from those with max priority) 1. ZF207 2. AP881 3. FT967 . . 100 100 100 100
  • 42. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) starts with uniform priorities for all runs (e.g. max priority=100) selects a random run (from those with max priority) 1. ZF207 2. AP881 3. FT967 . . 100 100 100 100
  • 43. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100
  • 44. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13
  • 45. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13, CR93E
  • 46. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found 100 WSJ13, CR93E, AP111
  • 47. experiments: baselines 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) extracts & judges docs from the selected run stays in the run until a non-rel doc is found when a non-rel doc is found, priority is decreased 100 99 WSJ13, CR93E, AP111
  • 48. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . . ... 1. FT941 2. WSJ13 3. WSJ19 . . 1. WSJ13 2. CR93E 3. AP111 . . MoveToFront (MTF) (Cormack et al 98) and we jump again to another max priority run 1. ZF207 2. AP881 3. FT967 . . 100 100 99 100
  • 49. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . ... 1. FT941 2. WSJ13 3. WSJ19 . 1. WSJ13 2. CR93E 3. AP111 . Moffat et al.'s method (A) (Moffat et al 2007) based on rank-biased precision (RBP) sums a rank-dependent score for each doc 1. ZF207 2. AP881 3. FT967 . score 0.20 0.16 0.13 .
  • 50. experiments: baselines 1. WSJ13 2. WSJ17 3. AP567 . ... 1. FT941 2. WSJ13 3. WSJ19 . 1. WSJ13 2. CR93E 3. AP111 . Moffat et al.'s method (A) (Moffat et al 2007) based on rank-biased precision (RBP) sums a rank-dependent score for each doc 1. ZF207 2. AP881 3. FT967 . score 0.20 0.16 0.13 . all docs are ranked by decreasing accummulated score and the ranked list defines the order in which docs are judged WSJ13: 0.20+0.16+0.20+...
  • 51. experiments: baselines Moffat et al.'s method (B) (Moffat et al 2007) evolution over A's method considers not only the rank-dependent doc's contributions but also the runs' residuals promotes the selection of docs from runs with many unjudged docs Moffat et al.'s method (C) (Moffat et al 2007) evolution over B's method considers not only the rank-dependent doc's and the residuals promotes the selection of docs from effective runs
  • 52. experiments: baselines MTF: best performing baseline
  • 53. experiments: MTF vs bandit-based models
  • 54. experiments: MTF vs bandit-based models Random: weakest approach BLA/UCB/ϵn -greedy are suboptimal (sophisticated exploration/exploitation trading not needed) MTF and MM: best performing methods
  • 55. improved bandit-based models MTF: forgets quickly about past rewards (a single non-relevance doc triggers a jump) non-stationary bandit-based solutions: not all historical rewards count the same MM-NS and BLA-NS non-stationary variants of MM and BLA
  • 56. stationary bandits Beta( , ), , =1α β α β rel docs add 1 to α non-rel docs add 1 to β (after n iterations) Beta(αn ,βn ) αn =1+jrels βn =1+jrets – jrels jrels : # judged relevant docs (retrieved by s) jrets : # judged docs (retrieved by s) all judged docs count the same non-stationary bandits Beta( , ), , =1α β α β jrels = rate*jrels + reld jrets = rate*jrets + 1 (after n iterations) Beta(αn ,βn ) αn =1+jrels βn =1+jrets – jrels rate>1: weights more early relevant docs rate<1: weights more late relevant docs rate=0: only the last judged doc counts (BLA-NS, MM-NS) rate=1: stationary version
  • 58. conclusions multi-arm bandits: formal & effective framework for doc adjudication in a pooling-based evaluation it's not good to increasingly reduce exploration (UCB, ϵn -greedy) it's good to react quickly to non-relevant docs (non-stationary variants)
  • 60. reproduce our experiments & test new ideas! http://tec.citius.usc.es/ir/code/pooling_bandits.html (our R code, instructions, etc)
  • 61. David E. Losada Javier Parapar, Álvaro Barreiro Feeling Lucky? Multi-armed bandits for Ordering Judgements in Pooling-based Evaluation Acknowledgements: MsSaraKelly. picture pg 1 (modified).CC BY 2.0. Sanofi Pasteur. picture pg 2 (modified).CC BY-NC-ND 2.0. pedrik. picture pgs 3-5.CC BY 2.0. Christa Lohman. picture pg 3 (left).CC BY-NC-ND 2.0. Chris. picture pg 4 (tag cloud).CC BY 2.0. Daniel Horacio Agostini. picture pg 5 (right).CC BY-NC-ND 2.0. ScaarAT. picture pg 14.CC BY-NC-ND 2.0. Sebastien Wiertz. picture pg 15 (modified).CC BY 2.0. Willard. picture pg 16 (modified).CC BY-NC-ND 2.0. Jose Luis Cernadas Iglesias. picture pg 17 (modified).CC BY 2.0. Michelle Bender. picture pg 25 (left).CC BY-NC-ND 2.0. Robert Levy. picture pg 25 (right).CC BY-NC-ND 2.0. Simply Swim UK. picture pg 37.CC BY-SA 2.0. Sarah J. Poe. picture pg 55.CC BY-ND 2.0. Kate Brady. picture pg 58.CC BY 2.0. August Brill. picture pg 59.CC BY 2.0. This work was supported by the “Ministerio de Economía y Competitividad” of the Goverment of Spain and FEDER Funds under research projects TIN2012-33867 and TIN2015-64282-R.