SlideShare a Scribd company logo
BANDIT ALGORITHMS
Max Pagels, Machine Learning Specialist
max.pagels@sc5.io, @maxpagels
Understanding the world using reinforcement learning
DATA & AI
ABOUT ME
MAX PAGELS
MSc, BSc, Computer Science (University of Helsinki),
former CS researcher
Currently a machine learning specialist at Helsinki-
based software consultancy SC5, working on applied AI
& cloud-native solutions
Specific areas of interest include probabilistic
supervised learning, deep neural networks and
(algorithmic) complexity theory
Other hats I sometimes wear: full-stack developer,
technical interviewer, architect
TRADITIONAL PROGRAMMING, IN A NUTSHELL
OUTPUTINPUT
ALGORITHM
MACHINE LEARNING, IN A NUTSHELL
PROGRAMINPUT & OUTPUT
LEARNING ALGORITHM
REINFORCEMENT LEARNING, IN A NUTSHELL
ACTION
WORLD
OBSERVATION
AGENT
REWARD
ACTION
WORLD
OBSERVATION
AGENT
REWARD
(Rotate/Up/Down/
Left/Right etc)
Score since
previous action
LEARNING HOW TO PLAY TETRIS
ACTION
WORLD
STATE
AGENT
REWARD
(Rotate/Up/Down/
Left/Right etc)
Score since
previous action
If fully observable
LEARNING HOW TO PLAY TETRIS
IF REWARDS AREN’T DELAYED, IT’S A BANDIT PROBLEM
THE (STOCHASTIC)
MULTI-ARMED
BANDIT PROBLEM
Let’s say you are in a Las Vegas casino, and
there are three vacant slot machines (one-
armed bandits). Pulling the arm of a machine
yields either $0 or $1.
Given these three bandit machines, each with
an unknown payoff strategy, how should we
choose which arm to pull to maximise* the
expected reward over time?



* Equivalently, we want to minimise the expected
regret over time (minimise how much money we
lose).
FORMALLY(ISH)
Problem
We have three one-armed bandits. Each arm
yields a reward of $1 according to some fixed
unknown probability Pa , or a reward of $0 with
probability 1-Pa.
Objective
Find Pa, ∀a ∈ {1,2,3}, play argmax(Pa, ∀a ∈ {1,2,3})
OBSERVATION #1
This is a partial information problem: when we
pull an arm, we don’t get to see the rewards of
the arms we didn’t pull.
(For math aficionados: a bandit problem is an
MDP with a single, terminal, state).
OBSERVATION #2
We need to create some approximation, or best
guess, of how the world works in order to
maximise our reward over time.
We need to create a model.
“All models are wrong but some are useful” —
George Box
OBSERVATION #3
Clearly, we need to try (explore) the arms of all
the machines, to get a sense of which one is
best.
OBSERVATION #4
Though exploration is necessary, we also need
to choose the best arm as much as possible
(exploit) to maximise our reward over time.
THE EXPLORE/EXPLOIT TUG OF WAR
< ACQUIRING KNOWLEDGE MAXIMISING REWARD >
HOW DO WE SOLVE THE BANDIT PROBLEM?
A NAÏVE ALGORITHM
Step 1: For the first N pulls
Evenly pull all the arms, keeping track of the
number of pulls & wins per arm
Step 2: For the remaining M pulls
Choose the arm the the highest expected
reward (ie. mean reward)
A NAÏVE ALGORITHM
Step 1: For the first N pulls
Evenly pull all the arms, keeping track of the
number of pulls & wins per arm
Step 2: For the remaining M pulls
Choose the arm the the highest expected
reward (ie. mean reward)
Arm 1: {trials: 100, wins: 2}
Arm 2: {trials: 100, wins: 21}
Arm 3: {trials: 100, wins: 17}
Arm 1: 2/100 = 0.02
Arm 2: 21/100 = 0.21
Arm 3: 17/100 = 0.17
A NAÏVE ALGORITHM
Step 1: For the first N pulls
Evenly pull all the arms, keeping track of the
number of pulls & wins per arm
Step 2: For the remaining M pulls
Choose the arm the the highest expected
reward (ie. mean reward)
Arm 1: {trials: 100, wins: 2}
Arm 2: {trials: 100, wins: 21}
Arm 3: {trials: 100, wins: 17}
Arm 1: 2/100 = 0.02
Arm 2: 21/100 = 0.21
Arm 3: 17/100 = 0.17
Explore
Exploit
EPOCH-GREEDY [1]
Step 1: For the first N pulls
Evenly pull all the arms, keeping track of the
number of pulls & wins per arm
Step 2: For the remaining M pulls
Choose the arm the the highest expected
reward (ie. mean reward)
Arm 1: {trials: 100, wins: 2}
Arm 2: {trials: 100, wins: 21}
Arm 3: {trials: 100, wins: 17}
Arm 1: 2/100 = 0.02
Arm 2: 21/100 = 0.21
Arm 3: 17/100 = 0.17
Explore
Exploit
[1]: http://hunch.net/~jl/projects/interactive/sidebandits/bandit.pdf
EPSILON-GREEDY/Ε-GREEDY [2]
Choose a value for the hyperparameter ε (e.g. 0.1 = 10%)
Loop forever
With probability ε:
Choose an arm uniformly at random, keep track of trials and wins
With probability 1-ε:
Choose the arm the the highest expected reward
Explore
Exploit
[2]: people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf
OBSERVATIONS
For N = 300 and M = 1000, the Epoch-Greedy algorithm will choose
something other than the best arm 200 times, or about 15% of the time.
At each turn, the ε-Greedy algorithm will choose something other than the
best arm with probability P=(ε/n) × (n-1), n=number of arms. It will always
explore with this probability, no matter how many time steps we have
done.
Ε-GREEDY: SUB-OPTIMAL LINEAR REGRET*
- O(N)Cumulativeregret
4
Timesteps
*Assuming no annealing
THOMPSON SAMPLING ALGORITHM
For each arm, initialise a uniform probability distribution (prior)
Loop forever
Step 1: sample randomly from the probability distribution of each
arm
Step 2: choose the arm with the highest sample value
Step 3: observe reward for the chosen and update the
hyperparameters of its probability distribution (posterior)
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
Assumption: rewards follow a Bernoulli process, which allows us to use a 𝛽-distribution as a
conjugate prior.
FIRST, INITIALISE A UNIFORM RANDOM PROB.
DISTRIBUTION FOR BLUE AND GREEN
0 10 1
FIRST, INITIALISE A UNIFORM RANDOM PROB.
DISTRIBUTION FOR BLUE AND GREEN
0 1
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
RANDOMLY SAMPLE FROM BOTH ARMS (BLUE GETS THE
HIGHER VALUE IN THIS EXAMPLE)
0 1
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
PULL BLUE , OBSERVE REWARD (LET’S SAY WE GOT A
REWARD OF 1)
0 1
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
UPDATE DISTRIBUTION OF BLUE
0 1
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
REPEAT THE PROCESS: SAMPLE, CHOOSE ARM WITH
HIGHEST SAMPLE VALUE, OBSERVE REWARD, UPDATE.
0 1
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
0 1
AFTER 100 TIMESTEPS, THE DISTRIBUTIONS MIGHT LOOK
LIKE THIS
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
0 1
AFTER 1,000 TIMESTEPS, THE DISTRIBUTIONS MIGHT LOOK
LIKE THIS: THE PROCESS HAS CONVERGED.
EXAMPLE WITH TWO ARMS (BLUE & GREEN)
THOMPSON SAMPLING: LOGARITHMIC REGRET -
O(LOG(N))Cumulativeregret
100
Timesteps
MULTI-ARMED BANDITS ARE
GOOD…
So far, we’ve covered the “vanilla” multi-armed bandit problem. Solving this
problem lets us find the globally best arm to maximise our expected reward
over time.
However, depending on the problem, the globally best arm may not always
be the best arm.
…BUT CONTEXTUAL BANDITS
ARE THE HOLY GRAIL
The the contextual bandit setting, once we pull an arm and receive a reward,
we also get to see a context vector associated with the reward. The
objective is now to maximise the expected reward over time given the
context at each time step.
Solving this problem gives us a higher reward over time in situations where
the globally best arm isn’t always the best arm.
Example algorithm: Thompson Sampling with Linear Payoffs [3]
[3]: https://arxiv.org/abs/1209.3352
…ACTUALLY, ADVERSARIAL
CONTEXTUAL BANDITS ARE THE
HOLY GRAIL
Until now, we’ve assumed that each one-armed bandit gives a $100 reward
according to some unknown probability P. In real-world scenarios, the world
rarely behaves this well. Instead of a simple dice roll, the bandit may pay out
differently depending on the day/hour/amount of money in the machine. It
works as an adversary of sorts.
A bandit algorithm that is capable of dealing with changes in payoff
structures in a reasonable amount of time tend to work better than
stochastic bandits.
Example algorithm: EXP4 (a.k.a the “Monster” algorithm) [4]
[4]: https://arxiv.org/abs/1002.4058
CONTEXTUAL BANDIT
ALGORITHMS ARE 1) STATE-OF-
THE-ART & 2) COMPLEX
• Many algorithms are <5 years old. Some of the most interesting ones
are less than a year old.
• Be prepared to read lots of academic papers whilst implementing the
algorithms.
• Don’t hesitate to contact the authors if there is some detail you don’t
understand (thanks John Langford and Shipra Agrawal!).
• Always remember to simulate and validate bandits using real data.
• Always remember to log actions and rewards of deployed bandits.
PRACTICAL TIPS & CONSIDERATIONS
FOR BANDITS
• If you are implementing your own online-learnable MAB and serving
predictions/receiving rewards over an API, a single process may suffice
• If using Thompson Sampling, sampling 100 arms won’t take more than
a few milliseconds -> possible to implement a single-threaded real-
time learning bandit that can scale to hundreds of requests per
second
• Thousands of arms or thousands of requests per second will require
multiple processes + synchronisation
• For contextual bandits, multiple processes + batching is required
• For some applications, you may need some heuristic for what counts as a
“zero reward”
• Remember to leverage caching on the web, otherwise the UX suffers
• In some cases, you might not want convergence -> tweak algorithm to
always explore with some nonzero probability
• In most applications, the number of arms can vary at any given timestep
REAL-WORLD APPLICATIONS OF BANDITS
BIVARIATE & MULTIVARIATE TESTING
Vanilla MAB
UI MENU OPTIMISATION
CODEC SELECTION
Image credit: Microsoft/Skype
ONLINE MULTIPLAYER
Image credit: Microsoft/Bungie
PERSONALISED RECOMMENDATIONS
Contextual
MAB
PERSONALISED RECOMMENDATIONS
Multiple-
play MAB
SELFIE OPTIMISATION
Vanilla MAB
(one per
user)
Image: Tinder Tech Blog
… & COUNTLESS OTHER POSSIBILITIES
DIY: CONTEXTUAL BANDIT IMPLEMENTED IN
PYTHON + AZURE FUNCTIONS + API + REDIS
= OPTIMISATION ENGINE
AS-A-SERVICE: AZURE CUSTOM
DECISION SERVICE
IF YOU CAN MODEL IT AS A
GAME, YOU CAN USE A BANDIT
ALGORITHM TO SOLVE IT*.
*Given rewards that aren’t delayed.
THANK YOU
A special thanks to Microsoft for hosting & Practical AI for organising!
QUESTIONS?

More Related Content

What's hot

Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
Jaya Kawale
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender Systems
David Zibriczky
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
MLconf
 
Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender Systems
Yves Raimond
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
Faisal Siddiqi
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
Harald Steck
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at Netflix
Linas Baltrunas
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
Justin Basilico
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Xavier Amatriain
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender models
Parmeshwar Khurd
 
Exploration and diversity in recommender systems
Exploration and diversity in recommender systemsExploration and diversity in recommender systems
Exploration and diversity in recommender systems
Jaya Kawale
 
Recommender system
Recommender systemRecommender system
Recommender system
Nilotpal Pramanik
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
Rohan Agrawal
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
Yves Raimond
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
Sangwoo Mo
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
Liang Xiang
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
Viet-Trung TRAN
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
Rishabh Mehta
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Sudeep Das, Ph.D.
 

What's hot (20)

Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender Systems
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender Systems
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at Netflix
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender models
 
Exploration and diversity in recommender systems
Exploration and diversity in recommender systemsExploration and diversity in recommender systems
Exploration and diversity in recommender systems
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 

Similar to Practical AI for Business: Bandit Algorithms

John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017 John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
MLconf
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
Kwanghee Choi
 
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
IRJET Journal
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
Tanzim Saqib
 
mcp-bandits.pptx
mcp-bandits.pptxmcp-bandits.pptx
mcp-bandits.pptx
Blackrider9
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS
 
UNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptxUNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptx
DrUdayKiranG
 
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
sugiuralab
 
Using Java & Genetic Algorithms to Beat the Market
Using Java & Genetic Algorithms to Beat the MarketUsing Java & Genetic Algorithms to Beat the Market
Using Java & Genetic Algorithms to Beat the Market
Matthew Ring
 
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
Simulation-based optimization: Upper Confidence Tree and Direct Policy SearchSimulation-based optimization: Upper Confidence Tree and Direct Policy Search
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
Olivier Teytaud
 
Lab 2 7_the_titanic_shuffle
Lab 2 7_the_titanic_shuffleLab 2 7_the_titanic_shuffle
Lab 2 7_the_titanic_shuffle
monicamunguia23
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
pradiprahul
 
A Study of Wearable Accelerometers Layout for Human Activity Recognition(Asia...
A Study of Wearable Accelerometers Layout for Human Activity Recognition(Asia...A Study of Wearable Accelerometers Layout for Human Activity Recognition(Asia...
A Study of Wearable Accelerometers Layout for Human Activity Recognition(Asia...
sugiuralab
 
Software testing
Software testingSoftware testing
Software testing
DIPEN SAINI
 
Unleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
Unleashing Real-World Simulations: A Python Tutorial by Avjinder KalerUnleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
Unleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
Avjinder (Avi) Kaler
 
Online learning &amp; adaptive game playing
Online learning &amp; adaptive game playingOnline learning &amp; adaptive game playing
Online learning &amp; adaptive game playing
Saeid Ghafouri
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble Algorithms
Sara Hooker
 
WEKA:Practical Machine Learning Tools And Techniques
WEKA:Practical Machine Learning Tools And TechniquesWEKA:Practical Machine Learning Tools And Techniques
WEKA:Practical Machine Learning Tools And Techniques
weka Content
 
WEKA: Practical Machine Learning Tools And Techniques
WEKA: Practical Machine Learning Tools And TechniquesWEKA: Practical Machine Learning Tools And Techniques
WEKA: Practical Machine Learning Tools And Techniques
DataminingTools Inc
 
Intro to Quant Trading Strategies (Lecture 8 of 10)
Intro to Quant Trading Strategies (Lecture 8 of 10)Intro to Quant Trading Strategies (Lecture 8 of 10)
Intro to Quant Trading Strategies (Lecture 8 of 10)
Adrian Aley
 

Similar to Practical AI for Business: Bandit Algorithms (20)

John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017 John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
 
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
mcp-bandits.pptx
mcp-bandits.pptxmcp-bandits.pptx
mcp-bandits.pptx
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
 
UNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptxUNIT - I Reinforcement Learning .pptx
UNIT - I Reinforcement Learning .pptx
 
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
 
Using Java & Genetic Algorithms to Beat the Market
Using Java & Genetic Algorithms to Beat the MarketUsing Java & Genetic Algorithms to Beat the Market
Using Java & Genetic Algorithms to Beat the Market
 
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
Simulation-based optimization: Upper Confidence Tree and Direct Policy SearchSimulation-based optimization: Upper Confidence Tree and Direct Policy Search
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
 
Lab 2 7_the_titanic_shuffle
Lab 2 7_the_titanic_shuffleLab 2 7_the_titanic_shuffle
Lab 2 7_the_titanic_shuffle
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
A Study of Wearable Accelerometers Layout for Human Activity Recognition(Asia...
A Study of Wearable Accelerometers Layout for Human Activity Recognition(Asia...A Study of Wearable Accelerometers Layout for Human Activity Recognition(Asia...
A Study of Wearable Accelerometers Layout for Human Activity Recognition(Asia...
 
Software testing
Software testingSoftware testing
Software testing
 
Unleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
Unleashing Real-World Simulations: A Python Tutorial by Avjinder KalerUnleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
Unleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
 
Online learning &amp; adaptive game playing
Online learning &amp; adaptive game playingOnline learning &amp; adaptive game playing
Online learning &amp; adaptive game playing
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble Algorithms
 
WEKA:Practical Machine Learning Tools And Techniques
WEKA:Practical Machine Learning Tools And TechniquesWEKA:Practical Machine Learning Tools And Techniques
WEKA:Practical Machine Learning Tools And Techniques
 
WEKA: Practical Machine Learning Tools And Techniques
WEKA: Practical Machine Learning Tools And TechniquesWEKA: Practical Machine Learning Tools And Techniques
WEKA: Practical Machine Learning Tools And Techniques
 
Intro to Quant Trading Strategies (Lecture 8 of 10)
Intro to Quant Trading Strategies (Lecture 8 of 10)Intro to Quant Trading Strategies (Lecture 8 of 10)
Intro to Quant Trading Strategies (Lecture 8 of 10)
 

More from SC5.io

AWS Machine Learning & Google Cloud Machine Learning
AWS Machine Learning & Google Cloud Machine LearningAWS Machine Learning & Google Cloud Machine Learning
AWS Machine Learning & Google Cloud Machine Learning
SC5.io
 
Transfer learning with Custom Vision
Transfer learning with Custom VisionTransfer learning with Custom Vision
Transfer learning with Custom Vision
SC5.io
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forests
SC5.io
 
Machine Learning Using Cloud Services
Machine Learning Using Cloud ServicesMachine Learning Using Cloud Services
Machine Learning Using Cloud Services
SC5.io
 
Angular.js Primer in Aalto University
Angular.js Primer in Aalto UniversityAngular.js Primer in Aalto University
Angular.js Primer in Aalto University
SC5.io
 
Miten design-muutosjohtaminen hyödyttää yrityksiä?
Miten design-muutosjohtaminen hyödyttää yrityksiä?Miten design-muutosjohtaminen hyödyttää yrityksiä?
Miten design-muutosjohtaminen hyödyttää yrityksiä?
SC5.io
 
Securing the client side web
Securing the client side webSecuring the client side web
Securing the client side web
SC5.io
 
Engineering HTML5 Applications for Better Performance
Engineering HTML5 Applications for Better PerformanceEngineering HTML5 Applications for Better Performance
Engineering HTML5 Applications for Better Performance
SC5.io
 
2013 10-02-backbone-robots-aarhus
2013 10-02-backbone-robots-aarhus2013 10-02-backbone-robots-aarhus
2013 10-02-backbone-robots-aarhus
SC5.io
 
2013 10-02-html5-performance-aarhus
2013 10-02-html5-performance-aarhus2013 10-02-html5-performance-aarhus
2013 10-02-html5-performance-aarhus
SC5.io
 
2013 04-02-server-side-backbone
2013 04-02-server-side-backbone2013 04-02-server-side-backbone
2013 04-02-server-side-backbone
SC5.io
 
Building single page applications
Building single page applicationsBuilding single page applications
Building single page applications
SC5.io
 

More from SC5.io (12)

AWS Machine Learning & Google Cloud Machine Learning
AWS Machine Learning & Google Cloud Machine LearningAWS Machine Learning & Google Cloud Machine Learning
AWS Machine Learning & Google Cloud Machine Learning
 
Transfer learning with Custom Vision
Transfer learning with Custom VisionTransfer learning with Custom Vision
Transfer learning with Custom Vision
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forests
 
Machine Learning Using Cloud Services
Machine Learning Using Cloud ServicesMachine Learning Using Cloud Services
Machine Learning Using Cloud Services
 
Angular.js Primer in Aalto University
Angular.js Primer in Aalto UniversityAngular.js Primer in Aalto University
Angular.js Primer in Aalto University
 
Miten design-muutosjohtaminen hyödyttää yrityksiä?
Miten design-muutosjohtaminen hyödyttää yrityksiä?Miten design-muutosjohtaminen hyödyttää yrityksiä?
Miten design-muutosjohtaminen hyödyttää yrityksiä?
 
Securing the client side web
Securing the client side webSecuring the client side web
Securing the client side web
 
Engineering HTML5 Applications for Better Performance
Engineering HTML5 Applications for Better PerformanceEngineering HTML5 Applications for Better Performance
Engineering HTML5 Applications for Better Performance
 
2013 10-02-backbone-robots-aarhus
2013 10-02-backbone-robots-aarhus2013 10-02-backbone-robots-aarhus
2013 10-02-backbone-robots-aarhus
 
2013 10-02-html5-performance-aarhus
2013 10-02-html5-performance-aarhus2013 10-02-html5-performance-aarhus
2013 10-02-html5-performance-aarhus
 
2013 04-02-server-side-backbone
2013 04-02-server-side-backbone2013 04-02-server-side-backbone
2013 04-02-server-side-backbone
 
Building single page applications
Building single page applicationsBuilding single page applications
Building single page applications
 

Recently uploaded

Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 

Recently uploaded (20)

Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 

Practical AI for Business: Bandit Algorithms

  • 1. BANDIT ALGORITHMS Max Pagels, Machine Learning Specialist max.pagels@sc5.io, @maxpagels Understanding the world using reinforcement learning DATA & AI
  • 2. ABOUT ME MAX PAGELS MSc, BSc, Computer Science (University of Helsinki), former CS researcher Currently a machine learning specialist at Helsinki- based software consultancy SC5, working on applied AI & cloud-native solutions Specific areas of interest include probabilistic supervised learning, deep neural networks and (algorithmic) complexity theory Other hats I sometimes wear: full-stack developer, technical interviewer, architect
  • 3. TRADITIONAL PROGRAMMING, IN A NUTSHELL OUTPUTINPUT ALGORITHM
  • 4. MACHINE LEARNING, IN A NUTSHELL PROGRAMINPUT & OUTPUT LEARNING ALGORITHM
  • 5. REINFORCEMENT LEARNING, IN A NUTSHELL ACTION WORLD OBSERVATION AGENT REWARD
  • 7. ACTION WORLD STATE AGENT REWARD (Rotate/Up/Down/ Left/Right etc) Score since previous action If fully observable LEARNING HOW TO PLAY TETRIS
  • 8. IF REWARDS AREN’T DELAYED, IT’S A BANDIT PROBLEM
  • 9. THE (STOCHASTIC) MULTI-ARMED BANDIT PROBLEM Let’s say you are in a Las Vegas casino, and there are three vacant slot machines (one- armed bandits). Pulling the arm of a machine yields either $0 or $1. Given these three bandit machines, each with an unknown payoff strategy, how should we choose which arm to pull to maximise* the expected reward over time?
 
 * Equivalently, we want to minimise the expected regret over time (minimise how much money we lose).
  • 10. FORMALLY(ISH) Problem We have three one-armed bandits. Each arm yields a reward of $1 according to some fixed unknown probability Pa , or a reward of $0 with probability 1-Pa. Objective Find Pa, ∀a ∈ {1,2,3}, play argmax(Pa, ∀a ∈ {1,2,3})
  • 11. OBSERVATION #1 This is a partial information problem: when we pull an arm, we don’t get to see the rewards of the arms we didn’t pull. (For math aficionados: a bandit problem is an MDP with a single, terminal, state).
  • 12. OBSERVATION #2 We need to create some approximation, or best guess, of how the world works in order to maximise our reward over time. We need to create a model. “All models are wrong but some are useful” — George Box
  • 13. OBSERVATION #3 Clearly, we need to try (explore) the arms of all the machines, to get a sense of which one is best.
  • 14. OBSERVATION #4 Though exploration is necessary, we also need to choose the best arm as much as possible (exploit) to maximise our reward over time.
  • 15. THE EXPLORE/EXPLOIT TUG OF WAR < ACQUIRING KNOWLEDGE MAXIMISING REWARD >
  • 16. HOW DO WE SOLVE THE BANDIT PROBLEM?
  • 17. A NAÏVE ALGORITHM Step 1: For the first N pulls Evenly pull all the arms, keeping track of the number of pulls & wins per arm Step 2: For the remaining M pulls Choose the arm the the highest expected reward (ie. mean reward)
  • 18. A NAÏVE ALGORITHM Step 1: For the first N pulls Evenly pull all the arms, keeping track of the number of pulls & wins per arm Step 2: For the remaining M pulls Choose the arm the the highest expected reward (ie. mean reward) Arm 1: {trials: 100, wins: 2} Arm 2: {trials: 100, wins: 21} Arm 3: {trials: 100, wins: 17} Arm 1: 2/100 = 0.02 Arm 2: 21/100 = 0.21 Arm 3: 17/100 = 0.17
  • 19. A NAÏVE ALGORITHM Step 1: For the first N pulls Evenly pull all the arms, keeping track of the number of pulls & wins per arm Step 2: For the remaining M pulls Choose the arm the the highest expected reward (ie. mean reward) Arm 1: {trials: 100, wins: 2} Arm 2: {trials: 100, wins: 21} Arm 3: {trials: 100, wins: 17} Arm 1: 2/100 = 0.02 Arm 2: 21/100 = 0.21 Arm 3: 17/100 = 0.17 Explore Exploit
  • 20. EPOCH-GREEDY [1] Step 1: For the first N pulls Evenly pull all the arms, keeping track of the number of pulls & wins per arm Step 2: For the remaining M pulls Choose the arm the the highest expected reward (ie. mean reward) Arm 1: {trials: 100, wins: 2} Arm 2: {trials: 100, wins: 21} Arm 3: {trials: 100, wins: 17} Arm 1: 2/100 = 0.02 Arm 2: 21/100 = 0.21 Arm 3: 17/100 = 0.17 Explore Exploit [1]: http://hunch.net/~jl/projects/interactive/sidebandits/bandit.pdf
  • 21. EPSILON-GREEDY/Ε-GREEDY [2] Choose a value for the hyperparameter ε (e.g. 0.1 = 10%) Loop forever With probability ε: Choose an arm uniformly at random, keep track of trials and wins With probability 1-ε: Choose the arm the the highest expected reward Explore Exploit [2]: people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf
  • 22. OBSERVATIONS For N = 300 and M = 1000, the Epoch-Greedy algorithm will choose something other than the best arm 200 times, or about 15% of the time. At each turn, the ε-Greedy algorithm will choose something other than the best arm with probability P=(ε/n) × (n-1), n=number of arms. It will always explore with this probability, no matter how many time steps we have done.
  • 23. Ε-GREEDY: SUB-OPTIMAL LINEAR REGRET* - O(N)Cumulativeregret 4 Timesteps *Assuming no annealing
  • 24. THOMPSON SAMPLING ALGORITHM For each arm, initialise a uniform probability distribution (prior) Loop forever Step 1: sample randomly from the probability distribution of each arm Step 2: choose the arm with the highest sample value Step 3: observe reward for the chosen and update the hyperparameters of its probability distribution (posterior)
  • 25. EXAMPLE WITH TWO ARMS (BLUE & GREEN) Assumption: rewards follow a Bernoulli process, which allows us to use a 𝛽-distribution as a conjugate prior. FIRST, INITIALISE A UNIFORM RANDOM PROB. DISTRIBUTION FOR BLUE AND GREEN 0 10 1
  • 26. FIRST, INITIALISE A UNIFORM RANDOM PROB. DISTRIBUTION FOR BLUE AND GREEN 0 1 EXAMPLE WITH TWO ARMS (BLUE & GREEN)
  • 27. RANDOMLY SAMPLE FROM BOTH ARMS (BLUE GETS THE HIGHER VALUE IN THIS EXAMPLE) 0 1 EXAMPLE WITH TWO ARMS (BLUE & GREEN)
  • 28. PULL BLUE , OBSERVE REWARD (LET’S SAY WE GOT A REWARD OF 1) 0 1 EXAMPLE WITH TWO ARMS (BLUE & GREEN)
  • 29. UPDATE DISTRIBUTION OF BLUE 0 1 EXAMPLE WITH TWO ARMS (BLUE & GREEN)
  • 30. REPEAT THE PROCESS: SAMPLE, CHOOSE ARM WITH HIGHEST SAMPLE VALUE, OBSERVE REWARD, UPDATE. 0 1 EXAMPLE WITH TWO ARMS (BLUE & GREEN)
  • 31. 0 1 AFTER 100 TIMESTEPS, THE DISTRIBUTIONS MIGHT LOOK LIKE THIS EXAMPLE WITH TWO ARMS (BLUE & GREEN)
  • 32. 0 1 AFTER 1,000 TIMESTEPS, THE DISTRIBUTIONS MIGHT LOOK LIKE THIS: THE PROCESS HAS CONVERGED. EXAMPLE WITH TWO ARMS (BLUE & GREEN)
  • 33. THOMPSON SAMPLING: LOGARITHMIC REGRET - O(LOG(N))Cumulativeregret 100 Timesteps
  • 34. MULTI-ARMED BANDITS ARE GOOD… So far, we’ve covered the “vanilla” multi-armed bandit problem. Solving this problem lets us find the globally best arm to maximise our expected reward over time. However, depending on the problem, the globally best arm may not always be the best arm.
  • 35. …BUT CONTEXTUAL BANDITS ARE THE HOLY GRAIL The the contextual bandit setting, once we pull an arm and receive a reward, we also get to see a context vector associated with the reward. The objective is now to maximise the expected reward over time given the context at each time step. Solving this problem gives us a higher reward over time in situations where the globally best arm isn’t always the best arm. Example algorithm: Thompson Sampling with Linear Payoffs [3] [3]: https://arxiv.org/abs/1209.3352
  • 36. …ACTUALLY, ADVERSARIAL CONTEXTUAL BANDITS ARE THE HOLY GRAIL Until now, we’ve assumed that each one-armed bandit gives a $100 reward according to some unknown probability P. In real-world scenarios, the world rarely behaves this well. Instead of a simple dice roll, the bandit may pay out differently depending on the day/hour/amount of money in the machine. It works as an adversary of sorts. A bandit algorithm that is capable of dealing with changes in payoff structures in a reasonable amount of time tend to work better than stochastic bandits. Example algorithm: EXP4 (a.k.a the “Monster” algorithm) [4] [4]: https://arxiv.org/abs/1002.4058
  • 37. CONTEXTUAL BANDIT ALGORITHMS ARE 1) STATE-OF- THE-ART & 2) COMPLEX • Many algorithms are <5 years old. Some of the most interesting ones are less than a year old. • Be prepared to read lots of academic papers whilst implementing the algorithms. • Don’t hesitate to contact the authors if there is some detail you don’t understand (thanks John Langford and Shipra Agrawal!). • Always remember to simulate and validate bandits using real data. • Always remember to log actions and rewards of deployed bandits.
  • 38. PRACTICAL TIPS & CONSIDERATIONS FOR BANDITS • If you are implementing your own online-learnable MAB and serving predictions/receiving rewards over an API, a single process may suffice • If using Thompson Sampling, sampling 100 arms won’t take more than a few milliseconds -> possible to implement a single-threaded real- time learning bandit that can scale to hundreds of requests per second • Thousands of arms or thousands of requests per second will require multiple processes + synchronisation • For contextual bandits, multiple processes + batching is required • For some applications, you may need some heuristic for what counts as a “zero reward” • Remember to leverage caching on the web, otherwise the UX suffers • In some cases, you might not want convergence -> tweak algorithm to always explore with some nonzero probability • In most applications, the number of arms can vary at any given timestep
  • 40. BIVARIATE & MULTIVARIATE TESTING Vanilla MAB
  • 42. CODEC SELECTION Image credit: Microsoft/Skype
  • 46. SELFIE OPTIMISATION Vanilla MAB (one per user) Image: Tinder Tech Blog
  • 47. … & COUNTLESS OTHER POSSIBILITIES
  • 48. DIY: CONTEXTUAL BANDIT IMPLEMENTED IN PYTHON + AZURE FUNCTIONS + API + REDIS = OPTIMISATION ENGINE
  • 50. IF YOU CAN MODEL IT AS A GAME, YOU CAN USE A BANDIT ALGORITHM TO SOLVE IT*. *Given rewards that aren’t delayed.
  • 51. THANK YOU A special thanks to Microsoft for hosting & Practical AI for organising! QUESTIONS?