SlideShare a Scribd company logo
1 of 141
Download to read offline
Real-world Reinforcement Learning
Max Pagels, Machine Learning Partner
@maxpagels
www.linkedin.com/in/maxpagels
Job: Fourkind
Education: BSc & MSc comp. sci, University of Helsinki
Background: CS researcher, full-stack dev, front-end dev,
data scientist
Interests: Immediate-reward RL, ML reductions,
incremental learning
Some industries: maritime, insurance, ecommerce, gaming,
telecommunications, transportation, media, education,
logistics
What is reinforcement learning?
What is reinforcement
learning?
In a reinforcement learning setting, one takes actions in an
environment & receives rewards. The ultimate goal is to
maximise rewards over time.
Environment
Agent
Goal: learn to act so as to maximise reward over time.
ActionReward
What is reinforcement
learning?
State
What is reinforcement
learning?
A good real-world analogy is teaching your dog a new
command. If the dog correctly performs (acts) the command
you (the environment) gave, he or she is given a treat (a
reward). Over time, you dog will learn to act as commanded
in order to maximise reward over time.
What is reinforcement
learning?
Reinforcement learning isn’t entirely dissimilar from the
notion of classical conditioning or Pavlovian response:
“Classical conditioning (also known as Pavlovian or
respondent conditioning) refers to a learning procedure in
which a biologically potent stimulus (e.g. food) is paired with
a previously neutral stimulus (e.g. a bell). It also refers to the
learning process that results from this pairing, through which
the neutral stimulus comes to elicit a response (e.g. salivation)
that is usually similar to the one elicited by the potent
stimulus.”
Classical conditioning,
https://en.wikipedia.org/wiki/Classical_conditioning
What is reinforcement
learning?
In the beginning, a reinforcement learning agent knows
nothing about the world. It must explore different options to
learn what works and what doesn’t.
What is reinforcement
learning?
In addition, an agent must also exploit its knowledge in order
to actually maximise rewards over time.
What is reinforcement
learning?
Balancing exploration & exploitation is what reinforcement
learning is all about.
What is reinforcement
learning?
We’ll get back to the details later. Before that, let’s think about
why you might want to use reinforcement learning, and how
to do it in a way that actually works in the real world.
The case for using reinforcement
learning
The case for using
reinforcement learning
Intentionally provocative statement: you can’t really call
machine learning systems intelligent unless they are
reinforcement systems.
Let’s dissect this through some observations.
The case for using
reinforcement learning
Observation #1: any system that doesn’t use machine
learning generates data that is ultimately based on human
expertise.
The case for using
reinforcement learning
Observation #2: any supervised machine learning system that
uses such data is effectively learning from data generated by
human expertise.
The case for using
reinforcement learning
Observation #3: humans aren’t great at everything.
The case for using
reinforcement learning
Observation #4: deploying a supervised learning system itself
generates data from a new distribution. However, it still has
its roots in human expertise.
The case for using
reinforcement learning
Is this type of source information really the way to go? Is it
really the correct signal?
The case for using
reinforcement learning
I don’t think so. Let me elaborate with an example.
The case for using
reinforcement learning
Which of the following would I be most interested in?
The case for using
reinforcement learning
Which of the following would I be most interested in?
The case for using
reinforcement learning
Personal opinion: the only way to uncover the correct signal is
to assume nothing, try out different things (explore), and
learn to act optimally (exploit) based on environmental
feedback. It’s causal by nature. Everything else is a hack*.
* supervised learning can be a massively useful, perhaps even glorious,
hack, but it is still a hack.
Learn
Log
Deploy
Almost all production machine learning
systems
The case for using
reinforcement learning
A fundamentally correct machine learning
system
Learn
Log
Explore
Deploy
The case for using
reinforcement learning
The case for using
reinforcement learning
If you agree with this train of thought, it begs a question: why
don’t we use more reinforcement learning?
The problem with reinforcement learning
The problem with
reinforcement learning
Put bluntly: it’s very difficult.
The problem with
reinforcement learning
Supervised learning
Full reinforcement
learning
Max’s Difficulty Continuum
* not necessarily easy
Straightforward* Hard as nails
The problem with
reinforcement learning
Why is reinforcement learning so difficult to do?
The theoretical framework underpinning full RL algorithms is
the Markov Decision Process (MDP).
The problem with
reinforcement learning
Reinforcement learning is about learning how to act optimally in
such environments.
The problem with
reinforcement learning
It can work really well if you have a reasonable number of
possible states.
The problem with
reinforcement learning
The problem with
reinforcement learning
Unfortunately, for many real-world problems, we have an
insane amount of possible states.
The problem with
reinforcement learning
An insane state space requires an insane amount of training
data to learn a good agent.
The problem with
reinforcement learning
For real-world problems, there usually isn’t an insane amount
of data on tap.
The problem with
reinforcement learning
The standard way to deal with this is to build an
environment simulator, that generates an endless supply of
states & rewards.
This works in constrained, fully digital settings like games.
But for loads of real-world problems, you literally can’t build
a simulator.
The problem with
reinforcement learning
How on earth would a simulator know I
enjoy Tudor history?
The problem with
reinforcement learning
It can’t.
The problem with
reinforcement learning
That’s not the only problem.
The problem with
reinforcement learning
In a full reinforcement learning setting, rewards can arrive
immediately, or sometime in the future.
The problem with
reinforcement learning
Let’s say you have a sequence of ten yes/no decisions to make.
1. If you decide yes at step 1, you get a small immediate
reward and no rewards for the remaining 9 steps.
2. If you say no at step 1, and then follow a very specific
sequence of yeses and nos for the remaining steps, you
get a large reward.
It would make sense to sacrifice short-term rewards in this
case, because of the payoff at the end is large.
The problem with
reinforcement learning
Consequence: you need to be able to learn to assign (partial)
rewards to actions that possibly happened a long time ago.
This is known as the credit assignment problem. Solving
this problem means full RL algorithms necessarily depend
on the number of observations, exacerbating the sample
complexity issue even more.
The problem with
reinforcement learning
What all of this means in practice for full RL:
The problem with
reinforcement learning
Despite all the issues, RL is still much too promising to
give up on. If we solve RL in real-world settings, we stand to
advance the state of the art significantly.
The problem with
reinforcement learning
So how to we do it? Currently, via a set of clever tricks and
simplifications. We aren’t yet able to solve all real-world RL
problems, but you’d be surprised what we can solve today.
How can you do reinforcement learning
in the real world?
How can you do
reinforcement learning in
the real world?
Currently: via some simplifications.
Let’s look at the Difficulty Continuum again, and add some
pros & cons.
Supervised learning
Full reinforcement
learning
Max’s Difficulty Continuum
* not necessarily easy
Straightforward* Hard as nails
How can you do
reinforcement learning in
the real world?
Straightforward* Hard as nails
Supervised
learning
Incorrect signal
Independent on
number of
observations
Full reinforcement
learning
Correct signal
Depends on number
of observations
Max’s Difficulty Continuum
* not necessarily easy
How can you do
reinforcement learning in
the real world?
If we can find a way to get rid of the dependence on sample
size, yet preserve the correctness of signal as well as possible,
we are onto something.
But can we?
How can you do
reinforcement learning in
the real world?
Yes. By making some simple yet critical modifications to
the full RL problem, we can make reinforcement learning
agents capable of solving a huge amount of real-world
problems. Not all problems, but a significant portion.
How can you do
reinforcement learning in
the real world?
Simplification #1: we are going to require that the reward for
an action is revealed (almost) immediately and, more
importantly, that is is attributable only to the previous action.
How can you do
reinforcement learning in
the real world?
Environment
Agent
ActionRewardState
Requirement:
arrives quickly, and
is attributable to a
single action.
How can you do
reinforcement learning in
the real world?
Q: Isn’t the immediate reward requirement a problem?
A: It depends. Though tricky, there is a huge class of
problems for which you can find short-term proxy rewards
that align well with long-term rewards. This is especially true
in online applications.
How can you do
reinforcement learning in
the real world?
Proxy reward examples
News site
Long-term reward:
user satisfaction
Short-term proxy:
dwell time
Weight loss program
Long-term reward:
kilos lost
Short-term proxy:
exercise time
Video site
Long-term reward:
annual viewing time
Short-term proxy:
seconds viewed following an
action
General-purpose
If you can build a predictor that
accurately predicts the long-term
reward using short-term features,
use the prediction as a short-term
reward
How can you do
reinforcement learning in
the real world?
Simplification #2: we are going to require that possible states
do not depend on previous actions we took.
How can you do
reinforcement learning in
the real world?
Environment
Agent
ActionRewardState
Requirement:
arrives quickly, and
is attributable to a
single action.
Requirement:
doesn’t depend on
previous actions
How can you do
reinforcement learning in
the real world?
Given these simplifications, we have what is known as
immediate-reward reinforcement learning, or contextual
bandits as it’s more commonly known.
How can you do
reinforcement learning in
the real world?
With no dependence on the number of observations, we have
a setting that is still RL, but closer to supervised learning in
terms of tractability.
How can you do
reinforcement learning in
the real world?
Straightforward* Hard as nails
Supervised
learning
Incorrect signal
Independent on
number of
observations
Full reinforcement
learning
Correct signal
Depends on number
of observations
Max’s Difficulty Continuum
* not necessarily easy
How can you do
reinforcement learning in
the real world?
Straightforward* Hard as nails
Supervised
learning
Incorrect signal
Independent on
number of
observations
Full reinforcement
learning
Correct signal
Depends on number
of observations
Max’s Difficulty Continuum
* not necessarily easy
Contextual
bandits
Rightish signal
Independent on
number of
observations
How can you do
reinforcement learning in
the real world?
The contextual bandit (CB) problem, in CB lingo:
Repeatedly do:
1. Observe features x (analogous to state in RL)
2. Choose action a given x
3. Receive immediate reward r for the action
Objective: maximise expected reward over time.
How can you do
reinforcement learning in
the real world?
Given the simplifications, contextual bandit problems are
solvable using much less data than full RL problems. This
makes CBs an excellent candidate for solving real-world
problems.
How can you do
reinforcement learning in
the real world?
Next question: how might we go about solving a contextual
bandit problem?
How can you do
reinforcement learning in
the real world?
Let’s take a break: 20 minutes
Next question: how might we go about solving a contextual
bandit problem?
How can you do
reinforcement learning in
the real world?
One possible solution: ML reductions
One possible solution: ML
reductions
There are two approaches to solving a machine learning
problem:
1. Design new algorithms
2. Figure out how to reuse existing algorithms
The subfield of machine learning reductions focuses on 2).
It’s one of my favourite ML topics.
One possible solution: ML
reductions
General approach: reduce your original data distribution into
something that can be solved by an existing, simpler
algorithm. Solve and rollup the solution to solve your original
problem.
One possible solution: ML
reductions
Some of these may be hard to believe, but using either a single
reduction or a stack of reductions, you can reduce at least the
following:
One possible solution: ML
reductions
● Importance-weighted binary classification to binary
classification
● Regression to binary classification
● Quantile regression to binary classification
● Multiclass classification to binary classification
● Cost-sensitive multiclass classification to
importance-weighted binary classification
● Cost-sensitive multiclass classification to regression
● Ranking to binary classification
● Contextual bandits to multiclass classification
● Contextual bandits to binary classification
● Contextual bandits to regression
● Semibandits to supervised learning
One possible solution: ML
reductions
Putting our ML reductionist hat on, let’s take a closer look at
the agent part of the contextual bandit process.
Environment
Agent
Goal: learn to act so as to maximise reward over time.
ActionRewardFeatures
One possible solution: ML
reductions
Agent
Exploration policy
Job: at each timestep, observe state
& play action, either the best one
or one according to some
exploration strategy
Features Action
One possible solution: ML
reductions
Exploration policy
Policy
Job: at each timestep, observe state
& output the best action
Features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
One possible solution: ML
reductions
You could argue finding the best way to explore is basically
what RL is all about. It’s such a broad topic that we’ll skip it*
in this talk, and focus on the policy itself.
*give me a shout after the talk if this is something you’d like to learn
more about.
One possible solution: ML
reductions
A policy is a learned function that takes a state as input and
outputs a prediction of the best action.
Replace “state” with “features” and “action” with “class” and
you get:
….a learned function that takes a features as input and
outputs a prediction of the best class.
Another way to think about this: a policy is a classifier that
acts.
One possible solution: ML
reductions
*Puts reductionist hat on*: all of this sounds an awful lot like
supervised learning.
One possible solution: ML
reductions
Supervised learning assumes a full information setting, so we
can’t use it directly. The bad, and beautiful, thing about
reinforcement learning is that you never get to see rewards
for actions you didn’t take.
One possible solution: ML
reductions
However, it is possible to fill in “fake” reward information in
such a way that you get a dataset without missing
observations.
One possible solution: ML
reductions
This doesn’t seem possible, but it is (we’ll learn one technique
later on). And this is massively exciting, because it means we
can solve the policy part of contextual bandits with any
supervised learning classifier.
One possible solution: ML
reductions
By any classifier, I do mean any. We treat the classifier as an
oracle, a black box whose inner workings we don’t even need
to know about. Any classifier (assuming sufficient
expressiveness) will do:
● Gradient boosted classifiers
● Neural nets
● Logistic regression
● Decision trees
● KNN
● SVMs
● Random Forests
● ...
One possible solution: ML
reductions
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Modified
features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
One possible solution: ML
reductions
Exciting conclusion: we can reduce contextual bandits to
supervised learning + exploration, and solve the learning part
using an oracle learner.
But how do we deal with the partial information problem
inherent to all RL?
One possible solution: ML
reductions
Contextual bandits & the partial
information problem
The reinforcement learning setting, including the contextual
bandit setting, suffers from some severe selection bias,
because we never get to see rewards from actions we
never see.
It makes evaluating the goodness of a policy less than
straightforward. Let’s look at an example.
Contextual bandits & the
partial information problem
Let’s pretend we’ve collected data (also known as experience)
from a contextual bandit agent that chooses between 4
actions (e.g. news articles) according to some exploration
policy π.
Contextual bandits & the
partial information problem
Let’s imagine we’ve logged the following reward sequence
(expected reward: 9/5 = 1.8):
Contextual bandits & the
partial information problem
(a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
Contextual bandits & the
partial information problem
(a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
Now, let’s say we want to improve on the existing system and
train a new policy using the logged data. It chooses:
(a: 1, x, r: ?) (a: 3, x, r: ?) (a: 2, x, r: ?) (a: 1, x, r: ?) (a: 4, x, r: ?)
How can we tell if our new policy is better?
Let’s imagine we’ve logged the following reward sequence
(expected reward: 9/5 = 1.8):
Contextual bandits & the
partial information problem
Now, let’s say we want to improve on the existing system and
train a new policy using the logged data. It chooses:
(a: 1, x, r: 1) (a: 3, x, r: ?) (a: 2, x, r: ?) (a: 1, x, r: 4) (a: 4, x, r: ?)
If we only use rewards for actions observed, we get a an
expected reward of 5/2 = 2.5. But is this policy actually
better? Not necessarily.
(a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
Let’s imagine we’ve logged the following reward sequence
(expected reward: 9/5 = 1.8):
Contextual bandits & the
partial information problem
Now, let’s say we want to improve on the existing system and
train a new policy using the logged data. It chooses:
(a: 1, x, r: 1) (a: 3, x, r: 0) (a: 2, x, r: 0) (a: 1, x, r: 4) (a: 4, x, r: 0)
Setting unseen rewards to zero doesn’t help, either: now the
policy seems worse (expectation 1.0), but we don’t really
know since we are just guessing unseen rewards.
(a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
Let’s imagine we’ve logged the following reward sequence
(expected reward: 9/5 = 1.8):
Suppose the actual best sequence (hidden from us) is the
one our new policy would have chosen:
Contextual bandits & the
partial information problemWe have a “perfect” policy, with an expected reward of 3.4
–1.9 times better than our previous one–but both our
previous attempts at evaluation didn’t estimate this well at all.
What we need is a way of filling in fake rewards in a way that is
unbiased, in order to build an unbiased estimator.
(a: 1, x, r: 1) (a: 3, x, r: 4) (a: 2, x, r: 5) (a: 1, x, r: 4) (a: 4, x, r: 3)
In math notation, our previous (bad) zero-filling estimator can
be formalised as follows:
Contextual bandits & the
partial information problem
Where:
n: the number of actions
x: the features observed during each round
a: the action chosen by the policy during each round
r: the reward observed for the (x,a) pair during each round (missing
observations zero filled)
In order to overcome these bias issues, we are going to leverage
one piece of information that we can collect but haven’t used
yet: action probabilities (the probability of choosing a
particular arm at a given timestep).
Since a contextual bandit policy both explores and exploits, at
any given time step, there’s some probability a given action
will be chosen.
So, in addition to features x, action a, and observed reward r
at each timestep, we also have p, the probability the action
was chosen, giving us an (x,a,p,r) quad.
Contextual bandits & the
partial information problem
Let’s tweak our bad estimator. If our new policy disagrees with
the logged action at any given time, we fill in a zero reward as
before.
Contextual bandits & the
partial information problem
However, if our new policy agrees, we take the observed reward
and inversely weight it by the probability it was chosen in
our logged data. This estimator is know as IPS (inverse
propensity scoring, a.k.a. inverse probability weighting).
It is possible to show that an IPS estimator provides an
unbiased estimate of the reward. In fact, the proof is so short
that we can do it now.
Contextual bandits & the
partial information problem
Theorem
Contextual bandits & the
partial information problem
Theorem
Contextual bandits & the
partial information problem
Proof
IPS isn’t the only estimator. Other candidates include:
● Direct method (DM): estimate reward directly using a
separate predictor
● Doubly Robust (DR): combine IPS & DM
● Clipping, Weighted IPS, MTR (upcoming)
Contextual bandits & the
partial information problem
What does all this mean? We’ll get to the most interesting bit
shortly, but first, let’s return to the problem of actually
implementing a contextual bandit.
Oracle learners
As we saw before, you can reduce contextual bandits to
exploration + supervised learning, and use any supervised
learning algorithm as an oracle learner.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
Features
Let’s say we want to use multiclass logistic regression as the
oracle. Since we don’t observe all possible rewards at each
timestep, we can’t use it directly.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
If we did, we’d be learning from incomplete data (as we saw
before) and the classifier wouldn’t work well.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
We would also run into massive class imbalance issues, since
the majority of the reward information we do have is from
whatever the logged policy thought was best.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
Let’s fiddle around with the data to make it compatible with
oracle classification algorithms.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Modified
features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
Given experience (x,a,p,r) and a supervised classification
algorithm, set rewards as follows (for each timestep):
● For the reward of the action that taken, set r = r/p(a)
● For all other actions, set r = 0
Oracle learners
Given experience (x,a,p,r) and a supervised classification
algorithm, set rewards as follows (for each timestep):
● For the reward of the action that taken, set r = r/p(a)
● For all other actions, set r = 0
This is simply IPS!
Oracle learners
Result: all missing rewards filled in in an unbiased fashion,
creating a supervised learning problem. The class imbalance
issue is also gone.
It’s really that simple.
Note: not all oracle learners need this tweak, but most classification
algorithms do.
Oracle learners
Policy evaluation
Using an unbiased estimator to fill in missing rewards allows
us to solve contextual bandits with oracle learners. That’s
neat, but not the best part.
Policy evaluation
We never explicitly mentioned what assumptions we have to
place on our logged quads (x,a,r,p) must have in order for us
to estimate rewards in an unbiased fashion.
Policy evaluation
Answer: apart from assumptions related to the contextual
bandit setting itself, pretty much nothing. Policy evaluation
We can take any logged experience of the form (x,a,r,p) and
evaluate a new policy offline, just like we do in supervised
learning.
Policy evaluation
What if the experience was generated by 10 different policies,
each deployed after the other?
Doesn’t matter*.
Policy evaluation
What if the experience was generated by a policy using an
entirely different learning algorithm (e.g. gradient boosting vs.
logistic regression)?
Doesn’t matter.
Policy evaluation
What if the experience was generated by a policy just
randomly exploring, possibly without any machine learning at
all?
Doesn’t matter.
Policy evaluation
Regardless of the policy that generated our experience, we can
use it for training a new policy and evaluating it offline. We
can run hundreds of experiments a day, testing new
hyperparameters, exploration options, learning algorithms,
features etc.
And we can do this without using a simulator, using real
world data collected from real users.
Policy evaluation
Putting it all together, this gives us a pretty fantastic recipe for
success:
1. Implement data collection system, collecting quads
(x,a,p,r)
2. Deploy your policy (at first, could even be a random
choice sans machine learning)
3. Train a better policy using experience, deploy
4. Repeat 3-4 using your ever-growing experience data
Policy evaluation
Putting it all together, this gives us a pretty fantastic recipe for
success:
1. Implement data collection system, collecting quads
(x,a,p,r)
2. Deploy your policy (at first, could even be a random
choice sans machine learning)
3. Train a better policy using experience, deploy
4. Repeat 3-4 using your ever-growing experience data
Policy evaluationImportant!
Regardless of the policy that generated our experience, we can
use it for training a new policy and evaluating it offline. We
can run hundreds of experiments a day, testing new
hyperparameters, exploration options, learning algorithms,
features etc.
And we can do this without using a simulator, using real
world data collected from real users.
Policy evaluation
Robust Bandit Architectures
So far, we’ve covered the use case for contextual bandits, and
important aspects including offline evaluation and learning
via reductions. But in order to build a robust contextual
bandit system for real-world use, there are some architecture
patterns and techniques that will allow us to avoid common
pitfalls.
Bandit Architecture
Let’s sketch a recommended architecture, starting with
constituent components, before explaining how they all fit
together.
Bandit Architecture
First, we need some client-facing prediction API. The API
should respond to requests by fetching the necessary context
x, consulting the exploration policy (model) for an action, and
returning the action a to the user along with a prediction
identifier i.
For reasons we’ll explain later, it should log the prediction ID
in addition to the familiar (x,a,p) tuple*. It’s also a good idea
to log a timestamp t for the prediction and a policy version v.
* The reward r arrives in the future.
Bandit Architecture
Prediction API
Request
x
(a,i)
(i,v,t,x,a,p)
Secondly, we need some join server in which to store the
logged experience (i,v, t,x,a,r,p). This can be a transactional
database, in-memory key-value store, or a NoSQL solution.
Ideally, you’ll want something capable of fast joins and ideally
support for expiration times and notifications thereof.
For reasons we’ll explain later, it should log the prediction ID
in addition to the familiar (x,a,r,p) tuple. It’s also a good idea
to log a timestamp t for the prediction.
Bandit Architecture
Join server (DB)
(i,v,t,x,a,p)
We also need an API to handle rewards. Assuming the user
send rewards for a particular prediction ID, the API need not
do anything more than send the reward, prediction ID and
timestamp to the join server.
Bandit Architecture
Reward API
(i,r) (i,t,r)
Good ML architectures save lots of artefacts. We are going to
need an object store for various things including storing
experience (training data) and models themselves.
Bandit Architecture
Object store
payload
Let’s see what we have sketched so far. We have APIs to
handle prediction requests and reward ingestion; a join server,
to store experience; and a general-purpose object store for
artifacts.
Bandit Architecture
Object store
Prediction API
Request
x
(a,i)
(i,v,t,x,a,p)
Join server
Reward API
(i,r) (i,t,r)
What about the join server? In practical bandit settings, you
may never see a reward for a user action. E.g. if you are
learning from clicks on a website, you have to assume a reward
of zero after some period of inaction.
Bandit Architecture
Object store
Prediction API
Request
x
(a,i)
(i,v,t,x,a,p)
Join server
Reward API
(i,r) (i,t,r)
This is best handled by, after a predetermined amount of time,
left joining (i,v,t,x,a,r,p) tuples with their reward r (if any). You
can use the prediction identifier i to reliably tie predictions to
rewards. If there no reward is found, set r to 0.
Bandit Architecture
Object store
Prediction API
Request
x
(a,i)
(i,v,t,x,a,p)
Join server
Reward API
(i,r) (i,t,r)
This procedure yields full (i,v,t,x,a,r,p) tuples we can use for
learning. It’s a good idea to periodically store these in an object
store.
Bandit Architecture
Object store
Prediction API
Request
x
(a,i)
(i,v,t,x,a,p)
Join server
Reward API
(i,r) (i,t,r)
(i,v,t,x,a,r,p)
For policy updating, we need some service that periodically
fetches data from out object store and trains a new policy.
Data can be filtered according to freshness & model version, if
desired. The new policy can be saved back to the same store.
Bandit Architecture
Object store
(i,v,t,x,a,r,p)
tuples
Learner
Policy (model)
Note that the learner can also be offline, i.e. a Jupyter
notebook or similar. As long as everything is saved in the
object store, you can experiment offline with production data
whilst keeping the production environment intact.
Bandit Architecture
Object store
(i,v,t,x,a,r,p)
tuples
Learner
Policy (model)
This is our complete architecture. It’s robust, horizontally scalable on the client side,
minimises mistakes when tying predictions to rewards, allows for model versioning, and
offline & online learning using the same data.
Object store
Prediction API
Request
x
(a,i)
(i,v,t,x,a,p)
Join server
Reward API
(i,r) (i,t,r)
(i,v,t,x,a,r,p)
Learner
(i,v,t,x,a,r,p)
tuples Policy (model)
Policy (model)
Summary
Let’s recap some of the key takeaways from this talk. Summary
Supervised learning is useful, but doesn’t really uncover the
right signal in many cases.
Full reinforcement learning does uncover the correct signal, is
causal by nature, but is also very difficult to apply to
real-world problems because of the sample complexity
required for credit assignment.
Contextual bandits provide a happy medium by relaxing the
full RL setting to only consider immediate rewards.
Summary
Summary
Straightforward* Hard as nails
Supervised
learning
Incorrect signal
Independent on
number of
observations
Full reinforcement
learning
Correct signal
Depends on number
of observations
Contextual
bandits
Rightish signal
Independent on
number of
observations
Contextual bandits can be reduced to exploration +
supervised learning, allowing us to take advantage of
ready-made, state-of-the-art learning algorithms.
Contextual bandit policies can be evaluated offline, using
experience quads (x,a,p,r) generated by any previous policy.
A properly implemented contextual bandit learning system is
a self-improving loop: better policies generate more reward,
and provide more data for improving further.
Contextual bandits allow you to solve a host of real-world
problems, using real data instead of simulation, in a causal
manner.
Summary
If you have a problem where it is possible to explore, and a desire to make a
machine learning system capable of uncovering new things, consider
immediate-reward RL.
Thank you! Questions?
max.pagels@fourkind.com
www.fourkind.com
@fourkindnow

More Related Content

What's hot

Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Suhyun Cho
 
[226]대용량 텍스트마이닝 기술 하정우
[226]대용량 텍스트마이닝 기술 하정우[226]대용량 텍스트마이닝 기술 하정우
[226]대용량 텍스트마이닝 기술 하정우NAVER D2
 
MMCF: Multimodal Collaborative Filtering for Automatic Playlist Conitnuation
MMCF: Multimodal Collaborative Filtering for Automatic Playlist ConitnuationMMCF: Multimodal Collaborative Filtering for Automatic Playlist Conitnuation
MMCF: Multimodal Collaborative Filtering for Automatic Playlist ConitnuationHojin Yang
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기NAVER D2
 
Reinforcement learning and the Frozen Lake Problem
Reinforcement learning and the Frozen Lake ProblemReinforcement learning and the Frozen Lake Problem
Reinforcement learning and the Frozen Lake ProblemVishal Kumar
 
Q Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object LocalizationQ Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object Localization홍배 김
 
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용Susang Kim
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 
딥러닝 기반 자연어 언어모델 BERT
딥러닝 기반 자연어 언어모델 BERT딥러닝 기반 자연어 언어모델 BERT
딥러닝 기반 자연어 언어모델 BERTSeonghyun Kim
 
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Convolutional neural neworks
Convolutional neural neworksConvolutional neural neworks
Convolutional neural neworksLuis Serrano
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
 
The Dark Side of Malware Analysis - Andrea Pompili - Codemotion Rome 2015
The Dark Side of Malware Analysis - Andrea Pompili - Codemotion Rome 2015The Dark Side of Malware Analysis - Andrea Pompili - Codemotion Rome 2015
The Dark Side of Malware Analysis - Andrea Pompili - Codemotion Rome 2015Codemotion
 
Cuadernillo LÉXICO-SEMANTICO.pdf
Cuadernillo LÉXICO-SEMANTICO.pdfCuadernillo LÉXICO-SEMANTICO.pdf
Cuadernillo LÉXICO-SEMANTICO.pdfflgavaleriehennicke
 
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalizationJaeJun Yoo
 
스타크래프트2 강화학습(StarCraft II Reinforcement Learning)
스타크래프트2 강화학습(StarCraft II Reinforcement Learning)스타크래프트2 강화학습(StarCraft II Reinforcement Learning)
스타크래프트2 강화학습(StarCraft II Reinforcement Learning)Chris Hoyean Song
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text MiningSushanti Acharya
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 

What's hot (20)

Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 
[226]대용량 텍스트마이닝 기술 하정우
[226]대용량 텍스트마이닝 기술 하정우[226]대용량 텍스트마이닝 기술 하정우
[226]대용량 텍스트마이닝 기술 하정우
 
MMCF: Multimodal Collaborative Filtering for Automatic Playlist Conitnuation
MMCF: Multimodal Collaborative Filtering for Automatic Playlist ConitnuationMMCF: Multimodal Collaborative Filtering for Automatic Playlist Conitnuation
MMCF: Multimodal Collaborative Filtering for Automatic Playlist Conitnuation
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
 
Reinforcement learning and the Frozen Lake Problem
Reinforcement learning and the Frozen Lake ProblemReinforcement learning and the Frozen Lake Problem
Reinforcement learning and the Frozen Lake Problem
 
Q Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object LocalizationQ Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object Localization
 
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
딥러닝 기반 자연어 언어모델 BERT
딥러닝 기반 자연어 언어모델 BERT딥러닝 기반 자연어 언어모델 BERT
딥러닝 기반 자연어 언어모델 BERT
 
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Convolutional neural neworks
Convolutional neural neworksConvolutional neural neworks
Convolutional neural neworks
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
The Dark Side of Malware Analysis - Andrea Pompili - Codemotion Rome 2015
The Dark Side of Malware Analysis - Andrea Pompili - Codemotion Rome 2015The Dark Side of Malware Analysis - Andrea Pompili - Codemotion Rome 2015
The Dark Side of Malware Analysis - Andrea Pompili - Codemotion Rome 2015
 
Cuadernillo LÉXICO-SEMANTICO.pdf
Cuadernillo LÉXICO-SEMANTICO.pdfCuadernillo LÉXICO-SEMANTICO.pdf
Cuadernillo LÉXICO-SEMANTICO.pdf
 
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
 
스타크래프트2 강화학습(StarCraft II Reinforcement Learning)
스타크래프트2 강화학습(StarCraft II Reinforcement Learning)스타크래프트2 강화학습(StarCraft II Reinforcement Learning)
스타크래프트2 강화학습(StarCraft II Reinforcement Learning)
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Sun’iy neyron modeli
Sun’iy neyron modeliSun’iy neyron modeli
Sun’iy neyron modeli
 

Similar to RL REAL-WORLD APPLICATIONS

Lect 8 learning types (M.L.).pdf
Lect 8 learning types (M.L.).pdfLect 8 learning types (M.L.).pdf
Lect 8 learning types (M.L.).pdfHassanElalfy4
 
Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Mike Cardus
 
Theory of constraints._eliyahu_m._goldra
Theory of constraints._eliyahu_m._goldraTheory of constraints._eliyahu_m._goldra
Theory of constraints._eliyahu_m._goldraJuan Colin
 
Types of machine learning
Types of machine learningTypes of machine learning
Types of machine learningHimaniAloona
 
Telling your story: improving your presentation in 10 easy steps
Telling your story: improving your presentation in 10 easy stepsTelling your story: improving your presentation in 10 easy steps
Telling your story: improving your presentation in 10 easy stepsFiona Passantino
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsXavier Amatriain
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview QuestionsRock Interview
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningKmPooja4
 
xAPI Live - Why do I need something new? Day Hikes in xAPI
xAPI Live - Why do I need something new?  Day Hikes in xAPIxAPI Live - Why do I need something new?  Day Hikes in xAPI
xAPI Live - Why do I need something new? Day Hikes in xAPIRISC Inc
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeOsama Ghandour Geris
 
Semi supervised learning machine learning made simple
Semi supervised learning  machine learning made simpleSemi supervised learning  machine learning made simple
Semi supervised learning machine learning made simpleDevansh16
 
Basics of coding
Basics of codingBasics of coding
Basics of codingSanaaSharda
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfKuan-Tsae Huang
 

Similar to RL REAL-WORLD APPLICATIONS (20)

Chaptr 7 (final)
Chaptr 7 (final)Chaptr 7 (final)
Chaptr 7 (final)
 
Lect 8 learning types (M.L.).pdf
Lect 8 learning types (M.L.).pdfLect 8 learning types (M.L.).pdf
Lect 8 learning types (M.L.).pdf
 
Softskills
SoftskillsSoftskills
Softskills
 
Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify Leveraging Diversity to Find What Works and Amplify
Leveraging Diversity to Find What Works and Amplify
 
Organizational behaviour
Organizational behaviourOrganizational behaviour
Organizational behaviour
 
Theory of constraints._eliyahu_m._goldra
Theory of constraints._eliyahu_m._goldraTheory of constraints._eliyahu_m._goldra
Theory of constraints._eliyahu_m._goldra
 
Types of machine learning
Types of machine learningTypes of machine learning
Types of machine learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Telling your story: improving your presentation in 10 easy steps
Telling your story: improving your presentation in 10 easy stepsTelling your story: improving your presentation in 10 easy steps
Telling your story: improving your presentation in 10 easy steps
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systems
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview Questions
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
xAPI Live - Why do I need something new? Day Hikes in xAPI
xAPI Live - Why do I need something new?  Day Hikes in xAPIxAPI Live - Why do I need something new?  Day Hikes in xAPI
xAPI Live - Why do I need something new? Day Hikes in xAPI
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-code
 
Semi supervised learning machine learning made simple
Semi supervised learning  machine learning made simpleSemi supervised learning  machine learning made simple
Semi supervised learning machine learning made simple
 
Build a great Technical Team
Build a great Technical TeamBuild a great Technical Team
Build a great Technical Team
 
Basics of coding
Basics of codingBasics of coding
Basics of coding
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
Principles of learning.
Principles of learning.Principles of learning.
Principles of learning.
 

Recently uploaded

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 

Recently uploaded (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 

RL REAL-WORLD APPLICATIONS

  • 1. Real-world Reinforcement Learning Max Pagels, Machine Learning Partner @maxpagels www.linkedin.com/in/maxpagels
  • 2. Job: Fourkind Education: BSc & MSc comp. sci, University of Helsinki Background: CS researcher, full-stack dev, front-end dev, data scientist Interests: Immediate-reward RL, ML reductions, incremental learning Some industries: maritime, insurance, ecommerce, gaming, telecommunications, transportation, media, education, logistics
  • 4. What is reinforcement learning? In a reinforcement learning setting, one takes actions in an environment & receives rewards. The ultimate goal is to maximise rewards over time.
  • 5. Environment Agent Goal: learn to act so as to maximise reward over time. ActionReward What is reinforcement learning? State
  • 6. What is reinforcement learning? A good real-world analogy is teaching your dog a new command. If the dog correctly performs (acts) the command you (the environment) gave, he or she is given a treat (a reward). Over time, you dog will learn to act as commanded in order to maximise reward over time.
  • 7. What is reinforcement learning? Reinforcement learning isn’t entirely dissimilar from the notion of classical conditioning or Pavlovian response: “Classical conditioning (also known as Pavlovian or respondent conditioning) refers to a learning procedure in which a biologically potent stimulus (e.g. food) is paired with a previously neutral stimulus (e.g. a bell). It also refers to the learning process that results from this pairing, through which the neutral stimulus comes to elicit a response (e.g. salivation) that is usually similar to the one elicited by the potent stimulus.” Classical conditioning, https://en.wikipedia.org/wiki/Classical_conditioning
  • 8. What is reinforcement learning? In the beginning, a reinforcement learning agent knows nothing about the world. It must explore different options to learn what works and what doesn’t.
  • 9. What is reinforcement learning? In addition, an agent must also exploit its knowledge in order to actually maximise rewards over time.
  • 10. What is reinforcement learning? Balancing exploration & exploitation is what reinforcement learning is all about.
  • 11. What is reinforcement learning? We’ll get back to the details later. Before that, let’s think about why you might want to use reinforcement learning, and how to do it in a way that actually works in the real world.
  • 12. The case for using reinforcement learning
  • 13. The case for using reinforcement learning Intentionally provocative statement: you can’t really call machine learning systems intelligent unless they are reinforcement systems. Let’s dissect this through some observations.
  • 14. The case for using reinforcement learning Observation #1: any system that doesn’t use machine learning generates data that is ultimately based on human expertise.
  • 15. The case for using reinforcement learning Observation #2: any supervised machine learning system that uses such data is effectively learning from data generated by human expertise.
  • 16. The case for using reinforcement learning Observation #3: humans aren’t great at everything.
  • 17. The case for using reinforcement learning Observation #4: deploying a supervised learning system itself generates data from a new distribution. However, it still has its roots in human expertise.
  • 18. The case for using reinforcement learning Is this type of source information really the way to go? Is it really the correct signal?
  • 19. The case for using reinforcement learning I don’t think so. Let me elaborate with an example.
  • 20. The case for using reinforcement learning Which of the following would I be most interested in?
  • 21. The case for using reinforcement learning Which of the following would I be most interested in?
  • 22. The case for using reinforcement learning Personal opinion: the only way to uncover the correct signal is to assume nothing, try out different things (explore), and learn to act optimally (exploit) based on environmental feedback. It’s causal by nature. Everything else is a hack*. * supervised learning can be a massively useful, perhaps even glorious, hack, but it is still a hack.
  • 23. Learn Log Deploy Almost all production machine learning systems The case for using reinforcement learning
  • 24. A fundamentally correct machine learning system Learn Log Explore Deploy The case for using reinforcement learning
  • 25. The case for using reinforcement learning If you agree with this train of thought, it begs a question: why don’t we use more reinforcement learning?
  • 26. The problem with reinforcement learning
  • 27. The problem with reinforcement learning Put bluntly: it’s very difficult.
  • 28. The problem with reinforcement learning Supervised learning Full reinforcement learning Max’s Difficulty Continuum * not necessarily easy Straightforward* Hard as nails
  • 29. The problem with reinforcement learning Why is reinforcement learning so difficult to do?
  • 30. The theoretical framework underpinning full RL algorithms is the Markov Decision Process (MDP). The problem with reinforcement learning
  • 31. Reinforcement learning is about learning how to act optimally in such environments. The problem with reinforcement learning
  • 32. It can work really well if you have a reasonable number of possible states. The problem with reinforcement learning
  • 33. The problem with reinforcement learning Unfortunately, for many real-world problems, we have an insane amount of possible states.
  • 34. The problem with reinforcement learning An insane state space requires an insane amount of training data to learn a good agent.
  • 35. The problem with reinforcement learning For real-world problems, there usually isn’t an insane amount of data on tap.
  • 36. The problem with reinforcement learning The standard way to deal with this is to build an environment simulator, that generates an endless supply of states & rewards. This works in constrained, fully digital settings like games. But for loads of real-world problems, you literally can’t build a simulator.
  • 37. The problem with reinforcement learning How on earth would a simulator know I enjoy Tudor history?
  • 38. The problem with reinforcement learning It can’t.
  • 39. The problem with reinforcement learning That’s not the only problem.
  • 40. The problem with reinforcement learning In a full reinforcement learning setting, rewards can arrive immediately, or sometime in the future.
  • 41. The problem with reinforcement learning Let’s say you have a sequence of ten yes/no decisions to make. 1. If you decide yes at step 1, you get a small immediate reward and no rewards for the remaining 9 steps. 2. If you say no at step 1, and then follow a very specific sequence of yeses and nos for the remaining steps, you get a large reward. It would make sense to sacrifice short-term rewards in this case, because of the payoff at the end is large.
  • 42. The problem with reinforcement learning Consequence: you need to be able to learn to assign (partial) rewards to actions that possibly happened a long time ago. This is known as the credit assignment problem. Solving this problem means full RL algorithms necessarily depend on the number of observations, exacerbating the sample complexity issue even more.
  • 43. The problem with reinforcement learning What all of this means in practice for full RL:
  • 44. The problem with reinforcement learning Despite all the issues, RL is still much too promising to give up on. If we solve RL in real-world settings, we stand to advance the state of the art significantly.
  • 45. The problem with reinforcement learning So how to we do it? Currently, via a set of clever tricks and simplifications. We aren’t yet able to solve all real-world RL problems, but you’d be surprised what we can solve today.
  • 46. How can you do reinforcement learning in the real world?
  • 47. How can you do reinforcement learning in the real world? Currently: via some simplifications. Let’s look at the Difficulty Continuum again, and add some pros & cons.
  • 48. Supervised learning Full reinforcement learning Max’s Difficulty Continuum * not necessarily easy Straightforward* Hard as nails How can you do reinforcement learning in the real world?
  • 49. Straightforward* Hard as nails Supervised learning Incorrect signal Independent on number of observations Full reinforcement learning Correct signal Depends on number of observations Max’s Difficulty Continuum * not necessarily easy How can you do reinforcement learning in the real world?
  • 50. If we can find a way to get rid of the dependence on sample size, yet preserve the correctness of signal as well as possible, we are onto something. But can we? How can you do reinforcement learning in the real world?
  • 51. Yes. By making some simple yet critical modifications to the full RL problem, we can make reinforcement learning agents capable of solving a huge amount of real-world problems. Not all problems, but a significant portion. How can you do reinforcement learning in the real world?
  • 52. Simplification #1: we are going to require that the reward for an action is revealed (almost) immediately and, more importantly, that is is attributable only to the previous action. How can you do reinforcement learning in the real world?
  • 53. Environment Agent ActionRewardState Requirement: arrives quickly, and is attributable to a single action. How can you do reinforcement learning in the real world?
  • 54. Q: Isn’t the immediate reward requirement a problem? A: It depends. Though tricky, there is a huge class of problems for which you can find short-term proxy rewards that align well with long-term rewards. This is especially true in online applications. How can you do reinforcement learning in the real world?
  • 55. Proxy reward examples News site Long-term reward: user satisfaction Short-term proxy: dwell time Weight loss program Long-term reward: kilos lost Short-term proxy: exercise time Video site Long-term reward: annual viewing time Short-term proxy: seconds viewed following an action General-purpose If you can build a predictor that accurately predicts the long-term reward using short-term features, use the prediction as a short-term reward How can you do reinforcement learning in the real world?
  • 56. Simplification #2: we are going to require that possible states do not depend on previous actions we took. How can you do reinforcement learning in the real world?
  • 57. Environment Agent ActionRewardState Requirement: arrives quickly, and is attributable to a single action. Requirement: doesn’t depend on previous actions How can you do reinforcement learning in the real world?
  • 58. Given these simplifications, we have what is known as immediate-reward reinforcement learning, or contextual bandits as it’s more commonly known. How can you do reinforcement learning in the real world?
  • 59. With no dependence on the number of observations, we have a setting that is still RL, but closer to supervised learning in terms of tractability. How can you do reinforcement learning in the real world?
  • 60. Straightforward* Hard as nails Supervised learning Incorrect signal Independent on number of observations Full reinforcement learning Correct signal Depends on number of observations Max’s Difficulty Continuum * not necessarily easy How can you do reinforcement learning in the real world?
  • 61. Straightforward* Hard as nails Supervised learning Incorrect signal Independent on number of observations Full reinforcement learning Correct signal Depends on number of observations Max’s Difficulty Continuum * not necessarily easy Contextual bandits Rightish signal Independent on number of observations How can you do reinforcement learning in the real world?
  • 62. The contextual bandit (CB) problem, in CB lingo: Repeatedly do: 1. Observe features x (analogous to state in RL) 2. Choose action a given x 3. Receive immediate reward r for the action Objective: maximise expected reward over time. How can you do reinforcement learning in the real world?
  • 63. Given the simplifications, contextual bandit problems are solvable using much less data than full RL problems. This makes CBs an excellent candidate for solving real-world problems. How can you do reinforcement learning in the real world?
  • 64. Next question: how might we go about solving a contextual bandit problem? How can you do reinforcement learning in the real world?
  • 65. Let’s take a break: 20 minutes
  • 66. Next question: how might we go about solving a contextual bandit problem? How can you do reinforcement learning in the real world?
  • 67. One possible solution: ML reductions
  • 68. One possible solution: ML reductions There are two approaches to solving a machine learning problem: 1. Design new algorithms 2. Figure out how to reuse existing algorithms The subfield of machine learning reductions focuses on 2). It’s one of my favourite ML topics.
  • 69. One possible solution: ML reductions General approach: reduce your original data distribution into something that can be solved by an existing, simpler algorithm. Solve and rollup the solution to solve your original problem.
  • 70. One possible solution: ML reductions Some of these may be hard to believe, but using either a single reduction or a stack of reductions, you can reduce at least the following:
  • 71. One possible solution: ML reductions ● Importance-weighted binary classification to binary classification ● Regression to binary classification ● Quantile regression to binary classification ● Multiclass classification to binary classification ● Cost-sensitive multiclass classification to importance-weighted binary classification ● Cost-sensitive multiclass classification to regression ● Ranking to binary classification ● Contextual bandits to multiclass classification ● Contextual bandits to binary classification ● Contextual bandits to regression ● Semibandits to supervised learning
  • 72. One possible solution: ML reductions Putting our ML reductionist hat on, let’s take a closer look at the agent part of the contextual bandit process.
  • 73. Environment Agent Goal: learn to act so as to maximise reward over time. ActionRewardFeatures One possible solution: ML reductions
  • 74. Agent Exploration policy Job: at each timestep, observe state & play action, either the best one or one according to some exploration strategy Features Action One possible solution: ML reductions
  • 75. Exploration policy Policy Job: at each timestep, observe state & output the best action Features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action One possible solution: ML reductions
  • 76. You could argue finding the best way to explore is basically what RL is all about. It’s such a broad topic that we’ll skip it* in this talk, and focus on the policy itself. *give me a shout after the talk if this is something you’d like to learn more about. One possible solution: ML reductions
  • 77. A policy is a learned function that takes a state as input and outputs a prediction of the best action. Replace “state” with “features” and “action” with “class” and you get: ….a learned function that takes a features as input and outputs a prediction of the best class. Another way to think about this: a policy is a classifier that acts. One possible solution: ML reductions
  • 78. *Puts reductionist hat on*: all of this sounds an awful lot like supervised learning. One possible solution: ML reductions
  • 79. Supervised learning assumes a full information setting, so we can’t use it directly. The bad, and beautiful, thing about reinforcement learning is that you never get to see rewards for actions you didn’t take. One possible solution: ML reductions
  • 80. However, it is possible to fill in “fake” reward information in such a way that you get a dataset without missing observations. One possible solution: ML reductions
  • 81. This doesn’t seem possible, but it is (we’ll learn one technique later on). And this is massively exciting, because it means we can solve the policy part of contextual bandits with any supervised learning classifier. One possible solution: ML reductions
  • 82. By any classifier, I do mean any. We treat the classifier as an oracle, a black box whose inner workings we don’t even need to know about. Any classifier (assuming sufficient expressiveness) will do: ● Gradient boosted classifiers ● Neural nets ● Logistic regression ● Decision trees ● KNN ● SVMs ● Random Forests ● ... One possible solution: ML reductions
  • 83. Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Modified features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action One possible solution: ML reductions
  • 84. Exciting conclusion: we can reduce contextual bandits to supervised learning + exploration, and solve the learning part using an oracle learner. But how do we deal with the partial information problem inherent to all RL? One possible solution: ML reductions
  • 85. Contextual bandits & the partial information problem
  • 86. The reinforcement learning setting, including the contextual bandit setting, suffers from some severe selection bias, because we never get to see rewards from actions we never see. It makes evaluating the goodness of a policy less than straightforward. Let’s look at an example. Contextual bandits & the partial information problem
  • 87. Let’s pretend we’ve collected data (also known as experience) from a contextual bandit agent that chooses between 4 actions (e.g. news articles) according to some exploration policy π. Contextual bandits & the partial information problem
  • 88. Let’s imagine we’ve logged the following reward sequence (expected reward: 9/5 = 1.8): Contextual bandits & the partial information problem (a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
  • 89. Contextual bandits & the partial information problem (a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1) Now, let’s say we want to improve on the existing system and train a new policy using the logged data. It chooses: (a: 1, x, r: ?) (a: 3, x, r: ?) (a: 2, x, r: ?) (a: 1, x, r: ?) (a: 4, x, r: ?) How can we tell if our new policy is better? Let’s imagine we’ve logged the following reward sequence (expected reward: 9/5 = 1.8):
  • 90. Contextual bandits & the partial information problem Now, let’s say we want to improve on the existing system and train a new policy using the logged data. It chooses: (a: 1, x, r: 1) (a: 3, x, r: ?) (a: 2, x, r: ?) (a: 1, x, r: 4) (a: 4, x, r: ?) If we only use rewards for actions observed, we get a an expected reward of 5/2 = 2.5. But is this policy actually better? Not necessarily. (a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1) Let’s imagine we’ve logged the following reward sequence (expected reward: 9/5 = 1.8):
  • 91. Contextual bandits & the partial information problem Now, let’s say we want to improve on the existing system and train a new policy using the logged data. It chooses: (a: 1, x, r: 1) (a: 3, x, r: 0) (a: 2, x, r: 0) (a: 1, x, r: 4) (a: 4, x, r: 0) Setting unseen rewards to zero doesn’t help, either: now the policy seems worse (expectation 1.0), but we don’t really know since we are just guessing unseen rewards. (a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1) Let’s imagine we’ve logged the following reward sequence (expected reward: 9/5 = 1.8):
  • 92. Suppose the actual best sequence (hidden from us) is the one our new policy would have chosen: Contextual bandits & the partial information problemWe have a “perfect” policy, with an expected reward of 3.4 –1.9 times better than our previous one–but both our previous attempts at evaluation didn’t estimate this well at all. What we need is a way of filling in fake rewards in a way that is unbiased, in order to build an unbiased estimator. (a: 1, x, r: 1) (a: 3, x, r: 4) (a: 2, x, r: 5) (a: 1, x, r: 4) (a: 4, x, r: 3)
  • 93. In math notation, our previous (bad) zero-filling estimator can be formalised as follows: Contextual bandits & the partial information problem Where: n: the number of actions x: the features observed during each round a: the action chosen by the policy during each round r: the reward observed for the (x,a) pair during each round (missing observations zero filled)
  • 94. In order to overcome these bias issues, we are going to leverage one piece of information that we can collect but haven’t used yet: action probabilities (the probability of choosing a particular arm at a given timestep). Since a contextual bandit policy both explores and exploits, at any given time step, there’s some probability a given action will be chosen. So, in addition to features x, action a, and observed reward r at each timestep, we also have p, the probability the action was chosen, giving us an (x,a,p,r) quad. Contextual bandits & the partial information problem
  • 95. Let’s tweak our bad estimator. If our new policy disagrees with the logged action at any given time, we fill in a zero reward as before. Contextual bandits & the partial information problem However, if our new policy agrees, we take the observed reward and inversely weight it by the probability it was chosen in our logged data. This estimator is know as IPS (inverse propensity scoring, a.k.a. inverse probability weighting).
  • 96. It is possible to show that an IPS estimator provides an unbiased estimate of the reward. In fact, the proof is so short that we can do it now. Contextual bandits & the partial information problem
  • 97. Theorem Contextual bandits & the partial information problem
  • 98. Theorem Contextual bandits & the partial information problem Proof
  • 99. IPS isn’t the only estimator. Other candidates include: ● Direct method (DM): estimate reward directly using a separate predictor ● Doubly Robust (DR): combine IPS & DM ● Clipping, Weighted IPS, MTR (upcoming) Contextual bandits & the partial information problem
  • 100. What does all this mean? We’ll get to the most interesting bit shortly, but first, let’s return to the problem of actually implementing a contextual bandit. Oracle learners
  • 101. As we saw before, you can reduce contextual bandits to exploration + supervised learning, and use any supervised learning algorithm as an oracle learner. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action Features
  • 102. Let’s say we want to use multiclass logistic regression as the oracle. Since we don’t observe all possible rewards at each timestep, we can’t use it directly. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action
  • 103. If we did, we’d be learning from incomplete data (as we saw before) and the classifier wouldn’t work well. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action
  • 104. We would also run into massive class imbalance issues, since the majority of the reward information we do have is from whatever the logged policy thought was best. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action
  • 105. Let’s fiddle around with the data to make it compatible with oracle classification algorithms. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Modified features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action
  • 106. Given experience (x,a,p,r) and a supervised classification algorithm, set rewards as follows (for each timestep): ● For the reward of the action that taken, set r = r/p(a) ● For all other actions, set r = 0 Oracle learners
  • 107. Given experience (x,a,p,r) and a supervised classification algorithm, set rewards as follows (for each timestep): ● For the reward of the action that taken, set r = r/p(a) ● For all other actions, set r = 0 This is simply IPS! Oracle learners
  • 108. Result: all missing rewards filled in in an unbiased fashion, creating a supervised learning problem. The class imbalance issue is also gone. It’s really that simple. Note: not all oracle learners need this tweak, but most classification algorithms do. Oracle learners
  • 110. Using an unbiased estimator to fill in missing rewards allows us to solve contextual bandits with oracle learners. That’s neat, but not the best part. Policy evaluation
  • 111. We never explicitly mentioned what assumptions we have to place on our logged quads (x,a,r,p) must have in order for us to estimate rewards in an unbiased fashion. Policy evaluation
  • 112. Answer: apart from assumptions related to the contextual bandit setting itself, pretty much nothing. Policy evaluation
  • 113. We can take any logged experience of the form (x,a,r,p) and evaluate a new policy offline, just like we do in supervised learning. Policy evaluation
  • 114. What if the experience was generated by 10 different policies, each deployed after the other? Doesn’t matter*. Policy evaluation
  • 115. What if the experience was generated by a policy using an entirely different learning algorithm (e.g. gradient boosting vs. logistic regression)? Doesn’t matter. Policy evaluation
  • 116. What if the experience was generated by a policy just randomly exploring, possibly without any machine learning at all? Doesn’t matter. Policy evaluation
  • 117. Regardless of the policy that generated our experience, we can use it for training a new policy and evaluating it offline. We can run hundreds of experiments a day, testing new hyperparameters, exploration options, learning algorithms, features etc. And we can do this without using a simulator, using real world data collected from real users. Policy evaluation
  • 118. Putting it all together, this gives us a pretty fantastic recipe for success: 1. Implement data collection system, collecting quads (x,a,p,r) 2. Deploy your policy (at first, could even be a random choice sans machine learning) 3. Train a better policy using experience, deploy 4. Repeat 3-4 using your ever-growing experience data Policy evaluation
  • 119. Putting it all together, this gives us a pretty fantastic recipe for success: 1. Implement data collection system, collecting quads (x,a,p,r) 2. Deploy your policy (at first, could even be a random choice sans machine learning) 3. Train a better policy using experience, deploy 4. Repeat 3-4 using your ever-growing experience data Policy evaluationImportant!
  • 120. Regardless of the policy that generated our experience, we can use it for training a new policy and evaluating it offline. We can run hundreds of experiments a day, testing new hyperparameters, exploration options, learning algorithms, features etc. And we can do this without using a simulator, using real world data collected from real users. Policy evaluation
  • 122. So far, we’ve covered the use case for contextual bandits, and important aspects including offline evaluation and learning via reductions. But in order to build a robust contextual bandit system for real-world use, there are some architecture patterns and techniques that will allow us to avoid common pitfalls. Bandit Architecture
  • 123. Let’s sketch a recommended architecture, starting with constituent components, before explaining how they all fit together. Bandit Architecture
  • 124. First, we need some client-facing prediction API. The API should respond to requests by fetching the necessary context x, consulting the exploration policy (model) for an action, and returning the action a to the user along with a prediction identifier i. For reasons we’ll explain later, it should log the prediction ID in addition to the familiar (x,a,p) tuple*. It’s also a good idea to log a timestamp t for the prediction and a policy version v. * The reward r arrives in the future. Bandit Architecture Prediction API Request x (a,i) (i,v,t,x,a,p)
  • 125. Secondly, we need some join server in which to store the logged experience (i,v, t,x,a,r,p). This can be a transactional database, in-memory key-value store, or a NoSQL solution. Ideally, you’ll want something capable of fast joins and ideally support for expiration times and notifications thereof. For reasons we’ll explain later, it should log the prediction ID in addition to the familiar (x,a,r,p) tuple. It’s also a good idea to log a timestamp t for the prediction. Bandit Architecture Join server (DB) (i,v,t,x,a,p)
  • 126. We also need an API to handle rewards. Assuming the user send rewards for a particular prediction ID, the API need not do anything more than send the reward, prediction ID and timestamp to the join server. Bandit Architecture Reward API (i,r) (i,t,r)
  • 127. Good ML architectures save lots of artefacts. We are going to need an object store for various things including storing experience (training data) and models themselves. Bandit Architecture Object store payload
  • 128. Let’s see what we have sketched so far. We have APIs to handle prediction requests and reward ingestion; a join server, to store experience; and a general-purpose object store for artifacts. Bandit Architecture Object store Prediction API Request x (a,i) (i,v,t,x,a,p) Join server Reward API (i,r) (i,t,r)
  • 129. What about the join server? In practical bandit settings, you may never see a reward for a user action. E.g. if you are learning from clicks on a website, you have to assume a reward of zero after some period of inaction. Bandit Architecture Object store Prediction API Request x (a,i) (i,v,t,x,a,p) Join server Reward API (i,r) (i,t,r)
  • 130. This is best handled by, after a predetermined amount of time, left joining (i,v,t,x,a,r,p) tuples with their reward r (if any). You can use the prediction identifier i to reliably tie predictions to rewards. If there no reward is found, set r to 0. Bandit Architecture Object store Prediction API Request x (a,i) (i,v,t,x,a,p) Join server Reward API (i,r) (i,t,r)
  • 131. This procedure yields full (i,v,t,x,a,r,p) tuples we can use for learning. It’s a good idea to periodically store these in an object store. Bandit Architecture Object store Prediction API Request x (a,i) (i,v,t,x,a,p) Join server Reward API (i,r) (i,t,r) (i,v,t,x,a,r,p)
  • 132. For policy updating, we need some service that periodically fetches data from out object store and trains a new policy. Data can be filtered according to freshness & model version, if desired. The new policy can be saved back to the same store. Bandit Architecture Object store (i,v,t,x,a,r,p) tuples Learner Policy (model)
  • 133. Note that the learner can also be offline, i.e. a Jupyter notebook or similar. As long as everything is saved in the object store, you can experiment offline with production data whilst keeping the production environment intact. Bandit Architecture Object store (i,v,t,x,a,r,p) tuples Learner Policy (model)
  • 134. This is our complete architecture. It’s robust, horizontally scalable on the client side, minimises mistakes when tying predictions to rewards, allows for model versioning, and offline & online learning using the same data. Object store Prediction API Request x (a,i) (i,v,t,x,a,p) Join server Reward API (i,r) (i,t,r) (i,v,t,x,a,r,p) Learner (i,v,t,x,a,r,p) tuples Policy (model) Policy (model)
  • 136. Let’s recap some of the key takeaways from this talk. Summary
  • 137. Supervised learning is useful, but doesn’t really uncover the right signal in many cases. Full reinforcement learning does uncover the correct signal, is causal by nature, but is also very difficult to apply to real-world problems because of the sample complexity required for credit assignment. Contextual bandits provide a happy medium by relaxing the full RL setting to only consider immediate rewards. Summary
  • 138. Summary Straightforward* Hard as nails Supervised learning Incorrect signal Independent on number of observations Full reinforcement learning Correct signal Depends on number of observations Contextual bandits Rightish signal Independent on number of observations
  • 139. Contextual bandits can be reduced to exploration + supervised learning, allowing us to take advantage of ready-made, state-of-the-art learning algorithms. Contextual bandit policies can be evaluated offline, using experience quads (x,a,p,r) generated by any previous policy. A properly implemented contextual bandit learning system is a self-improving loop: better policies generate more reward, and provide more data for improving further. Contextual bandits allow you to solve a host of real-world problems, using real data instead of simulation, in a causal manner. Summary
  • 140. If you have a problem where it is possible to explore, and a desire to make a machine learning system capable of uncovering new things, consider immediate-reward RL.