Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-world Reinforcement Learning

485 views

Published on

An introduction to immediate-reward reinforcement learning.

Published in: Technology
  • Be the first to comment

Real-world Reinforcement Learning

  1. 1. Real-world Reinforcement Learning Max Pagels, Machine Learning Partner @maxpagels www.linkedin.com/in/maxpagels
  2. 2. Job: Fourkind Education: BSc, MSc comp. sci, University of Helsinki Background: CS researcher, full-stack dev, front-end dev, data scientist Interests: Immediate-reward RL, ML reductions, incremental/online learning, generative design Some industries: maritime, insurance, ecommerce, gaming, telecommunications, transportation, media, education, logistics
  3. 3. What is reinforcement learning?
  4. 4. What is reinforcement learning? In a reinforcement learning setting, one takes actions in an environment & receives rewards. The ultimate goal is to maximise rewards over time.
  5. 5. Environment Agent Goal: learn to act so as to maximise reward over time. ActionReward What is reinforcement learning? State
  6. 6. What is reinforcement learning? A good real-world analogy is teaching your dog a new command. If the dog correctly performs (acts) the command you (the environment) gave, he or she is given a treat (a reward). Over time, you dog will learn to act as commanded in order to maximise reward over time.
  7. 7. What is reinforcement learning? Reinforcement learning isn’t entirely dissimilar from the notion of classical conditioning or Pavlovian response: “Classical conditioning (also known as Pavlovian or respondent conditioning) refers to a learning procedure in which a biologically potent stimulus (e.g. food) is paired with a previously neutral stimulus (e.g. a bell). It also refers to the learning process that results from this pairing, through which the neutral stimulus comes to elicit a response (e.g. salivation) that is usually similar to the one elicited by the potent stimulus.” Classical conditioning, https://en.wikipedia.org/wiki/Classical_conditioning
  8. 8. What is reinforcement learning? In the beginning, a reinforcement learning agent knows nothing about the world. It must explore different options to learn what works and what doesn’t.
  9. 9. What is reinforcement learning? In addition, an agent must also exploit its knowledge in order to actually maximise rewards over time.
  10. 10. What is reinforcement learning? Balancing exploration & exploitation is what reinforcement learning is all about.
  11. 11. What is reinforcement learning? We’ll get back to the details later. Before that, let’s think about why you might want to use reinforcement learning, and how to do it in a way that actually works in the real world.
  12. 12. The case for using reinforcement learning
  13. 13. The case for using reinforcement learning Intentionally provocative statement: you can’t really call machine learning systems intelligent unless they are reinforcement systems. Let’s dissect this through some observations.
  14. 14. The case for using reinforcement learning Observation #1: any system that doesn’t use machine learning generates data that is ultimately based on human expertise.
  15. 15. The case for using reinforcement learning Observation #2: any supervised machine learning system that uses such data is effectively learning from data generated by human expertise.
  16. 16. The case for using reinforcement learning Observation #3: humans aren’t great at everything.
  17. 17. The case for using reinforcement learning Observation #4: deploying a supervised learning system itself generates data from a new distribution. However, it still has its roots in human expertise.
  18. 18. The case for using reinforcement learning Is this type of source information really the way to go? Is it really the correct signal?
  19. 19. The case for using reinforcement learning I don’t think so. Let me elaborate with an example.
  20. 20. The case for using reinforcement learning Which of the following would I be most interested in?
  21. 21. The case for using reinforcement learning Which of the following would I be most interested in?
  22. 22. The case for using reinforcement learning Personal opinion: the only way to uncover the correct signal is to assume nothing, try out different things (explore), and learn to act optimally (exploit) based on environmental feedback. It’s causal by nature. Everything else is a hack*. * supervised learning can be a massively useful, perhaps even glorious, hack, but it is still a hack.
  23. 23. Learn Log Deploy Almost all production machine learning systems The case for using reinforcement learning
  24. 24. A fundamentally correct machine learning system Learn Log Explore Deploy The case for using reinforcement learning
  25. 25. The case for using reinforcement learning If you agree with this train of thought, it begs a question: why don’t we use more reinforcement learning?
  26. 26. The problem with reinforcement learning
  27. 27. The problem with reinforcement learning Put bluntly: it’s very difficult.
  28. 28. The problem with reinforcement learning Supervised learning Full reinforcement learning Max’s Difficulty Continuum * not necessarily easy Straightforward* Hard as nails
  29. 29. The problem with reinforcement learning Why is reinforcement learning so difficult to do?
  30. 30. The theoretical framework underpinning full RL algorithms is the Markov Decision Process (MDP). The problem with reinforcement learning
  31. 31. Reinforcement learning is about learning how to act optimally in such environments. The problem with reinforcement learning
  32. 32. It can work really well if you have a reasonable number of possible states. The problem with reinforcement learning
  33. 33. The problem with reinforcement learning Unfortunately, for many real-world problems, we have an insane amount of possible states.
  34. 34. The problem with reinforcement learning An insane state space requires an insane amount of training data to learn a good agent.
  35. 35. The problem with reinforcement learning For real-world problems, there usually isn’t an insane amount of data on tap.
  36. 36. The problem with reinforcement learning The standard way to deal with this is to build an environment simulator, that generates an endless supply of states & rewards. This works in constrained, fully digital settings like games. But for loads of real-world problems, you literally can’t build a simulator.
  37. 37. The case for using reinforcement learning How on earth would a simulator know I enjoy Tudor history?
  38. 38. The case for using reinforcement learning It can’t.
  39. 39. The problem with reinforcement learning That’s not the only problem.
  40. 40. The problem with reinforcement learning In a full reinforcement learning setting, rewards can arrive immediately, or sometime in the future.
  41. 41. The problem with reinforcement learning Let’s say you have a sequence of ten yes/no decisions to make. 1. If you decide yes at step 1, you get a small immediate reward and no rewards for the remaining 9 steps. 2. If you say no at step 1, and then follow a very specific sequence of yeses and nos for the remaining steps, you get a large reward. It would make sense to sacrifice short-term rewards in this case, because of the payoff at the end is large.
  42. 42. The problem with reinforcement learning Consequence: you need to be able to learn to assign (partial) rewards to actions that possibly happened a long time ago. This is known as the credit assignment problem. Solving this problem means full RL algorithms necessarily depend on the number of observations, exacerbating the sample complexity issue even more.
  43. 43. The problem with reinforcement learning What all of this means in practice for full RL:
  44. 44. The problem with reinforcement learning Despite all the issues, RL is still much too promising to give up on. If we solve RL in real-world settings, we stand to advance the state of the art significantly.
  45. 45. The problem with reinforcement learning So how to we do it? Currently, via a set of clever tricks and simplifications. We aren’t yet able to solve all real-world RL problems, but you’d be surprised what we can solve today.
  46. 46. How can you do reinforcement learning in the real world?
  47. 47. How can you do reinforcement learning in the real world? Currently: via some simplifications. Let’s look at the Difficulty Continuum again, and add some pros & cons.
  48. 48. Supervised learning Full reinforcement learning Max’s Difficulty Continuum * not necessarily easy Straightforward* Hard as nails How can you do reinforcement learning in the real world?
  49. 49. Straightforward* Hard as nails Supervised learning Incorrect signal Independent on number of observations Full reinforcement learning Correct signal Depends on number of observations Max’s Difficulty Continuum * not necessarily easy How can you do reinforcement learning in the real world?
  50. 50. If we can find a way to get rid of the dependence on sample size, yet preserve the correctness of signal as well as possible, we are on to something. But can we? How can you do reinforcement learning in the real world?
  51. 51. Yes. By making some simple yet critical modifications to the full RL problem, we can make reinforcement learning agents capable of solving a huge amount of real-world problems. Not all problems, but a significant portion. How can you do reinforcement learning in the real world?
  52. 52. Simplification #1: we are going to require that the reward for an action is revealed (almost) immediately and, more importantly, that is is attributable only to the previous action. How can you do reinforcement learning in the real world?
  53. 53. Environment Agent ActionRewardState Requirement: arrives quickly, and is attributable to a single action. How can you do reinforcement learning in the real world?
  54. 54. Q: Isn’t the immediate reward requirement a problem? A: It depends. Though tricky, there is a huge class of problems for which you can find short-term proxy rewards that align well with long-term rewards. This is especially true in online applications. How can you do reinforcement learning in the real world?
  55. 55. Proxy reward examples News site Long-term reward: user satisfaction Short-term proxy: dwell time Weight loss program Long-term reward: kilos lost Short-term proxy: exercise time Video site Long-term reward: annual viewing time Short-term proxy: seconds viewed following an action General-purpose If you can build a predictor that accurately predicts the long-term reward using short-term features, use the prediction as a short-term reward How can you do reinforcement learning in the real world?
  56. 56. Simplification #2: we are going to require that possible states do not depend on previous actions we took. How can you do reinforcement learning in the real world?
  57. 57. Environment Agent ActionRewardState Requirement: arrives quickly, and is attributable to a single action. Requirement: doesn’t depend on previous actions How can you do reinforcement learning in the real world?
  58. 58. Given these simplifications, we have what is known as immediate-reward reinforcement learning, or contextual bandits as it’s more commonly known. How can you do reinforcement learning in the real world?
  59. 59. With no dependence on the number of observations, we have a setting that is still RL, but closer to supervised learning in terms of tractability. How can you do reinforcement learning in the real world?
  60. 60. Straightforward* Hard as nails Supervised learning Incorrect signal Independent on number of observations Full reinforcement learning Correct signal Depends on number of observations Max’s Difficulty Continuum * not necessarily easy How can you do reinforcement learning in the real world?
  61. 61. Straightforward* Hard as nails Supervised learning Incorrect signal Independent on number of observations Full reinforcement learning Correct signal Depends on number of observations Max’s Difficulty Continuum * not necessarily easy Contextual bandits Rightish signal Independent on number of observations How can you do reinforcement learning in the real world?
  62. 62. The contextual bandit (CB) problem, in CB lingo: Repeatedly do: 1. Observe features x (analogous to state in RL) 2. Choose action a given x 3. Receive immediate reward r for the action Objective: maximise expected reward over time. How can you do reinforcement learning in the real world?
  63. 63. Given the simplifications, contextual bandit problems are solveable using much less data than full RL problems. This makes CBs an excellent candidate for solving real-world problems. How can you do reinforcement learning in the real world?
  64. 64. Next question: how might we go about solving a contextual bandit problem? How can you do reinforcement learning in the real world?
  65. 65. Let’s take a break: 20 minutes
  66. 66. Next question: how might we go about solving a contextual bandit problem? How can you do reinforcement learning in the real world?
  67. 67. One possible solution: ML reductions
  68. 68. One possible solution: ML reductions There are two approaches to solving a machine learning problem: 1. Design new algorithms 2. Figure out how to reuse existing algorithms The subfield of machine learning reductions focuses on 2). It’s one of my favourite ML topics.
  69. 69. One possible solution: ML reductions General approach: reduce your original data distribution into something that can be solved by an existing, simpler algorithm. Solve and rollup the solution to solve your original problem.
  70. 70. One possible solution: ML reductions Some of these may be hard to believe, but using either a single reduction or a stack of reductions, you can reduce at least the following:
  71. 71. One possible solution: ML reductions ● Importance-weighted binary classification to binary classification ● Regression to binary classification ● Quantile regression to binary classification ● Multiclass classification to binary classification ● Cost-sensitive multiclass classification to importance-weighted binary classification ● Cost-sensitive multiclass classification to regression ● Ranking to binary classification ● Contextual bandits to multiclass classification ● Contextual bandits to binary classification ● Contextual bandits to regression
  72. 72. One possible solution: ML reductions Putting our ML reductionist hat on, let’s take a closer look at the agent part of the contextual bandit process.
  73. 73. Environment Agent Goal: learn to act so as to maximise reward over time. ActionRewardFeatures One possible solution: ML reductions
  74. 74. Agent Exploration policy Job: at each timestep, observe state & play action, either the best one or one according to some exploration strategy Features Action One possible solution: ML reductions
  75. 75. Exploration policy Policy Job: at each timestep, observe state & output the best action Features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action One possible solution: ML reductions
  76. 76. You could argue finding the best way to explore is basically what RL is all about. It’s such a broad topic that we’ll skip it* in this talk, and focus on the policy itself. *give me a shout after the talk if this is something you’d like to learn more about. One possible solution: ML reductions
  77. 77. A policy is a learned function that takes a state as input and outputs a prediction of the best action. Replace “state” with “features” and “action” with “class” and you get: ….a learned function that takes a features as input and outputs a prediction of the best class. Another way to think about this: a policy is a classifier that acts. One possible solution: ML reductions
  78. 78. *Puts reductionist hat on*: all of this sounds an awful lot like supervised learning. One possible solution: ML reductions
  79. 79. Supervised learning assumes a full information setting, so we can’t use it directly. The bad, and beautiful, thing about reinforcement learning is that you never get to see rewards for actions you didn’t take. One possible solution: ML reductions
  80. 80. However, it is possible to fill in “fake” reward information in such a way that you get a dataset without missing observations. One possible solution: ML reductions
  81. 81. This doesn’t seem possible, but it is (we’ll learn one technique later on). And this is massively exciting, because it means we can solve the policy part of contextual bandits with any supervised learning classifier. One possible solution: ML reductions
  82. 82. By any classifier, I do mean any. We treat the classifier as an oracle, a black box whose inner workings we don’t even need to know about. Any classifier (assuming sufficient expressiveness) will do: ● Gradient boosted classifiers ● Neural nets ● Logistic regression ● Decision trees ● KNN ● SVMs ● Random Forests ● ... One possible solution: ML reductions
  83. 83. Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Modified features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action One possible solution: ML reductions
  84. 84. Exciting conclusion: we can reduce contextual bandits to supervised learning + exploration, and solve the learning part using an oracle learner. But how do we deal with the partial information problem inherent to all RL? One possible solution: ML reductions
  85. 85. Contextual bandits & the partial information problem
  86. 86. The reinforcement learning setting, including the contextual bandit setting, suffers from some severe selection bias, because we never get to see rewards from actions we never see. It makes evaluating the goodness of a policy less than straightforward. Let’s look at an example. Contextual bandits & the partial information problem
  87. 87. Let’s pretend we’ve collected data (also known as experience) from a contextual bandit agent that chooses between 4 actions (e.g. news articles) according to some exploration policy π. Contextual bandits & the partial information problem
  88. 88. Let’s imagine we’ve logged the following reward sequence (expected reward: 9/5 = 1.8): Contextual bandits & the partial information problem (a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
  89. 89. Contextual bandits & the partial information problem (a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1) Now, let’s say we want to improve on the existing system and train a new policy using the logged data. It chooses: (a: 1, x, r: ?) (a: 3, x, r: ?) (a: 2, x, r: ?) (a: 1, x, r: ?) (a: 4, x, r: ?) How can we tell if our new policy is better? Let’s imagine we’ve logged the following reward sequence (expected reward: 9/5 = 1.8):
  90. 90. Contextual bandits & the partial information problem Now, let’s say we want to improve on the existing system and train a new policy using the logged data. It chooses: (a: 1, x, r: 1) (a: 3, x, r: ?) (a: 2, x, r: ?) (a: 1, x, r: 4) (a: 4, x, r: ?) If we only use rewards for actions observed, we get a an expected reward of 5/2 = 2.5. But is this policy actually better? Not necessarily. (a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1) Let’s imagine we’ve logged the following reward sequence (expected reward: 9/5 = 1.8):
  91. 91. Contextual bandits & the partial information problem Now, let’s say we want to improve on the existing system and train a new policy using the logged data. It chooses: (a: 1, x, r: 1) (a: 3, x, r: 0) (a: 2, x, r: 0) (a: 1, x, r: 4) (a: 4, x, r: 0) Setting unseen rewards to zero doesn’t help, either: now the policy seems worse (expectation 1.0), but we don’t really know since we are just guessing unseen rewards. (a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1) Let’s imagine we’ve logged the following reward sequence (expected reward: 9/5 = 1.8):
  92. 92. Suppose the actual best sequence (hidden from us) is the one our new policy would have chosen: Contextual bandits & the partial information problemWe have a “perfect” policy, with an expected reward of 3.4 –1.9 times better than our previous one–but both our previous attempts at evaluation didn’t estimate this well at all. What we need is a way of filling in fake rewards in a way that is unbiased, in order to build an unbiased estimator. (a: 1, x, r: 1) (a: 3, x, r: 4) (a: 2, x, r: 5) (a: 1, x, r: 4) (a: 4, x, r: 3)
  93. 93. In math notation, our previous (bad) zero-filling estimator can be formalised as follows: Contextual bandits & the partial information problem Where: n: the number of actions x: the features observed during each round a: the action chosen by the policy during each round r: the reward observed for the (x,a) pair during each round (missing observations zero filled)
  94. 94. In order to overcome these bias issues, we are going to leverage one piece of information that we can collect but haven’t used yet: action probabilities (the probability of choosing a particular arm at a given timestep). Since a contextual bandit policy both explores and exploits, at any given time step, there’s some probability a given action will be chosen. So, in addition to features x, action a, and observed reward r at each timestep, we also have p, the probability the action was chosen, giving us an (x,a,p,r) quad. Contextual bandits & the partial information problem
  95. 95. Let’s tweak our bad estimator. If our new policy disagrees with the logged action at any given time, we fill in a zero reward as before. Contextual bandits & the partial information problem However, if our new policy agrees, we take the observed reward and inversely weigh it by the probability it was chosen in our logged data. This estimator is know as IPS (inverse propensity scoring, a.k.a. inverse probability weighting).
  96. 96. It is possible to show that an IPS estimator provides an unbiased estimate of the reward. In fact, the proof is so short that we can do it now. Contextual bandits & the partial information problem
  97. 97. Theorem Contextual bandits & the partial information problem
  98. 98. Theorem Contextual bandits & the partial information problem Proof
  99. 99. IPS isn’t the only estimator. Other candidates include: ● Direct method (DM): estimate reward directly using a separate predictor ● Doubly Robust (DR): combine IPS & DM ● Clipping, Weighted IPS, MTR (upcoming) Contextual bandits & the partial information problem
  100. 100. What does all this mean? We’ll get to the most interesting bit shortly, but first, let’s return to the problem of actually implementing a contextual bandit. Oracle learners
  101. 101. As we saw before, you can reduce contextual bandits to exploration + supervised learning, and use any supervised learning algorithm as an oracle learner. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action Features
  102. 102. Let’s say we want to use multiclass logistic regression as the oracle. Since we don’t observe all possible rewards at each timestep, we can’t use it directly. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action
  103. 103. If we did, we’d be learning from incomplete data (as we saw before) and the classifier wouldn’t work well. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action
  104. 104. We would also run into massive class imbalance issues, since the majority of the reward information we do have is from whatever the logged policy though was best. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action
  105. 105. Let’s fiddle around with the data to make it compatible with oracle classification algorithms. Oracle learners Exploration policy Supervised classifier oracle Job: at each timestep, observe state & output the best action Modified features Action Exploration strategy Job: at each timestep, decide whether to choose the best action, or try some other action
  106. 106. Given experience (x,a,p,r) and a supervised classification algorithm, set rewards as follows (for each timestep): ● For the reward of the action that taken, set r = r/p(a) ● For all other actions, set r = 0 Oracle learners
  107. 107. Given experience (x,a,p,r) and a supervised classification algorithm, set rewards as follows (for each timestep): ● For the reward of the action that taken, set r = r/p(a) ● For all other actions, set r = 0 This is simply IPS! Oracle learners
  108. 108. Result: all missing rewards filled in in an unbiased fashion, creating a supervised learning problem. The class imbalance issue is also gone. It’s really that simple. Note: not all oracle learners need this tweak, but most classification algorithms do. Oracle learners
  109. 109. Up next: the last, and most interesting bit of this talk. Oracle learners
  110. 110. Policy evaluation
  111. 111. Using an unbiased estimator to fill in missing rewards allows us to solve contextual bandits with oracle learners. That’s neat, but not the best part. Policy evaluation
  112. 112. We never explicitly mentioned what assumptions we have to place on our logged quads (x,a,r,p) must have in order for us to estimate rewards in an unbiased fashion. Policy evaluation
  113. 113. Answer: apart from assumptions related to the contextual bandit setting itself, pretty much nothing. Policy evaluation
  114. 114. We can take any logged experience of the form (x,a,r,p) and evaluate a new policy offline, just like we do in supervised learning. Policy evaluation
  115. 115. What if the experience was generated by 10 different policies, each deployed after the other? Doesn’t matter. Policy evaluation
  116. 116. What if the experience was generated by a policy using an entirely different learning algorithm (e.g. gradient boosting vs. logistic regression)? Doesn’t matter. Policy evaluation
  117. 117. What if the experience was generated by a policy just randomly exploring, possibly without any machine learning at all? Doesn’t matter. Policy evaluation
  118. 118. Regardless of the policy that generated our experience, we can use it for training a new policy and evaluating it offline. We can run hundreds of experiments a day, testing new hyperparameters, exploration options, learning algorithms, features etc. And we can do this without using a simulator, using real world data collected from real users. Policy evaluation
  119. 119. Putting it all together, this gives us a pretty fantastic recipe for success: 1. Implement data collection system, collecting quads (x,a,p,r) 2. Deploy your policy (at first, could even be a random choice sans machine learning) 3. Train a better policy using experience, deploy 4. Repeat 3-4 using your ever-growing experience data Policy evaluation
  120. 120. Putting it all together, this gives us a pretty fantastic recipe for success: 1. Implement data collection system, collecting quads (x,a,p,r) 2. Deploy your policy (at first, could even be a random choice sans machine learning) 3. Train a better policy using experience, deploy 4. Repeat 3-4 using your ever-growing experience data Policy evaluationImportant!
  121. 121. Summary
  122. 122. Let’s recap some of the key takeaways from this talk. Summary
  123. 123. Supervised learning is useful, but doesn’t really uncover the right signal in many cases. Full reinforcement learning does uncover the correct signal, is causal by nature, but is also very difficult to apply to real-world problems because of the sample complexity required for credit assignment. Contextual bandits provide a happy medium by relaxing the full RL setting to only consider immediate rewards. Summary
  124. 124. Summary Straightforward* Hard as nails Supervised learning Incorrect signal Independent on number of observations Full reinforcement learning Correct signal Depends on number of observations Contextual bandits Rightish signal Independent on number of observations
  125. 125. Contextual bandits can be reduced to exploration + supervised learning, allowing us to take advantage of ready-made, state-of-the-art learning algorithms. Contextual bandit policies can be evaluated offline, using experience quads (x,a,p,r) generated by any previous policy. A properly implemented contextual bandit learning system is a self-improving loop: better policies generate more reward, and provide more data for improving further. Contextual bandits allow you to solve a host of real-world problems, using real data instead of simulation, in a causal manner. Summary
  126. 126. If you have a problem where it is possible to explore, and a desire to make a machine learning system capable of uncovering new things, consider immediate-reward RL.
  127. 127. Thank you! Questions? max.pagels@fourkind.com www.fourkind.com @fourkindnow

×