Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017

545 views

Published on

John Maxwell, a data scientist at Nordstrom, did his graduate work in international development economics, focusing on field experiments. He has since led research projects in Indonesia and Ethiopia related to microenterprise, developed large mathematical simulation models used for investment decisions by WSDOT, built dynamic pricing algorithms at Thriftbooks.com, and led the development of Nordstrom’s open source a/b testing service: Elwin. He currently focuses on contextual multi-armed bandit problems and machine learning infrastructure at Nordstrom.

Abstract summary

Solving the Contextual Multi-Armed Bandit Problem at Nordstrom:
The contextual multi-armed bandit problem, also known as associative reinforcement learning or bandits with side information, is a useful formulation of the multi-armed bandit problem that takes into account information about arms and users when deciding which arm to pull. The barrier to entry for both understanding and implementing contextual multi-armed bandits in production is high. The literature in this field pulls from disparate sources including (but not limited to) classical statistics, reinforcement learning, and information theory. Because of this, finding material that fills the gap between very basic explanations and academic journal articles is challenging. The goal of this talk is to provide those lacking intermediate materials as well as an example implementation. Specifically, I will explain key findings from some of the more cited papers in the contextual bandit literature, discuss the minimum requirements for implementation, and give an overview of a production system for solving contextual multi-armed bandit problems.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017

  1. 1. Solving the Contextual Multi-Armed Bandit Problem at Nordstrom John Maxwell Nordstrom 2017/05/19
  2. 2. Motivating the Problem: Limitations of A/B testing for product recommendations
  3. 3. Motivating the Problem: Limitations of A/B testing for product recommendations Need to balance exploration and exploitation intelligently
  4. 4. Motivating the Problem: Limitations of A/B testing for product recommendations Need to balance exploration and exploitation intelligently People aren’t all the same, though maybe similar
  5. 5. Exploration vs Exploitation Explore first: explore then learn (like A/B testing)
  6. 6. Exploration vs Exploitation -greedy: exploit–but also explore a little bit
  7. 7. Exploration vs Exploitation Upper Confidence Bound (UCB): optimistic when uncertain
  8. 8. UCB Illustrated 1 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  9. 9. UCB Illustrated 2 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  10. 10. UCB Illustrated 3 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  11. 11. UCB Illustrated 4 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  12. 12. UCB Illustrated 5 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  13. 13. UCB Illustrated 6 Arm1 Arm2 0.50 0.75 1.00 1.25 1.50 1.75 arm avg Choice 0 1
  14. 14. Including Context How can we use things we know about people and products (context) along with UCB?
  15. 15. Including Context How can we use things we know about people and products (context) along with UCB? Train a ridge regression for each arm (regress rewards on contexts)
  16. 16. Including Context How can we use things we know about people and products (context) along with UCB? Train a ridge regression for each arm (regress rewards on contexts) Choose the arm using the UCB idea!
  17. 17. Including Context How can we use things we know about people and products (context) along with UCB? Train a ridge regression for each arm (regress rewards on contexts) Choose the arm using the UCB idea! at = argmaxa∈At xt,a ˆθa predicted payoff +α xt,aAa −1 xt,a standard deviation of payoff Li et al. (2010)
  18. 18. Including Context This seems hard to implement
  19. 19. Including Context This seems hard to implement Have to invert a potentially large matrix on every call
  20. 20. Including Context This seems hard to implement Have to invert a potentially large matrix on every call How do you deal with delayed rewards?
  21. 21. Including Context Notice how similar this is to classification arm 1 arm 2 arm 3 1 . . . .5 . . . 2 .8 . .
  22. 22. Including Context Notice how similar this is to classification arm 1 arm 2 arm 3 1 . . . .5 . . . 2 .8 . . We have partial feedback. . . how can we get full feedback?
  23. 23. Including Context Inverse propensity scoring: ci,t = − ri,t(ai ) · I{π(xi,t) = ai } pi,t(ai ) arm 1 arm 2 arm 3 c1,1 0 0 0 c2,2 0 0 0 c3,3 c1,4 0 0 Agarwal et al. (2014)
  24. 24. Including Context If you think about IPS transformed rewards as costs, you can reduce this to cost-sensitive classification
  25. 25. Including Context If you think about IPS transformed rewards as costs, you can reduce this to cost-sensitive classification Can use any cost-sensitive multi-class classification algorithm
  26. 26. Including Context If you think about IPS transformed rewards as costs, you can reduce this to cost-sensitive classification Can use any cost-sensitive multi-class classification algorithm Simplest is probably least squares regression for each arm with argmin to choose cost minimizing arm
  27. 27. Including Context If you think about IPS transformed rewards as costs, you can reduce this to cost-sensitive classification Can use any cost-sensitive multi-class classification algorithm Simplest is probably least squares regression for each arm with argmin to choose cost minimizing arm Can do this part offline
  28. 28. Implementation
  29. 29. Implementation Dora: a node app that explores using -greedy
  30. 30. Implementation Dora: a node app that explores using -greedy Logging, delayed joins
  31. 31. Implementation Dora: a node app that explores using -greedy Logging, delayed joins TensorFlow + TensorFlow Serving: Consistent way to train and serve cost-sensitive classifier
  32. 32. Questions? twitter: @jhnmxwll github: jmmaxwell site: john-maxwell.com email: john [at] john-maxwell.com
  33. 33. References Agarwal, Alekh, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. 2014. “Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits.” CoRR abs/1402.0555. http://arxiv.org/abs/1402.0555. Li, Lihong, Wei Chu, John Langford, and Robert E Schapire. 2010. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” In Proceedings of the 19th International Conference on World Wide Web, 661–70. ACM.

×