Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

1,664 views

Published on

Combining Statistics and Expert Human Judgment for Better Recommendations: Most algorithmic recommendation engines target the consumer directly. Combining these recommendation algorithms with expert human selection and curation can make them more effective. But it also makes things more complicated. In this talk I’ll share lessons from combining statistics and human judgement for personal styling recommendations at Stitch Fix, where we are committed to our recommendations through the physical delivery of merchandise to clients. I’ll discuss both statistical and practical challenges of machine learning with humans in the loop: training with selection bias, making predictions for human consumption and measuring success.

Published in: Technology
  • Be the first to comment

Brad Klingenberg, Director of Styling Algorithms, Stitch Fix at MLconf SF - 11/13/15

  1. 1. Combining Statistics and Expert Human Judgment for Better Recommendations Brad Klingenberg, Stitch Fix brad@stitchfix.com MLconf San Francisco 2015 Three lessons
  2. 2. Lessons from having humans in the loop Humans in the loop
  3. 3. Lessons from having humans in the loop Humans in the loop It works really well, but it’s complicated
  4. 4. Lessons from having humans in the loop Humans in the loop: It works really well, but it’s complicated Lesson 1: There’s more than one way to measure success
  5. 5. Lessons from having humans in the loop Humans in the loop: It works really well, but it’s complicated Lesson 1: There’s more than one way to measure success Lesson 2: You have to think carefully about what you’re predicting
  6. 6. Lessons from having humans in the loop Humans in the loop: It works really well, but it’s complicated Lesson 1: There’s more than one way to measure success Lesson 2: You have to think carefully about what you’re predicting Lesson 3: Humans can say “no”, and this complicates experiments
  7. 7. Humans in the loop at Stitch Fix
  8. 8. Stitch Fix
  9. 9. Stitch Fix
  10. 10. Stitch Fix
  11. 11. Stitch Fix
  12. 12. Styling at Stitch Fix Personal styling Inventory
  13. 13. Styling at Stitch Fix: personalized recommendations Inventory Algorithmic recommendations Statistics
  14. 14. Styling at Stitch Fix: expert human curation Human curation Algorithmic recommendations
  15. 15. Lesson 1: There’s more than one way to measure success
  16. 16. Traditional recommenders Learning through feedback
  17. 17. Humans in the loop Learning through feedback
  18. 18. Measuring success In the end, you are usually interested in optimizing and this may make sense for the combined system. But when optimizing an algorithm, it is important to consider selection
  19. 19. Optimizing interaction For a set of algorithms with the same marginal performance, We generally prefer the algorithms that
  20. 20. Optimizing interaction For a set of algorithms with the same marginal performance, We generally prefer the algorithms that ● increase agreement and reduce needed searching (credible and useful recommendations)
  21. 21. Optimizing interaction For a set of algorithms with the same marginal performance, We generally prefer the algorithms that ● increase agreement and reduce needed searching (credible and useful recommendations) ● make the humans more efficient (effortless curation)
  22. 22. Optimizing interaction For a set of algorithms with the same marginal performance, We generally prefer the algorithms that ● increase agreement and reduce needed searching (credible and useful recommendations) ● make the humans more efficient (effortless curation) ● have a better user experience (fewer bad or annoying recommendations)
  23. 23. Logging selection This means logging and analyzing selection data
  24. 24. Lesson 2: You have to think carefully about what you’re predicting
  25. 25. Training a model What should you predict? Naive approach: ignore selection and train on success data Advantages ● “traditional” supervised problem ● simple historical data
  26. 26. Censoring through selection Problem: selection can censor your data
  27. 27. Censoring through selection Problem: selection can censor your data
  28. 28. Censoring through selection Problem: selection can censor your data Arms flaunted Success Yes No Yes No ? ? p 1-p
  29. 29. Predicting selection What about predicting selection?
  30. 30. Predicting selection ● Simple, but selection is not really success ● There is a much more direct feedback loop
  31. 31. Training a model You should probably consider both. It is most interesting when they disagree Selection model Success model vs
  32. 32. Good disagreement Ignoring an inappropriate recommendation Client request: “I need an outfit for a glamorous night out!”
  33. 33. Good disagreement Ignoring an inappropriate recommendation Client request: “I need an outfit for a glamorous night out!”
  34. 34. Bad disagreement Stylist not choosing something that would be successful Predicted probability of success = 85% ?
  35. 35. Bad disagreement Stylist not choosing something that would be successful Could lack trust in the recommendation: importance of transparency Predicted probability of success = 85% ? Based on her recent purchase
  36. 36. Lesson 3: Humans can say “no”, and this complicates experiments -or- “the downside of free will”
  37. 37. Testing with humans in the loop Toy example: Suppose we want to test a (bad) new policy
  38. 38. Testing with humans in the loop New rule: all fixes must contain polka dots! Toy example: Suppose we want to test a (bad) new policy
  39. 39. An experiment Control Test (Polka Dots Rule)
  40. 40. Selective non-compliance Humans may not comply. Or, they may comply only selectively Hmm, no “Please don’t send me any polka dots” - client X Test (Polka Dots Rule)
  41. 41. Selective non-compliance Control Test (Polka Dots Rule)
  42. 42. Selective non-compliance Control Test (Polka Dots Rule)
  43. 43. Selective non-compliance Humans help avoid bad choices - this is great for the client! But, this can obscure the effect you are trying to measure.
  44. 44. Selective non-compliance Humans help avoid bad choices - this is great for the client! But, this can obscure the effect you are trying to measure. Helpful analogy: non-compliance in clinical trials. This has been intensively studied
  45. 45. Lessons from having humans in the loop Humans in the loop It works really well, but it’s complicated Lesson 1: There’s more than one way to measure success Lesson 2: You have to think carefully about what you’re predicting Lesson 3: Humans can say “no”, and this complicates experiments
  46. 46. Thanks! Questions? (we’re hiring!)

×