Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
From practice to theory
in learning from massive data
Charles Elkan
Amazon Fellow
August 14, 2016
Important
Information here is already public.
Opinions are mine, not Amazon’s.
3
Outline
Only 30 minutes!
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amaz...
Outline
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most importa...
From practice to theory
From theory to practice
Now for everyone!
Outline
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most importa...
From practice to practice
Outline
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most importa...
13
Academic versus applied
In theory, researchers favor simplicity. In practice, they don’t.
In industry, simplicity genui...
14
Amazon’s most important recommender system
1. Respect the privacy of users; don’t be creepy.
2. Make recommendations un...
Outline
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most importa...
What data scientists do every day
Let x be a user and let R = 0 or 1 be a response. For example, R=1
means the user buys s...
Is p(R=1|x) actually useful?
In principle, no. "Our goal is not to predict the future; it is to
change the future."
• Mere...
The risk of ignoring uplift
18
Users are ranked by p(R=1|x), shown by the brown line.
The blue dashed line shows p(R=1|x,T...
Politicians know this …
If you are a Republican, don’t target confirmed Democrat voters!
Instead:
• Send persuasive messag...
A common scenario for uplift
Many treatments are almost free to apply, such as sending email.
The uplift question is then ...
A public dataset
Published by Kevin Hillstrom, former VP of database marketing
at Nordstrom.
Studied in several published ...
Looking at the data
22
Treatments have a larger effect on “visit” than on “purchase
given visit” or on “spend given purcha...
The linear probability model
Assume the linear function p(R=1|x) = b0 + ∑i bi * xi.
• Find coefficients bi to minimize squ...
probability of visit =
7.5% + … +
6.5% IF (men’s past
AND men’s email) +
6.6% IF (women’s
past AND men’s
email) +
6.1% IF ...
25
The men’s email is effective for customers who have
previously purchased men’s or women’s clothing.
The women’s email i...
26
Optimal treatment policy:
• If only men’s previous purchases: send men’s email.
• If only women’s purchases: send eithe...
Validation
How can we confirm that we have found an optimal policy?
Approach:
1. Train models of response for each treatme...
Results using random forests:
Lower two panels: As expected,
p(R=1|x, T=M) > p(R=1|x, T=W).
Top panel: The two treatments
...
What comes next?
Conclusion: Indeed, one treatment (the men’s email) can be
optimal for all customers.
The step beyond upl...
Questions?
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most impo...
From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16
Upcoming SlideShare
Loading in …5
×

From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

820 views

Published on

This talk will discuss examples of how Amazon applies machine learning to large-scale data, and open research questions inspired by these applications. One important question is how to distinguish between users that can be influenced, versus those who are merely likely to respond. Another question is how to measure and maximize the long-term benefit of movie and other recommendations. A third question, is how to share data while provably protecting the privacy of users. Note: Information in the talk is already public, and opinions expressed will be strictly personal.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

  1. 1. 1 From practice to theory in learning from massive data Charles Elkan Amazon Fellow August 14, 2016
  2. 2. Important Information here is already public. Opinions are mine, not Amazon’s.
  3. 3. 3
  4. 4. Outline Only 30 minutes! 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  5. 5. Outline 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  6. 6. From practice to theory
  7. 7. From theory to practice
  8. 8. Now for everyone!
  9. 9. Outline 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  10. 10. From practice to practice
  11. 11. Outline 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  12. 12. 13 Academic versus applied In theory, researchers favor simplicity. In practice, they don’t. In industry, simplicity genuinely wins. Example: Desiderata for recommender systems: 1. Respect the privacy of users; don’t be creepy. 2. Make recommendations understandable. 3. Make them responsive to the user’s most recent interests. 4. Generate them with millisecond latency.
  13. 13. 14 Amazon’s most important recommender system 1. Respect the privacy of users; don’t be creepy. 2. Make recommendations understandable. 3. And responsive to the user’s most recent interests. 4. Generate them with millisecond latency.
  14. 14. Outline 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  15. 15. What data scientists do every day Let x be a user and let R = 0 or 1 be a response. For example, R=1 means the user buys shoes in the next month. Routinely, we train models to predict the probability p(R=1|x). We send messages and coupons to users with high p(R=1|x). 16
  16. 16. Is p(R=1|x) actually useful? In principle, no. "Our goal is not to predict the future; it is to change the future." • Merely predicting user behavior is of limited interest. We want to select treatments that influence users. • T = t means we choose treatment t. • For each available t, compute p(R=1|x,T=t). • Choose the t that gives highest probability. 17
  17. 17. The risk of ignoring uplift 18 Users are ranked by p(R=1|x), shown by the brown line. The blue dashed line shows p(R=1|x,T=t) . The treatment t has a negative effect for users in the top 5%: p(R=1|x,T=t) < p(R=1|x).
  18. 18. Politicians know this … If you are a Republican, don’t target confirmed Democrat voters! Instead: • Send persuasive messages to undecided voters. • Send “get out the vote” messages to confirmed supporters. • Send “please donate” messages to these people also.
  19. 19. A common scenario for uplift Many treatments are almost free to apply, such as sending email. The uplift question is then which treatment is most effective. For each user x, we want to know which t has highest value p(R=1|x,T=t). Keep in mind: The same treatment may be the best for all x. 20
  20. 20. A public dataset Published by Kevin Hillstrom, former VP of database marketing at Nordstrom. Studied in several published papers on uplift, notably by Nicholas Radcliffe, professor at the University of Edinburgh. • 64,000 past customers of an e-commerce site selling clothing. • Randomized to no email, men’s email, or women’s email. • Three outcomes: Binary visit? purchase? and numerical spend. 21
  21. 21. Looking at the data 22 Treatments have a larger effect on “visit” than on “purchase given visit” or on “spend given purchase.” We'll analyze uplift (i.e., the causal influence of treatments) for visits. Table from Hillstrom’s MineThatData email analytics challenge by Radcliffe.
  22. 22. The linear probability model Assume the linear function p(R=1|x) = b0 + ∑i bi * xi. • Find coefficients bi to minimize square loss. Square loss is proper, so predicted probabilities are calibrated. Avoid overfitting and predictions <0 or >1 by not having too many predictors. Commonly used in econometrics, not in ML. In practice, often quite similar to logistic regression. 23
  23. 23. probability of visit = 7.5% + … + 6.5% IF (men’s past AND men’s email) + 6.6% IF (women’s past AND men’s email) + 6.1% IF (women’s past AND women’s email) 24 Including treatment indicators M and W
  24. 24. 25 The men’s email is effective for customers who have previously purchased men’s or women’s clothing. The women’s email is not effective for customers who have previously purchased only men’s clothing.
  25. 25. 26 Optimal treatment policy: • If only men’s previous purchases: send men’s email. • If only women’s purchases: send either email. • If both: send men’s email. Hypothesis: Women tend to buy clothing for their families, but men tend to buy clothing only for themselves.
  26. 26. Validation How can we confirm that we have found an optimal policy? Approach: 1. Train models of response for each treatment. 2. For each user x in a test set, plot both predicted probabilities. 3. Three separate test sets: users who previously purchased only women’s clothing, only men’s, or both. 4. The latter two sets should show p(R=1|x, T=M) > p(R=1|x, T=W) for most x.
  27. 27. Results using random forests: Lower two panels: As expected, p(R=1|x, T=M) > p(R=1|x, T=W). Top panel: The two treatments M and W are equally effective.
  28. 28. What comes next? Conclusion: Indeed, one treatment (the men’s email) can be optimal for all customers. The step beyond uplift modeling is reinforcement learning: Learning a sequence of actions that is best for each user. • The goal is to maximize total lifetime reward from each customer. • Learn simultaneously how customers evolve and how they respond to actions that we take. 29
  29. 29. Questions? 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation

×