SlideShare a Scribd company logo
1 of 30
Download to read offline
1
From practice to theory
in learning from massive data
Charles Elkan
Amazon Fellow
August 14, 2016
Important
Information here is already public.
Opinions are mine, not Amazon’s.
3
Outline
Only 30 minutes!
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most important algorithm for recommendations
4. Uplift: We want causation, not merely correlation
Outline
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most important algorithm for recommendations
4. Uplift: We want causation, not merely correlation
From practice to theory
From theory to practice
Now for everyone!
Outline
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most important algorithm for recommendations
4. Uplift: We want causation, not merely correlation
From practice to practice
Outline
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most important algorithm for recommendations
4. Uplift: We want causation, not merely correlation
13
Academic versus applied
In theory, researchers favor simplicity. In practice, they don’t.
In industry, simplicity genuinely wins.
Example: Desiderata for recommender systems:
1. Respect the privacy of users; don’t be creepy.
2. Make recommendations understandable.
3. Make them responsive to the user’s most recent interests.
4. Generate them with millisecond latency.
14
Amazon’s most important recommender system
1. Respect the privacy of users; don’t be creepy.
2. Make recommendations understandable.
3. And responsive to the user’s most recent interests.
4. Generate them with millisecond latency.
Outline
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most important algorithm for recommendations
4. Uplift: We want causation, not merely correlation
What data scientists do every day
Let x be a user and let R = 0 or 1 be a response. For example, R=1
means the user buys shoes in the next month.
Routinely, we train models to predict the probability p(R=1|x).
We send messages and coupons to users with high p(R=1|x).
16
Is p(R=1|x) actually useful?
In principle, no. "Our goal is not to predict the future; it is to
change the future."
• Merely predicting user behavior is of limited interest.
We want to select treatments that influence users.
• T = t means we choose treatment t.
• For each available t, compute p(R=1|x,T=t).
• Choose the t that gives highest probability.
17
The risk of ignoring uplift
18
Users are ranked by p(R=1|x), shown by the brown line.
The blue dashed line shows p(R=1|x,T=t) .
The treatment t has a negative effect for users in the top 5%:
p(R=1|x,T=t) < p(R=1|x).
Politicians know this …
If you are a Republican, don’t target confirmed Democrat voters!
Instead:
• Send persuasive messages to undecided voters.
• Send “get out the vote” messages to confirmed supporters.
• Send “please donate” messages to these people also.
A common scenario for uplift
Many treatments are almost free to apply, such as sending email.
The uplift question is then which treatment is most effective.
For each user x, we want to know which t has highest value
p(R=1|x,T=t).
Keep in mind: The same treatment may be the best for all x.
20
A public dataset
Published by Kevin Hillstrom, former VP of database marketing
at Nordstrom.
Studied in several published papers on uplift, notably by Nicholas
Radcliffe, professor at the University of Edinburgh.
• 64,000 past customers of an e-commerce site selling clothing.
• Randomized to no email, men’s email, or women’s email.
• Three outcomes: Binary visit? purchase? and numerical spend.
21
Looking at the data
22
Treatments have a larger effect on “visit” than on “purchase
given visit” or on “spend given purchase.”
We'll analyze uplift (i.e., the causal influence of treatments)
for visits.
Table from Hillstrom’s MineThatData email analytics challenge by Radcliffe.
The linear probability model
Assume the linear function p(R=1|x) = b0 + ∑i bi * xi.
• Find coefficients bi to minimize square loss.
Square loss is proper, so predicted probabilities are calibrated.
Avoid overfitting and predictions <0 or >1 by not having too
many predictors.
Commonly used in econometrics, not in ML. In practice, often
quite similar to logistic regression.
23
probability of visit =
7.5% + … +
6.5% IF (men’s past
AND men’s email) +
6.6% IF (women’s
past AND men’s
email) +
6.1% IF (women’s
past AND women’s
email)
24
Including treatment indicators M and W
25
The men’s email is effective for customers who have
previously purchased men’s or women’s clothing.
The women’s email is not effective for customers who have
previously purchased only men’s clothing.
26
Optimal treatment policy:
• If only men’s previous purchases: send men’s email.
• If only women’s purchases: send either email.
• If both: send men’s email.
Hypothesis: Women tend to buy clothing for their families,
but men tend to buy clothing only for themselves.
Validation
How can we confirm that we have found an optimal policy?
Approach:
1. Train models of response for each treatment.
2. For each user x in a test set, plot both predicted probabilities.
3. Three separate test sets: users who previously purchased only
women’s clothing, only men’s, or both.
4. The latter two sets should show p(R=1|x, T=M) > p(R=1|x, T=W)
for most x.
Results using random forests:
Lower two panels: As expected,
p(R=1|x, T=M) > p(R=1|x, T=W).
Top panel: The two treatments
M and W are equally effective.
What comes next?
Conclusion: Indeed, one treatment (the men’s email) can be
optimal for all customers.
The step beyond uplift modeling is reinforcement learning:
Learning a sequence of actions that is best for each user.
• The goal is to maximize total lifetime reward from each
customer.
• Learn simultaneously how customers evolve and how
they respond to actions that we take.
29
Questions?
1. Detecting anomalies in streaming data
2. Making Spark usable for real-time predictions
3. Amazon’s most important algorithm for recommendations
4. Uplift: We want causation, not merely correlation

More Related Content

What's hot

Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your researchDorothy Bishop
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced dataSaurabhWani6
 
Statistical Test
Statistical TestStatistical Test
Statistical Testguestdbf093
 
Qnt 275 final exam july 2017 version
Qnt 275 final exam july 2017 versionQnt 275 final exam july 2017 version
Qnt 275 final exam july 2017 versionAdams-ASs
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 

What's hot (6)

Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Statistical Test
Statistical TestStatistical Test
Statistical Test
 
Qnt 275 final exam july 2017 version
Qnt 275 final exam july 2017 versionQnt 275 final exam july 2017 version
Qnt 275 final exam july 2017 version
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 

Similar to From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Personalized News Recommendation (Stream Data Based)
Personalized News Recommendation (Stream Data Based)Personalized News Recommendation (Stream Data Based)
Personalized News Recommendation (Stream Data Based)Umesh Singla
 
Causality without headaches
Causality without headachesCausality without headaches
Causality without headachesBenoît Rostykus
 
Marketing Experiment - Part II: Analysis
Marketing Experiment - Part II: Analysis Marketing Experiment - Part II: Analysis
Marketing Experiment - Part II: Analysis Minha Hwang
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshopodsc
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedRising Media Ltd.
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithmsArunangsu Sahu
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSScsula its training
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
Data Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That WayData Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That WayMelinda Thielbar
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data sciencepujashri1975
 
Using Excel to Build Understanding AMATYC 2015
Using Excel to Build Understanding AMATYC 2015Using Excel to Build Understanding AMATYC 2015
Using Excel to Build Understanding AMATYC 2015kathleenalmy
 
slides-correlations.pdf
slides-correlations.pdfslides-correlations.pdf
slides-correlations.pdfFlorentBersani
 
statistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfstatistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfkobra22
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment systemKOYELMAJUMDAR1
 

Similar to From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16 (20)

Personalized News Recommendation (Stream Data Based)
Personalized News Recommendation (Stream Data Based)Personalized News Recommendation (Stream Data Based)
Personalized News Recommendation (Stream Data Based)
 
Causality without headaches
Causality without headachesCausality without headaches
Causality without headaches
 
Marketing Experiment - Part II: Analysis
Marketing Experiment - Part II: Analysis Marketing Experiment - Part II: Analysis
Marketing Experiment - Part II: Analysis
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshop
 
DATA COLLECTION IN RESEARCH
DATA COLLECTION IN RESEARCHDATA COLLECTION IN RESEARCH
DATA COLLECTION IN RESEARCH
 
151028_abajpai1
151028_abajpai1151028_abajpai1
151028_abajpai1
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Data Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That WayData Science Isn't a Fad: Let's Keep it That Way
Data Science Isn't a Fad: Let's Keep it That Way
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Using Excel to Build Understanding AMATYC 2015
Using Excel to Build Understanding AMATYC 2015Using Excel to Build Understanding AMATYC 2015
Using Excel to Build Understanding AMATYC 2015
 
slides-correlations.pdf
slides-correlations.pdfslides-correlations.pdf
slides-correlations.pdf
 
statistics - Populations and Samples.pdf
statistics - Populations and Samples.pdfstatistics - Populations and Samples.pdf
statistics - Populations and Samples.pdf
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
Stat342 ch1
Stat342 ch1Stat342 ch1
Stat342 ch1
 

More from BigMine

Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...BigMine
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBigMine
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeBigMine
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...BigMine
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...BigMine
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 

More from BigMine (10)

Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina Morik
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 

Recently uploaded

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

  • 1. 1 From practice to theory in learning from massive data Charles Elkan Amazon Fellow August 14, 2016
  • 2. Important Information here is already public. Opinions are mine, not Amazon’s.
  • 3. 3
  • 4. Outline Only 30 minutes! 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  • 5. Outline 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  • 7. From theory to practice
  • 9. Outline 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  • 10. From practice to practice
  • 11.
  • 12. Outline 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  • 13. 13 Academic versus applied In theory, researchers favor simplicity. In practice, they don’t. In industry, simplicity genuinely wins. Example: Desiderata for recommender systems: 1. Respect the privacy of users; don’t be creepy. 2. Make recommendations understandable. 3. Make them responsive to the user’s most recent interests. 4. Generate them with millisecond latency.
  • 14. 14 Amazon’s most important recommender system 1. Respect the privacy of users; don’t be creepy. 2. Make recommendations understandable. 3. And responsive to the user’s most recent interests. 4. Generate them with millisecond latency.
  • 15. Outline 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation
  • 16. What data scientists do every day Let x be a user and let R = 0 or 1 be a response. For example, R=1 means the user buys shoes in the next month. Routinely, we train models to predict the probability p(R=1|x). We send messages and coupons to users with high p(R=1|x). 16
  • 17. Is p(R=1|x) actually useful? In principle, no. "Our goal is not to predict the future; it is to change the future." • Merely predicting user behavior is of limited interest. We want to select treatments that influence users. • T = t means we choose treatment t. • For each available t, compute p(R=1|x,T=t). • Choose the t that gives highest probability. 17
  • 18. The risk of ignoring uplift 18 Users are ranked by p(R=1|x), shown by the brown line. The blue dashed line shows p(R=1|x,T=t) . The treatment t has a negative effect for users in the top 5%: p(R=1|x,T=t) < p(R=1|x).
  • 19. Politicians know this … If you are a Republican, don’t target confirmed Democrat voters! Instead: • Send persuasive messages to undecided voters. • Send “get out the vote” messages to confirmed supporters. • Send “please donate” messages to these people also.
  • 20. A common scenario for uplift Many treatments are almost free to apply, such as sending email. The uplift question is then which treatment is most effective. For each user x, we want to know which t has highest value p(R=1|x,T=t). Keep in mind: The same treatment may be the best for all x. 20
  • 21. A public dataset Published by Kevin Hillstrom, former VP of database marketing at Nordstrom. Studied in several published papers on uplift, notably by Nicholas Radcliffe, professor at the University of Edinburgh. • 64,000 past customers of an e-commerce site selling clothing. • Randomized to no email, men’s email, or women’s email. • Three outcomes: Binary visit? purchase? and numerical spend. 21
  • 22. Looking at the data 22 Treatments have a larger effect on “visit” than on “purchase given visit” or on “spend given purchase.” We'll analyze uplift (i.e., the causal influence of treatments) for visits. Table from Hillstrom’s MineThatData email analytics challenge by Radcliffe.
  • 23. The linear probability model Assume the linear function p(R=1|x) = b0 + ∑i bi * xi. • Find coefficients bi to minimize square loss. Square loss is proper, so predicted probabilities are calibrated. Avoid overfitting and predictions <0 or >1 by not having too many predictors. Commonly used in econometrics, not in ML. In practice, often quite similar to logistic regression. 23
  • 24. probability of visit = 7.5% + … + 6.5% IF (men’s past AND men’s email) + 6.6% IF (women’s past AND men’s email) + 6.1% IF (women’s past AND women’s email) 24 Including treatment indicators M and W
  • 25. 25 The men’s email is effective for customers who have previously purchased men’s or women’s clothing. The women’s email is not effective for customers who have previously purchased only men’s clothing.
  • 26. 26 Optimal treatment policy: • If only men’s previous purchases: send men’s email. • If only women’s purchases: send either email. • If both: send men’s email. Hypothesis: Women tend to buy clothing for their families, but men tend to buy clothing only for themselves.
  • 27. Validation How can we confirm that we have found an optimal policy? Approach: 1. Train models of response for each treatment. 2. For each user x in a test set, plot both predicted probabilities. 3. Three separate test sets: users who previously purchased only women’s clothing, only men’s, or both. 4. The latter two sets should show p(R=1|x, T=M) > p(R=1|x, T=W) for most x.
  • 28. Results using random forests: Lower two panels: As expected, p(R=1|x, T=M) > p(R=1|x, T=W). Top panel: The two treatments M and W are equally effective.
  • 29. What comes next? Conclusion: Indeed, one treatment (the men’s email) can be optimal for all customers. The step beyond uplift modeling is reinforcement learning: Learning a sequence of actions that is best for each user. • The goal is to maximize total lifetime reward from each customer. • Learn simultaneously how customers evolve and how they respond to actions that we take. 29
  • 30. Questions? 1. Detecting anomalies in streaming data 2. Making Spark usable for real-time predictions 3. Amazon’s most important algorithm for recommendations 4. Uplift: We want causation, not merely correlation