SlideShare a Scribd company logo
1 of 29
Download to read offline
Counterfactual evaluation
of machine learning
models
Michael Manapat
@mlmanapat
Stripe
Stripe
• APIs and tools for commerce and payments
import stripe
stripe.api_key = "sk_test_BQokiaJOvBdiI2HdlWgHa4ds429x"
stripe.Charge.create(
amount=40000,
currency="usd",
source={
"number": '4242424242424242',
"exp_month": 12,
"exp_year": 2016,
"cvc": "123"
},
description="PyData registration for mlm@stripe.com"
)
Machine Learning at Stripe
• Ten engineers/data scientists as of July 2015 (no
distinction between roles)
• Focus has been on fraud detection
• Continuous scoring of businesses for fraud
risk
• Synchronous classification of charges as
fraud or not
Making a Charge
Customer's
browser
CC: 4242...
Exp: 7/15
Stripe API
Your
server
(1) Raw credit
card information
(2) Token
(3) Order information
(4) Charge
token
tokenization payment
Charge Outcomes
• Nothing (best case)
• Refunded
• Charged back ("disputed")
• Fraudulent disputes result
from "card testing" and "card
cashing"
• Can take > 60 days to get
raised
5
Model Building
• December 31st, 2013
• Train a binary classifier for
disputes on data from Jan
1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to wait
~60 days for labels)
• Based on validation data, pick
a policy for actioning scores:

block if score > 50
Questions
• Validation data is > 2 months old. How is the
model doing?
• What are the production precision and recall?
• Business complains about high false positive
rate: what would happen if we changed the
policy to "block if score > 70"?
Next Iteration
• December 31st, 2014. We
repeat the exercise from a
year earlier
• Train a model on data from
Jan 1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to wait
~60 days for labels)
• Validation results look ~ok
(but not great)
Next Iteration
• We put the model into production, and the
results are terrible
• From spot-checking and complaints from
customers, the performance is worse than
even the validation data suggested
• What happened?
Next Iteration
• We already had a model running that was blocking a
lot of fraud
• Training and validating only on data for which we had
labels: charges that the existing model didn't catch
• Far fewer examples of fraud in number. Essentially
building a model to catch the "hardest" fraud
• Possible solution: we could run both models in
parallel...
Evaluation and Retraining
Require the Same Data
• For evaluation, policy changes, and
retraining, we want the same thing:
• An approximation of the distribution of charges
and outcomes that would exist in the absence
of our intervention (blocking)
First attempt
• Let through some fraction of charges that we
would ordinarily block
• Straightforward to compute precision
if score > 50:
if random.random() < 0.05:
allow()
else:
block()
Recall
• Total "caught" fraud = (4,000 * 1/0.05)
• Total fraud = (4,000 * 1/0.05) + 10,000
• Recall = 80,000 / 90,000 = 89%
1,000,000 charges Score < 50 Score > 50
Total 900,000 100,000
Not Fraud 890,000 1,000
Fraud 10,000 4,000
Unknown 0 95,000
Training
• Train only on charges that were not blocked
• Include weights of 1/0.05 = 20 for charges that
would have been blocked if not for the random
reversal
from sklearn.ensemble import 
RandomForestRegressor
...
r = RandomForestRegressor(n_estimators=10)
r.fit(X, Y, sample_weight=weights)
Training
• Use weights in validation (on hold-out set) as
well
from sklearn import cross_validation
X_train, X_test, y_train, y_test = 
cross_validation.train_test_split(
data, target, test_size=0.2)
r = RandomForestRegressor(...)
...
r.score(
X_test, y_test, sample_weight=weights)
Better Approach
• We're letting through 5% of all charges we think
are fraudulent. Policy:
Very likely
to be
fraud
Could go
either way
Better Approach
• Propensity function: maps
classifier scores to P(Allow)
• The higher the score, the
lower probability we let the
charge through
• Get information on the area we
want to improve on
• Letting through less "obvious"
fraud ("budget" for evaluation)
Better Approach
def propensity(score):
# Piecewise linear/sigmoidal
...
ps = propensity(score)
original_block = score > 50
selected_block = random.random() < ps
if selected_block:
block()
else:
allow()
log_record(
id, score, ps, original_block,
selected_block)
ID Score p(Allow)
Original
Action
Selected
Action
Outcome
1 10 1.0 Allow Allow OK
2 45 1.0 Allow Allow Fraud
3 55 0.30 Block Block -
4 65 0.20 Block Allow Fraud
5 100 0.0005 Block Block -
6 60 0.25 Block Allow OK
Analysis
• In any analysis, we only consider samples that
were allowed (since we don't have labels
otherwise)
• We weight each sample by 1 / P(Allow)
• cf. weighting by 1/0.05 = 20 in the uniform
probability case
ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 50" policy
Precision = 5 / 9 = 0.56
Recall = 5 / 6 = 0.83
Evaluating the "block if score > 40" policy
Precision = 6 / 10 = 0.60
Recall = 6 / 6 = 1.00
ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 62" policy
Precision = 5 / 5 = 1.00
Recall = 5 / 6 = 0.83
ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Analysis
• The propensity function controls the
"exploration/exploitation" payoff
• Precision, recall, etc. are estimators
• Variance of the estimators decreases the
more we allow through
• Bootstrap to get error bars (pick rows from the
table uniformly at random with replacement)
New models
• Train on weighted data
• Weights => approximate distribution without any
blocking
• Can test arbitrarily many models and policies
offline
• Li, Chen, Kleban, Gupta: "Counterfactual
Estimation and Optimization of Click Metrics for
Search Engines"
Technicalities
• Independence and random seeds
• Application-specific issues
Conclusion
• If some policy is actioning model scores, you
can inject randomness in production to
understand the counterfactual
• Instead of a "champion/challenger" A/B test, you
can evaluate arbitrarily many models and
policies in this framework
Thanks!
• Work by Ryan Wang (@ryw90), Roban Kramer
(@robanhk), and Alyssa Frazee (@acfrazee)
• We're hiring engineers/data scientists!
• mlm@stripe.com (@mlmanapat)

More Related Content

What's hot

LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized HomepageJustin Basilico
 
Deep Learning for Semantic Search in E-commerce​
Deep Learning for Semantic Search in E-commerce​Deep Learning for Semantic Search in E-commerce​
Deep Learning for Semantic Search in E-commerce​Somnath Banerjee
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리Shane (Seungwhan) Moon
 
추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.choi kyumin
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기NAVER D2
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
ゼロから始めるレコメンダシステム
ゼロから始めるレコメンダシステムゼロから始めるレコメンダシステム
ゼロから始めるレコメンダシステムKazuaki Tanida
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Kishor Datta Gupta
 
알파고 해부하기 3부
알파고 해부하기 3부알파고 해부하기 3부
알파고 해부하기 3부Donghun Lee
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemJacek Wasilewski
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Learning to rank
Learning to rankLearning to rank
Learning to rankBruce Kuo
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
なぜ科学計算にはPythonか?
なぜ科学計算にはPythonか?なぜ科学計算にはPythonか?
なぜ科学計算にはPythonか?Aki Ariga
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 

What's hot (20)

LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
 
Deep Learning for Semantic Search in E-commerce​
Deep Learning for Semantic Search in E-commerce​Deep Learning for Semantic Search in E-commerce​
Deep Learning for Semantic Search in E-commerce​
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리
 
추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
ゼロから始めるレコメンダシステム
ゼロから始めるレコメンダシステムゼロから始めるレコメンダシステム
ゼロから始めるレコメンダシステム
 
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
 
알파고 해부하기 3부
알파고 해부하기 3부알파고 해부하기 3부
알파고 해부하기 3부
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender System
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Learning to rank
Learning to rankLearning to rank
Learning to rank
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
なぜ科学計算にはPythonか?
なぜ科学計算にはPythonか?なぜ科学計算にはPythonか?
なぜ科学計算にはPythonか?
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 

Similar to Counterfactual evaluation of machine learning models for fraud detection

Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningHJ van Veen
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsData Science Milan
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
Testcase design techniques final
Testcase design techniques finalTestcase design techniques final
Testcase design techniques finalshraavank
 
Predicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining ProjectPredicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining Projectraj
 
Fraud Detection by Stacking Cost-Sensitive Decision Trees
Fraud Detection by Stacking Cost-Sensitive Decision TreesFraud Detection by Stacking Cost-Sensitive Decision Trees
Fraud Detection by Stacking Cost-Sensitive Decision TreesAlejandro Correa Bahnsen, PhD
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxPriyadharshiniG41
 
Dynamic Testing
Dynamic TestingDynamic Testing
Dynamic TestingJimi Patel
 
Computer Intensive Statistical Methods
Computer Intensive Statistical MethodsComputer Intensive Statistical Methods
Computer Intensive Statistical MethodsDaniele Baker
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
Market Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesMarket Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesLuca Zavarella
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statisticsSpotle.ai
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMikhail Klassen
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedinAsoka Korale
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 

Similar to Counterfactual evaluation of machine learning models for fraud detection (20)

Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Testcase design techniques final
Testcase design techniques finalTestcase design techniques final
Testcase design techniques final
 
Predicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining ProjectPredicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining Project
 
Fraud Detection by Stacking Cost-Sensitive Decision Trees
Fraud Detection by Stacking Cost-Sensitive Decision TreesFraud Detection by Stacking Cost-Sensitive Decision Trees
Fraud Detection by Stacking Cost-Sensitive Decision Trees
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 
Dynamic Testing
Dynamic TestingDynamic Testing
Dynamic Testing
 
Computer Intensive Statistical Methods
Computer Intensive Statistical MethodsComputer Intensive Statistical Methods
Computer Intensive Statistical Methods
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Market Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesMarket Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning Services
 
Randomness and fraud
Randomness and fraudRandomness and fraud
Randomness and fraud
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statistics
 
Credit scorecard
Credit scorecardCredit scorecard
Credit scorecard
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace Training
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedin
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 

Recently uploaded

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 

Recently uploaded (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 

Counterfactual evaluation of machine learning models for fraud detection

  • 1. Counterfactual evaluation of machine learning models Michael Manapat @mlmanapat Stripe
  • 2. Stripe • APIs and tools for commerce and payments import stripe stripe.api_key = "sk_test_BQokiaJOvBdiI2HdlWgHa4ds429x" stripe.Charge.create( amount=40000, currency="usd", source={ "number": '4242424242424242', "exp_month": 12, "exp_year": 2016, "cvc": "123" }, description="PyData registration for mlm@stripe.com" )
  • 3. Machine Learning at Stripe • Ten engineers/data scientists as of July 2015 (no distinction between roles) • Focus has been on fraud detection • Continuous scoring of businesses for fraud risk • Synchronous classification of charges as fraud or not
  • 4. Making a Charge Customer's browser CC: 4242... Exp: 7/15 Stripe API Your server (1) Raw credit card information (2) Token (3) Order information (4) Charge token tokenization payment
  • 5. Charge Outcomes • Nothing (best case) • Refunded • Charged back ("disputed") • Fraudulent disputes result from "card testing" and "card cashing" • Can take > 60 days to get raised 5
  • 6. Model Building • December 31st, 2013 • Train a binary classifier for disputes on data from Jan 1st to Sep 30th • Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels) • Based on validation data, pick a policy for actioning scores:
 block if score > 50
  • 7. Questions • Validation data is > 2 months old. How is the model doing? • What are the production precision and recall? • Business complains about high false positive rate: what would happen if we changed the policy to "block if score > 70"?
  • 8. Next Iteration • December 31st, 2014. We repeat the exercise from a year earlier • Train a model on data from Jan 1st to Sep 30th • Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels) • Validation results look ~ok (but not great)
  • 9. Next Iteration • We put the model into production, and the results are terrible • From spot-checking and complaints from customers, the performance is worse than even the validation data suggested • What happened?
  • 10. Next Iteration • We already had a model running that was blocking a lot of fraud • Training and validating only on data for which we had labels: charges that the existing model didn't catch • Far fewer examples of fraud in number. Essentially building a model to catch the "hardest" fraud • Possible solution: we could run both models in parallel...
  • 11. Evaluation and Retraining Require the Same Data • For evaluation, policy changes, and retraining, we want the same thing: • An approximation of the distribution of charges and outcomes that would exist in the absence of our intervention (blocking)
  • 12. First attempt • Let through some fraction of charges that we would ordinarily block • Straightforward to compute precision if score > 50: if random.random() < 0.05: allow() else: block()
  • 13. Recall • Total "caught" fraud = (4,000 * 1/0.05) • Total fraud = (4,000 * 1/0.05) + 10,000 • Recall = 80,000 / 90,000 = 89% 1,000,000 charges Score < 50 Score > 50 Total 900,000 100,000 Not Fraud 890,000 1,000 Fraud 10,000 4,000 Unknown 0 95,000
  • 14. Training • Train only on charges that were not blocked • Include weights of 1/0.05 = 20 for charges that would have been blocked if not for the random reversal from sklearn.ensemble import RandomForestRegressor ... r = RandomForestRegressor(n_estimators=10) r.fit(X, Y, sample_weight=weights)
  • 15. Training • Use weights in validation (on hold-out set) as well from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split( data, target, test_size=0.2) r = RandomForestRegressor(...) ... r.score( X_test, y_test, sample_weight=weights)
  • 16. Better Approach • We're letting through 5% of all charges we think are fraudulent. Policy: Very likely to be fraud Could go either way
  • 17. Better Approach • Propensity function: maps classifier scores to P(Allow) • The higher the score, the lower probability we let the charge through • Get information on the area we want to improve on • Letting through less "obvious" fraud ("budget" for evaluation)
  • 18. Better Approach def propensity(score): # Piecewise linear/sigmoidal ... ps = propensity(score) original_block = score > 50 selected_block = random.random() < ps if selected_block: block() else: allow() log_record( id, score, ps, original_block, selected_block)
  • 19. ID Score p(Allow) Original Action Selected Action Outcome 1 10 1.0 Allow Allow OK 2 45 1.0 Allow Allow Fraud 3 55 0.30 Block Block - 4 65 0.20 Block Allow Fraud 5 100 0.0005 Block Block - 6 60 0.25 Block Allow OK
  • 20. Analysis • In any analysis, we only consider samples that were allowed (since we don't have labels otherwise) • We weight each sample by 1 / P(Allow) • cf. weighting by 1/0.05 = 20 in the uniform probability case
  • 21. ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK Evaluating the "block if score > 50" policy Precision = 5 / 9 = 0.56 Recall = 5 / 6 = 0.83
  • 22. Evaluating the "block if score > 40" policy Precision = 6 / 10 = 0.60 Recall = 6 / 6 = 1.00 ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK
  • 23. Evaluating the "block if score > 62" policy Precision = 5 / 5 = 1.00 Recall = 5 / 6 = 0.83 ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK
  • 24. Analysis • The propensity function controls the "exploration/exploitation" payoff • Precision, recall, etc. are estimators • Variance of the estimators decreases the more we allow through • Bootstrap to get error bars (pick rows from the table uniformly at random with replacement)
  • 25. New models • Train on weighted data • Weights => approximate distribution without any blocking • Can test arbitrarily many models and policies offline • Li, Chen, Kleban, Gupta: "Counterfactual Estimation and Optimization of Click Metrics for Search Engines"
  • 26. Technicalities • Independence and random seeds • Application-specific issues
  • 27.
  • 28. Conclusion • If some policy is actioning model scores, you can inject randomness in production to understand the counterfactual • Instead of a "champion/challenger" A/B test, you can evaluate arbitrarily many models and policies in this framework
  • 29. Thanks! • Work by Ryan Wang (@ryw90), Roban Kramer (@robanhk), and Alyssa Frazee (@acfrazee) • We're hiring engineers/data scientists! • mlm@stripe.com (@mlmanapat)