Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Python	
  as	
  part	
  of	
  a	
  produc0on	
  
machine	
  learning	
  stack	
  
	
  
	
  
	
  
Michael	
  Manapat	
  
@m...
Outline	
  
	
  
-­‐Why	
  we	
  need	
  ML	
  at	
  Stripe	
  
-­‐Simple	
  models	
  with	
  sklearn	
  
-­‐Pipelines	
 ...
Stripe	
  is	
  a	
  technology	
  company	
  
focusing	
  on	
  making	
  payments	
  easy	
  
	
  
-­‐Short	
  applica>o...
Tokeniza0on	
  
	
  
	
   Customer	
  
browser	
  
Stripe	
  
Stripe.js	
  
Token	
  
Merchant	
  
server	
  
Stripe	
  
A...
API	
  Call	
  
	
  
import stripe

stripe.Charge.create(

amount=400,

currency="usd",

card="tok_103xnl2gR5VxTSB”

email...
Fraud	
  /	
  business	
  viola0ons	
  
	
  
-­‐Terms	
  of	
  service	
  viola>ons	
  (weapons)	
  
-­‐Merchant	
  fraud	...
Fraud	
  /	
  business	
  viola0ons	
  
	
  
-­‐Terms	
  of	
  service	
  viola>ons	
  
	
  
E-­‐cigareMes,	
  drugs,	
  w...
Merchant	
  sign	
  up	
  flow	
  
	
  
	
  
	
  
	
  
Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored...
Merchant	
  sign	
  up	
  flow	
  
	
  
	
  
	
  
	
  
Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored...
Building	
  a	
  classifier:	
  e-­‐cigareIes	
  
	
  
data = pandas.from_pickle(‘ecigs’)

data.head()



text violator

0 ...
Features	
  for	
  text	
  classifica0on	
  
	
  
cv = CountVectorizer



features = 

cv.fit_transform(data['text'])



Sp...
Features	
  for	
  text	
  classifica0on	
  


X_train, X_test, y_train, y_test = 

train_test_split(

features, data['viol...
Training	
  


model = LogisticRegression()

model.fit(X_train, y_train)



Serializer	
  reads	
  from	
  


model.interc...
Valida0on	
  


probs = model.predict_proba(X_test)



fpr, tpr, thresholds =

roc_curve(y_test, probs[:, 1])



matplotli...
ROC:	
  Receiver	
  opera0ng	
  characteris0c	
  




	
  
Pipeline	
  
	
  
-­‐Fetch	
  website	
  snapshots	
  from	
  S3	
  
-­‐Fetch	
  classifica>ons	
  from	
  SQL/Impala	
  
-...
Luigi	
  
	
  
class GetSnapshots(luigi.Task):

def run(self):

" "...



class GenFeatures(luigi.Task):

def requires(sel...
Luigi	
  runs	
  tasks	
  on	
  Hadoop	
  cluster	
  
"
Scoring	
  as	
  a	
  service	
  
	
  
"Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored	
  
Applica>o...
Scoring	
  as	
  a	
  service	
  
	
  
struct ScoringRequest {

1: string text

2: optional string model_name

}



struct...
Why	
  a	
  service?	
  
	
  
-­‐Same	
  code	
  base	
  for	
  training/scoring	
  
	
  
-­‐Reduced	
  duplica>on/easier	...
-­‐Log	
  requests	
  
	
  and	
  responses	
  
	
  (Parquet/Impala)	
  
	
  
-­‐Centralized	
  
	
  monitoring	
  
	
  (G...
Summary	
  
	
  
-­‐Simple	
  models	
  with	
  sklearn	
  
-­‐Pipelines	
  with	
  Luigi	
  
-­‐Scoring	
  as	
  a	
  ser...
Upcoming SlideShare
Loading in …5
×

Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Over the course of three years, we've built Stripe from scratch and scaled it to process billions of dollars of transaction volume a year by making it easy and painless for merchants to get set up and start accepting payments. While the vast majority of transactions facilitated by Stripe are honest, we do need to protect our merchants from rogue individuals and groups seeing to "test" or "cash" stolen credit cards. To combat this sort of activity, Stripe uses Python (together with Scala and Ruby) as part of its production machine learning pipeline to detect and block fraud in real time. In this talk, I'll go through the scikit-based modeling process for a sample data set that is derived from production data to illustrate how we train and validate our models. We'll also walk through how we deploy the models and monitor them in our production environment and how Python has allowed us to do this at scale.

Related Books

Free with a 30 day trial from Scribd

See all

Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

  1. 1. Python  as  part  of  a  produc0on   machine  learning  stack         Michael  Manapat   @mlmanapat   Stripe    
  2. 2. Outline     -­‐Why  we  need  ML  at  Stripe   -­‐Simple  models  with  sklearn   -­‐Pipelines  with  Luigi   -­‐Scoring  as  a  service    
  3. 3. Stripe  is  a  technology  company   focusing  on  making  payments  easy     -­‐Short  applica>on    
  4. 4. Tokeniza0on       Customer   browser   Stripe   Stripe.js   Token   Merchant   server   Stripe   API  call   Result  
  5. 5. API  Call     import stripe
 stripe.Charge.create(
 amount=400,
 currency="usd",
 card="tok_103xnl2gR5VxTSB”
 email=customer@example.com"
 )"
  6. 6. Fraud  /  business  viola0ons     -­‐Terms  of  service  viola>ons  (weapons)   -­‐Merchant  fraud  (card  “cashers”)       -­‐Transac>on  fraud     -­‐No  machine  learning  a  year  ago  
  7. 7. Fraud  /  business  viola0ons     -­‐Terms  of  service  viola>ons     E-­‐cigareMes,  drugs,  weapons,  etc.     How  do  we  find  these  automa>cally?  
  8. 8. Merchant  sign  up  flow           Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed  
  9. 9. Merchant  sign  up  flow           Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed   Machine   learning   pipeline  and   service  
  10. 10. Building  a  classifier:  e-­‐cigareIes     data = pandas.from_pickle(‘ecigs’)
 data.head()
 
 text violator
 0 " please verify your age i am 21 years or older ... True
 1 coming soon toggle me drag me with your mouse ... False
 2 drink moscow mules cart 0 log in or create an ... False
 3 vapors electronic cigarette buy now insuper st... True
 4 t-shirts shorts hawaii about us silver coll... False
 
 [5 rows x 2 columns]  
  11. 11. Features  for  text  classifica0on     cv = CountVectorizer
 
 features = 
 cv.fit_transform(data['text'])
 
 Sparse  matrix  of  word  counts  from   input  text  (omiSng  feature  selec>on)  
  12. 12. Features  for  text  classifica0on   
 X_train, X_test, y_train, y_test = 
 train_test_split(
 features, data['violator'], 
 test_size=0.2)
 
 -­‐Avoid  leakage   -Other  cross-­‐valida>on  methods  
  13. 13. Training   
 model = LogisticRegression()
 model.fit(X_train, y_train)
 
 Serializer  reads  from   
 model.intercept_
 model.coef_
  
  14. 14. Valida0on   
 probs = model.predict_proba(X_test)
 
 fpr, tpr, thresholds =
 roc_curve(y_test, probs[:, 1])
 
 matplotlib.pyplot(fpr, tpr)  
  15. 15. ROC:  Receiver  opera0ng  characteris0c   
 
  
  16. 16. Pipeline     -­‐Fetch  website  snapshots  from  S3   -­‐Fetch  classifica>ons  from  SQL/Impala   -­‐Sani>ze  text  (strip  HTML)   -­‐Run  feature  genera>on  and  selec>on   -­‐Train  and  serialize  model   -­‐Export  valida>on  sta>s>cs  
  17. 17. Luigi     class GetSnapshots(luigi.Task):
 def run(self):
 " "...
 
 class GenFeatures(luigi.Task):
 def requires(self):
 return GetSnapshots()"
  18. 18. Luigi  runs  tasks  on  Hadoop  cluster   "
  19. 19. Scoring  as  a  service     "Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed   ThriO   RPC   Scoring   Service  
  20. 20. Scoring  as  a  service     struct ScoringRequest {
 1: string text
 2: optional string model_name
 }
 
 struct ScoringResponse {
 1: double score" " "// Experiments?
 2: double request_duration
 }"
  21. 21. Why  a  service?     -­‐Same  code  base  for  training/scoring     -­‐Reduced  duplica>on/easier  deploys     -­‐Experimenta>on    
  22. 22. -­‐Log  requests    and  responses    (Parquet/Impala)     -­‐Centralized    monitoring    (Graphite)  
  23. 23. Summary     -­‐Simple  models  with  sklearn   -­‐Pipelines  with  Luigi   -­‐Scoring  as  a  service     Thanks!   @mlmanapat    

×