Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

PyData
PyDataPyData
Python	
  as	
  part	
  of	
  a	
  produc0on	
  
machine	
  learning	
  stack	
  
	
  
	
  
	
  
Michael	
  Manapat	
  
@mlmanapat	
  
Stripe	
  
	
  
Outline	
  
	
  
-­‐Why	
  we	
  need	
  ML	
  at	
  Stripe	
  
-­‐Simple	
  models	
  with	
  sklearn	
  
-­‐Pipelines	
  with	
  Luigi	
  
-­‐Scoring	
  as	
  a	
  service	
  
	
  
Stripe	
  is	
  a	
  technology	
  company	
  
focusing	
  on	
  making	
  payments	
  easy	
  
	
  
-­‐Short	
  applica>on	
  
	
  
Tokeniza0on	
  
	
  
	
   Customer	
  
browser	
  
Stripe	
  
Stripe.js	
  
Token	
  
Merchant	
  
server	
  
Stripe	
  
API	
  call	
  
Result	
  
API	
  Call	
  
	
  
import stripe

stripe.Charge.create(

amount=400,

currency="usd",

card="tok_103xnl2gR5VxTSB”

email=customer@example.com"

)"
Fraud	
  /	
  business	
  viola0ons	
  
	
  
-­‐Terms	
  of	
  service	
  viola>ons	
  (weapons)	
  
-­‐Merchant	
  fraud	
  (card	
  “cashers”)	
  	
  	
  
-­‐Transac>on	
  fraud	
  
	
  
-­‐No	
  machine	
  learning	
  a	
  year	
  ago	
  
Fraud	
  /	
  business	
  viola0ons	
  
	
  
-­‐Terms	
  of	
  service	
  viola>ons	
  
	
  
E-­‐cigareMes,	
  drugs,	
  weapons,	
  etc.	
  
	
  
How	
  do	
  we	
  find	
  these	
  automa>cally?	
  
Merchant	
  sign	
  up	
  flow	
  
	
  
	
  
	
  
	
  
Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored	
  
Applica>on	
  
reviewed	
  
Merchant	
  sign	
  up	
  flow	
  
	
  
	
  
	
  
	
  
Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored	
  
Applica>on	
  
reviewed	
  
Machine	
  
learning	
  
pipeline	
  and	
  
service	
  
Building	
  a	
  classifier:	
  e-­‐cigareIes	
  
	
  
data = pandas.from_pickle(‘ecigs’)

data.head()



text violator

0 " please verify your age i am 21 years or older ... True

1 coming soon toggle me drag me with your mouse ... False

2 drink moscow mules cart 0 log in or create an ... False

3 vapors electronic cigarette buy now insuper st... True

4 t-shirts shorts hawaii about us silver coll... False



[5 rows x 2 columns]	
  
Features	
  for	
  text	
  classifica0on	
  
	
  
cv = CountVectorizer



features = 

cv.fit_transform(data['text'])



Sparse	
  matrix	
  of	
  word	
  counts	
  from	
  
input	
  text	
  (omiSng	
  feature	
  selec>on)	
  
Features	
  for	
  text	
  classifica0on	
  


X_train, X_test, y_train, y_test = 

train_test_split(

features, data['violator'], 

test_size=0.2)



-­‐Avoid	
  leakage	
  
-Other	
  cross-­‐valida>on	
  methods	
  
Training	
  


model = LogisticRegression()

model.fit(X_train, y_train)



Serializer	
  reads	
  from	
  


model.intercept_

model.coef_

	
  
Valida0on	
  


probs = model.predict_proba(X_test)



fpr, tpr, thresholds =

roc_curve(y_test, probs[:, 1])



matplotlib.pyplot(fpr, tpr)	
  
ROC:	
  Receiver	
  opera0ng	
  characteris0c	
  




	
  
Pipeline	
  
	
  
-­‐Fetch	
  website	
  snapshots	
  from	
  S3	
  
-­‐Fetch	
  classifica>ons	
  from	
  SQL/Impala	
  
-­‐Sani>ze	
  text	
  (strip	
  HTML)	
  
-­‐Run	
  feature	
  genera>on	
  and	
  selec>on	
  
-­‐Train	
  and	
  serialize	
  model	
  
-­‐Export	
  valida>on	
  sta>s>cs	
  
Luigi	
  
	
  
class GetSnapshots(luigi.Task):

def run(self):

" "...



class GenFeatures(luigi.Task):

def requires(self):

return GetSnapshots()"
Luigi	
  runs	
  tasks	
  on	
  Hadoop	
  cluster	
  
"
Scoring	
  as	
  a	
  service	
  
	
  
"Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored	
  
Applica>on	
  
reviewed	
  
ThriO	
  
RPC	
  
Scoring	
  
Service	
  
Scoring	
  as	
  a	
  service	
  
	
  
struct ScoringRequest {

1: string text

2: optional string model_name

}



struct ScoringResponse {

1: double score" " "// Experiments?

2: double request_duration

}"
Why	
  a	
  service?	
  
	
  
-­‐Same	
  code	
  base	
  for	
  training/scoring	
  
	
  
-­‐Reduced	
  duplica>on/easier	
  deploys	
  
	
  
-­‐Experimenta>on	
  
	
  
-­‐Log	
  requests	
  
	
  and	
  responses	
  
	
  (Parquet/Impala)	
  
	
  
-­‐Centralized	
  
	
  monitoring	
  
	
  (Graphite)	
  
Summary	
  
	
  
-­‐Simple	
  models	
  with	
  sklearn	
  
-­‐Pipelines	
  with	
  Luigi	
  
-­‐Scoring	
  as	
  a	
  service	
  
	
  
Thanks!	
  
@mlmanapat	
  
	
  
1 of 23

Recommended

Luigi presentation OA Summit by
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA SummitOpen Analytics
16.2K views18 slides
Luigi presentation NYC Data Science by
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
60.4K views69 slides
A Beginner's Guide to Building Data Pipelines with Luigi by
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
57K views26 slides
Managing data workflows with Luigi by
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
6.2K views35 slides
Best Practices in Handling Performance Issues by
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesOdoo
1.1K views19 slides
The Apache Spark File Format Ecosystem by
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
2.1K views44 slides

More Related Content

What's hot

Tuning Apache Kafka Connectors for Flink.pptx by
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
430 views54 slides
Improving the performance of Odoo deployments by
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deploymentsOdoo
108.4K views44 slides
Apache Airflow by
Apache AirflowApache Airflow
Apache AirflowSumit Maheshwari
12.8K views25 slides
Airflow presentation by
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
3.6K views53 slides
Tools for Solving Performance Issues by
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance IssuesOdoo
520 views26 slides
Airflow Intro-1.pdf by
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdfBagustTriCahyo1
573 views29 slides

What's hot(20)

Tuning Apache Kafka Connectors for Flink.pptx by Flink Forward
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward430 views
Improving the performance of Odoo deployments by Odoo
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
Odoo108.4K views
Airflow presentation by Ilias Okacha
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha3.6K views
Tools for Solving Performance Issues by Odoo
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
Odoo520 views
Common Performance Pitfalls in Odoo apps by Odoo
Common Performance Pitfalls in Odoo appsCommon Performance Pitfalls in Odoo apps
Common Performance Pitfalls in Odoo apps
Odoo457 views
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by... by Spark Summit
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit1.8K views
Odoo Performance Limits by Odoo
Odoo Performance LimitsOdoo Performance Limits
Odoo Performance Limits
Odoo1.1K views
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ... by Flink Forward
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward579 views
UseNUMA做了什么?(2012-03-14) by Kris Mok
UseNUMA做了什么?(2012-03-14)UseNUMA做了什么?(2012-03-14)
UseNUMA做了什么?(2012-03-14)
Kris Mok3.1K views
Storing State Forever: Why It Can Be Good For Your Analytics by Yaroslav Tkachenko
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko485 views
Odoo's Test Framework - Learn Best Practices by Odoo
Odoo's Test Framework - Learn Best PracticesOdoo's Test Framework - Learn Best Practices
Odoo's Test Framework - Learn Best Practices
Odoo297 views
Integrating NiFi and Flink by Bryan Bende
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
Bryan Bende3.6K views
Flink Batch Processing and Iterations by Sameer Wadkar
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
Sameer Wadkar3.5K views
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter by DataWorks Summit
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit22.9K views
[Meetup] a successful migration from elastic search to clickhouse by Vianney FOUCAULT
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT2.7K views

Viewers also liked

Machine learning in production with scikit-learn by
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learnJeff Klukas
3.1K views33 slides
Production machine learning_infrastructure by
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
9.2K views47 slides
Using PySpark to Process Boat Loads of Data by
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
1.1K views53 slides
Production and Beyond: Deploying and Managing Machine Learning Models by
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
6.5K views33 slides
Multi runtime serving pipelines for machine learning by
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningStepan Pushkarev
648 views16 slides
Serverless machine learning operations by
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
950 views40 slides

Viewers also liked(16)

Machine learning in production with scikit-learn by Jeff Klukas
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
Jeff Klukas3.1K views
Production machine learning_infrastructure by joshwills
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
joshwills9.2K views
Using PySpark to Process Boat Loads of Data by Robert Dempsey
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
Robert Dempsey1.1K views
Production and Beyond: Deploying and Managing Machine Learning Models by Turi, Inc.
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
Turi, Inc.6.5K views
Multi runtime serving pipelines for machine learning by Stepan Pushkarev
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
Stepan Pushkarev648 views
Serverless machine learning operations by Stepan Pushkarev
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
Stepan Pushkarev950 views
Machine learning in production by Turi, Inc.
Machine learning in productionMachine learning in production
Machine learning in production
Turi, Inc.7.9K views
Managing and Versioning Machine Learning Models in Python by Simon Frid
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Simon Frid7.8K views
Square's Machine Learning Infrastructure and Applications - Rong Yan by Hakka Labs
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
Hakka Labs4.1K views
Building A Production-Level Machine Learning Pipeline by Robert Dempsey
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
Robert Dempsey4K views
PostgreSQL + Kafka: The Delight of Change Data Capture by Jeff Klukas
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data Capture
Jeff Klukas7.7K views
Machine Learning In Production by Samir Bessalah
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
Samir Bessalah5.8K views
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ... by Jose Quesada (hiring)
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)10.1K views
Machine Learning Pipelines by jeykottalam
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam25.4K views
Spark and machine learning in microservices architecture by Stepan Pushkarev
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
Stepan Pushkarev4.1K views
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017 by Carol Smith
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith4.9M views

Similar to Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가 by
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가JaeCheolKim10
249 views67 slides
MLflow at Company Scale by
MLflow at Company ScaleMLflow at Company Scale
MLflow at Company ScaleDatabricks
391 views27 slides
Vertical Recommendation Using Collaborative Filtering by
Vertical Recommendation Using Collaborative FilteringVertical Recommendation Using Collaborative Filtering
Vertical Recommendation Using Collaborative Filteringgorass
457 views56 slides
Swift distributed tracing method and tools v2 by
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2zhang hua
1.2K views20 slides
Iswim for testing by
Iswim for testingIswim for testing
Iswim for testingClarkTony
283 views42 slides
Iswim for testing by
Iswim for testingIswim for testing
Iswim for testingClarkTony
289 views42 slides

Similar to Python as part of a production machine learning stack by Michael Manapat PyData SV 2014(20)

미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가 by JaeCheolKim10
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
JaeCheolKim10249 views
MLflow at Company Scale by Databricks
MLflow at Company ScaleMLflow at Company Scale
MLflow at Company Scale
Databricks391 views
Vertical Recommendation Using Collaborative Filtering by gorass
Vertical Recommendation Using Collaborative FilteringVertical Recommendation Using Collaborative Filtering
Vertical Recommendation Using Collaborative Filtering
gorass457 views
Swift distributed tracing method and tools v2 by zhang hua
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2
zhang hua1.2K views
Iswim for testing by ClarkTony
Iswim for testingIswim for testing
Iswim for testing
ClarkTony283 views
Iswim for testing by ClarkTony
Iswim for testingIswim for testing
Iswim for testing
ClarkTony289 views
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala by Chetan Khatri
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri88 views
Machine Learning with Microsoft Azure by Dmitry Petukhov
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
Dmitry Petukhov475 views
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an... by Chetan Khatri
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri400 views
Profiling Mondrian MDX Requests in a Production Environment by Raimonds Simanovskis
Profiling Mondrian MDX Requests in a Production EnvironmentProfiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production Environment
A Practical Deep Dive into Observability of Streaming Applications with Kosta... by HostedbyConfluent
A Practical Deep Dive into Observability of Streaming Applications with Kosta...A Practical Deep Dive into Observability of Streaming Applications with Kosta...
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
HostedbyConfluent334 views
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by... by NETWAYS
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
NETWAYS28 views
Becoming a SOC2 Ruby Shop - Montreal.rb November, 5, 2022 Ruby Meetup by Andy Maleh
Becoming a SOC2 Ruby Shop - Montreal.rb November, 5, 2022 Ruby MeetupBecoming a SOC2 Ruby Shop - Montreal.rb November, 5, 2022 Ruby Meetup
Becoming a SOC2 Ruby Shop - Montreal.rb November, 5, 2022 Ruby Meetup
Andy Maleh26 views
IRJET- Credit Card Fraud Detection : A Comparison using Random Forest, SVM an... by IRJET Journal
IRJET- Credit Card Fraud Detection : A Comparison using Random Forest, SVM an...IRJET- Credit Card Fraud Detection : A Comparison using Random Forest, SVM an...
IRJET- Credit Card Fraud Detection : A Comparison using Random Forest, SVM an...
IRJET Journal51 views
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions by Matthew Tovbin
Meet TransmogrifAI, Open Source AutoML That Powers Einstein PredictionsMeet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
Matthew Tovbin445 views

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P... by
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
3.3K views20 slides
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh by
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
1.7K views128 slides
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski by
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
1.3K views17 slides
Using Embeddings to Understand the Variance and Evolution of Data Science... ... by
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
590 views27 slides
Deploying Data Science for Distribution of The New York Times - Anne Bauer by
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
770 views35 slides
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma by
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
862 views35 slides

More from PyData(20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P... by PyData
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData3.3K views
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh by PyData
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData1.7K views
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski by PyData
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData1.3K views
Using Embeddings to Understand the Variance and Evolution of Data Science... ... by PyData
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData590 views
Deploying Data Science for Distribution of The New York Times - Anne Bauer by PyData
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData770 views
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma by PyData
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData862 views
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ... by PyData
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData484 views
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro by PyData
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData2.2K views
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod... by PyData
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData383 views
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott by PyData
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData229 views
Words in Space - Rebecca Bilbro by PyData
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData201 views
End-to-End Machine learning pipelines for Python driven organizations - Nick ... by PyData
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData856 views
Pydata beautiful soup - Monica Puerto by PyData
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
PyData523 views
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef... by PyData
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData11.6K views
Extending Pandas with Custom Types - Will Ayd by PyData
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
PyData1.6K views
Measuring Model Fairness - Stephen Hoover by PyData
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
PyData424 views
What's the Science in Data Science? - Skipper Seabold by PyData
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
PyData417 views
Applying Statistical Modeling and Machine Learning to Perform Time-Series For... by PyData
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData9.5K views
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward by PyData
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData928 views
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An... by PyData
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData398 views

Recently uploaded

Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
81 views46 slides
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...ShapeBlue
69 views29 slides
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsShapeBlue
172 views13 slides
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...ShapeBlue
114 views12 slides
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...ShapeBlue
48 views17 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
110 views31 slides

Recently uploaded(20)

DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue69 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue172 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue114 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue48 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue63 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue105 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc130 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue74 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue134 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue113 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue56 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software373 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue110 views

Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

  • 1. Python  as  part  of  a  produc0on   machine  learning  stack         Michael  Manapat   @mlmanapat   Stripe    
  • 2. Outline     -­‐Why  we  need  ML  at  Stripe   -­‐Simple  models  with  sklearn   -­‐Pipelines  with  Luigi   -­‐Scoring  as  a  service    
  • 3. Stripe  is  a  technology  company   focusing  on  making  payments  easy     -­‐Short  applica>on    
  • 4. Tokeniza0on       Customer   browser   Stripe   Stripe.js   Token   Merchant   server   Stripe   API  call   Result  
  • 5. API  Call     import stripe
 stripe.Charge.create(
 amount=400,
 currency="usd",
 card="tok_103xnl2gR5VxTSB”
 email=customer@example.com"
 )"
  • 6. Fraud  /  business  viola0ons     -­‐Terms  of  service  viola>ons  (weapons)   -­‐Merchant  fraud  (card  “cashers”)       -­‐Transac>on  fraud     -­‐No  machine  learning  a  year  ago  
  • 7. Fraud  /  business  viola0ons     -­‐Terms  of  service  viola>ons     E-­‐cigareMes,  drugs,  weapons,  etc.     How  do  we  find  these  automa>cally?  
  • 8. Merchant  sign  up  flow           Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed  
  • 9. Merchant  sign  up  flow           Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed   Machine   learning   pipeline  and   service  
  • 10. Building  a  classifier:  e-­‐cigareIes     data = pandas.from_pickle(‘ecigs’)
 data.head()
 
 text violator
 0 " please verify your age i am 21 years or older ... True
 1 coming soon toggle me drag me with your mouse ... False
 2 drink moscow mules cart 0 log in or create an ... False
 3 vapors electronic cigarette buy now insuper st... True
 4 t-shirts shorts hawaii about us silver coll... False
 
 [5 rows x 2 columns]  
  • 11. Features  for  text  classifica0on     cv = CountVectorizer
 
 features = 
 cv.fit_transform(data['text'])
 
 Sparse  matrix  of  word  counts  from   input  text  (omiSng  feature  selec>on)  
  • 12. Features  for  text  classifica0on   
 X_train, X_test, y_train, y_test = 
 train_test_split(
 features, data['violator'], 
 test_size=0.2)
 
 -­‐Avoid  leakage   -Other  cross-­‐valida>on  methods  
  • 13. Training   
 model = LogisticRegression()
 model.fit(X_train, y_train)
 
 Serializer  reads  from   
 model.intercept_
 model.coef_
  
  • 14. Valida0on   
 probs = model.predict_proba(X_test)
 
 fpr, tpr, thresholds =
 roc_curve(y_test, probs[:, 1])
 
 matplotlib.pyplot(fpr, tpr)  
  • 15. ROC:  Receiver  opera0ng  characteris0c   
 
  
  • 16. Pipeline     -­‐Fetch  website  snapshots  from  S3   -­‐Fetch  classifica>ons  from  SQL/Impala   -­‐Sani>ze  text  (strip  HTML)   -­‐Run  feature  genera>on  and  selec>on   -­‐Train  and  serialize  model   -­‐Export  valida>on  sta>s>cs  
  • 17. Luigi     class GetSnapshots(luigi.Task):
 def run(self):
 " "...
 
 class GenFeatures(luigi.Task):
 def requires(self):
 return GetSnapshots()"
  • 18. Luigi  runs  tasks  on  Hadoop  cluster   "
  • 19. Scoring  as  a  service     "Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed   ThriO   RPC   Scoring   Service  
  • 20. Scoring  as  a  service     struct ScoringRequest {
 1: string text
 2: optional string model_name
 }
 
 struct ScoringResponse {
 1: double score" " "// Experiments?
 2: double request_duration
 }"
  • 21. Why  a  service?     -­‐Same  code  base  for  training/scoring     -­‐Reduced  duplica>on/easier  deploys     -­‐Experimenta>on    
  • 22. -­‐Log  requests    and  responses    (Parquet/Impala)     -­‐Centralized    monitoring    (Graphite)  
  • 23. Summary     -­‐Simple  models  with  sklearn   -­‐Pipelines  with  Luigi   -­‐Scoring  as  a  service     Thanks!   @mlmanapat