Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1

Share

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamric and Alex Meyer

Download to read offline

Deploying machine learning models seems like it should be a relatively easy task. Take your model and pass it some features in production. The reality is that the code written during the prototyping phase of model development doesn’t always work when applied at scale or on “real” data. This talk will explore 1) common problems at the intersection of data science and data engineering 2) how you can structure your code so there is minimal friction between prototyping and production, and 3) how you can use Apache Spark to run predictions on your models in batch or streaming contexts.

You will take away how to address some of productionizing issues that data scientists and data engineers face while deploying machine learning models at scale and a better understanding of how to work collaboratively to minimize disparity between prototyping and productizing.

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamric and Alex Meyer

  1. 1. Brandon Hamric & Alex Meyer, Eventbrite Deploying Python Machine Learning Models with Apache Spark #SAISDS2
  2. 2. Introduction #SAISDS2
  3. 3. About Eventbrite • Global ticketing and event technology platform that provides creators of events of all shapes and sizes with tools and resources to seamlessly plan, promote, and produce live experiences around the world • Can be accessed online or via mobile apps, scales from basic registration and ticketing to a fully featured event management platform • 203 million tickets processed in 2017 • Powered 3 million events in 170+ countries in 2017 • 700k creators supported in 2017 3#SAISDS2
  4. 4. About Us • Eventbrite • We're data engineers • We ship models for Eventbrite data scientists • Started out at Eventbrite on Discovery - Event Recommendations • Built new data infrastructure to support all business needs • Our team creates, maintains, and supports the data infrastructure, tasks, and pipelines that serve other engineers, business insights, and product 4#SAISDS2
  5. 5. Brandon Hamric - bhamric@eventbrite.com • Principal Data Engineer/Architect @ Eventbrite • Co-founded Rescue Forensics (YC W15) • 10 years experience in data engineering • Worked with Spark since 2014 5#SAISDS2
  6. 6. • Senior Data Engineer @ Eventbrite • MS in Computer Science - Distributed Systems (Vanderbilt University) • 4 years experience in data engineering • Worked with Spark since 2014 6#SAISDS2 Alex Meyer - alexm@eventbrite.com
  7. 7. Structured Predictors #SAISDS2
  8. 8. Common Predictor Workflow • High coupling between engineers and data scientists • Mostly serial workflow • High barrier to entry • Too many contributors • Code duplication 8#SAISDS2
  9. 9. Improved Predictor Workflow • Low coupling between engineers and data scientists • Independent Workflows • Data scientists own their models end-to- end • Data Engineering isn't a bottleneck 9#SAISDS2
  10. 10. Predictor Code 10#SAISDS2 Model ManagementData prep and cleanup ● Training and prediction code can be inconsistent ● Sample data prep can be different than prod data prep Feature Extraction Prediction ● Training and prediction code can be inconsistent ● Mostly written for vertical scaling ● Version management is hard ● Can use a lot of memory ● It can be hard to switch between models ● Bulk vs single- item
  11. 11. Predictor Deployment Problems • Shared code between dev, batch, and streaming is an afterthought • Most models are written for vertical scaling first • Deployment is ad-hoc without a common structure • Model iteration is slow because of lack of automation • Model versioning isn't consistent without a library 11#SAISDS2
  12. 12. Predictor Structure 12#SAISDS2 Notebook Offline Prediction Streaming Prediction Data prep and cleanup Query to a local csv Convert to incremental query Convert to read stream Feature Extraction Pandas dataframes and python functions Convert to spark dataframe or rdd operations Convert to dataframe operations or foreachBatch in Spark 2.4 Load Model Load from a local pickle Load from s3 or hdfs onto executors Load from s3 or hdfs onto executors Predict Mixed into scoring logic Mapper or UDF on features UDF on feature rows
  13. 13. 13 Predictor Class • Manages Model – Versioning – Storage – Loading • Outlines structure – Data loading – Feature extraction – Prediction • Batch and streaming • Enables Automation #SAISDS2
  14. 14. Example Predictor and Demo #SAISDS2
  15. 15. Demo - Latent Dirichlet Allocation (LDA) • Generate topics on Eventbrite's event description corpus • Get topic probabilities per event • We can use topics to improve search, browse, and personalization • LDA Wiki • LDA Scikit Learn Model <open notebook> 15#SAISDS2
  16. 16. Takeaways • Consistent predictor structure makes distributed prediction easy to automate deployment • Streaming and batch prediction can share code • Use bulk feature extraction and prediction often • We may opensource our predictor library • We're hiring! 16
  17. 17. Thanks! Questions? Feel free to reach out! 17#SAISDS2
  • ArunkumarRamanan

    Jun. 9, 2019

Deploying machine learning models seems like it should be a relatively easy task. Take your model and pass it some features in production. The reality is that the code written during the prototyping phase of model development doesn’t always work when applied at scale or on “real” data. This talk will explore 1) common problems at the intersection of data science and data engineering 2) how you can structure your code so there is minimal friction between prototyping and production, and 3) how you can use Apache Spark to run predictions on your models in batch or streaming contexts. You will take away how to address some of productionizing issues that data scientists and data engineers face while deploying machine learning models at scale and a better understanding of how to work collaboratively to minimize disparity between prototyping and productizing.

Views

Total views

1,144

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

70

Shares

0

Comments

0

Likes

1

×