Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 33

Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley

8

Share

Download to read offline

This talk discusses developments within Apache Spark to allow deployment of MLlib models and pipelines within Structured Streaming jobs. MLlib has proven success and wide adoption for fitting Machine Learning (ML) models on big data. Scalability, expressive Pipeline APIs, and Spark DataFrame integration are key strengths.

Separately, the development of Structured Streaming has provided Spark users with intuitive, performant tools for building Continuous Applications. The smooth integration of batch and streaming APIs and workflows greatly simplifies many production use cases. Given the adoption of MLlib and Structured Streaming in production systems, a natural next step is to combine them: deploy MLlib models and Pipelines for scoring (prediction) in Structured Streaming.

However, before Apache Spark 2.3, many ML Pipelines could not be deployed in streaming. This talk discusses key improvements within MLlib to support streaming prediction. We will discuss currently supported functionality and opportunities for future improvements. With Spark 2.3, almost all MLlib workflows can be deployed for scoring in streaming, and we will demonstrate this live. The ability to deploy full ML Pipelines which include featurization greatly simplifies moving complex ML workflows from development to production. We will also include some discussion of technical challenges, such as featurization via Estimators vs. Transformers and DataFrame column metadata.

Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley

  1. 1. Deploying MLlib for Scoring in Structured Streaming Joseph Bradley June 5, 2018 Spark + AI Summit
  2. 2. About me Joseph Bradley • Software engineer at Databricks • Apache Spark committer & PMC member
  3. 3. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple Try for free today. databricks.com
  4. 4. App: monitoring web sessions for bots Web Activity Logs Compute Features Kill User’s Login Session Run Prediction API Check Cached Predictions streaming web app
  5. 5. App: monitoring web sessions for bots Web Activity Logs Kill User’s Login Session Compute Features Run Prediction API Check Cached Predictions streaming web app
  6. 6. Productionizing Machine Learning Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions End Users
  7. 7. Challenge: teams & environments Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions End Users
  8. 8. Challenge: featurization logic Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions Feature Logic ↓ Feature Logic ↓ Feature Logic ↓ Model End Users
  9. 9. Challenges in productionizing ML Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future
  10. 10. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  11. 11. ML Pipelines in Apache Spark Original dataset 11 Text Label I bought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
  12. 12. ML Pipelines: featurization 12 Feature extraction Original dataset Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
  13. 13. ML Pipelines: model 13 Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7 Feature extraction Original dataset Predictive model
  14. 14. ML Pipelines: successes • Apache Spark integration simplifies • Deployment • ETL • Integration into complete analytics pipelines with SQL (& streaming!) • Scalability & speed • Pipelines for featurization, modeling & tuning
  15. 15. ML Pipelines: adoption • 1000s of commits • 100s of contributors • 10,000s of users (on Databricks alone) • Many production use cases
  16. 16. Structured Streaming One single API Dataset / DataFrame for batch & streaming End-to-end exactly-once guarantees • The guarantees extend into the sources/sinks, e.g. MySQL, S3 Understands external event-time • Handles late-arriving data • Supports sessionization based on event time
  17. 17. Challenges in productionizing ML Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future ML Pipeline Persistence Apache Spark deployments Featurization in Pipelines Backwards compatibility
  18. 18. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  19. 19. 2-pass Transformers Algorithmic pattern • Scan data to collect stats • Collect stats to driver • Scan data to apply transform (using stats) VectorAssembler • Find lengths of Vector cols • Compute total # features • Create new Vector column (of length # features) Scan-collect-scan pattern fails with Structured Streaming.
  20. 20. Handling invalid values Invalid values include: • NaN and null values • Out-of-bounds values (e.g., for Bucketizer) • Incorrect Vector lengths (e.g., for VectorAssembler) Robust deployments must handle invalid data. ML Pipelines use the handleInvalid Param with options “skip” / “keep” / “error” — but have only partial coverage.
  21. 21. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  22. 22. Most Transformers & Models “just work” As of Apache Spark 2.3, batch & streaming scoring/transform are basically identical: • PipelineModel.transform() works on Streaming Datasets and DataFrames. • New unit test framework covers batch & streaming tests. Fixes & tests tracked in SPARK-21926 & SPARK-22644.
  23. 23. Fixes for 2-pass Transformers VectorAssembler • Assemble multiple columns into 1 feature Vector • Needs lengths of Vector columns • Extract from metadata (added by, e.g., OneHotEncoder) • Compute from data Fails with Structured Streaming VectorSizeHint • Manually adds Vector length to column metadata • Required only for Structured Streaming
  24. 24. Fixes for 2-pass Transformers OneHotEncoder • Transform categorical column to 0/1 Vector • Needs # categories: • Extract from metadata (added by, e.g., StringIndexer) • Compute from data OneHotEncoderEstimator • fit() stores categories for use in transform() • Match behavior at training & test time Bug if train & test data have different categories (state)
  25. 25. Handling invalid values Improvements in Spark 2.3 • VectorIndexer, StringIndexer, OneHotEncoderEstimator • Bucketizer, QuantileDiscretizer • RFormula • Most coverage handles NaN. Some handles null. Fixes targeted for Spark 2.4 • VectorAssembler • RFormula: Pass handleInvalid to all sub-stages
  26. 26. Demo: Streaming Scoring in 2.3
  27. 27. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  28. 28. Cheat sheet: fixing your Pipeline to work with Structured Streaming • Update uses of OneHotEncoder, VectorAssembler. (RFormula should be OK). • Check how invalid values are handled. • Beware using handleInvalid=“skip”, which drops invalid Rows. • Test! • In custom logic (custom SQL, Transformers, Models), beware of 2-pass Transformers (hidden state).
  29. 29. Remaining work • Locality Sensitive Hashing (LSH) Models do not work (SPARK-24465) • Require Spark SQL to support nested UDTs (SPARK-12878) • VectorAssemblerEstimator: nicer API than VectorSizeHint (SPARK-24467) • Handling invalid values • Expanded support • Better defaults for handleInvalid Param
  30. 30. Beyond this talk This talk: Deployment in streaming Deployment outside of Spark Deployment in batch jobs Model management Feature management Experiment management Monitoring A/B testing Serving APIs
  31. 31. Resources Overview of productionizing Apache Spark ML models Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize- your-machine-learning-models Batch scoring Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and- loading-pipelines Streaming scoring Guide and example notebook: https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-and- stuctured-streaming.html Sub-second scoring Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing- apache-spark-mllib-models-for-real-time-prediction-serving
  32. 32. https://databricks.com/careers
  33. 33. Thank You! Questions? Shout out to Bago Amirbekian, Weichen Xu, and to the many other contributors on this work. Office hours today @ 3:50pm at Databricks booth

×