Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark Model Deployment

8,286 views

Published on

Tech-Talk at Bay Area Spark Meetup
Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends.
In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.

Published in: Technology

Apache Spark Model Deployment

  1. 1. Apache Spark(™) Model Deployment Bay Area Spark Meetup – June 30, 2016 Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
  2. 2. About Me Richard L Garris • rlgarris@databricks.com • @rlgarris [Twitter] Big Data Solutions Architect @ Databricks 12+ years designing Enterprise Data Solutions for everyone from startups to Global 2000 Prior Work Experience PwC, Google, Skytree Ohio State Buckeye and CMU Alumni 2
  3. 3. About Apache Spark MLlib Started at Berkeley AMPLab (Apache Spark 0.8) Now (Apache Spark 2.0) • Contributions from 75+ orgs, ~250 individuals • Development driven by Databricks: roadmap + 50% of PRs • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphFrames 3
  4. 4. MLlib Goals General Machine Learning library for big data • Scalable & robust • Coverage of common algorithms • Leverages Apache Spark Tools for practical workflows Integration with existing data science tools 4
  5. 5. Apache Spark MLlib • spark.mllib • Pre Mllib < Spark 1.4 • Spark Mllib was a lower level library that used Spark RDDs • Uses LabeledPoint, Vectors and Tuples • Maintenance Mode only after Spark 2.X // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, st epSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }
  6. 6. Apache Spark – ML Pipelines • spark.ml • Spark > 1.4 • Spark.ML pipelines – able to create more complex models • Integrated with DataFrames // Let's initialize our linear regression learner val lr = new LinearRegression() // Now we set the parameters for the method lr.setPredictionCol("Predicted_PE") .setLabelCol("PE").setMaxIter(100).setRegParam(0.1) // We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar. val lrPipeline = new Pipeline() lrPipeline.setStages(Array(vectorizer, lr)) // Let's first train on the entire dataset to see what we get val lrModel = lrPipeline.fit(trainingSet)
  7. 7. The Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results
  8. 8. The Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results Focus of this talk
  9. 9. What is a Model? •
  10. 10. But What Really is a Model? A model is a complex pipeline of components • Data Sources • Joins • Featurization Logic • Algorithm(s) • Transformers • Estimators • Tuning Parameters
  11. 11. ML Pipelines 11 Train model Evaluate Load data Extract features A very simple pipeline
  12. 12. ML Pipelines 12 Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract features Extract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline
  13. 13. Why ML persistence? 13 Data Science Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model
  14. 14. Why ML persistence? 14 Data Science Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline
  15. 15. With ML persistence... 15 Data Science Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  16. 16. Demo Model Serialization in Apache Spark 2.0 using Parquet
  17. 17. What are the Requirements for a Robust Model Deployment System?
  18. 18. Customer SLAs • Response time • Throughput (predictions per second) • Uptime / Reliability Tech Stack • C / C++ • Legacy (mainframe) • Java • Docker Your Model Scoring Environment
  19. 19. Offline • Internal Use (batch) • Emails, Notifications (batch) • Offline – schedule based or event trigger based Model Scoring Offline vs Online Online • Customer Waiting on the Response (human real-time) • Super low-latency with fixed response window (transactional fraud, ad bidding)
  20. 20. Not All Models Return a Yes / No Model Scoring Considerations Example: Login Bot Detector Different behavior depending on probability score 0.0-0.4 ☞ Allow login 0.4-0.6 ☞ Challenge Question 0.6 to 0.75 ☞ Send SMS 0.75 to 0.9 ☞ Refer to Agent 0.9 - 1.0 ☞ Block Example: Item Recommendations Output is a ranking of the top n items API – send user ID + number of items Return sorted set of items to recommend Optional – pass context sensitive information to tailor results
  21. 21. Model Updates and Versioning • Model Update Frequency (nightly, weekly, monthly, quarterly) • Model Version Tracking • Model Release Process • Dev ‣ Test ‣ Staging ‣ Production • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Big Bang
  22. 22. • Models can have both reward and risk to the business – Well designed models prevent fraud, reduce churn, increase sales – Poorly designed models increase fraud, could impact the company’s brand, cause compliance violations or other risks • Models should be governed by the company's policies and procedures, laws and regulations and the organization's management goals Model Governance Considerations • Models have to be transparent, explainable, traceable and interpretable for auditors / regulators • Models may need reason codes for rejections (e.g. if I decline someone credit why?) • Models should have an approval and release process • Models also cannot violate any discrimination laws or use features that could be traced to religion, gender, ethnicity,
  23. 23. Model A/B Testing Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results • A/B testing – comparing two versions to see what performs better • Historical data works for evaluating models in testing, but production experiments required to validate model hypothesis • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Big Bang A/B Framework should support these steps
  24. 24. • Monitoring is the process of observing the model’s performance, logging it’s behavior and alerting when the model degrades • Logging should log exactly the data feed into the model at the time of scoring • Model alerting is critical to detect unusual or unexpected behaviors Model Monitoring
  25. 25. Open Loop vs Closed Loop • Open Loop – human being involved • Closed Loop – no human involved Model Scoring – almost always closed loop, some models alert agents or customer service Model Training – usually open loop with a data scientist in the loop to update the model
  26. 26. Online Learning • closed loop, entirely machine driven modeling is risky • need to have proper model monitoring and safeguards to prevent abuse / sensitivity to noise • Mllib supports online through streaming models (k- means, logistic regression support online) • Alternative – use a more complex model to better fit new data rather than using online learning
  27. 27. Model Deployment Architectures
  28. 28. Architecture #1 Offline Recommendations Train ALS Model Send Offers to Customers Save Offers to NoSQL Ranked Offers Display Ranked Offers in Web / Mobile Nightly Batch
  29. 29. Architecture #2 Precomputed Features with Streaming Web Logs Kill User’s Login SessionPre-compute Features Features Spark Streaming
  30. 30. Architecture #3 Local Apache Spark(™) Train Model in Spark Save Model to S3 / HDFS New Data Copy Model to Production Predictions Run Spark Local
  31. 31. Demo • Example of Offline Recommendations using ALS and Redis as a NoSQL Cache
  32. 32. Try Databricks Community Edition
  33. 33. 2016 Apache Spark Survey 33
  34. 34. Spark Summit EU Brussels October 25-27 The CFP closes at 11:59pm on July 1st For more information and to submit: https://spark-summit.org/eu-2016/ 34

×