Apache Spark(™)
Model Deployment
Bay Area Spark Meetup – June 30, 2016
Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
About Me
Richard L Garris
• rlgarris@databricks.com
• @rlgarris [Twitter]
Big Data Solutions Architect @ Databricks
12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
Prior Work Experience PwC, Google, Skytree
Ohio State Buckeye and CMU Alumni
2
About Apache Spark MLlib
Started at Berkeley AMPLab
(Apache Spark 0.8)
Now (Apache Spark 2.0)
• Contributions from 75+ orgs, ~250 individuals
• Development driven by Databricks: roadmap + 50% of
PRs
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphFrames
3
MLlib Goals
General Machine Learning library for big data
• Scalable & robust
• Coverage of common algorithms
• Leverages Apache Spark
Tools for practical workflows
Integration with existing data science tools
4
Apache Spark MLlib
• spark.mllib
• Pre Mllib < Spark 1.4
• Spark Mllib was a lower
level library that used
Spark RDDs
• Uses LabeledPoint,
Vectors and Tuples
• Maintenance Mode only
after Spark 2.X
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split('
').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, st
epSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Apache Spark – ML Pipelines
• spark.ml
• Spark > 1.4
• Spark.ML pipelines –
able to create more
complex models
• Integrated with
DataFrames
// Let's initialize our linear regression learner
val lr = new LinearRegression()
// Now we set the parameters for the method
lr.setPredictionCol("Predicted_PE")
.setLabelCol("PE").setMaxIter(100).setRegParam(0.1)
// We will use the new spark.ml pipeline API. If you
have worked with scikit-learn this will be very
familiar.
val lrPipeline = new Pipeline()
lrPipeline.setStages(Array(vectorizer, lr))
// Let's first train on the entire dataset to see what
we get
val lrModel = lrPipeline.fit(trainingSet)
The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Focus of this
talk
What is a Model?
•
But What Really is a Model?
A model is a complex pipeline of components
• Data Sources
• Joins
• Featurization Logic
• Algorithm(s)
• Transformers
• Estimators
• Tuning Parameters
ML Pipelines
11
Train model
Evaluate
Load data
Extract features
A very simple pipeline
ML Pipelines
12
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract features
Extract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline
Why ML persistence?
13
Data
Science
Software
Engineering
Prototype (Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
Why ML persistence?
14
Data
Science
Software
Engineering
Prototype (Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline for
production (Java)
Deploy Pipeline
With ML persistence...
15
Data
Science
Software
Engineering
Prototype (Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
Demo
Model Serialization in Apache Spark 2.0 using Parquet
What are the Requirements
for a Robust Model
Deployment System?
Customer SLAs
• Response time
• Throughput (predictions per second)
• Uptime / Reliability
Tech Stack
• C / C++
• Legacy (mainframe)
• Java
• Docker
Your Model Scoring Environment
Offline
• Internal Use (batch)
• Emails, Notifications
(batch)
• Offline – schedule based or
event trigger based
Model Scoring Offline vs Online
Online
• Customer Waiting on the
Response (human real-time)
• Super low-latency with fixed
response window
(transactional fraud, ad
bidding)
Not All Models Return a Yes / No
Model Scoring Considerations
Example: Login Bot Detector
Different behavior depending on
probability score
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Challenge Question
0.6 to 0.75 ☞ Send SMS
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block
Example: Item Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Return sorted set of items to recommend
Optional – pass context sensitive information
to tailor results
Model Updates and Versioning
• Model Update Frequency
(nightly, weekly, monthly, quarterly)
• Model Version Tracking
• Model Release Process
• Dev ‣ Test ‣ Staging ‣ Production
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
• Models can have both reward and risk to the business
– Well designed models prevent fraud, reduce churn, increase sales
– Poorly designed models increase fraud, could impact the company’s brand,
cause compliance violations or other risks
• Models should be governed by the company's policies and procedures,
laws and regulations and the organization's management goals
Model Governance
Considerations
• Models have to be transparent, explainable, traceable and interpretable for
auditors / regulators
• Models may need reason codes for rejections (e.g. if I decline someone credit why?)
• Models should have an approval and release process
• Models also cannot violate any discrimination laws or use features that could be
traced to religion, gender, ethnicity,
Model A/B Testing
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate
Results
• A/B testing – comparing two
versions to see what performs
better
• Historical data works for
evaluating models in testing, but
production experiments required
to validate model hypothesis
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
A/B Framework should support these steps
• Monitoring is the process of
observing the model’s
performance, logging it’s
behavior and alerting when the
model degrades
• Logging should log exactly the
data feed into the model at the
time of scoring
• Model alerting is critical to
detect unusual or unexpected
behaviors
Model Monitoring
Open Loop vs Closed Loop
• Open Loop – human being involved
• Closed Loop – no human involved
Model Scoring – almost always closed loop, some models alert
agents or customer service
Model Training – usually open loop with a data scientist in the
loop to update the model
Online Learning
• closed loop, entirely machine driven modeling is
risky
• need to have proper model monitoring and
safeguards to prevent abuse / sensitivity to noise
• Mllib supports online through streaming models (k-
means, logistic regression support online)
• Alternative – use a more complex model to better fit
new data rather than using online learning
Model Deployment
Architectures
Architecture #1
Offline Recommendations
Train ALS Model Send Offers to Customers
Save Offers to NoSQL
Ranked Offers
Display Ranked Offers in
Web / Mobile
Nightly Batch
Architecture #2
Precomputed Features with Streaming
Web Logs
Kill User’s Login SessionPre-compute Features Features
Spark Streaming
Architecture #3
Local Apache Spark(™)
Train Model in Spark Save Model to S3 / HDFS
New Data
Copy
Model to
Production
Predictions
Run Spark Local
Demo
• Example of Offline Recommendations using ALS and
Redis as a NoSQL Cache
Try Databricks Community Edition
2016 Apache Spark Survey
33
Spark Summit EU
Brussels
October 25-27
The CFP closes at 11:59pm on July 1st
For more information and to submit:
https://spark-summit.org/eu-2016/
34

Apache Spark Model Deployment

  • 1.
    Apache Spark(™) Model Deployment BayArea Spark Meetup – June 30, 2016 Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
  • 2.
    About Me Richard LGarris • rlgarris@databricks.com • @rlgarris [Twitter] Big Data Solutions Architect @ Databricks 12+ years designing Enterprise Data Solutions for everyone from startups to Global 2000 Prior Work Experience PwC, Google, Skytree Ohio State Buckeye and CMU Alumni 2
  • 3.
    About Apache SparkMLlib Started at Berkeley AMPLab (Apache Spark 0.8) Now (Apache Spark 2.0) • Contributions from 75+ orgs, ~250 individuals • Development driven by Databricks: roadmap + 50% of PRs • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphFrames 3
  • 4.
    MLlib Goals General MachineLearning library for big data • Scalable & robust • Coverage of common algorithms • Leverages Apache Spark Tools for practical workflows Integration with existing data science tools 4
  • 5.
    Apache Spark MLlib •spark.mllib • Pre Mllib < Spark 1.4 • Spark Mllib was a lower level library that used Spark RDDs • Uses LabeledPoint, Vectors and Tuples • Maintenance Mode only after Spark 2.X // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, st epSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }
  • 6.
    Apache Spark –ML Pipelines • spark.ml • Spark > 1.4 • Spark.ML pipelines – able to create more complex models • Integrated with DataFrames // Let's initialize our linear regression learner val lr = new LinearRegression() // Now we set the parameters for the method lr.setPredictionCol("Predicted_PE") .setLabelCol("PE").setMaxIter(100).setRegParam(0.1) // We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar. val lrPipeline = new Pipeline() lrPipeline.setStages(Array(vectorizer, lr)) // Let's first train on the entire dataset to see what we get val lrModel = lrPipeline.fit(trainingSet)
  • 7.
    The Agile ModelingProcess Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results
  • 8.
    The Agile ModelingProcess Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results Focus of this talk
  • 9.
    What is aModel? •
  • 10.
    But What Reallyis a Model? A model is a complex pipeline of components • Data Sources • Joins • Featurization Logic • Algorithm(s) • Transformers • Estimators • Tuning Parameters
  • 11.
    ML Pipelines 11 Train model Evaluate Loaddata Extract features A very simple pipeline
  • 12.
    ML Pipelines 12 Train model1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract features Extract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline
  • 13.
    Why ML persistence? 13 Data Science Software Engineering Prototype(Python/R) Create model Re-implement model for production (Java) Deploy model
  • 14.
    Why ML persistence? 14 Data Science Software Engineering Prototype(Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline
  • 15.
    With ML persistence... 15 Data Science Software Engineering Prototype(Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 16.
    Demo Model Serialization inApache Spark 2.0 using Parquet
  • 17.
    What are theRequirements for a Robust Model Deployment System?
  • 18.
    Customer SLAs • Responsetime • Throughput (predictions per second) • Uptime / Reliability Tech Stack • C / C++ • Legacy (mainframe) • Java • Docker Your Model Scoring Environment
  • 19.
    Offline • Internal Use(batch) • Emails, Notifications (batch) • Offline – schedule based or event trigger based Model Scoring Offline vs Online Online • Customer Waiting on the Response (human real-time) • Super low-latency with fixed response window (transactional fraud, ad bidding)
  • 20.
    Not All ModelsReturn a Yes / No Model Scoring Considerations Example: Login Bot Detector Different behavior depending on probability score 0.0-0.4 ☞ Allow login 0.4-0.6 ☞ Challenge Question 0.6 to 0.75 ☞ Send SMS 0.75 to 0.9 ☞ Refer to Agent 0.9 - 1.0 ☞ Block Example: Item Recommendations Output is a ranking of the top n items API – send user ID + number of items Return sorted set of items to recommend Optional – pass context sensitive information to tailor results
  • 21.
    Model Updates andVersioning • Model Update Frequency (nightly, weekly, monthly, quarterly) • Model Version Tracking • Model Release Process • Dev ‣ Test ‣ Staging ‣ Production • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Big Bang
  • 22.
    • Models canhave both reward and risk to the business – Well designed models prevent fraud, reduce churn, increase sales – Poorly designed models increase fraud, could impact the company’s brand, cause compliance violations or other risks • Models should be governed by the company's policies and procedures, laws and regulations and the organization's management goals Model Governance Considerations • Models have to be transparent, explainable, traceable and interpretable for auditors / regulators • Models may need reason codes for rejections (e.g. if I decline someone credit why?) • Models should have an approval and release process • Models also cannot violate any discrimination laws or use features that could be traced to religion, gender, ethnicity,
  • 23.
    Model A/B Testing SetBusiness Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results • A/B testing – comparing two versions to see what performs better • Historical data works for evaluating models in testing, but production experiments required to validate model hypothesis • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Big Bang A/B Framework should support these steps
  • 24.
    • Monitoring isthe process of observing the model’s performance, logging it’s behavior and alerting when the model degrades • Logging should log exactly the data feed into the model at the time of scoring • Model alerting is critical to detect unusual or unexpected behaviors Model Monitoring
  • 25.
    Open Loop vsClosed Loop • Open Loop – human being involved • Closed Loop – no human involved Model Scoring – almost always closed loop, some models alert agents or customer service Model Training – usually open loop with a data scientist in the loop to update the model
  • 26.
    Online Learning • closedloop, entirely machine driven modeling is risky • need to have proper model monitoring and safeguards to prevent abuse / sensitivity to noise • Mllib supports online through streaming models (k- means, logistic regression support online) • Alternative – use a more complex model to better fit new data rather than using online learning
  • 27.
  • 28.
    Architecture #1 Offline Recommendations TrainALS Model Send Offers to Customers Save Offers to NoSQL Ranked Offers Display Ranked Offers in Web / Mobile Nightly Batch
  • 29.
    Architecture #2 Precomputed Featureswith Streaming Web Logs Kill User’s Login SessionPre-compute Features Features Spark Streaming
  • 30.
    Architecture #3 Local ApacheSpark(™) Train Model in Spark Save Model to S3 / HDFS New Data Copy Model to Production Predictions Run Spark Local
  • 31.
    Demo • Example ofOffline Recommendations using ALS and Redis as a NoSQL Cache
  • 32.
  • 33.
  • 34.
    Spark Summit EU Brussels October25-27 The CFP closes at 11:59pm on July 1st For more information and to submit: https://spark-summit.org/eu-2016/ 34