Apache Spark Model Deployment

Apache Spark(™)
Model Deployment
Bay Area Spark Meetup – June 30, 2016
Richard Garris – Big Data Solution Architect Focused on Advanced Analytics

About Me
Richard L Garris
• rlgarris@databricks.com
• @rlgarris [Twitter]
Big Data Solutions Architect @ Databricks
12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
Prior Work Experience PwC, Google, Skytree
Ohio State Buckeye and CMU Alumni
2

About Apache Spark MLlib
Started at Berkeley AMPLab
(Apache Spark 0.8)
Now (Apache Spark 2.0)
• Contributions from 75+ orgs, ~250 individuals
• Development driven by Databricks: roadmap + 50% of
PRs
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphFrames
3

MLlib Goals
General Machine Learning library for big data
• Scalable & robust
• Coverage of common algorithms
• Leverages Apache Spark
Tools for practical workflows
Integration with existing data science tools
4

Apache Spark MLlib
• spark.mllib
• Pre Mllib < Spark 1.4
• Spark Mllib was a lower
level library that used
Spark RDDs
• Uses LabeledPoint,
Vectors and Tuples
• Maintenance Mode only
after Spark 2.X
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split('
').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, st
epSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}

Apache Spark – ML Pipelines
• spark.ml
• Spark > 1.4
• Spark.ML pipelines –
able to create more
complex models
• Integrated with
DataFrames
// Let's initialize our linear regression learner
val lr = new LinearRegression()
// Now we set the parameters for the method
lr.setPredictionCol("Predicted_PE")
.setLabelCol("PE").setMaxIter(100).setRegParam(0.1)
// We will use the new spark.ml pipeline API. If you
have worked with scikit-learn this will be very
familiar.
val lrPipeline = new Pipeline()
lrPipeline.setStages(Array(vectorizer, lr))
// Let's first train on the entire dataset to see what
we get
val lrModel = lrPipeline.fit(trainingSet)

The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results

The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Focus of this
talk

But What Really is a Model?
A model is a complex pipeline of components
• Data Sources
• Joins
• Featurization Logic
• Algorithm(s)
• Transformers
• Estimators
• Tuning Parameters

ML Pipelines
11
Train model
Evaluate
Load data
Extract features
A very simple pipeline

ML Pipelines
12
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract features
Extract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline

Why ML persistence?
13
Data
Science
Software
Engineering
Prototype (Python/R)
Create model
Re-implement model for
production (Java)
Deploy model

Why ML persistence?
14
Data
Science
Software
Engineering
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline for
production (Java)
Deploy Pipeline

With ML persistence...
15
Data
Science
Software
Engineering
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production

Demo
Model Serialization in Apache Spark 2.0 using Parquet

What are the Requirements
for a Robust Model
Deployment System?

Customer SLAs
• Response time
• Throughput (predictions per second)
• Uptime / Reliability
Tech Stack
• C / C++
• Legacy (mainframe)
• Java
• Docker
Your Model Scoring Environment

Offline
• Internal Use (batch)
• Emails, Notifications
(batch)
• Offline – schedule based or
event trigger based
Model Scoring Offline vs Online
Online
• Customer Waiting on the
Response (human real-time)
• Super low-latency with fixed
response window
(transactional fraud, ad
bidding)

Not All Models Return a Yes / No
Model Scoring Considerations
Example: Login Bot Detector
Different behavior depending on
probability score
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Challenge Question
0.6 to 0.75 ☞ Send SMS
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block
Example: Item Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Return sorted set of items to recommend
Optional – pass context sensitive information
to tailor results

Model Updates and Versioning
• Model Update Frequency
(nightly, weekly, monthly, quarterly)
• Model Version Tracking
• Model Release Process
• Dev ‣ Test ‣ Staging ‣ Production
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang

• Models can have both reward and risk to the business
– Well designed models prevent fraud, reduce churn, increase sales
– Poorly designed models increase fraud, could impact the company’s brand,
cause compliance violations or other risks
• Models should be governed by the company's policies and procedures,
laws and regulations and the organization's management goals
Model Governance
Considerations
• Models have to be transparent, explainable, traceable and interpretable for
auditors / regulators
• Models may need reason codes for rejections (e.g. if I decline someone credit why?)
• Models should have an approval and release process
• Models also cannot violate any discrimination laws or use features that could be
traced to religion, gender, ethnicity,

Model A/B Testing
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate
Results
• A/B testing – comparing two
versions to see what performs
better
• Historical data works for
evaluating models in testing, but
production experiments required
to validate model hypothesis
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
A/B Framework should support these steps

• Monitoring is the process of
observing the model’s
performance, logging it’s
behavior and alerting when the
model degrades
• Logging should log exactly the
data feed into the model at the
time of scoring
• Model alerting is critical to
detect unusual or unexpected
behaviors
Model Monitoring

Open Loop vs Closed Loop
• Open Loop – human being involved
• Closed Loop – no human involved
Model Scoring – almost always closed loop, some models alert
agents or customer service
Model Training – usually open loop with a data scientist in the
loop to update the model

Online Learning
• closed loop, entirely machine driven modeling is
risky
• need to have proper model monitoring and
safeguards to prevent abuse / sensitivity to noise
• Mllib supports online through streaming models (k-
means, logistic regression support online)
• Alternative – use a more complex model to better fit
new data rather than using online learning

Model Deployment
Architectures

Architecture #1
Offline Recommendations
Train ALS Model Send Offers to Customers
Save Offers to NoSQL
Ranked Offers
Display Ranked Offers in
Web / Mobile
Nightly Batch

Architecture #2
Precomputed Features with Streaming
Web Logs
Kill User’s Login SessionPre-compute Features Features
Spark Streaming

Architecture #3
Local Apache Spark(™)
Train Model in Spark Save Model to S3 / HDFS
New Data
Copy
Model to
Production
Predictions
Run Spark Local

Demo
• Example of Offline Recommendations using ALS and
Redis as a NoSQL Cache

Try Databricks Community Edition

Spark Summit EU
Brussels
October 25-27
The CFP closes at 11:59pm on July 1st
For more information and to submit:
https://spark-summit.org/eu-2016/
34

Apache Spark Model Deployment

More Related Content

What's hot

Viewers also liked

Similar to Apache Spark Model Deployment

More from Databricks

Recently uploaded

Apache Spark Model Deployment