Lambda architecture on Spark, Kafka for real-time large scale ML

1© Cloudera, Inc. All rights reserved.
Oryx 2 Overview
Sean Owen | Cloudera | @sean_r_owen

Consider the Music Recommender
Collect
Play Data
& Do Data
Science
Build
Taste
Model
Offline
Learn
Quickly
from
New Plays
Recommend
New Songs
Now

From Exploratory to Operational?
 Exploratory Analytics Operational Analytics 
Explore Data
Pick Model
Build Model
at Scale, Offline
Continuously
Update Model
?
Score Model in
Real-Time
?

Large Scale or Real-Time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs
Why Don’t We Have Both?
λ!

• Batch Layer
• High latency, high throughput
• Compute official result
• Speed Layer
• Low latency
• Compute approximate update to
last known result
• Serving Layer
• Real-time
• Merge batch/speed results
The Lambda Architecture
www.ymc.ch/en/lambda-architecture-part-1

• Batch Layer
• Train, evaluate, tune model
over all data in hours
• Speed Layer
• Update model approximately in
minutes or seconds
• Serving Layer
• Make prediction, recommendation
from model in milliseconds
λ Architecture fits ML + Hadoop
Streaming MLlib

www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg

History (or: 5th time’s a charm)
Taste
2005 – 2009
- Recommender
toolkit in Java
- Local only
- Serves results
Apache Mahout
2009 – 2014
- Adds Hadoop-based
model building
at scale
- But no serving
Myrrix
2011-2013
- Mahout recs
reimagined
- Adds serving to
Hadoop-based
model build
Oryx 1
2013 –
- Extends to
classification,
clustering
- PMML
- Merge with
cloudera/ml
Oryx 2
2014 –
- Same APIs / goals
- Rewrite
- Full lambda
architecture
- Kafka + Spark + YARN

Complementary, Not Competitive
Most ML-on-Hadoop tools are
for building models only, and
excel at this.
Oryx and similar projects do
everything else around this:
continuous update, serving

Architecture
HDFS
Input (Kafka topic)
Spark
Streaming
Batch
Layer
Recent
Input
Historical
Input
Models + Updates (Kafka topic)
Model
Spark
Streaming
Speed
Layer
Recent
Input
Model
Updates
Model
Input
Serving
Layer
Input
Serving
LayerServing
LayerServing
Layer
Model
Updates
Model
Serving
LayerQueries
Input

• Input Kafka topic
• Any type; usually strings
• From external or Serving Layer
• Update Kafka topic
• Serialized models (PMML)
produced by Batch Layer
• Model updates / deltas
produced by Speed Layer
Data Transport
Input (Kafka topic)
Recent
Input
Models + Updates (Kafka topic)
Model
Recent
Input
Model
Updates
Model
Input
Input
Model
Updates
Model

• Spark Streaming
• Persists input topic data
to HDFS from Kafka
• Builds “model” occasionally from
historical and new data
• Hours
• ML: can use MLlib
• ML: tunes hyperparameters
• Publishes models as PMML to
update topic
Batch Layer
HDFS
Spark
Streaming
Batch
Layer
Recent
Input
Historical
Input
Model

• Spark Streaming
• Listens for new PMML models
• Listens to input topic too
• Computes approximate updates to
model implied by input and publishes
to update topic
• Seconds
Speed Layer
Spark
Streaming
Speed
Layer
Recent
Input
Model
Updates
Model

• Tomcat + JAX-RS
• (Can deploy on YARN)
• REST API
• Listens for new PMML models and
updates from update topic
• Scores model / answers queries
• Writes to input topic too
• No shared state; scales horizontally
• Milliseconds
Serving Layer
Serving
Layer
Input
Serving
LayerServing
LayerServing
Layer
Model
Updates
Model
Serving
LayerQueries
Input

Logical Architecture
Serving Layer Speed Layer Batch Layer
App Tier oryx-app-serving oryx-app-mllib
oryx-app
oryx-app-mllib
oryx-app
ML Tier oryx-ml oryx-ml
Lambda Tier oryx-lambda-serving oryx-lambda oryx-lambda
Generic Lambda-Architecture support
ML-specific specialization
Prebuilt recommender, clustering,
classification implementations

• Scoring on the fly is not cheap
• 1M user/items ≈ 1GB heap
at scale (≈ 200 features)
• Feature, item count determines
latency, throughput
• Java 8 + 16-core 2.3GHz Xeon
• Smallish models ≈
100s QPS, 10s ms latency
• Huge models ≈
Single digit QPS, 100s ms latency
Recommendation Benchmarks

• Spark 1.3.1
• MLlib
• Streaming
• Kafka 0.8.2.1
• Hadoop 2.6
• HDFS
• YARN
• JavaEE 7
• JAX-RS 2
• Jersey 2
• Servlet 3.1
• Tomcat 8
• JPMML + PMML 4.2.1
Key Technology Roster
CDH 5.4+

• Cloudera Labs project
• Partial collaboration with Intel
• Not shipped with CDH
• Not supported, no plans to yet
• 2.0.0 beta 3
• Suitable for POCs
• 2.0.0 by end of year
• Best For
• Recommender engines
• Real-time anomaly detection
• Real-time classification
• Problems where both scale and
latency are important
• CDH users
Status

Get Started in ~1 Hour
http://oryx.io

Thank you
@sean_r_owen
sowen@cloudera.com

The conference for and by Data Scientists, from startup to enterprise
wrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce,
Uber, Pinterest, and more
When: Thursday, October 22, 2015
Where: Broadway Studios, San Francisco

Lambda architecture on Spark, Kafka for real-time large scale ML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lambda architecture on Spark, Kafka for real-time large scale ML

Similar to Lambda architecture on Spark, Kafka for real-time large scale ML (20)

More from huguk

More from huguk (20)

Recently uploaded

Recently uploaded (20)

Lambda architecture on Spark, Kafka for real-time large scale ML