Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia

Deep Learning and Streaming
in Apache Spark 2.2
Matei Zaharia
@matei_zaharia

Evolution of Big Data Systems
Tremendous potential, but very
hard to use at first:
• Low-level APIs (MapReduce)
• Separate systems for each
workload (SQL, ETL, ML, etc)

How Spark Tackled this Problem
1) Composable, high-level APIs
• Functional programs in Scala, Python, Java, R
• Opens big data to many more users
2) Unified engine
• Combines batch, interactive, streaming
• Simplifies building end-to-end apps
SQLStreaming ML Graph
…

Expanding Spark to New Areas
Structured Streaming
Deep Learning
1
2

Real-Time Applications Today
Increasingly important to put big data in production
• Real-time reporting, model serving, etc
But very hard to build:
• Disparate code for streaming & batch
• Complex interactions with
external systems
• Hard to operate and debug
Goal: unified API for end-to-end continuous apps
Batch
Job
Ad-hoc
Queries
Input
Stream
Atomic
Output
Continuous
Application
Static Data
Batch
Jobs
>_

Structured Streaming
New end-to-end streaming API built on Spark SQL
• Simple APIs: DataFrames, Datasets and SQL – same as in batch.
Event-time processing and out-of-order data.
• End-to-end exactly once: Transactional both in processing & output.
• Complete app lifecycle: Code upgrades, ad-hoc queries and more.
Marked GA in Apache Spark 2.2

Simple APIs: Benchmark
7
KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> {
return value.event_type.equals("view");
}).mapValues((value) -> {
return new ProjectedEvent(value.ad_id, value.event_time);
});
KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state");
KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> {
Map<String, String> campMap = Json.parser.readValue(value);
return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id"));
});
KStream<String, String> joined =
filteredEvents.join(deserCampaigns, (value1, value2) -> {
return value2.campaign_id;
},
Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(),
new ProjectedEventDeserializer()));
KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value);
KTable<Windowed<String>, Long> counts = keyedByCampaign.groupByKey()
.count(TimeWindows.of(10000), "time-windows");
Filter by click type and project
Join with campaigns table
Group and windowed count
streams

});
},
});
KTable<Windowed<String>, Long> counts =keyedByCampaign.groupByKey()
8
DataFrames
streams
events
.where("event_type = 'view'")
.join(table("campaigns"), "ad_id")
.groupBy(
window('event_time, "10 seconds"),
'campaign_id)
.count()

9
streams
SQL
SELECT COUNT(*)
FROM events
JOIN campaigns USING ad_id
WHERE event_type = 'view'
GROUP BY
window(event_time, "10 seconds"),
campaign_id)
});
},
});
KTable<Windowed<String>, Long> counts =keyedByCampaign.groupByKey()
streams

DataFrame,
Dataset or SQL
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Write to
Kafka
Under the Covers
Structured Streaming automatically incrementalizes
the provided batch computation
Series of Incremental
Execution Plans
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Kafka
Sink
Optimized
Physical Plan
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata

Structured Streaming reuses
the Spark SQL Optimizer
and Tungsten Engine.
11https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark
Throughput
At ~200ms Latency
700K
15M
65M
0
10
20
30
40
50
60
70
Kafka
Streams
Flink Structured
Streaming
Millions
5xlower cost
Performance: Benchmark

What About Latency?
Continuous processing mode for execution without microbatches
• <1 ms latency (same as per-record streaming systems)
• No changes to user code
• Proposal in SPARK-20928
Databricks blog post: tinyurl.com/spark-continuous-processing

Structured Streaming Use Cases
Cloud big data platform serving 500+ orgs
Metrics pipeline: 14B events/h on 10 nodes
Dashboards Analyze usage trends in real time
Alerts Notify engineers of critical issues
Ad-hoc Analysis Diagnose issues when they occur
ETL Clean and store historical data

Cloud big data platform serving 500+ orgs
Metrics pipeline: 14B events/h on 10 nodes
=
Metrics
Filter
ETL
Dashboards
Ad-hoc
Analysis
Alerts

Monitor quality of live video in production
across dozens of online properties
Analyze data from 1000s of WiFi hotspots
to find anomalous behavior
More info: see talks at Spark Summit 2017

Deep Learning has Huge Potential
Unprecedented ability to work with unstructured data
such as images and text

But Deep Learning is Hard to Use
Current APIs (TensorFlow, Keras, BigDL, etc) are low-level
• Build a computation graph from scratch
• Scale-out typically requires manual parallelization
Hard to expose models in larger applications
Very similar to early big data APIs (MapReduce)

Our Goal
Enable an order of magnitude more users to build
applications using deep learning
Provide scale & production use out of the box

Deep Learning Pipelines
A new high-level API for deep learning that integrates with
Apache Spark’s ML Pipelines
• Common use cases in just a few lines of code
• Automatically scale out on Spark
• Expose models in batch/streaming apps & Spark SQL
Builds on existing DL engines (TensorFlow, Keras, BigDL)

Image Loading
from sparkdl import readImages
image_df = readImages(sample_img_dir)

Applying Popular Models
Popular pre-trained models included as MLlib Transformers
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)

Fast Model Training via Transfer Learning
Example: identify James Bond cars

SoftMax
GIANT PANDA 0.9
RED PANDA 0.05
RACCOON 0.01
…
Classifier
Transfer Learning
DeepImageFeaturizer

Transfer Learning as an ML Pipeline
MLlib Pipeline
Image
Loading Preprocessing
Logistic
Regression
DeepImageFeaturizer

Transfer Learning Code
featurizer = DeepImageFeaturizer(modelName="InceptionV3”)
lr = LogisticRegression()
p = Pipeline(stages=[featurizer, lr])
model = p.fit(train_images_df)
Automatically distributed across cluster!

Distributed Model Tuning Code
myEstimator = KerasImageFileEstimator(
inputCol='input', outputCol='output', modelFile='/model.h5')
params1 = {'batch_size':10, epochs:10}
params2 = {'batch_size':5, epochs:20}
myParamMaps = ParamGridBuilder()
.addGrid(myEstimator.kerasParams, [params1, params2]).build()
cv = CrossValidator(myEstimator, myEvaluator, myParamMaps)
cvModel = cv.fit()

Sharing and Applying Models
Take a trained model / Pipeline, register a SQL UDF usable by
anyone in the organization
In Spark SQL:
registerKerasUDF("my_object_recognition_function",
keras_model_file="/mymodels/007model.h5")
select image, my_object_recognition_function(image) as objects
from traffic_imgs
Can now apply in streaming, batch or interactive queries!

Other Upcoming Features
Distributed training of one model via TensorFlowOnSpark
(https://github.com/yahoo/TensorFlowOnSpark)
More built-in data types: text, time series, etc

Scalable Deep Learning made Simple
High-level API for Deep Learning, integrated with MLlib
Scales common tasks with transformers and estimators
Expose deep learning models in MLlib and Spark SQL
Early release of Deep Learning Pipelines:
github.com/databricks/spark-deep-learning

Conclusion
As new use cases mature for big data, systems will naturally
move from specialized/complex to unified
We’re applying the lessons from early Spark to streaming & DL
• High-level, composable APIs
• Flexible execution (SQL optimizer, continuous processing)
• Support for end-to-end apps

https://spark-summit.org/eu-2017/
15% discount code: MateiAMS

Free preview release:
dbricks.co/2sK35XT

Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia

Similar to Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia (20)

More from GoDataDriven

More from GoDataDriven (20)

Recently uploaded

Recently uploaded (20)

Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia