Deep Learning and
Streaming in Apache Spark
2.x
Matei Zaharia
@matei_zaharia
Welcome to Spark Summit Europe
Our largest European summit yet
102talks
1200attendees
11tracks
What’s New in Spark?
Cost-based optimizer (Spark 2.2)
Python and R improvements
• PyPI & CRAN packages (Spark 2.2)
• Python ML plugins (Spark 2.3)
• Vectorized Pandas UDFs (Spark 2.3)
Kubernetes support (targeting 2.3)
0
10
20
30
40
50
Time(s)
Spark 2.2
Vectorized UDFs
0
25
50
75
100
125
Q1 Q2
Time(s)
Pandas
Spark
Spark: The Definitive Guide
To be released this winter
Free preview chapters and
code on Databricks website:
dbricks.co/spark-guide
Two Fast-Growing Workloads
Both are important but complex with current tools
We think we can simplify both with Apache Spark!
Streaming Deep
Learning
&
Why are Streaming and DL
Hard?Similar to early big data tools!
Tremendous potential, but very hard to use at first:
• Low-level APIs (MapReduce)
• Separate systems for each task (SQL, ETL, ML,
etc)
Spark’s Approach
1) Composable, high-level APIs
• Build apps from components
2) Unified engine
• Run complete, end-to-end apps
SQLStreaming ML Graph
…
Expanding Spark to New
Areas
Structured Streaming
Deep Learning
1
2
Structured Streaming
Streaming today requires separate APIs & systems
Structured Streaming is a high-level, end-to-end API
• Simple interface: run any DataFrame or SQL code incrementally
• Complete apps: combine with batch & interactive queries
• End-to-end reliability: exactly-once processing
Became GA in Apache Spark 2.2
Structured Streaming Use
Cases
Monitor quality of live video streaming
Anomaly detection on millions of WiFi hotspots
100s of customer apps in production on Databricks
Largest apps process tens of trillions of records per month
Real-time game analytics at scale
KTable<String, String> kCampaigns = builder.table("campaigns", "cmp-state");
KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> {
Map<String, String> campMap = Json.parser.readValue(value);
return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id"));
});
KStream<String, String> joined =
filtered.join(deserCampaigns, (value1, value2) -> {
return value2.campaign_id;
},
Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(),
new ProjectedEventDeserializer()));
KStream<String, ProjectedEvent> filtered = kEvents.filter((key, value) -> {
return value.event_type.equals("view");
}).mapValues((value) -> {
return new ProjectedEvent(value.ad_id, value.event_time);
});
KStream<String, String> keyedData = joined.selectKey((key, value) -> value);
KTable<Windowed<String>, Long> counts = keyedData.groupByKey()
.count(TimeWindows.of(10000), "time-windows");
streams
Example:
Benchmark DataFrames
events
.where("event_type = 'view'")
.join(table("campaigns"), "ad_id")
.groupBy(
window('event_time, "10 seconds"),
'campaign_id)
.count()
Batch Plan Incremental Plan
Scan Files
Aggregate
Write to Sink
Scan New
Files
Stateful Agg.
Update Sink
automatic
transformation
4xlower cost
Structured Streaming
reuses the Spark SQL
Optimizer and Tungsten
Engine.
https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark
Performance:
Benchmark System Throughput
700K
15M
65M
0
10
20
30
40
50
60
70
Kafka
Streams
Apache Flink Structured
Streaming
Millionsofrecords/s
4xfewer nodes
What About Latency?
Continuous processing mode to run without
microbatches
• <1 ms latency (same as per-record streaming systems)
• No changes to user code
• Proposal in SPARK-20928
Key idea: same API can target both streaming & batch
Find out more in today’s deep dive
Expanding Spark to New
Areas
Structured Streaming
Deep Learning
1
2
Deep Learning has Huge
PotentialUnprecedented ability to work with unstructured
data such as images and text
But Deep Learning is Hard to
UseCurrent APIs (TensorFlow, Keras, etc) are low-level
• Build a computation graph from scratch
Scale-out requires manual parallelization
Hard to use models in larger applications
Very similar to early big data APIs
Deep Learning on Spark
Image support in MLlib: SPARK-21866 (Spark 2.3)
DL framework integrations: TensorFlowOnSpark,
MMLSpark, Intel BigDL
Higher-level APIs: Deep Learning Pipelines
New in TensorFlowOnSpark
Library to run distributed TF on Spark clusters & data
• Built at Yahoo!, where it powers photos, videos & more
Yahoo! and Databricks collaborated to add:
• ML pipeline APIs
• Support for non-YARN and AWS clusters
github.com/yahoo/TensorFlowOnSpark
talk
tomorrow
at 17:00
Deep Learning Pipelines
Low-level DL frameworks are powerful, but common
use cases should be much simpler to build
Goal: Enable an order of magnitude more
users to build production apps using deep
learning
Deep Learning Pipelines
Key idea: High-level API built on ML Pipelines model
• Common use cases are just a few lines of code
• All operators automatically scale over Spark
• Expose models in batch, streaming & SQL apps
Uses existing DL engines (TensorFlow, Keras, etc)
Example: Using Existing
Modelpredictor = DeepImagePredictor(inputCol="image",
outputCol="labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
SELECT image, my_predictor(image) AS labels
FROM uploaded_images
Example: Model Search
est = KerasImageFileEstimator()
grid = ParamGridBuilder() 
.addGrid(est.modelFile, ["InceptionV3", "ResNet50"]) 
.addGrid(est.kerasParams, [{'batch': 32}, {'batch': 64}]) 
.build()
CrossValidator(est, eval, grid).fit(image_df)
InceptionV3
batch size 32
ResNet50
batch size 32
InceptionV3
batch size 64
ResNet50
batch size 64
Spark
Driver
Deep Learning Pipelines
DemoSue Ann Hong

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

  • 1.
    Deep Learning and Streamingin Apache Spark 2.x Matei Zaharia @matei_zaharia
  • 2.
    Welcome to SparkSummit Europe Our largest European summit yet 102talks 1200attendees 11tracks
  • 3.
    What’s New inSpark? Cost-based optimizer (Spark 2.2) Python and R improvements • PyPI & CRAN packages (Spark 2.2) • Python ML plugins (Spark 2.3) • Vectorized Pandas UDFs (Spark 2.3) Kubernetes support (targeting 2.3) 0 10 20 30 40 50 Time(s) Spark 2.2 Vectorized UDFs 0 25 50 75 100 125 Q1 Q2 Time(s) Pandas Spark
  • 4.
    Spark: The DefinitiveGuide To be released this winter Free preview chapters and code on Databricks website: dbricks.co/spark-guide
  • 5.
    Two Fast-Growing Workloads Bothare important but complex with current tools We think we can simplify both with Apache Spark! Streaming Deep Learning &
  • 6.
    Why are Streamingand DL Hard?Similar to early big data tools! Tremendous potential, but very hard to use at first: • Low-level APIs (MapReduce) • Separate systems for each task (SQL, ETL, ML, etc)
  • 7.
    Spark’s Approach 1) Composable,high-level APIs • Build apps from components 2) Unified engine • Run complete, end-to-end apps SQLStreaming ML Graph …
  • 8.
    Expanding Spark toNew Areas Structured Streaming Deep Learning 1 2
  • 9.
    Structured Streaming Streaming todayrequires separate APIs & systems Structured Streaming is a high-level, end-to-end API • Simple interface: run any DataFrame or SQL code incrementally • Complete apps: combine with batch & interactive queries • End-to-end reliability: exactly-once processing Became GA in Apache Spark 2.2
  • 10.
    Structured Streaming Use Cases Monitorquality of live video streaming Anomaly detection on millions of WiFi hotspots 100s of customer apps in production on Databricks Largest apps process tens of trillions of records per month Real-time game analytics at scale
  • 11.
    KTable<String, String> kCampaigns= builder.table("campaigns", "cmp-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filtered.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, ProjectedEvent> filtered = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KStream<String, String> keyedData = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts = keyedData.groupByKey() .count(TimeWindows.of(10000), "time-windows"); streams Example: Benchmark DataFrames events .where("event_type = 'view'") .join(table("campaigns"), "ad_id") .groupBy( window('event_time, "10 seconds"), 'campaign_id) .count() Batch Plan Incremental Plan Scan Files Aggregate Write to Sink Scan New Files Stateful Agg. Update Sink automatic transformation
  • 12.
    4xlower cost Structured Streaming reusesthe Spark SQL Optimizer and Tungsten Engine. https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark Performance: Benchmark System Throughput 700K 15M 65M 0 10 20 30 40 50 60 70 Kafka Streams Apache Flink Structured Streaming Millionsofrecords/s 4xfewer nodes
  • 13.
    What About Latency? Continuousprocessing mode to run without microbatches • <1 ms latency (same as per-record streaming systems) • No changes to user code • Proposal in SPARK-20928 Key idea: same API can target both streaming & batch Find out more in today’s deep dive
  • 14.
    Expanding Spark toNew Areas Structured Streaming Deep Learning 1 2
  • 15.
    Deep Learning hasHuge PotentialUnprecedented ability to work with unstructured data such as images and text
  • 16.
    But Deep Learningis Hard to UseCurrent APIs (TensorFlow, Keras, etc) are low-level • Build a computation graph from scratch Scale-out requires manual parallelization Hard to use models in larger applications Very similar to early big data APIs
  • 17.
    Deep Learning onSpark Image support in MLlib: SPARK-21866 (Spark 2.3) DL framework integrations: TensorFlowOnSpark, MMLSpark, Intel BigDL Higher-level APIs: Deep Learning Pipelines
  • 18.
    New in TensorFlowOnSpark Libraryto run distributed TF on Spark clusters & data • Built at Yahoo!, where it powers photos, videos & more Yahoo! and Databricks collaborated to add: • ML pipeline APIs • Support for non-YARN and AWS clusters github.com/yahoo/TensorFlowOnSpark talk tomorrow at 17:00
  • 19.
    Deep Learning Pipelines Low-levelDL frameworks are powerful, but common use cases should be much simpler to build Goal: Enable an order of magnitude more users to build production apps using deep learning
  • 20.
    Deep Learning Pipelines Keyidea: High-level API built on ML Pipelines model • Common use cases are just a few lines of code • All operators automatically scale over Spark • Expose models in batch, streaming & SQL apps Uses existing DL engines (TensorFlow, Keras, etc)
  • 21.
    Example: Using Existing Modelpredictor= DeepImagePredictor(inputCol="image", outputCol="labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df) SELECT image, my_predictor(image) AS labels FROM uploaded_images
  • 22.
    Example: Model Search est= KerasImageFileEstimator() grid = ParamGridBuilder() .addGrid(est.modelFile, ["InceptionV3", "ResNet50"]) .addGrid(est.kerasParams, [{'batch': 32}, {'batch': 64}]) .build() CrossValidator(est, eval, grid).fit(image_df) InceptionV3 batch size 32 ResNet50 batch size 32 InceptionV3 batch size 64 ResNet50 batch size 64 Spark Driver
  • 23.

Editor's Notes

  • #10 Make this more about how easy it is.
  • #13 Comparable latency to flink
  • #14 We’ve been experimenting with this at DB and we’re excited to contribute it back.