Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Deep Dive into
Structured Streaming
Tathagata “TD” Das
@tathadas
Spark Summit 2016
Who am I?
Project Mgmt. Committee (PMC) member of Apache Spark
Started Spark Streaming in grad school - AMPLab, UC Berkele...
Streaming in Apache Spark
Spark Streaming changedhow peoplewrite streaming apps
3
SQL Streaming MLlib
Spark Core
GraphX
Fu...
Streaming apps are
growing more complex
4
Streaming computations
don’t run in isolation
Need to interact with batch data,
interactive analysis, machine learning, et...
Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesSta...
Use case: IoT Device Monitoring
IoT events
from Kafka
ETL into long term storage
- Preventdata loss
- PreventduplicatesSta...
1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. ...
Structured Streaming
The simplest way to perform streaming analytics
is not having to reason about streaming at all
New Model Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an...
New Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for ...
New Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for ...
Static, bounded
data
Streaming, unbounded
data
Single API !
API - Dataset/DataFrame
Batch ETL with DataFrames
input = spark.read
.format("json")
.load("source-path")
result = input
.select("device", "signal...
Streaming ETL with DataFrames
input = spark.read
.format("json")
.stream("source-path")
result = input
.select("device", "...
Streaming ETL with DataFrames
input = spark.read
.format("json")
.stream("source-path")
result = input
.select("device", "...
Streaming ETL with DataFrames
1 2 3
Result
[append-only table]
Input
Output
[append mode]
new rows
in result
of 2
new rows...
Continuous Aggregations
Continuously compute average
signal across all devices
Continuously compute average
signal of each...
Continuous Windowed Aggregations
20
input.groupBy(
$"device-type",
window($"event-time-col", "10 min"))
.avg("signal")
Con...
Joining streams with static data
kafkaDataset = spark.read
.kafka("iot-updates")
.stream()
staticDataset = ctxt.read
.jdbc...
Output Modes
Defines what is outputted every time there is a trigger
Different output modes make sensefor different querie...
Query Management
query = result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
query.stop()
quer...
Logically:
Dataset operations on table
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the query...
Structured Streaming
High-level streaming API built on Datasets/DataFrames
Eventtime, windowing,sessions,sources& sinks
En...
What can you do with this that’s hard
with other engines?
True unification
Same code + same super-optimized engine for eve...
Underneath the Hood
Batch Execution on Spark SQL
28
DataFrame/
Dataset
Logical
Plan
Abstract
representation
of query
Batch Execution on Spark SQL
29
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Plan
Logical ...
Batch Execution on Spark SQL
30
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Run super-optimized Spark
jobsto com...
Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incrementa...
Continuous Incremental Execution
32
Planner
Incremental
Execution 2
Offsets:[106-197] Count: 92
Plannerpollsfor
new data f...
Continuous Aggregations
Maintain runningaggregate as in-memory state
backed by WAL in file system for fault-tolerance
33
s...
Fault-tolerance
All data and metadata in
the system needsto be
recoverable/ replayable
state
Planner
source sink
Increment...
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the
offset range of each execution to
a write ahead log (...
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the
offset range of each execution to
a write ahead log (...
Fault-tolerance
Fault-tolerant Planner
Tracks offsets by writing the
offset range of each execution to
a write ahead log (...
Fault-tolerance
Fault-tolerant Sources
Structured streaming sources
are by design replayable (e.g.
Kafka, Kinesis,files) a...
Fault-tolerance
Fault-tolerant State
Intermediate "state data" is a
maintained in versioned,key-
value maps in Spark worke...
Fault-tolerance
Fault-tolerant Sink
Sink are by design idempotent,
and handlesre-executionsto
avoid double committing the
...
41
offset tracking in WAL
+
state management
+
fault-tolerant sourcesand sinks
=
end-to-end
exactly-once
guarantees
42
Fast, fault-tolerant, exactly-once
stateful stream processing
without having to reason about streaming
Release Plan: Spark 2.0 [June 2016]
Basic infrastructureand API
- Eventtime, windows,aggregations
- Append and Complete ou...
Release Plan: Spark 2.1+
Stability and scalability
Supportfor more queries
Multiple aggregations
Sessionization
More outpu...
Stay tuned on our Databricks blogsfor more information and
examples on Structured Streaming
Try latestversion of ApacheSpa...
Structured Streaming
Making Continuous Applications
easier, faster, and smarter
Follow me @tathadas
AMA @
Databricks Booth...
Upcoming SlideShare
Loading in …5
×

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

18,701 views

Published on

“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D.

Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming"
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

// About the Presenter //
Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.

Follow T.D. on -
Twitter: https://twitter.com/tathadas
LinkedIn: https://www.linkedin.com/in/tathadas

Published in: Software

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

  1. 1. A Deep Dive into Structured Streaming Tathagata “TD” Das @tathadas Spark Summit 2016
  2. 2. Who am I? Project Mgmt. Committee (PMC) member of Apache Spark Started Spark Streaming in grad school - AMPLab, UC Berkeley Software engineerat Databricks and involved with all things streaming in Spark 2
  3. 3. Streaming in Apache Spark Spark Streaming changedhow peoplewrite streaming apps 3 SQL Streaming MLlib Spark Core GraphX Functional, conciseand expressive Fault-tolerant statemanagement Unified stack with batch processing More than 50%users consider most important partof Apache Spark
  4. 4. Streaming apps are growing more complex 4
  5. 5. Streaming computations don’t run in isolation Need to interact with batch data, interactive analysis, machine learning, etc.
  6. 6. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Preventdata loss - PreventduplicatesStatus monitoring - Handlelate data - Aggregateon windows on eventtime Interactively debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online+continuous learning
  7. 7. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Preventdata loss - PreventduplicatesStatus monitoring - Handlelate data - Aggregateon windows on eventtime Interactively debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online+continuous learning Continuous Applications Not just streaming any more
  8. 8. 1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time 2. Interoperatestreaming with batch AND interactive - RDD/DStream hassimilar API, butstill requirestranslation 3. Reasoning about end-to-end guarantees - Requirescarefully constructing sinks that handle failurescorrectly - Data consistency in the storage while being updated Pain points with DStreams
  9. 9. Structured Streaming
  10. 10. The simplest way to perform streaming analytics is not having to reason about streaming at all
  11. 11. New Model Trigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query Input: data from source as an append-only table Trigger: howfrequently to check input for newdata Query: operations on input usual map/filter/reduce newwindow, session ops
  12. 12. New Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complete output
  13. 13. New Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Output delta output Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Delta output: Write only the rows that changed in result from previous batch Append output: Write only new rows *Not all output modes are feasible withall queries
  14. 14. Static, bounded data Streaming, unbounded data Single API ! API - Dataset/DataFrame
  15. 15. Batch ETL with DataFrames input = spark.read .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") Read from Json file Select some devices Write to parquet file
  16. 16. Streaming ETL with DataFrames input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .startStream("dest-path") Read from Json file stream Replace load() with stream() Select some devices Code does not change Write to Parquet file stream Replace save() with startStream()
  17. 17. Streaming ETL with DataFrames input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .startStream("dest-path") read…stream() creates a streaming DataFrame, doesnot start any of the computation write…startStream() defineswhere & how to outputthe data and starts the processing
  18. 18. Streaming ETL with DataFrames 1 2 3 Result [append-only table] Input Output [append mode] new rows in result of 2 new rows in result of 3 input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .startStream("dest-path")
  19. 19. Continuous Aggregations Continuously compute average signal across all devices Continuously compute average signal of each type of device 19 input.avg("signal") input.groupBy("device-type") .avg("signal")
  20. 20. Continuous Windowed Aggregations 20 input.groupBy( $"device-type", window($"event-time-col", "10 min")) .avg("signal") Continuously compute average signal of each type of device in last 10 minutes using event-time Simplifiesevent-time stream processing (notpossible in DStreams) Works on both, streaming and batch jobs
  21. 21. Joining streams with static data kafkaDataset = spark.read .kafka("iot-updates") .stream() staticDataset = ctxt.read .jdbc("jdbc://", "iot-device-info") joinedDataset = kafkaDataset.join( staticDataset, "device-type") 21 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … withouthaving to thinkthat you are joining streaming data
  22. 22. Output Modes Defines what is outputted every time there is a trigger Different output modes make sensefor different queries 22 input.select("device", "signal") .write .outputMode("append") .format("parquet") .startStream("dest-path") Append mode with non-aggregation queries input.agg(count("*")) .write .outputMode("complete") .format("parquet") .startStream("dest-path") Complete mode with aggregation queries
  23. 23. Query Management query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 23 query: a handle to the running streaming computation for managingit - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack
  24. 24. Logically: Dataset operations on table (i.e. as easyto understand as batch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally and continuously) DataFrame LogicalPlan Continuous, incrementalexecution Catalyst optimizer Query Execution
  25. 25. Structured Streaming High-level streaming API built on Datasets/DataFrames Eventtime, windowing,sessions,sources& sinks End-to-end exactly once semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove,change queriesat runtime Build and apply ML models
  26. 26. What can you do with this that’s hard with other engines? True unification Same code + same super-optimized engine for everything Flexible API tightly integratedwith the engine Choose your own tool - Dataset/DataFrame/SQL Greater debuggability and performance Benefitsof Spark in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …
  27. 27. Underneath the Hood
  28. 28. Batch Execution on Spark SQL 28 DataFrame/ Dataset Logical Plan Abstract representation of query
  29. 29. Batch Execution on Spark SQL 29 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataset Helluvalotofmagic!
  30. 30. Batch Execution on Spark SQL 30 DataFrame/ Dataset Logical Plan Execution PlanPlanner Run super-optimized Spark jobsto compute results Bytecode generation JVM intrinsics, vectorization Operations on serialized data Code Optimizations MemoryOptimizations Compact and fastencoding Offheap memory Project Tungsten -Phase 1 and 2
  31. 31. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans, for eachprocessing the nextchunk of streaming data 31 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  32. 32. Continuous Incremental Execution 32 Planner Incremental Execution 2 Offsets:[106-197] Count: 92 Plannerpollsfor new data from sources Incremental Execution 1 Offsets:[19-105] Count: 87 Incrementally executes new data and writesto sink
  33. 33. Continuous Aggregations Maintain runningaggregate as in-memory state backed by WAL in file system for fault-tolerance 33 state data generated and used across incremental executions Incremental Execution 1 state: 87 Offsets:[19-105] Running Count: 87 memory Incremental Execution 2 state: 179 Offsets:[106-179] Count: 87+92 = 179
  34. 34. Fault-tolerance All data and metadata in the system needsto be recoverable/ replayable state Planner source sink Incremental Execution 1 Incremental Execution 2
  35. 35. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS state Planner source sink Offsets written to fault-tolerant WAL before execution Incremental Execution 2 Incremental Execution 1
  36. 36. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS state Planner source sink Failed planner fails current execution Incremental Execution 2 Incremental Execution 1 Failed Execution Failed Planner
  37. 37. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS Reads log to recover from failures, and re-execute exact range of offsets state Restarted Planner source sink Offsets read back from WAL Incremental Execution 1 Same executions regenerated from offsets Failed Execution Incremental Execution 2
  38. 38. Fault-tolerance Fault-tolerant Sources Structured streaming sources are by design replayable (e.g. Kafka, Kinesis,files) and generate the exactly same data given offsets recovered by planner state Planner sink Incremental Execution 1 Incremental Execution 2 source Replayable source
  39. 39. Fault-tolerance Fault-tolerant State Intermediate "state data" is a maintained in versioned,key- value maps in Spark workers, backed by HDFS Plannermakes sure "correct version"of state used to re- execute after failure Planner source sink Incremental Execution 1 Incremental Execution 2 state state is fault-tolerant with WAL
  40. 40. Fault-tolerance Fault-tolerant Sink Sink are by design idempotent, and handlesre-executionsto avoid double committing the output Planner source Incremental Execution 1 Incremental Execution 2 state sink Idempotent by design
  41. 41. 41 offset tracking in WAL + state management + fault-tolerant sourcesand sinks = end-to-end exactly-once guarantees
  42. 42. 42 Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming
  43. 43. Release Plan: Spark 2.0 [June 2016] Basic infrastructureand API - Eventtime, windows,aggregations - Append and Complete output modes - Support for a subsetof batch queries Sourceand sink - Sources: Files(*Kafka coming soon after 2.0 release) - Sinks: Filesand in-memory table Experimental release to set the future direction Not ready for production but good to experiment with and provide feedback
  44. 44. Release Plan: Spark 2.1+ Stability and scalability Supportfor more queries Multiple aggregations Sessionization More outputmodes Watermarks and late data Sourcesand Sinks Public APIs ML integrations Make Structured Streaming readyfor production workloads as soon as possible
  45. 45. Stay tuned on our Databricks blogsfor more information and examples on Structured Streaming Try latestversion of ApacheSpark and preview of Spark 2.0 Try Apache Spark with Databricks 45 http://databricks.com/try
  46. 46. Structured Streaming Making Continuous Applications easier, faster, and smarter Follow me @tathadas AMA @ Databricks Booth Today: Now - 2:00 PM Tomorrow: 12:15 PM - 1:00 PM

×