Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Making Structured Streaming
Ready For Production
Tathagata “TD” Das
@tathadas
Spark Summit East
8th February 2017
About Me
Spark PMC Member
Built Spark Streaming in UC Berkeley
Currently focused on Structured Streaming
2
building robust
stream processing
apps is hard
3
Complexities in stream processing
4
Complex Data
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out...
Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
d...
Philosophy
the simplest way to perform stream
processing is not having to reason
about streaming at all
6
Treat Streams as Unbounded Tables
7
data stream unbounded input table
new data in the
data stream
=
new rows appended
to a...
New Model Trigger: every 1 sec
Time
Input data up
to t = 3
Query
Input: data from source as an
append-only table
Trigger: ...
New Model
result up
to t = 1
Result
Query
Time
data up
to t = 1
Input data up
to t = 2
result up
to t = 2
data up
to t = 3...
New Model
t=1 t=2 t=3
Result
Query
Time
Input data up
to t = 3
result up
to t = 3
Output
[append mode] write new rows sinc...
static data =
bounded table
streaming data =
unbounded table
API - Dataset/DataFrame
Single API !
Batch Queries with DataFrames
input = spark.read
.format("json")
.load("source-path")
result = input
.select("device", "si...
Streaming Queries with DataFrames
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("de...
DataFrames,
Datasets, SQL
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "...
Fault-tolerance with Checkpointing
Checkpointing - metadata
(e.g. offsets) of current batch stored in
a write ahead log in...
Complex
Streaming ETL
16
Traditional ETL
Raw, dirty, un/semi-structured is data dumped as files
Periodic jobs run every few hours to convert raw da...
Traditional ETL
Hours of delay before taking decisions on latest data
Unacceptable when time is of essence
[intrusion dete...
Streaming ETL w/ Structured Streaming
Structured Streaming enables raw data to be available
as structured data as soon as ...
Streaming ETL w/ Structured Streaming
20
Example
- Json data being received in Kafka
- Parse nested json and flatten it
- ...
Reading from Kafka [Spark 2.1]
21
Support Kafka 0.10.0.1
Specify options to configure
How?
kafka.boostrap.servers => broke...
Reading from Kafka
22
val rawData = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.option("kafka.boostra...
Transforming Data
Cast binary value to string
Name it column json
23
val parsedData = rawData
.selectExpr("cast (value as ...
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name i...
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name i...
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name i...
Writing to Parquet table
Save parsed data as Parquet
table in the given path
Partition files by date so that
future querie...
Checkpointing
Enable checkpointing by
setting the checkpoint
location to save offset logs
start actually starts a
continuo...
Streaming Query
query is a handle to the continuously
running StreamingQuery
Used to monitor and manage the execution
29
v...
Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within seconds
Parquet table is updated at...
Advanced
Streaming
Analytics
31
Event time Aggregations
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system i...
Event time Aggregations
Windowing is just another type of grouping in Struct. Streaming
number of records every hour
Suppo...
Stateful Processing for Aggregations
Aggregates has to be saved as
distributed state between triggers
Each trigger reads p...
Watermarking and Late Data
Watermark [Spark 2.1]
boundary in event time trailing
behind max observed event time
Windows ol...
Watermarking and Late Data
Gap is a configurable
allowed lateness
Data newer than watermark may
be late, but allowed to ag...
Watermarking and Late Data
37
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("t...
Watermarking to Limit State [Spark 2.1]
38
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:1...
Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful ops to a
user-defined state
...
Many more updates!
StreamingQueryListener [Spark 2.1]
Receive of regular progress heartbeats for health and perf monitorin...
More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guid...
Comparison with Other Engines
42
Read the blog to understand this table
Upcoming SlideShare
Loading in …5
×

Making Structured Streaming Ready for Production

5,101 views

Published on

In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.

The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:

- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes

Speaker: Tathagata Das

This talk was originally presented at Spark Summit East 2017.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Making Structured Streaming Ready for Production

  1. 1. Making Structured Streaming Ready For Production Tathagata “TD” Das @tathadas Spark Summit East 8th February 2017
  2. 2. About Me Spark PMC Member Built Spark Streaming in UC Berkeley Currently focused on Structured Streaming 2
  3. 3. building robust stream processing apps is hard 3
  4. 4. Complexities in stream processing 4 Complex Data Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order Complex Systems Diverse storage systems and formats (SQL, NoSQL, parquet, ... ) System failures Complex Workloads Event time processing Combining streaming with interactive queries, machine learning
  5. 5. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems 5
  6. 6. Philosophy the simplest way to perform stream processing is not having to reason about streaming at all 6
  7. 7. Treat Streams as Unbounded Tables 7 data stream unbounded input table new data in the data stream = new rows appended to a unbounded table
  8. 8. New Model Trigger: every 1 sec Time Input data up to t = 3 Query Input: data from source as an append-only table Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops t=1 t=2 t=3 data up to t = 1 data up to t = 2
  9. 9. New Model result up to t = 1 Result Query Time data up to t = 1 Input data up to t = 2 result up to t = 2 data up to t = 3 result up to t = 3 Result: final operated table updated after every trigger Output: what part of result to write to storage after every trigger Complete output: write full result table every time Output [complete mode] write all rows in result table to storage t=1 t=2 t=3
  10. 10. New Model t=1 t=2 t=3 Result Query Time Input data up to t = 3 result up to t = 3 Output [append mode] write new rows since last trigger to storage Result: final operated table updated after every trigger Output: what part of result to write to storage after every trigger Complete output: write full result table every time Append output: write only new rows that got added to result table since previous batch *Not all output modes are feasible with all queries data up to t = 1 data up to t = 2 result up to t = 1 result up to t = 2
  11. 11. static data = bounded table streaming data = unbounded table API - Dataset/DataFrame Single API !
  12. 12. Batch Queries with DataFrames input = spark.read .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") Read from Json file Select some devices Write to parquet file
  13. 13. Streaming Queries with DataFrames input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Read from Json file stream Replace read with readStream Select some devices Code does not change Write to Parquet file stream Replace save() with start()
  14. 14. DataFrames, Datasets, SQL input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Streaming Source Project device, signal Filter signal > 15 Streaming Sink Streaming Query Execution Spark SQL converts batch-like query to series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans process newfiles t = 1 t = 2 t = 3 process newfiles process newfiles
  15. 15. Fault-tolerance with Checkpointing Checkpointing - metadata (e.g. offsets) of current batch stored in a write ahead log in HDFS/S3 Query can be restarted from the log Streaming sources can replay the exact data range in case of failure Streaming sinks can dedup reprocessed data when writing, idempotent by design end-to-end exactly-once guarantees process newfiles t = 1 t = 2 t = 3 process newfiles process newfiles write ahead log
  16. 16. Complex Streaming ETL 16
  17. 17. Traditional ETL Raw, dirty, un/semi-structured is data dumped as files Periodic jobs run every few hours to convert raw data to structured data ready for further analytics 17 file dump seconds hours table 10101010
  18. 18. Traditional ETL Hours of delay before taking decisions on latest data Unacceptable when time is of essence [intrusion detection, anomaly detection, etc.] 18 file dump seconds hours table 10101010
  19. 19. Streaming ETL w/ Structured Streaming Structured Streaming enables raw data to be available as structured data as soon as possible 19 table seconds10101010
  20. 20. Streaming ETL w/ Structured Streaming 20 Example - Json data being received in Kafka - Parse nested json and flatten it - Store in structured Parquet table - Get end-to-end failure guarantees val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load() val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json").as("data")) .select("data.*") val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable/")
  21. 21. Reading from Kafka [Spark 2.1] 21 Support Kafka 0.10.0.1 Specify options to configure How? kafka.boostrap.servers => broker1 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
  22. 22. Reading from Kafka 22 val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load() rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
  23. 23. Transforming Data Cast binary value to string Name it column json 23 val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")
  24. 24. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data 24 val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  25. 25. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns 25 val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*") data (nested) timestamp device … 1486087873 devA … 1486086721 devX … timestamp device … 1486087873 devA … 1486086721 devX … select("data.*") (not nested)
  26. 26. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns 26 val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*") powerful built-in APIs to perform complex data transformations from_json, to_json, explode, ... 100s of functions
  27. 27. Writing to Parquet table Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data 27 val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable")
  28. 28. Checkpointing Enable checkpointing by setting the checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster 28 val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
  29. 29. Streaming Query query is a handle to the continuously running StreamingQuery Used to monitor and manage the execution 29 val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/") process newdata t = 1 t = 2 t = 3 process newdata process newdata StreamingQuery
  30. 30. Data Consistency on Ad-hoc Queries Data available for complex, ad-hoc analytics within seconds Parquet table is updated atomically, ensures prefix integrity Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our blog https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html 30 seconds! complex, ad-hoc queries on latest data
  31. 31. Advanced Streaming Analytics 31
  32. 32. Event time Aggregations Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extracting event time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff 32
  33. 33. Event time Aggregations Windowing is just another type of grouping in Struct. Streaming number of records every hour Support UDAFs! 33 parsedData .groupBy(window("timestamp","1 hour")) .count() parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal") avg signal strength of each device every 10 mins
  34. 34. Stateful Processing for Aggregations Aggregates has to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! 34 process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahead log state updates are written to log for checkpointing state
  35. 35. Watermarking and Late Data Watermark [Spark 2.1] boundary in event time trailing behind max observed event time Windows older than watermark automatically deleted to limit the amount of intermediate state 35 event time max event time watermark
  36. 36. Watermarking and Late Data Gap is a configurable allowed lateness Data newer than watermark may be late, but allowed to aggregate Data older than watermark is "too late" and dropped, state removed 36 max event time event time watermark allowed lateness late data allowed to aggregate data too late, dropped
  37. 37. Watermarking and Late Data 37 max event time event time watermark allowed lateness of 10 mins parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowed to aggregate data too late, dropped
  38. 38. Watermarking to Limit State [Spark 2.1] 38 data too late, ignored in counts, state dropped Processing Time12:00 12:05 12:10 12:15 12:10 12:15 12:20 12:07 12:13 12:08 EventTime 12:15 12:18 12:04 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() system tracks max observed event time 12:08 wm = 12:04 10min 12:14 more details in online programming guide
  39. 39. Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful ops to a user-defined state fault-tolerant, exactly-once supports type-safe langs Scala and Java 39 dataset .groupByKey(groupingFunc) .mapGroupsWithState(mappingFunc) def mappingFunc( key: K, values: Iterator[V], state: KeyedState[S]): U = { // update or remove state // return mapped value }
  40. 40. Many more updates! StreamingQueryListener [Spark 2.1] Receive of regular progress heartbeats for health and perf monitoring Automatic in Databricks!! Kafka Batch Queries [Spark 2.2] Run batch queries on Kafka just like a file system Kafka Sink [Spark 2.2] Write to Kafka, can only give at-least-once guarantee as Kafka doesn't support transactional updates Update Output mode [Spark 2.2] Only updated rows in result table to be written to sink 40
  41. 41. More Info Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Databricks blog posts for more focused discussions https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html and more, stay tuned!! 41
  42. 42. Comparison with Other Engines 42 Read the blog to understand this table

×