Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new with Apache Spark's Structured Streaming?

1,190 views

Published on

Overview of the changes from Apache Spark streaming dstreams to structured streaming.

Published in: Technology

What's new with Apache Spark's Structured Streaming?

  1. 1. What’s new with Apache Spark’s Structured Streaming? Miklos Christine 3/21/2017
  2. 2. $ whoami Solutions Architect @ Databricks • Apache Spark Advocate • Build and architect big data platforms for streaming and batch processing Previously: • Sales Engineer @ Cloudera • Software Engineer @ Cisco
  3. 3. building robust stream processing apps is hard
  4. 4. Complexities in stream processing Complex Data Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order Complex Systems Diverse storage systems and formats (SQL, NoSQL, parquet, ... ) System failures Complex Workloads Event time processing Combining streaming with interactive queries, machine learning
  5. 5. Spark Streaming 1.x APIs (DStreams) Difficulties: • Separate SparkStreamingContext() • Additional library packages and dependencies • Kafka / Kinesis libraries • State management / window functions • Serialization issues & code changes (upgrade issues) • 1 SparkStreamingContext() per Application
  6. 6. Spark Streaming 1.x (aka DStreams) // Function to create a new StreamingContext and set it up def creatingFunc(): StreamingContext = { // Create a StreamingContext val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) // Get the input stream from the source val topic1_Stream = createKafkaStream(ssc, kafkaTopic1, kafkaBrokers) // … … … logic // To make sure data is not deleted by the time we query it interactively ssc.remember(Minutes(1)) println("Creating function called to create new StreamingContext") ssc }
  7. 7. Spark Streaming 1.x (aka DStreams) // Get or create a streaming context. val ssc = StreamingContext.getActiveOrCreate(creatingFunc) // This starts the streaming context in the background. ssc.start() // This is to ensure that we wait for some time before the background streaming job starts. This will put this cell on hold for 5 times the batchIntervalSeconds. ssc.awaitTerminationOrTimeout(batchIntervalSeconds * 5 * 1000)
  8. 8. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  9. 9. you should not have to reason about streaming
  10. 10. you should write simple batch queries & Spark should automatically streamify them
  11. 11. Treat Streams as Unbounded Tables 11 data stream unbounded input table new data in the data stream = new rows appended to a unbounded table
  12. 12. New Model Trigger: every 1 sec Time Input data up to t = 3 Query Input: data from source as an append-only table Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops t=1 t=2 t=3 data up to t = 1 data up to t = 2
  13. 13. New Model result up to t = 1 Result Query Time data up to t = 1 Input data up to t = 2 result up to t = 2 data up to t = 3 result up to t = 3 Result: final operated table updated after every trigger Output: what part of result to write to storage after every trigger Output [complete mode] write all rows in result table to storage t=1 t=2 t=3 Complete output: write full result table every time
  14. 14. New Model Query Time data up to t = 1 Input data up to t = 2 data up to t = 3 Result: final operated table updated after every trigger Output: what part of result to write to storage after every trigger Complete output: write full result table every time Append output: write only new rows that got added to result table since previous batch t=1 t=2 t=3 Result result up to t = 3 Output [append mode] write new rows since last trigger to storage result up to t = 1 result up to t = 2
  15. 15. static data = bounded table streaming data = unbounded table API - Dataset/DataFrame Single API !
  16. 16. Batch Queries with DataFrames input = spark.read .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") Read from Json file Select some devices Write to parquet file
  17. 17. Streaming Queries with DataFrames input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Read from Json file stream Replace read with readStream Select some devices Code does not change Write to Parquet file stream Replace save() with start()
  18. 18. DataFrames, Datasets, SQL Logical Plan Streaming Source Project device, signal Filter signal > 15 Streaming Sink Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans process newfiles t = 1 t = 2 t = 3 process newfiles process newfiles input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path")
  19. 19. Fault-tolerance with Checkpointing Checkpointing - metadata (e.g. offsets) of current batch stored in a write ahead log in HDFS/S3 Query can be restarted from the log Streaming sources can replay the exact data range in case of failure Streaming sinks can dedup reprocessed data when writing, idempotent by design end-to-end exactly-once guarantees process newfiles t = 1 t = 2 t = 3 process newfiles process newfiles write ahead log
  20. 20. Complex Streaming ETL
  21. 21. Traditional ETL Raw, dirty, un/semi-structured is data dumped as files Periodic jobs run every few hours to convert raw data to structured data ready for further analytics file dump seconds hours table 10101010
  22. 22. Traditional ETL Hours of delay before taking decisions on latest data Unacceptable when time is of essence [intrusion detection, anomaly detection, etc.] file dump seconds hours table 10101010
  23. 23. Streaming ETL w/ Structured Streaming Structured Streaming enables raw data to be available as structured data as soon as possible table seconds10101010
  24. 24. Streaming ETL w/ Structured Streaming 24 Example - Json data being received in Kafka - Parse nested json and flatten it - Store in structured Parquet table - Get end-to-end failure guarantees val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load() val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json").as("data")) .select("data.*") val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet")
  25. 25. Reading from Kafka [Spark 2.1] Support Kafka 0.10.0.1 Specify options to configure How? kafka.boostrap.servers => broker1 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
  26. 26. Reading from Kafka rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721 val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load()
  27. 27. Transforming Data Cast binary value to string Name it column json val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")
  28. 28. Transforming Data val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*") Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  29. 29. Transforming Data val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*") data (nested) timestamp device … 1486087873 devA … 1486086721 devX … timestamp device … 1486087873 devA … 1486086721 devX … select("data.*") (not nested) Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns
  30. 30. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns powerful built-in APIs to perform complex data transformations from_json, to_json, explode, ... 100s of functions (see our blog post) val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")
  31. 31. Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data Writing to Parquet table val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable")
  32. 32. Checkpointing Enable checkpointing by setting the checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
  33. 33. Streaming Query query is a handle to the continuously running StreamingQuery Used to monitor and manage the execution StreamingQuery val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/") processnew data t = 1 t = 2 t = 3 processnew data processnew data
  34. 34. Data Consistency on Ad-hoc Queries Data available for complex, ad-hoc analytics within seconds Parquet table is updated atomically, ensures prefix integrity Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our blog https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html seconds! complex, ad-hoc queries on latest data
  35. 35. Advanced Streaming Analytics
  36. 36. Event time Aggregations Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extracting event time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff
  37. 37. Windowing is just another type of grouping in Struct. Streaming number of records every hour Support UDAFs! parsedData .groupBy(window("timestamp","1 hour")) .count() avg signal strength of each device every 10 mins Event time Aggregations parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal")
  38. 38. Stateful Processing for Aggregations Aggregates has to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahead log state updates are written to log for checkpointing state
  39. 39. Watermarking and Late Data Watermark [Spark 2.1] - threshold on how late an event is expected to be in event time Trails behind max seen event time Trailing gap is configurable event time max event time watermark data older than watermark not expected 12:30 PM 12:20 PM trailing gap of 10 mins
  40. 40. Watermarking and Late Data Data newer than watermark may be late, but allowed to aggregate Data older than watermark is "too late" and dropped Windows older than watermark automatically deleted to limit the amount of intermediate state event time max event time watermark data too late, dropped 12:30 PM late data allowed to aggregate
  41. 41. Watermarking and Late Data event time max event time watermark data older than watermark not expected parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowed to aggregate allowed lateness of 10 mins
  42. 42. Watermarking to Limit State [Spark 2.1] data too late, ignored in counts, state dropped Processing Time12:00 12:05 12:10 12:15 12:10 12:15 12:20 12:07 12:13 12:08 EventTime 12:15 12:18 12:04 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() system tracks max observed event time 12:08 wm = 12:04 10min 12:14 more details in online programming guide
  43. 43. mapGroupsWithState allows any user-defined stateful ops to a user-defined state -- supports timeouts -- fault-tolerant, exactly-once -- supports Scala and Java Arbitrary Stateful Operations [Spark 2.2] dataset .groupByKey(groupingFunc) .mapGroupsWithState(mappingFunc) def mappingFunc( key: K, values: Iterator[V], state: KeyedState[S]): U = { // update or remove state // set timeouts // return mapped value }
  44. 44. Many more updates! StreamingQueryListener [Spark 2.1] Receive of regular progress heartbeats for health and perf monitoring Automatic in Databricks, come to the Databricks booth for a demo!! Kafka Batch Queries [Spark 2.2] Run batch queries on Kafka just like a file system Kafka Sink [Spark 2.2] Write to Kafka, can only give at-least-once guarantee as Kafka doesn't support transactional updates Kinesis Source Read from Amazon Kinesis 44
  45. 45. More Info Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Databricks blog posts for more focused discussions https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html and more to come, stay tuned!! 45
  46. 46. GET TICKETS NOW!! EARLY BIRD PRICING!!
  47. 47. Need time to convince your manager? Spark Summit Code: ChicagoMU Good for 15% off starting 4/8.
  48. 48. Comparison with Other Engines 48 Read our blog to understand this table
  49. 49. Thank you!!

×