Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Deep Dive into Structured Streaming in Apache Spark

2,422 views

Published on

This presentation was given at Apache Spark Meetup in Milano by Databricks software engineer and Apache Spark contributor Burak Yavuz. It covers how to write end-to-end, fault-tolerant continuous application using Structured Streaming APIs available in Apache Spark 2.x

Published in: Engineering
  • pls can you tell when Spark 3.0 is releasing and what are the new things we can expect. its been long since 2.1 was released. thank you
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A Deep Dive into Structured Streaming in Apache Spark

  1. 1. 
 A Deep Dive into 
 
 Structured Streaming
 Burak Yavuz Spark Meetup @ Milano 18 January 2017
  2. 2. Streaming in Apache Spark Spark Streaming changed how people write streaming apps 2 SQL Streaming MLlib Spark Core GraphX Functional, concise and expressive Fault-tolerant state management Unified stack with batch processing More than 50% users consider most important part of Apache Spark
  3. 3. Streaming apps are growing more complex 3
  4. 4. Streaming computations don’t run in isolation Need to interact with batch data, interactive analysis, machine learning, etc.
  5. 5. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Prevent data loss - Prevent duplicatesStatus monitoring - Handle late data - Aggregate on windows on event time Interactively debug issues - consistency event stream Anomaly detection - Learn models offline - Use online + continuous learning
  6. 6. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Prevent data loss - Prevent duplicatesStatus monitoring - Handle late data - Aggregate on windows on event time Interactively debug issues - consistency event stream Anomaly detection - Learn models offline - Use online + continuous learningContinuous Applications Not just streaming any more
  7. 7. 1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time 2. Interoperate streaming with batch AND interactive - RDD/DStream has similar API, but still requires translation 3. Reasoning about end-to-end guarantees - Requires carefully constructing sinks that handle failures correctly - Data consistency in the storage while being updated Pain points with DStreams
  8. 8. Structured Streaming
  9. 9. The simplest way to perform streaming analytics
 is not having to reason about streaming at all
  10. 10. New Model Trigger: every 1 sec 1 2 3 Time data up
 to 1 Input data up
 to 2 data up to 3 Query Input: data from source as an append-only table Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops
  11. 11. New Model Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up
 to 1 Input data up
 to 2 result for data up to 2 data up to 3 result for data up to 3 Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output [complete mode] output all the rows in the result table
  12. 12. New Model Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up
 to 1 Input data up
 to 2 result for data up to 2 data up to 3 result for data up to 3 Output [append mode] output only new rows since last trigger Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Append output: Write only new rows that got added to result table since previous batch *Not all output modes are feasible with all queries
  13. 13. Static, bounded data Streaming, unbounded data Single API ! API - Dataset/DataFrame
  14. 14. 14 Dataset A distributed bounded or unbounded collection of data with a known schema A high-level abstraction for selecting, filtering, mapping, reducing, and aggregating structured data Common API across Scala, Java, Python, R
  15. 15. Standards based: Supports most popular constructs in HiveQL, ANSI SQL Compatible: Use JDBC to connect with popular BI tools 15 Datasets with SQL SELECT type, avg(signal) FROM devices GROUP BY type
  16. 16. Concise: great for ad-hoc interactive analysis Interoperable: based on R / pandas, easy to go back and forth 16 Dynamic Datasets (DataFrames) spark.table("device-data") .groupBy("type") .agg("type", avg("signal")) .map(lambda …) .collect()
  17. 17. 17 Statically-typed Datasets No boilerplate: automatically convert to/ from domain objects Safe: compile time checks for correctness // Compute histogram of age by name. val hist = ds.groupBy(_.type).mapGroups { case (type, data: Iter[DeviceData]) => val buckets = new Array[Int](10) data.map(_.signal).foreach { a => buckets(a / 10) += 1 } (type, buckets) } case class DeviceData(type:String, signal:Int) // Convert data to Java objects val ds: Dataset[DeviceData] = spark.read.json("data.json").as[DeviceData]
  18. 18. Unified, Structured APIs in Spark 2.0 18 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime
  19. 19. Batch ETL with DataFrames input = spark.read .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .save("dest-path") 
 Read from Json file Select some devices Write to parquet file
  20. 20. Streaming ETL with DataFrames input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") 
 Read from Json file stream Replace read with readStream Select some devices Code does not change Write to Parquet file stream Replace save() with start()
  21. 21. Streaming ETL with DataFrames readStream…load() creates a streaming DataFrame, does not start any of the computation writeStream…start() defines where & how to output the data and starts the processing input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") 

  22. 22. Streaming ETL with DataFrames 1 2 3 Result Input Output [append mode] new rows in result of 2 new rows in result of 3 input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") 

  23. 23. Demo
  24. 24. Continuous Aggregations Continuously compute average signal across all devices Continuously compute average signal of each type of device 24 input.avg("signal") input.groupBy("device_type") .avg("signal")
  25. 25. Continuous Windowed Aggregations 25 input.groupBy( $"device_type", window($"event_time", "10 min")) .avg("signal") Continuously compute average signal of each type of device in last 10 minutes using event_time column Simplifies event-time stream processing (not possible in DStreams) Works on both, streaming and batch jobs
  26. 26. Watermarking for handling late data
 [Spark 2.1] 26 input.withWatermark("event_time", "15 min") .groupBy( $"device_type", window($"event_time", "10 min")) .avg("signal") Watermark = threshold using event-time for classifying data as "too late" Continuously compute 10 minute average while considering that data can be up to 15 minutes late
  27. 27. Joining streams with static data kafkaDataset = spark.readStream  .format("kafka")  .option("subscribe", "iot-updates")  .option("kafka.bootstrap.servers", ...) .load() staticDataset = spark.read  .jdbc("jdbc://", "iot-device-info") joinedDataset =  kafkaDataset.join(staticDataset, "device_type") 27 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … without having to think that you are joining streaming data
  28. 28. Output Modes Defines what is outputted every time there is a trigger Different output modes make sense for different queries 28 input.select("device", "signal") .writeStream .outputMode("append") .format("parquet") .start("dest-path") Append mode with non-aggregation queries input.agg(count("*")) .writeStream .outputMode("complete") .format("memory") .queryName("my_table") .start() Complete mode with aggregation queries
  29. 29. Query Management query = result.writeStream .format("parquet") .outputMode("append") .start("dest-path") query.stop() query.awaitTermination() query.exception() query.status() query.recentProgress() 29 query: a handle to the running streaming computation for managing it - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Can ask for current status, and metrics
  30. 30. Logically: Dataset operations on table (i.e. as easy to understand as batch) Physically: Spark automatically runs the query in streaming fashion (i.e. incrementally and continuously) DataFrame Logical Plan Continuous, 
 incremental execution Catalyst optimizer Query Execution
  31. 31. Structured Streaming High-level streaming API built on Datasets/DataFrames Event time, windowing, sessions, sources & sinks End-to-end exactly once semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove, change queries at runtime Build and apply ML models
  32. 32. What can you do with this that’s hard with other engines? True unification Combines batch + streaming + interactive workloads Same code + same super-optimized engine for everything Flexible API tightly integrated with the engine Choose your own tool - Dataset/DataFrame/SQL Greater debuggability and performance Benefits of Spark in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …
  33. 33. Underneath the Hood
  34. 34. Batch Execution on Spark SQL 34 DataFrame/ Dataset Logical Plan Abstract representation of query
  35. 35. Batch Execution on Spark SQL 35 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical 
 Plans Code Generation CatalogDataset Helluvalot of magic!
  36. 36. Batch Execution on Spark SQL 36 DataFrame/ Dataset Logical Plan Execution PlanPlanner Run super-optimized Spark jobs to compute results Bytecode generation JVM intrinsics, vectorization Operations on serialized data Code Optimizations Memory Optimizations Compact and fast encoding Off-heap memory Project Tungsten - Phase 1 and 2
  37. 37. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans, for each processing the next chunk of streaming data 37 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  38. 38. Continuous Incremental Execution 38 Planner Incremental Execution 2 Offsets: [106-197] Count: 92 Planner polls for new data from sources Incremental Execution 1 Offsets: [19-105] Count: 87 Incrementally executes new data and writes to sink
  39. 39. Continuous Aggregations Maintain running aggregate as in-memory state backed by WAL in file system for fault-tolerance 39 state data generated and used across incremental executions Incremental Execution 1 state: 87 Offsets: [19-105] Running Count: 87 memory Incremental Execution 2 state: 179 Offsets: [106-179] Count: 87+92 = 179
  40. 40. Fault-tolerance All data and metadata in the system needs to be recoverable / replayable st at e Planner source sink Incremental Execution 1 Incremental Execution 2
  41. 41. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS st at e Planner source sink Offsets written to fault- tolerant WAL before execution Incremental Execution 2 Incremental Execution 1
  42. 42. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS st at e Planner source sink Failed planner fails current execution Incremental Execution 2 Incremental Execution 1 Failed Execution Failed Planner
  43. 43. Fault-tolerance Fault-tolerant Planner Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS Reads log to recover from failures, and re-execute exact range of offsets st at e Restarted Planner source sink Offsets read back from WAL Incremental Execution 1 Same executions regenerated from offsets Failed Execution Incremental Execution 2
  44. 44. Fault-tolerance Fault-tolerant Sources Structured streaming sources are by design replayable (e.g. Kafka, Kinesis, files) and generate the exactly same data given offsets recovered by planner st at e Planner sink Incremental Execution 1 Incremental Execution 2 source Replayable source
  45. 45. Fault-tolerance Fault-tolerant State Intermediate "state data" is a maintained in versioned, key- value maps in Spark workers, backed by HDFS Planner makes sure "correct version" of state used to re- execute after failure Planner source sink Incremental Execution 1 Incremental Execution 2 st at e state is fault-tolerant with WAL
  46. 46. Fault-tolerance Fault-tolerant Sink Sink are by design idempotent, and handles re-executions to avoid double committing the output Planner source Incremental Execution 1 Incremental Execution 2 st at e sink Idempotent by design
  47. 47. 47 offset tracking in WAL + state management + fault-tolerant sources and sinks = end-to-end exactly-once guarantees
  48. 48. 48 Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming
  49. 49. Blogs for Deeper Technical Dive Blog 1: https://databricks.com/blog/2016/07/28/ continuous-applications-evolving- streaming-in-apache-spark-2-0.html Blog 2: https://databricks.com/blog/2016/07/28/ structured-streaming-in-apache- spark.html
  50. 50. Comparison with Other Engines 50Read the blog to understand this table
  51. 51. Previous Status: Spark 2.0.2 Basic infrastructure and API - Event time, windows, aggregations - Append and Complete output modes - Support for a subset of batch queries Source and sink - Sources: Files, Kafka - Sinks: Files, in-memory table, unmanaged sinks (foreach) Experimental release to set the future direction Not ready for production but good to experiment with and provide feedback
  52. 52. Current Status: Spark 2.1.0 Event-time watermarks and old state evictions Status monitoring and metrics - Current status and metrics (processing rates, state data size, etc.) - Codahale/Dropwizard metrics (reporting to Ganglia, Graphite, etc.) Stable future-proof checkpoint format upgrading code across Spark versions Many, many, many stability and performance improvements 52
  53. 53. Future Direction Stability, stability, stability Support for more queries Sessionization More output modes Support for more sinks ML integrations Make Structured Streaming ready for production workloads by Spark 2.2/2.3
  54. 54. https://spark-summit.org/east-2017/

×