Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

76 views

Published on

Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly.

Published in: Data & Analytics
  • Be the first to comment

Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

  1. 1. Performant Streaming in Production: Part 1 Max Thöne Stefan van Wouw Resident Solutions Architect Sr. Resident Solutions Architect
  2. 2. Notebooks ▪ To explore the demos we have shown, find the link to the notebooks here
  3. 3. About the Speakers Stefan van Wouw Sr. Resident Solutions Architect Databricks Max Thöne Resident Solutions Architect Databricks
  4. 4. This talk Part 1 Introduction What parts of a stream should be tuned Input Parameters Optimal mini batch size Part 2 State Parameters Limiting the state dimension Output Parameters Do not be a bully for downstream jobs Deployment Considerations after deploying to PROD
  5. 5. Introduction
  6. 6. Suppose we have a stream set up like this STRUCTURED STREAMING Message Source based stream spark .readStream .format(“kafka”) .option(“kafka.bootstrap.servers”, “...”) .option(“subscribe”, “topic”) .load() .selectExpr(“cast (value as string) as json”) .select(from_json(“json”, schema).as(“data”)) .writeStream .format(“delta”) .option(“path”, “/deltaTable/”) .trigger(“1 minute”) .option(“checkpointLocation”, “...”) .start()
  7. 7. Or a stream like this spark .readStream .format(“delta”) .load(“/salesDeltaIn/”) .withColumn(“item_id”, col(“data.item_id”)) .writeStream .format(“delta”) .option(“path”, “/deltaTableOut/”) .trigger(“1 minute”) .option(“checkpointLocation”, “...”) .start() STRUCTURED STREAMING File Source based stream
  8. 8. Maybe even a stream like this (joins) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M spark .readStream … .join(itemDF, “item_id”) … .writeStream … .start()
  9. 9. Maybe even a stream like this (stateful operations) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M spark .readStream … .groupBy(“item_id”) .count() … .writeStream … .start()
  10. 10. Scale dimensions Input size n (records in mini-batch) State size m (records to be compared against) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M
  11. 11. How do we correctly tune this?
  12. 12. Let’s use this example! 1. Main input stream salesSDF = ( spark .readStream .format("delta") .table("sales") ) 2. Join item category lookup itemSalesSDF = ( salesSDF .join( spark.table(“items”), “item_id”) ) 3. Aggregate sales per item category per hour itemSalesPerHourSDF = ( itemSalesSDF .groupBy(window(..., “1 hour”), “item_category”) .sum(“revenue”) ) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M
  13. 13. Input Parameters
  14. 14. Limiting the input dimension Input size n (records in mini-batch) State size m (records to be compared against) RECORD 1 RECORD 2 ... RECORD N Input STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M Limit n in O(n⨯m)
  15. 15. Why are input parameters important? ▪ Allows you to control the mini-batch size. ▪ Optimal mini-batch size → Optimal cluster usage. ▪ Suboptimal mini-batch size → performance cliff. ▪ Shuffle Spill ▪ Different Query Plan (Sort Merge Join vs Broadcast Join)
  16. 16. What input parameters are we talking about? File Source ▪ Any: maxFilesPerTrigger ▪ Delta Lake: +maxBytesPerTrigger Message Source ▪ Kafka: maxOffsetsPerTrigger ▪ Kinesis: fetchBufferSize ▪ EventHubs: maxEventsPerTrigger ▪ Controls the size of each mini-batch ▪ Especially important in relation to shuffle partitions
  17. 17. Input Parameters Example: Stream-Static Join What is a Stream-Static join? ▪ Joining a streaming df to a static df ▪ Induces a shuffling step. 1. Main input stream salesSDF = ( spark .readStream .format("delta") .table("sales") ) 2. Join item category lookup itemSalesSDF = ( salesSDF .join( spark.table(“items”), “item_id”) )
  18. 18. Input Parameters: Not tuning maxFilesPerTrigger What will happen when not setting maxFilesPerTrigger? ▪ For Delta: Default option is 1000 files. Each file is ~200 MB. ▪ For Message and other File based input: Default option is unlimited. ▪ Leads to a massive mini-batch! ▪ When you have shuffle operations → Spill.
  19. 19. Input Parameters: Tuning maxFilesPerTrigger Base it on shuffle partition size ▪ Rule of thumb 1: Optimal shuffle partition size ~100-200 MB ▪ Rule of thumb 2: Set shuffle partitions equal to # of cores = 20. ▪ Use Spark UI to tune maxFilesPerTrigger until you get ~100-200 MB per partition. ▪ Note: Size on disk is not a good proxy for size in memory ▪ Reason is that file size is different from the size in cluster memory
  20. 20. Significant performance improvement by removing spill ▪ maxFilesPerTrigger tuned to 6 files. ▪ Shuffle partitions tuned to 20. ▪ Processed Records/Seconds increased by 30% Tuning maxFilesPerTrigger: Result
  21. 21. Sort Merge Join vs Broadcast Hash Join We are not done yet! ▪ Currently we use a Sort Merge Join. ▪ Our static DF is small enough to broadcast it. ▪ Leads to 70% increased throughput! ▪ Can also increase maxFilesPerTrigger ▪ Because of no more risk of Shuffle Spill (shuffles were removed)
  22. 22. Demo: Input Parameters ▪ Explore the demo notebooks on the first topic: Input Parameters
  23. 23. Input Parameters: Summary Main takeaways ▪ Set shuffle partitions to # Cores (assuming no skew) ▪ Tune maxFilesPerTrigger so you end up with 150-200 MB / Shuffle Partition ▪ Try to make use of broadcasting whenever possible
  24. 24. Performant Streaming in Production: Part 2 Max Thone Stefan van Wouw Resident Solutions Architect Sr. Resident Solutions Architect
  25. 25. State Parameters
  26. 26. Limiting the state dimension RECORD 1 RECORD 2 ... RECORD N STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M Input Input size n (records in mini-batch) State size m (records to be compared against) Limit m in O(n⨯m)
  27. 27. Limiting the state dimension What we mean by state RECORD 1 RECORD 2 ... RECORD N STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M Input ▪ State Store backed operations ▪ Stateful (windowed) aggregations ▪ Drop duplicates ▪ Stream-Stream Joins ▪ Delta Lake table or external system ▪ Stream-Static Join / MERGE
  28. 28. Why are state parameters important? ▪ Optimal parameters → Optimal cluster usage ▪ If not controlled, state explosion can occur ▪ Slower stream performance over time ▪ Heavy shuffle spill (Joins/MERGE) ▪ Out of memory errors (State Store backed operations)
  29. 29. What parameters are we talking about? ▪ How much history to compare against (watermarking) ▪ What state store backend to use (RocksDB / Default) ▪ How much history to compare against (query predicate) State Store agnostic (Stream-Static Join / MERGE) State Store specific
  30. 30. State parameters example ▪ Extending the earlier code sample with stateful aggregation ▪ E.g. Calculating the number of sales per item category per hour ▪ Two types of state dimension here: a. Static side of the stream-static join (items) b. State Store backed operation (windowed stateful aggregation) 1. Main input stream salesSDF = ( spark .readStream .format("delta") .table("sales") ) 2. Join item category lookup itemSalesSDF = ( salesSDF .join( spark.table(“items”), “item_id”) ) 3. Aggregate sales per item per hour itemSalesPerHourSDF = ( itemSalesSDF .groupBy(window(..., “1 hour”), “item_category”) .sum(“revenue”) )
  31. 31. Demo: State Parameters ▪ Explore the demo notebooks on the second topic: State Parameters
  32. 32. State Parameters: Summary Main takeaways ▪ Limit state accumulation with appropriate watermark ▪ The more granular the aggregate key / window, the more state ▪ Delta Backed State might provide more flexibility at cost of latency
  33. 33. Output Parameters
  34. 34. How output parameters influence the scale dimensions RECORD 1 RECORD 2 ... RECORD N STRUCTURED STREAMING State RECORD 1 RECORD 2 ... RECORD M Input Input size n (records in mini-batch) State size m (records to be compared against)
  35. 35. Why are output parameters important? ▪ Streaming jobs tend to create many small files ▪ Reading a folder with many small files is slow ▪ Degrading performance for downstream jobs / self-joins
  36. 36. What Output parameters are we talking about? ▪ Manually using repartition ▪ Delta Lake: Auto-Optimize https://docs.databricks.com/delta/optimizations/auto-optimize.html
  37. 37. Demo: Output Parameters ▪ Explore the demo notebooks on the third topic: Output Parameters
  38. 38. Output Parameters: Summary Main takeaways ▪ High number of files impact performance ▪ 10x speed difference can easily be demonstrated
  39. 39. How to keep your streams performant after deployment
  40. 40. Multiple streams per Spark cluster ▪ Some small streams do not warrant their own cluster ▪ Packing them together in one Spark application might be a good option, but then they share driver process which has performance impact STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING SPARK APPLICATION
  41. 41. Temporary changes to load (elasticity) ▪ Temporary scaling up a streaming cluster to handle backlog ▪ Can only scale out until #cores <= #shuffle partitions
  42. 42. Permanent changes to load (capacity planning) ▪ Permanent load increase warrants capacity planning ▪ Requires checkpoint wipe-out since shuffle partitions is fixed per checkpoint location! ▪ Think of strategy to recover state (if necessary)
  43. 43. Summary
  44. 44. Summary Limit state accumulation Limit how far you look back (history) State ParametersInput Parameters Prevent generating many small files (10x faster) Output Parameters Capacity planning is needed due to deployment bound parameters Have a strategy for checkpoint reset Deployment Limit input size Tune shuffle partitions / cores (30% faster) Enforce broadcasting when possible (2x faster)
  45. 45. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×