Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1

Share

Delta Lake Streaming: Under the Hood

Download to read offline

With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.



In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.

Delta Lake Streaming: Under the Hood

  1. 1. Delta Lake Streaming : Under the hood Structured Streaming Internals
  2. 2. Speaker Shasidhar Eranti Senior Resident Solutions Engineer Databricks
  3. 3. Sample stream spark .readStream .format("delta") ... .writeStream .format("delta") .table("delta_stream") .option("checkpointLocation" , "…") .start() Process Process Checkpoint Time
  4. 4. Physical Components of Delta Stream Process Checkpoint 1.Source 2.Sink 3.Checkpoint 4.Transaction Log
  5. 5. Focus area Process Checkpoint Checkpoint Transaction Log
  6. 6. How do we deep dive? 1. Structured Streaming Internals 2. Delta Table properties 3. Common pitfalls & mitigation strategies Checkpoint Transaction Log
  7. 7. What we will deep dive? 1. Structured Streaming Internals 2. Delta Table properties 3. Common pitfalls & mitigation strategies Checkpoint Transaction Log
  8. 8. Structured Streaming internals
  9. 9. Structured Streaming internals ▪ Query Progress Logs(QPL) ▪ Streaming semantics with Delta Lake ▪ Streaming Checkpoint With Delta as Source & Sink
  10. 10. Query Progress Log Structured Streaming
  11. 11. Query Progress Log (QPL) • JSON log generated by every microbatch • Provides batch execution details in metrics • Used to display streaming dashboard in notebook cells
  12. 12. Query progress log ● Microbatch execution ● Source/Sink ● Stream performance ● Batch duration ● Streaming state Metrics categories
  13. 13. Metrics Categories Batch Execution metrics JSON "id" : "f87419cf-e92c-4d8a-b801-0ac1518da5e6", "runId" : "d7e7fe6b-6386-4276-a936-2485a1522190", "name" : simple_stream, "batchId" : 1, Key metrics id ● Stream unique id ● Maps to checkpoint directory batchId ● Microbatch Id Notebook UI
  14. 14. Metrics Categories Delta source and sink metrics Key metrics startOffset/endOffset Set of metrics at which batch started and ended ● reservoirVersion ○ version of delta table at which current stream execution started ● Index ○ file index of the transaction ● isStartingVersion ○ true , if stream starts from this reservoir version numInputRows Count of rows ingested in a microbatch sources" : [ { "description" : "DeltaSource[dbfs:/user/hive/..]", "startOffset" : { "sourceVersion" : 1, "reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a", "reservoirVersion" : 3, "index" : 2, "isStartingVersion" : true }, "endOffset" : { "sourceVersion" : 1, "reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a", "reservoirVersion" : 3, "index" : 3, "isStartingVersion" : true } "numInputRows" : 1, "sink" : { "description" : "DeltaSink[dbfs:/user/hive/warehouse/..]", "numOutputRows" : -1 } JSON
  15. 15. Metrics Categories Performance metrics JSON "inputRowsPerSecond" : 0.016666666666666666, "processedRowsPerSecond" : 0.2986857825567503 Key metrics InputRowsPerSecond ● The rate at which data is arriving from this source into stream ● Doesn’t represent the ingestion rate at source processedRowsPerSecond ● The rate at which data from this source is being processed by Spark ● processing rate > input rate, indicates stream slowness Notebook UI
  16. 16. Streaming semantics with Delta Lake
  17. 17. Sample Delta Lake stream structured streaming spark .readStream .format("delta") .option("maxFilesPerTrigger",5) .table("delta_keyval") ... .writeStream .format("delta") .table("delta_stream") .trigger("60 seconds") .option("checkpointLocation", "…") .start() source sink source sink
  18. 18. Sample Delta Lake stream spark .readStream .format("delta") .option("maxFilesPerTrigger",5) .table("delta_keyval") ... .writeStream .format("delta") .table("delta_stream") .trigger("60 seconds") .option("checkpointLocation", "…") .start() source sink ● Number of files ingested per microbatch ● If not specified stream will fallback to default value (1000) ● Always affects all delta streams
  19. 19. Sample Delta Lake stream spark .readStream .format("delta") .option("maxFilesPerTrigger",5) .table("delta_keyval") ... .writeStream .format("delta") .table("delta_stream") .trigger("60 seconds") .option("checkpointLocation", "…") .start() source sink ● Using Processing time trigger ● Default - No trigger ● Microbatch mode
  20. 20. Delta source - Stream Mapping Source Table Details 0 1 2 3 4 5 6 7 version 0 Data Table History
  21. 21. Delta source - Stream Mapping Delta to Delta Stream Source table spark .readStream .format("delta") .option("maxFilesPerTrigger",5) .table("delta_keyval") .writeStream .format("delta") .table("delta_kv_stream") .option("checkpointLocation", "…") .start() Stream Destination Table
  22. 22. Delta source - Stream Mapping Delta to Delta Stream Source table history "runId" : "324f1e17-4fae-4e1a-..." "batchId" : 0, ... "sources" : [ { "description" : "DeltaSource[...]", "startOffset" : null, "endOffset" : { "sourceVersion" : 1, "reservoirId": "744f0c51-48e6-482d-...", "reservoirVersion" : 0, "index" : 4, "isStartingVersion" : true } "numInputRows" : 10240, ], Query Progress log for first batch Destination Table History after first batch
  23. 23. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version null 0 4 true 10240 Delta source - Stream Mapping First batch (BatchId 0) Source history QPL for batch 0 0 1 2 3 4 5 6 7 version 0 file0 to file4 is processed in batch 0, since maxFilesPerTrigger is set to 5
  24. 24. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 0 4 true 0 7 true 6144 Delta source - Stream Mapping Second batch (BatchId 1) Source history QPL for batch 1 0 1 2 3 4 5 6 7 version 0 file5 to file7 is processed in batch 1
  25. 25. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 0 7 true 0 7 true 0 Delta source - Stream Mapping Third batch (BatchId 2) Source history QPL for batch 2 0 1 2 3 4 5 6 7 version 0 No files left for processing in batch 2
  26. 26. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 0 7 true 1 4 false 10240 Delta source - Stream Mapping Third batch (BatchId 2 with new data) Source history QPL for batch 2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 version 0 version 1
  27. 27. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 0 7 true 1 4 false 10240 Delta source - Stream Mapping Third batch (BatchId 2 with new data) Source history QPL for batch 2 file0 to file4 from version1 is processed in batch 2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 version 0 version 1
  28. 28. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 2 -1 false 2 -1 false 0 Delta source - Stream Mapping Last batch (BatchId 4) Source history QPL for batch 4 No new files left for processing in batch 4 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 version 0 version 1 0 1.. version 2 .. Next Possible version
  29. 29. Streaming Checkpoint
  30. 30. Streaming Stages STRUCTURED STREAMING Checkpoint Simple Stream Different Steps Status in notebook Step 1 Construct Microbatch ● Fetch source offsets ● Commit offsets Step 2 Process Microbatch Step 3 Commit Microbatch
  31. 31. Streaming Stages STRUCTURED STREAMING Checkpoint Simple Stream Different Steps Status in notebook Step 1 Construct Microbatch ● Fetch source offsets ● Commit offsets Step 2 Process Microbatch Step 3 Commit Microbatch
  32. 32. Streaming Checkpoint ● Tracks Streaming query ● All data is stored as JSON ● Contents ● offsets ● metadata ● commits Step 2 Process Microbatch Step 3 Commit Microbatch Step 1 Construct Microbatch
  33. 33. Streaming Checkpoint - Offsets ● One file per microbatch ● Offset file is generated when batch starts ● Following details are stored a. Batch and streaming state details b. Source details Step 1 Construct Microbatch
  34. 34. Streaming Checkpoint - Offset mapping dbfs:/mnt/stream/simple_stream/checkpoint/offsets/0 filename "startOffset" : null, "endOffset" : { "sourceVersion" : 1, "reservoirId": "483e5927-1c26-4af5-b7a2-", "reservoirVersion" : 0, "index" : 4, "isStartingVersion" : true } content query progress log file0 to file4 is processed in batch 0 (first batch) Step 1 Construct Microbatch
  35. 35. Streaming Checkpoint - Metadata ● Metadata of the stream ● Stream-id is generated when stream starts ● Remain same for the lifetime of checkpoint directory
  36. 36. ● One file per microbatch is generated ● File is generated only if batch completes ● Query restarts check num commit files == num of offset files ○ True -> Start a new batch ○ False -> Finish previously batch Streaming Checkpoint - Commits Step 3 Commit Microbatch
  37. 37. Summary QPL is the source of truth for stream execution Delta log <-> QPL Mapping explains stream execution Delta Streaming Query Progress Logs Offsets and Commits are key to maintain stream state Stream Checkpoint
  38. 38. What Next ?? ● Other variations of Delta stream ○ (TriggerOnce, maxBytesPerTrigger) ● Delta table properties ● Common pitfalls & mitigation strategies
  39. 39. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • manuzhang

    Jul. 1, 2021

With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users. In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.

Views

Total views

142

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

14

Shares

0

Comments

0

Likes

1

×