Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Near Real-Time Data Warehousing with Apache Spark and Delta Lake

817 views

Published on

Timely data in a data warehouse is a challenge many of us face, often with there being no straightforward solution.
Using a combination of batch and streaming data pipelines you can leverage the Delta Lake format to provide an enterprise data warehouse at a near real-time frequency. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse. In this talk, we’ll talk about how to use Delta Lake to improve the latency of ingestion and storage of your data warehouse tables. We’ll also talk about how you can use spark streaming to build the aggregations and tables that drive your data warehouse.

Published in: Data & Analytics
  • Login to see the comments

Near Real-Time Data Warehousing with Apache Spark and Delta Lake

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Jasper Groot, Eventbrite Near Real-Time Data Warehousing with Apache Spark and Delta Lake #UnifiedDataAnalytics #SparkAISummit
  3. 3. Introduction Personal Introduction - Data engineering in the event industry for 4+ years - Using spark for 3+ years - Currently at Eventbrite 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. Outline • Structured Streaming – In a nutshell • Delta Lake – How it works • Data Warehousing – Detailed example using these tools – Gotchas 4#UnifiedDataAnalytics #SparkAISummit
  5. 5. Structured Streaming In a nutshell • Introduced in Spark 2.0 • Streams are unbounded dataframes • Familiar API for anyone who has used Dataframes 5#UnifiedDataAnalytics #SparkAISummit
  6. 6. Structured Streaming 6#UnifiedDataAnalytics #SparkAISummit
  7. 7. Structured Streaming How streaming dataframes differ • More restrictive operations – Distinct – Joins – Aggregations • Must be after joins 7#UnifiedDataAnalytics #SparkAISummit
  8. 8. Structured Streaming - Recovery Recovery is done through checkpointing • Checkpointing uses write-ahead logs • Stores running aggregates and progress • Must be a HDFS compatible FS There are limitations on resuming from a checkpoint after updating application logic 8#UnifiedDataAnalytics #SparkAISummit
  9. 9. Structured Streaming + Data Warehousing • Importance of watermarking • Managing late data • Using foreachBatch 9#UnifiedDataAnalytics #SparkAISummit
  10. 10. 10#UnifiedDataAnalytics #SparkAISummit
  11. 11. 11#UnifiedDataAnalytics #SparkAISummit
  12. 12. Delta Lake • Open Sourced in 2019 • Parquet under the hood • Enables ACID transactions • Supports looking back in time • UPDATE & DELETE existing records • Schema management options 12#UnifiedDataAnalytics #SparkAISummit
  13. 13. Delta Lake • Files can be backed by – AWS S3 – Azure Blob Store – HDFS • Able to convert datasets between parquet and delta lake • Some SQL Support 13#UnifiedDataAnalytics #SparkAISummit
  14. 14. Delta Lake - ACID Transactions Works using a transaction log • Transaction log tracks state • Files will not be deleted during read • Optimistic conflict resolution 14#UnifiedDataAnalytics #SparkAISummit
  15. 15. Delta Lake - ACID Transactions 15#UnifiedDataAnalytics #SparkAISummit
  16. 16. Delta Lake - ACID Transactions 16#UnifiedDataAnalytics #SparkAISummit
  17. 17. Delta Lake - ACID Transactions 17#UnifiedDataAnalytics #SparkAISummit
  18. 18. Delta Lake - ACID Transactions 18#UnifiedDataAnalytics #SparkAISummit
  19. 19. Delta Lake - ACID Transactions 19#UnifiedDataAnalytics #SparkAISummit
  20. 20. Delta Lake - ACID Transactions 20#UnifiedDataAnalytics #SparkAISummit Update Insert
  21. 21. Delta Lake - ACID Transactions 21#UnifiedDataAnalytics #SparkAISummit
  22. 22. Delta Lake - ACID Transactions 22#UnifiedDataAnalytics #SparkAISummit Defining our dataset Aliases for merge Join condition
  23. 23. Delta Lake - ACID Transactions 23#UnifiedDataAnalytics #SparkAISummit Values to update
  24. 24. Delta Lake - ACID Transactions 24#UnifiedDataAnalytics #SparkAISummit If the join condition is not met, insert
  25. 25. Delta Lake - ACID Transactions 25#UnifiedDataAnalytics #SparkAISummit
  26. 26. Delta Lake - ACID Transactions 26#UnifiedDataAnalytics #SparkAISummit
  27. 27. Delta Lake - ACID Transactions 27#UnifiedDataAnalytics #SparkAISummit
  28. 28. Delta Lake - ACID Transactions Delta tracks operations on files • Not all operations are effective immediately • New log file is created for each transaction 28#UnifiedDataAnalytics #SparkAISummit
  29. 29. Delta Lake - ACID Transactions 29#UnifiedDataAnalytics #SparkAISummit
  30. 30. Delta Lake - ACID Transactions 30#UnifiedDataAnalytics #SparkAISummit
  31. 31. Delta Lake - ACID Transactions 31#UnifiedDataAnalytics #SparkAISummit
  32. 32. Delta Lake - File Management Cleaning up • Delta provides VACUUM commands • VACUUM can be run with a retention period – Default 7 day retention period – VACUUM with a low retention period can corrupt active writers • VACUUM does not get logged 32#UnifiedDataAnalytics #SparkAISummit
  33. 33. Delta Lake - File Management 33#UnifiedDataAnalytics #SparkAISummit
  34. 34. Pulling it all together Structured Streaming • Leverages many strengths of the Dataframe API • Gives a clean way to manage late data • Makes it manageable to join multiple streams 34#UnifiedDataAnalytics #SparkAISummit
  35. 35. Pulling it all together Delta Lake • Gives us ACID transactions • Logs what has taken place • Requires some file management 35#UnifiedDataAnalytics #SparkAISummit
  36. 36. Data Warehousing There are many ways to model data, let’s stick to an example: • Star Schema • Source is MySQL • Sink is S3 • Possibilities to export from S3 36#UnifiedDataAnalytics #SparkAISummit
  37. 37. Data Warehousing 37#UnifiedDataAnalytics #SparkAISummit
  38. 38. 38#UnifiedDataAnalytics #SparkAISummit Schema
  39. 39. Data Warehousing 39#UnifiedDataAnalytics #SparkAISummit
  40. 40. Data Warehousing 40#UnifiedDataAnalytics #SparkAISummit
  41. 41. Data Warehousing 41#UnifiedDataAnalytics #SparkAISummit Read stream from Kafka Value comes in as a binary Parse the message using a fixed schema
  42. 42. Data Warehousing 42#UnifiedDataAnalytics #SparkAISummit Parse the MySQL data Write the stream as Delta leveraging checkpoints
  43. 43. Data Warehouse Type 2 Dimension • A valid new record must – update the previous version – insert itself as the new version • Delta merge is the way to go • Process batches using foreachBatch 43#UnifiedDataAnalytics #SparkAISummit
  44. 44. Data Warehousing 44#UnifiedDataAnalytics #SparkAISummit
  45. 45. Data Warehousing 45#UnifiedDataAnalytics #SparkAISummit
  46. 46. Data Warehousing 46#UnifiedDataAnalytics #SparkAISummit foreachBatch method • Takes a dataframe, batchId, and more • You are free to handle the dataframe as you see fit
  47. 47. Data Warehousing 47#UnifiedDataAnalytics #SparkAISummit NULL merge key guarantees insert Join to table to merge into Filter to most recent records Select only the batch data and merge key
  48. 48. Data Warehousing 48#UnifiedDataAnalytics #SparkAISummit
  49. 49. Data Warehousing 49#UnifiedDataAnalytics #SparkAISummit Match condition, only match current records Set the previous latest record as non-current Insert all data if there is no previous iteration
  50. 50. Data Warehousing 50#UnifiedDataAnalytics #SparkAISummit
  51. 51. Data Warehousing - Gotchas 51#UnifiedDataAnalytics #SparkAISummit File Management • Smaller trigger windows mean more files – More files mean slower reads • How useful is table history • File size optimization
  52. 52. Data Warehousing - Gotchas 52#UnifiedDataAnalytics #SparkAISummit Streaming joins • Watermarks required for stream-to-stream joins • Be aware of the latency of your streams • Handle late data beyond watermark – Set failure conditions for you streaming applications
  53. 53. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×