Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Parallelization of Structured Streaming Jobs Using Delta Lake

Download to read offline

We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.

  • Be the first to like this

Parallelization of Structured Streaming Jobs Using Delta Lake

  1. 1. © Tubi, proprietary and confidential Parallelization of Structured Streaming Jobs using Delta - Oliver Lewis, Sr. Data Engineer
  2. 2. © Tubi, proprietary and confidential 2
  3. 3. © Tubi, proprietary and confidential 3 Datalake Throughput Requests/s 40,000 Aggregate Records/day 800M Volume/day 500GB
  4. 4. © Tubi, proprietary and confidential 4 Analytics Powered By Stream of Immutable Events
  5. 5. © Tubi, proprietary and confidential 5 Analytics Powered By Stream of Immutable Events
  6. 6. © Tubi, proprietary and confidential 6 Engineering Challenges w/ Stream First Architecture • Datalake file right-sizing • Backfill / Data Deletion Process is a nightmare • Multiple Streams writing to the same location
  7. 7. © Tubi, proprietary and confidential 7 Delta @ Tubi • Optimization of ingested parquet files • Data Deletion Use cases (GDPR/CCPA) • _spark_metadata failures in backfill operations
  8. 8. © Tubi, proprietary and confidential 8 Example of a simple structured streaming job
  9. 9. © Tubi, proprietary and confidential 9 Strategies to Backfill 1) Write a batch job to backfill. 1) Gracefully terminate the streaming job. Gotcha: Do not replace readStream/read and writeStream/write because implicitly flatMapGroupsWithState is converted to mapGroups and you’ll lose state management entirely.
  10. 10. © Tubi, proprietary and confidential 10 Issues in backfilling large datasets We set the start_date to 2016-01-01 and the end_date to 2020-05-31 and run the job. There are several problems in structuring the job like this: 1) It would be a nuisance if the job ran for a long time before failing. 2) State management cannot hold such a large state.
  11. 11. © Tubi, proprietary and confidential 11 Encapsulate the Task. What we need is a small batch size that can be ● TRIGGERED ● EXECUTED ● COMPLETED So that at any given time we do not store too much state on the executors and also clearly track completion. At scale, any date can be sent as input and we would be able to generate the same output. I.e. we should have an idempotent task.
  12. 12. © Tubi, proprietary and confidential 12 Performance To make the backfill go faster our immediate intuition is to increase the size of the cluster. Example: 3886 tasks and we have 64 cores and it took 8.2 mins. If we have 3886 cores we can complete this job in ~8 secs. So our intuition is CORRECT. Increasing the cluster size is useful until the number of cores is less than or equal to the most expensive task.
  13. 13. © Tubi, proprietary and confidential 13 Performance ● But if the number of cores is greater than tasks, then you have a large cluster that is not being fully utilized. ● This is an important limitation of our initial intuition that by spinning up a larger cluster we can increase performance, that isn't always true
  14. 14. © Tubi, proprietary and confidential 14 Performance
  15. 15. © Tubi, proprietary and confidential 15 Performance
  16. 16. © Tubi, proprietary and confidential 16 Backfilling in parallel 1) Separating the business logic from the execution logic. 1) We can run multiple streams in parallel. Each job is submitted to the spark scheduler which will be responsible for the execution of the job depending on the number of free cores available. 1) Using scala parallel collections (.par)
  17. 17. © Tubi, proprietary and confidential 17 Par collections limitations 1) Rob Pike: Concurrency is not parallelism. 1) Par collections do launch Spark jobs in parallel, but the Spark scheduler may not actually execute the jobs in parallel.
  18. 18. © Tubi, proprietary and confidential 18 Futures and Fair Scheduler Pool 1) By default, each pool gets an equal share of the cluster, but inside each pool, jobs run in FIFO order. spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1") 1) You can further configure the pools schedulerMode, minShare, and weight.
  19. 19. © Tubi, proprietary and confidential 19 DEMO https://dbc-5f2acc18- b29d.cloud.databricks.com/?o=6211501775943445#notebook/3133024054071987/
  20. 20. © Tubi, proprietary and confidential 20 Performance Chart
  21. 21. © Tubi, proprietary and confidential 21 Failure and Recovery handling ● We should be able to handle failure and retries within the job. ● Build a simple StateStore to monitor which state the job is currently in. ● If the job has successfully finished then we can remove it from the state store.
  22. 22. © Tubi, proprietary and confidential Thank You. https://corporate.tubitv.com/company/careers/ Blog: https://code.tubitv.com/ Contact: https://www.linkedin.com/in/oliveralewis/ olewis@tubi.tv

We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.

Views

Total views

447

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

20

Shares

0

Comments

0

Likes

0

×