Successfully reported this slideshow.
Your SlideShare is downloading. ×

Scale-Out Using Spark in Serverless Herd Mode!

Scale-Out Using Spark in Serverless Herd Mode!

Download to read offline

Spark is a beast of a technology and can do amazing things, especially with large datasets. But some big data pipelines require processing the data in small chunks and running them through a large Spark cluster can be inefficient and expensive.

Spark is a beast of a technology and can do amazing things, especially with large datasets. But some big data pipelines require processing the data in small chunks and running them through a large Spark cluster can be inefficient and expensive.

More Related Content

Similar to Scale-Out Using Spark in Serverless Herd Mode!

Scale-Out Using Spark in Serverless Herd Mode!

  1. 1. Scale-Out Using Spark in Serverless Herd Mode! Opher Dubrovsky + Ilai Malka
  2. 2. Spark Has Super Powers 1. Data - process huge amounts of data 1. Scale - easily scale by adding nodes But…
  3. 3. BUT - Every Superhero Has a Weakness Beware, Kryptonite !
  4. 4. Kryptonite I - Shuffle Moves data between executors and nodes in a cluster …. ….
  5. 5. Nodes wait for largest task to complete ! Kryptonite II - Skewed Data Processing Delays + Extra Costs ! Longest task time
  6. 6. ▪ Cluster built for reasonable load Kryptonite III - The Cluster Model Causes Excessive Costs & Delays !! Data Idle Time (& Costs) Unexpected load Cluster sizing
  7. 7. ▪ Spin up resources only when needed ▪ Pay for processing only ! The Serverless Approach No Delays & Cost Efficient !! Data Resources
  8. 8. Agenda ● A Serverless approach for scaling out Spark ● A real life case study ● Use cases to consider
  9. 9. Who We Are? Ilai Malka Big Data Engineer Moto: Serverless is the revolution ! Big Data Dev Lead Moto: Focus on 90% improvements ! Opher Dubrovsky
  10. 10. Nielsen Marketing Cloud (NMC) ● Acquired by Nielsen in 2015 (eXelate) ● Builds marketing segments + device graph ● Data used for: ○ Targeting (campaigns) ○ Business decisions
  11. 11. NMC in a Nutshell 5 PetaByte ~60TB/day ~150B rows/day ~6000 nodes/day Cloud Native
  12. 12. DataOut 250 Billion Events / Day Running on Serverless !!! watch the video: /bit.ly/TMA-Serverless
  13. 13. Top Day Average Day Events 250 billion 120 billion Files 17 million 5 million Data 55 TB 20-23 TB Scale 1 TB ←→ 6 TB DataOut in Numbers Incredible Power & Scale !
  14. 14. Lambda Workers Ad Platforms Input Files Transform Deliver Spark on EMR S3 S3 Old Architecture Output Files Cluster Model Serverless Model
  15. 15. 2,523 2,316 2,440 2,155 1,628 1,585 1,582 1,467 1,015 1,000 1,200 1,400 1,600 1,800 2,000 2,200 2,400 2,600 2 4 6 8 14 20 25 30 40 MB/Hour/Instance # of instances Instance Throughput - MB/Hour Performance Did Not Scale Well !! 35% Drop 60% Drop sweet spot
  16. 16. Initial Solution - Multi Cluster Input Files S3 S3 Output Files
  17. 17. Looking for a Better Solution
  18. 18. Goals 1. Scales well - up/down quickly 2. Cost - Pay only for processing times 3. Bursts - Be able to handle bursts Serverless-Like System !!!
  19. 19. Lambda Workers Ad Platforms Input Files Transform Deliver Spark on EMR S3 S3 Old Architecture Output Files Cluster Model
  20. 20. Serverless Herd Mode The Details
  21. 21. Task Queue Isolated Spark Pods
  22. 22. Implementation SQS EC2 EC2 Spark Pods EC2 Pull Tasks SQS Report Done Elastic Kubernetes Service (EKS) Horizontal Pod Autoscaler (HPA) Make Scaling Decisions Cluster Autoscaler Work Manager Postgres
  23. 23. No Shuffle
  24. 24. Skewed Data Has No Effect ● Single executor ● All data available locally ● No wait of executors for others to finish
  25. 25. Results: Scale Tasks Spark Pods EC2 Scales According to Demand !
  26. 26. Results: Burst Tasks Spark Pods EC2
  27. 27. MB per Instance Hour 3,172 3,538 4,389 4,382 4,431 4,474 4,589 4,573 4,511 4,332 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 2 3 4 5 6 7 8 9 10 11 12 13 14 MB/Hour/Instance # of instances Instance Throughput - MB/Hour
  28. 28. 3,172 3,538 4,389 4,382 4,431 4,474 4,589 4,573 4,511 4,332 2,523 2,316 2,440 2,155 1,628 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 2 3 4 5 6 7 8 9 10 11 12 13 14 MB/Hour/Instance # of instances Instance Throughput - MB/Hour Spark Pods vs. Spark Cluster Spark Pods Spark Cluster * Using similar cost instances sweet spot
  29. 29. Costs, Before Costs, After Savings 55% Savings ! ($15,000 / Year)
  30. 30. Looking Under the Hood
  31. 31. Instance Choice Type: m5d.2xlarge, m5ad.2xlarge, m5dn.2xlarge vCPU: 8 Memory: 32GB POD 1 Memory: 5 GB POD 2 Memory: 5 GB POD 3 Memory: 5 GB POD 4 Memory: 5 GB POD 5 Memory: 5 GB EC2 INSTANCE
  32. 32. Spark Config
  33. 33. Setting Up Horizontal Pod AutoScaler Desired_Pods = ceil ( number of tasks in queue / 2 ) Desired_Pods = 2 Current_Pods = 1 Add 1 Pod ! Task Wait Time: ~2 Processing Cycles
  34. 34. Scale Down Problem Desired_Pods = 2 Current_Pods = 4 Remove 2 pods ! Problem: Pod killed while processing a task. Lost Per Day → 240 tasks == 34 processing hours
  35. 35. Graceful Shutdown of Pods Terminates Immediately POD Kubernetes KILL Graceful KILL - Finish task - TerminatePOD Kubernetes
  36. 36. Implementation Kubernetes Prepare to die POD 2 complete processing Ready to die terminationGracePeriodSeconds: 3600 lifecycle: preStop: kill -2 1; while [ -f /tmp/MsgFile ]; do sleep 1; Kubernetes pod code Ready to die message wait for message Our application code POD 1
  37. 37. Accessing Spark UI
  38. 38. Specific Pod Spark UI
  39. 39. What is it Good For ?
  40. 40. ML Classification of Videos & Images Classify
  41. 41. Bursty Data Pipelines xx xxxxxxxxxx xxxx xxxxxxxxx YYYY YYYYYY Transform
  42. 42. ML Training on Large Divisible Datasets Train 010100010101 0000001001 011111010111
  43. 43. Spark Serverless Herd ● Ability to handle bursts !! ● Cost effectiveness !! Consider Trying Out the Methodology !
  44. 44. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  45. 45. Find Us At .../in/ opher-dubrovsky .../in/ ilai-malka Watch the video: /bit.ly/TMA-Serverless Blog: medium.com/nmc-techblog/tagged/big-data

×