Successfully reported this slideshow.
Your SlideShare is downloading. ×

Reliable Performance at Scale with Apache Spark on Kubernetes

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 84 Ad

Reliable Performance at Scale with Apache Spark on Kubernetes

Download to read offline

Kubernetes is an open-source containerization framework that makes it easy to manage applications in isolated environments at scale. In Apache Spark 2.3, Spark introduced support for native integration with Kubernetes. Palantir has been deeply involved with the development of Spark’s Kubernetes integration from the beginning, and our largest production deployment now runs an average of ~5 million Spark pods per day, as part of tens of thousands of Spark applications.

Over the course of our adventures in migrating deployments from YARN to Kubernetes, we have overcome a number of performance, cost, & reliability hurdles: differences in shuffle performance due to smaller filesystem caches in containers; Kubernetes CPU limits causing inadvertent throttling of containers that run many Java threads; and lack of support for dynamic allocation leading to resource wastage. We intend to briefly describe our story of developing & deploying Spark-on-Kubernetes, as well as lessons learned from deploying containerized Spark applications in production.

We will also describe our recently open-sourced extension (https://github.com/palantir/k8s-spark-scheduler) to the Kubernetes scheduler to better support Spark workloads & facilitate Spark-aware cluster autoscaling; our limited implementation of dynamic allocation on Kubernetes; and ongoing work that is required to support dynamic resource management & stable performance at scale (i.e., our work with the community on a pluggable external shuffle service API). Our hope is that our lessons learned and ongoing work will help other community members who want to use Spark on Kubernetes for their own workloads.

Kubernetes is an open-source containerization framework that makes it easy to manage applications in isolated environments at scale. In Apache Spark 2.3, Spark introduced support for native integration with Kubernetes. Palantir has been deeply involved with the development of Spark’s Kubernetes integration from the beginning, and our largest production deployment now runs an average of ~5 million Spark pods per day, as part of tens of thousands of Spark applications.

Over the course of our adventures in migrating deployments from YARN to Kubernetes, we have overcome a number of performance, cost, & reliability hurdles: differences in shuffle performance due to smaller filesystem caches in containers; Kubernetes CPU limits causing inadvertent throttling of containers that run many Java threads; and lack of support for dynamic allocation leading to resource wastage. We intend to briefly describe our story of developing & deploying Spark-on-Kubernetes, as well as lessons learned from deploying containerized Spark applications in production.

We will also describe our recently open-sourced extension (https://github.com/palantir/k8s-spark-scheduler) to the Kubernetes scheduler to better support Spark workloads & facilitate Spark-aware cluster autoscaling; our limited implementation of dynamic allocation on Kubernetes; and ongoing work that is required to support dynamic resource management & stable performance at scale (i.e., our work with the community on a pluggable external shuffle service API). Our hope is that our lessons learned and ongoing work will help other community members who want to use Spark on Kubernetes for their own workloads.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Reliable Performance at Scale with Apache Spark on Kubernetes (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Reliable Performance at Scale with Apache Spark on Kubernetes

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Will Manning + Matt Cheah, Palantir Technologies Reliable Performance at Scale with Spark on Kubernetes #UnifiedDataAnalytics #SparkAISummit
  3. 3. Joined Palantir First production adopter of YARN & Parquet Helped form our Data Flows & Analytics product groups Responsible for Engineering and Architecture for Compute (including Spark) 3 Will Manning About us 2013 2015 2016
  4. 4. Joined Palantir Migrated Spark cluster management from standalone to YARN, and later from YARN to Kubernetes Spark committer / open source Spark developer 4 Matt Cheah About us 2014 2018
  5. 5. Agenda 1. A (Very) Quick Primer on Palantir 2. Why We Moved from YARN 3. Spark on Kubernetes 4. Key Production Challenges Ø Kubernetes Scheduling Ø Shuffle Resiliency 5
  6. 6. A (Very) Quick Primer on Palantir … and how Spark on Kubernetes helps power Palantir Foundry
  7. 7. Who are we? 7 Headquartered Presence Employees Palo Alto, CA Global ~2500 / Mostly Engineers Founded 2004 Software Data Integration
  8. 8. 8 Supporting Counterterrorism From intelligence operations to mission planning in the field.
  9. 9. 9 Energy Institutions must evolve or die. Technology and data are driving this evolution.
  10. 10. 10 Manufacturing Ferrari uses Palantir Foundry to increase performance + reliability.
  11. 11. 11 Aviation Safety Palantir and Airbus founded Skywise to help make air travel safer and more economical.
  12. 12. 12 Cancer Research Syntropy brings together the greatest minds and institutions to advance research toward the common goal of improving human lives.
  13. 13. 13 Products Built for a Purpose Integrate, manage, secure, and analyze all of your enterprise data. Amplify and extend the power of data integration.
  14. 14. Enabling analytics in Palantir Foundry Executing untrusted code on behalf of trusted users in a multitenant environment 14 – Users can author code (e.g., using Spark SQL or pySpark) to define complex data transformations or to perform analysis – Our users want to write code once and have it keep working the same way indefinitely
  15. 15. Enabling analytics in Palantir Foundry Executing untrusted code on behalf of trusted users in a multitenant environment 15 – Users can author code (e.g., using Spark SQL or pySpark) to define complex data transformations or to perform analysis – Our users want to write code once and have it keep working the same way indefinitely – Foundry is responsible for executing arbitrary code on users’ behalf securely – Even though the customer might trust the user, Palantir infrastructure can’t
  16. 16. Enabling collaboration across organizations Using Spark on Kubernetes to enable multitenant compute 16 Engineers from Airline A Airbus Employees Engineers from Airline B
  17. 17. Enabling analytics in Palantir Foundry Executing untrusted code on behalf of trusted users in a multitenant environment 17 – Users can author code (e.g., using Spark SQL or pySpark) to define complex data transformations or to perform analysis – Our users want to write code once and have it keep working the same way indefinitely – Foundry is responsible for executing arbitrary code on users’ behalf securely – Even though the customer might trust the user, Palantir infrastructure can’t Repeatable Performance Security
  18. 18. Why We Moved from YARN
  19. 19. Hadoop/YARN Security Two modes 1. Kerberos 19
  20. 20. Hadoop/YARN Security Two modes 1. Kerberos 2. None (only mode until Hadoop 2.x) NB: I recommend reading Steve Loughran’s “Madness Beyond the Gate” to learn more 20
  21. 21. Hadoop/YARN Security Containerization as of 3.1.1 (late 2018) 21
  22. 22. Performance in YARN 22 YARN’s scheduler attempts to maximize utilization
  23. 23. Performance in YARN 23 YARN’s scheduler attempts to maximize utilization Spark on YARN with dynamic allocation is great at: – extracting maximum performance from static resources – providing bursts of resources for one-off batch work – running “just one more thing”
  24. 24. Performance in YARN 24 YARN’s scheduler attempts to maximize utilization Spark on YARN with dynamic allocation is great at: – extracting maximum performance from static resources – providing bursts of resources for one-off batch work – running “just one more thing”
  25. 25. YARN: Clown Car Scheduling 25 (Image Credit: 20th Century Fox Television)
  26. 26. Performance in YARN 26 YARN’s scheduler attempts to maximize utilization Spark on YARN with dynamic allocation is terrible at: – providing consistency from run to run – isolating performance between different users/tenants (i.e., if you kick off a big job, then my job is likely to run slower)
  27. 27. So… Kubernetes? 27 ü Native containerization ü Extreme extensibility (e.g., scheduler, networking/firewalls) ü Active community with a fast-moving code base ü Single platform for microservices and compute* *Spoiler alert: the Kubernetes scheduler is excellent for web services, not optimized for batch
  28. 28. Spark on Kubernetes
  29. 29. Timeline 29 Sep ‘16 Mar ‘17 Jan ‘18 Nov ‘16 Oct ‘17 Jun ‘18 Begin initial prototype of Spark on K8s Minimal integration complete, begin experimental deployment Begin first migration from YARN toK8s Establish working group with community First PR merged into upstream master Complete first migration from YARN to K8s in production
  30. 30. Spark on K8s Architecture 30 Client runs spark-submit with arguments spark-submit converts arguments into a PodSpec for the driver
  31. 31. Spark on K8s Architecture 31 Client runs spark-submit with arguments spark-submit converts arguments into a PodSpec for the driver K8s-specific implementation of SchedulerBackend interface requests executor pods in batches
  32. 32. Spark on K8s Architecture 32 CREATE DRIVER POD spark-submit input: … --master k8s://example.com:8443 --conf spark.kubernetes.image=example.com/appImage --conf … com.palantir.spark.app.main.Main spark-submit output: … spec: containers: - name: example image: example.com/appImage command: [”/opt/spark/entrypoint.sh"] args: [”--driver", ”--class”, “com.palantir.spark.app.main.Main”] … spark-submit
  33. 33. Spark on K8s Architecture 33 Driver Pod CREATE: 2 EXECUTOR PODS spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
  34. 34. Spark on K8s Architecture 34 spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2 Executor Pod #1 Executor Pod #2 Driver Pod REGISTER REGISTER CREATE: 2 EXECUTOR PODS
  35. 35. Spark on K8s Architecture 35 Executor Pod #1 Executor Pod #2 Driver Pod REGISTER REGISTER CREATE: 2 EXECUTOR PODS CREATE: 2 MORE EXECUTOR PODS spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
  36. 36. Spark on K8s Architecture 36 Executor Pod #1 Executor Pod #2 Executor Pod #3 Executor Pod #4 Driver Pod REGISTER REGISTER REGISTER REGISTER CREATE: 2 EXECUTOR PODS CREATE: 2 MORE EXECUTOR PODS spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
  37. 37. Spark on K8s Architecture 37 Executor Pod #1 Executor Pod #2 Executor Pod #3 Executor Pod #4 Driver Pod RUN TASKS REGISTER REGISTER REGISTER REGISTER RUN TASKS RUN TASKS RUN TASKS CREATE: 2 EXECUTOR PODS CREATE: 2 MORE EXECUTOR PODS spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2
  38. 38. Early Challenge: Disk Performance 38 Executor (Heap = E) Executor (Heap = E) OS File System Cache (Max Size = M – 2E) YARN Node Manager (Memory = M) File System
  39. 39. Early Challenge: Disk Performance 39 Executor (Heap = E) Executor (Heap = E) OS File System Cache (Max Size = M – 2E) YARN Node Manager (Memory = M) OS File System Cache (Max Size = C – E = ~0) Executor Pod (Memory = C) Executor (Heap = E = ~C) OS File System Cache (Max Size = C – E = ~0) Executor Pod (Memory = C) Executor (Heap = E = ~C) File System File System File System
  40. 40. Early Challenge: Disk Performance OS filesystem cache is now container-local – i.e., dramatically smaller – disks must be fast without hitting FS cache – Solution: use NVMe drives for temp storage Docker disk interface is slower than direct disk access – Solution: Use EmptyDir volumes for temp storage 40
  41. 41. Key Production Challenges
  42. 42. Challenge I: Kubernetes Scheduling 42 – The built-in Kubernetes scheduler isn’t really designed for distributed batch workloads (e.g., MapReduce) – Historically, optimized for microservice instances or single-pod, one-off jobs (what k8s natively supports)
  43. 43. Challenge II: Shuffle Resiliency 43 – External Shuffle Service is unavailable in Kubernetes – Jobs must be written more carefully to avoid executor failure (e.g., OOM) and the subsequent need to recompute lost blocks
  44. 44. Kubernetes Scheduling
  45. 45. Reliable, Repeatable Runtimes 45 A key goal is to make runtimes of the same workload consistent from run to run Recall
  46. 46. Reliable, Repeatable Runtimes 46 A key goal is to make runtimes of the same workload consistent from run to run When a driver is deployed, wants to do work, it should receive the same resources from run to run Recall Corollary
  47. 47. Reliable, Repeatable Runtimes 47 A key goal is to make runtimes of the same workload consistent from run to run When a driver is deployed, wants to do work, it should receive the same resources from run to run Using vanilla k8s scheduler led to partial starvation as the cluster became saturated Recall Corollary Problem
  48. 48. Kubernetes Scheduling 48 Scheduling Queue P2 Running Pods Rd 1 P1 P3
  49. 49. Kubernetes Scheduling 49 Scheduling Queue P2 Running Pods P1 Rd 1 P1 P3 P3 P2
  50. 50. Kubernetes Scheduling 50 Scheduling Queue P2 P2 Running Pods P1 P1 Rd 1 Rd 2 P1 P3 P3 P3 P2 Back to the end of the line!
  51. 51. Kubernetes Scheduling 51 Scheduling Queue P2 P2 Running Pods P1 P1 P3 Rd 1 Rd 2 P1 P1 P3 P3 P3 P2 P2 (Still waiting…)
  52. 52. Naïve Spark-on-K8s Scheduling 52 Scheduling Queue Running Pods Rd 1 D
  53. 53. Naïve Spark-on-K8s Scheduling 53 Scheduling Queue Running Pods D Rd 1 D
  54. 54. Naïve Spark-on-K8s Scheduling 54 Scheduling Queue Running Pods D D Rd 1 Rd 2 D E2E1
  55. 55. Naïve Spark-on-K8s Scheduling 55 Scheduling Queue Running Pods D D Rd 1 Rd 2 D D E2E1 E2E1
  56. 56. The “1000 Drivers” Problem 56 Scheduling Queue Running Pods Rd 1 D1 D2 …D1 D2 … D1000
  57. 57. The “1000 Drivers” Problem 57 Scheduling Queue Running Pods D1 Rd 1 D1 D2 …D1 D2 … D1000 D2 D1000 …
  58. 58. The “1000 Drivers” Problem 58 Scheduling Queue Running Pods D1 Rd 1 Rd 2 D1 E2E1 D2 …D1 D2 … D1000 D2 D1000 … … EN D2 D1000 …D1
  59. 59. The “1000 Drivers” Problem 59 Scheduling Queue Running Pods D1 Rd 1 Rd 2 D1 E2E1 D2 …D1 D2 … D1000 D2 D1000 … … EN D2 D1000 … E3E2 … EN D2 D1000 …D1 D1 (Uh oh!) E1
  60. 60. 60 ü Native containerization ü Extreme extensibility (e.g., scheduler, networking/firewalls) ü Active community with a fast-moving code base ü Single platform for microservices and compute* *Spoiler alert: the Kubernetes scheduler is excellent for web services, not optimized for batch So… Kubernetes?
  61. 61. K8s Spark Scheduler Idea: use the Kubernetes scheduler extender API to add – Gang scheduling of drivers & executors – FIFO (within instance groups)
  62. 62. K8s Spark Scheduler Goal: build entirely with native K8s extension points – Scheduler extender API: fn(pod, [node]) -> Option[node] – Custom resource definition: ResourceReservation – Driver annotations for resource requests (executors)
  63. 63. K8s Spark Scheduler get cluster usage (running pods + reservations) bin pack pending resource requests in FIFO order reserve resources (with CRD) if driver find unbound reservation bind reservation if executor
  64. 64. K8s Spark Scheduler 64 Scheduling Queue Running Pods D1 Rd 1 Rd 2+ D1 E2E1 D2 …D1 D2 … D1000 … D1 D1 R2R1D2 … D1000 D2 … D1000 …R2R1 E2E1D2 … D1000
  65. 65. Spark-Aware Autoscaling Idea: use the resource request annotations for unscheduled drivers to project desired cluster size – Again, we use a CRD to represent this unsatisfied Demand – Sum resources for pods + reservations + demand – In the interest of time, out of scope for today
  66. 66. Spark pods per day
  67. 67. Pod processing time
  68. 68. “Soft” dynamic allocation 68 Static allocation wastes resourcesProblem
  69. 69. “Soft” dynamic allocation 69 Static allocation wastes resourcesProblem Idea Voluntarily give up executors that aren’t needed, no preemption (same from run to run)
  70. 70. “Soft” dynamic allocation 70 Static allocation wastes resources No external shuffle, so executors store shuffle files Problem Idea Problem Voluntarily give up executors that aren’t needed, no preemption (same from run to run)
  71. 71. “Soft” dynamic allocation 71 Static allocation wastes resources No external shuffle, so executors store shuffle files Problem Idea Problem The driver already tracks shuffle file locations, so it can determine which executors are safe to give up Idea Voluntarily give up executors that aren’t needed, no preemption (same from run to run)
  72. 72. “Soft” dynamic allocation 72 – Saves $$$$, ~no runtime variance if consistent from run to run – Inspired by a 2018 Databricks blog post [1] – Merged into our fork in ~January 2019 – Recently adapted by @vanzin (thanks!) and merged upstream [2] [1] https://databricks.com/blog/2018/05/02/introducing-databricks-optimized-auto- scaling.html [2] https://github.com/apache/spark/pull/24817
  73. 73. K8s Spark Scheduler See our engineering blog on Medium – https://medium.com/palantir/spark-scheduling-in-kubernetes- 4976333235f3 Or, check out the source for yourself! (Apache v2 license) – https://github.com/palantir/k8s-spark-scheduler
  74. 74. Shuffle Resiliency
  75. 75. Shuffle Resiliency 75 Shuffles have a map side and a reduce side — Mapper executors write temporary data to local disk — Reducer executors contact mapper executors to retrieve written data YARN Node Manager Mapper Executor Reducer Executor YARN Node Manager Local Disk Local Disk
  76. 76. Shuffle Resiliency 76 If an executor crashes, all data written by that executor’s map tasks are lost — Spark must re-schedule the map tasks on other executors — Cannot remove executors to save resources because we would lose shuffle files YARN Node Manager Mapper Executor Reducer Executor YARN Node Manager Local Disk Local Disk
  77. 77. Shuffle Resiliency 77 Spark’s external shuffle service preserves shuffle data in cases of executor loss — Required to make preemption non-pathological — Prevents loss of work when executors crash Mapper Executor YARN Node Manager Shuffle Service Reducer Executor YARN Node Manager Shuffle Service Local Disk Local Disk
  78. 78. Problem: for security, containers have isolated storage Shuffle Resiliency 78 YARN Node Manager Shuffle Service Mapper Executor Executor Pod Local Disk Local Disk Mapper Executor Shuffle Service Pod YARN Kubernetes
  79. 79. Shuffle Resiliency 79 Idea: asynchronously back up shuffle files to a distributed storage system Backup Thread Executor Pod Local Disk Map Task Thread
  80. 80. Shuffle Resiliency 80 1. Reducers first try to fetch from other executors 2. Download from remote storage if mapper is unreachable Mapper Executor Pod Mapper Executor Pod Live Mapper Dead Mapper Reducer Executor Pod Reducer Executor Pod
  81. 81. Shuffle Resiliency – Wanted to generalize the framework for storing shuffle data in arbitrary storage systems – API In Progress: https://issues.apache.org/jira/browse/SPARK-25299 – Goal: Open source asynchronous backup strategy by end of 2019
  82. 82. Thanks to the team!
  83. 83. Q&A
  84. 84. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×