Successfully reported this slideshow.
Your SlideShare is downloading. ×

Improving Apache Spark for Dynamic Allocation and Spot Instances

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 16 Ad

Improving Apache Spark for Dynamic Allocation and Spot Instances

Download to read offline

This presentation will explore the new work in Spark 3.1 adding the concept of graceful decommissioning and how we can use this to improve Spark’s performance in both dynamic allocation and spot/preemptable instances. Together we’ll explore how Spark’s dynamic allocation has evolved over time, and why the different changes have been needed. We’ll also look at the multi-company collaboration that resulted in being able to deliver this feature and I’ll end with encouraging pointers on how to get more involved in Spark’s development.

This presentation will explore the new work in Spark 3.1 adding the concept of graceful decommissioning and how we can use this to improve Spark’s performance in both dynamic allocation and spot/preemptable instances. Together we’ll explore how Spark’s dynamic allocation has evolved over time, and why the different changes have been needed. We’ll also look at the multi-company collaboration that resulted in being able to deliver this feature and I’ll end with encouraging pointers on how to get more involved in Spark’s development.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Improving Apache Spark for Dynamic Allocation and Spot Instances (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Improving Apache Spark for Dynamic Allocation and Spot Instances

  1. 1. Apple logo is a trademark of Apple Inc. Holden Karau | Data / AI Summi t @holdenkara u Improving Spark for Dynamic Allocation & Spot Instances
  2. 2. Who am I? • Holden Kara u • She / he r • Apache Spark PMC • Contributor to a lot of other projects • co-author of High Performance Spark, Learning Spark, and Kubeflow for Machine Learning • http://bit.ly/holdenSparkVideos • https://youtube.com/user/holdenkarau
  3. 3. Apple logo is a trademark of Apple Inc.
  4. 4. Let us start at the beginning • Spark achieves resilience through re-computation which is part of how we go fas • This poses challenges with removing executors that may contain dat • We "solved" it for YARN/Mesos back in the da • I drank waaaay too much coffee and came up with an alternativ • But no one really liked it because we didn't need it so I closed the Google doc and forgot about i t • Don’t worry, we’ll get to the code soon :)
  5. 5. But then…. • The "cloud" became really popula r • Kubernetes became popula r • Everything caught on fire :/
  6. 6. Our Protagonist Remembers • I started drinking a lot of coffee • We dusted off that old design and wrote some cod e • And then I got hit by a ca r • More people wrote more cod e • We had a VOT E • We wrote waaaaay more cod e • Everyone lived happily ever after? Photo by Lukas from Pexels
  7. 7. How did DA work on YARN? • Scale up is "easy" (add more resources ) • Scale down required a stay resident program to be on each YARN node to serve any file s • Spark stored it's shuffle data as file s • Persist in memory data was still lost when scaling down an executor Photo by Markus Spiske from Pexels
  8. 8. Why did the cloud impact this? • If you wanted a ~50% cost saving of spot/preemptible instances you might lose entire machine s • Yes Spark can "handle" this, but does so by recomputing data (expensive ) • You can't depend on leaving a program around to serve files when the server is just gon e • So we need to find a way to migrate the data
  9. 9. Ok sure the cloud, but K8s? • Kubernetes doesn't like like the idea of scheduling a stay resident program on every nod e • Also most people don't like the idea of shared disk here either (accros jobs/ users ) • So we need to find a way to migrate the data
  10. 10. SPARK-20624 • Yee-haw ! • Ok but more seriously how does it work? Great question lets open up the code • BlockManagerDecomissioner.scala is where most of the magic happens
  11. 11. Collaboration http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE- Decommissioning-SPIP-td29701.htm l https://github.com/apache/spark/pulls?q=is%3Apr+decommission+is%3Aclosed+
  12. 12. Ok what about the car? Getting hit by a car sucks a lot Slowed down dev work while I did rehab to be able to walk & type again Shout out to everyone who helped me recover (from my wife, girlfriend, partners, my friends, to the hospital staff, nursing home, PT, OT, Ambulance, my employer for giving me time off, the Spark community for understanding I needed time off <3)
  13. 13. It’s early though so please be careful On a Happy Note: You can try this now • Enable the followin g - spark.decommission.enabled - spark.storage.decommission.enabled - spark.storage.decommission.rddBlocks.enabled - spark.storage.decommission.shuffleBlocks.enabled • Want to get fancy? Optionally enable: - spark.shuffle.externalStorage.enabled - And configure a storage backend ( spark.shuffle.externalStorage.backend)
  14. 14. Future work • Heuristics to migrate dat a • Improve container pre-emption selectio • Better heuristics around when to scale up and down containers
  15. 15. Please review this talk :)
  16. 16. TM and © 2021 Apple Inc. All rights reserved.

×