Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark tuning2016may11bida

461 views

Published on

This is a version of a talk I presented at Spark Summit East 2016 with Rachel Warren. In this version, I also discuss memory management on the JVM with pictures from Alexey Grishchenko, Sandy Ryza, and Mark Grover.

Published in: Data & Analytics

Spark tuning2016may11bida

  1. 1. Spark Tuning for Enterprise System Administrators Anya T. Bida, PhD Rachel B. Warren
  2. 2. Don't worry about missing something... Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ ! ! Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren ! ! 
 !2
  3. 3. About Anya About Rachel Operations Engineer ! ! ! Spark & Scala Enthusiast / Data Engineer Alpine Data ! alpinenow.com
  4. 4. About You* Intermittent Reliable Optimal Spark practitioners mySparkApp Success *
  5. 5. Intermittent Reliable Optimal mySparkApp Success
  6. 6. Default != Recommended Example: By default, spark.executor.memory = 1g 1g allows small jobs to finish out of the box. Spark assumes you'll increase this parameter.
 !6
  7. 7. Which parameters are important? ! How do I configure them? !7 Default != Recommended
  8. 8. Filter* data before an expensive reduce or aggregation consider* coalesce( Use* data structures that require less memory Serialize* PySpark serializing is built-in Scala/ Java? persist(storageLevel.[*]_SER) Recommended: kryoserializer * tuning.html#tuning- data-structures See "Optimize partitions." * See "GC investigation." * See "Checkpointing." * The Spark Tuning Cheat-Sheet
  9. 9. Intermittent Reliable Optimal mySparkApp Success Memory trouble Initial config
  10. 10. Intermittent Reliable Optimal mySparkApp Success Memory trouble Initial config
  11. 11. !11 How many in the audience have their own cluster?
  12. 12. !12
  13. 13. Fair Schedulers !13 YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations> SPARK <allocations>
 <pool name="sample_queue"> <schedulingMode>FAIR</sch <weight>1</weight>
 <minShare>2</minShare>
 </pool>
 </allocations>
  14. 14. Fair Schedulers !14 YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations> SPARK <allocations>
 <pool name="sample_queue"> <schedulingMode>FAIR</sch <weight>1</weight>
 <minShare>2</minShare>
 </pool>
 </allocations>
  15. 15. Fair Schedulers !15 YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations> SPARK <allocations>
 <pool name="sample_queue"> <schedulingMode>FAIR</sch <weight>1</weight>
 <minShare>2</minShare>
 </pool>
 </allocations>
  16. 16. Fair Schedulers !16 YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations> SPARK <allocations>
 <pool name="sample_queue"> <schedulingMode>FAIR</sch <weight>1</weight>
 <minShare>2</minShare>
 </pool>
 </allocations>
  17. 17. Fair Schedulers !17 YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations> SPARK <allocations>
 <pool name="sample_queue"> <schedulingMode>FAIR</sch <weight>1</weight>
 <minShare>2</minShare>
 </pool>
 </allocations> Use these parameters!
  18. 18. Fair Schedulers !18 YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> ! </allocations>
  19. 19. Fair Schedulers !19 YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> ! </allocations>
  20. 20. What is the memory limit for mySparkApp? !20
  21. 21. !21 Driver Executor Cluster Manager Sidebar: Spark Architecture Mark Grover: http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted- malaska Executor
  22. 22. !22 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! ! ! What is the memory limit for mySparkApp?
  23. 23. !23 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! ! ! What is the memory limit for mySparkApp?
  24. 24. !24 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! ! ! <maxResources>___mb</maxResources> Limitation What is the memory limit for mySparkApp?
  25. 25. What is the memory limit for mySparkApp? !25 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! ! ! Reserve 25% for overhead
  26. 26. !26 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! ! ! What is the memory limit for mySparkApp?
  27. 27. !27
  28. 28. !28 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors) What is the memory limit for mySparkApp?
  29. 29. !29 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors) What is the memory limit for mySparkApp?
  30. 30. !30 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors) What is the memory limit for mySparkApp? Limitation: Driver must not be larger than a single node.
  31. 31. !31 yarn.nodemanager.resource.memory-mb Driver Container spark.driver.memory
  32. 32. !32 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors) What is the memory limit for mySparkApp?
  33. 33. !33 Driver Executor Cluster Manager Sidebar: Spark Architecture Mark Grover: http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted- malaska Executor
  34. 34. !34 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! mySparkApp_mem_limit > driver.memory + (executor.memory x dynamicAllocation.maxExecutors) What is the memory limit for mySparkApp? Verify my calculations respect this limitation.
  35. 35. !35
  36. 36. Intermittent Reliable Optimal mySparkApp Success Memory trouble Initial config
  37. 37. Intermittent Reliable Optimal mySparkApp Success Memory trouble Initial config
  38. 38. mySparkApp memory issues
  39. 39. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues
  40. 40. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues
  41. 41. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues here let's talk about one scenario
  42. 42. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues persist(storageLevel.[*]_SER)
  43. 43. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues persist(storageLevel.[*]_SER)
  44. 44. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues persist(storageLevel.[*]_SER) Recommended: kryoserializer *
  45. 45. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues persist(storageLevel.[*]_SER) Recommended: kryoserializer *
  46. 46. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues
  47. 47. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues here let's talk about one scenario
  48. 48. Spark 1.1-1.5, Recommendation: Increase spark.memory.storageFraction
  49. 49. !51 Alexey Grishchenko: https://0x0fff.com/spark-memory-management/ Spark 1.1-1.5, Recommendation: Increase spark.memory.storageFraction ! Spark 1.6, Recommendation: UnifiedMemoryManager
  50. 50. Alexey Grishchenko: https://0x0fff.com/spark-memory-management/ Sandy Ryza: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ yarn.nodemanager.resource.memory-mb spark.yarn.executor.memoryOverhead Executor Container spark.executor.memory
  51. 51. !53 Driver Cluster Manager Sidebar: Spark Architecture yarn.nodema spark.yarn.e Exec spark.e yarn.nodema spark.yarn.e Exec spark.e yarn.nodema spark.yarn.e Exec spark.e Executor Executor
  52. 52. Intermittent Reliable Optimal mySparkApp Success Memory trouble Initial config
  53. 53. Intermittent Reliable Optimal mySparkApp Success Memory trouble Initial config Instead of 2.5 hours, myApp completes in 1 hour.
  54. 54. Cheat-sheet techsuppdiva.github.io/
  55. 55. Intermittent Reliable Optimal mySparkApp Success Memory trouble Initial config HighPerformanceSpark.com
  56. 56. Further Reading: • Spark Tuning Cheat-sheet
 techsuppdiva.github.io • Apache Spark Documentation
 https://spark.apache.org/docs/latest
 • Checkpointing
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
 https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-rdd-checkpointing.adoc
 • Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015 !58
  57. 57. More Questions? !59 Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ ! ! Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren ! ! 
 Thanks!

×