Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

2,952 views

Published on

by Anya Bida and Rachel Warren from Alpine Data
https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/

Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.

Published in: Data & Analytics
  • Slim Down in Just 1 Minute? What if I told you, you've been lied to for nearly all of your life? CLICK HERE TO SEE THE TRUTH ♣♣♣ http://ishbv.com/1minweight/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

  1. 1. Spark Tuning for Enterprise System Administrators Anya T. Bida, PhD Rachel B. Warren
  2. 2. Don't worry about missing something... Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ ! ! Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren ! ! 
 !2
  3. 3. About Anya About Rachel Operations Engineer ! ! ! Spark & Scala Enthusiast / Data Engineer About Alpine Data ! alpinenow.com Alpine deploys Spark in Production for our Enterprise Customers
  4. 4. About You* Intermittent Reliable Optimal Enterprise System Administrators mySparkApp Success *
  5. 5. Intermittent Reliable Optimal mySparkApp Success
  6. 6. Default != Recommended Example: By default, spark.executor.memory = 1g 1g allows small jobs to finish out of the box. Spark assumes you'll increase this parameter.
 !6
  7. 7. Which parameters are important? ! How do I configure them? !7 Default != Recommended
  8. 8. Filter* data before an expensive reduce or aggregation consider* coalesce( Use* data structures that require less memory Serialize* PySpark serializing is built-in Scala/ Java? persist(storageLevel.[*]_SER) Recommended: kryoserializer * tuning.html#tuning- data-structures See "Optimize partitions." * See "GC investigation." * See "Checkpointing." * The Spark Tuning Cheat-Sheet
  9. 9. Intermittent Reliable Optimal mySparkApp Success mySparkApp memory issues Shared Cluster
  10. 10. !10
  11. 11. !11
  12. 12. Fair Schedulers !12 YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations> SPARK <allocations>
 <pool name="sample_queue"> <schedulingMode>FAIR</sch <weight>1</weight>
 <minShare>2</minShare>
 </pool>
 </allocations>
  13. 13. Fair Schedulers !13 YARN <allocations> <queue name="sample_queue"> <minResources>4000 mb,0vcores</minResources> <maxResources>8000 mb,8vcores</maxResources> <maxRunningApps>10</maxRunningApps> <weight>2.0</weight> <schedulingPolicy>fair</schedulingPolicy> </queue> </allocations> SPARK <allocations>
 <pool name="sample_queue"> <schedulingMode>FAIR</sch <weight>1</weight>
 <minShare>2</minShare>
 </pool>
 </allocations> Configure these parameters too!
  14. 14. Fair Schedulers !14 YARN <allocations> <user name="sample_user"> <maxRunningApps>6</maxRunningApps> </user> <userMaxAppsDefault>5</userMaxAppsDefault> ! </allocations>
  15. 15. What is the memory limit for mySparkApp? !15
  16. 16. !16 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! ! ! <maxResources>8000 mb</maxResources> Limitation What is the memory limit for mySparkApp? Reserve 25% for overhead.
  17. 17. !17
  18. 18. !18 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors) What is the memory limit for mySparkApp?
  19. 19. !19 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors) What is the memory limit for mySparkApp?Limitation: Each driver and executor must not be larger than a single node. Limitation: Driver and executor memory must not be larger than a single node. ! (yarn.nodemanager.resource.memory-mb - 1Gb) executor.memory ~ # executors per node Limitation
  20. 20. !20 Max Memory in "pool" x 3/4 = mySparkApp_mem_limit ! mySparkApp_mem_limit = driver.memory + (executor.memory x dynamicAllocation.maxExecutors) Limitation: maxExecutors should not exceed pool allocation. ! Yarn: <maxResources>8vcores</maxResources> Limitation What is the memory limit for mySparkApp?
  21. 21. !21 I want a little more information... Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska of Cloudera http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applications How-to: Tune Your Apache Spark Jobs (Part 2) by Sandy Ryza of Cloudera http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ I want lots more...
  22. 22. !22
  23. 23. Intermittent Reliable Optimal mySparkApp Success mySparkApp memory issues Shared Cluster
  24. 24. Reduce the memory needed for mySparkApp. How? Gracefully handle memory limitations. How? mySparkApp memory issues
  25. 25. Reduce the memory needed for mySparkApp. How? mySparkApp memory issues here let's talk about one scenario
  26. 26. Reduce the memory needed for mySparkApp. How? mySparkApp memory issues persist(storageLevel.[*]_SER) Recommended: kryoserializer *
  27. 27. Gracefully handle memory limitations. How? mySparkApp memory issues Reduce the memory needed for mySparkApp. How?
  28. 28. Gracefully handle memory limitations. How? mySparkApp memory issues here let's talk about one scenario
  29. 29. Symptoms: !30 • mySparkApp is running for several hours Container is lost. • I notice one container fails, then the rest fail one by one • The first container to fail was the driver • Driver is a SPOF
  30. 30. Investigate: !31 collect unbounded data to the driver • Driver failures are often caused by: • I verified only bounded data is brought to the driver, but still the driver fails intermittently.
  31. 31. Potential Solution: RDD.checkpoint() !32 Use in these cases: • high-traffic cluster • network blips • preemption • disk space nearly full ! ! Function: • saves the RDD to stable storage (eg hdfs or S3) How-to: SparkContext.setCheckpointDir(directory: String) RDD.checkpoint()
  32. 32. Intermittent Reliable Optimal mySparkApp Success mySparkApp memory issues Shared Cluster Instead of 2.5 hours, myApp completes in 1 hour.
  33. 33. Cheat-sheet techsuppdiva.github.io/
  34. 34. Intermittent Reliable Optimal mySparkApp Success mySparkApp memory issues Shared Cluster HighPerformanceSpark.com
  35. 35. Further Reading: • Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015, O'Reilly
 https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html • Scheduling:
 https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application • Tuning the Spark Conf:
 Mark Grover and Ted Malaska from Cloudera
 http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applications
 Sandy Ryza (Cloudera)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ • Checkpointing:
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing • Troubleshooting:
 Miklos Christine from Databricks 
 https://spark-summit.org/east-2016/events/operational-tips-for-deploying-spark/ • High Performance Spark by R. Warren, H. Karau, coming in 2016, O'Reilly
 http://highperformancespark.com/ !36
  36. 36. More Questions? !37 Presentation: http://www.slideshare.net/anyabida Cheat-sheet: http://techsuppdiva.github.io/ ! ! Anya: https://www.linkedin.com/in/anyabida Rachel: https://www.linkedin.com/in/rachelbwarren ! ! 


×