Successfully reported this slideshow.
Your SlideShare is downloading. ×

Apache Spark Performance is too hard. Let's make it easier

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 54 Ad

Apache Spark Performance is too hard. Let's make it easier

Download to read offline

Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.

Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Apache Spark Performance is too hard. Let's make it easier (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Apache Spark Performance is too hard. Let's make it easier

  1. 1. Sean Suchter CTO @ Pepperdata Spark performance is too hard, let’s make it easier
  2. 2. Pepperdata does performance (for Big Data) 15 Thousand Production Nodes 50 Million Jobs/Year 200 Trillion Performance Data Points
  3. 3. Today’s talk will cover… • How code translates to execution • How to find common, known problems • For the rest of the problems… – Why debugging performance problems is hard – Data elements needed for complete view of application performance from separate tools – Bringing these elements together in a single tool
  4. 4. Brief terminology about Spark • An app contains multiple jobs • A job contains multiple stages • A stage contains multiple tasks • Executors run tasks
  5. 5. Example App A word count app: val textFile = sc.textFile("hdfs:/dict.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs:/wordcounts.txt") 1. Declares input from external storage 2. Specifies transformations 3. Triggers an action
  6. 6. Distributed Architecture Spark executes a job using multiple machines. Spark Driver process Spark Executor 1 process Spark Executor 2 process Spark Executor N process Sends tasks
  7. 7. Stages Image source. val textFile = sc.textFile("hdfs:/dict.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs:/wordcounts.txt")
  8. 8. Shuffle and Re-partitioning Image source.
  9. 9. Stages and Tasks in Example Job Task 0 Task 1 Task n Task n+m Task n+1 Task n+2
  10. 10. Debugging known problems The easier case…
  11. 11. Spark History Server 11
  12. 12. Spark History Server 12
  13. 13. Intro: Dr Elephant (MapReduce)
  14. 14. What does Dr. Elephant do? • Performance monitoring and tuning service • Finds common mistakes, indicates best practices 14
  15. 15. Spark Application Heuristics 15
  16. 16. Spark Application Heuristics 16
  17. 17. 3 Classes of Spark Heuristics • Configuration Settings • Simple Alarms on Stage/Job Failure • Data-Dependent Tuning Suggestions 17
  18. 18. Configuration Heuristic • Display some basic config settings for your app • Complain if some settings not explicitly set • Recommend configuring an external shuffle service (especially if dynamic allocation is enabled) • These recommendations won’t change over multiple runs of an application 18
  19. 19. Stages and Jobs Heuristics • Simple alarms showing stage and job failure rates • Good for seeing when there’s a problem 19
  20. 20. Executors Heuristic • Looks at the distribution across executors of several different metrics • Outliers in these distributions probably indicate: – Suboptimal partitioning. – One or more slow executors due to external circumstances (cluster weather) 20
  21. 21. Partitions Heuristic • Ideally data for each task will fit into the RAM available to that task. • Sandy Ryza (once from Cloudera) has an excellent blog on Spark tuning: (observed shuffle write) * (observed shuffle spill memory) * (spark.executor.cores) (observed shuffle spill disk) * (spark.executor.memory) * (spark.shuffle.memoryFraction) * (spark.shuffle.safetyFraction) http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 21
  22. 22. More Heuristics? Yes, please! Dr. Elephant is open source. https://github.com/linkedin/dr-elephant 22
  23. 23. Is there an enterprise version?
  24. 24. Pepperdata Application Profiler • Benefits to our users: – Provide simple answers to simple questions – Combination of metrics for experts – Simple actionable insights for all users – Pepperdata support • Why stay close to open source? – Heuristics 24
  25. 25. Pepperdata Application Profiler 25
  26. 26. Debugging novel problems The harder case…
  27. 27. 2 reasons this is hard
  28. 28. Reason #1 Same external symptom (“too slow”), but many possible causes: • code • data • configuration • cluster weather
  29. 29. Reason #2 Existing tools provide limited visibility • Spark Web UI is the most popular – Good view of query execution plan (job/stages/DAG) – Limited view of aggregate performance data • Time series – Ganglia, Ambari, CM, etc provide time series data for cluster (but not specific to Spark apps) – Spark Sink metrics can be fed to InfluxDb/others, yielding partial Spark app metrics • Code execution not connected to resource consumption • Load from other apps unaccounted
  30. 30. 3 data elements form a complete picture of Spark application performance 1. Code execution plan – Indicates which block of code is being executed, where 2. Time series view – Visual of resource consumption of application – Outliers in resource usage very easy to detect 3. Cluster weather – A view of all applications that run on the cluster
  31. 31. Spark Web UI First half of solution
  32. 32. Logical code execution plan from Spark: Jobs / Stages / DAG
  33. 33. Physical execution plan from Spark: Executors / Tasks
  34. 34. Time series view Second half of solution
  35. 35. Time series view of resource consumption for the App
  36. 36. Bring them together Best of both worlds
  37. 37. Code Analyzer = execution plan + time series
  38. 38. GC across all Stages of App
  39. 39. Let’s examine GC activity in Stage 4
  40. 40. Executor skew increased Stage duration 2x
  41. 41. Executor 6 does twice as much work: possible solution increase number of partitions
  42. 42. What if it’s not your fault? Cluster weather
  43. 43. How does cluster weather impact your app ?
  44. 44. No apparent reason for delay from Spark Web UI
  45. 45. Time series shows slower run of app with much lower resources
  46. 46. View cluster weather for slower run of app
  47. 47. Cluster weather reveals reason for CPU constraints on slower app
  48. 48. Cluster weather reveals reason for memory constraints on slower app
  49. 49. Cluster weather reveals reason for HDFS constraints on slower app
  50. 50. Code Analyzer for Apache Spark • Free during Early Access starting today • Early Access is for development teams • To learn more visit booth #101 • info@pepperdata.com pepperdata.com/products/code-analyzer
  51. 51. Other performance tools mentioned • Dr Elephant – github.com/linkedin/dr-elephant • Application Profiler – www.pepperdata.com/products/application-profiler/
  52. 52. To recap • Use heuristics to find known problems • Execution plan + time series = powerful visualization • Knowing cluster weather can prevent time wasted debugging performance “issues” that aren’t the app’s fault
  53. 53. Spark Summit Talk Plugs Tuesday 11:40AM Connect Code to Resource Consumption to Scale Your Production Spark Applications (Vinod @ Pepperdata) Tuesday 12:50PM Kubernetes SIG Big Data Birds-of-a-Feather session (many) Tuesday 3:20PM Apache Spark on Kubernetes (Anirudh @ Google, Tim @ Hyperpilot) Wednesday 11:00AM HDFS on Kubernetes – Lessons Learned (Kimoon @ Pepperdata) Wednesday 11:00AM Dr Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop (Carl @ LinkedIn, Simon @ Pepperdata)
  54. 54. Thank You. www.pepperdata.com/products/code-analyzer/ ssuchter@pepperdata.com

×