Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark Performance: Past, Future and Present

1,245 views

Published on

Apache Spark performance is notoriously difficult to reason about. Spark’s parallelized architecture makes it difficult to identify bottlenecks when jobs are running, and as a result, users often struggle to determine how to optimize their jobs for the best performance. This talk will take a deep dive into techniques for identifying resource bottlenecks in Spark. I’ll begin with the past, and discuss instrumentation that was added to Spark to measure how long jobs spend waiting on disk and network I/O. Next, I’ll discuss future-looking work from the research community that explores an alternative architecture for Spark based on using single-resource monotasks. Using monotasks makes it trivial for users to understand bottlenecks and predict their workloads’ performance under different hardware and software configuration. This future-looking approach requires a radical re-architecting of Spark’s internals, so I’ll end with the present, and describe how lessons from that work could be applied to Spark today to give users much more information about the performance of their workloads.

Published in: Software
  • Be the first to comment

Apache Spark Performance: Past, Future and Present

  1. 1. Spark Performance Past, Future, and Present Kay Ousterhout Joint work with Christopher Canel, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun
  2. 2. About Me Apache Spark PMC Member Recent PhD graduate from UC Berkeley Thesis work on performance of large-scale data analytics Co-founder at Kelda (kelda.io)
  3. 3. How can I make this faster?
  4. 4. How can I make this faster?
  5. 5. Should I use a different cloud instance type?
  6. 6. Should I trade more CPU for less I/O by using better compression?
  7. 7. How can I make this faster? ???
  8. 8. How can I make this faster? ???
  9. 9. How can I make this faster? ???
  10. 10. Major performance improvements possible via tuning, configuration …if only you knew which knobs to turn
  11. 11. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  12. 12. spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Example Spark Job Split input file into words and emit count of 1 for each Word Count:
  13. 13. Example Spark Job Split input file into words and emit count of 1 for each Word Count: For each word, combine the counts, and save the output spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”)
  14. 14. spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) Map Stage: Split input file into words and emit count of 1 for each Reduce Stage: For each word, combine the counts, and save the output Spark Word Count Job: .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) … Worker 1 Worker n Tasks … Worker 1 Worker n
  15. 15. Spark Word Count Job: Reduce Stage: For each word, combine the counts, and save the output .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) … Worker 1 Worker n
  16. 16. compute network time (1) Request a few shuffle blocks disk (5) Continue fetching remote data : time to handle one shuffle block (2) Process local data What happens in a reduce task? (4) Process data fetched remotely (3) Write output to disk
  17. 17. compute network time disk : time to handle one shuffle block What happens in a reduce task? Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU
  18. 18. compute network time disk : time to handle one shuffle block What happens in a reduce task? Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU
  19. 19. compute network time disk What instrumentation exists today? Instrumentation centered on single, main task thread : shuffle read blocked time : executor computing time (!)
  20. 20. actual What instrumentation exists today? timeline version
  21. 21. What instrumentation exists today? Instrumentation centered on single, main task thread Shuffle read and shuffle write blocked time Input read and output write blocked time not instrumented Possible to add!
  22. 22. compute disk Instrumenting read and write time Process shuffle block This is a lie compute Reality: Spark processes and then writes one record at a time Most writes get buffered Occasionally the buffer is flushed
  23. 23. compute Spark processes and then writes one record at a time Most writes get buffered Occasionally the buffer is flushed Challenges with reality: Record-level instrumentation is too high overhead Spark doesn’t know when buffers get flushed (HDFS does!)
  24. 24. Tasks use fine-grained pipelining to parallelize resources Instrumented times are blocked times only (task is doing other things in background) Opportunities to improve instrumentation
  25. 25. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  26. 26. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time 4 concurrent tasks on a worker
  27. 27. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time Concurrent tasks may contend for the same resource (e.g., network)
  28. 28. What’s the bottleneck? Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 Time t: different tasks may be bottlenecked on different resources Single task may be bottlenecked on different resources at different times
  29. 29. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 How much faster would my job be with 2x disk throughput? How would runtimes for these disk writes change? How would that change timing of (and contention for) other resources?
  30. 30. Today: tasks use pipelining to parallelize multiple resources Proposal: build systems using monotasks that each consume just one resource
  31. 31. Monotasks: Each task uses one resource Network monotask Disk monotask Compute monotask Today’s task: Monotasks don’t start until all dependencies complete Task 1 Network read CPU Disk write
  32. 32. Dedicated schedulers control contention Network scheduler CPU scheduler: 1 monotask / core Disk drive scheduler: 1 monotask / disk Monotasks for one of today’s tasks:
  33. 33. Spark today: Tasks have non- uniform resource use 4 multi-resource tasks run concurrently Single-resource monotasks scheduled by per-resource schedulers Monotasks: API-compatible, performance parity with Spark Performance telemetry trivial!
  34. 34. How much faster would job run if... 4x more machines Input stored in-memory No disk read No CPU time to deserialize Flash drives instead of disks Faster shuffle read/write time 10x improvement predicted with at most 23% error � ��� ��� ��� ��� ���� ���� ���� ���� ��� ����� �������� �� ����� �� ����� ���������� �������� �������� ������� �� ��������� ���� ���� ������ ��������� ��� ������� ��� ��������� ���� ���� ������ ������ ��� ������� ��� ��������� ���� ���� ������
  35. 35. Monotasks: Break jobs into single-resource tasks Using single-resource monotasks provides clarity without sacrificing performance Massive change to Spark internals (>20K lines of code)
  36. 36. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  37. 37. Spark today: Task resource use changes at fine time granularity 4 multi-resource tasks run concurrently Monotasks: Single-resource tasks lead to complete, trivial performance metrics Can we get monotask-like per-resource metrics for each task today?
  38. 38. compute network time disk Can we get per-task resource use? Measure machine resource utilization?
  39. 39. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time 4 concurrent tasks on a worker
  40. 40. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time Concurrent tasks may contend for the same resource (e.g., network) Contention controlled by lower layers (e.g., operating system)
  41. 41. compute network time disk Can we get per-task resource use? Machine utilization includes other tasks Can’t directly measure per-task I/O: in background, mixed with other tasks
  42. 42. compute network time disk Can we get per-task resource use? Existing metrics: total data read How long did it take? Use machine utilization metrics to get bandwidth!
  43. 43. Existing per-task I/O counters (e.g., shuffle bytes read) + Machine-level utilization (and bandwidth) metrics = Complete metrics about time spent using each resource
  44. 44. Goal: provide performance clarity Only way to improve performance is to know what to speed up Why do we care about performance clarity? Typical performance eval: group of experts Practical performance: 1 novice
  45. 45. Goal: provide performance clarity Only way to improve performance is to know what to speed up Some instrumentation exists already Focuses on blocked times in the main task thread Many opportunities to improve instrumentation (1) Add read/write instrumentation to lower level (e.g., HDFS) (2) Add machine-level utilization info (3) Calculate per-resource time More details at kayousterhout.org

×