Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Top 5 mistakes when
writing Spark
applications
tiny.cloudera.com/spark-mistakes
Mark Grover | Software Engineer, Cloudera ...
2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadoo...
3
Mistakes people make
when using Spark
4
Mistakes people we’ve made
when using Spark
5
Mistakes people make
when using Spark
6
Mistake # 1
7
# Executors, cores, memory !?!
• 6 Nodes
• 16 cores each
• 64 GB of RAM each
8
Decisions, decisions, decisions
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• M...
9
Spark Architecture recap
10
Answer #1 – Most granular
• Have smallest sized executors
possible
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB...
11
Answer #1 – Most granular
• Have smallest sized executors
possible
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB...
12
Why?
• Not using benefits of running multiple tasks in same executor
13
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node...
14
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node...
15
Why?
• Need to leave some memory overhead for OS/Hadoop daemons
16
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 63 GB memory each
• 15 cores each
Worker node
Executor 1
Ov...
17
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 63 GB memory each
• 15 cores each
Worker node
Executor 1
Ov...
18
Let’s assume…
• You are running Spark on YARN, from here on…
19
3 things
• 3 other things to keep in mind
20
#1 – Memory overhead
• --executor-memory controls the heap size
• Need some overhead (controlled by
spark.yarn.executor...
21
#2 - YARN AM needs a core: Client mode
22
#2 YARN AM needs a core: Cluster mode
23
#3 HDFS Throughput
• 15 cores per executor can lead to bad HDFS I/O throughput.
• Best is to keep under 5 cores per exe...
24
Calculations
• 5 cores per executor
– For max HDFS throughput
• Cluster has 6 * 15 = 90 cores in total
after taking out...
25
Correct answer
• 17 executors in total
• 19 GB memory/executor
• 5 cores/executor
* Not etched in stone
Overhead
Worker...
26
Dynamic allocation helps with though, right?
• Dynamic allocation allows Spark to dynamically scale the cluster
resourc...
27
Decisions with Dynamic Allocation
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
...
28
Read more
• From a great blog post on this topic by Sandy Ryza:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-...
29
Mistake # 2
30
Application failure
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage
6.0 (TID 120, 10.215.149.4...
31
Why?
• No Spark shuffle block can be greater than 2 GB
32
Ok, what’s a shuffle block again?
• In MapReduce terminology, a file written from one Mapper for a Reducer
• The Reduce...
33
Defining shuffle and partition
Each yellow arrow
in this diagram
represents a shuffle
block.
Each blue block is a
parti...
34
Once again
• Overflow exception if shuffle block size > 2 GB
35
What’s going on here?
• Spark uses ByteBuffer as abstraction for blocks
val buf = ByteBuffer.allocate(length.toInt)
• B...
36
Spark SQL
• Especially problematic for Spark SQL
• Default number of partitions to use when doing shuffles is 200
– Thi...
37
Umm, ok, so what can I do?
1. Increase the number of partitions
– Thereby, reducing the average partition size
2. Get r...
38
Umm, how exactly?
• In Spark SQL, increase the value of
spark.sql.shuffle.partitions
• In regular Spark applications, u...
39
But, how many partitions should I have?
• Rule of thumb is around 128 MB per partition
40
But! There’s more!
• Spark uses a different data structure for bookkeeping during shuffles, when
the number of partitio...
41
Don’t believe me?
• In MapStatus.scala
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]):
MapStatus = {
if...
42
Ok, so what are you saying?
If number of partitions < 2000, but not by much, bump it to be slightly higher
than 2000.
43
Can you summarize, please?
• Don’t have too big partitions
–Your job will fail due to 2 GB limit
• Don’t have too few p...
44
Mistake # 3
45
Slow jobs on Join/Shuffle
• Your dataset takes 20 seconds to run over with a map job, but take 4 hours
when joined or s...
46
Mistake - Skew
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Normal...
47
Mistake - Skew
Single ThreadNormal
Distributed
What about Skew, because that is a thing
48
• Salting
• Isolated Salting
• Isolated Map Joins
Mistake – Skew : Answers
49
• Normal Key: “Foo”
• Salted Key: “Foo” + random.nextInt(saltFactor)
Mistake – Skew : Salting
50
Managing Parallelism
51
Mistake – Skew: Salting
52©2014 Cloudera, Inc. All rights reserved.
Add Example Slide
53
• Two Stage Aggregation
– Stage one to do operations on the salted keys
– Stage two to do operation access unsalted key...
54
• Second Stage only required for Isolated Keys
Mistake – Skew : Isolated Salting
Data Source Map
Convert to
Key & Value...
55
• Filter Out Isolated Keys and use Map Join/Aggregate on
those
• And normal reduce on the rest of the data
• This can r...
56
Managing Parallelism
Cartesian Join
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle T...
57
Table YTable X
• How To fight Cartesian Join
– Nested Structures
Managing Parallelism
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
Tab...
58
• How To fight Cartesian Join
– Nested Structures
Managing Parallelism
create table nestedTable (
col1 string,
col2 str...
59
Mistake # 4
60
Out of luck?
• Do you every run out of memory?
• Do you every have more then 20 stages?
• Is your driver doing a lot of...
61
Mistake – DAG Management
• Shuffles are to be avoided
• ReduceByKey over GroupByKey
• TreeReduce over Reduce
• Use Comp...
62
Mistake – DAG Management: Shuffles
• Map Side reduction, where possible
• Think about partitioning/bucketing ahead of t...
63
ReduceByKey over GroupByKey
• ReduceByKey can do almost anything that GroupByKey
can do
• Aggregations
• Windowing
• Us...
64
TreeReduce over Reduce
• TreeReduce & Reduce return some result to driver
• TreeReduce does more work on the executors
...
65
Complex Types
• Top N List
• Multiple types of Aggregations
• Windowing operations
• All in one pass
66
Complex Types
• Think outside of the box use objects to reduce by
• (Make something simple)
67
Mistake # 5
68
Ever seen this?
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)L...
69
But!
• I already included protobuf in my app’s maven
dependencies?
70
Ah!
• My protobuf version doesn’t match with
Spark’s protobuf version!
71
Shading
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</...
72
Future of shading
• Spark 2.0 has some libraries shaded
• Gauva is fully shaded
73
Summary
74
5 Mistakes
• Size up your executors right
• 2 GB limit on Spark shuffle blocks
• Evil thing about skew and cartesians
•...
75
THANK YOU.
tiny.cloudera.com/spark-mistakes
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
Upcoming SlideShare
Loading in …5
×

Top 5 mistakes when writing Spark applications

9,154 views

Published on

Top 5 mistakes when writing Spark applications presentation at Strata + Hadoop World 2016, New York.

Published in: Engineering
  • Sex in your area is here: ❶❶❶ http://bit.ly/2F90ZZC ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❤❤❤ http://bit.ly/2F90ZZC ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Top 5 mistakes when writing Spark applications

  1. 1. Top 5 mistakes when writing Spark applications tiny.cloudera.com/spark-mistakes Mark Grover | Software Engineer, Cloudera | @mark_grover Ted Malaska | Technical Group Architect, Blizzard| @TedMalaska
  2. 2. 2 About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook
  3. 3. 3 Mistakes people make when using Spark
  4. 4. 4 Mistakes people we’ve made when using Spark
  5. 5. 5 Mistakes people make when using Spark
  6. 6. 6 Mistake # 1
  7. 7. 7 # Executors, cores, memory !?! • 6 Nodes • 16 cores each • 64 GB of RAM each
  8. 8. 8 Decisions, decisions, decisions • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor-memory) • 6 nodes • 16 cores each • 64 GB of RAM
  9. 9. 9 Spark Architecture recap
  10. 10. 10 Answer #1 – Most granular • Have smallest sized executors possible • 1 core each • 64GB/node / 16 executors/node = 4 GB/executor • Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 16 Executor 4 Executor 3 Executor 2 Executor 1
  11. 11. 11 Answer #1 – Most granular • Have smallest sized executors possible • 1 core each • 64GB/node / 16 executors/node = 4 GB/executor • Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 16 Executor 4 Executor 3 Executor 2 Executor 1
  12. 12. 12 Why? • Not using benefits of running multiple tasks in same executor
  13. 13. 13 Answer #2 – Least granular • 6 executors in total =>1 executor per node • 64 GB memory each • 16 cores each Worker node Executor 1
  14. 14. 14 Answer #2 – Least granular • 6 executors in total =>1 executor per node • 64 GB memory each • 16 cores each Worker node Executor 1
  15. 15. 15 Why? • Need to leave some memory overhead for OS/Hadoop daemons
  16. 16. 16 Answer #3 – with overhead • 6 executors – 1 executor/node • 63 GB memory each • 15 cores each Worker node Executor 1 Overhead(1G,1 core)
  17. 17. 17 Answer #3 – with overhead • 6 executors – 1 executor/node • 63 GB memory each • 15 cores each Worker node Executor 1 Overhead(1G,1 core)
  18. 18. 18 Let’s assume… • You are running Spark on YARN, from here on…
  19. 19. 19 3 things • 3 other things to keep in mind
  20. 20. 20 #1 – Memory overhead • --executor-memory controls the heap size • Need some overhead (controlled by spark.yarn.executor.memory.overhead) for off heap memory • Default is max(384MB, .07 * spark.executor.memory)
  21. 21. 21 #2 - YARN AM needs a core: Client mode
  22. 22. 22 #2 YARN AM needs a core: Cluster mode
  23. 23. 23 #3 HDFS Throughput • 15 cores per executor can lead to bad HDFS I/O throughput. • Best is to keep under 5 cores per executor
  24. 24. 24 Calculations • 5 cores per executor – For max HDFS throughput • Cluster has 6 * 15 = 90 cores in total after taking out Hadoop/Yarn daemon cores) • 90 cores / 5 cores/executor = 18 executors • Each node has 3 executors • 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB • 1 executor for AM => 17 executors Overhead Worker node Executor 3 Executor 2 Executor 1
  25. 25. 25 Correct answer • 17 executors in total • 19 GB memory/executor • 5 cores/executor * Not etched in stone Overhead Worker node Executor 3 Executor 2 Executor 1
  26. 26. 26 Dynamic allocation helps with though, right? • Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload. • Works with Spark-On-Yarn
  27. 27. 27 Decisions with Dynamic Allocation • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor-memory) • 6 nodes • 16 cores each • 64 GB of RAM
  28. 28. 28 Read more • From a great blog post on this topic by Sandy Ryza: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs- part-2/
  29. 29. 29 Mistake # 2
  30. 30. 30 Application failure 15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
  31. 31. 31 Why? • No Spark shuffle block can be greater than 2 GB
  32. 32. 32 Ok, what’s a shuffle block again? • In MapReduce terminology, a file written from one Mapper for a Reducer • The Reducer makes a local copy of this file (reducer local copy) and then ‘reduces’ it
  33. 33. 33 Defining shuffle and partition Each yellow arrow in this diagram represents a shuffle block. Each blue block is a partition.
  34. 34. 34 Once again • Overflow exception if shuffle block size > 2 GB
  35. 35. 35 What’s going on here? • Spark uses ByteBuffer as abstraction for blocks val buf = ByteBuffer.allocate(length.toInt) • ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!
  36. 36. 36 Spark SQL • Especially problematic for Spark SQL • Default number of partitions to use when doing shuffles is 200 – This low number of partitions leads to high shuffle block size
  37. 37. 37 Umm, ok, so what can I do? 1. Increase the number of partitions – Thereby, reducing the average partition size 2. Get rid of skew in your data – More on that later
  38. 38. 38 Umm, how exactly? • In Spark SQL, increase the value of spark.sql.shuffle.partitions • In regular Spark applications, use rdd.repartition() or rdd.coalesce()(latter to reduce #partitions, if needed)
  39. 39. 39 But, how many partitions should I have? • Rule of thumb is around 128 MB per partition
  40. 40. 40 But! There’s more! • Spark uses a different data structure for bookkeeping during shuffles, when the number of partitions is less than 2000, vs. more than 2000.
  41. 41. 41 Don’t believe me? • In MapStatus.scala def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = { if (uncompressedSizes.length > 2000) { HighlyCompressedMapStatus(loc, uncompressedSizes) } else { new CompressedMapStatus(loc, uncompressedSizes) } }
  42. 42. 42 Ok, so what are you saying? If number of partitions < 2000, but not by much, bump it to be slightly higher than 2000.
  43. 43. 43 Can you summarize, please? • Don’t have too big partitions –Your job will fail due to 2 GB limit • Don’t have too few partitions –Your job will be slow, not making using of parallelism • Rule of thumb: ~128 MB per partition • If #partitions < 2000, but close, bump to just > 2000 • Track SPARK-6235 for removing various 2 GB limits
  44. 44. 44 Mistake # 3
  45. 45. 45 Slow jobs on Join/Shuffle • Your dataset takes 20 seconds to run over with a map job, but take 4 hours when joined or shuffled. What wrong?
  46. 46. 46 Mistake - Skew Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Normal Distributed The Holy Grail of Distributed Systems
  47. 47. 47 Mistake - Skew Single ThreadNormal Distributed What about Skew, because that is a thing
  48. 48. 48 • Salting • Isolated Salting • Isolated Map Joins Mistake – Skew : Answers
  49. 49. 49 • Normal Key: “Foo” • Salted Key: “Foo” + random.nextInt(saltFactor) Mistake – Skew : Salting
  50. 50. 50 Managing Parallelism
  51. 51. 51 Mistake – Skew: Salting
  52. 52. 52©2014 Cloudera, Inc. All rights reserved. Add Example Slide
  53. 53. 53 • Two Stage Aggregation – Stage one to do operations on the salted keys – Stage two to do operation access unsalted key results Mistake – Skew : Salting Data Source Map Convert to Salted Key & Value Tuple Reduce By Salted Key Map Convert results to Key & Value Tuple Reduce By Key Results
  54. 54. 54 • Second Stage only required for Isolated Keys Mistake – Skew : Isolated Salting Data Source Map Convert to Key & Value Isolate Key and convert to Salted Key & Value Tuple Reduce By Key & Salted Key Filter Isolated Keys From Salted Keys Map Convert results to Key & Value Tuple Reduce By Key Union to Results
  55. 55. 55 • Filter Out Isolated Keys and use Map Join/Aggregate on those • And normal reduce on the rest of the data • This can remove a large amount of data being shuffled Mistake – Skew : Isolated Map Join Data Source Filter Normal Keys From Isolated Keys Reduce By Normal Key Union to Results Map Join For Isolated Keys
  56. 56. 56 Managing Parallelism Cartesian Join Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 ReduceTask ReduceTask ReduceTask ReduceTask Amount of Data Amount of Data 10x 100x 1000x 10000x 100000x 1000000x Or more
  57. 57. 57 Table YTable X • How To fight Cartesian Join – Nested Structures Managing Parallelism A, 1 A, 2 A, 3 A, 4 A, 5 A, 6 Table X A, 1, 4 A, 2, 4 A, 3, 4 A, 1, 5 A, 2, 5 A, 3, 5 A, 1, 6 A, 2, 6 A, 3, 6 JOIN OR Table X A A, 1 A, 2 A, 3 A, 4 A, 5 A, 6
  58. 58. 58 • How To fight Cartesian Join – Nested Structures Managing Parallelism create table nestedTable ( col1 string, col2 string, col3 array< struct< col3_1: string, col3_2: string>> val rddNested = sc.parallelize(Array( Row("a1", "b1", Seq(Row("c1_1", "c2_1"), Row("c1_2", "c2_2"), Row("c1_3", "c2_3"))), Row("a2", "b2", Seq(Row("c1_2", "c2_2"), Row("c1_3", "c2_3"), Row("c1_4", "c2_4")))), 2) =
  59. 59. 59 Mistake # 4
  60. 60. 60 Out of luck? • Do you every run out of memory? • Do you every have more then 20 stages? • Is your driver doing a lot of work?
  61. 61. 61 Mistake – DAG Management • Shuffles are to be avoided • ReduceByKey over GroupByKey • TreeReduce over Reduce • Use Complex/Nested Types
  62. 62. 62 Mistake – DAG Management: Shuffles • Map Side reduction, where possible • Think about partitioning/bucketing ahead of time • Do as much as possible with a single shuffle • Only send what you have to send • Avoid Skew and Cartesians
  63. 63. 63 ReduceByKey over GroupByKey • ReduceByKey can do almost anything that GroupByKey can do • Aggregations • Windowing • Use memory • But you have more control • ReduceByKey has a fixed limit of Memory requirements • GroupByKey is unbound and dependent on data
  64. 64. 64 TreeReduce over Reduce • TreeReduce & Reduce return some result to driver • TreeReduce does more work on the executors • While Reduce bring everything back to the driver Partition Partition Partition Partition Driver 100% Partition Partition Partition Partition Driver 4 25% 25% 25% 25%
  65. 65. 65 Complex Types • Top N List • Multiple types of Aggregations • Windowing operations • All in one pass
  66. 66. 66 Complex Types • Think outside of the box use objects to reduce by • (Make something simple)
  67. 67. 67 Mistake # 5
  68. 68. 68 Ever seen this? Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org $apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at…....
  69. 69. 69 But! • I already included protobuf in my app’s maven dependencies?
  70. 70. 70 Ah! • My protobuf version doesn’t match with Spark’s protobuf version!
  71. 71. 71 Shading <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.2</version> ... <relocations> <relocation> <pattern>com.google.protobuf</pattern> <shadedPattern>com.company.my.protobuf</shadedPattern> </relocation> </relocations>
  72. 72. 72 Future of shading • Spark 2.0 has some libraries shaded • Gauva is fully shaded
  73. 73. 73 Summary
  74. 74. 74 5 Mistakes • Size up your executors right • 2 GB limit on Spark shuffle blocks • Evil thing about skew and cartesians • Learn to manage your DAG, yo! • Do shady stuff, don’t let classpath leaks mess you up
  75. 75. 75 THANK YOU. tiny.cloudera.com/spark-mistakes Mark Grover | @mark_grover Ted Malaska | @TedMalaska

×