Spark is already well-established in the world of Big Data, although when working with clients I noticed a lack of understanding of how it works beyond a very nice API. I would like this talk to provide the audience with a practical framework for optimizing the most common problems with Spark applications.
11. @mszymani#Devoxx #TantusData #spark
Sizing executors
Executor - JVM process able to run one or more tasks
…
Executor
Example config: --executor-cores 3 --executor-memory 10g
…
Executor
…
…
Executor
…
Pending Tasks
Complete Tasks
…
…
Executor
Job resources
12. @mszymani#Devoxx #TantusData #spark
Executor - JVM process able to run one or more tasks
…
Executor
Example config: --executor-cores 3 --executor-memory 10g
…
Executor
…
…
Executor
…
Pending Tasks
Complete Tasks
…
…
Executor
Job resources
Driver
Sizing executors
13. @mszymani#Devoxx #TantusData #spark
Sizing executors
Executor - JVM process able to run one or more tasks
…
Executor
Example config: --executor-cores 3 --executor-memory 10g
Executor
Executor
Executor
Executor
Node
…
Executor
Executor
Node
…
Node
…
Executor
Executor
Executor
Executor
Node
Executor
…
vs vs vs vs…
14. @mszymani#Devoxx #TantusData #spark
Sizing executors
• Spark can benefit from running multiple tasks in the same JVM
• Many cores leads to problems: HDFS I/O, GC
• 1-4 CPUs should be good for start
21. @mszymani#Devoxx #TantusData #spark
We are up & running
What to watch out for:
• Failing tasks
• GC heavy tasks
• Shuffle read/write sizes
• Long running tasks
22. @mszymani#Devoxx #TantusData #spark
We are up & running
• Too much data per task?
• Is parallelism level large enough?
• More memory for executor?
def groupByKey(numPartitions: Int)
def repartition(numPartitions: Int)
28. @mszymani#Devoxx #TantusData #spark
Caching
• Branch in execution plan is a candidate for caching
• Spark UI shows to see how much memory an RDD is taking
• You cannot control priority - it's LRU
• Don't cache to disk if computation is cheap
• Caching with RF - only when recreation is extremely costly
• Checkpointing vs caching
• Shuffle data is automatically persisted
33. @mszymani#Devoxx #TantusData #spark
Optimize shuffle - recap
• Control number of partitions
• Use mapValues instead of map if you can
• Broadcast variables
• Filter before shuffle
• Avoid groupByKey, use reduceByKey
37. @mszymani#Devoxx #TantusData #spark
General notes
• Tests vs no tests? - Test your code!
• You are probably not the only user of the cluster
• What are you optimizing for?
• Share the knowledge
• Spark actually works :)
goo.gl/7eTtvH