Spark tuning

1
Tuning
Q4’s Research Report
linhtm@runsystem.net

Agenda
1. Tuning Spark parameter
a. Control Spark’s resource usage
b. Advanced Parameter
c. Dynamic Allocation
2. Tips for tuning your Spark program
3. Example use case of tuning Spark
algorithm
2

The easy way
If you Spark application is slow, just let it
have more system resources.
Is there anything simpler?
4

Spark Architecture Simplified
5

Control Spark’s resource usage
• spark-submit command’s parameter (some only
available when using in YARN)
6
Parameter Description Default value
num-executor Number of executors to launch 2
executor-cores Number of cores per executor 1
executor-memory Memory per executor 1G
driver-cores Number of cores used by the driver, only in YARN
cluster mode
1
driver-memory Memory for driver 1G

Calculate the right values
• For example: 4 servers for Spark, each server has
64gb ram, 16 cores. How should we set those spark-
submit’s parameters?
– --num-executors 4 --executor-memory 63g --
executor-cores 15
– --num-executors 7 --executor-memory 29GB --
executor-cores 7
– --num-executors 11 --executor-memory 19GB --
executor-cores 5
7

Spark Executor’s Memory Model
• Memory request from YARN for each container =
spark.executor.memory + spark.yarn.executor.
memoryOverhead
• spark.yarn.executor.memoryOverhead = max(spark.
executor.memory * 0.1, 384mb)
8

Move advanced parameters
9
spark.shuffle.memoryFraction Fraction of Java heap to use for aggregation and
cogroups during shuffles
0.2
spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously
from each reduce task
48m
spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created
during a shuffle
false
spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output
stream
32k
spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6
spark.akka.frameSize Number of actor threads to use for communication 4
spark.akka.threads Maximum message size to allow in "control plane"
communication (for serialized tasks and task results), in
MB
10

Using Dynamic Allocation
• Dynamically scale the set of cluster resources
allocated to your application up and down based on
the workload
• Only available when using YARN as cluster
management tool
• Must use an external shuffle service, so must config
a shuffle service with YARN
12

Dynamic Allocation parameters (1)
13
spark.shuffle.service.enabled Enables the external shuffle service.
This service preserves the shuffle
files written by executors so the
executors can be safely removed
false
spark.dynamicAllocation.enabled Whether to use dynamic resource
allocation
false
spark.dynamicAllocation.
executorIdleTimeout
If an executor has been idle for more
than this duration, the executor will
be removed
60s
cachedExecutorIdleTimeout
If an executor which has cached
data blocks has been idle for more
than this duration, the executor will
be removed
Infinity
spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor

Dynamic Allocation parameters (2)
14
spark.dynamicAllocation.maxExecutors Upper bound for the number of
executors
Infinity
spark.dynamicAllocation.minExecutors Lower bound for the number of
executors
0
schedulerBacklogTimeout
If there have been pending tasks
backlogged for more than this
duration, new executors will be
requested
1s
sustainedSchedulerBacklogTimeout
Same as spark.dynamicAllocation.
schedulerBacklogTimeout, but
used only for subsequent executor
requests
schedulerBackl
ogTimeout

Dynamic Allocation in Action
15

Dynamic Allocation - The verdict
• Dynamic Allocation help using your cluster resource
more efficiently
• But only effective when Spark Application is a long
running one with different long stages with different
number of tasks (Spark Streaming?)
• In addition, when an executor is removed, all
cached data will no longer be accessible
16

17
Tips for Tuning Your Spark
Program
17

Tuning Memory Usage
• Prefer arrays of objects, and primitive types, instead
of the standard Java or Scala collection classes (e.g.
HashMap).
• Avoid nested structures with a lot of small objects
and pointers when possible.
• Using numeric IDs or enumeration objects instead of
strings for keys.
• If you have less than 32 GB of RAM, set the JVM flag
-XX:+UseCompressedOops to make pointers be four
bytes instead of eight.
18

Other Tuning Tips (1)
● Using KryoSerializer instead of default JavaSerilizer
● Know when to persist RDD and determine the right
level of storage level
○ MEMORY_ONLY
○ MEMORY_AND_DISK
○ MEMORY_ONLY_SER
○ …
● Choose the right level of parallelism
○ spark.default.parallelism
○ repartition
○ 2nd arguments for methods in spark.
PairRDDFunctions
19

Other tuning tips (2)
• Broadcast large variables
• Do not collect on large RDDs (should filter first)
• Careful when using operation that require data
shuffle (join, reduceByKey, groupByKey…)
• Avoid groupByKey, use reduceByKey or
aggregateByKey or combineByKey (low level) if
possible.
20

groupByKey vs reduceByKey (1)
21

groupByKey vs reduceByKey (2)
22

23
Example use case of tuning Spark
algorithm
23

Tuning CF algorithm in RW project
• 1st algorithm, no parameter tuning: 27mins
• 1st algorithm, parameters tuned: 18mins
• 2nd algorithm (from Spark code), parameters tuned:
~ 7mins 30s
• 3nd algorithm (improved Spark code), parameters
tuned: ~6mins 30s
24

Spark tuning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark tuning

Similar to Spark tuning (20)

More from GMO-Z.com Vietnam Lab Center

More from GMO-Z.com Vietnam Lab Center (20)

Recently uploaded

Recently uploaded (20)

Spark tuning