1
Tuning
Q4’s Research Report
linhtm@runsystem.net
Agenda
1. Tuning Spark parameter
a. Control Spark’s resource usage
b. Advanced Parameter
c. Dynamic Allocation
2. Tips for tuning your Spark program
3. Example use case of tuning Spark
algorithm
2
3
Tuning Spark Parameter
3
The easy way
If you Spark application is slow, just let it
have more system resources.
Is there anything simpler?
4
Spark Architecture Simplified
5
Control Spark’s resource usage
• spark-submit command’s parameter (some only
available when using in YARN)
6
Parameter Description Default value
num-executor Number of executors to launch 2
executor-cores Number of cores per executor 1
executor-memory Memory per executor 1G
driver-cores Number of cores used by the driver, only in YARN
cluster mode
1
driver-memory Memory for driver 1G
Calculate the right values
• For example: 4 servers for Spark, each server has
64gb ram, 16 cores. How should we set those spark-
submit’s parameters?
– --num-executors 4 --executor-memory 63g --
executor-cores 15
– --num-executors 7 --executor-memory 29GB --
executor-cores 7
– --num-executors 11 --executor-memory 19GB --
executor-cores 5
7
Spark Executor’s Memory Model
• Memory request from YARN for each container =
spark.executor.memory + spark.yarn.executor.
memoryOverhead
• spark.yarn.executor.memoryOverhead = max(spark.
executor.memory * 0.1, 384mb)
8
Move advanced parameters
9
spark.shuffle.memoryFraction Fraction of Java heap to use for aggregation and
cogroups during shuffles
0.2
spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously
from each reduce task
48m
spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created
during a shuffle
false
spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output
stream
32k
spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6
spark.akka.frameSize Number of actor threads to use for communication 4
spark.akka.threads Maximum message size to allow in "control plane"
communication (for serialized tasks and task results), in
MB
10
Advanced Spark memory
10
Demo Spark UI
11
Using Dynamic Allocation
• Dynamically scale the set of cluster resources
allocated to your application up and down based on
the workload
• Only available when using YARN as cluster
management tool
• Must use an external shuffle service, so must config
a shuffle service with YARN
12
Dynamic Allocation parameters (1)
13
spark.shuffle.service.enabled Enables the external shuffle service.
This service preserves the shuffle
files written by executors so the
executors can be safely removed
false
spark.dynamicAllocation.enabled Whether to use dynamic resource
allocation
false
spark.dynamicAllocation.
executorIdleTimeout
If an executor has been idle for more
than this duration, the executor will
be removed
60s
spark.dynamicAllocation.
cachedExecutorIdleTimeout
If an executor which has cached
data blocks has been idle for more
than this duration, the executor will
be removed
Infinity
spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor
Dynamic Allocation parameters (2)
14
spark.dynamicAllocation.maxExecutors Upper bound for the number of
executors
Infinity
spark.dynamicAllocation.minExecutors Lower bound for the number of
executors
0
spark.dynamicAllocation.
schedulerBacklogTimeout
If there have been pending tasks
backlogged for more than this
duration, new executors will be
requested
1s
spark.dynamicAllocation.
sustainedSchedulerBacklogTimeout
Same as spark.dynamicAllocation.
schedulerBacklogTimeout, but
used only for subsequent executor
requests
schedulerBackl
ogTimeout
Dynamic Allocation in Action
15
Dynamic Allocation - The verdict
• Dynamic Allocation help using your cluster resource
more efficiently
• But only effective when Spark Application is a long
running one with different long stages with different
number of tasks (Spark Streaming?)
• In addition, when an executor is removed, all
cached data will no longer be accessible
16
17
Tips for Tuning Your Spark
Program
17
Tuning Memory Usage
• Prefer arrays of objects, and primitive types, instead
of the standard Java or Scala collection classes (e.g.
HashMap).
• Avoid nested structures with a lot of small objects
and pointers when possible.
• Using numeric IDs or enumeration objects instead of
strings for keys.
• If you have less than 32 GB of RAM, set the JVM flag
-XX:+UseCompressedOops to make pointers be four
bytes instead of eight.
18
Other Tuning Tips (1)
● Using KryoSerializer instead of default JavaSerilizer
● Know when to persist RDD and determine the right
level of storage level
○ MEMORY_ONLY
○ MEMORY_AND_DISK
○ MEMORY_ONLY_SER
○ …
● Choose the right level of parallelism
○ spark.default.parallelism
○ repartition
○ 2nd arguments for methods in spark.
PairRDDFunctions
19
Other tuning tips (2)
• Broadcast large variables
• Do not collect on large RDDs (should filter first)
• Careful when using operation that require data
shuffle (join, reduceByKey, groupByKey…)
• Avoid groupByKey, use reduceByKey or
aggregateByKey or combineByKey (low level) if
possible.
20
groupByKey vs reduceByKey (1)
21
groupByKey vs reduceByKey (2)
22
23
Example use case of tuning Spark
algorithm
23
Tuning CF algorithm in RW project
• 1st algorithm, no parameter tuning: 27mins
• 1st algorithm, parameters tuned: 18mins
• 2nd algorithm (from Spark code), parameters tuned:
~ 7mins 30s
• 3nd algorithm (improved Spark code), parameters
tuned: ~6mins 30s
24
25
Q&A
25
26
Thank You!

Spark tuning

  • 1.
  • 2.
    Agenda 1. Tuning Sparkparameter a. Control Spark’s resource usage b. Advanced Parameter c. Dynamic Allocation 2. Tips for tuning your Spark program 3. Example use case of tuning Spark algorithm 2
  • 3.
  • 4.
    The easy way Ifyou Spark application is slow, just let it have more system resources. Is there anything simpler? 4
  • 5.
  • 6.
    Control Spark’s resourceusage • spark-submit command’s parameter (some only available when using in YARN) 6 Parameter Description Default value num-executor Number of executors to launch 2 executor-cores Number of cores per executor 1 executor-memory Memory per executor 1G driver-cores Number of cores used by the driver, only in YARN cluster mode 1 driver-memory Memory for driver 1G
  • 7.
    Calculate the rightvalues • For example: 4 servers for Spark, each server has 64gb ram, 16 cores. How should we set those spark- submit’s parameters? – --num-executors 4 --executor-memory 63g -- executor-cores 15 – --num-executors 7 --executor-memory 29GB -- executor-cores 7 – --num-executors 11 --executor-memory 19GB -- executor-cores 5 7
  • 8.
    Spark Executor’s MemoryModel • Memory request from YARN for each container = spark.executor.memory + spark.yarn.executor. memoryOverhead • spark.yarn.executor.memoryOverhead = max(spark. executor.memory * 0.1, 384mb) 8
  • 9.
    Move advanced parameters 9 spark.shuffle.memoryFractionFraction of Java heap to use for aggregation and cogroups during shuffles 0.2 spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously from each reduce task 48m spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created during a shuffle false spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output stream 32k spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6 spark.akka.frameSize Number of actor threads to use for communication 4 spark.akka.threads Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB 10
  • 10.
  • 11.
  • 12.
    Using Dynamic Allocation •Dynamically scale the set of cluster resources allocated to your application up and down based on the workload • Only available when using YARN as cluster management tool • Must use an external shuffle service, so must config a shuffle service with YARN 12
  • 13.
    Dynamic Allocation parameters(1) 13 spark.shuffle.service.enabled Enables the external shuffle service. This service preserves the shuffle files written by executors so the executors can be safely removed false spark.dynamicAllocation.enabled Whether to use dynamic resource allocation false spark.dynamicAllocation. executorIdleTimeout If an executor has been idle for more than this duration, the executor will be removed 60s spark.dynamicAllocation. cachedExecutorIdleTimeout If an executor which has cached data blocks has been idle for more than this duration, the executor will be removed Infinity spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor
  • 14.
    Dynamic Allocation parameters(2) 14 spark.dynamicAllocation.maxExecutors Upper bound for the number of executors Infinity spark.dynamicAllocation.minExecutors Lower bound for the number of executors 0 spark.dynamicAllocation. schedulerBacklogTimeout If there have been pending tasks backlogged for more than this duration, new executors will be requested 1s spark.dynamicAllocation. sustainedSchedulerBacklogTimeout Same as spark.dynamicAllocation. schedulerBacklogTimeout, but used only for subsequent executor requests schedulerBackl ogTimeout
  • 15.
  • 16.
    Dynamic Allocation -The verdict • Dynamic Allocation help using your cluster resource more efficiently • But only effective when Spark Application is a long running one with different long stages with different number of tasks (Spark Streaming?) • In addition, when an executor is removed, all cached data will no longer be accessible 16
  • 17.
    17 Tips for TuningYour Spark Program 17
  • 18.
    Tuning Memory Usage •Prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). • Avoid nested structures with a lot of small objects and pointers when possible. • Using numeric IDs or enumeration objects instead of strings for keys. • If you have less than 32 GB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight. 18
  • 19.
    Other Tuning Tips(1) ● Using KryoSerializer instead of default JavaSerilizer ● Know when to persist RDD and determine the right level of storage level ○ MEMORY_ONLY ○ MEMORY_AND_DISK ○ MEMORY_ONLY_SER ○ … ● Choose the right level of parallelism ○ spark.default.parallelism ○ repartition ○ 2nd arguments for methods in spark. PairRDDFunctions 19
  • 20.
    Other tuning tips(2) • Broadcast large variables • Do not collect on large RDDs (should filter first) • Careful when using operation that require data shuffle (join, reduceByKey, groupByKey…) • Avoid groupByKey, use reduceByKey or aggregateByKey or combineByKey (low level) if possible. 20
  • 21.
  • 22.
  • 23.
    23 Example use caseof tuning Spark algorithm 23
  • 24.
    Tuning CF algorithmin RW project • 1st algorithm, no parameter tuning: 27mins • 1st algorithm, parameters tuned: 18mins • 2nd algorithm (from Spark code), parameters tuned: ~ 7mins 30s • 3nd algorithm (improved Spark code), parameters tuned: ~6mins 30s 24
  • 25.
  • 26.