6. Default != Recommended
Example: By default, spark.executor.memory = 1g
1g allows small jobs to finish out of the box.
Spark assumes you'll increase this parameter.
!6
7. Which parameters are important?
!
How do I configure them?
!7
Default != Recommended
8. Filter* data
before an
expensive reduce
or aggregation
consider*
coalesce(
Use* data
structures that
require less
memory
Serialize*
PySpark
serializing
is built-in
Scala/
Java?
persist(storageLevel.[*]_SER)
Recommended:
kryoserializer *
tuning.html#tuning-
data-structures
See "Optimize partitions."
*
See "GC investigation." *
See "Checkpointing." *
The Spark Tuning Cheat-Sheet
29. !29
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
30. !30
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
31. !31
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
Limitation: Driver must not be
larger than a single node.
32. !32
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
33. !33
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
!
!
!
!
!
!
34. !34
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
!
!
!
!
!
!
Parameter Default Recommended
spark.executor.cores 1(Yarn mode) 5 or less
35. !35
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
executors per node= (cores per node) / (5cores per executor)
!
!
!
!
!
36. !36
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
executors per node= (cores per node) / (5cores per executor)
!
executor.memory = (memory per node) / (executors per node)
!
!
!
37. !37
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
!
executors per node= (cores per node) / (5cores per executor)
!
executor.memory = (memory per node) / (executors per node)
!
maxExecutors=(executors per node) x (num nodes)
!
38. !38
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
39. !39
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
Verify my calculations respect this
limitation.
44. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
45. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
46. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
here let's talk about one scenario
47.
48. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
49. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
50. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
51. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
persist(storageLevel.[*]_SER)
Recommended: kryoserializer *
53. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
54. Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues
here let's talk about one scenario
56. Symptoms:
!56
• mySparkApp is running for several hours
Container is lost.
• Several Executors are lost.
• Behavior is intermittent (sometimes succeeds,
sometimes fails).
60. Potential Solution: RDD.checkpoint()
!60
Use in these cases:
!
Function:
• saves the RDD to stable
storage (eg hdfs or S3)
How-to:
Cache first!
SparkContext.setCheckpointDir(directory: String)
RDD.checkpoint()
61. Potential Solution: RDD.checkpoint()
!61
Use in these cases:
• high-traffic cluster
• network blips
• preemption
• disk space nearly full
!
!
Function:
• saves the RDD to stable
storage (eg hdfs or S3)
How-to:
Cache first!
SparkContext.setCheckpointDir(directory: String)
RDD.checkpoint()