Spark tuning2016may11bida

Spark Tuning for Enterprise
System Administrators
Anya T. Bida, PhD
Rachel B. Warren

Don't worry about missing something...
Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be
Presentation: http://www.slideshare.net/anyabida
Cheat-sheet: http://techsuppdiva.github.io/
!
!
Anya: https://www.linkedin.com/in/anyabida
Rachel: https://www.linkedin.com/in/rachelbwarren
!
!
  !2

About Anya About Rachel
Operations Engineer
!
!
!
Spark & Scala Enthusiast /
Data Engineer
Alpine Data
!
alpinenow.com

About You*
Intermittent
Reliable
Optimal
Spark practitioners
mySparkApp Success
*

Intermittent
Reliable
Optimal
mySparkApp Success

Default != Recommended
Example: By default, spark.executor.memory = 1g
1g allows small jobs to finish out of the box.
Spark assumes you'll increase this parameter. 
!6

Which parameters are important?
!
How do I configure them?
!7
Default != Recommended

Filter* data
before an
expensive reduce
or aggregation
consider*
coalesce(
Use* data
structures that
require less
memory
Serialize*
PySpark
serializing
is built-in
Scala/
Java?
persist(storageLevel.[*]_SER)
Recommended:
kryoserializer *
tuning.html#tuning-
data-structures
See "Optimize partitions."
*
See "GC investigation." *
See "Checkpointing." *
The Spark Tuning Cheat-Sheet

Intermittent
Reliable
Optimal
mySparkApp Success
Memory trouble
Initial conﬁg

!11
How many in the
audience have their own
cluster?

Fair Schedulers
!13
YARN
<allocations>
<queue name="sample_queue">
<minResources>4000 mb,0vcores</minResources>
<maxResources>8000 mb,8vcores</maxResources>
<maxRunningApps>10</maxRunningApps>
<weight>2.0</weight>
<schedulingPolicy>fair</schedulingPolicy>
</queue>
</allocations>
SPARK
<allocations> 
<pool name="sample_queue">
<schedulingMode>FAIR</sch
<weight>1</weight> 
<minShare>2</minShare> 
</pool> 
</allocations>

Fair Schedulers
!14
YARN
<allocations>
</queue>
</allocations>
SPARK
<allocations> 
</pool> 
</allocations>

Fair Schedulers
!15
YARN
<allocations>
</queue>
</allocations>
SPARK
<allocations> 
</pool> 
</allocations>

Fair Schedulers
!16
YARN
<allocations>
</queue>
</allocations>
SPARK
<allocations> 
</pool> 
</allocations>

Fair Schedulers
!17
YARN
<allocations>
</queue>
</allocations>
SPARK
<allocations> 
</pool> 
</allocations>
Use these parameters!

Fair Schedulers
!18
YARN
<allocations>
<user name="sample_user">
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
!
</allocations>

Fair Schedulers
!19
YARN
<allocations>
<user name="sample_user">
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
!
</allocations>

What is the memory limit for
mySparkApp?
!20

!21
Driver
Executor
Cluster Manager
Sidebar: Spark Architecture
Mark Grover:
http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-
malaska
Executor

!22
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
!
!
mySparkApp?

!23
!
!
!
mySparkApp?

!24
!
!
!
<maxResources>___mb</maxResources>
Limitation
mySparkApp?

mySparkApp?
!25
!
!
!
Reserve 25% for overhead

!26
!
!
!
mySparkApp?

!28
!
mySparkApp_mem_limit > driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
mySparkApp?

!29
!
mySparkApp?

!30
!
mySparkApp?
Limitation: Driver must not be
larger than a single node.

!31
yarn.nodemanager.resource.memory-mb
Driver Container
spark.driver.memory

!32
!
mySparkApp?

!33
Driver
Executor
Cluster Manager
Mark Grover:
http://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-by-mark-grover-and-ted-
malaska
Executor

!34
!
mySparkApp?
Verify my calculations respect this
limitation.

Reduce the memory needed for
mySparkApp. How?
Gracefully handle memory
limitations. How?
mySparkApp memory issues

mySparkApp. How?
limitations. How?
here let's talk about one scenario

mySparkApp. How?
limitations. How?

mySparkApp. How?
limitations. How?
Recommended: kryoserializer *

Spark 1.1-1.5,
Recommendation: Increase
spark.memory.storageFraction

!51
Alexey Grishchenko:
https://0x0fff.com/spark-memory-management/
Spark 1.1-1.5,
Recommendation: Increase
spark.memory.storageFraction
!
Spark 1.6, Recommendation:
UnifiedMemoryManager

Alexey Grishchenko:
https://0x0fff.com/spark-memory-management/
Sandy Ryza:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
yarn.nodemanager.resource.memory-mb
spark.yarn.executor.memoryOverhead
Executor Container
spark.executor.memory

!53
Driver
Cluster Manager
yarn.nodema
spark.yarn.e
Exec
spark.e
yarn.nodema
spark.yarn.e
Exec
spark.e
yarn.nodema
spark.yarn.e
Exec
spark.e
Executor
Executor

Intermittent
Reliable
Optimal
mySparkApp Success
Memory trouble
Initial conﬁg
Instead of 2.5 hours, myApp
completes in 1 hour.

Cheat-sheet
techsuppdiva.github.io/

Intermittent
Reliable
Optimal
mySparkApp Success
Memory trouble
Initial conﬁg
HighPerformanceSpark.com

Further Reading:
• Spark Tuning Cheat-sheet 
techsuppdiva.github.io
• Apache Spark Documentation 
https://spark.apache.org/docs/latest 
• Checkpointing 
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing 
https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-rdd-checkpointing.adoc 
• Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015
!58

More Questions?
!59
Video: https://www.youtube.com/watch?v=DNWaMR8uKDc&feature=youtu.be
Presentation: http://www.slideshare.net/anyabida
Cheat-sheet: http://techsuppdiva.github.io/
!
!
Anya: https://www.linkedin.com/in/anyabida
Rachel: https://www.linkedin.com/in/rachelbwarren
!
!
  Thanks!

Spark tuning2016may11bida

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Spark tuning2016may11bida

Similar to Spark tuning2016may11bida (20)

Recently uploaded

Recently uploaded (20)

Spark tuning2016may11bida