by Anya Bida and Rachel Warren from Alpine Data
https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/
Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.
3. About Anya About Rachel
Operations Engineer
!
!
!
Spark & Scala Enthusiast /
Data Engineer
About Alpine Data
!
alpinenow.com
Alpine deploys Spark in Production
for our Enterprise Customers
6. Default != Recommended
Example: By default, spark.executor.memory = 1g
1g allows small jobs to finish out of the box.
Spark assumes you'll increase this parameter.
!6
7. Which parameters are important?
!
How do I configure them?
!7
Default != Recommended
8. Filter* data
before an
expensive reduce
or aggregation
consider*
coalesce(
Use* data
structures that
require less
memory
Serialize*
PySpark
serializing
is built-in
Scala/
Java?
persist(storageLevel.[*]_SER)
Recommended:
kryoserializer *
tuning.html#tuning-
data-structures
See "Optimize partitions."
*
See "GC investigation." *
See "Checkpointing." *
The Spark Tuning Cheat-Sheet
16. !16
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
!
!
<maxResources>8000 mb</maxResources>
Limitation
What is the memory limit for
mySparkApp?
Reserve 25% for overhead.
18. !18
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?
19. !19
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
What is the memory limit for
mySparkApp?Limitation: Each driver
and executor must not be larger than a
single node.
Limitation: Driver and
executor memory must not be larger than
a single node.
!
(yarn.nodemanager.resource.memory-mb - 1Gb)
executor.memory ~
# executors per node
Limitation
20. !20
Max Memory in "pool" x 3/4 = mySparkApp_mem_limit
!
mySparkApp_mem_limit = driver.memory + (executor.memory
x dynamicAllocation.maxExecutors)
Limitation: maxExecutors should
not exceed pool allocation.
!
Yarn: <maxResources>8vcores</maxResources>
Limitation
What is the memory limit for
mySparkApp?
21. !21
I want a little more information...
Top 5 Mistakes When Writing Spark Applications
by Mark Grover and Ted Malaska of Cloudera
http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applications
How-to: Tune Your Apache Spark Jobs (Part 2)
by Sandy Ryza of Cloudera
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
I want lots more...
30. Symptoms:
!30
• mySparkApp is running for several hours
Container is lost.
• I notice one container fails, then the rest fail
one by one
• The first container to fail was the driver
• Driver is a SPOF
31. Investigate:
!31
collect unbounded data to the driver
• Driver failures are often caused by:
• I verified only bounded data is brought to the
driver, but still the driver fails intermittently.
32. Potential Solution: RDD.checkpoint()
!32
Use in these cases:
• high-traffic cluster
• network blips
• preemption
• disk space nearly full
!
!
Function:
• saves the RDD to stable
storage (eg hdfs or S3)
How-to:
SparkContext.setCheckpointDir(directory: String)
RDD.checkpoint()
36. Further Reading:
• Learning Spark, by H. Karau, A. Konwinski, P. Wendell, M. Zaharia, 2015, O'Reilly
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
• Scheduling:
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
• Tuning the Spark Conf:
Mark Grover and Ted Malaska from Cloudera
http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applications
Sandy Ryza (Cloudera)
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
• Checkpointing:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
• Troubleshooting:
Miklos Christine from Databricks
https://spark-summit.org/east-2016/events/operational-tips-for-deploying-spark/
• High Performance Spark by R. Warren, H. Karau, coming in 2016, O'Reilly
http://highperformancespark.com/
!36