SlideShare a Scribd company logo
1 of 57
Understanding Spark Tuning
(Magical spells to stop your pager going off at 2:00am)
Holden Karau, Rachel Warren
Rachel
- Rachel Warren → She/ Her
- Data Scientist / Software engineer at Salesforce Einstein
- Formerly at Alpine Data (with Holden)
- Lots of experience scaling Spark in different production environments
- The other half of the High Performance Spark team :)
- @warre_n_peace
- Linked in: https://www.linkedin.com/in/rachelbwarren/
- Slideshare: https://www.slideshare.net/RachelWarren4/
- Github: https://github.com/rachelwarren
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC :)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Who we think you wonderful humans are?
● Nice enough people
● I’m sure you love pictures of cats
● Might know something about using Spark, or are using it in production
● Maybe sys-admin or developer
● Are tired of spending so much time fussing with Spark Settings to get jobs to
run
Lori Erickson
The goal of this talk is to give you the resources to programatically tune
your Spark jobs so that they run consistently and efficiently
In terms of and $$$$$
What we will cover?
- A run down of the most important settings
- Getting the most out of Spark’s built-in Auto Tuner Options
- A few examples of errors and performance problems that can be addressed
by tuning
- A job can go out of tune over time as the world and it changes, much like Holden’s Vespa.
- How to tune jobs “statically” e.g. without historical data
- How to collect historical data (meet Robin Sparkles: https://github.com/high-
performance-spark/robin-sparkles )
- An example of using static and historical information to programmatically
configure Spark jobs
- The latest and greatest auto tuner tools
I can haz application :p
val conf = new SparkConf()
.setMaster("local")
.setAppName("my_awesome_app")
val sc = SparkContext.getOrCreate(newConf)
val rdd = sc.textFile(inputFile)
val words: RDD[String] = rdd.flatMap(_.split(“ “).
map(_.trim.toLowerCase))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCount.saveAsTextFile(outputFile)
Trish Hamme
Settings go here
This is a shuffle
I can haz application :p
val conf = new SparkConf()
.setMaster("local")
.setAppName("my_awesome_app")
val sc = SparkContext.getOrCreate(newConf)
val rdd = sc.textFile(inputFile)
val words: RDD[String] = rdd.flatMap(_.split(“ “).
map(_.trim.toLowerCase))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCount.saveAsTextFile(outputFile)
Trish Hamme
Start of application
End Stage 1
Stage 2
Action, Launches Job
How many resources to give my application?
● Spark.executor.memory
● Spark.driver.memory
● spark.executor.vcores
● Enable dynamic allocation
○ (or set # number of executors)
val conf = new SparkConf()
.setMaster("local")
.setAppName("my_awesome_app")
.set("spark.executor.memory", ???)
.set("spark.driver.memory", ???)
.set("spark.executor.vcores", ???)
Spark Execution Environment
Other App
My Spark App
- Node can have several executors
- But on executor can only be on
one node
- Executors have same amount of
memory and cores
- One task per core
- Task is the compute for one
partition
- Rdd is distributed set of partitions
Executor and Driver Memory
- Driver Memory
- As small as it can be without failing (but that can be pretty big)
- Will have to be bigger is collecting data to the driver, or if there are many partitions
- Executor memory + overhead < less than the size of the container
- Think about binning
- if you have 12 gig nodes making an 8 gig executor is maybe silly
- Pros Of Fewer Larger Executors Per Node
- Maybe less likely to oom
- Some tasks can take a long time
- Cons of Fewer Large Executors (Pros of More Small Executors)
- Some people report slow down with more than 5ish cores … (more on that later)
- If using dynamic allocation may be harder to “scale up” on a busy cluster
Vcores
- Remember 1 core = 1 task. So number of concurrent tasks is limited by total
cores
- Sort of, unless you change it. Terms and conditions apply to Python users.
- In HDFS too many cores per executor may cause issue with too many
concurrent hdfs threads
- maybe?
- 1 core per executor takes away some benefit of things like broadcast
variables
- Think about “burning” cpu and memory equally
- If you have 60Gb ram & 10 core nodes, making default executor size 30 gb but with ten cores
maybe not so great
How To Enable Dynamic Allocation
Dynamic Allocation allows Spark to add and subtract executors between
Jobs over the course of an application
- To configure
- spark.dynamicAllocation.enabled=true
- spark.shuffle.service.enabled=true ( you have to configure external shuffle service on each worker)
- spark.dynamicAllocation.minExecutors
- spark.dynamicAllocation.maxExecutors
- spark.dynamicAllocation.initialExecutors
- To Adjust
- Spark will add executors when there are pending tasks (spark.dynamicAllocation.schedulerBacklogTimeout)
- and exponentially increase them as long as tasks in the backlog persist
(spark...sustainedSchedulerBacklogTimeout)
- Executors are decommissioned when they have been idle for
spark.dynamicAllocation.executorIdleTimeout
Why To Enable Dynamic Allocation
When
- Most important for shared or cost sensitive environments
- Good when an application contains several jobs of differing sizes
- The only real way to adjust resources throughout an application
Improvements
- If jobs are very short adjust the timeouts to be shorter
- For jobs that you know are large start with a higher number of initial executors to avoid slow spin up
- If you are sharing a cluster, setting max executors can prevent you from hogging it
Run it!
Matthew Hoelscher
Oh no! It failed :( How could we adjust it?
Suppose that in the driver log, we see a “container lost exception” and on the
executor logs we see:
java.lang.OutOfMemoryError: Java heap space
This points to an out of memory error on the executors
hkase
Addressing Executor OOM
- If we have more executor memory to give it, try that!
- Lets try increasing the number of partitions so that each executor will process
smaller pieces of the data at once
- Spark.default.parallelism = 10
- Or by adding the number of partitions to the code e.g. reduceByKey(numPartitions = 10)
- Many more things you can do to improve the code
Low Cluster Utilization: Idle Executors Susanne Nilsson
What to do about it?
- If we see idle executors but the total size of our job is small, we may just be
requesting too many executors
- If all executors are idle it maybe because we are doing a large computation in
the driver
- If the computation is very large, and we see idle executors, this maybe
because the executors are waiting for a “large” task → so we can increase
partitions
- At some point adding partitions will slow the job down
- But only if not too much skew
Toshiyuki IMAI
Shuffle Spill to Disk/ Slow Shuffle Fung0131
Preventing Shuffle Spill to Disk
- Larger executors
- Configure off heap storage
- More partitions can help (ideally the labor of all the partitions on one executor
can “fit” in that executor’s memory)
- We can adjust shuffle settings
- Increase shuffle memory fraction (spark.shuffle.memory.fraction)
- Try increasing:
- spark.shuffle.file.buffer
- Configure an external shuffle service, so that the shuffle files will not need to
be stored in the spark executors
- spark.shuffle.io.serverThreads
- spark.shuffle.io.backLog
jaci XIII
Signs of Too Many Partitions
Number of partitions is the size of the data each core is computing … smaller
pieces are easier to process only up to a point
- Spark needs to keep metadata about each partition on the driver
- Driver memory errors & driver overhead errors
- Very long task “spin up” time
- Too many partitions at read usually caused by small part files
- Lots of pending tasks & Low memory utilization
- Long file write time for relatively small I/O “size” (especially with blockstores)
Dorian Wallender
PYTHON SETTINGS
● Application memory overhead
○ We can tune this based on if an app is PySpark or not
○ Infact in the proposed PySpark on K8s PR this is done for us
○ More tuning may still be required
● Buffers & batch sizes oh my
○ spark.sql.execution.arrow.maxRecordsPerBatch
○ spark.python.worker.memory - default to 512 but default mem for Python can be lower :(
■ Set based on amount memory assigned to Python to reduce OOMs
○ Normal: automatic, sometimes set wrong - code change required :(
Nessima E.
Tuning Can Help With
- Low cluster utilization ($$$$)
- Out of memory errors
- Spill to Disk / Slow shuffles
- GC errors / High GC overhead
- Driver / Executor overhead exceptions
- Reliable deployment
Melinda Seckington
Things you can’t “tune away”
- Silly Shuffles
- You can make each shuffle faster through tuning but you cannot change the fundamental size
or number of shuffles without adjusting code
- Think about how much you are moving data, and how to do it more efficiently
- Be very careful with column based operations on Spark SQL
- Unbalanced shuffles (caused by unbalanced keys)
- if one or two tasks are much larger than the others
- #Partitions is bounded by #distinct keys
- Bad Object management/ Data structures choices
- Can lead to memory exceptions / memory overhead exceptions / gc errors
- Serialization exceptions*
- *Exceptions due to cost of serialization can be tuned
- Python
- This makes Holden sad, buuuut. Arrow?
Neil Piddock
But enough doom, lets watch software fail.
Christian Walther
But what if we run a billion Spark jobs per day?
Tuning Could Depend on 4 Factors:
1. The execution environment
2. The size of the input data
3. The kind of computation (ML, python, streaming)
_______________________________
1. The historical runs of that job
We can get all this information programmatically!
Execution Environment
What we need to know about where the job will run
- How much memory do I have available to me: on a single node/ on the cluster
- In my queue
- How much CPU: on a single node, on the cluster → corresponds to total
number of concurrent tasks
- We can get all this information from the yarn api (https://dzone.com/articles/how-to-use-the-yarn-
api-to-determine-resources-ava-1)
- Can I configure dynamic allocation?
- Cluster Health
Size of the input data
- How big is the input data on Disk
- How big will it be in memory?
- “Coefficient of In Memory Expansion: shuffle Spill Memory / shuffle Spill disk
- <- can get historically or guess
- How many part files? (the default number of partitions at read)
- Cardinality and type?
Setting # partitions on the first try
Assuming you have many distinct keys, you want to try to make partitions small enough that each partition
fits in the memory “available to each task” to avoid spilling to disk or failing (the “Sandy Ryza formula”)
https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Size of input data in memory
Amount of memory available per task on the executors
How much memory should each task use?
def availableTaskMemoryMB(executorMemory : Long): Double = {
val memFraction = sparkConf.getDouble("spark.memory.fraction", 0.6)
val storageFraction = sparkConf.getDouble("spark.memory.storageFraction", 0.5)
val nonStorage = 1-storageFraction
val cores = sparkConf.getInt("spark.executor.cores", 1)
Math.ceil((executorMemory * memFraction * nonStorage )/ cores)
}
Partitions could be inputDataSize/ memory per task
def determinePartitionsFromInputDataSize(inputDataSize: Double) : Int = {
Math.round(inputDataSize/availableTaskMemoryMB()).toInt
}
Historical Data
What kinds of historical data could we use?
- Historical cluster info
- (was one node sad? How many executors were used?)
- Stage Metrics
- duration (only on an isolated cluster)
- executor CPU
- vcores/Second
- Task Metrics
- Shuffle read & shuffle write → how much data is being moved & how big is input data
- “Total task time” = time for each task * number of tasks
- Shuffle read spill memory / shuffle spill disk = input data expansion in memory
- What kind of computation was run (python/streaming)
- Who was running the computation
How do we save the metrics?
- For a human: Spark history server can be persisted even if the cluster is
terminated
- For a program: Can use listeners to save everything from web UI
- (See Spark Measure and read these at the start of job)
- Can also get significantly more by capturing the full event stream
- We can save information at task/ stage/ job level.
Meet Robin Sparkles!!!!!
- Saves historical data using listeners from Spark Measure
- ( https://github.com/LucaCanali/sparkMeasure)
- Create directory scheme to save task and stage data for
successive runs
- Read in n previous runs before creating conf
- Use the result of the previous runs in tandem with current Spark
Conf to Optimize some Spark Settings. (WIP)
- Note: not to be used in production, simply the start of a spell
book.
Lets go to the
Mall!
Start a Spark listener and save metrics
Extend Spark Measure flight recorder which automatically saves metrics:
class RobinStageListener(sc: SparkContext, override val metricsFileName: String)
extends ch.cern.sparkmeasure.FlightRecorderStageMetrics(sc.getConf) {
}
Start Spark Listener
val myStageListener = new RobinStageListener(sc,
stageMetricsPath(runNumber))
sc.addSparkListener(myStageListener)
Add listener to program code
def run(sc: SparkContext, id: Int, metricsDir: String,
inputFile: String, outputFile: String): Unit = {
val metricsCollector = new MetricsCollector(sc, metricsDir)
metricsCollector.startSparkJobWithRecording(id)
//some app code
}
Read in the metrics
val STAGE_METRICS_SUBDIR = "stage_metrics"
val metricsDir = s"$metricsRootDir/${appName}"
val stageMetricsDir = s"$metricsDir/$STAGE_METRICS_SUBDIR"
def stageMetricsPath(n: Int): String = s"$metricsDir/run=$n"
def readStageInfo(n : Int) =
ch.cern.sparkmeasure.Utils.readSerializedStageMetrics(stageMetricsPath(n))
To use: read in metrics, then create optimized conf
val conf = new SparkConf() ….
val (newConf: SparkConf, id: Int) = Runner.getOptimizedConf(metricsDir, conf)
val sc = SparkContext.getOrCreate(newConf)
Runner.run(sc, id, metricsDir, inputFile, outputFile)
Put it all together!
A programmatic way of defining the number of
partitions …
Partitions
Performance
Parallelism / number of partitions
Keep increasing the number of partitions until the metric we care about stops
improving
- Spark.default.parallelism is only default if not in code
- Different stages could / should have different partitions
- can also compute for each stage and then set in shuffle phase:
- rdd.reduceByKey(numPartitions = x)
- You can design your job to get number of partitions from a variable in the conf
- Which stage to tune if we can only do one:
- We can search for the longest stage
- If we are tuning for parallelism, we might want to capture the stage “with the biggest shuffle”
- We use “total shuffle write” which is the number of bytes written in shuffle files during the stage
(this should be setting agnostic)
What metric defines better?
- Depends on your business … often a trade off between speed and efficiency
- If you just want speed Holden has several cloud solutions to sell you (or would if she knew
anyone in the sales department).
- cluster is “fully utilized”
- Stage duration * executors = sum(task time)/ executors) + fuge factor
- WARNING: Stage duration is dependent on instance & exec spin up, network delays excetera
- Stage is “Fastest”-> Executor Cpu Time (total cpu time for all the tasks)
- Could use anything that Spark measures:
- Executor Serialization time
- Deserialization time
- Executor Idle time
Compute the number of partitions given a list of web UI input for each stage
def fromStageMetricSharedCluster(previousRuns: List[StageInfo]): Int = {
previousRuns match {
case Nil =>
//If this is the first run and parallelism is not provided, use the number of concurrent tasks
//We could also look at the file on disk
possibleConcurrentTasks()
case first :: Nil =>
val fromInputSize = determinePartitionsFromInputDataSize(first.totalInputSize)
Math.max(first.numPartitionsUsed + math.max(first.numExecutors,1), fromInputSize)
}
If we have several runs of the same computation
case _ => val first = previousRuns(previousRuns.length - 2)
val second = previousRuns(previousRuns.length - 1)
if(morePartitionsIsBetter(first,second)) { //Increase the number of partitions from everything that we have tried
Seq(first.numPartitionsUsed, second.numPartitionsUsed).max + second.numExecutors
} else{ //If we overshot the number of partitions, use whichever run had the best executor cpu time
previousRuns.sortBy(_.executorCPUTime).head.numPartitionsUsed
} }
}
How to use Robin-Sparkles
- Assume user constructs the Spark Context
- Configure listener to write to a persistent location
- Maybe need to be in an external location if you are using a web based system
- Modify the application code that creates the Spark context to use Robin
Sparkles to set the settings
Note: not applicable for use where Spark context is created for you except for
tuning number of partitions or variables that can change at the job (rather than
application level)
She could do so much more!
- Additional settings to tune
- Set executor memory and driver memory based on system
- Executor health and system monitoring
- Could surface warnings about data skew
- Shuffle settings could be adjusted
- Currently we don’t get info if the stage fails
- Our reader would actually fail if stage data is malformed, we should read partial stage
- We could use the firehost listener that records every event from the event stream, to monitor
failures
- These other listeners would contain additional information that is not in the web ui
But first a nap
Donald Lee Pardue
Other Tools
Sometimes the right answer isn’t tuning, it’s telling the user to change the code
(see Sparklens) or telling the administrator to look at their cluster
Melinda Seckington
Sparklens - Tells us what to tune or refactor
● AppTimelineAnalyzer
● EfficiencyStatisticsAnalyzer
● ExecutorTimelineAnalyzer
● ExecutorWallclockAnalyzer
● HostTimelineAnalyzer
● JobOverlapAnalyzer
● SimpleAppAnalyzer
● StageOverlapAnalyzer
● StageSkewAnalyzer
Kitty Terwolbeck
Dr Elephant
● Spark ConfigurationHeurestic
○ val SPARK_DRIVER_MEMORY_KEY = "spark.driver.memory"
○ val SPARK_EXECUTOR_MEMORY_KEY = "spark.executor.memory"
○ val SPARK_EXECUTOR_INSTANCES_KEY = "spark.executor.instances"
○ val SPARK_EXECUTOR_CORES_KEY = "spark.executor.cores"
○ val SPARK_SERIALIZER_KEY = "spark.serializer"
○ val SPARK_APPLICATION_DURATION = "spark.application.duration"
○ val SPARK_SHUFFLE_SERVICE_ENABLED = "spark.shuffle.service.enabled"
○ val SPARK_DYNAMIC_ALLOCATION_ENABLED = "spark.dynamicAllocation.enabled"
○ val SPARK_DRIVER_CORES_KEY = "spark.driver.cores"
○ val SPARK_DYNAMIC_ALLOCATION_MIN_EXECUTORS =
"spark.dynamicAllocation.minExecutors"
○ etc.
● Among other things
Other monitoring/linting/tuning tools
● Kamon
● Sparklint
● Sparkoscope
● Graphite + Spark
Jakub Szestowicki
Some upcoming talks
● May
○ JOnTheBeach (Spain) Friday - General Purpose Big Data Systems are
eating the world - Is Tool Consolidation inevitable?
● June
○ Spark Summit SF (Accelerating Tensorflow & Accelerating Python +
Dependencies)
○ Scala Days NYC
○ FOSS Back Stage & BBuzz
● July
○ Curry On Amsterdam
○ OSCON Portland
● August
○ JupyterCon (NYC)
High Performance Spark!
You can buy it today! Hopefully you got it signed earlier today if not … buy it
and come see us again!
The settings didn’t get its own chapter is in the appendix, (doing things on time is hard)
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print rather than e-book.
Thanks and keep in touch

More Related Content

What's hot

Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 

What's hot (20)

Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Apache spark
Apache sparkApache spark
Apache spark
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 

Similar to Spark Tuning Guide to Improve Performance & Prevent Failures

Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkRachel Warren
 
Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Holden Karau
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsSamir Bessalah
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 

Similar to Spark Tuning Guide to Improve Performance & Prevent Failures (20)

Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New York
 
Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Spark 101
Spark 101Spark 101
Spark 101
 
20140708hcj
20140708hcj20140708hcj
20140708hcj
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Spark Tuning Guide to Improve Performance & Prevent Failures

  • 1. Understanding Spark Tuning (Magical spells to stop your pager going off at 2:00am) Holden Karau, Rachel Warren
  • 2. Rachel - Rachel Warren → She/ Her - Data Scientist / Software engineer at Salesforce Einstein - Formerly at Alpine Data (with Holden) - Lots of experience scaling Spark in different production environments - The other half of the High Performance Spark team :) - @warre_n_peace - Linked in: https://www.linkedin.com/in/rachelbwarren/ - Slideshare: https://www.slideshare.net/RachelWarren4/ - Github: https://github.com/rachelwarren
  • 3. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC :) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 4.
  • 5. Who we think you wonderful humans are? ● Nice enough people ● I’m sure you love pictures of cats ● Might know something about using Spark, or are using it in production ● Maybe sys-admin or developer ● Are tired of spending so much time fussing with Spark Settings to get jobs to run Lori Erickson
  • 6. The goal of this talk is to give you the resources to programatically tune your Spark jobs so that they run consistently and efficiently In terms of and $$$$$
  • 7. What we will cover? - A run down of the most important settings - Getting the most out of Spark’s built-in Auto Tuner Options - A few examples of errors and performance problems that can be addressed by tuning - A job can go out of tune over time as the world and it changes, much like Holden’s Vespa. - How to tune jobs “statically” e.g. without historical data - How to collect historical data (meet Robin Sparkles: https://github.com/high- performance-spark/robin-sparkles ) - An example of using static and historical information to programmatically configure Spark jobs - The latest and greatest auto tuner tools
  • 8. I can haz application :p val conf = new SparkConf() .setMaster("local") .setAppName("my_awesome_app") val sc = SparkContext.getOrCreate(newConf) val rdd = sc.textFile(inputFile) val words: RDD[String] = rdd.flatMap(_.split(“ “). map(_.trim.toLowerCase)) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCount.saveAsTextFile(outputFile) Trish Hamme Settings go here This is a shuffle
  • 9. I can haz application :p val conf = new SparkConf() .setMaster("local") .setAppName("my_awesome_app") val sc = SparkContext.getOrCreate(newConf) val rdd = sc.textFile(inputFile) val words: RDD[String] = rdd.flatMap(_.split(“ “). map(_.trim.toLowerCase)) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCount.saveAsTextFile(outputFile) Trish Hamme Start of application End Stage 1 Stage 2 Action, Launches Job
  • 10. How many resources to give my application? ● Spark.executor.memory ● Spark.driver.memory ● spark.executor.vcores ● Enable dynamic allocation ○ (or set # number of executors) val conf = new SparkConf() .setMaster("local") .setAppName("my_awesome_app") .set("spark.executor.memory", ???) .set("spark.driver.memory", ???) .set("spark.executor.vcores", ???)
  • 11. Spark Execution Environment Other App My Spark App - Node can have several executors - But on executor can only be on one node - Executors have same amount of memory and cores - One task per core - Task is the compute for one partition - Rdd is distributed set of partitions
  • 12. Executor and Driver Memory - Driver Memory - As small as it can be without failing (but that can be pretty big) - Will have to be bigger is collecting data to the driver, or if there are many partitions - Executor memory + overhead < less than the size of the container - Think about binning - if you have 12 gig nodes making an 8 gig executor is maybe silly - Pros Of Fewer Larger Executors Per Node - Maybe less likely to oom - Some tasks can take a long time - Cons of Fewer Large Executors (Pros of More Small Executors) - Some people report slow down with more than 5ish cores … (more on that later) - If using dynamic allocation may be harder to “scale up” on a busy cluster
  • 13. Vcores - Remember 1 core = 1 task. So number of concurrent tasks is limited by total cores - Sort of, unless you change it. Terms and conditions apply to Python users. - In HDFS too many cores per executor may cause issue with too many concurrent hdfs threads - maybe? - 1 core per executor takes away some benefit of things like broadcast variables - Think about “burning” cpu and memory equally - If you have 60Gb ram & 10 core nodes, making default executor size 30 gb but with ten cores maybe not so great
  • 14. How To Enable Dynamic Allocation Dynamic Allocation allows Spark to add and subtract executors between Jobs over the course of an application - To configure - spark.dynamicAllocation.enabled=true - spark.shuffle.service.enabled=true ( you have to configure external shuffle service on each worker) - spark.dynamicAllocation.minExecutors - spark.dynamicAllocation.maxExecutors - spark.dynamicAllocation.initialExecutors - To Adjust - Spark will add executors when there are pending tasks (spark.dynamicAllocation.schedulerBacklogTimeout) - and exponentially increase them as long as tasks in the backlog persist (spark...sustainedSchedulerBacklogTimeout) - Executors are decommissioned when they have been idle for spark.dynamicAllocation.executorIdleTimeout
  • 15. Why To Enable Dynamic Allocation When - Most important for shared or cost sensitive environments - Good when an application contains several jobs of differing sizes - The only real way to adjust resources throughout an application Improvements - If jobs are very short adjust the timeouts to be shorter - For jobs that you know are large start with a higher number of initial executors to avoid slow spin up - If you are sharing a cluster, setting max executors can prevent you from hogging it
  • 17. Oh no! It failed :( How could we adjust it? Suppose that in the driver log, we see a “container lost exception” and on the executor logs we see: java.lang.OutOfMemoryError: Java heap space This points to an out of memory error on the executors hkase
  • 18. Addressing Executor OOM - If we have more executor memory to give it, try that! - Lets try increasing the number of partitions so that each executor will process smaller pieces of the data at once - Spark.default.parallelism = 10 - Or by adding the number of partitions to the code e.g. reduceByKey(numPartitions = 10) - Many more things you can do to improve the code
  • 19. Low Cluster Utilization: Idle Executors Susanne Nilsson
  • 20. What to do about it? - If we see idle executors but the total size of our job is small, we may just be requesting too many executors - If all executors are idle it maybe because we are doing a large computation in the driver - If the computation is very large, and we see idle executors, this maybe because the executors are waiting for a “large” task → so we can increase partitions - At some point adding partitions will slow the job down - But only if not too much skew Toshiyuki IMAI
  • 21. Shuffle Spill to Disk/ Slow Shuffle Fung0131
  • 22. Preventing Shuffle Spill to Disk - Larger executors - Configure off heap storage - More partitions can help (ideally the labor of all the partitions on one executor can “fit” in that executor’s memory) - We can adjust shuffle settings - Increase shuffle memory fraction (spark.shuffle.memory.fraction) - Try increasing: - spark.shuffle.file.buffer - Configure an external shuffle service, so that the shuffle files will not need to be stored in the spark executors - spark.shuffle.io.serverThreads - spark.shuffle.io.backLog jaci XIII
  • 23. Signs of Too Many Partitions Number of partitions is the size of the data each core is computing … smaller pieces are easier to process only up to a point - Spark needs to keep metadata about each partition on the driver - Driver memory errors & driver overhead errors - Very long task “spin up” time - Too many partitions at read usually caused by small part files - Lots of pending tasks & Low memory utilization - Long file write time for relatively small I/O “size” (especially with blockstores) Dorian Wallender
  • 24. PYTHON SETTINGS ● Application memory overhead ○ We can tune this based on if an app is PySpark or not ○ Infact in the proposed PySpark on K8s PR this is done for us ○ More tuning may still be required ● Buffers & batch sizes oh my ○ spark.sql.execution.arrow.maxRecordsPerBatch ○ spark.python.worker.memory - default to 512 but default mem for Python can be lower :( ■ Set based on amount memory assigned to Python to reduce OOMs ○ Normal: automatic, sometimes set wrong - code change required :( Nessima E.
  • 25. Tuning Can Help With - Low cluster utilization ($$$$) - Out of memory errors - Spill to Disk / Slow shuffles - GC errors / High GC overhead - Driver / Executor overhead exceptions - Reliable deployment Melinda Seckington
  • 26. Things you can’t “tune away” - Silly Shuffles - You can make each shuffle faster through tuning but you cannot change the fundamental size or number of shuffles without adjusting code - Think about how much you are moving data, and how to do it more efficiently - Be very careful with column based operations on Spark SQL - Unbalanced shuffles (caused by unbalanced keys) - if one or two tasks are much larger than the others - #Partitions is bounded by #distinct keys - Bad Object management/ Data structures choices - Can lead to memory exceptions / memory overhead exceptions / gc errors - Serialization exceptions* - *Exceptions due to cost of serialization can be tuned - Python - This makes Holden sad, buuuut. Arrow? Neil Piddock
  • 27. But enough doom, lets watch software fail. Christian Walther
  • 28. But what if we run a billion Spark jobs per day?
  • 29. Tuning Could Depend on 4 Factors: 1. The execution environment 2. The size of the input data 3. The kind of computation (ML, python, streaming) _______________________________ 1. The historical runs of that job We can get all this information programmatically!
  • 30. Execution Environment What we need to know about where the job will run - How much memory do I have available to me: on a single node/ on the cluster - In my queue - How much CPU: on a single node, on the cluster → corresponds to total number of concurrent tasks - We can get all this information from the yarn api (https://dzone.com/articles/how-to-use-the-yarn- api-to-determine-resources-ava-1) - Can I configure dynamic allocation? - Cluster Health
  • 31. Size of the input data - How big is the input data on Disk - How big will it be in memory? - “Coefficient of In Memory Expansion: shuffle Spill Memory / shuffle Spill disk - <- can get historically or guess - How many part files? (the default number of partitions at read) - Cardinality and type?
  • 32. Setting # partitions on the first try Assuming you have many distinct keys, you want to try to make partitions small enough that each partition fits in the memory “available to each task” to avoid spilling to disk or failing (the “Sandy Ryza formula”) https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Size of input data in memory Amount of memory available per task on the executors
  • 33. How much memory should each task use? def availableTaskMemoryMB(executorMemory : Long): Double = { val memFraction = sparkConf.getDouble("spark.memory.fraction", 0.6) val storageFraction = sparkConf.getDouble("spark.memory.storageFraction", 0.5) val nonStorage = 1-storageFraction val cores = sparkConf.getInt("spark.executor.cores", 1) Math.ceil((executorMemory * memFraction * nonStorage )/ cores) }
  • 34. Partitions could be inputDataSize/ memory per task def determinePartitionsFromInputDataSize(inputDataSize: Double) : Int = { Math.round(inputDataSize/availableTaskMemoryMB()).toInt }
  • 36. What kinds of historical data could we use? - Historical cluster info - (was one node sad? How many executors were used?) - Stage Metrics - duration (only on an isolated cluster) - executor CPU - vcores/Second - Task Metrics - Shuffle read & shuffle write → how much data is being moved & how big is input data - “Total task time” = time for each task * number of tasks - Shuffle read spill memory / shuffle spill disk = input data expansion in memory - What kind of computation was run (python/streaming) - Who was running the computation
  • 37. How do we save the metrics? - For a human: Spark history server can be persisted even if the cluster is terminated - For a program: Can use listeners to save everything from web UI - (See Spark Measure and read these at the start of job) - Can also get significantly more by capturing the full event stream - We can save information at task/ stage/ job level.
  • 38. Meet Robin Sparkles!!!!! - Saves historical data using listeners from Spark Measure - ( https://github.com/LucaCanali/sparkMeasure) - Create directory scheme to save task and stage data for successive runs - Read in n previous runs before creating conf - Use the result of the previous runs in tandem with current Spark Conf to Optimize some Spark Settings. (WIP) - Note: not to be used in production, simply the start of a spell book. Lets go to the Mall!
  • 39. Start a Spark listener and save metrics Extend Spark Measure flight recorder which automatically saves metrics: class RobinStageListener(sc: SparkContext, override val metricsFileName: String) extends ch.cern.sparkmeasure.FlightRecorderStageMetrics(sc.getConf) { } Start Spark Listener val myStageListener = new RobinStageListener(sc, stageMetricsPath(runNumber)) sc.addSparkListener(myStageListener)
  • 40. Add listener to program code def run(sc: SparkContext, id: Int, metricsDir: String, inputFile: String, outputFile: String): Unit = { val metricsCollector = new MetricsCollector(sc, metricsDir) metricsCollector.startSparkJobWithRecording(id) //some app code }
  • 41. Read in the metrics val STAGE_METRICS_SUBDIR = "stage_metrics" val metricsDir = s"$metricsRootDir/${appName}" val stageMetricsDir = s"$metricsDir/$STAGE_METRICS_SUBDIR" def stageMetricsPath(n: Int): String = s"$metricsDir/run=$n" def readStageInfo(n : Int) = ch.cern.sparkmeasure.Utils.readSerializedStageMetrics(stageMetricsPath(n))
  • 42. To use: read in metrics, then create optimized conf val conf = new SparkConf() …. val (newConf: SparkConf, id: Int) = Runner.getOptimizedConf(metricsDir, conf) val sc = SparkContext.getOrCreate(newConf) Runner.run(sc, id, metricsDir, inputFile, outputFile)
  • 43. Put it all together! A programmatic way of defining the number of partitions … Partitions Performance
  • 44. Parallelism / number of partitions Keep increasing the number of partitions until the metric we care about stops improving - Spark.default.parallelism is only default if not in code - Different stages could / should have different partitions - can also compute for each stage and then set in shuffle phase: - rdd.reduceByKey(numPartitions = x) - You can design your job to get number of partitions from a variable in the conf - Which stage to tune if we can only do one: - We can search for the longest stage - If we are tuning for parallelism, we might want to capture the stage “with the biggest shuffle” - We use “total shuffle write” which is the number of bytes written in shuffle files during the stage (this should be setting agnostic)
  • 45. What metric defines better? - Depends on your business … often a trade off between speed and efficiency - If you just want speed Holden has several cloud solutions to sell you (or would if she knew anyone in the sales department). - cluster is “fully utilized” - Stage duration * executors = sum(task time)/ executors) + fuge factor - WARNING: Stage duration is dependent on instance & exec spin up, network delays excetera - Stage is “Fastest”-> Executor Cpu Time (total cpu time for all the tasks) - Could use anything that Spark measures: - Executor Serialization time - Deserialization time - Executor Idle time
  • 46. Compute the number of partitions given a list of web UI input for each stage def fromStageMetricSharedCluster(previousRuns: List[StageInfo]): Int = { previousRuns match { case Nil => //If this is the first run and parallelism is not provided, use the number of concurrent tasks //We could also look at the file on disk possibleConcurrentTasks() case first :: Nil => val fromInputSize = determinePartitionsFromInputDataSize(first.totalInputSize) Math.max(first.numPartitionsUsed + math.max(first.numExecutors,1), fromInputSize) }
  • 47. If we have several runs of the same computation case _ => val first = previousRuns(previousRuns.length - 2) val second = previousRuns(previousRuns.length - 1) if(morePartitionsIsBetter(first,second)) { //Increase the number of partitions from everything that we have tried Seq(first.numPartitionsUsed, second.numPartitionsUsed).max + second.numExecutors } else{ //If we overshot the number of partitions, use whichever run had the best executor cpu time previousRuns.sortBy(_.executorCPUTime).head.numPartitionsUsed } } }
  • 48. How to use Robin-Sparkles - Assume user constructs the Spark Context - Configure listener to write to a persistent location - Maybe need to be in an external location if you are using a web based system - Modify the application code that creates the Spark context to use Robin Sparkles to set the settings Note: not applicable for use where Spark context is created for you except for tuning number of partitions or variables that can change at the job (rather than application level)
  • 49. She could do so much more! - Additional settings to tune - Set executor memory and driver memory based on system - Executor health and system monitoring - Could surface warnings about data skew - Shuffle settings could be adjusted - Currently we don’t get info if the stage fails - Our reader would actually fail if stage data is malformed, we should read partial stage - We could use the firehost listener that records every event from the event stream, to monitor failures - These other listeners would contain additional information that is not in the web ui
  • 50. But first a nap Donald Lee Pardue
  • 51. Other Tools Sometimes the right answer isn’t tuning, it’s telling the user to change the code (see Sparklens) or telling the administrator to look at their cluster Melinda Seckington
  • 52. Sparklens - Tells us what to tune or refactor ● AppTimelineAnalyzer ● EfficiencyStatisticsAnalyzer ● ExecutorTimelineAnalyzer ● ExecutorWallclockAnalyzer ● HostTimelineAnalyzer ● JobOverlapAnalyzer ● SimpleAppAnalyzer ● StageOverlapAnalyzer ● StageSkewAnalyzer Kitty Terwolbeck
  • 53. Dr Elephant ● Spark ConfigurationHeurestic ○ val SPARK_DRIVER_MEMORY_KEY = "spark.driver.memory" ○ val SPARK_EXECUTOR_MEMORY_KEY = "spark.executor.memory" ○ val SPARK_EXECUTOR_INSTANCES_KEY = "spark.executor.instances" ○ val SPARK_EXECUTOR_CORES_KEY = "spark.executor.cores" ○ val SPARK_SERIALIZER_KEY = "spark.serializer" ○ val SPARK_APPLICATION_DURATION = "spark.application.duration" ○ val SPARK_SHUFFLE_SERVICE_ENABLED = "spark.shuffle.service.enabled" ○ val SPARK_DYNAMIC_ALLOCATION_ENABLED = "spark.dynamicAllocation.enabled" ○ val SPARK_DRIVER_CORES_KEY = "spark.driver.cores" ○ val SPARK_DYNAMIC_ALLOCATION_MIN_EXECUTORS = "spark.dynamicAllocation.minExecutors" ○ etc. ● Among other things
  • 54. Other monitoring/linting/tuning tools ● Kamon ● Sparklint ● Sparkoscope ● Graphite + Spark Jakub Szestowicki
  • 55. Some upcoming talks ● May ○ JOnTheBeach (Spain) Friday - General Purpose Big Data Systems are eating the world - Is Tool Consolidation inevitable? ● June ○ Spark Summit SF (Accelerating Tensorflow & Accelerating Python + Dependencies) ○ Scala Days NYC ○ FOSS Back Stage & BBuzz ● July ○ Curry On Amsterdam ○ OSCON Portland ● August ○ JupyterCon (NYC)
  • 56. High Performance Spark! You can buy it today! Hopefully you got it signed earlier today if not … buy it and come see us again! The settings didn’t get its own chapter is in the appendix, (doing things on time is hard) Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 57. Thanks and keep in touch

Editor's Notes

  1. Photo from https://www.flickr.com/photos/lorika/4148361363/in/photolist-7jzriM-9h3my2-9Qn7iD-bp55TS-7YCJ4G-4pVTXa-7AFKbm-bkBfKJ-9Qn6FH-aniTRF-9LmYvZ-HD6w6-4mBo3t-8sekvz-mgpFzD-5z6BRK-de513-8dVhBu-bBZ22n-4Vi2vS-3g13dh-e7aPKj-b6iHHi-4ThGzv-7NcFNK-aniTU6-Kzqxd-7LPmYs-4ok2qy-dLY9La-Nvhey-Kte6U-74B7Ma-6VfnBK-6VjrY7-58kAY9-7qUeDK-4eoSxM-6Vjs5A-9v5Pvb-26mja-4scwq3-GHzAL-672eVr-nFUomD-4s8u8F-5eiQmQ-bxXXCc-5P9cCT-5GX8no
  2. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using both historical and live job information, using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Much of the data required to effectively tune jobs is already collected inside of Spark. You just need to understand it. Holden, Rachel, and Anya outline sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work. They also discuss what kind of tuning can be done statically (e.g., without depending on historic information) and look at Spark’s own built-in components for auto-tuning (currently dynamically scaling cluster size) and how you can improve them. Even if the idea of building an auto-tuner sounds as appealing as using a rusty spoon to debug the JVM on a haunted supercomputer, this talk will give you a better understanding of the knobs available to you to tune your Apache Spark jobs. Also, to be clear, Holden, Rachel, and Anya don’t promise to stop your pager going off at 2:00am, but hopefully this helps.
  3. FIX THIS …. Spark Messos / Yarn / S3 / Standalone cluster
  4. FIX THIS …. Spark Messos / Yarn / S3 / Standalone cluster
  5. Mention that it doesn’t always look like this and that there are so other ways of Think about how to present this because the executors are different sizes;...
  6. A few larger executors is bad because of GC Look up which HDFS deamon this is
  7. Adjust this
  8. Spark will add executors when there are pending tasks (spark.dynamicAllocation.schedulerBacklogTimeout) and exponentially increase them as long as tasks in the backlog persist (spark...sustainedSchedulerBacklogTimeout) Executors are decommissioned when they have been idle for
  9. Spark will add executors when there are pending tasks (spark.dynamicAllocation.schedulerBacklogTimeout) and exponentially increase them as long as tasks in the backlog persist (spark...sustainedSchedulerBacklogTimeout) Executors are decommissioned when they have been idle for
  10. Now we are going to look at some ways that we can fall over
  11. RESEARCH OTHER METRICS OF CLUSTER UTALIZATION
  12. if partitions = total core cores, then next stage will wait on slowest partition If we have more partitions than the number of total cores then several tasks with less records could be executed while one tasks with many records is executed. At some point there are too many partitions and overhead gets high and job will slow down
  13. Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former. Note that both metrics are aggregated over the entire duration of the task (i.e. within each task you can spill multiple times).
  14. External shuffle service is storing shuffle files somewhere else
  15. Relative CPU and GC time maybe define them ...
  16. Requesting too many resources should say that earlier
  17. Give that configuration
  18. Fix to be cleared
  19. https://upload.wikimedia.org/wikipedia/commons/a/a3/Cat_with_book_2320356661.jpg
  20. I should talk about executor cpu time Vcores per second…. Look into spark lint and fill this out more smartly
  21. Link to spark measure !!! Add picture of robin sparkles
  22. This is where to put the graph
  23. Update it but
  24. NB from Anya:
  25. NB from Anya:
  26. NB from Anya: