Fine Tuning and Enhancing Performance of Apache Spark Jobs
Jun. 28, 2020•0 likes
6 likes
Be the first to like this
Show More
•2,384 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Data & Analytics
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing
Performance of Apache Spark Jobs
Blake Becerra, Kira Lindke, Kaushik Tadikonda
Our Setup
▪ Data Validation Tool for ETL
▪ Millions of comparisons and aggregations
▪ One of the larger datasets initially took 4+ hours, unstable
▪ Challenge: improve reliability and performance
▪ Months of research and tuning, same application takes 35 minutes
Configuring Cluster
▪ Test changes with
▪ spark.driver.cores
▪ spark.driver.memory
▪ executor-memory
▪ executor-cores
▪ spark.cores.max
▪ reserve cores (~1 core per node) for
daemons
▪ num_executors = total_cores /
num_cores
▪ num_partitions
▪ Too much memory per executor can
result in excessive GC delays
▪ Too little memory can lose benefits
from running multiple tasks in a single
JVM
▪ Look at stats (network, CPU, memory,
etc. and tweak to improve
performance)
Skew
▪ Can severely downgrade
performance
▪ Extreme imbalance of work in the cluster
▪ Tasks within a stage (like a join) take uneven
amounts of time to finish
▪ How to check
▪ Spark UI shows job waiting only on some of the
tasks
▪ Look for large variances in memory usage within a
job (primarily works if it is at the beginning of the job
and doing data ingestion – otherwise can be
misleading)
▪ Executor missing heartbeat
▪ Check partition sizes of RDD while debugging to
confirm
Example – skew in ingestion of data
Uneven partitions Even partitions
Handling skew cont’d
▪ Blind repartition is the most
naïve approach but effective
▪ Great for "narrow transformations"
▪ Good for increasing the number of partitions
▪ Use coalesce not repartition to decrease
partitions
▪ Choosing a better approach for
joins specifically coming later
Handling skew - ingestion
▪ Use spark options to do partitioned reads with JDBC
▪ partitionColumn
▪ lower/upperBound - used to determine stride
▪ numPartitions – maximum number of partitions
▪ partitionColumn
▪ Ideally is not skewed (such as primary key)
▪ Stride can have skew
▪ Possible trick – mod function
▪ Example of working with slow jdbc databases:
▪ Initial query took ~40 minutes
▪ Took it down to 10 minutes
Example
lowerBound: 0
upperBound: 1000
numPartitions: 10
=> Stride is equal to 100 and partitions correspond to following queries:
SELECT * FROM tableName WHERE partitionModColumn < 100
SELECT * FROM tableName WHERE partitionModColumn >= 100 AND
partitionModColumn < 200
...
SELECT * FROM tableName WHERE partitionModColumn >= 900
Also works with: partitioned data ingestion
Example – reading from object storage, files, etc.
▪ Reads in with same skew
▪ Read in then repartition to get even distribution
Cache/Persist
▪ Reuse a DataFrame with
transformations
▪ Unpersist when done
▪ Without cache, DataFrame is built
from scratch each time
▪ Don't over persist
▪ Worse memory performance
▪ Possible slowdown
▪ GC pressure
Other Performance
Improvements
▪ Try seq.par.foreach instead of
just seq.foreach
▪ Increases parallelization
▪ Race conditions and non-deterministic results
▪ Use accumulators or synchronization to protect
▪ Avoid UDFs if possible
▪ Deserialize every row to object
▪ Apply lambda
▪ Then reserialize it
▪ More garbage generated
Join Optimization
▪ Different join strategies
▪ Data preprocessing - clever
filtering
▪ Avoiding unnecessary scans
▪ Locality - use of same partitioner
▪ Salting - skewed join keys
Things to remember
▪ Follow good partitioning strategies
▪ Too few partitions – less parallelism
▪ Too many partitions – scheduling issues
▪ Improve scan efficiency
▪ Try to use same partitioner
between DataFrames for join
▪ Skipped stages are good
▪ Caching prevents repeated
exchanges where data is re-used
Fair Scheduling
▪ Round Robin fashion
▪ Multiple jobs can run
simultaneously if submitted by
threads inside an application
▪ Better resource utilization
▪ Harder to debug and makes
performance tuning more
difficult
Fair Scheduling and
Pools
▪ Enabled by setting
spark.scheduler.mode to FAIR
▪ Allows jobs to be grouped into
pools with prioritization
options
▪ Jobs created from threads
without a pool defined always
go to the “default” pool
▪ Pools, weight and minShare
are specified in
fairscheduler.xml
Serialization
▪ Default for most types
▪ Can work with any class
▪ More flexible
▪ Slower
▪ Default for shuffling RDDs with simple
types
▪ Significantly faster and more compact
▪ Set your SparkConf to
use "org.apache.spark.serializer.KryoSe
rializer"
▪ Register classes in order to take full
advantage
Kryo SerializerJava Serializer
https://spark.apache.org/docs/latest/tuning.html#data-serialization
CompressedOOPs
▪ JVM flag –XX:UseCompressedOops
allows usage of 4-byte pointers
instead of 8-byte
▪ Does not work as heap size grows
beyond 32 GB
▪ Need 48 GB heap without compressed
pointers to hold same number of
objects as 32 GB
Garbage Collection
Tuning
▪ Contention when allocating
more objects
▪ Full GC occurring before tasks
finish
▪ Too many RDDs cached
▪ Frequent garbage collection
▪ Long time spent in GC
▪ Missed heartbeats
ParallelGC (default)
Heap space is divided into Young and Old
▪ Minor GC occurs frequently:
▪ Increase Eden and Survivor space
▪ Major GC occurs frequently:
▪ Increase young space if too many objects are being promoted
▪ Increase old space
▪ Decrease spark.memory.fraction
▪ Full GC occurs before task finishes:
▪ Try clearing young space by frequent GC triggers
▪ Increase memory
G1GC
▪ Breaks heap into thousands of regions
▪ Recommended if heap sizes are
large (> 8GB) and GC times are long
▪ Lower latency traded with higher
CPU usage for GC bookkeeping
For large scale applications, these properties help preventing full
collections:
▪ -XX:ParallelGCThreads = n
▪ -XX:ConcGCThreads = [n, 2n]
▪ -XX:InitiatingHeapOccupancyPercent = 35
Takeaways
▪ Performance tuning is iterative
▪ Tuning is case by case
▪ Take advantage of the Spark UI, logs, and available monitoring
▪ Focus on the major slowdowns, not on one particular trick
▪ You can't be perfect