Spark performance tuning - Maksud Ibrahimov

Performance tuning
of Apache Spark
Melbourne
Apache Spark
meetup
Maksud Ibrahimov
February 2016

Who am I?
—  Chief Data Scientist at InfoReady - a leading Australian data &
analytics business
—  PhD in Artificial Intelligence from the University of Adelaide
—  Over last 10+ years, have worked and improved operations of
major companies in mining, manufacturing, retail and logistics
through machine learning, optimisation and simulation
—  Particular interest in applying these algorithms in cluster
computing environment, hence Apache Spark
—  User of Spark since 1.0 release
2

What is Apache Spark?
18/02/16 Performance tuning of Apache Spark 3
—  Apache Spark is a fast and general engine for large-scale data
processing.
—  Run programs up to 100x faster than Hadoop MapReduce in memory,
or 10x faster on disk.
—  Write applications quickly in Java, Scala, Python, R
—  SQL
—  Streaming
—  Graph processing
—  Machine learning

How easy it is to write programs in Spark?
—  Fairly easy to start. Within 1-2 day you can start writing simple
programs on a single machine
—  Not too hard to deploy and run on the cluster with preconfigured
deployment options, such as Amazon EMR or Hortonworks distribution
—  Once you start writing programs that run in a cluster with a few nodes
you may notice that the execution is not as fast as was expected

Generally performance can be improved by tuning
the following areas
•  Partitioning. Do you take full advantage of parallel capabilities of Spark?
Do you spill to disk?
•  Runtime configuration. Is your configuration tuned to your task?
•  Optimal code. Can you perform your computation more efficiently?
Algorithm complexity analysis.
•  Cluster and hardware. What hardware and how many nodes do I need?
How to run jobs quicker while keeping costs down?
•  Persistence. Do you perform unnecessary recomputes by failing to cache
rdds?
•  Isolating bottlenecks. How do you find which resource is your
bottleneck? Block-time analysis?

Key concepts to understand for performance tuning
—  Spark performance metrics
—  Memory model
—  Partitioning
—  DAG and shuffles
—  Persistence

Spark programs consist of jobs, stages and tasks
—  Each Spark program runs as a job
—  DAG scheduler splits jobs into stages
—  Tasks belong to a stage. Task is a unit of work to run on
executor, correspond to a single partition
—  Each task either partitions its output for “shuffle”, or
sends the output back to the driver
Job
Stage 1
Task
Task
Stage 2
Task
Task
Task
Task

Stage 1
Task 1
Task 2
Task 3
Shuffle anatomy
Stage 2
Task 1
Task 2
Task 3
Shuffle readShuffle write
—  Shuffle redistributes data among partitions
—  Files are written to disk by the end of one stage
—  Read by next stage
—  Reducing number of shuffles will generally improve
performance

Spark memory model
—  Execution memory: shuffles
—  Storage memory: caching
—  Pre 1.6.0 had to manually configure memory ratios
—  1.6.0: unified memory management
18/02/16 Sample Infoready Presentation 9
Storage Execu-on File system

How to debug performance?
—  Web UI is your friend
—  Failed executors. JVM crashes, memory issues, config issues, network
—  Identify stragglers
•  Is a particular node running slow? Turn speculation on
•  Data skew: max >> median
•  GC issues
•  Jstack, jmap, or UI stack dump
—  Recomputation
—  Rdd.toDebugString or WebUI
—  Metrics to watch
•  GC time. Lots of them gone in Spark 1.6 due to Tungsten
•  Disk spill

Using UI to find the cause of the skew

Find the problematic partition. Majority of such
problems are related to disk I/O

Cause:
rdd.persist(StorageLevel.MEMORY_AND_DISK)

The same RDD can be split differntly
…
100 GB RDD, 4 par--ons, 25 GB each
100 GB RDD, 100 par--ons, 1 GB each
25 GB 25 GB 25 GB 25 GB

Spilling to disk

Small tasks vs Large tasks
Execu-on memory
Disk spill
Core 1 Core 2 Core 3 Core 4
Execu-on memory
Core 1 Core 2 Core 3 Core 4
Task
Task Task Task Task
Task Task Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Tasks pool Tasks pool
1 sec each
task
60 sec each
task

Partitions
—  Partitions determine a degree of parallelism
—  Too big partition
•  Small amount of tasks and long time to execute each
•  More memory needed per task => disk spills
•  More chances for data skew
—  Too small partition
•  overhead of launching the task dominates the runtime of a task
—  Rule of thumb: task runtime just under 1s execution per task and more
than 100ms
—  To control number of partitions use coalesce() and repartition(). Note
that both may trigger shuffle. The case when additional shuffle in the
beginning may improve performance

Partitions
Number of par--ons
Execu-on -me
Minimum
Op-mal
par--on size
range

Partitioning strategies
—  Fixed number of partitions
•  numPartitions = 100
—  Fixed number records per partition
•  numPartitions = rdd.count / 100
—  Fixed memory size of a partition. Calculate number of partitions based
on the memory consumption of rdd or sample of rdd
•  partitionSpace = 100 Mb
•  rowsPerPartition = partitionSpace / meanRowSpace(rdd)
•  numPartitions = rdd.count / rowsPerPartition

CPU vs Memory. What should I add to increase
performance of my job?
—  Parameters I can play with:
•  Number of cores per node
•  Amount of memory per node
•  Both are related to number of nodes
•  Amount of storage. Normally not a problem
•  Network. Normally fixed
vs

Where is your bottleneck?
RAM
CPU
Network Storage I/O

Block-time analysis
Source: Kay Ousterhout, Spark Summit 2015

How job completion would change if the network is
infinitely fast
Source: Kay Ousterhout, Spark Summit 2015

How we do it?
—  The goal is to make CPU utilisation each of the nodes at 100%
—  Limit your resources to a single core and just 1G of memory
—  Run the job on the subset of the data, so the full job runs in about a
minute. This way you can iterate through your tuning experiments
much faster.
—  Tweak the memory and partition size until no disk spills. This is the
amount of memory needed per core
—  Scale cores and memory proportionally
—  Make sure the partition size is the same as for the full job

Key lessons
—  Understand the memory model
—  Avoid expensive shuffles if possible
—  Choose number/size of partitions
—  Use persistence when reusing RDDs
—  Do not spill to disk
—  Make the job CPU bound and scale for performance
—  Experiment on small subsets and limited resources

Spark performance tuning - Maksud Ibrahimov

More Related Content

What's hot

Viewers also liked

Similar to Spark performance tuning - Maksud Ibrahimov

Recently uploaded

Spark performance tuning - Maksud Ibrahimov