Performance tuning
of Apache Spark
Melbourne
Apache Spark
meetup
Maksud Ibrahimov
February 2016
Who am I?
—  Chief Data Scientist at InfoReady - a leading Australian data &
analytics business
—  PhD in Artificial Intelligence from the University of Adelaide
—  Over last 10+ years, have worked and improved operations of
major companies in mining, manufacturing, retail and logistics
through machine learning, optimisation and simulation
—  Particular interest in applying these algorithms in cluster
computing environment, hence Apache Spark
—  User of Spark since 1.0 release
2
What is Apache Spark?
18/02/16 Performance tuning of Apache Spark 3
—  Apache Spark is a fast and general engine for large-scale data
processing.
—  Run programs up to 100x faster than Hadoop MapReduce in memory,
or 10x faster on disk.
—  Write applications quickly in Java, Scala, Python, R
—  SQL
—  Streaming
—  Graph processing
—  Machine learning
How easy it is to write programs in Spark?
—  Fairly easy to start. Within 1-2 day you can start writing simple
programs on a single machine
—  Not too hard to deploy and run on the cluster with preconfigured
deployment options, such as Amazon EMR or Hortonworks distribution
—  Once you start writing programs that run in a cluster with a few nodes
you may notice that the execution is not as fast as was expected
18/02/16 Performance tuning of Apache Spark 4
Generally performance can be improved by tuning
the following areas
•  Partitioning. Do you take full advantage of parallel capabilities of Spark?
Do you spill to disk?
•  Runtime configuration. Is your configuration tuned to your task?
•  Optimal code. Can you perform your computation more efficiently?
Algorithm complexity analysis.
•  Cluster and hardware. What hardware and how many nodes do I need?
How to run jobs quicker while keeping costs down?
•  Persistence. Do you perform unnecessary recomputes by failing to cache
rdds?
•  Isolating bottlenecks. How do you find which resource is your
bottleneck? Block-time analysis?
18/02/16 Performance tuning of Apache Spark 5
Key concepts to understand for performance tuning
—  Spark performance metrics
—  Memory model
—  Partitioning
—  DAG and shuffles
—  Persistence
18/02/16 Performance tuning of Apache Spark 6
Spark programs consist of jobs, stages and tasks
—  Each Spark program runs as a job
—  DAG scheduler splits jobs into stages
—  Tasks belong to a stage. Task is a unit of work to run on
executor, correspond to a single partition
—  Each task either partitions its output for “shuffle”, or
sends the output back to the driver
18/02/16 Performance tuning of Apache Spark 7
Job
Stage 1
Task
Task
Stage 2
Task
Task
Task
Task
Stage 1
Task 1
Task 2
Task 3
Shuffle anatomy
18/02/16 Performance tuning of Apache Spark 8
Stage 2
Task 1
Task 2
Task 3
Shuffle readShuffle write
—  Shuffle redistributes data among partitions
—  Files are written to disk by the end of one stage
—  Read by next stage
—  Reducing number of shuffles will generally improve
performance
Spark memory model
—  Execution memory: shuffles
—  Storage memory: caching
—  Pre 1.6.0 had to manually configure memory ratios
—  1.6.0: unified memory management
18/02/16 Sample Infoready Presentation 9
Storage	Execu-on	 File	system
How to debug performance?
—  Web UI is your friend
—  Failed executors. JVM crashes, memory issues, config issues, network
—  Identify stragglers
•  Is a particular node running slow? Turn speculation on
•  Data skew: max >> median
•  GC issues
•  Jstack, jmap, or UI stack dump
—  Recomputation
—  Rdd.toDebugString or WebUI
—  Metrics to watch
•  GC time. Lots of them gone in Spark 1.6 due to Tungsten
•  Disk spill
18/02/16 Performance tuning of Apache Spark 10
Using UI to find the cause of the skew
18/02/16 Performance tuning of Apache Spark 11
Find the problematic partition. Majority of such
problems are related to disk I/O
18/02/16 Performance tuning of Apache Spark 12
18/02/16 Performance tuning of Apache Spark 13
Cause:	
rdd.persist(StorageLevel.MEMORY_AND_DISK)
The same RDD can be split differntly
18/02/16 Performance tuning of Apache Spark 14
…	
100	GB	RDD,	4	par--ons,	25	GB	each	
100	GB	RDD,	100	par--ons,	1	GB	each	
25	GB	 25	GB	 25	GB	 25	GB
18/02/16 Sample Infoready Presentation 15
Spilling to disk
Small tasks vs Large tasks
18/02/16 Sample Infoready Presentation 16
Execu-on	memory	
Disk spill
Core	1	 Core	2	 Core	3	 Core	4	
Execu-on	memory	
Core	1	 Core	2	 Core	3	 Core	4	
Task
Task Task Task Task
Task Task Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Tasks pool Tasks pool
1	sec	each	
task	
60	sec	each	
task
Partitions
—  Partitions determine a degree of parallelism
—  Too big partition
•  Small amount of tasks and long time to execute each
•  More memory needed per task => disk spills
•  More chances for data skew
—  Too small partition
•  overhead of launching the task dominates the runtime of a task
—  Rule of thumb: task runtime just under 1s execution per task and more
than 100ms
—  To control number of partitions use coalesce() and repartition(). Note
that both may trigger shuffle. The case when additional shuffle in the
beginning may improve performance
18/02/16 Performance tuning of Apache Spark 17
Partitions
18/02/16 Performance tuning of Apache Spark 18
Number	of	par--ons	
Execu-on	-me	
Minimum	
Op-mal	
par--on	size	
range
Partitioning strategies
—  Fixed number of partitions
•  numPartitions = 100
—  Fixed number records per partition
•  numPartitions = rdd.count / 100
—  Fixed memory size of a partition. Calculate number of partitions based
on the memory consumption of rdd or sample of rdd
•  partitionSpace = 100 Mb
•  rowsPerPartition = partitionSpace / meanRowSpace(rdd)
•  numPartitions = rdd.count / rowsPerPartition
18/02/16 Sample Infoready Presentation 19
CPU vs Memory. What should I add to increase
performance of my job?
—  Parameters I can play with:
•  Number of cores per node
•  Amount of memory per node
•  Both are related to number of nodes
•  Amount of storage. Normally not a problem
•  Network. Normally fixed
18/02/16 Performance tuning of Apache Spark 20
vs
Where is your bottleneck?
18/02/16 Performance tuning of Apache Spark 21
RAM	
CPU	
Network	 Storage	I/O
Block-time analysis
18/02/16 Sample Infoready Presentation 22
Source:	Kay	Ousterhout,	Spark	Summit	2015
How job completion would change if the network is
infinitely fast
18/02/16 Sample Infoready Presentation 23
Source:	Kay	Ousterhout,	Spark	Summit	2015
How we do it?
—  The goal is to make CPU utilisation each of the nodes at 100%
—  Limit your resources to a single core and just 1G of memory
—  Run the job on the subset of the data, so the full job runs in about a
minute. This way you can iterate through your tuning experiments
much faster.
—  Tweak the memory and partition size until no disk spills. This is the
amount of memory needed per core
—  Scale cores and memory proportionally
—  Make sure the partition size is the same as for the full job
18/02/16 Performance tuning of Apache Spark 24
Key lessons
—  Understand the memory model
—  Avoid expensive shuffles if possible
—  Choose number/size of partitions
—  Use persistence when reusing RDDs
—  Do not spill to disk
—  Make the job CPU bound and scale for performance
—  Experiment on small subsets and limited resources
18/02/16 Performance tuning of Apache Spark 25
Thank you!
26

Spark performance tuning - Maksud Ibrahimov

  • 1.
    Performance tuning of ApacheSpark Melbourne Apache Spark meetup Maksud Ibrahimov February 2016
  • 2.
    Who am I? — Chief Data Scientist at InfoReady - a leading Australian data & analytics business —  PhD in Artificial Intelligence from the University of Adelaide —  Over last 10+ years, have worked and improved operations of major companies in mining, manufacturing, retail and logistics through machine learning, optimisation and simulation —  Particular interest in applying these algorithms in cluster computing environment, hence Apache Spark —  User of Spark since 1.0 release 2
  • 3.
    What is ApacheSpark? 18/02/16 Performance tuning of Apache Spark 3 —  Apache Spark is a fast and general engine for large-scale data processing. —  Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. —  Write applications quickly in Java, Scala, Python, R —  SQL —  Streaming —  Graph processing —  Machine learning
  • 4.
    How easy itis to write programs in Spark? —  Fairly easy to start. Within 1-2 day you can start writing simple programs on a single machine —  Not too hard to deploy and run on the cluster with preconfigured deployment options, such as Amazon EMR or Hortonworks distribution —  Once you start writing programs that run in a cluster with a few nodes you may notice that the execution is not as fast as was expected 18/02/16 Performance tuning of Apache Spark 4
  • 5.
    Generally performance canbe improved by tuning the following areas •  Partitioning. Do you take full advantage of parallel capabilities of Spark? Do you spill to disk? •  Runtime configuration. Is your configuration tuned to your task? •  Optimal code. Can you perform your computation more efficiently? Algorithm complexity analysis. •  Cluster and hardware. What hardware and how many nodes do I need? How to run jobs quicker while keeping costs down? •  Persistence. Do you perform unnecessary recomputes by failing to cache rdds? •  Isolating bottlenecks. How do you find which resource is your bottleneck? Block-time analysis? 18/02/16 Performance tuning of Apache Spark 5
  • 6.
    Key concepts tounderstand for performance tuning —  Spark performance metrics —  Memory model —  Partitioning —  DAG and shuffles —  Persistence 18/02/16 Performance tuning of Apache Spark 6
  • 7.
    Spark programs consistof jobs, stages and tasks —  Each Spark program runs as a job —  DAG scheduler splits jobs into stages —  Tasks belong to a stage. Task is a unit of work to run on executor, correspond to a single partition —  Each task either partitions its output for “shuffle”, or sends the output back to the driver 18/02/16 Performance tuning of Apache Spark 7 Job Stage 1 Task Task Stage 2 Task Task Task Task
  • 8.
    Stage 1 Task 1 Task2 Task 3 Shuffle anatomy 18/02/16 Performance tuning of Apache Spark 8 Stage 2 Task 1 Task 2 Task 3 Shuffle readShuffle write —  Shuffle redistributes data among partitions —  Files are written to disk by the end of one stage —  Read by next stage —  Reducing number of shuffles will generally improve performance
  • 9.
    Spark memory model — Execution memory: shuffles —  Storage memory: caching —  Pre 1.6.0 had to manually configure memory ratios —  1.6.0: unified memory management 18/02/16 Sample Infoready Presentation 9 Storage Execu-on File system
  • 10.
    How to debugperformance? —  Web UI is your friend —  Failed executors. JVM crashes, memory issues, config issues, network —  Identify stragglers •  Is a particular node running slow? Turn speculation on •  Data skew: max >> median •  GC issues •  Jstack, jmap, or UI stack dump —  Recomputation —  Rdd.toDebugString or WebUI —  Metrics to watch •  GC time. Lots of them gone in Spark 1.6 due to Tungsten •  Disk spill 18/02/16 Performance tuning of Apache Spark 10
  • 11.
    Using UI tofind the cause of the skew 18/02/16 Performance tuning of Apache Spark 11
  • 12.
    Find the problematicpartition. Majority of such problems are related to disk I/O 18/02/16 Performance tuning of Apache Spark 12
  • 13.
    18/02/16 Performance tuningof Apache Spark 13 Cause: rdd.persist(StorageLevel.MEMORY_AND_DISK)
  • 14.
    The same RDDcan be split differntly 18/02/16 Performance tuning of Apache Spark 14 … 100 GB RDD, 4 par--ons, 25 GB each 100 GB RDD, 100 par--ons, 1 GB each 25 GB 25 GB 25 GB 25 GB
  • 15.
    18/02/16 Sample InforeadyPresentation 15 Spilling to disk
  • 16.
    Small tasks vsLarge tasks 18/02/16 Sample Infoready Presentation 16 Execu-on memory Disk spill Core 1 Core 2 Core 3 Core 4 Execu-on memory Core 1 Core 2 Core 3 Core 4 Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Tasks pool Tasks pool 1 sec each task 60 sec each task
  • 17.
    Partitions —  Partitions determinea degree of parallelism —  Too big partition •  Small amount of tasks and long time to execute each •  More memory needed per task => disk spills •  More chances for data skew —  Too small partition •  overhead of launching the task dominates the runtime of a task —  Rule of thumb: task runtime just under 1s execution per task and more than 100ms —  To control number of partitions use coalesce() and repartition(). Note that both may trigger shuffle. The case when additional shuffle in the beginning may improve performance 18/02/16 Performance tuning of Apache Spark 17
  • 18.
    Partitions 18/02/16 Performance tuningof Apache Spark 18 Number of par--ons Execu-on -me Minimum Op-mal par--on size range
  • 19.
    Partitioning strategies —  Fixednumber of partitions •  numPartitions = 100 —  Fixed number records per partition •  numPartitions = rdd.count / 100 —  Fixed memory size of a partition. Calculate number of partitions based on the memory consumption of rdd or sample of rdd •  partitionSpace = 100 Mb •  rowsPerPartition = partitionSpace / meanRowSpace(rdd) •  numPartitions = rdd.count / rowsPerPartition 18/02/16 Sample Infoready Presentation 19
  • 20.
    CPU vs Memory.What should I add to increase performance of my job? —  Parameters I can play with: •  Number of cores per node •  Amount of memory per node •  Both are related to number of nodes •  Amount of storage. Normally not a problem •  Network. Normally fixed 18/02/16 Performance tuning of Apache Spark 20 vs
  • 21.
    Where is yourbottleneck? 18/02/16 Performance tuning of Apache Spark 21 RAM CPU Network Storage I/O
  • 22.
    Block-time analysis 18/02/16 SampleInfoready Presentation 22 Source: Kay Ousterhout, Spark Summit 2015
  • 23.
    How job completionwould change if the network is infinitely fast 18/02/16 Sample Infoready Presentation 23 Source: Kay Ousterhout, Spark Summit 2015
  • 24.
    How we doit? —  The goal is to make CPU utilisation each of the nodes at 100% —  Limit your resources to a single core and just 1G of memory —  Run the job on the subset of the data, so the full job runs in about a minute. This way you can iterate through your tuning experiments much faster. —  Tweak the memory and partition size until no disk spills. This is the amount of memory needed per core —  Scale cores and memory proportionally —  Make sure the partition size is the same as for the full job 18/02/16 Performance tuning of Apache Spark 24
  • 25.
    Key lessons —  Understandthe memory model —  Avoid expensive shuffles if possible —  Choose number/size of partitions —  Use persistence when reusing RDDs —  Do not spill to disk —  Make the job CPU bound and scale for performance —  Experiment on small subsets and limited resources 18/02/16 Performance tuning of Apache Spark 25
  • 26.