Spark / Mesos Cluster Optimization

HAYSSAM SALEH
Spark / Mesos Cluster Optimization
Paris Spark Meetup April 28th 2016 @Criteo

Grappe
Clic
Nav
AT
DataLake
Avis
Evts.
Supports
Comptes
Spark Job
DataMart
Contribution
Visiteur
Session
Recherche
Bandeau
[LR]
[FD]
[Clic]
Project Context

3
Interactive Discovery
Elastcsearch DataMart
(KPI)

Interactive Discovery
Elastcsearch DataMart
(KPI)

OBJECTIVES
— 100 Gb of data / day => 50 million requests
— 40To / year
— How we turned from a 4 hours job on :
¢ 6 nodes
¢ 8 cores & 32 Gb per node
¢ To a 20 minutes Job on :
¢ 4 nodes, 8 cores & 8 Gb per node

SUMMARY
¢ Spark concepts
¢ Spark UI offline
¢ Application Optimization
— Shuffling
— Partitioning
— Closures
¢ Parameters Optimization
— Spark Shuffling
— Mesos application distribution
¢ Elasticsearch Optimization
— Google “elasticsearch performance tuning” -> blog.ebiznext.com

SPARK : LES CONCEPTS
¢ Application
— Main application
¢ Job
— Roundtrip Driver -> Cluster
¢ Stage
— Shuffle Boundary
¢ Task
— Thread working on a single RDD partition
¢ Partition
— RDD are split into partitions
— Partition is unit of work for each task
¢ Executor
— System process
Application
Job
Stage
Task
1
n
n
n
1
1
Driver code
Spark Action
Shuﬄe Boundary
One task / partition
RDD
Partition

SPARK UI OFFLINE
Spark-env.sh
On the driver

SHUFFLING APPLICATION OPTIMIZATION

WHY OPTIMIZE SHUFFLING
Distance between Data & CPU Duaration (scaled)
Cache L1 1 seconde
RAM 3 minutes
Node to node communication 3 jours

TRANSFORMATIONS LEADING TO SHUFFLING
¢ repartition
¢ cogroup
¢ ...join
¢ ...ByKey
¢ distinct

SHUFFLING OPTIMIZATION 1/2
(k1, v1, w1)
(k2, v2, w2)
(k3, v3, w3)
(k1, v4 w4)
(k1, v5, w5)
(k3,v6,w6)
(k2, v7, w7)
(k2 v8, w8)
(k3,v9,w9)
(k1, v1)
(k2, v2)
(k3,v3)
(k1, v4)
(k1, v5)
(k3,v6)
(k2, v7)
(k2 v8)
(k3,v9)
map map map
(k1, [v1, v4, v5]) (k3, [v3, v6, v9]) (k2, [v2, v7, v8])
groupByKey groupByKey groupByKey
(k1,v1+v4+v5) (k3,v3+v6+v9) (k2, v2+v7+v8])
Shuffling
Bonus : OOM

SHUFFLING OPTIMIZATION 2/2
(k1, v1, w1)
(k2, v2, w2)
(k3, v3, w3)
(k1, v4 w4)
(k1, v5, w5)
(k3,v6,w6)
(k2, v7, w7)
(k2 v8, w8)
(k3,v9,w9)
(k1, v1)
(k2, v2)
(k3,v3)
(k1, v4)
(k1, v5)
(k3,v6)
(k2, v7)
(k2 v8)
(k3,v9)
map map map
(k1, v1+v45)) (k3,v3+v6+v9) (k2, v2+v78)
reduceByKey (2/2) reduceByKey (2/2)
reduceByKey (2/2)
(k1, v1)
(k2, v2)
(k3,v3)
(k1, v4+v5)
(k3,v6)
(k2, v7+v8)
(k3,v9)
reduceByKey (1/2) reduceByKey (1/2) reduceByKey (1/2)
Less shuffling

OPTIMIZATIONS BROUGHT BY SPARK SQL
L Weak typing
L Limited Java Support
L Schema inference not supported
• requires scala.Product

LES DATASETS – EXPERIMENTAL -
Scala
Java
J API similar to RDD, strongly typed

CLOSURES
Indirect reference. Use local variables instead

PARTITIONING
¢ Symptoms
— Important number of short lived tasks
¢ Solutions
— Review your implementation
— Define custom partitionner
— Control the number of partitions with
¢ coalesce et repartition

CHOOSE THE RIGHT SERIALIZATION ALGORITHM

REDUCE OBJECT SIZE
¢ Kryo Serialization
— Reduce space – up to 10X
— Perform gain – 2X to 3X
1. Require Spark to use Kryo serialization
2. Optimize class naming => reduce object size
3. Force all serialized classes to register
Not applicable to the Dataset API that use Encoders instead

SPARK MEMORY MANAGEMENT– SPARK 1.6.X
Shuffle fraction
Memory reserved to Spark (300Mb)
Memory assigned to user data
1 - spark.memory.fraction
Memory assigned to spark data
spark.memory.fraction = 0.75
spark.memory.storageFraction = 0.5
16Go
11.78Go
3.92Go

SPARK MEMORY MANAGEMENT – SPARK 1.6.X
¢ Storage Fraction
— Host cached RDDs
— Host broadcast variables
— Data in this fraction are subject to eviction
¢ Shuffling Fraction
— Hold Intermediate Data
— Spill to disk when full
— Data this fraction cannot be evicted
by other threads
16Go
11.78Go
3.92Go

SPARK MEMORY MANAGEMENT – SPARK 1.6.X
¢ Shuffle & Storage fractions may borrow
memory from each other under certain
conditions:
— The shuffling fraction cannot extend beyond its
defined size if storage fraction uses all its
memory
— Storage fraction cannot evict data from the
shuffling fraction even if it expanded beyond its
defined size.
16Go
11.78Go
3.92Go

THE SHUFFLE MANAGER
SPARK.SHUFFLE.MANAGER

HASH SHUFFLE MANAGER
Executor
Partition
Partition
Partition
Partition
Partition
Task
Task
Task
Number of tasks = spark.executor.cores / spark.task.cpus(1)
YARN & standalone only
Local Filesystem
spark.local.dir
File 1
File 2
File …
…

SORT SHUFFLE MANAGER
Executor
Partition
Partition
Partition
Partition
Partition
Task
Task
Task
Number of tasks = spark.executor.cores / spark.task.cpus(1)
YARN & standalone only
Filesystem local
spark.local.dir
Single ﬁleMap AppendOnly
Sort & Spill
spark.shufﬂe.spill = true
id1
id2
id3
id..
sorted by reducer id

SHUFFLE MANAGER
¢ spark.shuffle.manager= hash
— Perform better when the number of mapper/reducer is small (generation of M * R files)
¢ spark.shuffle.manager= sort
— Perform better for an important number of mapper/reducer
¢ Best of both world
— spark.shuffle.sort.bypassMergeThreshold
¢ spark.shuffle.manager= tungsten-sort ???
— Similar to sort with the following benefits :
¢ Off-Heap allocation
¢ Work directly on binary object serialization (much smaller than JVM objects) – aka memcopy J
¢ Use 8 bytes / record pour les sort => take advantage on L* cache

COARSE GRAINED MODE
J Static resource allocation
Spark Driver
CoarseMesosSchedulerBackend
Mesos Master
Mesos Worker
CoarseGrainedExecutorBackend
Mesos Worker
Mesos Worker
Mesos Worker
Executor
Task Task
Task Task
Task
Task
…

COARSE GRAINED MODE
L Only one executor / node
L No control over the
number of executors 8 cores
16g
8 cores
16g
8 cores
16g
8 cores
16g
16 cores
16g
16 cores
16g
32 cores / 64Go 32 cores / 32Go

FINE GRAINED MODE
L Only one executor / node
L No guaranty of resource
availability
L No control over the
number of executors
Spark Driver
MesosSchedulerBackend
Mesos Master
Mesos Worker
Mesos Worker
Mesos Worker
Mesos Worker
MesosExecutorBackend
Executor
Task
…
Executor
Task
Executor
Task
Executor
Task
spark.mesos.mesosExecutor.cores (1)

SOLUTION 1 : USE MESOS ROLES
¢ Limit the number of cores / node on each mesos worker
L Must be configured for each Spark application
• Assign a Mesos role to your Spark job
8 cores
16g
8 cores
16g
8 cores
16g
8 cores
16g

SOLUTION 2 : DYNAMIC ALLOCATION (COARSED GRAINED MODE ONLY)
¢ Principles
— Dynamically add/remove Spark executors
¢ Requires a dedicated process to move shuffled data (spawned on each Mesos
worker through Marathon)

Spark / Mesos Cluster Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Spark / Mesos Cluster Optimization

Similar to Spark / Mesos Cluster Optimization (20)

More from ebiznext

More from ebiznext (6)

Recently uploaded

Recently uploaded (20)

Spark / Mesos Cluster Optimization