10 things i wish i'd known before using spark in production

“ 10 things I wish I'd known
before using
in production ! ”

Himanshu Arora
Lead Data Engineer, NeoLynk France
h.arora@neolynk.fr
@him_aro
Nitya Nand Yadav
Data Engineer, NeoLynk France
n.yadav@neolynk.fr
@nityany

Partenaire Suivez l’actualité de nos Tribus JVM, PHP, JS et Agile
sur nos réseaux sociaux :
JVM

What we are going to cover...
1. RDD vs DataFrame vs DataSet
2. Data Serialisation Formats
3. Storage formats
4. Broadcast join
5. Hardware tuning
6. Level of parallelism
7. GC tuning
8. Common errors
9. Data skew
10. Data locality
5

1/10 - RDD vs DataFrames vs DataSets
6

● RDD - Resilient Distributed Dataset
➔ Main abstraction of Spark.
➔ Low-level transformation, actions and control on partition level.
➔ Unstructured dataset like media streams, text streams.
➔ Manipulate data with functional programming constructs.
➔ No optimization
7

● DataFrame
➔ High level abstractions, rich semantics.
➔ Like a big distributed SQL table.
➔ High level expressions (aggregation, average, sum, sql queries).
➔ Performance and optimizations(Predicate pushdown, QBO, CBO...).
➔ No compile time type check, runtime errors.
8

● DataSet
➔ A collection of strongly-typed JVM objects, dictated by a case class you define
in Scala or a class in Java.
➔ DataFrame = DataSet[Row].
➔ Performance and optimisations.
➔ Type-safety at compile time.
9

2/10 - Data Serialisation Format
➔ Data shuffled in serialized format between executors.
➔ RDDs cached & persisted in disk are serialized too.
➔ Default serialization format of spark: Java Serialization (slow & large).
➔ Better use: Kryo serialisation.
➔ Kryo: Faster and more compact (up to 10x).
➔ DataFrame/DataSets use tungsten serialization (even better than kryo).
10

val sparkConf: SparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession: SparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
// register your own custom classes with kryo
sparkConf.registerKryoClasses(Array(classOf[MyCustomeClass]))
2/10 - Data Serialisation Format
11

➔ Avoid using text, json and csv etc. if possible.
➔ Use compressed binary formats instead.
➔ Popular choices: Apache Parquet, Apache Avro & ORC etc.
➔ Use case dictates the choice.
3/10 - Storage Formats
13

➔ Binary formats.
➔ Splittable.
➔ Parquet: Columnar & Avro: Row based
➔ Parquet: Higher compression rates than row based format.
➔ Parquet: read-heavy workload & Avro: write heavy workload
➔ Schema preserved in files itself.
➔ Avro: Better support for schema evolution
3/10 - Storage Formats: Apache Parquet & Avro
14

.set("spark.sql.parquet.compression.codec", "snappy")
val dataframe = sparkSession.read.parquet("s3a://....")
dataframe.write.parquet("s3a://....")
.set("spark.sql.avro.compression.codec", "snappy")
val dataframe = sparkSession.read.avro("s3a://....")
dataframe.write.avro("s3a://....")
3/10 - Storage Formats
15

3/10 - Benchmark
Using AVRO
instead of JSON
16

//spark automatically broadcasts small dataframes (max. 10MB by default)
.set("spark.sql.autoBroadcastJoinThreshold", "2147483648")
.set("spark.sql.broadcastTimeout", "900") //default 300 secs
/*
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
*/
//force broadcast
val result = bigDataFrame.join(broadcast(smallDataFrame))
4/10 - Broadcast Join
18

Know Your Cluster
● Number of nodes
● Cores per node
● RAM per node
● Cluster Manager (Yarn,
Mesos …)
Let’s assume:
● 5 nodes
● 16 cores each
● 64GB RAM
● Yarn as RM
● Spark in client mode
5/10 - Hardware Tuning
19

--num-executors = 80 //( 16 cores x 5 nodes)
--executor-cores = 1
--executor-memory = 4GB //(64 GB / 16 executors per node)
➔ Not running multiple tasks on same JVM (not sharing
broadcast vars, accumulators…).
➔ Risk of running out of memory to compute a partition.
5/10 - Hardware Tuning / Scenario #1 (Small executors)
20

--num-executors = 5
--executor-memory = 64GB
5/10 - Hardware Tuning / Scenario #2 (Large executors)
➔ Very long garbage collection pauses.
➔ Poor performance with HDFS (handling many
concurrent threads).
21

--num-executors = 14 //(15 core per node / 5 core per executor = 3 x 5 node -1)
--executor-memory = 18GB //(64 / 3 executors per node - 10% overhead)
5/10 - Hardware Tuning / Scenario #3 (Right Balance)
➔ Recommended concurrent threads for HDFS is 5.
➔ Always leave one core for Yarn daemons.
➔ Always leave one executor for Yarn ApplicationMaster.
➔ Off heap memory for yarn = 10% for executor memory.
22

➔ Hardware tuning.
➔ Moving from
Java serializer to
Kryo.
5/10 - Benchmark
23

rdd = sc.textFile('demo.zip')
rdd = rdd.repartition(100)
6/10 - Level of parallelism/partitions
➔ The maximum size of a partition(s) is limited by the available memory of an
executor.
➔ Increasing partitions count will make each partition to have less data.
➔ Spark can not split compressed files (e.g. zip) and creates only 1 partition so
repartition yourself.
24

➔ Quick wins when using a large JVM heap to avoid long GC pauses.
spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages
-XX:+UseTLAB -XX:+ResizeTLAB
// if creating too many objects in driver (ex. collect())
// which is not a very good idea though
spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages
-XX:+UseTLAB -XX:+ResizeTLAB
7/10 - GC Tuning
25

Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical
memory used.
8/10 - Knock knock… Who’s there?… An error :(
26

➔ Not enough executor memory.
➔ Too many executor cores (implies too much parallelism).
➔ Not enough spark partitions.
➔ Data skew (let’s talk about that later…).
➔ Increase executor memory.
➔ Reduce number of executor cores.
➔ Increase number of spark partitions.
➔ Persist in memory and disk (or just disk) with serialization.
➔ Off heap memory for caching.
27

19/01/31 21:03:13 INFO DAGScheduler: Host lost:
ip-172-29-149-243.eu-west-1.compute.internal (epoch 16)
19/01/31 21:03:13 INFO BlockManagerMasterEndpoint: Trying to
remove executors on host
ip-172-29-149-243.eu-west-1.compute.internal from BlockManagerMaster.
19/01/31 21:03:13 INFO BlockManagerMaster: Removed executors on
host ip-172-29-149-243.eu-west-1.compute.internal successfully.
28

{
"Name": "ScaleInContainerPendingRatio",
"Description": "Scale in on ContainerPendingRatio",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -1,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "LESS_THAN_OR_EQUAL",
"EvaluationPeriods": 3,
"MetricName": "ContainerPendingRatio",
"Namespace": "AWS/ElasticMapReduce",
"Dimensions": [
{
"Value": "$${emr.clusterId}",
"Key": "JobFlowId"
}
],
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 0,
"Unit": "COUNT"
}
}
}
Containers Pending / Containers allocated

Is this really
my cluster
…….?????

{
"Name": "ScaleInMemoryPercentage",
"Description": "Scale in on YARNMemoryAvailablePercentage",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -2,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "GREATER_THAN",
"EvaluationPeriods": 3,
"MetricName": "YARNMemoryAvailablePercentage",
"Namespace": "AWS/ElasticMapReduce",
"Dimensions": [
{
"Key": "JobFlowId",
"Value": "$${emr.clusterId}"
}
],
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 95.0,
"Unit": "PERCENT"
}
}
}

➔ A condition when data is not uniformly distributed across partitions.
➔ During joins, aggregations etc.
➔ E.g. joining with a column containing lots of null.
➔ Might cause java.lang.OutOfMemoryError: Java heap space.
9/10 - Data Skew
32

df1.join(df2,
Seq(
"make",
"model"
)
)
33

➔ Repartition your data based on key(Rdd) and column(dataframe) ,which will
evenly distribute the data.
➔ Use non-skewed column(s) for join.
➔ Replace null values of join col with NULL_X (X is a random number).
➔ Salting.
9/10 - Data Skew: possible solutions
34

df1.join(df2,
Seq(
"make",
"model",
"engine_size"
)
)
35

Let’s sprinkle
some SALT on
data skew …!!

9/10 - Impossible to find repartitioning key for even data distribution ???
Salting key = Actual partition key + Random fake key
(where fake key takes value between 1 to N, with N being the level of
distribution/partitions)
37

➔ Join DFs : Create salt col on bigger DF and broadcast the smaller one (with
addition col containing 1 to N).
➔ If both are too big to broadcast: Salt one and iterative broadcast other.
38

➔ Why it’s important?
10/10 - Data Locality
39

val sparkSession = SparkSession
.builder()
.appName("spark-app")
.config("spark.locality.wait", "60s") //default 3secs
.config("spark.locality.wait.node", "0") //set to 0 to skip node local
.config("spark.locality.wait.process", "10s")
.config("spark.locality.wait.rack", "30s")
.getOrCreate()
10/10 - Data Locality
40

Thank you...very much !
Slides: http://tiny.cc/tiq56y
@him_aro @nityany

10 things i wish i'd known before using spark in production

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 10 things i wish i'd known before using spark in production

Similar to 10 things i wish i'd known before using spark in production (20)

More from Paris Data Engineers !

More from Paris Data Engineers ! (11)

Recently uploaded

Recently uploaded (20)

10 things i wish i'd known before using spark in production