SlideShare a Scribd company logo
“ 10 things I wish I'd known
before using
in production ! ”
Himanshu Arora
Lead Data Engineer, NeoLynk France
h.arora@neolynk.fr
@him_aro
Nitya Nand Yadav
Data Engineer, NeoLynk France
n.yadav@neolynk.fr
@nityany
Partenaire Suivez l’actualité de nos Tribus JVM, PHP, JS et Agile
sur nos réseaux sociaux :
JVM
What we are going to cover...
1. RDD vs DataFrame vs DataSet
2. Data Serialisation Formats
3. Storage formats
4. Broadcast join
5. Hardware tuning
6. Level of parallelism
7. GC tuning
8. Common errors
9. Data skew
10. Data locality
5
1/10 - RDD vs DataFrames vs DataSets
6
● RDD - Resilient Distributed Dataset
➔ Main abstraction of Spark.
➔ Low-level transformation, actions and control on partition level.
➔ Unstructured dataset like media streams, text streams.
➔ Manipulate data with functional programming constructs.
➔ No optimization
7
● DataFrame
➔ High level abstractions, rich semantics.
➔ Like a big distributed SQL table.
➔ High level expressions (aggregation, average, sum, sql queries).
➔ Performance and optimizations(Predicate pushdown, QBO, CBO...).
➔ No compile time type check, runtime errors.
8
● DataSet
➔ A collection of strongly-typed JVM objects, dictated by a case class you define
in Scala or a class in Java.
➔ DataFrame = DataSet[Row].
➔ Performance and optimisations.
➔ Type-safety at compile time.
9
2/10 - Data Serialisation Format
➔ Data shuffled in serialized format between executors.
➔ RDDs cached & persisted in disk are serialized too.
➔ Default serialization format of spark: Java Serialization (slow & large).
➔ Better use: Kryo serialisation.
➔ Kryo: Faster and more compact (up to 10x).
➔ DataFrame/DataSets use tungsten serialization (even better than kryo).
10
val sparkConf: SparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession: SparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
// register your own custom classes with kryo
sparkConf.registerKryoClasses(Array(classOf[MyCustomeClass]))
2/10 - Data Serialisation Format
11
3/10 - Storage Formats
12
➔ Avoid using text, json and csv etc. if possible.
➔ Use compressed binary formats instead.
➔ Popular choices: Apache Parquet, Apache Avro & ORC etc.
➔ Use case dictates the choice.
3/10 - Storage Formats
13
➔ Binary formats.
➔ Splittable.
➔ Parquet: Columnar & Avro: Row based
➔ Parquet: Higher compression rates than row based format.
➔ Parquet: read-heavy workload & Avro: write heavy workload
➔ Schema preserved in files itself.
➔ Avro: Better support for schema evolution
3/10 - Storage Formats: Apache Parquet & Avro
14
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.parquet.compression.codec", "snappy")
val dataframe = sparkSession.read.parquet("s3a://....")
dataframe.write.parquet("s3a://....")
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.avro.compression.codec", "snappy")
val dataframe = sparkSession.read.avro("s3a://....")
dataframe.write.avro("s3a://....")
3/10 - Storage Formats
15
3/10 - Benchmark
Using AVRO
instead of JSON
16
4/10 - Broadcast Join
17
//spark automatically broadcasts small dataframes (max. 10MB by default)
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.autoBroadcastJoinThreshold", "2147483648")
.set("spark.sql.broadcastTimeout", "900") //default 300 secs
/*
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
*/
//force broadcast
val result = bigDataFrame.join(broadcast(smallDataFrame))
4/10 - Broadcast Join
18
Know Your Cluster
● Number of nodes
● Cores per node
● RAM per node
● Cluster Manager (Yarn,
Mesos …)
Let’s assume:
● 5 nodes
● 16 cores each
● 64GB RAM
● Yarn as RM
● Spark in client mode
5/10 - Hardware Tuning
19
--num-executors = 80 //( 16 cores x 5 nodes)
--executor-cores = 1
--executor-memory = 4GB //(64 GB / 16 executors per node)
➔ Not running multiple tasks on same JVM (not sharing
broadcast vars, accumulators…).
➔ Risk of running out of memory to compute a partition.
5/10 - Hardware Tuning / Scenario #1 (Small executors)
20
--num-executors = 5
--executor-cores = 16
--executor-memory = 64GB
5/10 - Hardware Tuning / Scenario #2 (Large executors)
➔ Very long garbage collection pauses.
➔ Poor performance with HDFS (handling many
concurrent threads).
21
--executor-cores = 5
--num-executors = 14 //(15 core per node / 5 core per executor = 3 x 5 node -1)
--executor-memory = 18GB //(64 / 3 executors per node - 10% overhead)
5/10 - Hardware Tuning / Scenario #3 (Right Balance)
➔ Recommended concurrent threads for HDFS is 5.
➔ Always leave one core for Yarn daemons.
➔ Always leave one executor for Yarn ApplicationMaster.
➔ Off heap memory for yarn = 10% for executor memory.
22
➔ Hardware tuning.
➔ Moving from
Java serializer to
Kryo.
5/10 - Benchmark
23
rdd = sc.textFile('demo.zip')
rdd = rdd.repartition(100)
6/10 - Level of parallelism/partitions
➔ The maximum size of a partition(s) is limited by the available memory of an
executor.
➔ Increasing partitions count will make each partition to have less data.
➔ Spark can not split compressed files (e.g. zip) and creates only 1 partition so
repartition yourself.
24
➔ Quick wins when using a large JVM heap to avoid long GC pauses.
spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages
-XX:+UseTLAB -XX:+ResizeTLAB
// if creating too many objects in driver (ex. collect())
// which is not a very good idea though
spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages
-XX:+UseTLAB -XX:+ResizeTLAB
7/10 - GC Tuning
25
Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical
memory used.
8/10 - Knock knock… Who’s there?… An error :(
26
➔ Not enough executor memory.
➔ Too many executor cores (implies too much parallelism).
➔ Not enough spark partitions.
➔ Data skew (let’s talk about that later…).
➔ Increase executor memory.
➔ Reduce number of executor cores.
➔ Increase number of spark partitions.
➔ Persist in memory and disk (or just disk) with serialization.
➔ Off heap memory for caching.
8/10 - Knock knock… Who’s there?… An error :(
27
8/10 - Knock knock… Who’s there?… An error :(
19/01/31 21:03:13 INFO DAGScheduler: Host lost:
ip-172-29-149-243.eu-west-1.compute.internal (epoch 16)
19/01/31 21:03:13 INFO BlockManagerMasterEndpoint: Trying to
remove executors on host
ip-172-29-149-243.eu-west-1.compute.internal from BlockManagerMaster.
19/01/31 21:03:13 INFO BlockManagerMaster: Removed executors on
host ip-172-29-149-243.eu-west-1.compute.internal successfully.
28
{
"Name": "ScaleInContainerPendingRatio",
"Description": "Scale in on ContainerPendingRatio",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -1,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "LESS_THAN_OR_EQUAL",
"EvaluationPeriods": 3,
"MetricName": "ContainerPendingRatio",
"Namespace": "AWS/ElasticMapReduce",
"Dimensions": [
{
"Value": "$${emr.clusterId}",
"Key": "JobFlowId"
}
],
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 0,
"Unit": "COUNT"
}
}
}
Containers Pending / Containers allocated
Is this really
my cluster
…….?????
{
"Name": "ScaleInMemoryPercentage",
"Description": "Scale in on YARNMemoryAvailablePercentage",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -2,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "GREATER_THAN",
"EvaluationPeriods": 3,
"MetricName": "YARNMemoryAvailablePercentage",
"Namespace": "AWS/ElasticMapReduce",
"Dimensions": [
{
"Key": "JobFlowId",
"Value": "$${emr.clusterId}"
}
],
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 95.0,
"Unit": "PERCENT"
}
}
}
➔ A condition when data is not uniformly distributed across partitions.
➔ During joins, aggregations etc.
➔ E.g. joining with a column containing lots of null.
➔ Might cause java.lang.OutOfMemoryError: Java heap space.
9/10 - Data Skew
32
df1.join(df2,
Seq(
"make",
"model"
)
)
33
➔ Repartition your data based on key(Rdd) and column(dataframe) ,which will
evenly distribute the data.
➔ Use non-skewed column(s) for join.
➔ Replace null values of join col with NULL_X (X is a random number).
➔ Salting.
9/10 - Data Skew: possible solutions
34
df1.join(df2,
Seq(
"make",
"model",
"engine_size"
)
)
35
Let’s sprinkle
some SALT on
data skew …!!
9/10 - Impossible to find repartitioning key for even data distribution ???
Salting key = Actual partition key + Random fake key
(where fake key takes value between 1 to N, with N being the level of
distribution/partitions)
37
➔ Join DFs : Create salt col on bigger DF and broadcast the smaller one (with
addition col containing 1 to N).
➔ If both are too big to broadcast: Salt one and iterative broadcast other.
38
➔ Why it’s important?
10/10 - Data Locality
39
val sparkSession = SparkSession
.builder()
.appName("spark-app")
.config("spark.locality.wait", "60s") //default 3secs
.config("spark.locality.wait.node", "0") //set to 0 to skip node local
.config("spark.locality.wait.process", "10s")
.config("spark.locality.wait.rack", "30s")
.getOrCreate()
10/10 - Data Locality
40
REFERENCES
41
Thank you...very much !
Slides: http://tiny.cc/tiq56y
@him_aro @nityany

More Related Content

What's hot

Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
Shiao-An Yuan
 
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
Denish Patel
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
Alexey Lesovsky
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Jose De La Rosa
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
Denish Patel
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
 
GlusterFS As an Object Storage
GlusterFS As an Object StorageGlusterFS As an Object Storage
GlusterFS As an Object Storage
Keisuke Takahashi
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres MonitoringDenish Patel
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
PostgreSQL Experts, Inc.
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
ScaleGrid.io
 
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication Cheatsheet
Alexey Lesovsky
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
Mirantis
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red_Hat_Storage
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
zznate
 
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky HaryadiPGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
Equnix Business Solutions
 
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
PostgreSQL Experts, Inc.
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
Alexey Bashtanov
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
DataStax
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
Open Source Consulting
 

What's hot (19)

Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
 
GlusterFS As an Object Storage
GlusterFS As an Object StorageGlusterFS As an Object Storage
GlusterFS As an Object Storage
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
 
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication Cheatsheet
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky HaryadiPGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
 
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 

Similar to 10 things i wish i'd known before using spark in production

Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
Alex Thompson
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
Takeshi Yamamuro
 
GR740 User day
GR740 User dayGR740 User day
GR740 User day
klepsydratechnologie
 
16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem
Tier1 app
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
Dinakar Guniguntala
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythongroveronline
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem
Tier1 app
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
Andrey Kudryavtsev
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Zabbix
 
Evolution of Spark APIs
Evolution of Spark APIsEvolution of Spark APIs
Evolution of Spark APIs
Máté Szalay-Bekő
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Cosimo Streppone
 

Similar to 10 things i wish i'd known before using spark in production (20)

Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
GR740 User day
GR740 User dayGR740 User day
GR740 User day
 
16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Python
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Evolution of Spark APIs
Evolution of Spark APIsEvolution of Spark APIs
Evolution of Spark APIs
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
 

More from Paris Data Engineers !

Spark tools by Jonathan Winandy
Spark tools by Jonathan WinandySpark tools by Jonathan Winandy
Spark tools by Jonathan Winandy
Paris Data Engineers !
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
SCIO : Apache Beam API
SCIO : Apache Beam APISCIO : Apache Beam API
SCIO : Apache Beam API
Paris Data Engineers !
 
Apache Beam de A à Z
 Apache Beam de A à Z Apache Beam de A à Z
Apache Beam de A à Z
Paris Data Engineers !
 
REX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre scheduler
Paris Data Engineers !
 
Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
Paris Data Engineers !
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Paris Data Engineers !
 
Introduction à Apache Pulsar
 Introduction à Apache Pulsar Introduction à Apache Pulsar
Introduction à Apache Pulsar
Paris Data Engineers !
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
Paris Data Engineers !
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
Paris Data Engineers !
 
Scala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan Winandy
Paris Data Engineers !
 

More from Paris Data Engineers ! (11)

Spark tools by Jonathan Winandy
Spark tools by Jonathan WinandySpark tools by Jonathan Winandy
Spark tools by Jonathan Winandy
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
SCIO : Apache Beam API
SCIO : Apache Beam APISCIO : Apache Beam API
SCIO : Apache Beam API
 
Apache Beam de A à Z
 Apache Beam de A à Z Apache Beam de A à Z
Apache Beam de A à Z
 
REX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre scheduler
 
Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
 
Introduction à Apache Pulsar
 Introduction à Apache Pulsar Introduction à Apache Pulsar
Introduction à Apache Pulsar
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Scala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan Winandy
 

Recently uploaded

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

10 things i wish i'd known before using spark in production

  • 1. “ 10 things I wish I'd known before using in production ! ”
  • 2. Himanshu Arora Lead Data Engineer, NeoLynk France h.arora@neolynk.fr @him_aro Nitya Nand Yadav Data Engineer, NeoLynk France n.yadav@neolynk.fr @nityany
  • 3. Partenaire Suivez l’actualité de nos Tribus JVM, PHP, JS et Agile sur nos réseaux sociaux : JVM
  • 4.
  • 5. What we are going to cover... 1. RDD vs DataFrame vs DataSet 2. Data Serialisation Formats 3. Storage formats 4. Broadcast join 5. Hardware tuning 6. Level of parallelism 7. GC tuning 8. Common errors 9. Data skew 10. Data locality 5
  • 6. 1/10 - RDD vs DataFrames vs DataSets 6
  • 7. ● RDD - Resilient Distributed Dataset ➔ Main abstraction of Spark. ➔ Low-level transformation, actions and control on partition level. ➔ Unstructured dataset like media streams, text streams. ➔ Manipulate data with functional programming constructs. ➔ No optimization 7
  • 8. ● DataFrame ➔ High level abstractions, rich semantics. ➔ Like a big distributed SQL table. ➔ High level expressions (aggregation, average, sum, sql queries). ➔ Performance and optimizations(Predicate pushdown, QBO, CBO...). ➔ No compile time type check, runtime errors. 8
  • 9. ● DataSet ➔ A collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. ➔ DataFrame = DataSet[Row]. ➔ Performance and optimisations. ➔ Type-safety at compile time. 9
  • 10. 2/10 - Data Serialisation Format ➔ Data shuffled in serialized format between executors. ➔ RDDs cached & persisted in disk are serialized too. ➔ Default serialization format of spark: Java Serialization (slow & large). ➔ Better use: Kryo serialisation. ➔ Kryo: Faster and more compact (up to 10x). ➔ DataFrame/DataSets use tungsten serialization (even better than kryo). 10
  • 11. val sparkConf: SparkConf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val sparkSession: SparkSession = SparkSession .builder() .config(sparkConf) .getOrCreate() // register your own custom classes with kryo sparkConf.registerKryoClasses(Array(classOf[MyCustomeClass])) 2/10 - Data Serialisation Format 11
  • 12. 3/10 - Storage Formats 12
  • 13. ➔ Avoid using text, json and csv etc. if possible. ➔ Use compressed binary formats instead. ➔ Popular choices: Apache Parquet, Apache Avro & ORC etc. ➔ Use case dictates the choice. 3/10 - Storage Formats 13
  • 14. ➔ Binary formats. ➔ Splittable. ➔ Parquet: Columnar & Avro: Row based ➔ Parquet: Higher compression rates than row based format. ➔ Parquet: read-heavy workload & Avro: write heavy workload ➔ Schema preserved in files itself. ➔ Avro: Better support for schema evolution 3/10 - Storage Formats: Apache Parquet & Avro 14
  • 15. val sparkConf: SparkConf = new SparkConf() .set("spark.sql.parquet.compression.codec", "snappy") val dataframe = sparkSession.read.parquet("s3a://....") dataframe.write.parquet("s3a://....") val sparkConf: SparkConf = new SparkConf() .set("spark.sql.avro.compression.codec", "snappy") val dataframe = sparkSession.read.avro("s3a://....") dataframe.write.avro("s3a://....") 3/10 - Storage Formats 15
  • 16. 3/10 - Benchmark Using AVRO instead of JSON 16
  • 17. 4/10 - Broadcast Join 17
  • 18. //spark automatically broadcasts small dataframes (max. 10MB by default) val sparkConf: SparkConf = new SparkConf() .set("spark.sql.autoBroadcastJoinThreshold", "2147483648") .set("spark.sql.broadcastTimeout", "900") //default 300 secs /* Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] */ //force broadcast val result = bigDataFrame.join(broadcast(smallDataFrame)) 4/10 - Broadcast Join 18
  • 19. Know Your Cluster ● Number of nodes ● Cores per node ● RAM per node ● Cluster Manager (Yarn, Mesos …) Let’s assume: ● 5 nodes ● 16 cores each ● 64GB RAM ● Yarn as RM ● Spark in client mode 5/10 - Hardware Tuning 19
  • 20. --num-executors = 80 //( 16 cores x 5 nodes) --executor-cores = 1 --executor-memory = 4GB //(64 GB / 16 executors per node) ➔ Not running multiple tasks on same JVM (not sharing broadcast vars, accumulators…). ➔ Risk of running out of memory to compute a partition. 5/10 - Hardware Tuning / Scenario #1 (Small executors) 20
  • 21. --num-executors = 5 --executor-cores = 16 --executor-memory = 64GB 5/10 - Hardware Tuning / Scenario #2 (Large executors) ➔ Very long garbage collection pauses. ➔ Poor performance with HDFS (handling many concurrent threads). 21
  • 22. --executor-cores = 5 --num-executors = 14 //(15 core per node / 5 core per executor = 3 x 5 node -1) --executor-memory = 18GB //(64 / 3 executors per node - 10% overhead) 5/10 - Hardware Tuning / Scenario #3 (Right Balance) ➔ Recommended concurrent threads for HDFS is 5. ➔ Always leave one core for Yarn daemons. ➔ Always leave one executor for Yarn ApplicationMaster. ➔ Off heap memory for yarn = 10% for executor memory. 22
  • 23. ➔ Hardware tuning. ➔ Moving from Java serializer to Kryo. 5/10 - Benchmark 23
  • 24. rdd = sc.textFile('demo.zip') rdd = rdd.repartition(100) 6/10 - Level of parallelism/partitions ➔ The maximum size of a partition(s) is limited by the available memory of an executor. ➔ Increasing partitions count will make each partition to have less data. ➔ Spark can not split compressed files (e.g. zip) and creates only 1 partition so repartition yourself. 24
  • 25. ➔ Quick wins when using a large JVM heap to avoid long GC pauses. spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:+UseTLAB -XX:+ResizeTLAB // if creating too many objects in driver (ex. collect()) // which is not a very good idea though spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:+UseTLAB -XX:+ResizeTLAB 7/10 - GC Tuning 25
  • 26. Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. 8/10 - Knock knock… Who’s there?… An error :( 26
  • 27. ➔ Not enough executor memory. ➔ Too many executor cores (implies too much parallelism). ➔ Not enough spark partitions. ➔ Data skew (let’s talk about that later…). ➔ Increase executor memory. ➔ Reduce number of executor cores. ➔ Increase number of spark partitions. ➔ Persist in memory and disk (or just disk) with serialization. ➔ Off heap memory for caching. 8/10 - Knock knock… Who’s there?… An error :( 27
  • 28. 8/10 - Knock knock… Who’s there?… An error :( 19/01/31 21:03:13 INFO DAGScheduler: Host lost: ip-172-29-149-243.eu-west-1.compute.internal (epoch 16) 19/01/31 21:03:13 INFO BlockManagerMasterEndpoint: Trying to remove executors on host ip-172-29-149-243.eu-west-1.compute.internal from BlockManagerMaster. 19/01/31 21:03:13 INFO BlockManagerMaster: Removed executors on host ip-172-29-149-243.eu-west-1.compute.internal successfully. 28
  • 29. { "Name": "ScaleInContainerPendingRatio", "Description": "Scale in on ContainerPendingRatio", "Action": { "SimpleScalingPolicyConfiguration": { "AdjustmentType": "CHANGE_IN_CAPACITY", "ScalingAdjustment": -1, "CoolDown": 300 } }, "Trigger": { "CloudWatchAlarmDefinition": { "ComparisonOperator": "LESS_THAN_OR_EQUAL", "EvaluationPeriods": 3, "MetricName": "ContainerPendingRatio", "Namespace": "AWS/ElasticMapReduce", "Dimensions": [ { "Value": "$${emr.clusterId}", "Key": "JobFlowId" } ], "Period": 300, "Statistic": "AVERAGE", "Threshold": 0, "Unit": "COUNT" } } } Containers Pending / Containers allocated
  • 30. Is this really my cluster …….?????
  • 31. { "Name": "ScaleInMemoryPercentage", "Description": "Scale in on YARNMemoryAvailablePercentage", "Action": { "SimpleScalingPolicyConfiguration": { "AdjustmentType": "CHANGE_IN_CAPACITY", "ScalingAdjustment": -2, "CoolDown": 300 } }, "Trigger": { "CloudWatchAlarmDefinition": { "ComparisonOperator": "GREATER_THAN", "EvaluationPeriods": 3, "MetricName": "YARNMemoryAvailablePercentage", "Namespace": "AWS/ElasticMapReduce", "Dimensions": [ { "Key": "JobFlowId", "Value": "$${emr.clusterId}" } ], "Period": 300, "Statistic": "AVERAGE", "Threshold": 95.0, "Unit": "PERCENT" } } }
  • 32. ➔ A condition when data is not uniformly distributed across partitions. ➔ During joins, aggregations etc. ➔ E.g. joining with a column containing lots of null. ➔ Might cause java.lang.OutOfMemoryError: Java heap space. 9/10 - Data Skew 32
  • 34. ➔ Repartition your data based on key(Rdd) and column(dataframe) ,which will evenly distribute the data. ➔ Use non-skewed column(s) for join. ➔ Replace null values of join col with NULL_X (X is a random number). ➔ Salting. 9/10 - Data Skew: possible solutions 34
  • 36. Let’s sprinkle some SALT on data skew …!!
  • 37. 9/10 - Impossible to find repartitioning key for even data distribution ??? Salting key = Actual partition key + Random fake key (where fake key takes value between 1 to N, with N being the level of distribution/partitions) 37
  • 38. ➔ Join DFs : Create salt col on bigger DF and broadcast the smaller one (with addition col containing 1 to N). ➔ If both are too big to broadcast: Salt one and iterative broadcast other. 38
  • 39. ➔ Why it’s important? 10/10 - Data Locality 39
  • 40. val sparkSession = SparkSession .builder() .appName("spark-app") .config("spark.locality.wait", "60s") //default 3secs .config("spark.locality.wait.node", "0") //set to 0 to skip node local .config("spark.locality.wait.process", "10s") .config("spark.locality.wait.rack", "30s") .getOrCreate() 10/10 - Data Locality 40
  • 42. Thank you...very much ! Slides: http://tiny.cc/tiq56y @him_aro @nityany