SlideShare a Scribd company logo
1 of 21
Download to read offline
Apache Spark
Best Practices
Eren Avşaroğulları
Data Science & Engineering Club Meetup Talk
Trinity College, Dublin
27 April 2019
Agenda
• Architecture
• Data Structures
• Persistency + GC Policy
• Partitioning + Data Skew
• Data Locality
• Job Scheduling
• Ser/De
• Event Sourcing on Transformations
• Checkpointing
Bio
• B.Sc & M.Sc on Electronics & Control Engineering
• Data Engineer @
• Working on Data Analytics (Data Transformations & Cleaning)
• Open Source Contributor @
{ Apache Spark | Pulsar | Heron }
erenavsarogullari
erenavsar@apache.org
Job Pipeline
Two types of Spark Operations on DataSet:
• Transformations: lazy evaluated (not computed immediately)
• Actions: triggers the computation and returns value
RDD
RDD
RDD
Transformations DataSet ValueData
Source Transformation Operators Sink
Trigger
Computation
Actions
Architecture
Client Node
Driver
SparkContext
Master Node
YARN
Resource
Manager
HDFS Name
Node
Worker Node
YARN
Node
Manager
Executor
Task Partition
HDFS Data Node
Data Block
Worker Node
YARN
Node
Manager
Executor
Task Partition
HDFS Data Node
Data Block
Cache
Worker Node
YARN
Node
Manager
Cache
Executor
TaskZookeeper Partition
HDFS Data Node
Data Block
Cache
Spark on YARN Topology
(Client Mode)
1- Data Structures
DataFrame
DataSet
(Type Safe)
RDD
Project
Tungsten
Catalyst
Optimizer
Takeaway 1
DataFrame or DataSet APIs can
be preferred against to RDD to
get more performance efficiency
Tabular Data
Structures
For Structured &
Semi-Structured
Data processing
2- Persistency
// Import Packages
import org.apache.spark.sql.Dataset
import spark.implicits._
// Define DataSet Model
case class Employee(id: String, name: String, department: String)
// Create DataSet by loading file
val employeeDS: Dataset[Employee] = spark.read.option(“header”, true).csv(”$FilePath”).as[Employee]
// Transform DataSet and Cache
val filteredEmployeeDS = employeeDS.filter(...).map(...).persist()
// Get count of DataSet
filteredEmployeeDS.count()
// Get content of DataSet
filteredEmployeeDS.take(10)
Takeaway 3
DataSet needs to be
unpersisted if it is not
used anymore.
Takeaway 2
Caching can be thought
before multiple jobs use
same DataSet.
Spark uses LRU
(Least Recently Used)
eviction policy
3- Storage Level
Storage Mode Description
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2 /
MEMORY_AND_DISK_2
OFF_HEAP
Takeaway 6
MEMORY_ONLY_SER
provides Less memory
usage. However, more
CPU intensive for Ser/De
Takeaway 4
MEMORY_ONLY is the
Best Option for Prod
Takeaway 5
MEMORY_AND_DISK
Option can be thought
for limited environments
4- Driver/Executor GC Policy
Young Generation (-Xmn) Old Generation
Minor GC Major GC
JVM Heap (-Xms -Xmx)
Eden Survivor 1 Survivor 2 Tenured Metaspace
JDK 8 Memory Model
// Enable GC Logging
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
// Enable G1GC through SparkConf
spark.executor.extraJavaOptions=-XX:+UseG1GC // Important when caching the data on executors
spark.driver.extraJavaOptions=-XX:+UseG1GC // Important when collecting the data to driver
Takeaway 7
Enable GC Stats first to
analyze JVM Pause
Times
Takeaway 8
G1GC can be
preferred when
shorter pause time
and less throughput
ParallelGC can be
preferred when
longer pause time
and more
throughput
5- Partitioning
3 Factors affecting Partition Size & Number:
• External DataSource (HDFS, Cassandra Table)
• CPU Core Numbers
• Properties (Parallelism Level)
Supported Partitioners:
• Hash Partitioner (default)
• Range Partitioner
• Custom Partitioner
Partitioning Operations
• coalesce
• repartition
Takeaway 11
Too many small files: Less efficient with excessive parallelism
Too few large files: Less efficient due to not utilizing all
available cores in the cluster.
Takeaway 9
HDFS Default Partition Size: 128MB (HDFS Data Block Size)
Cassandra Input Split Size: 64MB
Partition Index = key.hashCode() % numberOfPartitions
Takeaway 10
Use suitable Partitioner to create well balanced partitions.
Each Partitioner type may not be fit for each dataset.
Worker Node M
Worker Node I
6- Level of Parallelism
Executor N
Executor I
RDD
Partition
Partition
Task
Task
// Default Parallelism
spark.default.parallelism: Default number of partitions in RDDs returned by transformations like
join, reduceByKey and parallelize when not set by user.
Local Mode and YARN: All number of CPU cores.
spark.sql.shuffle.partitions: Configures the number of partitions to use when shuffling data for
joins or aggregations. Default is 200
Takeaway 12
Partition Number 1 x 1 Task Number
Max Partitions Number = Total Core Numbers * (2 or 3)
Takeaway 13
spark.default.parallelism affects for RDD
spark.sql.shuffle.partitions affects for DF / DS
7- Data Skew I
Id Composer Country Instrument Listeners
1 Paco de Lucia Spain Guitar 337K
2 Vincente Amigo Spain Guitar 106K
3 Paco Pena Spain Guitar 65K
4 Enrique Granados Spain Guitar 55K
5 Fernando Sor Spain Guitar 63K
6 Johannes Linstead Canada Guitar 65K
7 Govi Germany Guitar 115K
8 Ottmart Liebert Germany Guitar 1K
Composer Year Country Instrument Listeners
Paco de Lucia 1947 Spain Guitar 337K
Vincente Amigo 1967 Spain Guitar 106K
Paco Pena 1942 Spain Guitar 65K
Enrique Granados 1867 Spain Guitar 55K
Fernando Sor 1778 Spain Guitar 63K
Johannes Linstead 1969 Canada Guitar 65K
Govi 1949 Germany Guitar 115K
Ottmart Liebert 1959 Germany Guitar 1K
RDD
Partition I
Partition II
Partition III
ds1.join(ds2, Seq(”Country”, “Instrument”))
7- Data Skew II
Id Virtuoso Country Instrument Listeners
1 Paco de Lucia Spain Guitar 337K
2 Vincente Amigo Spain Guitar 106K
3 Fernando Sor Spain Guitar 63K
4 Enrique Granados Spain Guitar 55K
5 Paco Pena Spain Guitar 65K
6 Johannes Linstead Canada Guitar 65K
7 Govi Germany Guitar 115K
8 Ottmart Liebert Germany Guitar 1K
Virtuoso Year Country Instrument Listeners
Paco de Lucia 1947 Spain Guitar 337K
Vincente Amigo 1967 Spain Guitar 106K
Fernando Sor 1942 Spain Guitar 63K
Enrique Granados 1867 Spain Guitar 55K
Paco Pena 1778 Spain Guitar 65K
Johannes Linstead 1969 Canada Guitar 65K
Govi 1949 Germany Guitar 115K
Ottmart Liebert 1959 Germany Guitar 1K
RDD
Partition I
ds1.join(ds2, Seq(”Country”, “Instrument”, “Listeners”))
Partition II
Partition III
Partition IV
Partition V
Partition VI
Partition VII
Takeaway 14
Adding more Well
Balanced Column or
Salting can help for
well balanced
partitioning
8- Data Locality
Data Locality Level Description
PROCESS_LOCAL Data is in the same JVM where code is run.
NODE_LOCAL Data is in the same Node (HDFS Data Node || different
executor memory) where executor running the code.
RACK_LOCAL Data is on the different node in the same Rack
ANY Data is elsewhere in the network, not in the same Rack.
NO_PREF No data locality preference
Takeaway 15
Spark runs the code
close the data as much
as possible
Takeaway 16
PROCESS_LOCAL is the
best option. Needs to
check when
monitoring the job
App I
9- Job Scheduling
Single Application Across Application
Static Allocation Dynamic Allocation
Driver I
Master
Worker I
Executor
Worker III
Executor
Executor
App II
Driver II
Worker II
Executor
Executor
App I
Driver I
Master
Worker I
Executor
Worker III
Executor
Executor
App II
Driver II
Worker II
Executor
Executor
Executor
• FIFO (default)
• FAIR
Takeaway 17
FAIR Pool is useful for
parallel job submission
Takeaway 18
Dynamic Allocation can be useful for
auto scaling at the runtime in the light
of increased/decreased traffic
10- Serialization
Data Structure Description
RDD JDK Serialization as default. Kryo is suggested. Kryo is not as
default because requiring custom registration.
DataFrame / DataSet Tungsten serialization format is used.
When is used?
• Shuffle: Data transferring between Worker Nodes
• Disk I/O: Data is being written to disk
• Memory: Data is being stored to Heap in serialized form.
Takeaway 19
Kryo can be preferred
against to JDK IO
Serialization when
using RDD.
11- Event Sourcing on Transformations
Write Flow
Load
Transformation(s) 1
Transformation(s) 2
Transformation(s) 3
Write 1
Transformation(s) 4
Transformation(s) 5
Transformation(s) 6
Write 2
Transformation(s) 7
Transformation(s) 8
Read Flow
Load
Transformation(s) 1
Transformation(s) 2
Transformation(s) 3
Write 1
Transformation(s) 4
Transformation(s) 5
Transformation(s) 6
Write 2
Transformation(s) 7
Transformation(s) 8
Read I
Write 1
Transformation(s) 4
Transformation(s) 5
Compute
Update 5
Operation
Compute
Update 7
Operation
Read II
Write 2
Transformation(s) 7
Microservices Data Management Patterns
• CQRS (Command Query Responsibility Segregation)
• Event Sourcing
Write DataSet
Content to
Distributed File
System
Takeaway 20
RDDs are immutable. Event Sourcing is
useful pattern at serving layer to keep RDD
audit history and computing historical data.
12- Checkpointing
Checkpointing is a technique for fault tolerance and performance efficiency in
computing systems.
• No need re-computation of complex Spark jobs requiring shuffle
(e.g: Complex Join, GroupBy, Sort requiring shuffle)
• Help for re-computation of required DataSet fastly.
(e.g: Single and Batch Transformations)
• Fault Tolerance (System can re-continue from failure point)
References
• https://spark.apache.org/docs/latest/
• https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
• https://stackoverflow.com/questions/36215672/spark-yarn-architecture
• https://microservices.io/patterns/data/event-sourcing.html
• https://coxautomotivedatasolutions.github.io/datadriven/spark/data%20skew/joins/data_skew/
Q & A
?
Thanks

More Related Content

What's hot

Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1AjayRawat971036
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 

What's hot (20)

Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 

Similar to Apache Spark Best Practices Meetup Talk

GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
 Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C... Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...Databricks
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 

Similar to Apache Spark Best Practices Meetup Talk (20)

Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
 Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C... Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Hotsos Advanced Linux Tools
Hotsos Advanced Linux ToolsHotsos Advanced Linux Tools
Hotsos Advanced Linux Tools
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 

Recently uploaded

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 

Recently uploaded (20)

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Apache Spark Best Practices Meetup Talk

  • 1. Apache Spark Best Practices Eren Avşaroğulları Data Science & Engineering Club Meetup Talk Trinity College, Dublin 27 April 2019
  • 2. Agenda • Architecture • Data Structures • Persistency + GC Policy • Partitioning + Data Skew • Data Locality • Job Scheduling • Ser/De • Event Sourcing on Transformations • Checkpointing
  • 3. Bio • B.Sc & M.Sc on Electronics & Control Engineering • Data Engineer @ • Working on Data Analytics (Data Transformations & Cleaning) • Open Source Contributor @ { Apache Spark | Pulsar | Heron } erenavsarogullari erenavsar@apache.org
  • 4. Job Pipeline Two types of Spark Operations on DataSet: • Transformations: lazy evaluated (not computed immediately) • Actions: triggers the computation and returns value RDD RDD RDD Transformations DataSet ValueData Source Transformation Operators Sink Trigger Computation Actions
  • 5. Architecture Client Node Driver SparkContext Master Node YARN Resource Manager HDFS Name Node Worker Node YARN Node Manager Executor Task Partition HDFS Data Node Data Block Worker Node YARN Node Manager Executor Task Partition HDFS Data Node Data Block Cache Worker Node YARN Node Manager Cache Executor TaskZookeeper Partition HDFS Data Node Data Block Cache Spark on YARN Topology (Client Mode)
  • 6. 1- Data Structures DataFrame DataSet (Type Safe) RDD Project Tungsten Catalyst Optimizer Takeaway 1 DataFrame or DataSet APIs can be preferred against to RDD to get more performance efficiency Tabular Data Structures For Structured & Semi-Structured Data processing
  • 7. 2- Persistency // Import Packages import org.apache.spark.sql.Dataset import spark.implicits._ // Define DataSet Model case class Employee(id: String, name: String, department: String) // Create DataSet by loading file val employeeDS: Dataset[Employee] = spark.read.option(“header”, true).csv(”$FilePath”).as[Employee] // Transform DataSet and Cache val filteredEmployeeDS = employeeDS.filter(...).map(...).persist() // Get count of DataSet filteredEmployeeDS.count() // Get content of DataSet filteredEmployeeDS.take(10) Takeaway 3 DataSet needs to be unpersisted if it is not used anymore. Takeaway 2 Caching can be thought before multiple jobs use same DataSet. Spark uses LRU (Least Recently Used) eviction policy
  • 8. 3- Storage Level Storage Mode Description MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2 / MEMORY_AND_DISK_2 OFF_HEAP Takeaway 6 MEMORY_ONLY_SER provides Less memory usage. However, more CPU intensive for Ser/De Takeaway 4 MEMORY_ONLY is the Best Option for Prod Takeaway 5 MEMORY_AND_DISK Option can be thought for limited environments
  • 9. 4- Driver/Executor GC Policy Young Generation (-Xmn) Old Generation Minor GC Major GC JVM Heap (-Xms -Xmx) Eden Survivor 1 Survivor 2 Tenured Metaspace JDK 8 Memory Model // Enable GC Logging -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps // Enable G1GC through SparkConf spark.executor.extraJavaOptions=-XX:+UseG1GC // Important when caching the data on executors spark.driver.extraJavaOptions=-XX:+UseG1GC // Important when collecting the data to driver Takeaway 7 Enable GC Stats first to analyze JVM Pause Times Takeaway 8 G1GC can be preferred when shorter pause time and less throughput ParallelGC can be preferred when longer pause time and more throughput
  • 10. 5- Partitioning 3 Factors affecting Partition Size & Number: • External DataSource (HDFS, Cassandra Table) • CPU Core Numbers • Properties (Parallelism Level) Supported Partitioners: • Hash Partitioner (default) • Range Partitioner • Custom Partitioner Partitioning Operations • coalesce • repartition Takeaway 11 Too many small files: Less efficient with excessive parallelism Too few large files: Less efficient due to not utilizing all available cores in the cluster. Takeaway 9 HDFS Default Partition Size: 128MB (HDFS Data Block Size) Cassandra Input Split Size: 64MB Partition Index = key.hashCode() % numberOfPartitions Takeaway 10 Use suitable Partitioner to create well balanced partitions. Each Partitioner type may not be fit for each dataset.
  • 11. Worker Node M Worker Node I 6- Level of Parallelism Executor N Executor I RDD Partition Partition Task Task // Default Parallelism spark.default.parallelism: Default number of partitions in RDDs returned by transformations like join, reduceByKey and parallelize when not set by user. Local Mode and YARN: All number of CPU cores. spark.sql.shuffle.partitions: Configures the number of partitions to use when shuffling data for joins or aggregations. Default is 200 Takeaway 12 Partition Number 1 x 1 Task Number Max Partitions Number = Total Core Numbers * (2 or 3) Takeaway 13 spark.default.parallelism affects for RDD spark.sql.shuffle.partitions affects for DF / DS
  • 12. 7- Data Skew I Id Composer Country Instrument Listeners 1 Paco de Lucia Spain Guitar 337K 2 Vincente Amigo Spain Guitar 106K 3 Paco Pena Spain Guitar 65K 4 Enrique Granados Spain Guitar 55K 5 Fernando Sor Spain Guitar 63K 6 Johannes Linstead Canada Guitar 65K 7 Govi Germany Guitar 115K 8 Ottmart Liebert Germany Guitar 1K Composer Year Country Instrument Listeners Paco de Lucia 1947 Spain Guitar 337K Vincente Amigo 1967 Spain Guitar 106K Paco Pena 1942 Spain Guitar 65K Enrique Granados 1867 Spain Guitar 55K Fernando Sor 1778 Spain Guitar 63K Johannes Linstead 1969 Canada Guitar 65K Govi 1949 Germany Guitar 115K Ottmart Liebert 1959 Germany Guitar 1K RDD Partition I Partition II Partition III ds1.join(ds2, Seq(”Country”, “Instrument”))
  • 13. 7- Data Skew II Id Virtuoso Country Instrument Listeners 1 Paco de Lucia Spain Guitar 337K 2 Vincente Amigo Spain Guitar 106K 3 Fernando Sor Spain Guitar 63K 4 Enrique Granados Spain Guitar 55K 5 Paco Pena Spain Guitar 65K 6 Johannes Linstead Canada Guitar 65K 7 Govi Germany Guitar 115K 8 Ottmart Liebert Germany Guitar 1K Virtuoso Year Country Instrument Listeners Paco de Lucia 1947 Spain Guitar 337K Vincente Amigo 1967 Spain Guitar 106K Fernando Sor 1942 Spain Guitar 63K Enrique Granados 1867 Spain Guitar 55K Paco Pena 1778 Spain Guitar 65K Johannes Linstead 1969 Canada Guitar 65K Govi 1949 Germany Guitar 115K Ottmart Liebert 1959 Germany Guitar 1K RDD Partition I ds1.join(ds2, Seq(”Country”, “Instrument”, “Listeners”)) Partition II Partition III Partition IV Partition V Partition VI Partition VII Takeaway 14 Adding more Well Balanced Column or Salting can help for well balanced partitioning
  • 14. 8- Data Locality Data Locality Level Description PROCESS_LOCAL Data is in the same JVM where code is run. NODE_LOCAL Data is in the same Node (HDFS Data Node || different executor memory) where executor running the code. RACK_LOCAL Data is on the different node in the same Rack ANY Data is elsewhere in the network, not in the same Rack. NO_PREF No data locality preference Takeaway 15 Spark runs the code close the data as much as possible Takeaway 16 PROCESS_LOCAL is the best option. Needs to check when monitoring the job
  • 15. App I 9- Job Scheduling Single Application Across Application Static Allocation Dynamic Allocation Driver I Master Worker I Executor Worker III Executor Executor App II Driver II Worker II Executor Executor App I Driver I Master Worker I Executor Worker III Executor Executor App II Driver II Worker II Executor Executor Executor • FIFO (default) • FAIR Takeaway 17 FAIR Pool is useful for parallel job submission Takeaway 18 Dynamic Allocation can be useful for auto scaling at the runtime in the light of increased/decreased traffic
  • 16. 10- Serialization Data Structure Description RDD JDK Serialization as default. Kryo is suggested. Kryo is not as default because requiring custom registration. DataFrame / DataSet Tungsten serialization format is used. When is used? • Shuffle: Data transferring between Worker Nodes • Disk I/O: Data is being written to disk • Memory: Data is being stored to Heap in serialized form. Takeaway 19 Kryo can be preferred against to JDK IO Serialization when using RDD.
  • 17. 11- Event Sourcing on Transformations Write Flow Load Transformation(s) 1 Transformation(s) 2 Transformation(s) 3 Write 1 Transformation(s) 4 Transformation(s) 5 Transformation(s) 6 Write 2 Transformation(s) 7 Transformation(s) 8 Read Flow Load Transformation(s) 1 Transformation(s) 2 Transformation(s) 3 Write 1 Transformation(s) 4 Transformation(s) 5 Transformation(s) 6 Write 2 Transformation(s) 7 Transformation(s) 8 Read I Write 1 Transformation(s) 4 Transformation(s) 5 Compute Update 5 Operation Compute Update 7 Operation Read II Write 2 Transformation(s) 7 Microservices Data Management Patterns • CQRS (Command Query Responsibility Segregation) • Event Sourcing Write DataSet Content to Distributed File System Takeaway 20 RDDs are immutable. Event Sourcing is useful pattern at serving layer to keep RDD audit history and computing historical data.
  • 18. 12- Checkpointing Checkpointing is a technique for fault tolerance and performance efficiency in computing systems. • No need re-computation of complex Spark jobs requiring shuffle (e.g: Complex Join, GroupBy, Sort requiring shuffle) • Help for re-computation of required DataSet fastly. (e.g: Single and Batch Transformations) • Fault Tolerance (System can re-continue from failure point)
  • 19. References • https://spark.apache.org/docs/latest/ • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ • https://stackoverflow.com/questions/36215672/spark-yarn-architecture • https://microservices.io/patterns/data/event-sourcing.html • https://coxautomotivedatasolutions.github.io/datadriven/spark/data%20skew/joins/data_skew/