SlideShare a Scribd company logo
By: Amit Raj IIT Kharagpur
Apache Spark Performance
tuning and Best Practise
Our Agenda
01 Spark Introduction
02 Code Level Optimization
03 Outside Code Technique
04 Demo
05 Summary
Introduction
● Apache Spark is Open Source, in-memory computation framework.
● It gives high performance for both batch as well as streaming job.
● It deals of big data processing.
● it is approx 100 times faster than mapreduce, because of in-memory computation
As it deals with the big data processing application it also involves lot of uses of resources such as
CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction.
In the upcoming 40 minute we will learn about the approaches which will help to do so.
Ways to Optimise
Code Level:-
Here we will learn the best practices to follow in order to achieve high performance in minimal
resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid
UDF, Filter Data at earliest , Reduce Shuffle
Beyond Code:-
Here we will learn to tune the config parameter cluster resources level tuning such as:-
File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval
Major Bottleneck
● CPU
● Network Bandwidth
● Memory
Our Goal is to optimise each of them as much as possible in order to reduce the resources used
and reduce the computation time to achieve optimum performance.
Caching
Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving
from a particular country and same is being used multiple times.
● Raw Data is in text file
● Reading Text File as DF1
● Grouping by origin country DF2
Caching
JOB1:- Now number of flights leaving US as DF3
JOB2:- number of flights leaving Singapore as DF4
JOB3:- number of flights leaving India as DF5
Execution plan for JOB1 :- DF1>DF2 >DF3
Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step.
Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step.
here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can
use it in another job to reduce computation resources and save time.
Broadcasting
Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with
task every time. which helps in reducing the network bandwidth and time consumption.
When to Use Broadcast Variable:-
Suppose we have a lookup data and that data need to be used by each executor while performing task.
We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition)
we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task).
But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be
sent.
Benefit= sending 100 copy vs sending 10 copy
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)
val broadcastCountries = spark.sparkContext.broadcast(countries)
– - Continue
In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution.
Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.
Serialization
From the above diagram it is clear that serialization is needed when we write data in some storage.
De-Serialization is needed when we need to read from the some source.
In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuffling etc.
Hence it becomes very important to optimize the serialization process.
Serialization
Kyro serialization over Java serialization:-
kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires
to register the classes not supported by it.
val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate()
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class
it will store the class name with each object of it (for every row)
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[Foo]))
DataSet/DataFrame over RDD
RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition
and shuffle, and we all know that Serialization and de-serialization are very expensive operations in spark.
On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization
of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD
Avoid UDF
When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence
whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible.
but by any chance we have to use it then first we have to define a function like a normal scala function and we
have to register it with spark udf class
● val plusOne = udf((x: Int) => x + 1) //defined function
● spark.udf.register("plusOne", plusOne) //register udf
● spark.sql("SELECT plusOne(5)").show() // calling udf
// |UDF(5)| // result
// +------+
// | 6|
Filter Data at Earliest
example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address,
pastexp, marital status, ……………………….. etc.
Bu we have to find number of employees belonging to a particular city. in this case we have to perform groupby operation on city column
and other column becomes irrelevant.
df.select(name,city).groupby(“city”).show()
df.groupby(“City”).select(“City”, “count”).show()
Scan
Aggregate
Filter
Scan
Aggregate
Filter
Shuffling
Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across
machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(),
reducebyKey(), join() on RDD and DataFrame. It involves
● Disk I/O
● Involves data serialization and deserialization
● Network I/O
Reduce Shuffle Operation
We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations
remove any unused operations.
Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property
you can improve Spark performance.
spark.conf.set("spark.sql.shuffle.partitions",100)
Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we
don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase
it to 200 or more.
File Format
Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database
As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades
from Database2 and perform calculation then writes in Databse3.
as Database2 involves writing the data into and reading the data from it.
In the above scenario we should prefer writing an intermediate file in Serialized and optimized formats like Avro, Parquet
e.t.c,
Any transformations on these formats performs better than text, CSV, and JSON.
Spark Job1
Spark Job2
DataBase2 Database3
DataBase1
Executor Config
● JOB > Stage > Task
● one job can have multiple Stage, One stage can have multiple task.
● And number of core = number of parallel task
● Here we have to give proper number of core to each executor in order to optimise the resources.
● Allocating more number of core to each executor will leads to more parallel task on each executor which can
lead to outofmemory(OOM) error.
● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor
memory will not be fully optimised.
● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of
parallelism and proper memory uses.
./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5
Memory Tuning
There are three considerations in tuning memory usage:
● the amount of memory used by your objects (you may want your entire dataset to fit in memory),
● the cost of accessing those objects, and
● the overhead of garbage collection
● String data types uses less storage space compared to Linked List and Map as these objects not only has a
header, but also pointers (typically 8 bytes each) to the next object in the list.
● We can also optimise the memory uses by storing data in a serialized format.
● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their fields.
● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage
collection cost. Broadcasting variable also help us in reducing GC.
Thank You !
Get in touch with us:
Amit Raj
Senior Data Engineer
IIT Kharagpur
amitraj.iitkgp@gmail.com / 7548095242

More Related Content

What's hot

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 

What's hot (20)

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 

Similar to Spark Performance Tuning .pdf

Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Spark optimization
Spark optimizationSpark optimization
Spark optimization
Ankit Beohar
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
Knoldus Inc.
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
Rachel Warren
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 

Similar to Spark Performance Tuning .pdf (20)

Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Spark optimization
Spark optimizationSpark optimization
Spark optimization
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 

More from Amit Raj

Environmental Impact Assessment(EIA)
Environmental Impact Assessment(EIA) Environmental Impact Assessment(EIA)
Environmental Impact Assessment(EIA)
Amit Raj
 
Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report BRPNNL by Amit Raj 14CE10005Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report BRPNNL by Amit Raj 14CE10005
Amit Raj
 
Summer traning report
Summer traning reportSummer traning report
Summer traning report
Amit Raj
 
Haripur npp project
Haripur npp projectHaripur npp project
Haripur npp project
Amit Raj
 
Spot speed study
Spot speed studySpot speed study
Spot speed study
Amit Raj
 
Case study on small e commerce
Case study on small e commerceCase study on small e commerce
Case study on small e commerce
Amit Raj
 

More from Amit Raj (6)

Environmental Impact Assessment(EIA)
Environmental Impact Assessment(EIA) Environmental Impact Assessment(EIA)
Environmental Impact Assessment(EIA)
 
Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report BRPNNL by Amit Raj 14CE10005Summer traning report BRPNNL by Amit Raj 14CE10005
Summer traning report BRPNNL by Amit Raj 14CE10005
 
Summer traning report
Summer traning reportSummer traning report
Summer traning report
 
Haripur npp project
Haripur npp projectHaripur npp project
Haripur npp project
 
Spot speed study
Spot speed studySpot speed study
Spot speed study
 
Case study on small e commerce
Case study on small e commerceCase study on small e commerce
Case study on small e commerce
 

Recently uploaded

4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
architagupta876
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
ramrag33
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
gaafergoudaay7aga
 

Recently uploaded (20)

4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
 

Spark Performance Tuning .pdf

  • 1. By: Amit Raj IIT Kharagpur Apache Spark Performance tuning and Best Practise
  • 2. Our Agenda 01 Spark Introduction 02 Code Level Optimization 03 Outside Code Technique 04 Demo 05 Summary
  • 3. Introduction ● Apache Spark is Open Source, in-memory computation framework. ● It gives high performance for both batch as well as streaming job. ● It deals of big data processing. ● it is approx 100 times faster than mapreduce, because of in-memory computation As it deals with the big data processing application it also involves lot of uses of resources such as CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction. In the upcoming 40 minute we will learn about the approaches which will help to do so.
  • 4. Ways to Optimise Code Level:- Here we will learn the best practices to follow in order to achieve high performance in minimal resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid UDF, Filter Data at earliest , Reduce Shuffle Beyond Code:- Here we will learn to tune the config parameter cluster resources level tuning such as:- File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval
  • 5. Major Bottleneck ● CPU ● Network Bandwidth ● Memory Our Goal is to optimise each of them as much as possible in order to reduce the resources used and reduce the computation time to achieve optimum performance.
  • 6. Caching Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving from a particular country and same is being used multiple times. ● Raw Data is in text file ● Reading Text File as DF1 ● Grouping by origin country DF2
  • 7. Caching JOB1:- Now number of flights leaving US as DF3 JOB2:- number of flights leaving Singapore as DF4 JOB3:- number of flights leaving India as DF5 Execution plan for JOB1 :- DF1>DF2 >DF3 Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step. Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step. here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can use it in another job to reduce computation resources and save time.
  • 8. Broadcasting Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with task every time. which helps in reducing the network bandwidth and time consumption. When to Use Broadcast Variable:- Suppose we have a lookup data and that data need to be used by each executor while performing task. We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition) we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task). But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be sent. Benefit= sending 100 copy vs sending 10 copy val states = Map(("NY","New York"),("CA","California"),("FL","Florida")) val countries = Map(("USA","United States of America"),("IN","India")) val broadcastStates = spark.sparkContext.broadcast(states) val broadcastCountries = spark.sparkContext.broadcast(countries)
  • 9. – - Continue In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution. Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.
  • 10. Serialization From the above diagram it is clear that serialization is needed when we write data in some storage. De-Serialization is needed when we need to read from the some source. In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuffling etc. Hence it becomes very important to optimize the serialization process.
  • 11. Serialization Kyro serialization over Java serialization:- kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires to register the classes not supported by it. val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate() spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class it will store the class name with each object of it (for every row) conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(classOf[Foo]))
  • 12. DataSet/DataFrame over RDD RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition and shuffle, and we all know that Serialization and de-serialization are very expensive operations in spark. On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD
  • 13. Avoid UDF When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible. but by any chance we have to use it then first we have to define a function like a normal scala function and we have to register it with spark udf class ● val plusOne = udf((x: Int) => x + 1) //defined function ● spark.udf.register("plusOne", plusOne) //register udf ● spark.sql("SELECT plusOne(5)").show() // calling udf // |UDF(5)| // result // +------+ // | 6|
  • 14. Filter Data at Earliest example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address, pastexp, marital status, ……………………….. etc. Bu we have to find number of employees belonging to a particular city. in this case we have to perform groupby operation on city column and other column becomes irrelevant. df.select(name,city).groupby(“city”).show() df.groupby(“City”).select(“City”, “count”).show() Scan Aggregate Filter Scan Aggregate Filter
  • 15. Shuffling Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(), reducebyKey(), join() on RDD and DataFrame. It involves ● Disk I/O ● Involves data serialization and deserialization ● Network I/O
  • 16. Reduce Shuffle Operation We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations remove any unused operations. Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property you can improve Spark performance. spark.conf.set("spark.sql.shuffle.partitions",100) Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase it to 200 or more.
  • 17. File Format Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades from Database2 and perform calculation then writes in Databse3. as Database2 involves writing the data into and reading the data from it. In the above scenario we should prefer writing an intermediate file in Serialized and optimized formats like Avro, Parquet e.t.c, Any transformations on these formats performs better than text, CSV, and JSON. Spark Job1 Spark Job2 DataBase2 Database3 DataBase1
  • 18. Executor Config ● JOB > Stage > Task ● one job can have multiple Stage, One stage can have multiple task. ● And number of core = number of parallel task ● Here we have to give proper number of core to each executor in order to optimise the resources. ● Allocating more number of core to each executor will leads to more parallel task on each executor which can lead to outofmemory(OOM) error. ● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor memory will not be fully optimised. ● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of parallelism and proper memory uses. ./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5
  • 19. Memory Tuning There are three considerations in tuning memory usage: ● the amount of memory used by your objects (you may want your entire dataset to fit in memory), ● the cost of accessing those objects, and ● the overhead of garbage collection ● String data types uses less storage space compared to Linked List and Map as these objects not only has a header, but also pointers (typically 8 bytes each) to the next object in the list. ● We can also optimise the memory uses by storing data in a serialized format. ● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their fields. ● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage collection cost. Broadcasting variable also help us in reducing GC.
  • 20. Thank You ! Get in touch with us: Amit Raj Senior Data Engineer IIT Kharagpur amitraj.iitkgp@gmail.com / 7548095242