SlideShare a Scribd company logo
1
Tuning
Q4’s Research Report
linhtm@runsystem.net
Agenda
1. Tuning Spark parameter
a. Control Spark’s resource usage
b. Advanced Parameter
c. Dynamic Allocation
2. Tips for tuning your Spark program
3. Example use case of tuning Spark
algorithm
2
3
Tuning Spark Parameter
3
The easy way
If you Spark application is slow, just let it
have more system resources.
Is there anything simpler?
4
Spark Architecture Simplified
5
Control Spark’s resource usage
• spark-submit command’s parameter (some only
available when using in YARN)
6
Parameter Description Default value
num-executor Number of executors to launch 2
executor-cores Number of cores per executor 1
executor-memory Memory per executor 1G
driver-cores Number of cores used by the driver, only in YARN
cluster mode
1
driver-memory Memory for driver 1G
Calculate the right values
• For example: 4 servers for Spark, each server has
64gb ram, 16 cores. How should we set those spark-
submit’s parameters?
– --num-executors 4 --executor-memory 63g --
executor-cores 15
– --num-executors 7 --executor-memory 29GB --
executor-cores 7
– --num-executors 11 --executor-memory 19GB --
executor-cores 5
7
Spark Executor’s Memory Model
• Memory request from YARN for each container =
spark.executor.memory + spark.yarn.executor.
memoryOverhead
• spark.yarn.executor.memoryOverhead = max(spark.
executor.memory * 0.1, 384mb)
8
Move advanced parameters
9
spark.shuffle.memoryFraction Fraction of Java heap to use for aggregation and
cogroups during shuffles
0.2
spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously
from each reduce task
48m
spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created
during a shuffle
false
spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output
stream
32k
spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6
spark.akka.frameSize Number of actor threads to use for communication 4
spark.akka.threads Maximum message size to allow in "control plane"
communication (for serialized tasks and task results), in
MB
10
Advanced Spark memory
10
Demo Spark UI
11
Using Dynamic Allocation
• Dynamically scale the set of cluster resources
allocated to your application up and down based on
the workload
• Only available when using YARN as cluster
management tool
• Must use an external shuffle service, so must config
a shuffle service with YARN
12
Dynamic Allocation parameters (1)
13
spark.shuffle.service.enabled Enables the external shuffle service.
This service preserves the shuffle
files written by executors so the
executors can be safely removed
false
spark.dynamicAllocation.enabled Whether to use dynamic resource
allocation
false
spark.dynamicAllocation.
executorIdleTimeout
If an executor has been idle for more
than this duration, the executor will
be removed
60s
spark.dynamicAllocation.
cachedExecutorIdleTimeout
If an executor which has cached
data blocks has been idle for more
than this duration, the executor will
be removed
Infinity
spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor
Dynamic Allocation parameters (2)
14
spark.dynamicAllocation.maxExecutors Upper bound for the number of
executors
Infinity
spark.dynamicAllocation.minExecutors Lower bound for the number of
executors
0
spark.dynamicAllocation.
schedulerBacklogTimeout
If there have been pending tasks
backlogged for more than this
duration, new executors will be
requested
1s
spark.dynamicAllocation.
sustainedSchedulerBacklogTimeout
Same as spark.dynamicAllocation.
schedulerBacklogTimeout, but
used only for subsequent executor
requests
schedulerBackl
ogTimeout
Dynamic Allocation in Action
15
Dynamic Allocation - The verdict
• Dynamic Allocation help using your cluster resource
more efficiently
• But only effective when Spark Application is a long
running one with different long stages with different
number of tasks (Spark Streaming?)
• In addition, when an executor is removed, all
cached data will no longer be accessible
16
17
Tips for Tuning Your Spark
Program
17
Tuning Memory Usage
• Prefer arrays of objects, and primitive types, instead
of the standard Java or Scala collection classes (e.g.
HashMap).
• Avoid nested structures with a lot of small objects
and pointers when possible.
• Using numeric IDs or enumeration objects instead of
strings for keys.
• If you have less than 32 GB of RAM, set the JVM flag
-XX:+UseCompressedOops to make pointers be four
bytes instead of eight.
18
Other Tuning Tips (1)
● Using KryoSerializer instead of default JavaSerilizer
● Know when to persist RDD and determine the right
level of storage level
○ MEMORY_ONLY
○ MEMORY_AND_DISK
○ MEMORY_ONLY_SER
○ …
● Choose the right level of parallelism
○ spark.default.parallelism
○ repartition
○ 2nd arguments for methods in spark.
PairRDDFunctions
19
Other tuning tips (2)
• Broadcast large variables
• Do not collect on large RDDs (should filter first)
• Careful when using operation that require data
shuffle (join, reduceByKey, groupByKey…)
• Avoid groupByKey, use reduceByKey or
aggregateByKey or combineByKey (low level) if
possible.
20
groupByKey vs reduceByKey (1)
21
groupByKey vs reduceByKey (2)
22
23
Example use case of tuning Spark
algorithm
23
Tuning CF algorithm in RW project
• 1st algorithm, no parameter tuning: 27mins
• 1st algorithm, parameters tuned: 18mins
• 2nd algorithm (from Spark code), parameters tuned:
~ 7mins 30s
• 3nd algorithm (improved Spark code), parameters
tuned: ~6mins 30s
24
25
Q&A
25
26
Thank You!

More Related Content

What's hot

How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 

What's hot (20)

How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Similar to Spark tuning

Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
datamantra
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Spark / Mesos Cluster Optimization
Spark / Mesos Cluster OptimizationSpark / Mesos Cluster Optimization
Spark / Mesos Cluster Optimization
ebiznext
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design
PivotalOpenSourceHub
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
Amazon Web Services
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
Jianfeng Zhang
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
Amazon Web Services
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 

Similar to Spark tuning (20)

Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Spark / Mesos Cluster Optimization
Spark / Mesos Cluster OptimizationSpark / Mesos Cluster Optimization
Spark / Mesos Cluster Optimization
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 

More from GMO-Z.com Vietnam Lab Center

高負荷に耐えうるWebApplication Serverの作り方
高負荷に耐えうるWebApplication Serverの作り方高負荷に耐えうるWebApplication Serverの作り方
高負荷に耐えうるWebApplication Serverの作り方
GMO-Z.com Vietnam Lab Center
 
Phương pháp và chiến lược đối ứng tải trong Web Application Server
Phương pháp và chiến lược đối ứng tải trong Web Application ServerPhương pháp và chiến lược đối ứng tải trong Web Application Server
Phương pháp và chiến lược đối ứng tải trong Web Application Server
GMO-Z.com Vietnam Lab Center
 
Ứng dụng NLP vào việc xác định ý muốn người dùng (Intent Detection) và sửa lỗ...
Ứng dụng NLP vào việc xác định ý muốn người dùng (Intent Detection) và sửa lỗ...Ứng dụng NLP vào việc xác định ý muốn người dùng (Intent Detection) và sửa lỗ...
Ứng dụng NLP vào việc xác định ý muốn người dùng (Intent Detection) và sửa lỗ...
GMO-Z.com Vietnam Lab Center
 
Tìm hiểu và triển khai ứng dụng Web với Kubernetes
Tìm hiểu và triển khai ứng dụng Web với KubernetesTìm hiểu và triển khai ứng dụng Web với Kubernetes
Tìm hiểu và triển khai ứng dụng Web với Kubernetes
GMO-Z.com Vietnam Lab Center
 
Xây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
Xây dựng hệ thống quản lý sân bóng sử dụng Yii FrameworkXây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
Xây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
GMO-Z.com Vietnam Lab Center
 
Nhận biết giao dịch lừa đảo sử dụng học máy
Nhận biết giao dịch lừa đảo sử dụng học máyNhận biết giao dịch lừa đảo sử dụng học máy
Nhận biết giao dịch lừa đảo sử dụng học máy
GMO-Z.com Vietnam Lab Center
 
Hệ thống giám sát nhận diện khuôn mặt
Hệ thống giám sát nhận diện khuôn mặtHệ thống giám sát nhận diện khuôn mặt
Hệ thống giám sát nhận diện khuôn mặt
GMO-Z.com Vietnam Lab Center
 
Image Style Transfer
Image Style TransferImage Style Transfer
Image Style Transfer
GMO-Z.com Vietnam Lab Center
 
Optimizing MySQL queries
Optimizing MySQL queriesOptimizing MySQL queries
Optimizing MySQL queries
GMO-Z.com Vietnam Lab Center
 
Surveillance on slam technology
Surveillance on slam technologySurveillance on slam technology
Surveillance on slam technology
GMO-Z.com Vietnam Lab Center
 
Blockchain & Smart Contract - Bắt đầu như thế nào và các ứng dụng
Blockchain & Smart Contract - Bắt đầu như thế nào và các ứng dụngBlockchain & Smart Contract - Bắt đầu như thế nào và các ứng dụng
Blockchain & Smart Contract - Bắt đầu như thế nào và các ứng dụng
GMO-Z.com Vietnam Lab Center
 
Giới thiệu Embulk
Giới thiệu Embulk Giới thiệu Embulk
Giới thiệu Embulk
GMO-Z.com Vietnam Lab Center
 
Giới thiệu docker và ứng dụng trong ci-cd
Giới thiệu docker và ứng dụng trong ci-cdGiới thiệu docker và ứng dụng trong ci-cd
Giới thiệu docker và ứng dụng trong ci-cd
GMO-Z.com Vietnam Lab Center
 
Tài liệu giới thiệu công ty GMO-Z.com Vietnam Lab Center
Tài liệu giới thiệu công ty GMO-Z.com Vietnam Lab CenterTài liệu giới thiệu công ty GMO-Z.com Vietnam Lab Center
Tài liệu giới thiệu công ty GMO-Z.com Vietnam Lab Center
GMO-Z.com Vietnam Lab Center
 
Chia se Agile
Chia se AgileChia se Agile
Agile retrospective
Agile retrospectiveAgile retrospective
Agile retrospective
GMO-Z.com Vietnam Lab Center
 
Giới thiệu Agile + Scrum
Giới thiệu Agile + ScrumGiới thiệu Agile + Scrum
Giới thiệu Agile + Scrum
GMO-Z.com Vietnam Lab Center
 
Create android app can send SMS and Email by React Natice
Create android app can send SMS and Email by React NaticeCreate android app can send SMS and Email by React Natice
Create android app can send SMS and Email by React Natice
GMO-Z.com Vietnam Lab Center
 
Introduce React Native
Introduce React NativeIntroduce React Native
Introduce React Native
GMO-Z.com Vietnam Lab Center
 

More from GMO-Z.com Vietnam Lab Center (20)

高負荷に耐えうるWebApplication Serverの作り方
高負荷に耐えうるWebApplication Serverの作り方高負荷に耐えうるWebApplication Serverの作り方
高負荷に耐えうるWebApplication Serverの作り方
 
Phương pháp và chiến lược đối ứng tải trong Web Application Server
Phương pháp và chiến lược đối ứng tải trong Web Application ServerPhương pháp và chiến lược đối ứng tải trong Web Application Server
Phương pháp và chiến lược đối ứng tải trong Web Application Server
 
Ứng dụng NLP vào việc xác định ý muốn người dùng (Intent Detection) và sửa lỗ...
Ứng dụng NLP vào việc xác định ý muốn người dùng (Intent Detection) và sửa lỗ...Ứng dụng NLP vào việc xác định ý muốn người dùng (Intent Detection) và sửa lỗ...
Ứng dụng NLP vào việc xác định ý muốn người dùng (Intent Detection) và sửa lỗ...
 
Tìm hiểu và triển khai ứng dụng Web với Kubernetes
Tìm hiểu và triển khai ứng dụng Web với KubernetesTìm hiểu và triển khai ứng dụng Web với Kubernetes
Tìm hiểu và triển khai ứng dụng Web với Kubernetes
 
Xây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
Xây dựng hệ thống quản lý sân bóng sử dụng Yii FrameworkXây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
Xây dựng hệ thống quản lý sân bóng sử dụng Yii Framework
 
Nhận biết giao dịch lừa đảo sử dụng học máy
Nhận biết giao dịch lừa đảo sử dụng học máyNhận biết giao dịch lừa đảo sử dụng học máy
Nhận biết giao dịch lừa đảo sử dụng học máy
 
Hệ thống giám sát nhận diện khuôn mặt
Hệ thống giám sát nhận diện khuôn mặtHệ thống giám sát nhận diện khuôn mặt
Hệ thống giám sát nhận diện khuôn mặt
 
Image Style Transfer
Image Style TransferImage Style Transfer
Image Style Transfer
 
Optimizing MySQL queries
Optimizing MySQL queriesOptimizing MySQL queries
Optimizing MySQL queries
 
Surveillance on slam technology
Surveillance on slam technologySurveillance on slam technology
Surveillance on slam technology
 
Blockchain & Smart Contract - Bắt đầu như thế nào và các ứng dụng
Blockchain & Smart Contract - Bắt đầu như thế nào và các ứng dụngBlockchain & Smart Contract - Bắt đầu như thế nào và các ứng dụng
Blockchain & Smart Contract - Bắt đầu như thế nào và các ứng dụng
 
Giới thiệu Embulk
Giới thiệu Embulk Giới thiệu Embulk
Giới thiệu Embulk
 
Giới thiệu docker và ứng dụng trong ci-cd
Giới thiệu docker và ứng dụng trong ci-cdGiới thiệu docker và ứng dụng trong ci-cd
Giới thiệu docker và ứng dụng trong ci-cd
 
Tài liệu giới thiệu công ty GMO-Z.com Vietnam Lab Center
Tài liệu giới thiệu công ty GMO-Z.com Vietnam Lab CenterTài liệu giới thiệu công ty GMO-Z.com Vietnam Lab Center
Tài liệu giới thiệu công ty GMO-Z.com Vietnam Lab Center
 
Chia se Agile
Chia se AgileChia se Agile
Chia se Agile
 
Agile retrospective
Agile retrospectiveAgile retrospective
Agile retrospective
 
Giới thiệu Agile + Scrum
Giới thiệu Agile + ScrumGiới thiệu Agile + Scrum
Giới thiệu Agile + Scrum
 
Create android app can send SMS and Email by React Natice
Create android app can send SMS and Email by React NaticeCreate android app can send SMS and Email by React Natice
Create android app can send SMS and Email by React Natice
 
Introduce React Native
Introduce React NativeIntroduce React Native
Introduce React Native
 
Git in real product
Git in real productGit in real product
Git in real product
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 

Spark tuning

  • 2. Agenda 1. Tuning Spark parameter a. Control Spark’s resource usage b. Advanced Parameter c. Dynamic Allocation 2. Tips for tuning your Spark program 3. Example use case of tuning Spark algorithm 2
  • 4. The easy way If you Spark application is slow, just let it have more system resources. Is there anything simpler? 4
  • 6. Control Spark’s resource usage • spark-submit command’s parameter (some only available when using in YARN) 6 Parameter Description Default value num-executor Number of executors to launch 2 executor-cores Number of cores per executor 1 executor-memory Memory per executor 1G driver-cores Number of cores used by the driver, only in YARN cluster mode 1 driver-memory Memory for driver 1G
  • 7. Calculate the right values • For example: 4 servers for Spark, each server has 64gb ram, 16 cores. How should we set those spark- submit’s parameters? – --num-executors 4 --executor-memory 63g -- executor-cores 15 – --num-executors 7 --executor-memory 29GB -- executor-cores 7 – --num-executors 11 --executor-memory 19GB -- executor-cores 5 7
  • 8. Spark Executor’s Memory Model • Memory request from YARN for each container = spark.executor.memory + spark.yarn.executor. memoryOverhead • spark.yarn.executor.memoryOverhead = max(spark. executor.memory * 0.1, 384mb) 8
  • 9. Move advanced parameters 9 spark.shuffle.memoryFraction Fraction of Java heap to use for aggregation and cogroups during shuffles 0.2 spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously from each reduce task 48m spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created during a shuffle false spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output stream 32k spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6 spark.akka.frameSize Number of actor threads to use for communication 4 spark.akka.threads Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB 10
  • 12. Using Dynamic Allocation • Dynamically scale the set of cluster resources allocated to your application up and down based on the workload • Only available when using YARN as cluster management tool • Must use an external shuffle service, so must config a shuffle service with YARN 12
  • 13. Dynamic Allocation parameters (1) 13 spark.shuffle.service.enabled Enables the external shuffle service. This service preserves the shuffle files written by executors so the executors can be safely removed false spark.dynamicAllocation.enabled Whether to use dynamic resource allocation false spark.dynamicAllocation. executorIdleTimeout If an executor has been idle for more than this duration, the executor will be removed 60s spark.dynamicAllocation. cachedExecutorIdleTimeout If an executor which has cached data blocks has been idle for more than this duration, the executor will be removed Infinity spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor
  • 14. Dynamic Allocation parameters (2) 14 spark.dynamicAllocation.maxExecutors Upper bound for the number of executors Infinity spark.dynamicAllocation.minExecutors Lower bound for the number of executors 0 spark.dynamicAllocation. schedulerBacklogTimeout If there have been pending tasks backlogged for more than this duration, new executors will be requested 1s spark.dynamicAllocation. sustainedSchedulerBacklogTimeout Same as spark.dynamicAllocation. schedulerBacklogTimeout, but used only for subsequent executor requests schedulerBackl ogTimeout
  • 16. Dynamic Allocation - The verdict • Dynamic Allocation help using your cluster resource more efficiently • But only effective when Spark Application is a long running one with different long stages with different number of tasks (Spark Streaming?) • In addition, when an executor is removed, all cached data will no longer be accessible 16
  • 17. 17 Tips for Tuning Your Spark Program 17
  • 18. Tuning Memory Usage • Prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). • Avoid nested structures with a lot of small objects and pointers when possible. • Using numeric IDs or enumeration objects instead of strings for keys. • If you have less than 32 GB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight. 18
  • 19. Other Tuning Tips (1) ● Using KryoSerializer instead of default JavaSerilizer ● Know when to persist RDD and determine the right level of storage level ○ MEMORY_ONLY ○ MEMORY_AND_DISK ○ MEMORY_ONLY_SER ○ … ● Choose the right level of parallelism ○ spark.default.parallelism ○ repartition ○ 2nd arguments for methods in spark. PairRDDFunctions 19
  • 20. Other tuning tips (2) • Broadcast large variables • Do not collect on large RDDs (should filter first) • Careful when using operation that require data shuffle (join, reduceByKey, groupByKey…) • Avoid groupByKey, use reduceByKey or aggregateByKey or combineByKey (low level) if possible. 20
  • 23. 23 Example use case of tuning Spark algorithm 23
  • 24. Tuning CF algorithm in RW project • 1st algorithm, no parameter tuning: 27mins • 1st algorithm, parameters tuned: 18mins • 2nd algorithm (from Spark code), parameters tuned: ~ 7mins 30s • 3nd algorithm (improved Spark code), parameters tuned: ~6mins 30s 24