SlideShare a Scribd company logo
1 of 22
Download to read offline
Auto-scaling
Apache Spark cluster using
Deep Reinforcement Learning
Kundjanasith Thonglek1
, Kohei Ichikawa1
,
Chatchawal Sangkeettrakan2
, Apivadee Piyatumrong2
1
1
Nara Institute of Science and Technology (NAIST), Japan
2
National Electronics and Computer Technology Center (Nectec), Thailand
OLA’2019 : International Conference on Optimization and Learning
Agenda
This is a brief description
Introduction
Methodology
Evaluation
Conclusion
Conclusion
2
Introduction
3
Big data and advanced analytics technology are attracting much attention not just because the size
of data is big but also because the potential of impact is big
Real-time application might have to handle different sizes of the input data at the
different time as well as different techniques of machine learning for different purposes
at the same time.
Engineers need can efficiently handle large-scale data processing systems. However, it is also
known that data processing science is a relatively new field where it requires advanced knowledge on a
huge variety of techniques, tools, and theories
Apache Spark
Apache Spark is a fast, in-memory data processing engine with elegant and
expressive development APIs to allow data workers to efficiently execute
streaming, machine learning or SQL workloads that require fast iterative access
to datasets.
Spark operation :
- Transformation : passing each dataset element through a function and returns a new RDD
representing the results
- Action : aggregating all the elements of the RDD using some function and returns the final
result to the driver program
4
Transformation Action
RDD
RDD
RDD
RDD
Value
Apache Spark cluster
5
The Key Components of Apache Spark cluster
Master Node Data Node
Worker Node
Executor
Driver Program
Cluster
Manager
Spark
Context
s
c
a
l
i
n
g
Master Node
- Spark Context : It is essentially a client of Spark’s
execution environment and acts as the master of
the Spark application
Worker Node
- Executor : It is a distributed agent that
responsible for executing tasks.
Problem statement
When does Apache Spark cluster should scale-out or scale-in the
worker node for completing task within the limit execution time constraint
and the maximum number of worker nodes constraint?
6
scale-out
scale-in
Resources
Resources
Time
Time
The system supports real-time
processing to handle different size
of input data at the different time.
The system can complete the task
within the bounded time and
resources constraints.
Objectives
We will create auto-scaling system to scale Apache Spark cluster automatically
on OpenStack platform using Deep Reinforcement Learning technique.
Auto-Scaling system
8
SCALING TECHNIQUE
Rule-Based Scaling Technique Data-Driven Scaling Technique
cluster cluster
cluster management system
Data Model
cluster management system
Rule
current
state
scaling
command
scaling
command
current
state
task
status
Data
Modeling
Methodology
Auto-scaling Apache Spark cluster using Deep Reinforcement Learning
- Set up Apache Spark
cluster on OpenStack
platform by config Apache
Spark cluster template
Set up Environment
- Analyse the features
which from the log that we
collect from system API
Feature selection
- DQN is a deep reinforcement
learning technique which is
suitable for this situation
problem
Applied DQN
Set up
Environment
Feature
Selection
Applied
DQN
Auto-scaling
system
- Design our auto-scaling
system to connect between
compute and scaling module
Auto-scaling system
9
Set up Environment
10
The OpenStack system is prepared and stacked up with Apache Spark Cluster configuration in
necessary templates such as master node template, worker node template, data node template
Apache Spark cluster template where one cluster must have at least one master and one
worker node.
OpenStack platform
Apache Spark cluster
Apache Spark cluster is launched on the OpenStack platform in
homogeneous mode.
Node :
- CPU 4 vCPU
- Memory 8 GB
- Storage disk 20 GB
Feature Selection
11
The percentage of memory usage when Apache
Spark operate action ( ma
)
The percentage of memory usage when Apache
Spark operate transformation ( mt
)
Collector
Collector Analyze
Analyze
The percentage of CPU usage for
user processes ( cu
)
The percentage of CPU usage for
system processes ( cs
)
The percentage of network usage for
inbound network ( bi
)
The percentage of network usage for
outbound network ( bo
)
[ Action ] : Ay
o | neutral | i
Deep Reinforcement Learning
12
OpenStack
platform
Apache Spark
cluster
Deep
Reinforcement
Learning
[ Agent ]
[ Constraints ]
[Reward function ]
State
The current state of
Apache Spark cluster is
acquired to be the features.
Action
The scaling action with
the number of scaling
worker nodes in cluster.
Agent
Deep Q-Network or DQN
to be the network for learning
feature and take action.
[ State ] : cu
, cs
, bi
, bo
[ State ] : mt
, ma
13
States & Constraints
The states are the possible environment status of the studying system. According to the scenario
we are facing, the Apache Spark Cluster is spawned as a cluster with at least one Master node and
one Worker node, based on the pre-configured template of OpenStack for scaling purpose.
If the maximum number of worker nodes is N then the number of possible states is N
Assumption : the maximum number of worker nodes is 3
S1
T, 3
S2
T, 3
S3
T, 3
[ T, N ] are the environment constraints.
- Time constraint [ T ] : The expectation of bounded execution time.
- Resource constraint [ N ] : The maximum number of worker nodes.
Actions
14
The actions for deep reinforcement learning to scale Apache Spark cluster. There are three
possible scaling actions: (1) scaling-out (2) not-scaling and (3) scaling-in
A0
neutral
If the maximum number of worker nodes is N then the number of possible actions is 2(N-1) + 1
Assumption : the maximum number of worker nodes is 3
A1
o
A1
o
A1
i A1
i
A2
o
A2
i
Reward Function
15
The reward equation to give the reward (r) to an agent when it make a decision to scale the
cluster, must has at least one worker node. The reward function utilize the features which are selected
and explained earlier as well as the constraint of the cluster state (ma
,mt
,cu
,cs
,bi
,bo
,T,N). Furthermore, it
must take into account the number of scaling worker nodes y made by the actions.
w(y) =
{
+y, when Ay
o
; the agent takes scaling-out action
0, when A0
neutral
; the agent takes not-scaling action
-y, when Ay
i
; the agent takes scaling-in action
The reward function is defined as
r =
( 1 - ) + ma
+ mt
+ cu
+ cs
+ bi
+ bo
+
w
(N - 1)
( 1 + )
(T - t)
T
U
Where t is the execution time of this round and U is the number of features
System Architecture
16
OpenStack platform
Apache Spark cluster Deep Reinforcement Learning node
Learning & Scaling Engine
Scaling-Mode Web Interface
Data Publishing Engine
Evaluation
17
The auto-scaling system on Apache Spark cluster using deep reinforcement learning is
evaluated by data size is 5 GB.
via streaming processed. Each environment constraint is tested 100
times.
It is evaluated within two constraints :
(1) The limit execution time constraint ( T )
(2) The maximum number of worker nodes constraint ( N )
T = { 5, 6, 7, 8, 9, 10 } minutes
N = { 5, 6, 7, 8, 9, 10 } nodes
The Percentage of Job Failure with Different Optimization Models
18
Deep Q-Network (DQN) Linear Regression (LR)
OUR MODEL BASE LINE
The Sacrifice and Stabilize period of DQN and LR
19
Time Constraint (T) 5 6 7 8 9
# Experiment LR DQN LR DQN LR DQN LR DQN LR DQN
1 - 25 4 5, L=9 4 5, L=7 2 2, L=3 0 0 0 0
26 - 50 2 0 3 0 1 0 1, L=34 0 0 0
51 - 75 2 0 2, L=73 0 1 0 0 0 0 0
76 - 100 2, L=90 0 0 0 1, L=84 0 0 0 0 0
The maximum number of worker node constraint is 5 worker nodes.
Let L be the experiment round that last failure happened
Conclusion
● We study how to optimize the scaling computing node issue of Apache
Spark cluster automatically using deep reinforcement learning technique.
20
● Found the six significant features that give direct impact to the
performance of real-time application running on Apache Spark
cluster.
● Improved performance of the cluster
constrained by two constraint
features: the limitation of execution
time and the maximum number of
worker node per cluster.
Implementation
We provide Docker image on Dockerhub and source code on Github
21
https://hub.docker.com/r/kundjanasith/kitwai-engine/
https://hub.docker.com/r/kundjanasith/kitwai-ai/
https://github.com/Kundjanasith/scaling-sparkcluster/
Email : thonglek.kundjanasith.ti7@is.naist.jp
Thank You
Q & A
Kundjanasith Thonglek
Software Design & Analysis Laboratory, NAIST
22

More Related Content

What's hot

Oracle Database performance tuning using oratop
Oracle Database performance tuning using oratopOracle Database performance tuning using oratop
Oracle Database performance tuning using oratopSandesh Rao
 
Scaling paypal workloads with oracle rac ss
Scaling paypal workloads with oracle rac ssScaling paypal workloads with oracle rac ss
Scaling paypal workloads with oracle rac ssAnil Nair
 
Oracle GoldenGate for Disaster Recovery
Oracle GoldenGate for Disaster RecoveryOracle GoldenGate for Disaster Recovery
Oracle GoldenGate for Disaster RecoveryFumiko Yamashita
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
【旧版】Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年12月版]
【旧版】Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年12月版]【旧版】Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年12月版]
【旧版】Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年12月版]オラクルエンジニア通信
 
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)オラクルエンジニア通信
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBANikhil Kumar
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Oracle MAA (Maximum Availability Architecture) 18c - An Overview
Oracle MAA (Maximum Availability Architecture) 18c - An OverviewOracle MAA (Maximum Availability Architecture) 18c - An Overview
Oracle MAA (Maximum Availability Architecture) 18c - An OverviewMarkus Michalewicz
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMarkus Michalewicz
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Ceph Community
 
20170518_今さら聞けないHANAのハナシの基本のき by SAPジャパン株式会社 新久保浩二
20170518_今さら聞けないHANAのハナシの基本のき by SAPジャパン株式会社 新久保浩二20170518_今さら聞けないHANAのハナシの基本のき by SAPジャパン株式会社 新久保浩二
20170518_今さら聞けないHANAのハナシの基本のき by SAPジャパン株式会社 新久保浩二Insight Technology, Inc.
 
【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック
【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック
【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニックオラクルエンジニア通信
 
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]オラクルエンジニア通信
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Timothy Spann
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
Oracle DBMS vs Amazon RDS vs Amazon Aurora PostgreSQL principali similitudini...
Oracle DBMS vs Amazon RDS vs Amazon Aurora PostgreSQL principali similitudini...Oracle DBMS vs Amazon RDS vs Amazon Aurora PostgreSQL principali similitudini...
Oracle DBMS vs Amazon RDS vs Amazon Aurora PostgreSQL principali similitudini...Amazon Web Services
 

What's hot (20)

Oracle Database performance tuning using oratop
Oracle Database performance tuning using oratopOracle Database performance tuning using oratop
Oracle Database performance tuning using oratop
 
Scaling paypal workloads with oracle rac ss
Scaling paypal workloads with oracle rac ssScaling paypal workloads with oracle rac ss
Scaling paypal workloads with oracle rac ss
 
Oracle GoldenGate for Disaster Recovery
Oracle GoldenGate for Disaster RecoveryOracle GoldenGate for Disaster Recovery
Oracle GoldenGate for Disaster Recovery
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
【旧版】Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年12月版]
【旧版】Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年12月版]【旧版】Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年12月版]
【旧版】Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年12月版]
 
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Oracle MAA (Maximum Availability Architecture) 18c - An Overview
Oracle MAA (Maximum Availability Architecture) 18c - An OverviewOracle MAA (Maximum Availability Architecture) 18c - An Overview
Oracle MAA (Maximum Availability Architecture) 18c - An Overview
 
Oracle Analytics Cloud のご紹介【2021年3月版】
Oracle Analytics Cloud のご紹介【2021年3月版】Oracle Analytics Cloud のご紹介【2021年3月版】
Oracle Analytics Cloud のご紹介【2021年3月版】
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19c
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
20170518_今さら聞けないHANAのハナシの基本のき by SAPジャパン株式会社 新久保浩二
20170518_今さら聞けないHANAのハナシの基本のき by SAPジャパン株式会社 新久保浩二20170518_今さら聞けないHANAのハナシの基本のき by SAPジャパン株式会社 新久保浩二
20170518_今さら聞けないHANAのハナシの基本のき by SAPジャパン株式会社 新久保浩二
 
【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック
【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック
【より深く知ろう】活用最先端!データベースとアプリケーション開発をシンプルに、高速化するテクニック
 
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Oracle DBMS vs Amazon RDS vs Amazon Aurora PostgreSQL principali similitudini...
Oracle DBMS vs Amazon RDS vs Amazon Aurora PostgreSQL principali similitudini...Oracle DBMS vs Amazon RDS vs Amazon Aurora PostgreSQL principali similitudini...
Oracle DBMS vs Amazon RDS vs Amazon Aurora PostgreSQL principali similitudini...
 
Exadata X8M-2 KVM仮想化ベストプラクティス
Exadata X8M-2 KVM仮想化ベストプラクティスExadata X8M-2 KVM仮想化ベストプラクティス
Exadata X8M-2 KVM仮想化ベストプラクティス
 

Similar to Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf

Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovA Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovSpark Summit
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkDatabricks
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopQuantUniversity
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoDatabricks
 
Machine learning Experiments report
Machine learning Experiments report Machine learning Experiments report
Machine learning Experiments report AlmkdadAli
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
 
Why is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_questionWhy is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_questionAjith Narayanan
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 

Similar to Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf (20)

Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovA Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Svm on cloud (presntation)
Svm on cloud  (presntation)Svm on cloud  (presntation)
Svm on cloud (presntation)
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
Machine learning Experiments report
Machine learning Experiments report Machine learning Experiments report
Machine learning Experiments report
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
Why is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_questionWhy is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_question
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 

More from Kundjanasith Thonglek

Sparse Communication for Federated Learning
Sparse Communication for Federated LearningSparse Communication for Federated Learning
Sparse Communication for Federated LearningKundjanasith Thonglek
 
Improving Resource Availability in Data Center using Deep Learning.pdf
Improving Resource Availability in Data Center using Deep Learning.pdfImproving Resource Availability in Data Center using Deep Learning.pdf
Improving Resource Availability in Data Center using Deep Learning.pdfKundjanasith Thonglek
 
Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...
Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...
Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...Kundjanasith Thonglek
 
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfFederated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfKundjanasith Thonglek
 
Abnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdf
Abnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdfAbnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdf
Abnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdfKundjanasith Thonglek
 
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdfRetraining Quantized Neural Network Models with Unlabeled Data.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdfKundjanasith Thonglek
 
Improving Resource Utilization in Data Centers using an LSTM-based Prediction...
Improving Resource Utilization in Data Centers using an LSTM-based Prediction...Improving Resource Utilization in Data Centers using an LSTM-based Prediction...
Improving Resource Utilization in Data Centers using an LSTM-based Prediction...Kundjanasith Thonglek
 
Intelligent Vehicle Accident Analysis System.pdf
Intelligent Vehicle Accident Analysis System.pdfIntelligent Vehicle Accident Analysis System.pdf
Intelligent Vehicle Accident Analysis System.pdfKundjanasith Thonglek
 

More from Kundjanasith Thonglek (8)

Sparse Communication for Federated Learning
Sparse Communication for Federated LearningSparse Communication for Federated Learning
Sparse Communication for Federated Learning
 
Improving Resource Availability in Data Center using Deep Learning.pdf
Improving Resource Availability in Data Center using Deep Learning.pdfImproving Resource Availability in Data Center using Deep Learning.pdf
Improving Resource Availability in Data Center using Deep Learning.pdf
 
Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...
Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...
Enhancing the Prediction Accuracy of Solar Power Generation using a Generativ...
 
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdfFederated Learning of Neural Network Models with Heterogeneous Structures.pdf
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
 
Abnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdf
Abnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdfAbnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdf
Abnormal Gait Recognition in Real-Time using Recurrent Neural Networks.pdf
 
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdfRetraining Quantized Neural Network Models with Unlabeled Data.pdf
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
 
Improving Resource Utilization in Data Centers using an LSTM-based Prediction...
Improving Resource Utilization in Data Centers using an LSTM-based Prediction...Improving Resource Utilization in Data Centers using an LSTM-based Prediction...
Improving Resource Utilization in Data Centers using an LSTM-based Prediction...
 
Intelligent Vehicle Accident Analysis System.pdf
Intelligent Vehicle Accident Analysis System.pdfIntelligent Vehicle Accident Analysis System.pdf
Intelligent Vehicle Accident Analysis System.pdf
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf

  • 1. Auto-scaling Apache Spark cluster using Deep Reinforcement Learning Kundjanasith Thonglek1 , Kohei Ichikawa1 , Chatchawal Sangkeettrakan2 , Apivadee Piyatumrong2 1 1 Nara Institute of Science and Technology (NAIST), Japan 2 National Electronics and Computer Technology Center (Nectec), Thailand OLA’2019 : International Conference on Optimization and Learning
  • 2. Agenda This is a brief description Introduction Methodology Evaluation Conclusion Conclusion 2
  • 3. Introduction 3 Big data and advanced analytics technology are attracting much attention not just because the size of data is big but also because the potential of impact is big Real-time application might have to handle different sizes of the input data at the different time as well as different techniques of machine learning for different purposes at the same time. Engineers need can efficiently handle large-scale data processing systems. However, it is also known that data processing science is a relatively new field where it requires advanced knowledge on a huge variety of techniques, tools, and theories
  • 4. Apache Spark Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark operation : - Transformation : passing each dataset element through a function and returns a new RDD representing the results - Action : aggregating all the elements of the RDD using some function and returns the final result to the driver program 4 Transformation Action RDD RDD RDD RDD Value
  • 5. Apache Spark cluster 5 The Key Components of Apache Spark cluster Master Node Data Node Worker Node Executor Driver Program Cluster Manager Spark Context s c a l i n g Master Node - Spark Context : It is essentially a client of Spark’s execution environment and acts as the master of the Spark application Worker Node - Executor : It is a distributed agent that responsible for executing tasks.
  • 6. Problem statement When does Apache Spark cluster should scale-out or scale-in the worker node for completing task within the limit execution time constraint and the maximum number of worker nodes constraint? 6 scale-out scale-in Resources Resources Time Time
  • 7. The system supports real-time processing to handle different size of input data at the different time. The system can complete the task within the bounded time and resources constraints. Objectives We will create auto-scaling system to scale Apache Spark cluster automatically on OpenStack platform using Deep Reinforcement Learning technique.
  • 8. Auto-Scaling system 8 SCALING TECHNIQUE Rule-Based Scaling Technique Data-Driven Scaling Technique cluster cluster cluster management system Data Model cluster management system Rule current state scaling command scaling command current state task status Data Modeling
  • 9. Methodology Auto-scaling Apache Spark cluster using Deep Reinforcement Learning - Set up Apache Spark cluster on OpenStack platform by config Apache Spark cluster template Set up Environment - Analyse the features which from the log that we collect from system API Feature selection - DQN is a deep reinforcement learning technique which is suitable for this situation problem Applied DQN Set up Environment Feature Selection Applied DQN Auto-scaling system - Design our auto-scaling system to connect between compute and scaling module Auto-scaling system 9
  • 10. Set up Environment 10 The OpenStack system is prepared and stacked up with Apache Spark Cluster configuration in necessary templates such as master node template, worker node template, data node template Apache Spark cluster template where one cluster must have at least one master and one worker node. OpenStack platform Apache Spark cluster Apache Spark cluster is launched on the OpenStack platform in homogeneous mode. Node : - CPU 4 vCPU - Memory 8 GB - Storage disk 20 GB
  • 11. Feature Selection 11 The percentage of memory usage when Apache Spark operate action ( ma ) The percentage of memory usage when Apache Spark operate transformation ( mt ) Collector Collector Analyze Analyze The percentage of CPU usage for user processes ( cu ) The percentage of CPU usage for system processes ( cs ) The percentage of network usage for inbound network ( bi ) The percentage of network usage for outbound network ( bo )
  • 12. [ Action ] : Ay o | neutral | i Deep Reinforcement Learning 12 OpenStack platform Apache Spark cluster Deep Reinforcement Learning [ Agent ] [ Constraints ] [Reward function ] State The current state of Apache Spark cluster is acquired to be the features. Action The scaling action with the number of scaling worker nodes in cluster. Agent Deep Q-Network or DQN to be the network for learning feature and take action. [ State ] : cu , cs , bi , bo [ State ] : mt , ma
  • 13. 13 States & Constraints The states are the possible environment status of the studying system. According to the scenario we are facing, the Apache Spark Cluster is spawned as a cluster with at least one Master node and one Worker node, based on the pre-configured template of OpenStack for scaling purpose. If the maximum number of worker nodes is N then the number of possible states is N Assumption : the maximum number of worker nodes is 3 S1 T, 3 S2 T, 3 S3 T, 3 [ T, N ] are the environment constraints. - Time constraint [ T ] : The expectation of bounded execution time. - Resource constraint [ N ] : The maximum number of worker nodes.
  • 14. Actions 14 The actions for deep reinforcement learning to scale Apache Spark cluster. There are three possible scaling actions: (1) scaling-out (2) not-scaling and (3) scaling-in A0 neutral If the maximum number of worker nodes is N then the number of possible actions is 2(N-1) + 1 Assumption : the maximum number of worker nodes is 3 A1 o A1 o A1 i A1 i A2 o A2 i
  • 15. Reward Function 15 The reward equation to give the reward (r) to an agent when it make a decision to scale the cluster, must has at least one worker node. The reward function utilize the features which are selected and explained earlier as well as the constraint of the cluster state (ma ,mt ,cu ,cs ,bi ,bo ,T,N). Furthermore, it must take into account the number of scaling worker nodes y made by the actions. w(y) = { +y, when Ay o ; the agent takes scaling-out action 0, when A0 neutral ; the agent takes not-scaling action -y, when Ay i ; the agent takes scaling-in action The reward function is defined as r = ( 1 - ) + ma + mt + cu + cs + bi + bo + w (N - 1) ( 1 + ) (T - t) T U Where t is the execution time of this round and U is the number of features
  • 16. System Architecture 16 OpenStack platform Apache Spark cluster Deep Reinforcement Learning node Learning & Scaling Engine Scaling-Mode Web Interface Data Publishing Engine
  • 17. Evaluation 17 The auto-scaling system on Apache Spark cluster using deep reinforcement learning is evaluated by data size is 5 GB. via streaming processed. Each environment constraint is tested 100 times. It is evaluated within two constraints : (1) The limit execution time constraint ( T ) (2) The maximum number of worker nodes constraint ( N ) T = { 5, 6, 7, 8, 9, 10 } minutes N = { 5, 6, 7, 8, 9, 10 } nodes
  • 18. The Percentage of Job Failure with Different Optimization Models 18 Deep Q-Network (DQN) Linear Regression (LR) OUR MODEL BASE LINE
  • 19. The Sacrifice and Stabilize period of DQN and LR 19 Time Constraint (T) 5 6 7 8 9 # Experiment LR DQN LR DQN LR DQN LR DQN LR DQN 1 - 25 4 5, L=9 4 5, L=7 2 2, L=3 0 0 0 0 26 - 50 2 0 3 0 1 0 1, L=34 0 0 0 51 - 75 2 0 2, L=73 0 1 0 0 0 0 0 76 - 100 2, L=90 0 0 0 1, L=84 0 0 0 0 0 The maximum number of worker node constraint is 5 worker nodes. Let L be the experiment round that last failure happened
  • 20. Conclusion ● We study how to optimize the scaling computing node issue of Apache Spark cluster automatically using deep reinforcement learning technique. 20 ● Found the six significant features that give direct impact to the performance of real-time application running on Apache Spark cluster. ● Improved performance of the cluster constrained by two constraint features: the limitation of execution time and the maximum number of worker node per cluster.
  • 21. Implementation We provide Docker image on Dockerhub and source code on Github 21 https://hub.docker.com/r/kundjanasith/kitwai-engine/ https://hub.docker.com/r/kundjanasith/kitwai-ai/ https://github.com/Kundjanasith/scaling-sparkcluster/ Email : thonglek.kundjanasith.ti7@is.naist.jp
  • 22. Thank You Q & A Kundjanasith Thonglek Software Design & Analysis Laboratory, NAIST 22