The document describes using deep reinforcement learning to automatically scale an Apache Spark cluster on an OpenStack platform. Key points:
- Features like CPU/memory usage and network traffic are selected to represent cluster state.
- A deep Q-network agent takes scaling actions (add/remove nodes) based on state to meet time/resource constraints.
- The approach improves on linear regression by reducing job failures during constraint testing with different data sizes and limits.
- An open-source implementation including Docker images and source code is provided.
Streamlining Python Development: A Guide to a Modern Project Setup
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
1. Auto-scaling
Apache Spark cluster using
Deep Reinforcement Learning
Kundjanasith Thonglek1
, Kohei Ichikawa1
,
Chatchawal Sangkeettrakan2
, Apivadee Piyatumrong2
1
1
Nara Institute of Science and Technology (NAIST), Japan
2
National Electronics and Computer Technology Center (Nectec), Thailand
OLA’2019 : International Conference on Optimization and Learning
2. Agenda
This is a brief description
Introduction
Methodology
Evaluation
Conclusion
Conclusion
2
3. Introduction
3
Big data and advanced analytics technology are attracting much attention not just because the size
of data is big but also because the potential of impact is big
Real-time application might have to handle different sizes of the input data at the
different time as well as different techniques of machine learning for different purposes
at the same time.
Engineers need can efficiently handle large-scale data processing systems. However, it is also
known that data processing science is a relatively new field where it requires advanced knowledge on a
huge variety of techniques, tools, and theories
4. Apache Spark
Apache Spark is a fast, in-memory data processing engine with elegant and
expressive development APIs to allow data workers to efficiently execute
streaming, machine learning or SQL workloads that require fast iterative access
to datasets.
Spark operation :
- Transformation : passing each dataset element through a function and returns a new RDD
representing the results
- Action : aggregating all the elements of the RDD using some function and returns the final
result to the driver program
4
Transformation Action
RDD
RDD
RDD
RDD
Value
5. Apache Spark cluster
5
The Key Components of Apache Spark cluster
Master Node Data Node
Worker Node
Executor
Driver Program
Cluster
Manager
Spark
Context
s
c
a
l
i
n
g
Master Node
- Spark Context : It is essentially a client of Spark’s
execution environment and acts as the master of
the Spark application
Worker Node
- Executor : It is a distributed agent that
responsible for executing tasks.
6. Problem statement
When does Apache Spark cluster should scale-out or scale-in the
worker node for completing task within the limit execution time constraint
and the maximum number of worker nodes constraint?
6
scale-out
scale-in
Resources
Resources
Time
Time
7. The system supports real-time
processing to handle different size
of input data at the different time.
The system can complete the task
within the bounded time and
resources constraints.
Objectives
We will create auto-scaling system to scale Apache Spark cluster automatically
on OpenStack platform using Deep Reinforcement Learning technique.
8. Auto-Scaling system
8
SCALING TECHNIQUE
Rule-Based Scaling Technique Data-Driven Scaling Technique
cluster cluster
cluster management system
Data Model
cluster management system
Rule
current
state
scaling
command
scaling
command
current
state
task
status
Data
Modeling
9. Methodology
Auto-scaling Apache Spark cluster using Deep Reinforcement Learning
- Set up Apache Spark
cluster on OpenStack
platform by config Apache
Spark cluster template
Set up Environment
- Analyse the features
which from the log that we
collect from system API
Feature selection
- DQN is a deep reinforcement
learning technique which is
suitable for this situation
problem
Applied DQN
Set up
Environment
Feature
Selection
Applied
DQN
Auto-scaling
system
- Design our auto-scaling
system to connect between
compute and scaling module
Auto-scaling system
9
10. Set up Environment
10
The OpenStack system is prepared and stacked up with Apache Spark Cluster configuration in
necessary templates such as master node template, worker node template, data node template
Apache Spark cluster template where one cluster must have at least one master and one
worker node.
OpenStack platform
Apache Spark cluster
Apache Spark cluster is launched on the OpenStack platform in
homogeneous mode.
Node :
- CPU 4 vCPU
- Memory 8 GB
- Storage disk 20 GB
11. Feature Selection
11
The percentage of memory usage when Apache
Spark operate action ( ma
)
The percentage of memory usage when Apache
Spark operate transformation ( mt
)
Collector
Collector Analyze
Analyze
The percentage of CPU usage for
user processes ( cu
)
The percentage of CPU usage for
system processes ( cs
)
The percentage of network usage for
inbound network ( bi
)
The percentage of network usage for
outbound network ( bo
)
12. [ Action ] : Ay
o | neutral | i
Deep Reinforcement Learning
12
OpenStack
platform
Apache Spark
cluster
Deep
Reinforcement
Learning
[ Agent ]
[ Constraints ]
[Reward function ]
State
The current state of
Apache Spark cluster is
acquired to be the features.
Action
The scaling action with
the number of scaling
worker nodes in cluster.
Agent
Deep Q-Network or DQN
to be the network for learning
feature and take action.
[ State ] : cu
, cs
, bi
, bo
[ State ] : mt
, ma
13. 13
States & Constraints
The states are the possible environment status of the studying system. According to the scenario
we are facing, the Apache Spark Cluster is spawned as a cluster with at least one Master node and
one Worker node, based on the pre-configured template of OpenStack for scaling purpose.
If the maximum number of worker nodes is N then the number of possible states is N
Assumption : the maximum number of worker nodes is 3
S1
T, 3
S2
T, 3
S3
T, 3
[ T, N ] are the environment constraints.
- Time constraint [ T ] : The expectation of bounded execution time.
- Resource constraint [ N ] : The maximum number of worker nodes.
14. Actions
14
The actions for deep reinforcement learning to scale Apache Spark cluster. There are three
possible scaling actions: (1) scaling-out (2) not-scaling and (3) scaling-in
A0
neutral
If the maximum number of worker nodes is N then the number of possible actions is 2(N-1) + 1
Assumption : the maximum number of worker nodes is 3
A1
o
A1
o
A1
i A1
i
A2
o
A2
i
15. Reward Function
15
The reward equation to give the reward (r) to an agent when it make a decision to scale the
cluster, must has at least one worker node. The reward function utilize the features which are selected
and explained earlier as well as the constraint of the cluster state (ma
,mt
,cu
,cs
,bi
,bo
,T,N). Furthermore, it
must take into account the number of scaling worker nodes y made by the actions.
w(y) =
{
+y, when Ay
o
; the agent takes scaling-out action
0, when A0
neutral
; the agent takes not-scaling action
-y, when Ay
i
; the agent takes scaling-in action
The reward function is defined as
r =
( 1 - ) + ma
+ mt
+ cu
+ cs
+ bi
+ bo
+
w
(N - 1)
( 1 + )
(T - t)
T
U
Where t is the execution time of this round and U is the number of features
17. Evaluation
17
The auto-scaling system on Apache Spark cluster using deep reinforcement learning is
evaluated by data size is 5 GB.
via streaming processed. Each environment constraint is tested 100
times.
It is evaluated within two constraints :
(1) The limit execution time constraint ( T )
(2) The maximum number of worker nodes constraint ( N )
T = { 5, 6, 7, 8, 9, 10 } minutes
N = { 5, 6, 7, 8, 9, 10 } nodes
18. The Percentage of Job Failure with Different Optimization Models
18
Deep Q-Network (DQN) Linear Regression (LR)
OUR MODEL BASE LINE
19. The Sacrifice and Stabilize period of DQN and LR
19
Time Constraint (T) 5 6 7 8 9
# Experiment LR DQN LR DQN LR DQN LR DQN LR DQN
1 - 25 4 5, L=9 4 5, L=7 2 2, L=3 0 0 0 0
26 - 50 2 0 3 0 1 0 1, L=34 0 0 0
51 - 75 2 0 2, L=73 0 1 0 0 0 0 0
76 - 100 2, L=90 0 0 0 1, L=84 0 0 0 0 0
The maximum number of worker node constraint is 5 worker nodes.
Let L be the experiment round that last failure happened
20. Conclusion
● We study how to optimize the scaling computing node issue of Apache
Spark cluster automatically using deep reinforcement learning technique.
20
● Found the six significant features that give direct impact to the
performance of real-time application running on Apache Spark
cluster.
● Improved performance of the cluster
constrained by two constraint
features: the limitation of execution
time and the maximum number of
worker node per cluster.
21. Implementation
We provide Docker image on Dockerhub and source code on Github
21
https://hub.docker.com/r/kundjanasith/kitwai-engine/
https://hub.docker.com/r/kundjanasith/kitwai-ai/
https://github.com/Kundjanasith/scaling-sparkcluster/
Email : thonglek.kundjanasith.ti7@is.naist.jp
22. Thank You
Q & A
Kundjanasith Thonglek
Software Design & Analysis Laboratory, NAIST
22