Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Architectures

Fuzzy Self-Learning Controllers for Elasticity
Management in Dynamic Cloud Architectures
Pooyan Jamshidi
Imperial College London
p.jamshidi@imperial.ac.uk
Invited talk at Sharif University of Technology
12 April 2016

Motivation
~50% = wasted hardware
Actual
traffic
Typical weekly traffic to Web-based applications (e.g., Amazon.com)

Motivation
Problem 1: ~75% wasted capacity
Actual
demand
Problem 2:
customer lost
Traffic in an unexpected burst in requests (e.g. end of
year traffic to Amazon.com)

Motivation
Really like this??
Auto-scaling enables you to realize this ideal on-demand provisioning
Time
Demand
?
Enacting change in the
Cloud resources are not
real-time

Motivation
Capacity we can provision
with Auto-Scaling
A realistic figure of dynamic provisioning

Research Challenges
• Challenge 1. Parameters’ value prediction ahead of time.
• Challenge 2. Qualitative specification of thresholds.
• Challenge 3. Robust control of uncertainty in measurement data.

Predictable vs. Unpredictable Demand
0 50 100
0
500
1000
1500
0 50 100
100
200
300
400
500
0 50 100
0
1000
2000
0 50 100
0
200
400
600

An Example of Auto-scaling Rule These quantitative
values are required to
be determined by the
user
Þ requires deep
knowledge of
application (CPU,
memory,
thresholds)
Þ requires
performance
modeling expertise
(when and how to
scale)
Þ A unified opinion
of user(s) is
required
Amazon auto scaling
Microsoft Azure Watch
9
Microsoft Azure Auto-
scaling Application Block

Sources of Uncertainty in Elastic Software
P. Jamshidi, C. Pahl, N. Mendonca,
“Managing Uncertainty in Autonomic
Cloud Elasticity Controllers”,
IEEE Cloud Computing, 2016.
P. Jamshidi, C. Pahl,
“Software Architecture for the Cloud–
a Roadmap towards Control-Theoretic,
Model-Based Cloud Architecture”,
LNCS, 2015.

A concrete example of uncertainty in the cloud
Uncertainty related to enactment latency:
The same scaling action (adding/removing
a VM with precisely the same size) took
different time to be enacted on the
cloud platform (here is Microsoft Azure)
at different points and
this difference were significant
(up to couple of minutes).
The enactment latency would be also different
on different cloud platforms.

Goal!
•Take the burden away from the user
─ users specifies the thresholds through qualitative linguistics
─ the auto-scaling controller should be fully responsible for scaling decisions
─ the auto-scaling should be robust against uncertainty
Ø Offline benchmarking
Ø Trial-and-error
Ø Expert knowledge
Costly and
not systematic
A. Gandhi, P. Dube, A. Karve, A. Kochut, L. Zhang,
Adaptive, “Model-driven Autoscaling for Cloud
Applications”, ICAC’14
arrival rate (req/s)
95% Resp. time (ms)
400 ms
60 req/s

RobusT2Scale: Architectural Overview
RobusT2Scale
Initial setting +
elasticity rules +
response-time SLA
environment
monitoring
application
monitoring
scaling
actions
Fuzzy Reasoning
Users
Prediction/
Smoothing

Internal Details of RobusT2Scale
Code: https://github.com/pooyanjamshidi/RobusT2Scale

Why we decided to use type-2 fuzzy logic?
0 0.5 1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Region of
definite
satisfaction
Region of
definite
dissatisfaction Region of
uncertain
satisfaction
Performance Index
Possibility
Performance Index
Possibility
words can mean different
things to different people
Different users often
recommend
different elasticity policies
0 0.5 1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Type-2 MF
Type-1 MF

How we designed the fuzzy controller?
- The fuzzy logic controller is completely defined by its
“membership functions” and “fuzzy rules”.
- Knowledge elicitation through a survey of 10 experts.

Survey processing and fuzzy MF construction
Workload
Response time
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x2
uMembershipgrade
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
uMembershipgrade
=>
=>
UMF
LMF
Embedded
FOU
mean
sd

Fuzzy rule elicitation
Rule
(𝒍)
Antecedents Consequent
𝒄 𝒂𝒗𝒈
𝒍
Workload
Response-
time
Normal
(-2)
Effort
(-1)
Medium
Effort
(0)
High
Effort
(+1)
Maximum
Effort (+2)
1 Very low Instantaneous 7 2 1 0 0 -1.6
2 Very low Fast 5 4 1 0 0 -1.4
3 Very low Medium 0 2 6 2 0 0
4 Very low Slow 0 0 4 6 0 0.6
5 Very low Very slow 0 0 0 6 4 1.4
6 Low Instantaneous 5 3 2 0 0 -1.3
7 Low Fast 2 7 1 0 0 -1.1
8 Low Medium 0 1 5 3 1 0.4
9 Low Slow 0 0 1 8 1 1
10 Low Very slow 0 0 0 4 6 1.6
11 Medium Instantaneous 6 4 0 0 0 -1.6
12 Medium Fast 2 5 3 0 0 -0.9
13 Medium Medium 0 0 5 4 1 0.6
14 Medium Slow 0 0 1 7 2 1.1
15 Medium Very slow 0 0 1 3 6 1.5
16 High Instantaneous 8 2 0 0 0 -1.8
17 High Fast 4 6 0 0 0 -1.4
18 High Medium 0 1 5 3 1 0.4
19 High Slow 0 0 1 7 2 1.1
20 High Very slow 0 0 0 6 4 1.4
21 Very high Instantaneous 9 1 0 0 0 -1.9
22 Very high Fast 3 6 1 0 0 -1.2
23 Very high Medium 0 1 4 4 1 0.5
24 Very high Slow 0 0 1 8 1 1
25 Very high Very slow 0 0 0 4 6 1.6
Rule
()
Antecedents Consequent
Work
load
Response
-time
-2 -1 0 +1 +2
12 Medium Fast 2 5 3 0 0 -0.9
10 experts’ responses
𝑅"
: IF (the workload (𝑥%) is 𝐹'()
, AND the response-
time (𝑥*) is 𝐺'(,
), THEN (add/remove 𝑐./0
"
instances).
𝑐./0
"
=
∑ 𝑤4
"
×𝐶
78
49%
∑ 𝑤4
"78
49%
Goal: pre-computations of costly calculations
to make a runtime efficient elasticity
reasoning based on fuzzy inference

Elasticity Reasoning @ Runtime
Liang, Q., Mendel, J. M. (2000). Interval type-2 fuzzy
logic systems: theory and design. Fuzzy Systems, IEEE
Transactions on, 8(5), 535-550.
Scaling Actions
Monitoring Data

Fuzzification
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.5954
0.3797
𝑀
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2212
0.0000
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x2
u
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
u Monitoring data
Workload
Response time

Inference Mechanism
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.5954
0.3797
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.9568
0.9377

Experimental Setting: Process View
0 50 100
0
500
1000
1500
0 50 100
100
200
300
400
500
0 50 100
0
1000
2000
0 50 100
0
200
400
600
0 50 100
0
500
1000
0 50 100
0
500
1000

Estimation Errors w.r.t. Workload Patterns
0 10 20 30 40 50 60 70 80 90 100
-500
0
500
1000
1500
2000
Time (seconds)
Numberofhits
Original data
betta=0.10, gamma=0.94, rmse=308.1565, rrse=0.79703
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Big spike Dual phase Large variations Quickly varying Slowly varying Steep tri phase
0 50 100
0
500
1000
1500
0 50 100
100
200
300
400
500
0 50 100
0
1000
2000
0 50 100
0
200
400
600
0 50 100
0
500
1000
0 50 100
0
500
1000
RootRelativeSquaredError

Workload Prediction and Its Accuracy
0 20 40 60 80 100 120
Time (Seconds)
150
200
250
300
350
400
450
500
Numberofhits
Forecasting with double exponential smoothing
Observed data
Smoothed data
Forecast
0.12
0.9
0.14
0.2
0.16
0.92
0.18
0.4 0.94
0.2
alpha gamma
Root mean squared error versus alpha
RMSE
0.22
0.960.6
0.24
0.98
0.26
0.8
1

Effectiveness of RobusT2Scale
SUT Criteria Big spike Dual phase
Large
variations
Quickly
varying
Slowly
varying
Steep tri
phase
RobusT2Scale
973ms 537ms 509ms 451ms 423ms 498ms
3.2 3.8 5.1 5.3 3.7 3.9
Overprovisioning
354ms 411ms 395ms 446ms 371ms 491ms
6 6 6 6 6 6
Under
provisioning
1465ms 1832ms 1789ms 1594ms 1898ms 2194ms
2 2 2 2 2 2
SLA: 𝒓𝒕 𝟗𝟓 ≤ 𝟔𝟎𝟎𝒎𝒔
For every 10s control interval
•RobusT2Scale is superior to under-provisioning in terms of
guaranteeing the SLA and does not require excessive
resources
•RobusT2Scale is superior to over-provisioning in terms of
guaranteeing required resources while guaranteeing the SLA

Robustness of RobusT2Scale
0
0.02
0.04
0.06
0.08
0.1
alpha=0.1 alpha=0.5 alpha=0.9 alpha=1.0
RootMeanSquareError
Noise level: 10%

Initial Work
(SEMAS paper)
Current Work
Updating K
in MAPE-K
@ Runtime
(QoSA16,
IEEE Cloud)
Design-time
Assistance
Multi-cloud

A Model-Free Reinforcement Learning Approach
S1
S2
S3
S4
S5
a1
a2
a3
a4
a5
a6
Environment
RL
Agent
!"0
state
#"$%0
reward/
punishment&"$%0 PolicyPolicyPolicy
Calibrate
EstablishDetermine
Environment
RL
Agent
!"0
state
#"$%0
reward/
punishment&"$%0 PolicyPolicyPolicy
Derive
Establish
Determine
Value0
Function
(!) ())

Fuzzifier
Inference
Engine
Defuzzifier
Rule
base
Fuzzy
Q-learning
Cloud ApplicationMonitoring Actuator
Cloud Platform
Fuzzy Logic
Controller
Knowledge Learning
Autonomic Controller
𝑟𝑡
𝑤
𝑤,𝑟𝑡,𝑡ℎ,𝑣𝑚
𝑠𝑎
system state system goal
FQL4KE: Logical Architecture

Fuzzy Q-Learning
Algorithm 1 : Fuzzy Q-Learning
Require: , ⌘, ✏
1: Initialize q-values:
q[i, j] = 0, 1 < i < N , 1 < j < J
2: Select an action for each fired rule:
ai = argmaxkq[i, k] with probability 1 ✏ . Eq. 5
ai = random{ak, k = 1, 2, · · · , J} with probability ✏
3: Calculate the control action by the fuzzy controller:
a =
PN
i=1 µi(x) ⇥ ai, . Eq. 1
where ↵i(s) is the firing level of the rule i
4: Approximate the Q function from the current
q-values and the firing level of the rules:
Q(s(t), a) =
PN
i=1 ↵i(s) ⇥ q[i, ai],
where Q(s(t), a) is the value of the Q function for
the state current state s(t) in iteration t and the action a
5: Take action a and let system goes to the next state s(t+1).
6: Observe the reinforcement signal, r(t + 1)
and compute the value for the new state:
V (s(t + 1)) =
PN
i=1 ↵i(s(t + 1)).maxk(q[i, qk]).
7: Calculate the error signal:
Q = r(t + 1) + ⇥ Vt(s(t + 1)) Q(s(t), a), . Eq. 4
where is a discount factor
8: Update q-values:
q[i, ai] = q[i, ai] + ⌘ · Q · ↵i(s(t)), . Eq. 4
where ⌘ is a learning rate
9: Repeat the process for the new state until it converges
D
c
c
a
o
b
o
S
a
r
d
a
w
if
th
to
r
a
Low Medium High
Workload
1
0
α β γ δ
Bad OK Good
Response Time
1
0
λ μ ν
of w and rt that correspond to the state of the system, s(t) (cf.
Step 4 in Algorithm 1). The control signal sa represents the
action a that the controller take at each loop. We define the
reward signal r(t) based on three criteria: (i) numbers of the
desired response time violations, (ii) the amount of resource
acquired, and (iii) throughput, as follows:
r(t) = U(t) U(t 1), (6)
where U(t) is the utility value of the system at time t. Hence,
if a controlling action leads to an increased utility, it means
that the action is appropriate. Otherwise, if the reward is close
to zero, it implies that the action is not effective. A negative
reward (punishment) warns that the situation becomes worse
after taking the action. The utility function is defined as:
U(t) = w1 ·
th(t)
thmax
+w2 ·(1
vm(t)
vmmax
)+w3 ·(1 H(t)) (7)
H(t) =
8
><
>:
(rt(t) rtdes)
rtdes
rtdes  rt(t)  2 · rtdes
1 rt(t) 2 · rtdes
0 rt(t)  rtdes
where th(t), vm(t) and rt(t) are throughput, number of worker
roles and response time of the system, respectively. w1,w2 and
w3 are their corresponding weights determining their relative
o possible but due to the intricacies of updating
e, we consider this as a natural future extension
r the problem areas that requires coordination
controllers, see [9].
ion. The controller receives the current values
t correspond to the state of the system, s(t) (cf.
rithm 1). The control signal sa represents the
e controller take at each loop. We define the
(t) based on three criteria: (i) numbers of the
e time violations, (ii) the amount of resource
ii) throughput, as follows:
r(t) = U(t) U(t 1), (6)
he utility value of the system at time t. Hence,
action leads to an increased utility, it means
s appropriate. Otherwise, if the reward is close
ies that the action is not effective. A negative
ment) warns that the situation becomes worse
action. The utility function is defined as:
h(t)
max
+w2 ·(1
vm(t)
vmmax
)+w3 ·(1 H(t)) (7)
Code:
https://github.com/pooyanjamshidi/Fuzzy-Q-Learning

RobusT2Scale
Learned rules
FQL
Monitoring Actuator
Cloud Platform
.fis
L
W
W
ElasticBench
𝑤, 𝑟𝑡
𝑤, 𝑟𝑡,
𝑡ℎ, 𝑣𝑚
𝑠𝑎
Load Generator
C
system state
WCF
REST
𝛾, 𝜂, 𝜀, 𝑟
FQL4KE: Implementation Architecture

Cloud Platform (PaaS)On-Premise
P:
Worker
Role
L: Web
Role
P:
Worker
Role
P:
Worker
Role
Cache
M:
Worker
Role
Results:
Storage
Blackboard:
Storage
LG:
Console
Auto-scaling
Logic (controller)
Policy Enforcer
1 2 3
7
8
1112
4
10
9
LB: Load
Balancer
6
5
Queue
Actuator
Monitoring
Code: https://github.com/pooyanjamshidi/ElasticBench
ElasticBench: The Experimental Platform

Learning Strategies
0
0.2
0.4
0.6
0.8
1
1.2
0
8
15
23
33
42
53
61
75
87
95
105
118
127
135
147
159
169
179
190
199
210
217
223
236
245
255
265
271
279
289
298
305
317
S1 S2 S3 S4 S5
learning epochs
probability

Q-value Evolution
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 50 100 150 200 250 300 350
q(9,3)
punishment
reward
no change

Temporal Evolution of Acquired Nodes
0
1
2
3
4
5
6
7
8
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
high exploration
less frequent exploration
mostly exploitation
experiment epochs
numberofVMs

Control Surface Evolution
1
2
3
4

Workload Injected to the System
0 50 100
0
500
1000
1500
0 50 100
100
200
300
400
500
0 50 100
0
1000
2000
0 50 100
0
200
400
600
0 50 100
0
500
1000
0 50 100
0
500
1000
Big spike Dual phase Large variations
Quickly varying Slowly varying Steep tri phase
0
1000
2:00
userrequests
userrequests

Results
IV. Unlike supervised techniques that learn from training data, FQL4KE does not require off-line
training, which saves significant amounts of time and effort.
Table 3. Comparison of the effectiveness of FQL4KE, RobusT2Scale and Azure auto-scaling under different workloads.
Approach Criteria
Workload patterns
Big
spike
Dual
phase
Large
Variations
Quickly
varying
Slowly
varying
Steep tri
phase
FQL4KE rt_95 1212ms 548ms 991ms 1319ms 512ms 561ms
vm 2.2 3.6 4.3 4.4 3.6 3.4
RobusT2Scale rt_95 1339ms 729ms 1233ms 1341ms 567ms 512ms
vm 3.2 3.8 5.1 5.3 3.7 3.9
Azure auto-scaling rt_95 1409ms 712ms 1341ms 1431ms 1101ms 1412ms
vm 3.3 4 5.5 5.4 3.7 4

5. Conclusions
We propose a new learning based self-adaptation framework, called MAPE-KE, which is
particularly suited for engineering elastic systems that need to cope with uncertain
environments, such as cloud and big data, and need to be robust enough by taking the human
- FQL4KE performs better than Azure’s native auto-
scaling service and better or similarly to RobusT2Scale
- FQL4KE can learn to acquire resources for dynamic
cloud systems.
- FQL4KE is flexible enough to allow the operator to
set different elasticity strategies.

Runtime Overhead
Monitoring Learning Actuation
104
0
2
4
6
8
10
12

Insight:
MAPE-K->
MAPE-KE
Monitoring
Analysis Planning
Execution
Oﬄine
Training
Online
Learning
Knowledge
Update
Base-Level: Elastic System
Environment
(Cloud, Sensors,
Actuators)
Knowledge
Knowledge
Users
S A
Meta-Level:MAPE-KMeta-Meta-Level:KE

Current work
- Implementation on OpenStack with Intel, Ireland
- Policy learning through other ML techniques (e.g., GP, BO)

Challenge 1: ~75% wasted capacity
Actual
demand
Challenge 2:
customer lost
Fuzzifier
Inference
Engine
Defuzzifier
Rule
base
Fuzzy
Q-learning
Cloud ApplicationMonitoring Actuator
Cloud Platform
Fuzzy Logic
Controller
Knowledge Learning
Autonomic Controller
𝑟𝑡
𝑤
𝑤,𝑟𝑡,𝑡ℎ,𝑣𝑚
𝑠𝑎
system state system goal
RobusT2Scale
Learned rules
FQL
Monitoring Actuator
Cloud Platform
.fis
L
W
W
ElasticBench
𝑤, 𝑟𝑡
𝑤, 𝑟𝑡,
𝑡ℎ, 𝑣𝑚
𝑠𝑎
Load Generator
C
system state
WCF
REST
𝛾, 𝜂, 𝜀, 𝑟

http://www.doc.ic.ac.uk/~pjamshid/PDF/qosa16.pdf
More
Details?
=>
http://www.slideshare.net/pooyanjamshidi/
Slides?
=>
Thank you!
https://github.com/pooyanjamshidi
Code?
=>

Submit to CloudWays 2016!
Paper submission deadline July 1st, 2016
Decision notification August 1st, 2016
Final version due August 8th, 2016
Workshop date September 5th, 2016
Collocated with ESOCC, Vienna
Topics: Cloud Architecture, Big Data, DevOps
- Details: https://sites.google.com/site/cloudways2016/call-for-papers

Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Architectures

More Related Content

What's hot

Viewers also liked

Similar to Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Architectures

More from Pooyan Jamshidi

Recently uploaded

Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Architectures