INTECHDublinConference-247-camera-ready

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/306056598
A Reinforcement Learning Approach for Dynamic
Selection of Virtual Machines in Cloud Data
Centres
Conference Paper · August 2016
CITATIONS
0
READS
111
5 authors, including:
Kieran Flesk
National University of Ireland, Galway
1 PUBLICATION 0 CITATIONS
SEE PROFILE
Jim Duggan
106 PUBLICATIONS 506 CITATIONS
SEE PROFILE
Enda Howley
SEE PROFILE
Enda Barrett
SEE PROFILE
All content following this page was uploaded by Martin Duggan on 11 August 2016.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.

A Reinforcement Learning Approach for Dynamic
Selection of Virtual Machines in Cloud Data Centres
Martin Duggan
National University Of
Ireland, Galway
m.duggan1@nuigalway.ie
Kieran Flesk
Ireland, Galway
k.flesk2@nuigalway.ie
Jim Duggan
Ireland, Galway
jim.duggan@nuigalway.ie
Enda Howley
Ireland, Galway
ehowley@nuigalway.ie
Enda Barrett
Ireland, Galway
enda.barrett@nuigalway.ie
Abstract—In recent years Machine Learning techniques have
proven to reduce energy consumption when applied to cloud
computing systems. Reinforcement Learning provides a promis-
ing solution for the reduction of energy consumption, while
maintaining a high quality of service for customers. We present
a novel single agent Reinforcement Learning approach for the
selection of virtual machines, creating a new energy efficiency
practice for data centres. Our dynamic Reinforcement Learning
virtual machine selection policy learns to choose the optimal
virtual machine to migrate from an over-utilised host. Our
experiment results show that a learning agent has the abilities
to reduce energy consumption and decrease the number of
migrations when compared to a state-of-the-art approach.
Keywords—Live Migration, Energy, Reinforcement Learning.
I. INTRODUCTION
Recent research in cloud computing has highlighted the
increasing environmental impact of data centres with regards
to electricity usage and associated CO2 emissions. According
to Koomey et al. between the years of 2005 to 2010, data centre
power consumption increased by 56% and in 2010, 1.3% of all
power consumed worldwide was due to data centre operations
[14]. Studies by Barriso et al. and Koomey et al. highlight that
increases in energy consumption range in the billions of kWh
[4], [13], [14]. Furthermore a study by Petty et al. in 2007
highlighted that the Information Communication Technology
(ICT) industry contributed to 2% of global CO2 emission each
year, putting it on par with the aviation industry [15]. However,
a report by Brown in 2007 directly addresses this issue by
stating that existing technologies and strategies could reduce
typical server energy usage by an estimated 25% [8]. This
report highlights if state-of-the-art energy efficiency practices
are implemented throughout the U.S. data centres, an estimated
55% of energy consumption could be reduced when compared
to current levels. Cloud computing leverages infrastructure as
a service (IaaS), platform as a service (PaaS) and software as
a service (SaaS) to provide virtualised computing resources to
customers. A key aspect of the IaaS platform is to provide
computer infrastructure, for example virtualised hardware,
virtual server space, network connections and bandwidth. The
IaaS consists of multiple servers and networks, distributed
throughout and can be divided among numerous data centres.
The virtualised structure of the IaaS platform allows for the
re-allocation of resources through migrating a virtual machine
(VM) or a group of VMs among hosts. Migration is the
process of moving VMs from one physical host to another
based on resource allocation or power saving techniques,
however migration has a considerable impact on energy con-
sumption in data centres. Migration is triggered once a host
transitions into an over-utilised or underutilised state, then a
virtual Machine (VM) or VMs must be relocated to another
host that has sufficient resource, to ensure quality of service
(QoS) is guaranteed for both customer and cloud provider.
Live VM migration is commonly used in cloud data centres
due to its capability of maintaining high system performance
under dynamic workloads. Live migration has been studied
extensively and has shown to be of benefit to cloud providers
[1]. Live migration of a singe VM will require computing
resources and can lead to increases of energy consumption
in a data centre. However a group of migration will require
significant amount of resources and energy which may lead to
the violations in the Service Level Agreement (SLA), which
is the contract between cloud provider and cloud customer.
When a service level agreement violations (SLAV) occurs this
cause increased penalties in overhead for the cloud provider.
The longer a host stays in a over-utilised state the greater the
energy consumption. Thus selecting the optimal VM to migrate
from a host presents a complex and challenging problem.
In this paper, we focus on the selection of a VM to
migrate from an over-utilised host. We propose a dynamic
Reinforcement Learning (RL) VM selection policy, enabling a
single learning agent to decide on a optimal VM for migration
from an over-utilised host depending on the current energy
consumption. This agent-based VM selection policy directly
addresses the issues highlighted in [8] by implementing a new
state-of-the-art energy efficiency practice. Our algorithm is
shown to reduce energy consumption and decrease migration
of VMs, leading to a greener cloud data centre.
The contributions of this paper are:
1) Present a novel Reinforcement Learning VM selec-
tion policy (Lr-RL) with the capabilities to decide
an appropriate VM to migrate from an over-utilised
host. We create a novel state-action space, based
on the CPU utilisation percentage of the host and
VM to be migrated, to show how a Reinforcement
Learning algorithm can improve upon a state of the
art approach, in terms of energy consumption.
2) To show how an autonomous VM selection policy has
the abilities to reduce energy consumption, creating
a more efficient cloud data centre.
The rest of this paper is structured as follows: Section II
discusses related work, Section III introduces Reinforcement

Learning, Section IV presents our Dynamic RL-VM selection
Policy, Section V describes the experimental setup, Section VI
evaluates and analyses our results, and Section VII concludes
the paper.
II. RELATED WORK
In recent years much research has been conducted in the
areas of energy efficiency, dynamic resource selection and
allocation policies for cloud infrastructure. These approaches
can be classified into two main categories: (1) Threshold and
Non-Threshold, (2) Machine Learning.
A. Threshold and Non-Threshold Based Approaches
An example of a non-threshold approach is that of Verma
et al., who implemented a power aware application placement
framework called pMapper [19]. This is designed to utilise
power management applications such as CPU idling, Dynamic
Voltage and Frequency Scaling (DVFS) and consolidation
techniques that already exist in hypervisors. These techniques
are leveraged via separate modules, mainly the performance
manager which has a global overview of the system and
receives information such as SLAs and QoS parameters. The
migration manager deals directly with the VMs to implement
live migration, the power manager communicates with the
infrastructure layer to manage hardware energy policies and
then the arbitrator decides on information supplied from the
above mentioned polices for the optimal placement of VMs
through a bin packaging algorithm.
Threshold based approaches for autonomic scaling of re-
sources are commonplace, and are used by cloud providers
such as Amazon EC2 in their Auto Scaling software. Threshold
approaches are based on the premise of setting an upper and
lower bound threshold, that when broken trigger the allocation
or consolidation of resources as necessary. Research conducted
in the area of threshold based approaches includes a proposed
architecture known as ’the 1000 island solution architecture’
by Zhu et al. [22]. Similar to Verma et al. they consider
three separate application categories based on different time
periods, they then designate an individual controller to each
category. The largest timescale is hours to days, then minutes
and finally seconds. Each group is regarded as a pod and has
a node controller managing dynamic allocation of the node’s
resources. As part of the node controller, there is a utilisation
controller which computes resource consumption and estimates
what resources are required in order to meet SLAs in the
future.
B. Reinforcement Learning Based Approaches
In recent years, Reinforcement Learning has proven to
be a promising approach for optimal allocation of cloud
resources. Barrett et al. proposed a parallel RL framework for
optimisation of scaling resources instead of the threshold based
approach [3]. Barrett approach requires agents to approximate
optimal policies and each agents shares its information with
a global agent to improve overall performance. This approach
has been empirically proven to outperform the traditional rigid
threshold-based approaches. Bahati proposed an RL approach
to help simplify the management of existing threshold based
rules, where a primary controller applies rules to a system to
enforce its quality attribute and a secondary controller monitors
the effects of implementing these rules and adapts thresholds
accordingly [2]. Teasuro introduces a hybrid RL approach to
optimising sever allocations in data centres, through training
of a nonlinear function approximator in batch mode on a
data set, while an externally trained policy makes management
decisions within a given system [18] . Both Farahnakian et al.
[12] and Yuan et al. [21] demonstrate how RL can be used to
optimise the number of active hosts in operation in a given time
frame. An RL agent learns an on-line host detection policy
and dynamically consolidates machines in line with optimal
parameters. Both studies implement the minimum migration
time selection policy proposed by Beloglazov et al. [7], for
post detection of over utilised hosts in order to identify VMs
for migration. Tan et al. use an RL agent to shut down or
make hosts idle that are at minimal power consumption [17].
Dutreil et al. proposed a RL framework for autonomic resource
allocation in cloud domains [10]. They show how having good
learning policies in early phases using appropriate initialisation
and convergence, helps speed up learning in problems that
typically have a large convergence time.
This research is motivated by the fact all of the above
RL approaches have proven to be a statistical advantage over
threshold based approaches. We implement and evaluate RL at
a lower level of abstraction to learn policies for the selection of
VMs with the aim to reduce energy consumption and provide
a greener cloud data centre.
C. Virtual Machine Selection Policy
The study conducted by Beloglazov et al. in 2011 remains
one of the most highly cited and respected pieces of research
in relation to the consolidation of VMs while maximizing
performance and efficiency in cloud data centres [7]. Bel-
oglazov examines the dynamic consolidation of VMs while
considering multiple hosts and VMs in an IaaS environment.
Importantly, Beloglazov models SLAs as a key component in
a solution to VM consolidation which is a main feature in this
paper also. Beloglazov’s proposed algorithm can be broken
into three sections: (1) Over-Utilised/Under-Utilised detection,
(2) VM selection policy and (3) VM placement. In this paper
we are only interested in section 1 and 2 of Beloglazov’s
research. (1) Over-Utilised detection: Building on past research
Beloglazov suggests an adaptive selection policy known as
Local Regression (LR) for determining when VMs require
migration from the host in order not to violate SLAs [5]. LR,
first proposed by Cleveland, allows for the analysis of a local
subset of data, in this case hosts [9]. By providing an over
utilisation threshold along with a safety parameter, LR decides
if a host is likely to become over-utilised if their current
CPU utilisation usage multiplied by the safety parameter is
larger than the maximum possible utilisation. (2) VM selection:
VMv are placed on a migration list V based on the shortest
period of time to complete the migration. The minimum time
is considered as the utilised ram divided by spare bandwidth
for the host h , the policy selects a suitable VMv through the
following equation:
v ∈ Vh — ∀a ∈ Vh, RAMu(v)
NETh
≤ RAMu(a)
NETh
where RAMu(a) is the total RAM currently utilised by the
VMa and NETh is the spare network bandwidth available on
host h .

We have chosen Lr-Mmt as a state of the art approach
which we will use to benchmark the performance of our
proposed RL algorithm.
III. REINFORCEMENT LEARNING
In Reinforcement Learning (RL) (1998), an agent learns
through a trial and error process, by interacting with its
environment and observing the resulting reward signal [16]. RL
problems are modelled as Markov Decision Processes (MDP)
which provides a mathematical framework for modelling se-
quential decision making under uncertainty. An MDP is a tuple
S, A, T, R for example, an agent maps action a ∈ A, to
state s ∈ S, which then moves to future state s ∈ S.
The probability P that when a is executed in s it will
transition to s can be defined by the following:
Pa
s,s =Pr st+1 = s |st = s, at = a
The agent receives a scalar reward rt which can be either
negative or positive. The reward space represented as any
current state st and, at and together with any next state st+1,
the expected value of the next reward is as follows.
Ra
s,s = rt+1|st = s, at = a, st+1 = s
The goal of solving MDPs is to find a policy, by maximis-
ing accumulated rewards. In specific cases where a complete
environmental model is known that is S, A, T, R are fully
observable, the problem is then reduced to a planning problem
and can be solved using traditional dynamic programming
techniques such as value iteration. If there is no complete
model available then one must either attempt to approximate
the missing model (model-based reinforcement learning) or
directly estimate the value function or policy (model-free
Reinforcement Learning).
In the absence of a compete environmental model, model-
free Reinforcement Learning algorithms such as Q learning
which is used in this paper, can be used to generate optimal
policies [20]. Q-Learning belongs to a collection of algo-
rithms known as Temporal Difference (TD) methods which
estimates the state-action pair Q(st, at). Not needing a full
model of the environment TD has the capability to make
predictions incrementally by bootstrapping the current estimate
onto previous estimates. After every state-action-reward-state
transition experienced, the TD algorithm Q-learning, calculates
an estimated value known as a Q-value:
Q(st, at) ← Q(st, at) + α[rt + γMaxQ((st+1
, at+1
) −
Q(st, at))]
α is the learning rate, which determines how quickly
an agent learns. A α set close to 1 ensures most recent
information obtained is utilised while α close to 0 infers no
learning will take place. An agent’s degree of myopia can
be controlled by setting gamma γ between 0 and 1. The
closer γ is to 1, will emphasise greater weight on future
rewards whereas values close to 0 consider only most recent
rewards. MaxQ(st+1
, at+1
) returns the maximum estimate for
the future state-action pair. Once the Q-Value is calculate it is
then store in the agent’s Q-Matrix
Actions are chosen based on the policy π that the agent is
following. To ensure that the agent discover the most optimal
Fig. 1. RL Environment Interaction [16]
policy π, a trade-off between exploration and exploitation
must exist. An agent that always exploits the best action, is
said to be following a greedy selection policy, however such
an implementation never explores, thus paying no regard to
possible alternative more lucrative actions. In this research
paper an -greedy policy is used to ensure an agent can
explore the entirety of the environment and based on the value
of which will control exploration rate of the environment.
Figure 1 illustrates the interaction the RL agent has with the
environment. Pseudo-code 1 provides a
Q-learning Algorithm - Pseudo-code 1
Initialize Qmap arbitrarily, π
Repeat (while st is not terminal)
Observe st
Select at using: π
Execute at
Observe st+1 rt
Calculate Q Q(st, at) ←
Q(st, at) + α[rt + γMaxQ((st+1, at+1 ) − Q(st, at))]
IV. DYNAMIC RL-VM SELECTION POLICY
To the best of our knowledge this research is the first to
apply RL to the selection of individual VMs during migration
from an over-utilised host. Studies such as Beloglazov et al.
presents efficient VM selection techniques but however these
approaches cause high amounts of energy consumption [7].
Our aim is to investigate the effects an energy-aware learning
agent can have on live migration, which is a critical process
that consumes high amounts of energy in a data centre.
Our RL-VM selection policy (Lr-RL) learns to select a VM
to migrate from an over-utilised host. Each host contains a set
of VMs. Our Lr-RL algorithm determines the optimal VM
to migrate from a set of VMs. However to which host that
VM will be migrated is out of the scope of this research,
but will be considered in future work. To clarify, our Lr-RL
approach uses Beloglazov’s Local Regression (Lr) technique
to determine if a host is over-utilised (Section II, C). Our RL
agent is implemented in-place of their Mmt algorithm. By
giving the agent observability over key state variables such
as energy consumption then the agent has an advantage over
Lr-Mmt (section II), the algorithm we will be examining our
approach with.

A. State-Action Space
We created a novel percentile state-action space to incor-
porate the RL algorithm into an IaaS environment. The state
space st is defined as the current host utilisation (CPU Usage),
denoted as hu, returned as a percentage. Therefore, this allows
the state space to be defined as st ∈ S = {0 - 100} and is
obtained through the following equation,
st =
n
v=1
vmu(v) (1)
where v is the current VM selected, vmu is the function
that calculates the current VM’s utilisation of the host’s CPU
and n is all possible VMs that can be migrated. For Example,
each host contains a set of VMs, for each VM the utilisa-
tion percentage of that host resources is calculated and then
summed to determine the overall utilisation for that host, in
the form of a percentage.
The action a space is represented as vmu (defined above)
of its assigned host h, returned as a percentage.
a=
t
vmu(v)
hu(h)
· 100 (2)
The action space is defined in terms of percentage, as at ∈ A
= {0 - 100}. Each state-action pair is mapped to a q-value in
the Agent’s Q-Matrix.
B. Reward Function
Reinforcement Learning maximise rewards through map-
ping of states to actions. In order to achieve this a recurrent
interaction at discreet time steps between the agent and envi-
ronment is necessary. The RL agent receives a representation
of the environment in the form of the current state st which
allows an action at to be returned based on the policy the agent
is following. The Reward function R( st, at) is determined as
the action at, for the current state st of the host. At the next
time step the environment returns a new representation of the
current state st+1 and a numerical reward rt based on the
previous action at undertaken.
For example, a host has an over-utilisation (in terms of
CPU utilisation) of 90%, this then is the agents st. The RL
agent selects an at for example, a VM that is utilising 10%
of the host CPU. Then the host transitions into st+1, which in
this case would be 80%. The agent receives a reward based on
the energy usage of the host once the VM has been migrated.
The reward is defined as presented in [6] the host’s power
consumption is nearly proportional to its CPU utilization, so
the power consumption can be described by Equation (1):
P(µ) = 0.7 · Pmax + 0.3 · Pmaxµ (3)
where Pmax denotes the hosts power consumption when it is
in full load and µ represents the PMs CPU utilization, which
changes over time.
C. Q-Learning Implementation
The following details how the Lr-RL algorithm functions.
First the pseudo-code of the dynamic RL-VM selection policy
is provided (Pseudo-code 2) with a detailed explanation of how
the algorithm will function.
Dynamic RL-VM Selection Policy - Pseudo-code 2
host → overUtilisedHost
V M → migrateableV MspossibleAction ← vmSize
choose VM from possibleActions using π
Migrate V M
Observe Future host Utilisation, reward
calculate Q
Q(st, at) ← Q(st, at)+α[rt +γ Max(Q(st+1, at+1)−Q(st, at))]
update Q − Matrix
The Lr-RL algorithm is invoked when a host is determined to
be over-utilised through the Lr function. The host is placed on
a list of over-utilised hosts. A host is selected from the list of
over-utilised hosts and the RL agent calculates the host’s state,
in terms of percentage (see equation 1). All VMs are mapped
as possible actions based on the percentage of CPU utilisation
of their host (see equation 2). The RL agent selects an action
(i.e. VM) based on the RL selection policy i.e -greedy. The
agent performs the action, migrating the selected VM from the
current host to a suitable alternative host. Then agent observes
the new host’s utilisation level and a energy-aware-reward is
received (defined in section IV, B). The agent calculates the
q-value for the state-action pair, which is then mapped to the
Q-Matrix. If the host is deemed to be still over-utilised, the RL
agent selects the next optimal VM to migrate. The process is
repeated, until a time when the host is no longer over utilised.
V. EXPERIMENT SETUP
As the target system is an IaaS, it was essential to
conduct the experiments on a large scale virtualised data
centre infrastructure simulation tool, as a real world system
would be too complex. The Cloudsim framework allows for
the representation of a power-aware data centre with LAN-
migration capabilities. For the experiment conducted in this
paper to be considered fair, we have set the Cloudsim param-
eters according to Beloglazov et al [7]. This comprises of 800
physical servers consisting of 400 HP Proliant MI 110 G5 and
400 HP Proliant Ml 110 g4 servers is replicated with in the
simulator. A 30 day workload used in this experiment comes
from a real world IaaS environment. PlanetLab files within
the CloudSim framework contain data from CoMon project
representing CPU utilization of over a 100 VMs from servers
located in 500 locations worldwide. In order to make the
experiments more realistic, a 30 day workload experiment was
created on a random bases from the PlanetLab files containing
288 values representative of CPU workloads. VMs are assigned
to these 30 day workloads on a random basis in order to best
represent the stochastic characteristics of workload allocation
and demands within an IaaS environment. Cloudsim offers a
default ceiling threshold of 100% for each hosts, with a safety
parameter of 1.2. This safety parameter acts as an over-utilised
buffer. For example if the current utilisation of a host is 85%
and the safety parameter value is set to 1.2, this gives an the

host a utilisation level of 102% and is consider to be over-
utilised.
To ensure an agent converges to an optimal policy the
learning parameters must be set. These values were selected by
conducting a parameter sweep to determine the most optimal
performance of the agent’s learning abilities. The parameters
for the experiment in this paper are set as follows: α =.8, γ
=.8 and =.05.
A. Experiment Metrics
The following metrics are used to evaluate the Lr-RL
algorithm abilities with the Lr-Mmt heuristic.
1) Energy Consumption: The total energy consumed by the
data centre per day in relation to computational resources i.e.
servers. Although other energy draws exist, such as cooling
and infrastructural demands, this area was deemed outside the
scope of this research.
2) Migrations of VMs: The total migrations of all VMs
on all servers, performed by the data centre over a 30 day
workload.
VI. PRELIMINARY RESULTS
This section compares our Lr-RL algorithm against the
benchmark set by the Lr-Mmt technique. Both algorithms were
subject to the 30 day workload experiment, repeated 100 times.
For every iteration of the experiment, the RL agent’s Q-matrix
was initialised to 0.
Figure 2. presents the energy consumption results for
both algorithms. The results show over the 30 day workload
period, Lr-RL consumed a total of 3,948.35 kWh compared
to 4,623.75kWh for Lr-Mmt. The standard deviation for Lr-
RL was +
−28.79 kWh in comparison with +
− 33.12 kWh for
Lr-Mmt. A paired t-test shows that there is a statistically
significant difference in the consumption of energy when
utilising Lr-RL and Lr-Mmt resulting in a p-value <0.0067
with a 95% confidence interval (-6.474, -38.5533).
The migrations results for both algorithms are shown in
Figure 3., highlighting that Lr-RL had a considerably lower
number of migrations than Lr-Mmt. Lr-RL selects VMs based
on future energy usage of a host. This resulted in fewer
migrations from a single host for Lr-RL and a host will
maintain sufficient processing power capabilities. Over the 30
day workload period, Lr-RL had a total of 525,769 migrations
compared to Lr-Mmt with 797,496 migrations. The standard
deviation for migrations for Lr-RL was +
−4,443.25 compared
to +
−5,898.23 for Lr-Mmt. A paired t-test confirms that there
is a statistically significant difference between Lr-RL and Lr-
Mmt, with a p-value <0.0001 with a 95% confidence interval
(-6,358.85, -11,756.41 ).
Examining a single day workload in more detail further
highlights the improvements that Lr-RL could potentially
contribute to real world data centres. The correlation between
the decreased number of migrations and the energy reduction
for the Lr-RL algorithm is shown in Figure 4, measured at the
industry standard of 5 minute intervals. For day 1 of the 30 day
workload, Lr-Mmt had a total energy consumption of 138.55
kWh and 23,211 total migrations, while Lr-RL had an energy
80
100
120
140
160
180
200
220
240
0 5 10 15 20 25
kWh
Workloads
Lr-RL
Lr-mmt
Fig. 2. Energy Consumption
10000
15000
20000
25000
30000
35000
40000
45000
0 5 10 15 20 25
Migrations
Workloads
Lr-RL
Lr-mmt
Fig. 3. Migrations of VMs
consumption of 127.31 kWh and 19,437.4 total migrations.
Lr-RL saves on average 11.24 kWh and performs 3,773.6 less
migrations on the first day of this workload.
On average for a single day workload Lr-RL saves 22.51
kWh of energy, with 9,058 less migrations than Lr-Mmt. Lr-
Mmt requires nearly 12 VMs to be moved from a host whereas
on average Lr-RL never requires greater than 2 VMs to be
migrations. One reason for this being, Lr-Mmt chooses a VM
with the least time of movement accounting for only 3.06%
of overall host utilisation. Whereas Lr-RL on average selects
a VM that accounts for 12.87% of overall host utilisation thus
enabling faster process whereby a host is no longer considered
to be in an over-utilised state.
Considering the energy saving aspect of our results. On
average Lr-RL saved 22.51 kWh per day,this results in an
estimated savings of 8,577.5 kWh per year. According to
calculations by the EPA, this is equivalent to a reduction of 5.9
metric tons of CO2 emissions due to electricity generation and
potentially protecting 4.6 acres of forest land from destruction
[11].
The results highlight the adaptive nature of RL. An agent
with the capabilities to learn and adapt to changing workloads,
results in a reduction of energy consumption in a cloud
domain. However, RL too has drawbacks, for an agent to learn,
a number of training episodes must be conducted, potentially
leading to a substantial amount of time for an agent to
converge on an optimal policy. Although the Lr-Mmt approach
consumes high amounts of energy and incurs high overhead

Fig. 4. Energy & Migration Correlation for Workload Day 1
costs for cloud providers, the size of the VM (in terms of
RAM) to be migrated will not have a considerable impact on
compute resources such as bandwidth in a data centre.
VII. CONCLUSION
This research presented a dynamic RL VM selection policy
(Lr-RL), where an agent learned to select a VM to migrate
from an over-utilised host. Based on an energy-aware-reward
function the agent reduces energy consumption and migrations.
To address the aims of the paper proposed in Section I:
1) Our Lr-RL approach improves upon the best known
algorithm Lr-Mmt in terms of energy consumption.
Our agent based approach learns to select the optimal
VM to be migrated. We created a novel percentile
stat-action space, represented by the host CPU utili-
sation in a percentage and the VMs usage of the host’s
CPU, also in percentage. The experiment results
demonstrated that Reinforcement Learning can be
implemented at a low level of abstraction for the use
in a IaaS environment.
2) The energy aware reward function provided energy
performance feedback to the agent when selecting an
appropriate VM for migration. From our EPA calcula-
tions, our RL VM selection policy has the capabilities
to create a cognitive live migration framework that
has the potential to decrease C02 emissions from a
cloud data centre.
Our research so far shows the potential benefits of an
agent based approach when applied to energy consumption
problems in a cloud simulation domain. The SLAV model was
out of the scope of this research as we wanted to highlight
the advancement RL could achieve in energy consumption. In
future work we plan to model the SLAV performance of both
algorithms. The work will enable an agent to observe both
SLAV and energy to decide to most optimal VM to migrate
while improving the performance of a cloud data centre.
REFERENCES
[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski,
G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al. A view of cloud
computing. Communications of the ACM, 53(4):50–58, 2010.
[2] R. M. Bahati and M. A. Bauer. Towards adaptive policy-based
management. In Network Operations and Management Symposium
(NOMS), 2010 IEEE, pages 511–518. IEEE, 2010.
[3] E. Barrett, E. Howley, and J. Duggan. Applying reinforcement learning
towards automating resource allocation and application scalability in
the cloud. Concurrency and Computation: Practice and Experience,
25(12):1656–1674, 2013.
[4] L. A. Barroso and U. Hölzle. The case for energy-proportional
computing. Computer, (12):33–37, 2007.
[5] A. Beloglazov, J. Abawajy, and R. Buyya. Energy-aware resource
allocation heuristics for efficient management of data centers for cloud
computing. Future generation computer systems, 28(5):755–768, 2012.
[6] A. Beloglazov and R. Buyya. Adaptive threshold-based approach for
energy-efficient consolidation of virtual machines in cloud data centers.
2010.
[7] A. Beloglazov and R. Buyya. Optimal online deterministic algorithms
and adaptive heuristics for energy and performance efficient dynamic
consolidation of virtual machines in cloud data centers. Concurrency
and Computation: Practice and Experience, 24(13):1397–1420, 2012.
[8] R. Brown. Report to congress on server and data center energy effi-
ciency: Public law 109-431. Lawrence Berkeley National Laboratory,
2008.
[9] W. S. Cleveland. Robust locally weighted regression and smooth-
ing scatterplots. Journal of the American statistical association,
74(368):829–836, 1979.
[10] X. Dutreilh, S. Kirgizov, O. Melekhova, J. Malenfant, N. Rivierre, and
I. Truck. Using reinforcement learning for autonomic resource alloca-
tion in clouds: Towards a fully automated workflow. In ICAS 2011,
The Seventh International Conference on Autonomic and Autonomous
Systems, pages 67–74, 2011.
[11] Epa.gov. Calculations and references — clean energy — us epa.
[12] F. Farahnakian, P. Liljeberg, and J. Plosila. Energy-efficient virtual ma-
chines consolidation in cloud data centers using reinforcement learning.
In Parallel, Distributed and Network-Based Processing (PDP), 2014
22nd Euromicro International Conference on, pages 500–507. IEEE,
2014.
[13] J. Koomey. Growth in data center electricity use 2005 to 2010. A report
by Analytical Press, completed at the request of The New York Times,
page 9, 2011.
[14] J. G. Koomey et al. Estimating total power consumption by servers in
the us and the world, 2007.
[15] C. Pettey. Gartner estimates ict industry accounts for 2 percent of global
co2 emissions. Dostupno na, 14:2013, 2007.
[16] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction,
volume 1. MIT press Cambridge, 1998.
[17] Y. Tan, W. Liu, and Q. Qiu. Adaptive power management using
reinforcement learning. In Proceedings of the 2009 International
Conference on Computer-Aided Design, pages 461–467. ACM, 2009.
[18] G. Tesauro, N. K. Jong, R. Das, and M. N. Bennani. A hybrid
reinforcement learning approach to autonomic resource allocation. In
Autonomic Computing, 2006. ICAC’06. IEEE International Conference
on, pages 65–73. IEEE, 2006.
[19] A. Verma, P. Ahuja, and A. Neogi. pmapper: power and migration
cost aware application placement in virtualized systems. In Middleware
2008, pages 243–264. Springer, 2008.
[20] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–
292, 1992.
[21] J. Yuan, X. Miao, L. Li, and X. Jiang. An online energy saving
resource optimization methodology for data center. Journal of Software,
8(8):1875–1880, 2013.
[22] X. Zhu, D. Young, B. J. Watson, Z. Wang, J. Rolia, S. Singhal,
B. McKee, C. Hyser, D. Gmach, R. Gardner, et al. 1000 islands:
Integrated capacity and workload management for the next generation
data center. In Autonomic Computing, 2008. ICAC’08. International
Conference on, pages 172–181. IEEE, 2008.
View publication statsView publication stats

INTECHDublinConference-247-camera-ready

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (16)

Similar to INTECHDublinConference-247-camera-ready

Similar to INTECHDublinConference-247-camera-ready (20)

INTECHDublinConference-247-camera-ready