Scaling in the Cloud - JMansur Thesis

SCALING IN THE CLOUD FOR FLASH EVENTS WITH LEAD-LAG
FILTERS
A Thesis
Presented to the
Faculty of
California Polytechnic University, Pomona
In Partial Fulﬁllment
Of the Requirements for the Degree
Master of Science
In
Computer Science
By
Jonathan Mansur
2014

Acknowledgement
I would like to first thank my thesis adviser, Dr. Lan Yang, whose calm advice helped temper
the occasional deadline panic. Next a thank you for Dr. Daisy Sang and Dr. Gilbert Young
for agreeing to be committee members, in particular, their commitments of time and energy.
I would like to thank Dr. Mohammad Husain whose class in secure communications pushed
my ability to research, but more importantly how to communicate the fruits of that research.
I would also extend thanks to Dr. Ciera Jaspan, although she is no longer a member of the
Cal Poly Pomona Faculty, her classes in software engineering and verification & validation
greatly improved how I approach software development. What I learned from her about test
driven development alone would’ve saved me a lot of self inflicted misery if I had applied it
more consistently. Thanks to the faculty and staff of the Computer Science department of
Cal Poly Pomona for their exceptional teaching of the curious art of [hopefully ]controlling
computers. In addition, thanks to the faculty and staff of Cal Poly Pomona for providing
an excellent learning environment with a spacious campus that provides ample inspiration
for regular exercise when traveling between classes.
I would like to thank my parents, Rick and Carolyn, for being my first teachers, even
if I was unappreciative at the time. My mother for teaching me the art of learning and
my father for helping develop my early technical knowledge. Additionally, I would extend
thanks to my grandparents for their support over the years. From the large to the the small,
such as, nodding along as I over-enthusiastically talk about some recently acquired piece of
knowledge, a service my recently departed and much loved, Grandma Wood was more than
happy to do. Lastly for my family, thanks to my older brother, James, and younger sister,
Crystal, whose constant chiding of ”smart-ass” seemed like it was worth proving.
iii

Acknowledgements. . . continued
Thanks to Cyber Monday, whose web service crushing sales inspired the idea for my thesis.
Additionally, thanks to speling, punctuation: grammar, whose tireless and frustrating eﬀorts
made this thesis comprehensible, hopefully. Along the same lines, a thank you for deadlines,
with their stress induced fervor helping fuel many tireless hours of testing and development.
A special thanks to Advil for reasons that readers of this thesis may understand (and
hopefully appreciate).
And ﬁnally, I would like to thank bunnies.
Why? Do bunnies need a reason to be thanked?
iv

Abstract
Cloud services make it possible to start or expand a web service in a matter of minutes
by selling computing resources on demand. This on demand computing can be used to
expand a web service to handle a flash event, which is when the demand for a web service
increases rapidly over a short period of time. To ensure that resources are not over or
under provisioned the web service must attempt to accurately predict the rising demand for
their service. In order to estimate this rising demand, first and second order filters will be
evaluated on the attributes of cost and response time, which should, and in fact do, show
that with an increase in cost lead/lag filters can keep response times down during a flash
event. To evaluate first and second order filters a test environment was created to simulate
an Infrastructure as a Service cloud, complete with a load balancer capable of allocating or
de-allocating resources while maintaining an even load across the active instances. Flash
events were created using a flash model that generated CPU load across the instances as an
analog to demand. Testing used both predictable and secondary flash events, unpredictable
flash events were excluded due to test environment limitations. Lead/lag filters will be
shown to outperform the leading competitors method of demand estimation when the flash
event scales up quickly.
v

Table of Contents
Acknowledgement iii
Abstract v
List of Tables viii
List of Figures ix
1 Introduction 1
2 Literature Survey 2
2.1 Virtual Machines and Hypervisors . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Hardware-Layer Hypervisor . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 OS-Layer Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.4 The Xen Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Cloud Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Infrastructure as a Service . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Instance Management . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Load Balancing Service . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Virtual Appliances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Flash Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Flash Event Classiﬁcations . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Scaling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Scaling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Demand Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Current Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Lead-Lag Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Research Goal 21
vi

4 Methodology 21
4.1 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Flash Event Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Scaling Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Demand Estimation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Test Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Evaluation of results 27
5.1 Allocation Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 Predictable Flash Event . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.2 Secondary Flash Event . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Responsiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Conclusion 35
References 37
vii

List of Tables
2.1 Flash event data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Test hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Flash events for testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1 Cost results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Responsiveness results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
viii

List of Figures
2.1 Hypervisor types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 The Xen hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Timeline of important cloud events . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Load balancing architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 The Slashdot effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Flash event phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Demand estimation comparison. . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 The testing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Demand response example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Generated demand inaccuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 The control instance allocations for the predictable flash event. . . . . . . . 28
5.2 The EWMA instance allocations for the predictable flash event. . . . . . . . 29
5.3 The FUSD instance allocations for the predictable flash event. . . . . . . . 29
5.4 The lead filter instance allocations for the predictable flash event. . . . . . . 30
5.5 The lead-lag instance allocations for the predictable flash event. . . . . . . . 30
5.6 The control instance allocations for the secondary flash event. . . . . . . . . 31
5.7 The EWMA instance allocations for the secondary flash event. . . . . . . . 32
5.8 The FUSD instance allocations for the secondary flash event. . . . . . . . . 32
5.9 The lead filter instance allocations for the secondary flash event. . . . . . . 33
5.10 The lead-lag filter instance allocations for the secondary flash event. . . . . 33
ix

1 Introduction
A flash event occurs when a website experiences a sudden and large increase in demand,
such as, what Wikipedia experienced when Steve Jobs died on October 5, 2011 [1]. While
Wikipedia was able to handle the increased demand, not all web services can handle flash
events. A good example of a service failing to handle a flash event would be the Moto X
Cyber Monday sale [2]. Motorola potentially lost sales on that day because their servers
could not handle the demand for their product. To handle a flash event a website, or its
services, must be able quickly scale to serve new demand.
In order to scale quickly, a website/service operator could utilize an Infrastructure as
a Service(IaaS) from a public or private cloud to quickly increase the services resources.
In particular, the service operator can use the property of rapid elasticity [3] in the cloud
to respond to demand. To achieve rapid elasticity, cloud providers virtualize their infras-
tructure. Once virtualized the cloud provider can both start and sell virtual instances in a
manner of minutes using the properties of hypervisors and virtual machines. However, an
operator will have to have a method of estimating demand during the flash event in order
to handle the demand without allocating unneeded resources.
In this thesis works, I will be evaluating the usage of lead-lag filters to calculate antic-
ipated demand. This approach will be compared with those presented by Xiao, Song, and
Chen, in their 2013 paper “Dynamic Resource Allocation Using Virtual Machines for Cloud
Computing Environment” [4]. These methods will be implemented in a test environment
and compared on their ability to respond to a flash (more details in section 5).
This thesis will be structured as follows, the literature survey (section 2) will start
by examining hypervisors, virtual machines, and their properties (section 2.1). The Xen
hypervisor will be given special attention because of its role in creating IaaS. The next
section will cover cloud services (section 2.2) with a focus on services and properties that
improve scalability. From there, the main motivation of this thesis, flash events (section
2.3) will be covered with an emphasis on consequences and modeling. The last two sections
of the literature survey are about framing the problem of scaling for flash events (section
2.4) and how to estimate demand (section 2.5). After the literature survey, the research
1

goal (section 3), methodology (section 4), and evaluation of results, (section 5) will cover
what how the different estimation methods were evaluated and the results. The final section
(section 6) will present the conclusions of the research.
2 Literature Survey
2.1 Virtual Machines and Hypervisors
Hypervisors, or virtual machine monitors, are programs that can execute a virtual in-
stance of another system, called virtual machines (guest system), on a single host machine.
1974 is the starting point for virtualization, Popek and Goldberg defined the formal require-
ments to implement a virtual system [5]. Virtual systems stagnated between 1974 and 1998
because the common hardware architectures didn’t satisfy the virtualization requirements.
However, in 1998 VMWare managed to bring virtualization to the most common CPU ar-
chitecture of the era, the x86 CPU [6]. VMWare achieved this through paravirtualization,
the modification of the guest OS [7]. With the rise of x86-64 CPUs, Intel and AMD in-
troduced advanced features to enable better virtualization. In 2005 Intel introduced VT-x,
which allows instruction set virtualization, and in 2007 AMD introduced RVI, memory vir-
tualization. In 2002 the Xen hypervisor was open-sourced. The Xen hypervisor was defined
as a Hardware-Layer hypervisor, in a paper titled “Xen and the Art of Virtualization” by
Barham et al. [8]. This hypervisor would later go on to be used in the creation of Amazon’s
EC2 cloud service [9].
2.1.1 Properties
The requirements specified by Popek and Goldberg start by defining two types of in-
structions, innocuous and sensitive. Sensitive instructions are those instructions that could
cause interference between the host machine or other virtual machines, conversely, innocu-
ous instructions have no potential interference. One of the requirements for a virtual system
was that all sensitive instructions must also be privileged, meaning they can be intercepted
before they are executed (i.e. trapped). Once intercepted the hypervisor must execute
equivalent code for the guest instance that will not cause interference. If the hardware
2

architecture could satisfy the sensitive/privileged requirement, it would then have to satisfy
the following three properties.
Virtual Machine Properties
Efficiency
All innocuous instructions are directly executed on the hardware.
Resource Control
The hypervisor should maintain complete control of the hardware resources for the
virtual machine at all times.
Equivalence
Any program within the virtual machine should execute like it was on physical hard-
ware.
These properties allow the hypervisor to control and monitor virtual machine in ways
that are impossible to do with physical hardware. The first ability is to suspend and resume
a system while it is currently executing a process without invoking the operating systems
shutdown or hibernate features. The second ability is to monitor the virtual machines
resources without loading any special software in the virtual machine. A list of resources
that the hypervisor can track are; CPU utilization, memory usage, storage speed/utilization,
and network speed/traffic.
Paravirtualization bypasses the sensitive/privileged requirement by modifying the guest
OS to use an API that avoids or specially handles sensitive instructions. However, par-
avirtualization still achieves the three virtual machine properties making it an effective
technique for creating a virtualizable system.
2.1.2 Hardware-Layer Hypervisor
A hardware-layer hypervisor, as shown in Figure 2.1a, is also commonly known as a
bare metal hypervisor or a type 1 hypervisor. This hypervisor has direct access to the
underlying hardware, there is no host OS between the hypervisor and the hardware, and
is commonly used in server settings because of the improved performance over OS-layer
3

(a) Hardware-layer hypervisor (b) OS-layer hypervisor
Figure 2.1. Hypervisor types
hypervisors [10]. The lack of a Host OS limits the ability to monitor and dynamically
manage the hypervisor. Examples of a hardware-layer hypervisor are the Xen hypervisor
and VMWare’s ESXi hypervisor.
2.1.3 OS-Layer Hypervisor
The OS-layer hypervisor, shown in ﬁgure 2.1b, is also referred to as a hosted hypervisor
or a type 2 hypervisor. These hypervisors have slower performance because of the additional
overhead of the host OS between the hypervisor and the hardware. This type of hypervisor
has found usage with security researchers to help them isolate viruses and malware for
research. Another use includes system prototyping within IT to ensure software works
together. Examples of OS-layer hypervisors would include Oracle’s VirtualBox, VMWare’s
Player, and Parallels’ Desktop.
2.1.4 The Xen Hypervisor
Keir Fraser and Ian Pratt created the Xen hypervisor in the late 1990’s while working
in the Xenoserver research project at Cambridge University. The Xen hypervisor was open-
sourced in 2002, it is now controlled by the Xen Project, a collaboration with the Linux
Foundation. The ﬁrst major release of the Xen hypervisor, version 1.0, was in 2004, followed
shortly by Xen 2.0. The latest version of the Xen hypervisor, 4.3, was released on July 9,
4

Figure 2.2. The Xen hypervisor
2013 [11]. Another milestone for the Xen hypervisor occurred 2006, when it helped launch
Amazon’s EC2 cloud and cemented the role of virtualization within cloud infrastructures.
To overcome the monitoring and management restrictions of a hardware-layer hyper-
visor, the Xen Project divided the virtual machines into specific privilege domains. In
particular, a highly privileged domain, Dom0, has complete control of the hardware re-
sources and to all other executing virtual machines. Most virtual machines that run on the
Xen hypervisor, such as virtual instances for a cloud service, will run in the unprivileged
domain, DomU. Within this domain, the virtual machine must be given permission to access
the hardware. This virtual architecture is shown in figure 2.2.
2.2 Cloud Services
The early history of cloud computing is intertwined with the history of grid computing,
virtualization technology, and utility computing. Grid computing started in the mid-90s
as a way of handling large-scale problems with only commodity hardware, which at the
time would’ve required either a supercomputer or a large cluster. Shortly after this, in
1997, the term cloud computing was introduced by Ramnath Chellappa, a University of
Texas professor, while giving a talk about a new computing paradigm. The following
year, 1998, marked the start of low cost virtualization when VMWare was founded. More
importantly for cloud computing, 1999 marked the year that the first cloud computing
5

company was created, with the founding of Salesforce.com. The next major milestone for
cloud computing occurred in 2006 when Amazon launched their infrastructure-as-a-service,
the Elastic Compute Cloud [12]. After Amazon launched their cloud service, both Google
and Microsoft followed with their own offerings. A timeline of important events is shown
in figure 2.3.
1995Gridcomputingisstarted
1997“CloudComputing”coined
1998VMWareisfounded
1999Salesforce.comisfounded
2002Xenhypervisorisopen-sourced
2005IntelintroducesVT-x
2006AmazonlaunchesEC2
2007AMDintroducesRVI
Figure 2.3. Timeline of important cloud events
What both grid computing and cloud computing tried to achieve is a disruption of the
business model for computation. Before either of these computing models were available,
computational cost was usually a large onetime upfront cost (purchase of the hardware).
However, with these new models the concept of utility computing, paying directly for the
computation time needed, is possible. While both grid computing and cloud computing are
forms of utility computing, there are still differences between their business models.
Grid computing was the earlier concept for utility computing with a focus on large scale
problems. Most grids, such as TeraGrid, were set up in a sharing scheme to solve these large
problems. In this scheme, a group of organizations would join their computational resources
and then allocate them back out from a given criteria. The intention of the sharing was to
increase overall computational capacity by utilizing the downtime of other organization’s
resources [13]. The utility model came from how the resources were allocated by need,
instead of capacity.
Cloud computing was created with a focus on web services, which resulted in a different
6

business model and a different set of features. The biggest difference between the business
models was a change from computation sharing to a pay-as-you-go process. To simplify
the pay-as-you-go model, cloud providers created the following three classifications of cloud
services, which were formalized by National Institute of Standards and Technology(NIST)
[3]:
Infrastructure as a Service (IaaS)
is when access to the hardware resources, or virtualized equivalents, is the product
being sold.
Platform as a Service (PaaS)
is when hosting for user created applications are run on the cloud providers hardware
through their tools and APIs.
Software as a Service (SaaS)
is when the user accesses the cloud provider applications.
NIST also defines 5 essential properties of the cloud as follows [3]:
On-demand Self-service
A user can obtain and allocate computational resources without human intervention.
Broad Network Access
The cloud is readily accessible through standard network mechanisms (usually the
Internet).
Resource Pooling
The provider gathers the computational resources into a pool, wherein consumers can
purchase the resources they need.
Rapid Elasticity
Resources can be acquired and relinquished quickly.
7

Measured Service
Computational resources are monitored, by both the provider and user, in order to
optimize resource control.
For the purposes of this thesis only IaaS will be considered further.
2.2.1 Infrastructure as a Service
To provide Infrastructure as a Service (IaaS), companies such as Amazon, group their
computational resources into instances of various sizes and then sell access to them on an
hourly basis. In order to make the usage of such instances easier, most cloud providers offer
additional management services, such as, load balancing and utilization tracking. These
tools, covered in more detail in sections 2.2.2 and 2.2.3, create a basis for a scalable service.
2.2.2 Instance Management
If a cloud service satisfies NIST requirements, specified in section 2.2, then the user will
have the following tools available to control their utilization of cloud resources:
Utilization Monitor
measures the current resource load of the acquired resources. This will monitor various
aspects of the resources, such as, CPU load, memory usage, network traffic, and
storage usage.
Instance Acquirer
allocates additional resources/instances upon user request.
Instance Relinquisher
deallocates existing resources/instances upon user request.
As a consequence of all of these tools being on-demand self-service, it is possible to
create a scaling function that takes in the current resource utilization and outputs an ex-
pected resource utilization estimate. Once the estimate is generated then the instance
acquirer/relinquisher can adjust the current instance allocation to match the expected uti-
lization. This behavior can be summarized in the following equations 1 and 2.
8

uf = E(uc) (1)
Where,
uc is the current, or recent history, of the utilization of the resources
uf is the estimate of future resource utilization
E is a function that tries to estimate future resource utilization
Ralloc =
uf
Rc
(2)
Where,
uf is the estimate of future resource utilization
Rc is the resource’s computational capacity
Ralloc represents how many instances are needed to serve the expected utilization
2.2.3 Load Balancing Service
To improve the elasticity of their cloud services, several providers which include Google
[14], Amazon [15], and Rackspace [16], have added load balancing as a service. Load
balancers improve a services scalability by distributing the workload evenly across a number
of servers [17]. There are two types of load balancers, content-blind and content-aware.
Content-blind load balancers do not use the purpose of incoming connections when assigning
connections to servers. Conversely, content-aware load balancers do use the purpose of
incoming connections when assigning connections. Content-aware load balancers are only
necessary when the balancer is trying to balance very disparate services into a single set of
servers. However, flash events have been shown to focus on a particular set of homogeneous
resources [18], which would not require content-aware load balancing and therefore only
content-blind load balancers will be covered.
To distribute load to different servers, a load balancer will take in a series of TCP
connections and forward them to a set of servers, hopefully in an equitable manner. For
content-blind forwarding there are three possible solutions for two different network layers.
Additionally, each of these solutions can either be a two-way architecture, meaning the
9

(a) One-way architecture (b) Two-way architecture
Figure 2.4. Load balancing architectures
returning traffic goes through the load balancer, or one-way architecture, meaning the
returning traffic is sent directly from the loaded servers. The differences in forwarding
architectures are shown in figure 2.4.
Layer-2 forwarding uses direct routing, which places a device between the client-side
(service user) and serve-side (service provider/cloud user) router. This device works by
rewriting the destination MAC address to an internal server, in a way that hides the exis-
tence of an intermediary device while also distributing the connections. Using this forward-
ing method the returning traffic is sent directly from the servers to the service user because
they are all in the same IP subnet (one-way architecture)
Layer-3 forwarding has two different approaches, Network Address Translation (NAT),
which is a two-way architecture, and IP Tunneling, which is a one-way architecture. NAT
works by rewriting the destination address of the incoming, load balanced, and outgoing,
returning packets. This has the potential to make the load balancer a bottleneck in the
system because of the multitude of checksum calculations that need to be performed. The
next method is IP tunneling, which takes the incoming IP datagrams and encapsulates them
in a new datagram that also contains the virtual IP address, the IP address hosting the web
service, and the original source destination. The end-server can then use this information
to directly send the packets back to the service user directly (one-way architecture).
Once a forwarding method is selected, the system must implement a method of distribut-
ing the packets evenly. For content-blind system there are the following four methods:
Round Robin
incoming connections are assigned to servers in a circular manner.
10

Least Connections
an incoming connection is assigned to the server with the least number of connections.
Least Loaded
an incoming connection is assigned to the server with the lowest CPU utilization.
Random
Connections are assigned randomly as they arrive.
There are also weighted versions of the round robin and least connections method that
improve performance when load balancing heterogeneous servers. The least connections
and least loaded methods require additional information tracking that makes them more
costly to implement than the round robin method, at least for homogeneous servers.
2.2.4 Virtual Appliances
Virtual appliances were defined in 2003 by a group of Stanford University researchers as
the combination of a service and its supporting environment that can be executed on a hy-
pervisor [19]. These appliances can either be user generated or offered by the cloud provider.
To reduce the system’s execution and deployment overhead, some virtual appliances have
been built using a new operating system (OS) design called Just Enough Operating Sys-
tem (JeOS, pronounced like juice). By stripping the OS down to just storage, file system,
networking, and support packages, JeOS creates a light and easy to install system with a
small footprint [20]. Lightweight systems help scalability in a number of ways, one of which
is the reduction in startup time. For example, Xiao, Chen, and Luo were able to achieve
70% reduction in an appliance startup time with virtual instances that had only 1 GB of
virtual memory [21]. The startup time was compared against a system going through the
normal boot sequence to one that used the hypervisor’s resume feature.
While not directly related to scaling, virtual appliances are easier to maintain and install
for two reasons. The first reason is because the reduced support environment removes
potential conflicts, which in turn reduces testing time and problems. The second reason
is the virtualization makes setting up additional instances as easy as copy-and-pasting a
11

file. The copy-and-paste nature will also aid the maintenance of a system by easily creating
isolated environments to test software updates/patches.
Overall, virtual appliances offer a way to improve the scaling and maintenance of virtual
instances in a cloud IaaS. Scaling is aided by the lightweight design speeding up start
times and service encapsulation reducing the overhead cost of additional instances. The
encapsulation and lightweight nature also help maintanability by reducing the chance of
problems.
2.3 Flash Events
A flash event is when a large, and possibly unpredicted, number of legitimate users
attempt to access a service concurrently. While large demand for a service could be good
for a business, if the demand starts degrading the response time the business can lose money,
possibly to the extent of 1% sales drop for each 100ms of delay [22]. With a large enough
flash event, websites could become unreachable. Like what happened to the news sites,
CNN and MSNBC, during the 9/11 terrorist attack [23]. Research into flash events has
primarily focused on distinguishing them from distributed denial of service attacks (DDoS).
However, this research is hampered by the lack of traces from real flash events, though a
couple of good traces exist, such as the 1998 FIFA World Cup dataset and the traffic log
for Steve Jobs’ Wikipedia page after his death. From these, and other sources, Bhatia et
al. presented a taxonomy and modeling system for flash events in their paper “Modeling
Web-server Flash Events” [18], which will be covered in more detail in the following sections.
2.3.1 Flash Event Classifications
Before attempting to model flash events, Bhatia et al. divided flash events into the
following three categories:
Predictable Flash Event
the occurrence time of the flash event is known a priori
Unpredictable Flash Event
The occurrence time of the flash event is not known
12

Secondary Flash Event
When a secondary website/service is overloaded by being linked to from a more traf-
ficked website
Figure 2.5. The Slashdot effect
Predictable flash events usually occur when a popular event is scheduled, such as, a
popular product launch or a sporting event. Conversely, unpredictable flash events happen
when the root event is unplanned/unexpected, like the death of a celebrity or a natural
disaster. The last classification is for secondary flash events, which could be considered a
lesser more specific version of an unpredictable flash event. This type of flash event was
given its own classification because of how regularly it happens, in particular, it can go by
another name, “the Slashdoteffect”. What happens is a popular site, such as Slashdot, links
to an under resourced (much less popular) website and the resulting traffic flow overwhelms
linked website). An example of a moderate Slashdot effect is shown in figure 2.5.
2.3.2 Modeling
Bhatia et al. break up a flash event into two separate phases, the flash-phase and the
decay phase. They rejected other flash event models because of the inclusion of a sustained-
traffic phase, which their data showed was magnitudes smaller than the other phases [18].
In figure 2.6 the different phases are illustrated.
For their model, Bhatia et al. created two equations, one equation to model the flash
phase and the other to model the decay-phase. Before creating the equations, Bhatia et al.
13

Figure 2.6. Flash event phases
defined the traffic amplification factor (Af ) as the ratio of the peak rate requests (Rp) to
the average normal request rate (Rn). Af can be set to different values to simulate different
flash event magnitudes. It is also important to mark specific points in the flash event. The
starting point of the flash event is called T0. After that is the flash peak, which will indicate
when the flash-phase will be complete and Rp has been achieved, which is indicated by Tf .
The flash peak also indicates the start of the decay phase, which ends when the request
rate drops back to normal (Rn) and is indicated by Tp.
Flash Phase For the flash-phase they used a standard exponential growth model, which
assumed that increase in the current rate of traffic(dR
dt ) was proportional to the current
traffic rate(αR). Such that,
dR
dt
= αR
This differential equation can be solved by substituting values from the begining of the
flash event. In particular, t = t0 and R = Rn,
R = Rneα(t−t0)
(3)
14

In this equation α is the flash-constant. In order to solve the flash-constant they sub-
stitute values into the equation for when the flash event has reached the peak. This results
in the flash-constant being,
α =
ln Af
(tf − T0)
Substituting the flash-constant back into equation 3 they got the following equation that
could model the flash phase of a flash event,
R = Rne
ln Af
(tf −T0)
(t−t0)
(4)
Decay Phase Bhatia et al. used a similar technique to model the decay phase, however,
it was the decrease in traffic that was proportional to the current traffic rate,
dR
dt
= −βR
Where, β is the decay-constant. Solving the decay-constant and the differential equation
are similar to what was done for the flash-phase equation (equation 4) and will therefore
be skipped and the final equation presented as follows,
R = Rpe
− ln Af
(td−Tf )
(t−tf )
(5)
With equations 4 and 5, set with parameters Rp, Rn, t0, tf , and tp, it is possible to
model flash events of varying flash peaks, flash-phase lengths, and decay-phase lengths.
Flash Event Data Bhatia et al. gathered data from four flash events across all three
classification. For the predictable flash events they analyzed two events for the 1998 FIFA
World Cup semi-finals. For the unpredictable flash event they used the traffic logs for the
Wikipedia after the death of Steve Jobs. For the last event type, secondary flash event,
they analyzed the traffic to a website experiencing the Slashdot effect. The results of these
analyses is in table 2.1 [21].
15

Table 2.1. Flash event data
Flash Event Amplification
Factor (Af )
Normal
Rate (Rn)
Peak Rate
(Pn)
Flash
Time
(tf )
Decay
Time
(td)
FIFA Semi-
Final 1st
8.47 1,167,621 9,898,273 3 2
FIFA Semi-
Final 2nd
9.47 1,032,715 9,785,128 2 2
Wikipedia
after Steve
Jobs’ death
582.51 1,826 1,063,665 2 33
Slashdot Ef-
fect
78.00 1,000 78,000 2 56
Model Validity The last part of Bhatia et al. modeling effort included validation of their
flash model against real world flash events, in all three classifications. While the results are
too long to list here, overall the model performed well even though it had a slightly faster
flash-phase and a slightly slower decay-phase then the trace data.
2.4 Scaling Framework
For their scaling framework, Vaquero et al. proposed a simple rule framework where
each rule has a specific condition and associated action [24]. The condition must evaluate to
either true or false, and if it evaluates to true then the linked action will be performed. The
potential actions are limited by what the cloud service provides. Potential conditions could
include current demand, budget constraints, and/or whether authorization is granted. For
example, consider the following three rules;
If ((E(uc) > C) AND (I ≥ 8) AND P) Then I++
If ((E(uc) > C) AND (I < 8)) Then I++
If ((E(uc) < C) AND (uc < C) AND (I > 2)) Then I--
Where: I is the number of instances currently running and P indicates if permission
to acquire more instances is granted. The terms E(uc),uc, and C are as covered in section
2.2.2.
16

The rules describe a system that has a total of four properties. The ﬁrst property is
that there will always be a minimum of two instances allocated. For the second property,
instances will only be released when the expected utilization (E(uc)) and the current uti-
lization (uc) are below the current capacity(C), ideally this is some fraction of the current
capacity to ensure the system is not constantly trying to release instances. The third prop-
erty is that up until eight instances, the system will scale without any human involvement.
The fourth property is that to continue scaling a user must grant permission, this prevents
a cost overrun. The fourth property could be improved with addition of rules that would
report when the system is expected to exceed eight instances.
2.4.1 Scaling Problem
Scaling for a ﬂash event is like trying to solve the problem of how best to keep response
times low without excessive spending. To borrow notation from scheduling algorithms [25],
the problem of response time can be written as a special instance of Total Completion Time
objective function, like follows,
Lj (6)
Where: the release date (rj) is the same as the due date ( ¯dj), ensuring the completion time
(Cj) is greater than the due date, and Lj = Cj − ¯dj.
However, equation 6 alone does not capture the monetary constraints, to do this we
must minimize the sum of the usage time of each instance by its hourly cost
Ck ∗ ceil(Uk) (7)
Where: k is a unique virtual instance, Ck is the hourly cost of instance k, and Uk is the the
total time the instance executed.
The above equations make the following assumptions:
• The instances are all the same
• The only costs are only from instance execution
17

• An instance is only used once (this is to ensure that Ck ∗ ceil(Uk) is correct)
If both equation 6 and 7 are Minimized, then an optimal solution has been found.
Since the jobs are not known ahead of time an optimal solution cannot be guaranteed, but
heuristic algorithms can be used to allocate instances in an efficient manner.
2.5 Demand Estimation
Scaling for a flash event is essentially trying to satisfy current and expected demand. The
easiest way to satisfy this demand would be to buy as many instances as possible, however,
this would tend to overspend in most circumstances without any benefit. Therefore, the
best way to minimize cost and response time, formally defined in section 2.4, is to accurately
estimate future demand for the service. In the following sections 2.5.1 and 2.5.2, different
methods of estimating demand will be covered.
2.5.1 Current Algorithms
Xiao et al. proposed two methods of estimating demand using the monitorable behavior
of virtual instances, an exponentially weighted moving average (EWMA) and a fast up and
slow down (FUSD) method [4].
The EWMA method, weights the previous expectation (E(t − 1)) against the current
observed demand (O(t)) as follows,
E(t) = α ∗ E(t − 1) + (1 − α) ∗ O(t), 0 ≤ α ≤ 1 (8)
Values of α closer to 1 will increase responsiveness but lower the stability, while, con-
versely, a value of α closer to 0 will increase stability and lower responsiveness. In their
testing, Xiao et al. found that an α value of 0.7 was nearly ideal. However, this method
has a weakness, in that the current estimation will always be between the previous estimate
and the current observation. In other words, EWMA cannot anticipate increases in de-
mand. The second method, FUSD, was developed to overcome the weakness of EWMA. To
improve the method they allow α to be a negative value, resulting in the following equation,
18

E(t) = O(t) + |α| ∗ (O(t) − E(t − 1)), −1 ≤ α ≤ 0 (9)
This creates an acceleration effect by adding a weighted change from the previous es-
timate to the current observed demand. Their testing also showed an ideal value of -0.2.
This corresponds to the “fast up” part of their algorithm, the next part is meant to prevent
the system from underestimating demand, i.e. “slow down”. In order to slow the fall in
expectation they split the α, depending upon whether the observed demand is increasing
(O(t) > O(t − 1)) or decreasing (O(t) ≤ O(t − 1)), into an increasing α↑ and decreasing α↓.
This results in the following set of equations,
E(t) =



O(t) + |α↑| ∗ (O(t) − E(t − 1)) if O(t) > O(t − 1), −1 ≤ α↑ ≤ 0
α↓ ∗ E(t − 1) + (1 − α) ∗ O(t) if O(t) ≤ O(t − 1), 0 ≤ α↓ ≤ 1
(10)
The only information still needed to implement these methods is the length of the
measurement window, or more simply the length of the time slice where demand is measured
and averaged into a single value. Larger time slices help reduce the impact noise in the
data at the cost of some responsiveness.
2.5.2 Lead-Lag Filters
Within the control systems field there are a set of filters that behave similar to the
FUSD method, described in section 2.5.1. They are the first order lag filter, first order
lead filter, and second order lead-lag filter. One of the advantages of these filters is their
implementability within real-time systems through discretizations of the Laplace transforms.
I could not find any work related to the usage of lead/lag filters in estimating demand. I did,
however, find work related to transfer functions (the math used to create lead-lag systems)
being used to model short term electrical demand [26], which can experience flash events.
The coverage of these methods will start with the first order lag filter defined by Benoit
Boulet, in chapter 8 of his book ”Fundamentals of signals and systems”, as the following
Laplace transform [27],
19

H(s) =
ατs + 1
τs + 1
(11)
Where τ > 0 and 0 ≤ α < 1
When τ = 0 and the input signal is stable, the filter will decay at a rate equal to e−τx,
where x is the time in seconds. An example of this decay can be seen in figure 2.7 from
time 10s to 20s. The next type of filter is the lead filter, which could be considered a flash
up and flash down method because the filter amplifies changes in the input signal before
following a decay path once the input has stabilized. The Laplace transform for a lead filter
is as follows,
H(s) =
ατs + 1
τs + 1
(12)
Where τ > 0 and α > 1
In this system α controls the amplification of any changes in the input signal. figure
2.7 shows the complete behavior of a lead filter. The last type of filter is a second order
lead-lag filter which combines both the lead filter with the lag filter as follows,
H(s) = K ∗
ατas + 1
τas + 1
βτbs + 1
τbs + 1
(13)
Where τa > 0 and τb > 0 and αa > 1 and 0 ≤ αb < 1
With this the behavior of both first order lead filters and first order lag filters can be
combined to both flash up and slow down. The decaying nature of the filters is comparable to
what Xiao et al. had proposed. Therefore, performance behavior should be similar between
the different methods. One significant advantage that the filters have is the granular control
of both the flash up and slow down phases. This tunability could be used with machine
learning techniques in future work to create systems that adapt to changing flash events.
20

Figure 2.7. Demand estimation comparison.
3 Research Goal
The goal of this research is to show the effectiveness of a lead-lag filter in scaling a
service for a flash event within a cloud environment. To judge the effectiveness of a lead-lag
filter’s ability to anticipate demand, this method will be compared to methods proposed by
Xiao et al. The algorithms will be primarily judged by response time and resource cost, as
covered in section 2.4.1.
4 Methodology
The three methods for estimating future demand EWMA, FUSD, and lead/lag filters,
as covered in section 2.5, will be implemented and compared in a scalable test environment.
This environment will have a custom load balancer that can handle the addition and removal
of virtual instances. The estimation methods will be combined with scaling rules similar
to what was presented in section 2.4. The flash event model presented in section 2.3.2 will
be used to generate flash events for the implementations to handle. This traffic will then
be routed through the load balancer to custom virtual appliances, as described in section
2.2.4, that will handle the execution of tasks.
21

4.1 Test Environment
The test environment contains four primary components; tasks, the traffic generator,
the load balancer, and the worker nodes. The hardware for the traffic generator, load
balancer, and the worker nodes is given in table 4.1. The load balancer and worker nodes
are all virtual instances running running on separate hardware, i.e. one machine for the
load balancer and eight different machines for the worker nodes. The traffic generator is
not running in a virtual instance but is on an isolated machine.
Table 4.1. Test hardware
CPU CPU
Speed
(Ghz)
Cores/
Threads
Memory
Capacity
(GB)
Memory
Speed
(Mhz)
Virtual
Worker
Node
Intel Xeon
E3-1270 v2
3.5 4 / 8 4 1600 Yes
Load
Balancer
Intel Xeon
E3-1270 v2
3.5 4 / 8 4 1600 Yes
Traffic
Generator
Intel Core
i5-3570
3.4 4 / 4 16 1600 No
The traffic generator is responsible for creating tasks at a rate given by the flash model
specified in section 4.2. The traffic generator is also responsible for receiving tasks after
they’ve finished execution for analysis of response times. Once a task is created it is then
sent to the load balancer, which will forward the tasks to the worker nodes using a round
robin algorithm. However, the load balancer’s primary task is to apply the scaling rules
specified in section . To apply these rules the load balancer will have the ability to poll all
worker nodes to collect CPU utilization data, shutdown worker nodes to remove instances,
and start the nodes when adding instances. The load balancer will also track when instances
are started and stopped for analysis. Once a task arrives at the worker node it is placed in
a queue for processing. Each worker node has eight threads dedicated to the execution of
tasks. Once a task is finished executing it is sent back to the traffic generator for analysis.
There are in total eight worker nodes that can work on executing tasks. Figure 4.1 shows
how the tasks flow through the test environment.
To allow the traffic generator to overcome an at least eight to one computational advan-
tage of the worker nodes, each task is implemented to require 1s to execute. Tasks record
22

Figure 4.1. The testing environment
information on creation time, execution start time, the time they are sent back to the traffic
generator, and when the arrive at the traffic generator. The creation time and execution
time are used to calculate the response time of each task. Differences between the worker
nodes and the traffic generator system clock could skew the results by either inflating or
minimizing the actual start time. To prevent this skewing the tasks adjust the response time
by reducing it by the difference between the sent time and the arrival time. This works by
reducing the response time if the worker node’s clock is greater than the traffic generator’s
clock and conversely increasing it if the worker node is behind the traffic generator’s clock.
Completion time is not tracked because each task is executed in a non-preemptive manner,
this guarantees that each task will complete 1s after it is started.
4.2 Flash Event Model
To implement the flash model specified in section 2.3.2 the request rates will be changed
to percentage capacity of a single worker node, where 100% indicates complete utilization
of a single worker. For example, if six worker nodes are active with 80% utilization the
percentage capacity would be 480%. With this scheme it easier to model flash events within
the limited capacity of the test environment without exceeding capacity, which can obscure
differences between the different estimation techniques. The test environment cannot handle
modeling amplification factor greater than 120, which means only a predicted flash event
and a secondary flash event can be handled. Data from table 2.1 is used to generate the
following parameters (table 4.2) for the two flash event models.
23

Table 4.2. Flash events for testing
Flash
Event
Type
Amplification
Factor
Normal
Traffic(%)
Peak
Traffic
(%)
Flash
Time
(minutes)
Decay
Time
(Minutes)
Predicted 10 64 640 30 30
Secondary 80 8 640 15 420
4.3 Scaling Rules
Each scaling method will implement the following four rules:
1. Add instance when anticipated demand is greater than 90% of current capacity
2. Remove instance only if the removal will not increase demand above 85%
3. Total instances will not exceed 8
4. Total instances will not decrease below 1
All aspects but the demand estimation will be held constant to ensure that the results
between different tests can properly show the effects of demand estimation.
4.4 Demand Estimation Techniques
The following five techniques will be implemented to determine the performance of
lead/lag filters.
Current Demand
Control method to test how the scaling rules work without anticipation
Exponentially Weighted Moving Average (EWMA)
A method that lags the current demand
Flash-Up Slow-Down (FUSD)
Demand anticipation method proposed by Xiao et al.
First-Order Lead Filter
The maximum between the current demand a lead filter specified by equation 14
24

Second-Order Lead-Lag Filter
A second order lead-lag filter specified by equation 15
The first and second order filters are given by the following equations,
1.2s + 1
s + 1
(14)
The above equation uses the same lead characteristic as the FUSD method with an
overshoot on a transient of 1.2. This method uses the maximum between the current demand
and the output of lead filter to ensure that the anticipated demand doesn’t undershoot
causing a loss in responsiveness.
1.26s2 + 2.5s + 1
s2 + 2s + 1
(15)
The above filter was designed to combine both the lead and lag qualities of the FUSD
method in a single filter. Figure 4.2 shows an example of the above filters responding to a
simple demand model.
Figure 4.2. Demand response example
25

4.5 Test Phases
Traffic will be generated in four phases; the start up phase, the flash phase, the decay
phase, and the normalization phase. The start up phase will be 1 minute long where demand
generated will correspond to the normal rate. In this phase the load balancer will manually
adjust the instance capacity to be optimal. The goal of this phase is to get the system
into a steady state before the flash event starts. Therefore, data for this phase will not
be collected. The next two phases, flash and decay, will generate demand according to the
flash model specified in section 4.2. The last phase, normalization, is meant to measure
how long the system takes to return to optimal capacity from any over-provisioning.
4.6 Limitation
While running the flash event the actual demand lagged behind the generated demand.
To quantify this inaccuracy, a scenario was created where the demand will rise every five
minutes by 5% from 0% demand to 115% demand. Scaling rules were disabled to ensure a
second instance did not affect the results. The utilization rate is measured once a minute
giving five utilization rate data points per generated demand rate. However, the first and
last data points are excluded because they may include the transition from one demand
rate to another. The results of this test are in figure 4.3. In short, the actual demand was
about 80% of the generated demand.
The tests are also limited by a lack of variability, i.e. noise, in the generated demand. A
lack of noise would benefit methods more inclined to remove a worker instance, such as, the
lead filter and the lea-lag filter. Adding variability to the flash model would require further
research into how users access a service. This is beyond the current scope of research and
will only be mentioned.
26

Figure 4.3. Generated demand inaccuracy
5 Evaluation of results
5.1 Allocation Graphs
The following ﬁgures show how each scaling method allocates capacity based upon the
scaled demand. 100% demand indicates that one worker node is fully utilized (as described
in section 4.2). Demand was measured by how many tasks where sent per second ﬁltered
through a 5 second rolling average. Capacity is calculated by measuring how many instances
where active at any given time. The graphs are not adjusted to account for the inaccuracy
in the actual demand. Some of the below charts show temporary drops in demand, con-
sidering the fact that some of the demand estimation methods responded to these drops it
is reasonable to conclude that the drop in demand is real. The root cause of these drops
is unknown, however, the drop is consistent across tests and should not interfere with the
results.
27

5.1.1 Predictable Flash Event
The predictable flash events, shown below, start at approximately 64% capacity and
ramp up to 640% capacity at 1,800s (30 minutes), the flash phase. After that the demand
starts dropping, the decay phase, back to 64% capacity taking another 1,800s. Once the
decay phase has ended the flash event is completed at 3,600s (1 hour). This is all done per
table 4.2 in section 4.2.
Figure 5.1. The control instance allocations for the predictable flash event.
Instance allocation by the control method shows the latest points when instances should
be added during the flash phase. Additionally, instance de-allocation in the decay phase is
optimal because there is no lag to drop the demand.
28

Figure 5.2. The EWMA instance allocations for the predictable flash event.
The EWMA method allocates instances after the control method creating the potential
for a loss in responsiveness from under provisioning. Furthermore, the the decay phase is
over provisioned because of the EWMA lag.
Figure 5.3. The FUSD instance allocations for the predictable flash event.
The FUSD leads the control method during the flash, which help increase task
responsiveness. However, the FUSD method, like EWMA method, over provisions during
the flash phase. Although this over provisioning could aid responsiveness if the generated
demand was more random.
29

Figure 5.4. The lead filter instance allocations for the predictable flash event.
The lead filter performs similarly to the FUSD method during the flash phase but doesn’t
suffer the same over provisioning. In particular, during the decay phase, the lead filter
performs comparably to the control method.
Figure 5.5. The lead-lag instance allocations for the predictable flash event.
The lead-lag filter leads the demand like the FUSD and lead filter methods to help
increase responsiveness. However, during the decay phase the instances are de-allocated
sooner then the control method, and even the lead filter. This could reduce the instance
cost at a possible loss of responsiveness.
30

5.1.2 Secondary Flash Event
The secondary flash events, shown below, starts at approximately 8% capacity and
ramps up to 640% capacity at 900s (15 minutes), the flash phase. After this demand starts
dropping, the decay phase, back to 8% capacity in 25,200s (7 hours), which ends the flash
event at 26,100s (7 hours and 15 minutes). This is all done per table 4.2 in section 4.2.
Figure 5.6. The control instance allocations for the secondary flash event.
The control method will again set the standard to judge the other methods. The only
difference with this flash event is the time it takes to ramp up to peak demand is half the
time used in the predictable flash event, which caused effect discussed more in section 5.3.
31

Figure 5.7. The EWMA instance allocations for the secondary flash event.
The EWMA method performed much as it did in the predictable flash event by under
provisioning in the flash phase and over provisioning in the decay phase.
Figure 5.8. The FUSD instance allocations for the secondary flash event.
The FUSD method did not perform as well in the secondary flash event compared to the
predictable flash event because of the quicker flash phase. Although it still outperformed
the control method by allocating instances sooner.
32

Figure 5.9. The lead filter instance allocations for the secondary flash event.
The lead filter outperformed the FUSD method by allocating and de-allocating instances
earlier. This could produce a lower response time with an overall lower cost.
Figure 5.10. The lead-lag filter instance allocations for the secondary flash event.
The lead-lag filter outperformed the FUSD method and the lead filter by allocating and
de-allocating instances earlier. This could produce a slightly lower response time with a
slightly lower cost.
33

5.2 Cost
Table 5.1 normalized the cost, measured by total instance time, of each run to the
percent improvement over the control.
Table 5.1. Cost results
EWMA FUSD Lead
Filter
Lead-Lag
Filter
Predicted 0.94% -10.6% -2.1% -0.6%
Secondary 3.25% -2.4% -1.2% 0.4%
The EWMA method consistently came under the cost of the control. The only other
method to beat the control was the lead-lag filter during the secondary flash event, although
overall the lead-lag generated costs similar to the control event. Additionally, both the lead
filter and the lead-lag filter had lower costs than the FUSD method.
5.3 Responsiveness
Responsiveness was measured as the average of the 1% highest response times, this was
done to reduce the chance a single bad response to skew the results. Table 5.2 normalizes the
results to the percent improvement over the control. The responsiveness of the secondary
flash event was much worse than the predictable flash event, 9ms vs 4.3s in the control
method, which caused the lead and lead-lag filters to perform better than the FUSD method.
Table 5.2. Responsiveness results
EWMA FUSD Lead
Filter
Lead-Lag
Filter
Predicted -99% 232.6% 43.7% 55.9%
Secondary -99% 40.8% 2,937% 31,251%
The EWMA method created response times over 100 times slower than any other method
making it an unsuitable method for handling a flash event. Both the lead filter and the
lead-lag filter showed the most improvement over the control in the secondary flash because
flash rate accelerated much faster than the predictable flash event. However, in the slower
accelerating predictable flash event the FUSD had the advantage in responsiveness.
34

6 Conclusion
This thesis covered the history of virtualization, from 1973’s Popek and Goldberg re-
quirements to the introduction of x86-64 virtualization extensions in mid-2000. More im-
portantly, virtualization made cloud services elastic, i.e. malleable for a range of users
needs, by providing an abstraction of the physical hardware. Some cloud providers, such as
Google, offer a load balancing service to evenly spread demand across a series of resource
instances. Load balancing services improves cloud elasticity by distributing incoming con-
nections evenly across any active resource instances. This elasticity makes the cloud better
at responding to changes in demand rather than purchasing and using physical hardware.
Changes in demand occur in many ways, from user growth to increased service usage
from current users. However, there is a special type of change in demand, a flash event,
that happens so quickly that it can degrade a web services ability to operate. A flash event
is where the demand for a service rapidly increases over a very brief period of time, usually
just hours, to possibly over a 100 times the normal traffic rate. There are three common
varieties of flash events; predictable, unpredictable, and secondary. Flash events cause
problems when demand outstrips the underlying resources causing a drop in response time,
which Amazon found can cause a 1% drop in sales for every 100ms of increased response
time [22].
Scaling for a flash event requires the anticipation of future demand because of the time
it takes to allocate additional resources; this can range from a couple of seconds to minutes.
This thesis focused on the usage of lead/lag filters, which are Laplace domain equations that
can be tuned to over and/or under shoot demand. To fully evaluate the effectiveness of
lead-lag filters in anticipating demand they were compared against the leading method from
Xiao, et al. the flash-up slow-down method (FUSD). Whatever the anticipation method it
must be coupled with a set of scaling rules to ensure that the instances are allocated and
de-allocated only when intended.
In order to test these estimation methods a testing environment was created to simulate
an Infrastructure as a Service. This environment contained a load balancer and eight worker
nodes to handle traffic. Generated traffic was created in accordance with the flash event
35

model from Bhatia, et al, that uses tasks to create measurable CPU load. This CPU load
is then used during the flash event to approximate demand for the estimation method and
scaling rules. Only the predictable and secondary flash events were tested in the environment
due to resource limitations.
The results of this testing showed that the lowest cost estimation method was EWMA,
but this produced a response time that was magnitudes worse than any other methods.
The lead-lag filter estimation method had the next lowest cost by around 1% of the control.
This lower cost, however, didn’t come at a worsened response time over the lead filter
estimation method and actually beat the competitor algorithm, FUSD, in the secondary
flash event. The lead filter did not perform as well as the lead-lag filter in terms of cost or
responsiveness but still outperformed the FUSD method in cost and, for only the secondary
flash event, responsiveness. These results may not hold if the generated demand was noisier.
In conclusion, first and second order filters can improve responsiveness at a slight cost over
a direct method of measuring demand. Furthermore, the filters are an improvement over
the lead competitor’s algorithm, FUSD, in cost and are better at handling the rapidly
accelerating demand of a secondary flash event.
36

References
[1] D. Rushe, “Steve jobs, apple co-founder, dies at 56,” Oct. 2011,
http://www.guardian.co.uk/technology/2011/oct/06/steve-jobs-apple-cofounder-
dies.
[2] D. Woodside, “We owe you an apology,” Dec. 2013, http://motorola-
blog.blogspot.com/2013/12/we-owe-you-apology.html.
[3] P. Mell and T. Grance, “The nist deﬁnition of cloud computing: Recommendations of
the national institute of standards and technology,” Sep. 2011.
[4] Z. Xiao, W. Song, and Q. Chen, “Dynamic resource allocation using virtual machines
for cloud computing environment,” Parallel and Distributed Systems, IEEE Transac-
tions on, vol. 24, no. 6, pp. 1107–1117, June 2013.
[5] G. J. Popek and R. P. Goldberg, “Formal requirements for virtualizable third genera-
tion architectures,” in Proceedings of the Fourth ACM Symposium on Operating System
Principles, ser. SOSP ’73. New York, NY, USA: ACM, 1973, pp. 121–, http://0-
doi.acm.org.opac.library.csupomona.edu/10.1145/800009.808061.
[6] O. Agesen, A. Garthwaite, J. Sheldon, and P. Subrahmanyam, “The evolution of an
x86 virtual machine monitor,” SIGOPS Oper. Syst. Rev., vol. 44, no. 4, pp. 3–18, Dec.
2010, http://0-doi.acm.org.opac.library.csupomona.edu/10.1145/1899928.1899930.
[7] E. Bugnion, S. Devine, M. Rosenblum, J. Sugerman, and E. Y. Wang, “Bring-
ing virtualization to the x86 architecture with the original vmware workstation,”
ACM Trans. Comput. Syst., vol. 30, no. 4, pp. 12:1–12:51, Nov. 2012, http://0-
[8] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neuge-
bauer, I. Pratt, and A. Warﬁeld, “Xen and the art of virtualization,”
SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp. 164–177, Oct. 2003, http://0-
37

[9] J. Bezos, “Amazon ec2 beta,” Aug. 2006, http://aws.typepad.com/aws/2006/08/amazon ec2 beta.htm
[10] J. Sahoo, S. Mohapatra, and R. Lath, “Virtualization: A survey on concepts, taxonomy
and associated security issues,” in Computer and Network Technology (ICCNT), 2010
Second International Conference on, April 2010, pp. 222–226.
[11] X. Project, “Xen 4.3 release notes,” Jul. 2013,
http://wiki.xenproject.org/wiki/Xen 4.3 Release Notes (Accessed on March 3,
2014).
[12] G. Staﬀ, “30 years of accumulation: a timeline of cloud cimputing,” Mar. 2013,
http://gcn.com/articles/2013/05/30/gcn30-timeline-cloud.aspx (Accessed on March 3,
2014).
[13] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud computing and grid computing 360-
degree compared,” in Grid Computing Environments Workshop, 2008. GCE ’08, nov
2008, pp. 1–10.
[14] Google, “Load balancing,” Mar. 2014, https://developers.google.com/compute/docs/load-
balancing/ (Accessed on March 4, 2014).
[15] Amazon, “Elastic load balancing,” Mar. 2014, http://aws.amazon.com/elasticloadbalancing/
(Accessed on March 3, 2014).
[16] Rackspace, “Load balancing as a service,” Mar. 2014,
http://www.rackspace.com/cloud/load-balancing/ (Accessed on March 3, 2014).
[17] K. Gilly, C. Juiz, and R. Puigjaner, “An up-to-date survey in web load
balancing,” World Wide Web, vol. 14, no. 2, pp. 105–131, 2011. [Online]. Available:
http://dx.doi.org/10.1007/s11280-010-0101-5
[18] S. Bhatia, G. Mohay, D. Schmidt, and A. Tickle, “Modelling web-server ﬂash events,”
in Network Computing and Applications (NCA), 2012 11th IEEE International Sym-
posium on, Aug 2012, pp. 79–86.
38

[19] C. Sapuntzakis, D. Brumley, R. Chandra, N. Zeldovich, J. Chow, M. S. Lam,
and M. Rosenblum, “Virtual appliances for deploying and maintaining software,” in
Proceedings of the 17th USENIX Conference on System Administration, ser. LISA ’03.
Berkeley, CA, USA: USENIX Association, 2003, pp. 181–194. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1051937.1051965
[20] D. Geer, “The os faces a brave new world,” Computer, vol. 42, no. 10, pp. 15–17, Oct
2009.
[21] Z. Xiao, Q. Chen, and H. Luo, “Automatic scaling of internet applications for cloud
computing services,” IEEE Transactions on Computers, vol. 99, no. PrePrints, p. 1,
2012.
[22] G. Linden, “Make data useful,” Nov. 2006, amazon, Presentation.
[23] J. Hu and G. Sandoval, “Web acts as hub for info on attacks,” Sep. 2001, accessed
http://news.cnet.com/2100-1023-272873.html on March 3, 2014.
[24] L. M. Vaquero, L. Rodero-Merino, and R. Buyya, “Dynamically scaling applications in
the cloud,” SIGCOMM Comput. Commun. Rev., vol. 41, no. 1, pp. 45–52, Jan. 2011,
http://0-doi.acm.org.opac.library.csupomona.edu/10.1145/1925861.1925869.
[25] J. Leung, Handbook of scheduling : algorithms, models, and performance analysis.
Boca Raton: Chapman & Hall/CRC, 2004.
[26] F. J. Nogales and A. J. Conejo, “Electricity price forecasting through transfer
function models,” The Journal of the Operational Research Society, vol. 57,
no. 4, pp. 350–356, 04 2006, copyright - Palgrave Macmillan Ltd 2006;
Last updated - 2013-10-04; CODEN - OPRQAK. [Online]. Available: http:
//search.proquest.com/docview/231377045?accountid=10357
[27] B. Boulet, Fundamentals of signals and systems. Hingham, Mass: Da Vinci Engineer-
ing Press, 2006.
39

Scaling in the Cloud - JMansur Thesis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling in the Cloud - JMansur Thesis

Similar to Scaling in the Cloud - JMansur Thesis (20)

Scaling in the Cloud - JMansur Thesis