3 the cloud

DOSE: Deployment and Operations
for Software Engineers
The Cloud

© Len Bass 2019 2
Overview
• Structure
• Failure in the Cloud
• Scaling Service Capacity

© Len Bass 2019 3
Data centers in the cloud
• A cloud provider (AWS, Google, Microsoft) maintains
data centers around the world.
• Each data center has ~100,000 computers.
• Limited by power and cooling considerations.

© Len Bass 2019 4
Data Center

© Len Bass 2019 5
Organization of data centers
• Each data center has independent power supply,
independent fire control, independent security, etc
• Data centers are collected into availability zones and
availability zones are collected into geographic
regions.
• AWS currently has 20 regions each with several
availability zones.

© Len Bass 2019 6
Allocating a virtual machine in
AWS - 1
• A user wishes to allocate a virtual machine in AWS
• The user specifies
• A region
• Availability zone
• Image to load into virtual machine
• …

© Len Bass 2019 7
Allocating a virtual machine in
AWS - 2
• AWS management software
• Finds a server in that region and
availability zone with spare capacity
• Allocates a virtual machine in that server
• Assigns IP address to that virtual machine
(public)
• Loads image into that virtual machine
• VM can then send and receive
messages

© Len Bass 2019 8
Overview
• Structure

© Len Bass 2019 9
A distributed system is one in which the failure of a
computer you didn't even know existed can render your
own computer unusable.
Leslie Lamport

© Len Bass 2019 10
Failures in the cloud
• Cloud failures large and small
• The Long Tail
• Techniques for dealing with the long tail

© Len Bass 2019 11
Sometimes the whole cloud fails
Selected Cloud Outages - 2018
• Feb 15, Google Cloud Datastore
• March 2, AWS Eastern region networking problems
• April 6, Microsoft Office 365. Users locked out of their accounts
• May 31, AWS Eastern Region hardware failure caused by
“power event”
• June 17, Microsoft Azure. Heatwave in Europe and
malfunctioning air conditioning
• July 17, Google. Load balancer issue affected services globally
• July 16, Amazon Prime. Not AWS but Amazon sales
• Sept 5. Microsoft Office 365. Botched update to Azure
authentication

© Len Bass 2019 12
And sometimes just a part of it fails
…

© Len Bass 2019 13
A year in the life of a Google datacenter
• Typical first year for a new cluster:
• ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to
recover)
• ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to
come back)
• ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get
back)
• ~5 racks go wonky (40-80 machines see 50% packetloss)
• ~8 network maintenances (4 might cause ~30-minute random connectivity
losses)
• ~12 router reloads (takes out DNS and external vips for a couple minutes)
• ~3 router failures (have to immediately pull traffic for an hour)
• ~dozens of minor 30-second blips for dns
• ~1000 individual machine failures
• ~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, dead
horses, …

© Len Bass 2019 14
Amazon failure statistics
• In a data center with ~64,000 servers with 2 disks
each
~5 servers and ~17 disks fail every day.

© Len Bass 2019 15
• The Long Tail

© Len Bass 2019 16
Short digression into probability
• A distribution describes the probability than any given
reading will have a particular value.
• Many phenomenon in nature are “normally distributed”.
• Most values will cluster
around the mean with
progressively smaller
numbers of values going
toward the edges.
• In a normal distribution
the mean is equal to the
median

© Len Bass 2019 17
• In a long tail distribution, there are some
values far from the mean.
• These values are sufficient to influence
the mean.
• The mean and the
median are
dramatically
different in a long
tail distribution.
Long Tail

© Len Bass 2019 18
• If there is a partial failure of the cloud some
activities will take a long time to complete and
exhibit a long tail.
• The figure shows
distribution of
1000 AWS
“launch instance”
calls.
• 4.5% of calls were
“long tail”
What does this mean?

© Len Bass 2019 19
• The Long Tail

© Len Bass 2019 20
What can you do to prevent
long tail problems?
• Recognize that failure has occurred
• Recover from failure
• Mask failure

© Len Bass 2019 21
Recognizing failure
• In a distributed system, the only
communication between distinct computers is
through messages.
• A request (message) to another machine
goes unanswered.
• The failure is recognized by the requestor
making the request asynchronously and
setting a timeout.

© Len Bass 2019 22
Health Check
• Health check. Requires one machine to act as a
monitor for other machines.
• Machine sends a message to a monitor
periodically saying “I am alive”
• Ping/echo. Monitor sends a message to the
machine being monitored and expects a reply
within specified period.

© Len Bass 2019 23
Recovering from failure
• Retry
• Instantiate another instance of failed service
• Graceful degradation
• Set “circuit breaker” to prevent trying to use service
again.
• Netflix Open Source Hystrix library provides some
useful functions

© Len Bass 2019 24
Masking failure
• “Hedged” request. Suppose you wish to launch 10
instances. Issue 11 requests. Terminate the request
that has not completed when 10 are completed.
• “Alternative” request. In the above scenario, issue 10
requests. When 8 requests have completed issue 2
more. Cancel the last 2 to respond.
• Using these techniques reduces the time of the
longest of the 1000 launch instance requests from
202 sec to 51 sec.

© Len Bass 2019 25
Overview
• Structure

© Len Bass 2019 26
Load Balancer
• One server may not suffice for all of the requests for
a given service.
• Have multiple servers supplying the same
service.
• Use “load balancer” to distribute requests.
• Server is registered with load balancer upon
initialization
• Load balancer monitors health of servers and knows
which ones are healthy.
• Load balancer IP is returned from DNS server when
client requests URL of service.

© Len Bass 2019 27
Message sequence – client
makes a request
Servers
Clients
Load
Balance
r

© Len Bass 2019 28
Message sequence- request
arrives at load balancer
Servers
Clients
Load
Balancer

© Len Bass 2019 29
Message sequence – request is
send to one server
Servers
Clients
Load
Balance
r

© Len Bass 2019 30
Message sequence – reply goes
directly back to sender
Servers
Clients
Load
Balance
r

© Len Bass 2019 31
Note IP manipulation
• Server always sends message back to what it thinks
is the sender.
• Load balancer changes destination IP but not the
source. Then reply goes directly back to client
• Load balancer (now acting as a proxy) can change
origin as well. In this case, reply goes back to load
balancer which must change destination (of reply)
back to original client.

© Len Bass 2019 32
Routing algorithms
• Load balancers use variety of algorithms to choose
instance for message
• Round robin. Rotate requests evenly
• Weighted round robin. Rotate requests according
to some weighting.
• Hashing – IP address of source to determine
instance. Means that a request from a particular
client always sent to same instance as long as it
is still in service.
• Note that these algorithms do not require knowledge
of an instance’s load.

© Len Bass 2019 33
Suppose Load Balancer
Becomes Overloaded
– Load Balance the Load Balancers

© Len Bass 2019 34
Combining pictures
• IP payload address is
modified multiple times
before message gets to
server
• Return message from
instance to client may go
through the gateway (proxy)
or may go directly back to
the client.
Cloud gatew
Client

© Len Bass 2019 35
Summary
• The cloud consists of regions with availability zones
• A new VM is placed in an availability zone in a region
• Instances in the cloud can fail.
• Must be detected
• Long tail is a cloud phenomena that developers
must be aware of.
• Load balancers distribute requests among identical
servers.

© Len Bass 2019 36
Overview
• Autoscaling
• Distributed Coordination

© Len Bass 2019 37
Suppose servers become
overloaded
• As load grows, existing resources may not be
sufficient.
• Autoscaling is a mechanism for creating new
instances of a server.
• Set up a collection of rules that determine
• Under what conditions are new servers added
• Under what conditions are servers deleted

© Len Bass 2019 38
First there were three servers
Servers
Load
Balancer

© Len Bass 2019 39
Now there are four
Issues:
• What makes the decision to add a new server?
• How does the new server get loaded with software?
• How does the load balancer know about the new server?

© Len Bass 2019 40
Making the decision
Monitor
Each server reports its CPU usage to a monitor
Monitor has a collection of rules to determine whether to
add new server. E.g. one server is over 80% utilization for
15 minutes

© Len Bass 2019 41
When a new instance is created, it is
created based on a launch
configuration, security groups, etc
Loading the new server with
software
Configuration
specifications

© Len Bass 2019 42
Making the load balancer new
server aware
The new server is registered with a load
balancer by the monitor.

© Len Bass 2019 44
Atomicity
• Some operations must be done atomically (or appear
to be atomic).
• Race conditions result from non atomic operations.
• Data base updates may result in inconsistent data
if not atomic
• Resources (e.g. memory) may be overwritten if not
atomic
• Interrupt handlers must be atomic
• etc

© Len Bass 2019 45
How do I achieve (or appear to
achieve) atomicity?
• Time stamps?
• Many protocols involve putting a time stamp on
messages for error detection and ordering
purposes.
• Time stamps are often used to identify log
messages used for debugging problems.
• Couldn’t time stamps be used to order activities?
• Problem is that clocks drift and so clock time may be
different on different computers.

© Len Bass 2019 46
Synchronizing clocks over networks
• Suppose two different computers are connected via a
network. How do they synchronize their clocks?
• If one computer sends its time reading to another, it takes
time for the message to arrive.
• NTP (Network Time Protocol) can be used to synchronize
time on a collection of computers.
• Accurate to around 1 millisecond in local area networks
• Accurate to around 10 milliseconds over public internet
• Congestion can cause errors of 100 milliseconds or
more.

© Len Bass 2019 47
Suppose NTP is insufficiently
accurate
• Financial industry is spending 100s of millions of dollars
to reduce latency between Chicago and New York by 3
milliseconds.
• Well within error range of NTP
• GPS time is accurate within
• 14 nanoseconds (theoretically)
• 100 nanoseconds (mostly)Atomi
• Timestamp messages with GPS time
• Used by electric companies to measure phase angle
• Atomic clocks
• Used by Google to coordinate time across all of their
distributed systems.

© Len Bass 2019 48
How about using locks?
• One technique to synchronize is to lock critical resources.
• Can lead to deadlock – two processes waiting for each
other to release critical resources
• Process one gets a lock on row 1 of a data base
• Process two gets a lock on row 2.
• Process one waits for process 2 to release its lock on
row 2
• Process two waits for process 1 to release its lock on
row 1
• No progress.

© Len Bass 2019 49
More problems with locks
• Locks are logical structures maintained in software
or in persistent storage.
• Getting a lock across distributed systems is not an
atomic operation.
• It is possible that while requesting a lock another
process can acquire the lock. This can go on for a
long time (it is called livelock if there is no
possibility of ever acquiring a lock)
• Suppose the virtual machine holding the lock fails.
Then the owner of the lock can never release it.

© Len Bass 2019 50
Locks in distributed systems
• Utilizing locks in a distributed system has a more
fundamental problem.
• It takes time to send a message (create a lock, for
example) from one computer to another.

© Len Bass 2019 51
How much time do operations
in distributed systems take
• Main memory reference 100 ns
• Send 1K bytes over 1 Gbps network 0.01 ms
• Read 4K randomly from SSD. 15 ms
• Read 1 MB sequentially from memory 0.25 ms
• Round trip within same datacenter 0.5 ms
• Read 1 MB sequentially from SSD 1 ms
(4X memory)
• Disk seek
10 ms
(20x datacenter roundtrip)
• Read 1 MB sequentially from disk 20 ms
(80x memory, 20X SSD)
• Send packet CA->Netherlands->CA 150 ms
* dean-keynote-ladis2009_scalable_distributed_google_system

© Len Bass 2019 52
Implications of latency
numbers
• State stored in persistent storage (disk or
SSD) will take longer to fetch than state
stored in memory.
• State stored in a different datacenter will take
longer to access than state stored locally,
especially across continents.
• Persistent store is typically replicated both for
performance (latency) reasons and for
availability (failure) reasons.
• => keeping data consistent across different
occurrences is important but difficult.

© Len Bass 2019 53
Is there a solution?
• The general problem is that you want to manage
synchronization of data across a distributed set of
servers where up to half of the servers can fail.
• Paxos is a family of algorithms that use consensus
to manage concurrency. Complicated and difficult
to implement.
• An example of the implementation difficulty
• Choose one server as the master that keeps the
“authoritative” state.
• Now that server fails. Need to
• Find a new master
• Make sure it is up to data with the authoritative state.

© Len Bass 2019 54
Luckily
• Several open source systems are now available that
• Implement Paxos or an alternative consensus
algorithm
• Are reasonably easy to use.
• Sample systems
• Zookeeper
• Consul
• etcd

© Len Bass 2019 55
Zookeeper as an example
• Zookeeper provides a guaranteed consistent
(mostly) data structure for every instance of a
distributed application.
• Definition of “mostly” is within eventual
consistency lag (but this is small).
• Zookeeper keeps data in memory. This is why it
is fast.
• Zookeeper deals with managing failure as well as
consistency.
• Done using consensus algorithm.

© Len Bass 2019 57
Zookeeper API
Function Type
create write
delete write
Exists read
Get children Read
Get data Read
Set data write
+ others
• All calls return atomic views of state – either succeed or
fail. No partial state returned. Writes also are atomic.
Either succeed or fail. If they fail, no side effects.

© Len Bass 2019 58
Example – Lock creation
• Locks have names.
• Client N
• Create Lock1
• If success then client owns lock.
• If failure, then it is placed on waitlist for lock 1 and control is
returned to Client N
• When client finishes, it deletes Lock1
• If there is a waitlist, then those clients are informed
about Lock 1 deletion and they try again to create
Lock 1
• If Client N fails, Zookeeper will delete Lock 1
and waitlist clients are informed of deletion

© Len Bass 2019 60
Summary
• Autoscaling creates new instances based on utilization
of existing instances
• Synchronization of data across a distributed system id
complicated
• It is difficult because of
• Timing delays across distributed systems
• Possible failures of nodes of various types
• Open source systems implement complicated
consensus algorithms in a fashion that is relatively
easy to use.

3 the cloud

More Related Content

What's hot

Similar to 3 the cloud

More from Len Bass

Recently uploaded

3 the cloud