2. Deployment pipeline
• Developers commit code
• Code is compiled
• Binary is processed by a build and unit test tool which builds
the service
• Integration tests are run followed by performance tests.
• Result is a machine image (assuming virtualization)
• The service (its image) is deployed to production.
2
3. Deployment Overview
3
Multiple instances of
a service are
executing
• Red is service being replaced
with new version
• Blue are clients
• Green are dependent services
VAVB
VBVB
UAT / staging /
performance
tests
4. Additional considerations
• Recall release plan.
• Each release requires coordination among
stakeholders and development teams.
– Can be explicitly performed
– Can be implicit in the architecture (we saw an
example of this)
• You as an architect need to know how frequently
new releases will be deployed.
– Some organizations deploy dozens of times a day
– Other organizations deploy once a week
– Still others once a month or longer
5. Monthly deployment
• There is time for coordination among
development teams to ensure that
components are consistent
• There is time for “release manager” to ensure
that a release is correct before moving it to
production
6. Weekly deployment
• Limited time for coordination among
development teams
• Time for “release manager” to ensure that a
release is correct before moving it to
production
7. Daily deployment
• No time for
– Coordination among development teams
– Release manager to validate release
• These items must be implicit in the
architecture.
8. Deployment goal and constraints
• Goal of a deployment is to move from current state (N
instances of version A of a service) to a new state (N instances
of version B of a service)
• Constraints associated with Continuous Deplloyment:
– Any development team can deploy their service at any
time. I.e. New version of a service can be deployed either
before or after a new version of a client. (no
synchronization among development teams)
– It takes time to replace one instance of version A with an
instance of version B (order of minutes)
– Service to clients must be maintained while the new
version is being deployed.
8
9. Deployment strategies
• Two basic all of nothing strategies
– Red/Black– leave N instances with version A as they are,
allocate and provision N instances with version B and then
switch to version B and release instances with version A.
– Rolling Upgrade – allocate one instance, provision it with
version B, release one version A instance. Repeat N times.
• Other deployment topics
– Partial strategies (canary testing, A/B testing,). We will
discuss them later. For now we are discussing all or nothing
deployment.
– Rollback
– Packaging services into machine images
9
10. Trade offs – Red/Black and
Rolling Upgrade
• Red/Black
– Only one version available to the
client at any particular time.
– Requires 2N instances (additional
costs)
– Accomplished by moving
environments
• Rolling Upgrade
– Multiple versions are available for
service at the same time
– Requires N+1 instances.
• Both are heavily used. Choice
depends on
– Cost
– Managing complications from
using rolling upgrade
10
Update Auto Scaling
Group
Sort Instances
Remove & Deregister Old
Instance from ELB
Confirm Upgrade Spec
Terminate Old Instance
Wait for ASG to Start
New Instance
Register New Instance
with ELB
Rolling
Upgrade in
EC2
11. Types of failures during rolling upgrade
Rolling
Upgrade
Failure
Provisioning
Research topic
Logical failure
Inconsistencies
to be discussed
Instance failure
Handled by
Auto Scaling
Group in EC2
11
12. What are the problems with
Rolling Upgrade?
• Recall that any development team can deploy
their service at any time.
• Three concerns
– Maintaining consistency between different versions of
the same service when performing a rolling upgrade
– Maintaining consistency among different services
– Maintaining consistency between a service and
persistent data
12
13. Maintaining consistency between
different versions of the same
service
• Key idea – differentiate between installing a new
version and activating a new version
• Involves “feature toggles” (described momentarily)
• Sequence
– Develop version B with new code under control of feature
toggle
– Install each instance of version B with the new code
toggled off.
– When all of the instances of version A have been replaced
with instances of version B, activate new code through
toggling the feature.
13
14. Issues
• Do you remember feature toggles?
• How do I manage features that extend across
multiple services?
• How do I activate all relevant instances at
once?
14
15. Feature toggle
• Place feature dependent new code inside of an
“if” statement where the code is executed if an
external variable is true. Removed code would be
the “else” portion.
• Used to allow developers to check in
uncompleted code. Uncompleted code is toggled
off.
• During deployment, until new code is activated, it
will not be executed.
• Removing feature toggles when a new feature
has been committed is important.
15
16. Multi service features
• Most features will involve multiple services.
• Each service has some code under control of a
feature toggle.
• Activate feature when all instances of all services
involved in a feature have been installed.
– Maintain a catalog with feature vs service version
number.
– A feature toggle manager determines when all old
instances of each version have been replaced. This
could be done using registry/load balancer.
– The feature manager activates the feature.
16
17. Activating feature
• The feature toggle manager changes the value of
the feature toggle. Two possible techniques to
get new value to instances.
– Push. Broadcasting the new value will instruct each
instance to use new code. If a lag of several seconds
between the first service to be toggled and the last
can be tolerated, there is no problem. Otherwise
synchronizing value across network must be done.
– Pull. Querying the manager by each instance to get
latest value may cause performance problems.
• A coordination mechanism such as Zookeeper will
overcome both problems.
17
18. Maintaining consistency across
versions (summary)
• Install all instances before activating any new
code
• Use feature toggles to activate new code
• Use feature toggle manager to determine
when to activate new code
• Use Zookeeper to coordinate activation with
low overhead
18
19. Maintaining consistency among
different services
• Use case:
– Wish to deploy new version of service A without
coordinating with development team for clients of
service A.
• I.e. new version of service A should be backward
compatible in terms of its interfaces.
• May also require forward compatibility in certain
circumstances, e.g. rollback
19
20. Achieving Backwards Compatibility
• APIs can be extended but must always be backward
compatible.
• Leads to a translation layer
External APIs (unchanging but with ability to extend or
add new ones)
Translation to internal APIs
Client Client
Internal APIs (changes require changes to translation
layer but do not propagate further)
21. What about dependent services?
• Dependent services that are within your control
should maintain backward compatibility
• Dependent services not within your control (third
party software) cannot be forced to maintain
backward compatibility.
– Minimize impact of changes by localizing interactions
with third party software within a single module.
– Keeping services independent and packaging as much
as possible into a virtual machine means that only
third party software accessed through message
passing will cause problems.
21
22. Forward Compatibility
• Gracefully handle unknown calls and data base schema information
– Suppose your service receives a method call it does
not recognize. It could be intended for a later version
where this method is supported.
– Suppose your service retrieves a data base table with
an unknown field. It could have been added to
support a later version.
• Forward compatibility allows a version of a service to be upgraded
or rolled back independently from its clients. It involves both
– The service handling unrecognized information
– The client handling returns that indicate unrecognized
information.
22
23. Maintaining consistency between a
service and persistent data
• Assume new version is correct – we will discuss the situation where
it is incorrect in a moment.
• Inconsistency in persistent data can come about because data
schema or semantics change.
• Effect can be minimized by the following practices (if possible).
– Only extend schema – do not change semantics of
existing fields. This preserves backwards compatibility.
– Treat schema modifications as features to be toggled.
This maintains consistency among various services
that access data.
23
24. I really must change the schema
• In this case, apply pattern for backward
compatibility of interfaces to schemas.
• Use features of database system (I am
assuming a relational DBMS) to restructure
data while maintaining access to not yet
restructured data.
24
25. Summary of consistency discussion so
far.
• Feature toggles are used to maintain consistency
within instances of a service
• Backward compatibility pattern is used to
maintain consistency between a service and it s
clients.
• Discouraging modification of schema will
maintain consistency between services and
persistent data.
– If schema must be modified, then synchronize
modifications with feature toggles.
25
26. Canary testing
• Canaries are a small number of instances of a new version
placed in production in order to perform live testing in a
production environment.
• Canaries are observed closely to determine whether the new
version introduces any logical or performance problems. If
not, roll out new version globally. If so, roll back canaries.
• Named after canaries
in coal mines.
26
27. Implementation of canaries
• Designate a collection of instances as canaries. They do not need to
be aware of their designation.
• Designate a collection of customers as testing the canaries. Can be,
for example
– Organizationally based
– Geographically based
• Then
– Activate feature or version to be tested for canaries. Can be
done through feature activation synchronization mechanism
– Route messages from canary customers to canaries. Can be
done through making registry/load balancer canary aware.
27
28. A/B testing
• Suppose you wish to test user response to a
system variant. E.g. UI difference or marketing
effort. A is one variant and B is the other.
• You simultaneously make available both
variants to different audiences and compare
the responses.
• Implementation is the same as canary testing.
28
29. Rollback
• New versions of a service may be unacceptable either
for logical or performance reasons.
• Two options in this case
• Roll back (undo deployment)
• Roll forward (discontinue current deployment and create a
new release without the problem).
• Decision to rollback or roll forward is almost never automated
because there are multiple factors to consider.
• Forward or backward recovery
• Consequences and severity of problem
• Importance of upgrade
29
30. States of upgrade.
• An upgrade can be in one of two states when
an error is detected.
– Installed (fully or partially) but new features not
activated
– Installed and new features activated.
30
31. Possibilities
• For this slide assume persistent data is correct.
• Installed but new features not activated
– Error must be in backward compatibility
– Halt deployment
– Roll back by reinstalling old version
– Roll forward by creating new version and installing
that
• Installed with new features activated
– Turn off new features
– If that is insufficient, we are at prior case.
31
32. Persistent data may be incorrect
• Keep log of user requests (each with their own
identification)
• Identification of incorrect persistent data
• Tag each data item with metadata that provides service and
version that wrote that data
• user request that caused the data to be written
• Correction of incorrect persistent data (simplistic
version)
– Remove data written by incorrect version of a service
– Install correct version
– Replay user requests that caused incorrect data to be
written
32
33. Persistent data correction
problems
I will not present good solutions to these problems.
1. Replaying user requests may involve requesting features that are
not in the current version.
– Requests can be queued until they can be correctly
re-executed
– User can be informed of error (after the fact)
2. There may be domino effects from incorrect data. i.e. other
calculations may be affected.
– Keep pedigree for data items that allows determining
which additional data items are incorrect. Remove
them and regenerate them when requests replayed.
– Data that escaped the system, e.g. sent to other
system or shown to a user, cannot be retrieved.
33
34. Summary of rollback options
• Can roll back or roll forward
• Rolling back without consideration of
persistent data is relatively straightforward.
• Managing erroneous persistent data is
complicated and will likely require manual
processing.
34
35. Packaging of services
• The last portion of the deployment pipeline is
packaging services into machine images for
installation.
• Two dimensions
– Flat vs deep service hierarchy
– One service per virtual machine vs many services
per virtual machine
35
36. Flat vs Deep Service Hierarchy
• Trading off independence of teams and
possibilities for reuse.
• Flat Service Hierarchy
– Limited dependence among services & limited
coordination needed among teams
– Difficult to reuse services
• Deep Service Hierarchy
– Provides possibility for reusing services
– Requires coordination among teams to discover reuse
possibilities. This can be done during architecture
definition.
36
37. Services per VM Image
37
Service1
Service2
VM image
Develop
Develop
Embed
Embed
One service per VM
Service VM image
Develop Embed
Multiple services per VM
38. One Possible Race Condition with
Multiple Services per VM
38
TIME
Initial State: VM image with Version N of Service 1 and Version N of Service 2
Developer 1
Build new image with VN+1|VN
Begin provisioning process
with new image
Developer 2
Build new image with VN|VN+1
Begin provisioning process
with new image
without new version of
Service 1
Results in Version N+1 of Service 1 not being updated
until next build of VM image
Could be prevented by VM image build tool
39. Another Possible Race Condition
with Multiple Services per VM
39
TIME
Initial State: VM image with Version N of Service 1 and Version N of Service 2
Developer 1
Build new image with VN+1|VN
Begin provisioning process
with new image
overwrites image
created by developer 2
Developer 2
Build new image with VN+1|VN+1
Begin provisioning process
with new image
Results in Version N+1 of Service 2 not being updated
until next build of VM image
Could be prevented by provisioning tool
40. Trade offs
• One service per VM
– Message from one service to another must go
through inter VM communication mechanism –
adds latency
– No possibility of race condition
• Multiple Services per VM
– Inter VM communication requirements reduced –
reduces latency
– Adds possibility of race condition caused by
simultaneous deployment
40
41. Summary of Deployment
• Rolling upgrade is common deployment strategy
• Introduces requirements for consistency among
– Different versions of the same service
– Different services
– Services and persistent data
• Other deployment considerations include
– Canary deployment
– A/B testing
– Rollback
– Business continuity
41
43. Overview
• What is Netflix
• Migrating to the cloud
– Data
– Build process
– Testing process
• Withstanding Amazon outage of April 21, 2011.
44. Overview
• What is Netflix
• Migrating to the cloud
– Data
– Build process
– Testing process
• Withstanding Amazon outage of April 21, 2011.
45. Netflix Corporation
Launched in 1998 after founder was irritated at having to pay late fees on a DVD rental.
DVD Model
• Pay monthly membership fee that includes rentals, shipping and no late fees
• Maintain online queue of desired rentals
• When return last rental (depending on service plan), next item in queue is mailed to
you together with a return envelope.
• Customers rate movies and Netflix recommends based on your preferences
Unusual culture
• Unlimited vacation time for salaried workers
• Workers can take any percentage of their salary in stock options (up to 100%).
46. Streaming video - 1
Customers can watch Netflix streaming video on a wide variety of devices many of which
feed into a TV
– Roku set top box
– Blu-ray disk players
– Xbox 360
– TV directly
– PlayStation 3
– Wii
– DVRs
– Nintendo
Customers can stop and restart video at will. Netflix calls these locations in the films
“bookmarks”.
47. Streaming video - 2
In 2007 Netflix began to offer streaming video to it’s subscribers
Initially, one hour of streaming video was available to customers for every dollar they spent
on their plan
In Jan, 2008, every customer was entitled to unlimited streaming video.
In May, 2011, Netflix streaming video accounted for 22% of all internet traffic. 30% of traffic
during peak usage hours.
Three bandwidth tiers
• Continuous bandwidth to the client of 5 Mbit/s. HDTV, surround sound
• Continuous bandwidth to the client of 3Mbit/s – better than DVD
• Continuous bandwidth to the client of 1.5Mbit/s – DVD quality
48. Overview
• What is Netflix
• Migrating to the cloud
– Data
– Build process
– Testing process
• Withstanding Amazon outage of April 21, 2011.
49. Netflix’s Growth
In late 2008, Netflix had a single data center with Oracle as the main database system.
With the growth of subscriptions and streaming video, it was clear that they would soon
outgrow the data center.
50. Not Just More Requests
• Netflix was expanding internationally
• They had unpredictable usage spikes with product releases
– E.g. Wii, and Xbox
– Need infrastructure to handle usage spikes
• Datacenters are expensive
– Netflix was running Oracle on IBM hardware (not commodity hardware)
– Could switch to commodity hardware but would have to have in-house
expertise
• Can’t hire enough system and database administrators to grow fast
enough
51. Netflix Moves to Cloud
Four reasons cited by Netflix for moving to the cloud
1. Every layer of the software stack needed to scale horizontally, be more reliable,
redundant, and fault tolerant. This leads to reason #2
2. Outsourcing data center infrastructure to Amazon allowed Netflix engineers to focus
on building and improving their business.
3. Netflix is not very good at predicting customer growth or device engagement. They
underestimated their growth rate. The cloud supports rapid scaling.
4. Cloud computing is the future. This will help Netflix with recruiting engineers who are
interested in honing their skills, and will help scale the business. It will also ensure
competition among cloud providers helping to keep costs down.
Why Amazon and EC2? In 2008, Amazon was the leading supplier. Netflix wanted
an IaaS so they could focus on their core competencies.
52. Netflix applications
What applications does Netflix have?
• Video ratings, reviews, and recommendations
• Video streaming
• User registration, log-in
• Video queues
• Billing
• DVD disc management – inventory and shipping
• Video metadata management – movie cast information
Strategy was to move in a phased manner.
• A subset of applications would move in a particular phase
• Means that data may have to be replicated between cloud and data center during the
move.
53. Amazon’s S3 and SimpleDB
SimpleDB is Amazon’s non-relational data store
– Optimized for data access speed
– Indexes data for fast retrieval
– Physical drives are less dense to support this
Amazon’s Simple Storage Service (S3)
– Stores raw data
– Optimized on storing larger data sets inexpensively
54. Choosing Cloud-Based Data Store
SimpleDB and S3 both provide
• Disaster recovery
• Managed fail over and fail back
• Distribution across availability zones
• Persistence
• Eventual consistency
• SimpleDB has a query language
55. Cloud-Based Data Store II
SimpleDB and S3 do not provide:
• Transactions
• Locks
• Sequences
• Triggers
• Clocks
• Constraints
• Joins
• Schemas and associated integrity checking
• Native data types. All data in SimpleDB is string data
56. Migrating Data from Oracle to SimpleDB
Whole data sets were moved. E.g. instant watch bookmarks were lifted wholesale into
SimpleDB
Incremental data changes were kept in synch (bi-directional) between Oracle and
SimpleDB. Device support moved from Oracle to the cloud. During that transition,
a user may begin watching a movie on a device supported in the cloud and
continue watching on a device supported in Oracle. Instant watch bookmarks are
kept in synch for these types of possibilities.
Once, the need for the Oracle version is removed, then the synchronization is
discontinued and the data no longer resides on the Oracle database.
57. Using SimpleDB and S3
Move normal DB functionality up the stack to the application.
Specifically:
• Do “Group By” and “Join” in application
• De-normalize tables to reduce necessity for joins
• Do without triggers
• Manage concurrency using time stamps and conditional Put and
Delete
• Do without clock operations
• Application checks on constraints on read and repair data as a side
effect
58. Exploiting SimpleDB
All datasets with a significant write load are partitioned across multiple SimpleDB domains
Since all data is stored as strings
• Store all date-times in ISO8601 format
• Zero pad any numeric columns used in sorts or WHERE clause inequalities
Atomic writes requiring both a delete of an attribute and an update of another attribute in the same
item are accomplished through deleting and rewriting the entire item.
SimpleDB is case sensitive to attribute names. Netflix implemented a common data access layer that
normalizes all attribute names (e.g. TO_UPPER)
Misspelled attribute names can fail silently so Netflix has a schema validator in the common data access
layer
Deal with eventual consistency by avoiding reads immediately after writes. If this is not possible, use
ConsistentRead.
59. Netflix Build Process - Ivy
Ant is a tool for automating the software build process (like Make).
Ivy is an extension to Ant for managing (recording, tracking, resolving and
reporting) project dependencies.
Ant, Ivy
60. Netflix Build Process - Artifactory
Artifactory is a repository manager.
• Artifactory acts as a proxy between your build tool (Maven, Ant, Ivy, Gradle etc.) and the
outside world.
• It caches remote artifacts so that you don’t have to download them over and over again.
• It blocks unwanted (and sometimes security-sensitive) external requests for internal
artifacts and controls how and where artifacts are deployed, and by whom.
Artifactory
61. Netflix Build Process - Jenkins
Jenkins is an open source continuous integration tool.
Jenkins provides a so-called continuous integration system, making it easier for developers to integrate
changes to the project, and making it easier for users to obtain a fresh build.
Jenkins monitors executions of externally-run jobs, such as cron jobs and procmail jobs, even those that are
run on a remote machine. For example, with cron, Jenkins keeps e-mail outputs in a repository.
62. Migration Lessons
• The change from reliable to commodity hardware impacts many things
– E.g. in the old world memory based session state was fine
• Hardware failures were rare
– Not appropriate on the cloud
• Co-tenancy is hard
– In the cloud all resources are shared e.g.
• Hardware, network, storage, …
– This introduces variable performance depending on load
• At all levels of the stack
– Have to be willing to abandon sub-task or manage resources to avoid co-tenancy
• The best way to avoid failure is to fail consistently
– Netflix has designed their systems to be tolerant to failure
– They test this capability regularly
63. Unleash the Monkey
• As a result of disruption of services Netflix adopted a process of
testing the system
• They learned that the best way to test the system was at scale
• In order to learn how their systems handle failure they introduced
faults
– Along came the “Chaos Monkey”
• The Chaos Monkey randomly disables production instances
• This is done in a tightly controlled environment with engineers
standing by
– But it is done in the production environment
• The engineers then learn how their systems handle failures and can
improve the solution accordingly
64. Increase the Chaos
• The Chaos Monkey was so successful Netflix created a “Simian Army” to test the system
• The “Army” includes:Netflix Test Suite - 1
– Chaos monkey. Randomly kill a process and monitor the effect.
– Latency monkey. Randomly introduce latency and monitor the effect.
– Doctor monkey. The Doctor Monkey taps into health checks that run on
each instance as well as monitors other external signs of health (e.g. CPU
load) to detect unhealthy instances.
– Janitor Monkey. The Janitor Monkey ensures that the Netflix cloud
environment is running free of clutter and waste. It searches for unused
resources and disposes of them.
65. Netflix Test Suite - 2
• Conformity Monkey. The Conformity Monkey finds instances that don’t adhere to best-
practices and shuts them down. For example, if an instance does not belong to an auto-
scaling group, that is a potential problem.
• Security Monkey The Security Monkey is an extension of Conformity Monkey. It finds
security violations or vulnerabilities, such as improperly configured AWS security
groups, and terminates the offending instances. It also ensures that all our SSL and DRM
certificates are valid and are not coming up for renewal.
• 10-18 Monkey The 10-18 Monkey (Localization-Internationalization) detects
configuration and run time problems in instances serving customers in multiple
geographic regions, using different languages and character sets. The name 10-18
comes from L10n and I18n which are the number of characters in the words localization
and internationalization.
• Chaos Gorilla: The Gorilla is similar to the Chaos Monkey but simulates the outage of an
entire Amazon availability zone
66. Overview
• What is Netflix
• Migrating to the cloud
– Data
– Build process
– Testing process
• Withstanding Amazon outage of April 21, 2011
67. April 21, 2011 Outage – Background
An EBS cluster is comprised of a set of EBS nodes.
When Node A loses connectivity to a node (Node B) to which it is replicating data,
Node A assumes node B has failed
In this case, it must find another node to which data is replicated. This is called re-
mirroring.
When data is being re-mirrored, all nodes that have copies of that data retain that
data and block all external access to that data. This is called the node being “stuck”
In the normal case, a node is stuck for only a few milli-seconds.
68. April 21, 2011 Outage - Events
At 12:47 AM PDT, a configuration change was made to upgrade the capacity of the primary EBS
network within one availability zone in the Eastern Region.
• Standard practice is to shift traffic off of one of the redundant routers in the primary EBS
network to allow the upgrade to happen
• Traffic was shifted incorrectly to the lower capacity backup network.
Consequently, there was no functioning primary or secondary network in the affected portion of
the availability zone since traffic was shifted off of the primary and the secondary became
quickly overloaded.
When the error was detected and connectivity restored, a large number of nodes that had been
isolated began searching for other nodes to which they could re-mirror their data.
This caused a re-mirroring storm where all of the available space in the affected cluster was used
up and the attempts to gain access to another availability zone overwhelmed the primary
communication network.
This caused the whole region to be unavailable
69. April 21, 2011 Outage – Consequences
At 12:04 PM PDT, the outage was contained to the one affected availability zone. 13%
of the nodes in the availability zone remained “stuck.
Additional capacity was added to the availability zone to allow those nodes to become
unstuck.
In the meantime, some of the stuck volumes experienced hardware failures. Hardware
failures are a normal occurrence within a cluster.
Ultimately, the data on .07% of the nodes in the affected availability zone could not be
recovered and this data was permanently lost.
70. Netflix and the Amazon Outage of April, 2011
On April 21, 2011, one of the availability zones in Amazon’s Eastern Region failed.
– Many organizations were unavailable for the duration of the outage (24 hours)
– Some organizations permanently lost data
Netflix was largely unaffected
– Customers experienced no disruption of service
– There were some increased latencies
– Higher than normal error rate
– No interruption on customers ability to find and watch movies
71. How Did Netflix Accomplish This?
Stateless services
• Services are largely stateless
• If a server fails the request can be re-routed
• Remember the discussion of modular redundancy?
– The lack of state means failover times are negligible
Data stored across availability zones
• In some cases it was not practical to re-architect the system to be stateless
• In these cases the there are multiple hot copies of the data across availability zones
• In the event of a failure again the failover time is negligible
– What is the cost of doing this?
72. Design Decisions II
Graceful degradation: The Netflix systems are designed for failure. Allot of thought went
into what to do when they fail.
• Fail fast – aggressive time outs so that failing components do not slow the whole
system down
• Fallbacks – every feature has the ability to fall back to a lower quality representation.
E.g. if Netflix cannot generate personalized recommendations, they will fall back to non-
personalized results
• Feature removal. If a feature is non-critical and is slow, then it may be removed from
any given page.
N+1 redundancy
• The system is architected with more capacity than it needs at any time
• This allows them to cope with large spikes in load caused by users directly or the ripple
effect of transient failures
73. What Issues Were Experienced?
Netflix did have some issues, however, including:
• As the availability zone started to fail Netflix decided to pull out all-together
• This required manual configuration changes to AWS
• Each service team was involved in moving their service out of the zone
• This was a time consuming and error prone process
Load balancers had to be manually adjusted to keep traffic
out of affected availability zone.
• Netflix uses Amazon’s Elastic Load Balancing (ELB)
• This first balances load to availability zones and then instances
• If you have many instances go down the others in that zone have to pick up the slack
• If you can’t bring up more nodes you will experience a cascading effect until all nodes go down
74. Amazon Outage of Aug 8, 2011
On Aug 8, 2011, the Eastern Region of AWS went down for about
half an hour due to connectivity issues that affected all
availability zones …
Netflix was down as well…
75. Summary
Netfflix moved to the cloud because
• They doubted their ability to correctly predict the load
• They did not want to invest in data centers directly
Netflix chose SimpleDB as their original target database system and their migration
strategy involved
• Running Oracle and SimpleDB concurrently for a period
• Moving some features provided by Oracle up the application stack
Netflix has a number of test programs that inject faults or- look for specific types of
violations.
Neflix re-architected their build process to support sophisticated deployment practices
and to allow for continuous integration.
Netflix did a number of things to promote availability and these enabled Netflix to
continue customer service through an extensive AWS outage.
80. Containers
• Loading full VM at deployment time
– Takes time for large machine images
– Results in many slightly different VM images for
different versions
– Depends on particular processor even if virtualized.
• Containers are a proposed solution
– Machine image consists of OS + libraries
– Application portion runs inside of this machine image
and is loaded by the machine image once it is
instantiated.
84. Adoption of Containers
• Containers are an old concept
• Docker is a software system that packages the
creation of containers
• Cloud providers are adopting containers since
they control the entire stack and can choose
OSs.
87. Simple text has been available
for a long time
• CMU wearable computer group early 1990s
• Google Glass modern display device
88. Computer power needed for more
sophisticated images
• Route planning – showing what is ahead if you
take a particular route
• Visualizing furniture in a room
• These applications require
– Location determination
– Orientation determination
• This computer power is not available on
mobile devices.
• This is a lead in to “cloudlets”
89. Cloudlets
• Mobile devices do not have the CPU power
necessary for some applications.
• Consequently, they act as data sources for
apps in the cloud.
– This introduces latency in responding to a user
request
– This latency does not matter for some purposes –
e.g. airline reservations, music streaming
– But – it does matter for others, e.g. voice
recognition, translation, augmented reality.
90. How does one match computation
power and data with low latency?
• Move computation power closer to the data.
• But, but, but
– Data is inherently mobile (with some latency)
– Computation is tied to a server. Real computation
power is not so mobile.
– I am in an automobile and my mobile is constantly
moving and getting further away from any fixed
computation source
91. Anyway, won’t my mobile have
sufficient computational power?
• Mobiles will always inherently be less powerful
relative to static client and server hardware
• Mobiles have to trade off
– size,
– weight,
– battery life,
– storage
with computational power
• Computational power is always going to be of
lower priority than these other qualities.
92. Cloudlet infrastructure
• A cloudlet infrastructure is
– A collection of servers with more computational power
than my mobile
– Connected to the internet
– Connected to a power source
– One data hop away from my mobile
• Where would these servers live?
– Doctor’s offices
– Coffee shops
– Think wireless access points
• A new client would instantiate a VM with the desired
app in a local cloudlet server.
93. How would it work?
• Three models
1. Download application VM from cloud into cloudlet
server on demand.
• Time consuming
• Allows for bare cloudlet server
• Allows arbitrary applications
2. Pre load most popular application VMs
• Fast
• Limited in terms of number of applications available
3. Use containers to preload most popular platforms and
download remaining software of an app.
• Intermediate in time required to initiate
• Intermediate in flexibility of apps.
94. Revisit AR application
• Consider what is
involved to show what
is ahead based on
head location.
– Location
– Orientation
• Cloudlets have
potential to supply
necessary
infrastructure.
95. Summary of cloudlets
• Locally available servers with higher
computational power than mobiles
• Enables real time application of some
technologies such as augmented reality or real
time translation
• Based on creating VM in local server with
necessary app
• Feasible with current technology
97. Internet of Things
• “Everything” is connected to the internet
– Toasters
– Refrigerators
– Automobiles
– …
• Every device has both data generation and
control aspects.
• From a cloud perspective, data generation is
important.
98. Data available from a device
• Current state
• Current activity
• Current location
• Current power usage
• Information about the environment of the
device
• …
99. How frequently is the data sampled?
• Suppose each device is sampled once a
minute
• Suppose each sample contains 1MB
• The number of devices being sampled may be
in the 100,000s or1,000,000s (think
automobiles or engines, or tires, or …)
• Petabytes of data streaming into a data center.
• “Big Data” is the buzz term used for this level
of data.
101. Requirements for streaming dbms
Database management system (DBMS) Data stream management system (DSMS)
Persistent data (relations) volatile data streams
Random access Sequential access
One-time queries Continuous queries
(theoretically) unlimited secondary
storage
limited main memory
Only the current state is relevant Consideration of the order of the input
relatively low update rate potentially extremely high update rate
Little or no time requirements Real-time requirements
Assumes exact data Assumes outdated/inaccurate data
Plannable query processing
Variable data arrival and data
characteristics
102. Managing the data
• Synopses
– Select sample of raw data to support a particular analysis.
– Moving average of data
– May be inaccurate when performing analysis
• Windows
– Only look at a portion of the data
– Last 10 elements
– Last 10 seconds
• Choice of approach for managing the data is driven by
the types of analyses that the data is used for.
103. Demand is for ever fasting
processing
• Businesses want to use the data for real time
reaction
• E.g. Amazon has “customers who look at this
also looked at”. Suppose they use this
information and your browsing history to offer
you a deal on two items rather than have you
decide individually to buy the two items.
• Just one example out of hundreds.
104. Issues
• Managing the volume of data
• Security – how do you know data is correctly
identified?
• Privacy – what kind of information sharing will
consumers tolerate?
105. Summary for Internet of Things
• If every device is connected to the internet,
the data generated will be massive
• Specialized data management systems
• Specialized analysis systems.
• Problems include security and privacy