Architecture for the cloud deployment case study future


Published on

These are the last lectures for the course "Architecting for the Cloud". There are lectures on deployment, a case study, and future trends

Published in: Software
  • Be the first to comment

Architecture for the cloud deployment case study future

  1. 1. Architecting for the Cloud Len and Matt Bass Deployment
  2. 2. Deployment pipeline • Developers commit code • Code is compiled • Binary is processed by a build and unit test tool which builds the service • Integration tests are run followed by performance tests. • Result is a machine image (assuming virtualization) • The service (its image) is deployed to production. 2
  3. 3. Deployment Overview 3 Multiple instances of a service are executing • Red is service being replaced with new version • Blue are clients • Green are dependent services VAVB VBVB UAT / staging / performance tests
  4. 4. Additional considerations • Recall release plan. • Each release requires coordination among stakeholders and development teams. – Can be explicitly performed – Can be implicit in the architecture (we saw an example of this) • You as an architect need to know how frequently new releases will be deployed. – Some organizations deploy dozens of times a day – Other organizations deploy once a week – Still others once a month or longer
  5. 5. Monthly deployment • There is time for coordination among development teams to ensure that components are consistent • There is time for “release manager” to ensure that a release is correct before moving it to production
  6. 6. Weekly deployment • Limited time for coordination among development teams • Time for “release manager” to ensure that a release is correct before moving it to production
  7. 7. Daily deployment • No time for – Coordination among development teams – Release manager to validate release • These items must be implicit in the architecture.
  8. 8. Deployment goal and constraints • Goal of a deployment is to move from current state (N instances of version A of a service) to a new state (N instances of version B of a service) • Constraints associated with Continuous Deplloyment: – Any development team can deploy their service at any time. I.e. New version of a service can be deployed either before or after a new version of a client. (no synchronization among development teams) – It takes time to replace one instance of version A with an instance of version B (order of minutes) – Service to clients must be maintained while the new version is being deployed. 8
  9. 9. Deployment strategies • Two basic all of nothing strategies – Red/Black– leave N instances with version A as they are, allocate and provision N instances with version B and then switch to version B and release instances with version A. – Rolling Upgrade – allocate one instance, provision it with version B, release one version A instance. Repeat N times. • Other deployment topics – Partial strategies (canary testing, A/B testing,). We will discuss them later. For now we are discussing all or nothing deployment. – Rollback – Packaging services into machine images 9
  10. 10. Trade offs – Red/Black and Rolling Upgrade • Red/Black – Only one version available to the client at any particular time. – Requires 2N instances (additional costs) – Accomplished by moving environments • Rolling Upgrade – Multiple versions are available for service at the same time – Requires N+1 instances. • Both are heavily used. Choice depends on – Cost – Managing complications from using rolling upgrade 10 Update Auto Scaling Group Sort Instances Remove & Deregister Old Instance from ELB Confirm Upgrade Spec Terminate Old Instance Wait for ASG to Start New Instance Register New Instance with ELB Rolling Upgrade in EC2
  11. 11. Types of failures during rolling upgrade Rolling Upgrade Failure Provisioning Research topic Logical failure Inconsistencies to be discussed Instance failure Handled by Auto Scaling Group in EC2 11
  12. 12. What are the problems with Rolling Upgrade? • Recall that any development team can deploy their service at any time. • Three concerns – Maintaining consistency between different versions of the same service when performing a rolling upgrade – Maintaining consistency among different services – Maintaining consistency between a service and persistent data 12
  13. 13. Maintaining consistency between different versions of the same service • Key idea – differentiate between installing a new version and activating a new version • Involves “feature toggles” (described momentarily) • Sequence – Develop version B with new code under control of feature toggle – Install each instance of version B with the new code toggled off. – When all of the instances of version A have been replaced with instances of version B, activate new code through toggling the feature. 13
  14. 14. Issues • Do you remember feature toggles? • How do I manage features that extend across multiple services? • How do I activate all relevant instances at once? 14
  15. 15. Feature toggle • Place feature dependent new code inside of an “if” statement where the code is executed if an external variable is true. Removed code would be the “else” portion. • Used to allow developers to check in uncompleted code. Uncompleted code is toggled off. • During deployment, until new code is activated, it will not be executed. • Removing feature toggles when a new feature has been committed is important. 15
  16. 16. Multi service features • Most features will involve multiple services. • Each service has some code under control of a feature toggle. • Activate feature when all instances of all services involved in a feature have been installed. – Maintain a catalog with feature vs service version number. – A feature toggle manager determines when all old instances of each version have been replaced. This could be done using registry/load balancer. – The feature manager activates the feature. 16
  17. 17. Activating feature • The feature toggle manager changes the value of the feature toggle. Two possible techniques to get new value to instances. – Push. Broadcasting the new value will instruct each instance to use new code. If a lag of several seconds between the first service to be toggled and the last can be tolerated, there is no problem. Otherwise synchronizing value across network must be done. – Pull. Querying the manager by each instance to get latest value may cause performance problems. • A coordination mechanism such as Zookeeper will overcome both problems. 17
  18. 18. Maintaining consistency across versions (summary) • Install all instances before activating any new code • Use feature toggles to activate new code • Use feature toggle manager to determine when to activate new code • Use Zookeeper to coordinate activation with low overhead 18
  19. 19. Maintaining consistency among different services • Use case: – Wish to deploy new version of service A without coordinating with development team for clients of service A. • I.e. new version of service A should be backward compatible in terms of its interfaces. • May also require forward compatibility in certain circumstances, e.g. rollback 19
  20. 20. Achieving Backwards Compatibility • APIs can be extended but must always be backward compatible. • Leads to a translation layer External APIs (unchanging but with ability to extend or add new ones) Translation to internal APIs Client Client Internal APIs (changes require changes to translation layer but do not propagate further)
  21. 21. What about dependent services? • Dependent services that are within your control should maintain backward compatibility • Dependent services not within your control (third party software) cannot be forced to maintain backward compatibility. – Minimize impact of changes by localizing interactions with third party software within a single module. – Keeping services independent and packaging as much as possible into a virtual machine means that only third party software accessed through message passing will cause problems. 21
  22. 22. Forward Compatibility • Gracefully handle unknown calls and data base schema information – Suppose your service receives a method call it does not recognize. It could be intended for a later version where this method is supported. – Suppose your service retrieves a data base table with an unknown field. It could have been added to support a later version. • Forward compatibility allows a version of a service to be upgraded or rolled back independently from its clients. It involves both – The service handling unrecognized information – The client handling returns that indicate unrecognized information. 22
  23. 23. Maintaining consistency between a service and persistent data • Assume new version is correct – we will discuss the situation where it is incorrect in a moment. • Inconsistency in persistent data can come about because data schema or semantics change. • Effect can be minimized by the following practices (if possible). – Only extend schema – do not change semantics of existing fields. This preserves backwards compatibility. – Treat schema modifications as features to be toggled. This maintains consistency among various services that access data. 23
  24. 24. I really must change the schema • In this case, apply pattern for backward compatibility of interfaces to schemas. • Use features of database system (I am assuming a relational DBMS) to restructure data while maintaining access to not yet restructured data. 24
  25. 25. Summary of consistency discussion so far. • Feature toggles are used to maintain consistency within instances of a service • Backward compatibility pattern is used to maintain consistency between a service and it s clients. • Discouraging modification of schema will maintain consistency between services and persistent data. – If schema must be modified, then synchronize modifications with feature toggles. 25
  26. 26. Canary testing • Canaries are a small number of instances of a new version placed in production in order to perform live testing in a production environment. • Canaries are observed closely to determine whether the new version introduces any logical or performance problems. If not, roll out new version globally. If so, roll back canaries. • Named after canaries in coal mines. 26
  27. 27. Implementation of canaries • Designate a collection of instances as canaries. They do not need to be aware of their designation. • Designate a collection of customers as testing the canaries. Can be, for example – Organizationally based – Geographically based • Then – Activate feature or version to be tested for canaries. Can be done through feature activation synchronization mechanism – Route messages from canary customers to canaries. Can be done through making registry/load balancer canary aware. 27
  28. 28. A/B testing • Suppose you wish to test user response to a system variant. E.g. UI difference or marketing effort. A is one variant and B is the other. • You simultaneously make available both variants to different audiences and compare the responses. • Implementation is the same as canary testing. 28
  29. 29. Rollback • New versions of a service may be unacceptable either for logical or performance reasons. • Two options in this case • Roll back (undo deployment) • Roll forward (discontinue current deployment and create a new release without the problem). • Decision to rollback or roll forward is almost never automated because there are multiple factors to consider. • Forward or backward recovery • Consequences and severity of problem • Importance of upgrade 29
  30. 30. States of upgrade. • An upgrade can be in one of two states when an error is detected. – Installed (fully or partially) but new features not activated – Installed and new features activated. 30
  31. 31. Possibilities • For this slide assume persistent data is correct. • Installed but new features not activated – Error must be in backward compatibility – Halt deployment – Roll back by reinstalling old version – Roll forward by creating new version and installing that • Installed with new features activated – Turn off new features – If that is insufficient, we are at prior case. 31
  32. 32. Persistent data may be incorrect • Keep log of user requests (each with their own identification) • Identification of incorrect persistent data • Tag each data item with metadata that provides service and version that wrote that data • user request that caused the data to be written • Correction of incorrect persistent data (simplistic version) – Remove data written by incorrect version of a service – Install correct version – Replay user requests that caused incorrect data to be written 32
  33. 33. Persistent data correction problems I will not present good solutions to these problems. 1. Replaying user requests may involve requesting features that are not in the current version. – Requests can be queued until they can be correctly re-executed – User can be informed of error (after the fact) 2. There may be domino effects from incorrect data. i.e. other calculations may be affected. – Keep pedigree for data items that allows determining which additional data items are incorrect. Remove them and regenerate them when requests replayed. – Data that escaped the system, e.g. sent to other system or shown to a user, cannot be retrieved. 33
  34. 34. Summary of rollback options • Can roll back or roll forward • Rolling back without consideration of persistent data is relatively straightforward. • Managing erroneous persistent data is complicated and will likely require manual processing. 34
  35. 35. Packaging of services • The last portion of the deployment pipeline is packaging services into machine images for installation. • Two dimensions – Flat vs deep service hierarchy – One service per virtual machine vs many services per virtual machine 35
  36. 36. Flat vs Deep Service Hierarchy • Trading off independence of teams and possibilities for reuse. • Flat Service Hierarchy – Limited dependence among services & limited coordination needed among teams – Difficult to reuse services • Deep Service Hierarchy – Provides possibility for reusing services – Requires coordination among teams to discover reuse possibilities. This can be done during architecture definition. 36
  37. 37. Services per VM Image 37 Service1 Service2 VM image Develop Develop Embed Embed One service per VM Service VM image Develop Embed Multiple services per VM
  38. 38. One Possible Race Condition with Multiple Services per VM 38 TIME Initial State: VM image with Version N of Service 1 and Version N of Service 2 Developer 1 Build new image with VN+1|VN Begin provisioning process with new image Developer 2 Build new image with VN|VN+1 Begin provisioning process with new image without new version of Service 1 Results in Version N+1 of Service 1 not being updated until next build of VM image Could be prevented by VM image build tool
  39. 39. Another Possible Race Condition with Multiple Services per VM 39 TIME Initial State: VM image with Version N of Service 1 and Version N of Service 2 Developer 1 Build new image with VN+1|VN Begin provisioning process with new image overwrites image created by developer 2 Developer 2 Build new image with VN+1|VN+1 Begin provisioning process with new image Results in Version N+1 of Service 2 not being updated until next build of VM image Could be prevented by provisioning tool
  40. 40. Trade offs • One service per VM – Message from one service to another must go through inter VM communication mechanism – adds latency – No possibility of race condition • Multiple Services per VM – Inter VM communication requirements reduced – reduces latency – Adds possibility of race condition caused by simultaneous deployment 40
  41. 41. Summary of Deployment • Rolling upgrade is common deployment strategy • Introduces requirements for consistency among – Different versions of the same service – Different services – Services and persistent data • Other deployment considerations include – Canary deployment – A/B testing – Rollback – Business continuity 41
  42. 42. Architecting for the Cloud Case Study
  43. 43. Overview • What is Netflix • Migrating to the cloud – Data – Build process – Testing process • Withstanding Amazon outage of April 21, 2011.
  44. 44. Overview • What is Netflix • Migrating to the cloud – Data – Build process – Testing process • Withstanding Amazon outage of April 21, 2011.
  45. 45. Netflix Corporation Launched in 1998 after founder was irritated at having to pay late fees on a DVD rental. DVD Model • Pay monthly membership fee that includes rentals, shipping and no late fees • Maintain online queue of desired rentals • When return last rental (depending on service plan), next item in queue is mailed to you together with a return envelope. • Customers rate movies and Netflix recommends based on your preferences Unusual culture • Unlimited vacation time for salaried workers • Workers can take any percentage of their salary in stock options (up to 100%).
  46. 46. Streaming video - 1 Customers can watch Netflix streaming video on a wide variety of devices many of which feed into a TV – Roku set top box – Blu-ray disk players – Xbox 360 – TV directly – PlayStation 3 – Wii – DVRs – Nintendo Customers can stop and restart video at will. Netflix calls these locations in the films “bookmarks”.
  47. 47. Streaming video - 2 In 2007 Netflix began to offer streaming video to it’s subscribers Initially, one hour of streaming video was available to customers for every dollar they spent on their plan In Jan, 2008, every customer was entitled to unlimited streaming video. In May, 2011, Netflix streaming video accounted for 22% of all internet traffic. 30% of traffic during peak usage hours. Three bandwidth tiers • Continuous bandwidth to the client of 5 Mbit/s. HDTV, surround sound • Continuous bandwidth to the client of 3Mbit/s – better than DVD • Continuous bandwidth to the client of 1.5Mbit/s – DVD quality
  48. 48. Overview • What is Netflix • Migrating to the cloud – Data – Build process – Testing process • Withstanding Amazon outage of April 21, 2011.
  49. 49. Netflix’s Growth In late 2008, Netflix had a single data center with Oracle as the main database system. With the growth of subscriptions and streaming video, it was clear that they would soon outgrow the data center.
  50. 50. Not Just More Requests • Netflix was expanding internationally • They had unpredictable usage spikes with product releases – E.g. Wii, and Xbox – Need infrastructure to handle usage spikes • Datacenters are expensive – Netflix was running Oracle on IBM hardware (not commodity hardware) – Could switch to commodity hardware but would have to have in-house expertise • Can’t hire enough system and database administrators to grow fast enough
  51. 51. Netflix Moves to Cloud Four reasons cited by Netflix for moving to the cloud 1. Every layer of the software stack needed to scale horizontally, be more reliable, redundant, and fault tolerant. This leads to reason #2 2. Outsourcing data center infrastructure to Amazon allowed Netflix engineers to focus on building and improving their business. 3. Netflix is not very good at predicting customer growth or device engagement. They underestimated their growth rate. The cloud supports rapid scaling. 4. Cloud computing is the future. This will help Netflix with recruiting engineers who are interested in honing their skills, and will help scale the business. It will also ensure competition among cloud providers helping to keep costs down. Why Amazon and EC2? In 2008, Amazon was the leading supplier. Netflix wanted an IaaS so they could focus on their core competencies.
  52. 52. Netflix applications What applications does Netflix have? • Video ratings, reviews, and recommendations • Video streaming • User registration, log-in • Video queues • Billing • DVD disc management – inventory and shipping • Video metadata management – movie cast information Strategy was to move in a phased manner. • A subset of applications would move in a particular phase • Means that data may have to be replicated between cloud and data center during the move.
  53. 53. Amazon’s S3 and SimpleDB SimpleDB is Amazon’s non-relational data store – Optimized for data access speed – Indexes data for fast retrieval – Physical drives are less dense to support this Amazon’s Simple Storage Service (S3) – Stores raw data – Optimized on storing larger data sets inexpensively
  54. 54. Choosing Cloud-Based Data Store SimpleDB and S3 both provide • Disaster recovery • Managed fail over and fail back • Distribution across availability zones • Persistence • Eventual consistency • SimpleDB has a query language
  55. 55. Cloud-Based Data Store II SimpleDB and S3 do not provide: • Transactions • Locks • Sequences • Triggers • Clocks • Constraints • Joins • Schemas and associated integrity checking • Native data types. All data in SimpleDB is string data
  56. 56. Migrating Data from Oracle to SimpleDB Whole data sets were moved. E.g. instant watch bookmarks were lifted wholesale into SimpleDB Incremental data changes were kept in synch (bi-directional) between Oracle and SimpleDB. Device support moved from Oracle to the cloud. During that transition, a user may begin watching a movie on a device supported in the cloud and continue watching on a device supported in Oracle. Instant watch bookmarks are kept in synch for these types of possibilities. Once, the need for the Oracle version is removed, then the synchronization is discontinued and the data no longer resides on the Oracle database.
  57. 57. Using SimpleDB and S3 Move normal DB functionality up the stack to the application. Specifically: • Do “Group By” and “Join” in application • De-normalize tables to reduce necessity for joins • Do without triggers • Manage concurrency using time stamps and conditional Put and Delete • Do without clock operations • Application checks on constraints on read and repair data as a side effect
  58. 58. Exploiting SimpleDB All datasets with a significant write load are partitioned across multiple SimpleDB domains Since all data is stored as strings • Store all date-times in ISO8601 format • Zero pad any numeric columns used in sorts or WHERE clause inequalities Atomic writes requiring both a delete of an attribute and an update of another attribute in the same item are accomplished through deleting and rewriting the entire item. SimpleDB is case sensitive to attribute names. Netflix implemented a common data access layer that normalizes all attribute names (e.g. TO_UPPER) Misspelled attribute names can fail silently so Netflix has a schema validator in the common data access layer Deal with eventual consistency by avoiding reads immediately after writes. If this is not possible, use ConsistentRead.
  59. 59. Netflix Build Process - Ivy Ant is a tool for automating the software build process (like Make). Ivy is an extension to Ant for managing (recording, tracking, resolving and reporting) project dependencies. Ant, Ivy
  60. 60. Netflix Build Process - Artifactory Artifactory is a repository manager. • Artifactory acts as a proxy between your build tool (Maven, Ant, Ivy, Gradle etc.) and the outside world. • It caches remote artifacts so that you don’t have to download them over and over again. • It blocks unwanted (and sometimes security-sensitive) external requests for internal artifacts and controls how and where artifacts are deployed, and by whom. Artifactory
  61. 61. Netflix Build Process - Jenkins Jenkins is an open source continuous integration tool. Jenkins provides a so-called continuous integration system, making it easier for developers to integrate changes to the project, and making it easier for users to obtain a fresh build. Jenkins monitors executions of externally-run jobs, such as cron jobs and procmail jobs, even those that are run on a remote machine. For example, with cron, Jenkins keeps e-mail outputs in a repository.
  62. 62. Migration Lessons • The change from reliable to commodity hardware impacts many things – E.g. in the old world memory based session state was fine • Hardware failures were rare – Not appropriate on the cloud • Co-tenancy is hard – In the cloud all resources are shared e.g. • Hardware, network, storage, … – This introduces variable performance depending on load • At all levels of the stack – Have to be willing to abandon sub-task or manage resources to avoid co-tenancy • The best way to avoid failure is to fail consistently – Netflix has designed their systems to be tolerant to failure – They test this capability regularly
  63. 63. Unleash the Monkey • As a result of disruption of services Netflix adopted a process of testing the system • They learned that the best way to test the system was at scale • In order to learn how their systems handle failure they introduced faults – Along came the “Chaos Monkey” • The Chaos Monkey randomly disables production instances • This is done in a tightly controlled environment with engineers standing by – But it is done in the production environment • The engineers then learn how their systems handle failures and can improve the solution accordingly
  64. 64. Increase the Chaos • The Chaos Monkey was so successful Netflix created a “Simian Army” to test the system • The “Army” includes:Netflix Test Suite - 1 – Chaos monkey. Randomly kill a process and monitor the effect. – Latency monkey. Randomly introduce latency and monitor the effect. – Doctor monkey. The Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. – Janitor Monkey. The Janitor Monkey ensures that the Netflix cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.
  65. 65. Netflix Test Suite - 2 • Conformity Monkey. The Conformity Monkey finds instances that don’t adhere to best- practices and shuts them down. For example, if an instance does not belong to an auto- scaling group, that is a potential problem. • Security Monkey The Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal. • 10-18 Monkey The 10-18 Monkey (Localization-Internationalization) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets. The name 10-18 comes from L10n and I18n which are the number of characters in the words localization and internationalization. • Chaos Gorilla: The Gorilla is similar to the Chaos Monkey but simulates the outage of an entire Amazon availability zone
  66. 66. Overview • What is Netflix • Migrating to the cloud – Data – Build process – Testing process • Withstanding Amazon outage of April 21, 2011
  67. 67. April 21, 2011 Outage – Background An EBS cluster is comprised of a set of EBS nodes. When Node A loses connectivity to a node (Node B) to which it is replicating data, Node A assumes node B has failed In this case, it must find another node to which data is replicated. This is called re- mirroring. When data is being re-mirrored, all nodes that have copies of that data retain that data and block all external access to that data. This is called the node being “stuck” In the normal case, a node is stuck for only a few milli-seconds.
  68. 68. April 21, 2011 Outage - Events At 12:47 AM PDT, a configuration change was made to upgrade the capacity of the primary EBS network within one availability zone in the Eastern Region. • Standard practice is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen • Traffic was shifted incorrectly to the lower capacity backup network. Consequently, there was no functioning primary or secondary network in the affected portion of the availability zone since traffic was shifted off of the primary and the secondary became quickly overloaded. When the error was detected and connectivity restored, a large number of nodes that had been isolated began searching for other nodes to which they could re-mirror their data. This caused a re-mirroring storm where all of the available space in the affected cluster was used up and the attempts to gain access to another availability zone overwhelmed the primary communication network. This caused the whole region to be unavailable
  69. 69. April 21, 2011 Outage – Consequences At 12:04 PM PDT, the outage was contained to the one affected availability zone. 13% of the nodes in the availability zone remained “stuck. Additional capacity was added to the availability zone to allow those nodes to become unstuck. In the meantime, some of the stuck volumes experienced hardware failures. Hardware failures are a normal occurrence within a cluster. Ultimately, the data on .07% of the nodes in the affected availability zone could not be recovered and this data was permanently lost.
  70. 70. Netflix and the Amazon Outage of April, 2011 On April 21, 2011, one of the availability zones in Amazon’s Eastern Region failed. – Many organizations were unavailable for the duration of the outage (24 hours) – Some organizations permanently lost data Netflix was largely unaffected – Customers experienced no disruption of service – There were some increased latencies – Higher than normal error rate – No interruption on customers ability to find and watch movies
  71. 71. How Did Netflix Accomplish This? Stateless services • Services are largely stateless • If a server fails the request can be re-routed • Remember the discussion of modular redundancy? – The lack of state means failover times are negligible Data stored across availability zones • In some cases it was not practical to re-architect the system to be stateless • In these cases the there are multiple hot copies of the data across availability zones • In the event of a failure again the failover time is negligible – What is the cost of doing this?
  72. 72. Design Decisions II Graceful degradation: The Netflix systems are designed for failure. Allot of thought went into what to do when they fail. • Fail fast – aggressive time outs so that failing components do not slow the whole system down • Fallbacks – every feature has the ability to fall back to a lower quality representation. E.g. if Netflix cannot generate personalized recommendations, they will fall back to non- personalized results • Feature removal. If a feature is non-critical and is slow, then it may be removed from any given page. N+1 redundancy • The system is architected with more capacity than it needs at any time • This allows them to cope with large spikes in load caused by users directly or the ripple effect of transient failures
  73. 73. What Issues Were Experienced? Netflix did have some issues, however, including: • As the availability zone started to fail Netflix decided to pull out all-together • This required manual configuration changes to AWS • Each service team was involved in moving their service out of the zone • This was a time consuming and error prone process Load balancers had to be manually adjusted to keep traffic out of affected availability zone. • Netflix uses Amazon’s Elastic Load Balancing (ELB) • This first balances load to availability zones and then instances • If you have many instances go down the others in that zone have to pick up the slack • If you can’t bring up more nodes you will experience a cascading effect until all nodes go down
  74. 74. Amazon Outage of Aug 8, 2011 On Aug 8, 2011, the Eastern Region of AWS went down for about half an hour due to connectivity issues that affected all availability zones … Netflix was down as well…
  75. 75. Summary Netfflix moved to the cloud because • They doubted their ability to correctly predict the load • They did not want to invest in data centers directly Netflix chose SimpleDB as their original target database system and their migration strategy involved • Running Oracle and SimpleDB concurrently for a period • Moving some features provided by Oracle up the application stack Netflix has a number of test programs that inject faults or- look for specific types of violations. Neflix re-architected their build process to support sophisticated deployment practices and to allow for continuous integration. Netflix did a number of things to promote availability and these enabled Netflix to continue customer service through an extensive AWS outage.
  76. 76. Reference
  77. 77. Questions??
  78. 78. Architecting for the Cloud Future Trends
  79. 79. Topics • Containers • Augmented reality • Cloudlets • Internet of things – cars, applicances, etc
  80. 80. Containers • Loading full VM at deployment time – Takes time for large machine images – Results in many slightly different VM images for different versions – Depends on particular processor even if virtualized. • Containers are a proposed solution – Machine image consists of OS + libraries – Application portion runs inside of this machine image and is loaded by the machine image once it is instantiated.
  81. 81. What is a Container
  82. 82. Containers vs VMs
  83. 83. Virtues of Containers
  84. 84. Adoption of Containers • Containers are an old concept • Docker is a software system that packages the creation of containers • Cloud providers are adopting containers since they control the entire stack and can choose OSs.
  85. 85. Topics • Containers • Augmented reality • Cloudlets • Internet of things – cars, applicances, etc
  86. 86. Augmented Reality • Superimpose computer generated images over visual images
  87. 87. Simple text has been available for a long time • CMU wearable computer group early 1990s • Google Glass modern display device
  88. 88. Computer power needed for more sophisticated images • Route planning – showing what is ahead if you take a particular route • Visualizing furniture in a room • These applications require – Location determination – Orientation determination • This computer power is not available on mobile devices. • This is a lead in to “cloudlets”
  89. 89. Cloudlets • Mobile devices do not have the CPU power necessary for some applications. • Consequently, they act as data sources for apps in the cloud. – This introduces latency in responding to a user request – This latency does not matter for some purposes – e.g. airline reservations, music streaming – But – it does matter for others, e.g. voice recognition, translation, augmented reality.
  90. 90. How does one match computation power and data with low latency? • Move computation power closer to the data. • But, but, but – Data is inherently mobile (with some latency) – Computation is tied to a server. Real computation power is not so mobile. – I am in an automobile and my mobile is constantly moving and getting further away from any fixed computation source
  91. 91. Anyway, won’t my mobile have sufficient computational power? • Mobiles will always inherently be less powerful relative to static client and server hardware • Mobiles have to trade off – size, – weight, – battery life, – storage with computational power • Computational power is always going to be of lower priority than these other qualities.
  92. 92. Cloudlet infrastructure • A cloudlet infrastructure is – A collection of servers with more computational power than my mobile – Connected to the internet – Connected to a power source – One data hop away from my mobile • Where would these servers live? – Doctor’s offices – Coffee shops – Think wireless access points • A new client would instantiate a VM with the desired app in a local cloudlet server.
  93. 93. How would it work? • Three models 1. Download application VM from cloud into cloudlet server on demand. • Time consuming • Allows for bare cloudlet server • Allows arbitrary applications 2. Pre load most popular application VMs • Fast • Limited in terms of number of applications available 3. Use containers to preload most popular platforms and download remaining software of an app. • Intermediate in time required to initiate • Intermediate in flexibility of apps.
  94. 94. Revisit AR application • Consider what is involved to show what is ahead based on head location. – Location – Orientation • Cloudlets have potential to supply necessary infrastructure.
  95. 95. Summary of cloudlets • Locally available servers with higher computational power than mobiles • Enables real time application of some technologies such as augmented reality or real time translation • Based on creating VM in local server with necessary app • Feasible with current technology
  96. 96. Topics • Containers • Augmented reality • Cloudlets • Internet of things – cars, applicances, etc
  97. 97. Internet of Things • “Everything” is connected to the internet – Toasters – Refrigerators – Automobiles – … • Every device has both data generation and control aspects. • From a cloud perspective, data generation is important.
  98. 98. Data available from a device • Current state • Current activity • Current location • Current power usage • Information about the environment of the device • …
  99. 99. How frequently is the data sampled? • Suppose each device is sampled once a minute • Suppose each sample contains 1MB • The number of devices being sampled may be in the 100,000s or1,000,000s (think automobiles or engines, or tires, or …) • Petabytes of data streaming into a data center. • “Big Data” is the buzz term used for this level of data.
  100. 100. Leads to streaming DBMSs • SQLstream • STREAM [1] • AURORA,[2] StreamBase Systems, Inc. • TelegraphCQ [3] • NiagaraCQ,[4] • QStream • PIPES, webMethods Business Events • StreamGlobe • Odysseus • StreamInsight • InfoSphere Streams • Kinesis
  101. 101. Requirements for streaming dbms Database management system (DBMS) Data stream management system (DSMS) Persistent data (relations) volatile data streams Random access Sequential access One-time queries Continuous queries (theoretically) unlimited secondary storage limited main memory Only the current state is relevant Consideration of the order of the input relatively low update rate potentially extremely high update rate Little or no time requirements Real-time requirements Assumes exact data Assumes outdated/inaccurate data Plannable query processing Variable data arrival and data characteristics
  102. 102. Managing the data • Synopses – Select sample of raw data to support a particular analysis. – Moving average of data – May be inaccurate when performing analysis • Windows – Only look at a portion of the data – Last 10 elements – Last 10 seconds • Choice of approach for managing the data is driven by the types of analyses that the data is used for.
  103. 103. Demand is for ever fasting processing • Businesses want to use the data for real time reaction • E.g. Amazon has “customers who look at this also looked at”. Suppose they use this information and your browsing history to offer you a deal on two items rather than have you decide individually to buy the two items. • Just one example out of hundreds.
  104. 104. Issues • Managing the volume of data • Security – how do you know data is correctly identified? • Privacy – what kind of information sharing will consumers tolerate?
  105. 105. Summary for Internet of Things • If every device is connected to the internet, the data generated will be massive • Specialized data management systems • Specialized analysis systems. • Problems include security and privacy