0
Architecting Highly                       Dependable Cloud                          Applications                          ...
The Land Down UnderNICTA Copyright 2012   From imagination to impact
SydneyNICTA Copyright 2012   From imagination to impact
About NICTANational ICT Australia    • Federal and state funded research      company established in 2002    • Largest ICT...
Research Areas at NICTANetworks                                                     Machine                             So...
Our team’s mission: help enterprises take fulladvantage as software extends into cloud!                                   ...
Who are we?• Anna• LenNICTA Copyright 2012   From imagination to impact   8
Who are you?What would you like from this tutorial?NICTA Copyright 2012   From imagination to impact   9
Outline• Introduction               • Cloud Computing Platforms               • Nature and causes of outages and down-time...
Introduction• intro to the cloud – xxx as a  service, regions/zones• What is dependability• why is dependability a concern...
NICTA Copyright 2012   From imagination to impact   13
What is Cloud Computing?   Cloud computing is a model for enabling convenient, on-   demand network access to a shared poo...
Characterising Cloud Computing                                       Measured                                        Servi...
Five Characteristics – NIST Definition• On-demand Self-Service    – A consumer can provision computing capabilities withou...
Leading Provider: Amazon EC2      Let‟s see how Amazon EC2, a leading commercial cloud, looks                       I want...
1. Grab your creditcard and create anaccount. (10 min)Then, access to aconsole                                            ...
4. Select a machine                                                image                                                • ...
5. Determine the amount of resources to allocate   • <1.0Ghz CPU + 600MB RAM  0.01 USD/hour   • 1.0Ghz CPU + 1.7GB RAM  ...
6. Define a set of                                                    access control rulesNICTA Copyright 2012   From imag...
7. Done! (< 5 minutes in total)                       • You have your virtual machine at                       ec2-184-74-...
8. Connect to my virtual machine                                             • Just SSH to the address                    ...
If you like Windows, just  launch a Windows virtual  machine and remote-desktop  to itConnected througha VPN connection   ...
9. Terminate or hibernate virtual machines                       when they are not in use                       • In some ...
10. Check a bill in real-time                                         • Hours to run virtual machines                     ...
Three Service Models – NIST definitionTechnology exposed to customers                           Providers                 ...
Three Delivery Models• Infrastructure as a Service (IaaS)       – The consumer has control over operating systems,        ...
Leading Provider: Google App Engine      Let‟s see how Google App Engine, a leading      commercial PaaS, looks           ...
1. Create an account.(5 min) GAE offers alarge amount of quotafor free                                                    ...
3. Deploy your application on                        GAE!                          Scale up/down, load                    ...
4. Check your resource                                                    usage (CPU, storage, #                          ...
Provider Services - 1• Consumer is allocated some number of virtual  machine instances.        – Number of instances is un...
Provider Services – 2• Cloud data centers        – hosted in different geographic regions        – Cloud provider responsi...
QuestionsNICTA Copyright 2012   From imagination to impact   35
NICTA Copyright 2012   From imagination to impact   36
What is dependability?• Dependability of a computing system is the  ability to deliver service that can justifiably be  tr...
Parsing the definition• Dependability is relative        – “justifiably be trusted”• May be different users with different...
Dependability subsumes many otherattributesNICTA Copyright 2012   From imagination to impact   39
QuestionsNICTA Copyright 2012   From imagination to impact   40
NICTA Copyright 2012   From imagination to impact   41
Cloud vis a vis private data center • Cloud providers remove some of the problems   of operating a private data center    ...
Cloud Specific Dependability ProblemsFailure       Instance failure       Data failure/consistency       Operator error   ...
Provisioning• Consumer or cloud infrastructure can launch or  delete instance of virtual machine• When new instance launch...
Elasticity - Over or Under Provisioning• Elasticity is the defining characteristic of cloud        – Traditional „scalabil...
NICTA Copyright 2012   From imagination to impact   46
Instance Failure – recognition• Basic failure recognition mechanism is  “heartbeat”.• Instance must periodically show it i...
Monitoring for Pending Failure   • Besides PING…   • A dashboard of flashing lights   • Monitoring ongoing CPU, memory uti...
State• An instance can be stateful or stateless• A stateful instance remembers information from  one message to another. S...
Stateful Recovery• Strategy depends on how much loss of  computation and events can be tolerated.• Strategy - 1        – C...
Stateful Recovery Strategy – 2• Periodically save important state on persistent  external device.• When image is activated...
Stateful Recovery Strategy – 3• Periodically save important state on persistent  external device• Log incoming messages on...
Comments on Stateful recovery strategies• Only strategy 1 (provision with checkpointed  image) is specific to cloud• Other...
Stateless images• If instance is stateless then        – Infrastructure can send any message to any instance        – Can ...
How do messages get to instances?• Two models        – Push. Load balancer decides which instance should          get mess...
Push Architecture Pattern                Clients             Load balancer                                                ...
Push Pattern Description  Client sends a request (e.g. HTTP message) to    the app in the cloud.  Request arrives at a loa...
MonitorThe load balancer knows       CPU utilization for each VM through monitor       how many requests each VM has gotte...
Failure management within Push Pattern• Monitor will recognize failure of instance through  non-responsiveness.• Load Bala...
Pull architecture pattern (aka Producer-Consumer)                   Clients                Load balancer/                q...
Pull architecture descriptionEach request from the client is application specific  and typed.The queue keeps separate queu...
MonitorThe monitor can now see        how long a request waits in a queue        the average queue lengthThis is an indica...
Failure Management within Pull Pattern• Controller knows when message has been  processed.• If message is not processed wi...
Cleaning upWhen instance fails it is not automaticallydeallocatedConsumer must deallocate failed instance.When instance de...
Data Failure• Data storage can be “ephemeral” or “persistent”• Ephemeral storage disappears if instance fails• Persistent ...
Data Consistency• Takes time to replicate data• Means that different replicas of the data may not  be instantaneously cons...
Characterising Eventual Consistency inAmazon SimpleDB• The probability to read updated data in SimpleDB in US West       –...
Operator error• After trying out something in AWS, may want to  go back to original state• Not always that straight-forwar...
Undo for System Operators                       Administrator                                 begin-                 do   ...
Approach                       Administrator                                 begin-                   do                  ...
Approach                       Administrator                                 begin-                   do                  ...
Approach                       Administrator                                 begin-                   do                  ...
Location of instances• Amazon divides the cloud into        – Regions (currently eight)               •   US – east (North...
User Visible Failures• Operator error is largest cause of user visible  errors in large Internet systems• Largest cause of...
Upgrade FrequencyUpgrades to systems are a very commonoccurrenceUpgrade frequency of some common systems              Appl...
Configuration parameters• Options are extensive        – Hadoop – 206        – Cassandra – 36        – HBase – 64• Massive...
Basic upgrade strategies• Rolling Upgrade        – Perform upgrade one node at a time               • Does not require add...
Potential error condition during rollingupgrade • Multiple versions are simultaneously active   during rolling upgrade • O...
Mixed Version Race Condition             Client (browser)                                      Server                     ...
Assumptions/Requirements for a Solution• Requirements        – Clients never interact with decreasing versions. i.e.      ...
Key Ideas of Proposed Solution - 1• Consider different versions as separate  endpoints for a message. Each version is  www...
Key ideas of Proposed Solution - 2• Load Balancer portion        – Use a load balancer that routes messages to different  ...
NICTA Copyright 2012   From imagination to impact   83
Achieving Elasticity• Elasticity means the ability to create new (virtual)  resources on demand• Providers allow consumer ...
Provisioning Latency• Small Instance        – 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1          EC2 Com...
Provisioning Forecasting• Approaches to predict appropriate number of  instances• Technique 1 (due to Sadeka Islam)       ...
Latency of Communication• Measurements by Robin Meehan based on http-  ping• Within EU region but across availability zone...
NICTA Copyright 2012   From imagination to impact   88
Security topics• Credentials and keys• Management of credentials and keys in the  cloud• Multi-tenancy• Location dependenc...
Credentials and keys • A credential identifies you        – As an individual        – As having certain privileges        ...
Basic Data protection     App outside                                                           App inside of cloud       ...
What can go wrong with the Basic DataProtection?• Suppose cloud provider has to respond to  subpoena for data. Your data  ...
Use of credentials•     Log into app in the cloud•     Attach a disk volume•     Download application from a non-public lo...
Vulnerabilities to Credentials• Compromised inadvertently through social  engineering means or carelessness• Held by disgr...
Goals for credential storage• Easy to do. If it is difficult to store credentials,  people will avoid their use. A script ...
Options for getting credentials to App in thecloud• Send credentials from client outside the cloud       – HTTPS will nego...
More options for getting credentials to Appserver• Build credentials into the image        –   App server is instantiated ...
Conclusion with respect to credentialmanagement• No insurmountable problem• Needs to be thought through        – Who has a...
What is Multi-tenancy?                       VM for               VM for                VM for                       custo...
Multi Tenancy Gets More Complicated                                    End users                       VM for           VM...
Multi Tenancy Means “Sharing”• Consumers share hardware        – CPU        – Network        – Storage media• Consumers sh...
What are the problems with Multi-tenancy?• Performance – other users or consumers will  consume resources and, potentially...
Isolation assumptions• Virtual machines are isolated based on virtual  memory technology and addressing scheme        – Pr...
Personally Identifiable Information• Personally identifiable (US NIST)        – Information which can be used to distingui...
Location dependency/governance• Some jurisdictions require that personal  information for their jurisdiction is not stored...
What does this mean in the cloud?• Knowing location of data centers        – Amazon provides locations of their data cente...
Use tokens as a replacement for PII• A token is an identifier that has no mathematical  mapping to the individual being id...
Example of token use• Original data        – John Doe        – Sensitive information• Token table (kept locally to conform...
How about jurisdictional problem?• Tokens        – Technique for decoupling PII from identifier.        – Adds a level of ...
QuestionsNICTA Copyright 2012   From imagination to impact   110
NICTA Copyright 2012   From imagination to impact   111
Netflix Corporation• Launched in 1998 after founder was irritated at  having to pay late fees on a DVD rental.• DVD Model ...
Streaming video - 1• Streaming video service introduced in 2008• Customers can watch Netflix streaming video on  a wide va...
Streaming video - 2• Initially, one hour of streaming video was  available to customers for every dollar they  spent on th...
Internet statistics• In May, 2011, Netflix streaming video accounted  for 22% of all internet traffic. 30% of traffic duri...
Netflix‟s move to the cloud• In late 2008, Netflix had a single data center with  Oracle as the main database system.• Wit...
Why EC2?• Four reasons cited by Netflix for moving to the  cloud  1. Every layer of the software stack needed to scale hor...
Netflix applicationsVideo ratings, reviews, and recommendationsVideo streamingUser registration, log-inVideo queuesBilling...
Netflix Reliability• Deep service  dependency  hierarchy• 1 billion incoming  calls/day• Across 1000s of  instances• Inter...
Approach to detecting faults• Fast network timeouts and  retries• Separate threads on per-  dependency thread pools• Semap...
If failure detected• Custom fallback        – Each service has specific fallback plan• Fail silent        – Service return...
Netflix test suite - 1  • Netflix has a variety of test programs they call    the Simian Army. These programs include     ...
Netflix test suite - 2        – Conformity Monkey. The Conformity Monkey finds instances that          don‟t adhere to bes...
Performance• Create new auto-scaling group for each new  version of code        – Copy entire configuration to new group  ...
SmugMug• Photo sharing site• Survived April AWS outage• Recommendations        –   Spread across as many availability zone...
Others• Bizo        – Use circuit breakers. Assume services will fail, cache          data and monitor extensively to dete...
NICTA Copyright 2012   From imagination to impact   127
Enterprise DR under pressure?Issues…                                                                         Good DR is on...
Using Cloud for Business Continuity• Two main usages of cloud for Business Continuity:        – Provides highly available ...
Building Highly Reliable Systems with Cloud• Must address potential failures at two levels:        – Hardware/Infrastructu...
DR As A Service – Requirements• Cost Effective DR-As-A-Service is essential to  get the DR solution deployed• Deep archite...
Case Study: Building Reliable System using EC2• Highly replicated                                                         ...
Case Study: Building Reliable System using EC2 (Contd)• Data backup in AWS        – Amazon S3 is best for off-site data ba...
The Business Opportunity                           “always-on” costs in                        cloud. Also, very hot one  ...
Yuruware BoltNICTA Copyright 2012   From imagination to impact   135
QuestionsNICTA Copyright 2012   From imagination to impact   136
Conclusions• Cloud Computing brings unique dependability  challenges               • Latency across the global links      ...
References• How to keep your AWS credentials on an EC2 Instance Securely,  Shlomo Swidler, http://shlomoswidler.com/2009/0...
References - 2• Cloud Software Updates: Challenges and Opportunies, Neamtiu,  Dumitras,  http://www.ece.cmu.edu/~tdumitra/...
References - 3• Why do upgrades fail and what can we do about it? Tudor Dumitras  and Priya Narasimhan. 2009. Why do upgra...
References - 4• How a consumer can measure elasticity for cloud platforms, Sadeka  Islam, Kevin Lee, Alan Fekete, Anna Liu...
Q&A                       Thank You!Research study opportunities in dependable cloud computing:• Software Architecture• Da...
Upcoming SlideShare
Loading in...5
×

WICSA 2012 tutorial

1,790

Published on

Tutorial on building highly dependable applications in the cloud

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,790
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
37
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Reduce cost, reduce complexity
  • Need to cut out more words on this slide – just tell the story!!Still need to do good EA, planning, monitoring, governance and managementRisk management approach to security, privacyPlan for Integration with existing assetsCome pick out brains at UNSW/NICTA
  • NICTA will focus on six research groups of significant scale and focus in which we have genuine opportunity to be ranked in the top five in an area in the world. Research groups have been selected on the basis of current NICTA strengths in research and research leadership. Software Systems. - Software Systems aims to develop game-changing techniques, frameworks and methodologies for the design of integrated, secure, reliable, performant and adaptive software architectures. Software systems has pervasive application in real-world applications ranging from enterprise ecosystems to embedded systems.Networks. - The networks research group will develop new theories, models and methods to support future networked applications andservices. Networked systems will address issues such as radio spectrum scarcity, wired bandwidth abundance, context and content, improvements to computing, energy constraints, and data privacy.Machine Learning. - is the science of interpreting and understanding data. The core problems are jointly statistical and computational. NICTA research will aim to develop machine learning as an engineering discipline, drawing on a spectrum of work from conceptual theory through algorithmics. Machine learning applications will aim to commonalities between problems, developing implementation frameworks that genuinely encourage reuse across different domains.Computer Vision - aims to understand the world through images and video. NICTA will focus on areas including geometry, detection and recognition, optimisation, segmentation, scene understanding, shape/illumination and reflectance, biological inspired approaches and the interfaces between them, drawing from approaches including statistical methods and learning and optimisation. Computer vision is a key enabling research discipline for many applications, including visual surveillance, bionic eye, mapping of the environment and visual surveillance.Control and Signal Processing. - comprises a substantial group of sub-disciplines dealing with optimisation, estimation, detection, identification, behaviour modification, feedback control and stability of a very large class of dynamical systems. It is likely that NICTA will focus on problems of control and signal processing in large-scale decentralised systems which are core to many new ICT systems. Techniques from information theory, Bayesian networks, large scale optimization etc are employed to address this important class of problem.Optimisation - the &quot;science of better&quot;. Research will focus on the interface between constraint programming, operations research, satisfiability, search, automated reasoning, machine learning, simulation and game theory, exploring methods that combine algorithms fromthese different areas. Optimisation applications will address multi-faceted questions such as how best to schedule in a network, whether there is a better folding for a protein, or how best to operate a supply chain.
  • Also comment on Public vs Private, and need to prepare for HybridRapid Elasticity: Elasticity is defined as the ability to scale resources both up and down as needed. To the consumer, the cloud appears to be infinite, and the consumer can purchase as much or as little computing power as they need. This is one of the essential characteristics of cloud computing in the NIST definition. • Measured Service: In a measured service, aspects of the cloud service are controlled and monitored by the cloud provider. This is crucial for billing, access control, resource optimization, capacity planning and other tasks. • On-Demand Self-Service: The on-demand and self-service aspects of cloud computing mean that a consumer can use cloud services as needed without any human interaction with the cloud provider. • Ubiquitous Network Access: Ubiquitous network access means that the cloud provider’s capabilities are available over the network and can be accessed through standard mechanisms by both thick and thin clients.4 • Resource Pooling: Resource pooling allows a cloud provider to serve its consumers via a multi-tenant model. Physical and virtual resources are assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).5
  • We have this data from our own studies!! Ping Kevin to get our own reference...
  • We also have this sort of data ourselves!! From australia obviously!
  • Where does Amadeus sit?Can we identify a set of apps that’s cold standby now, and can be pushed into warm standby easily/cheaply using cloud?
  • Transcript of "WICSA 2012 tutorial"

    1. 1. Architecting Highly Dependable Cloud Applications Anna Liu Len BassNICTA Copyright 2012 From imagination to impact
    2. 2. The Land Down UnderNICTA Copyright 2012 From imagination to impact
    3. 3. SydneyNICTA Copyright 2012 From imagination to impact
    4. 4. About NICTANational ICT Australia • Federal and state funded research company established in 2002 • Largest ICT research resource in Australia • National impact is an important success metric • ~700 staff/students working in 5 labs across major capital cities • 7 university partners NICTA technology is • Providing R&D services, knowledge in over 1 billion mobile transfer to Australian (and global) ICT phones industry 4 NICTA Copyright 2012 From imagination to impact
    5. 5. Research Areas at NICTANetworks Machine Software Learning Systems Aruna Seneviratne Bob Williamson Anna LiuComputer Gernot HeiserVision Optimisation Nick Barnes, Richard Hartley Control & Peter Corke Signal Mark Wallace, Sylvie Thiebaux, Processing Toby Walsh Rob EvansNICTA Copyright 2012 From imagination to impact 5
    6. 6. Our team’s mission: help enterprises take fulladvantage as software extends into cloud! Cost optimised High availability Onsite/offsite Hybrid cloud Real-time monitoring Disaster recovery Actionable analytics Business continuity Intelligent management Systems resilience Dynamic Elastic Real time High performance Our applied R&D capability spans cloud computing, web, SOA, distributed systems, data management, analytics, performance monitoring, DR, automated reasoning, ontologies, AI…7NICTA Copyright 2012 From imagination to impact
    7. 7. Who are we?• Anna• LenNICTA Copyright 2012 From imagination to impact 8
    8. 8. Who are you?What would you like from this tutorial?NICTA Copyright 2012 From imagination to impact 9
    9. 9. Outline• Introduction • Cloud Computing Platforms • Nature and causes of outages and down-time • Characteristics of Dependability in Cloud• Achieving high dependability • The importance of stateless components • Techniques to handle performance problems • Techniques to handle availability problems • Techniques to handle security problems• Case Studies: Netflix, Yuruware• ConclusionsNICTA Copyright 2012 From imagination to impact 11
    10. 10. Introduction• intro to the cloud – xxx as a service, regions/zones• What is dependability• why is dependability a concern in the cloud• types of dependability and high level problem descriptions – performance – availability – SecurityNICTA Copyright 2012 From imagination to impact 12
    11. 11. NICTA Copyright 2012 From imagination to impact 13
    12. 12. What is Cloud Computing? Cloud computing is a model for enabling convenient, on- demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. - US National Institute of Standards and TechnologyNICTA Copyright 2012 From imagination to impact
    13. 13. Characterising Cloud Computing Measured Service Resource Self Pooling Elasticity Service Ubiquitous Network AccessNICTA Copyright 2012 From imagination to impact
    14. 14. Five Characteristics – NIST Definition• On-demand Self-Service – A consumer can provision computing capabilities without human interaction• Broad network access – Computing capabilities are available over the network and accessed through standard mechanisms• Resource pooling – Provider‟s computing resources are pooled to serve multiple consumers with different resources dynamically assigned according to consumers‟ demands• Rapid elasticity – Computing capabilities can be rapidly and elastically provisioned to quickly scale out and rapidly released to scale in• Measured service – Resource usage can be monitored, controlled, and reported. Providing transparency for both the providertoand consumerNICTA Copyright 2012 From imagination impact
    15. 15. Leading Provider: Amazon EC2 Let‟s see how Amazon EC2, a leading commercial cloud, looks I want my cloud!NICTA Copyright 2012 From imagination to impact
    16. 16. 1. Grab your creditcard and create anaccount. (10 min)Then, access to aconsole 3. Hit this button 2. Select where you want to create your virtual machines (US East, US West, Ireland or Singapore) NICTA Copyright 2012 From imagination to impact
    17. 17. 4. Select a machine image • Many pre-configured images are available • You can register your machine images as wellNICTA Copyright 2012 From imagination to impact
    18. 18. 5. Determine the amount of resources to allocate • <1.0Ghz CPU + 600MB RAM  0.01 USD/hour • 1.0Ghz CPU + 1.7GB RAM  0.04 USD/hour • 3.0Ghz x 8 CPUs + 68GB RAM  1.1 USD/hour • Copyright can pay Win/SQL ServerimpactNICTA You 2012 From imagination to license fees in pay-per-hour
    19. 19. 6. Define a set of access control rulesNICTA Copyright 2012 From imagination to impact
    20. 20. 7. Done! (< 5 minutes in total) • You have your virtual machine at ec2-184-74-14-28.us-west- 1.compute.amazonaws.com I got my virtual machine!NICTA Copyright 2012 From imagination to impact
    21. 21. 8. Connect to my virtual machine • Just SSH to the address • You have a root access!! You‟re in an Amazon Datacenter in CA This is my desktop in SydneyNICTA Copyright 2012 From imagination to impact
    22. 22. If you like Windows, just launch a Windows virtual machine and remote-desktop to itConnected througha VPN connection You‟re in an Amazon Datacenter in NV This is my desktop in SydneyNICTA Copyright 2012 From imagination to impact
    23. 23. 9. Terminate or hibernate virtual machines when they are not in use • In some systems, we use a script to hibernate virtual machines at 8:00PM • Restart instances in the morning if necessary.NICTA Copyright 2012 It takes justFrom imagination to impact minutes a couple of
    24. 24. 10. Check a bill in real-time • Hours to run virtual machines • Network in/out • VPN • Disk access • # of requests made …NICTA Copyright 2012 From imagination to impact
    25. 25. Three Service Models – NIST definitionTechnology exposed to customers Providers Software as a Service Platform as a Service Infrastructure as a Service Datacenter Infrastructure NICTA Copyright 2012 From imagination to impact
    26. 26. Three Delivery Models• Infrastructure as a Service (IaaS) – The consumer has control over operating systems, storage and deployed applications• Platform as a Service (PaaS) – Consumers can deploy applications created using programming languages and tools supported by the provider (e.g., Java Servlet) – The provider shields the complexity of its infrastructure • Scale up/down, load balancing, replication, disaster recovery, database management, …• Software as a Service (SaaS) – Consumers use the provider‟s applications – The consumer does not manage the underlying cloud infrastructureNICTA Copyright 2012 From imagination to impact
    27. 27. Leading Provider: Google App Engine Let‟s see how Google App Engine, a leading commercial PaaS, looks I want my PaaS!NICTA Copyright 2012 From imagination to impact
    28. 28. 1. Create an account.(5 min) GAE offers alarge amount of quotafor free 2. Write an application using GAE‟s frameworkNICTA Copyright 2012 From imagination to impact
    29. 29. 3. Deploy your application on GAE! Scale up/down, load balancing, replication, disaster recovery, database management, … many functionsNICTA Copyright 2012 are implemented by GAE‟s From imagination to impact
    30. 30. 4. Check your resource usage (CPU, storage, # of API calls, …) Pay only when usage exceeds the free quotaNICTA Copyright 2012 From imagination to impact
    31. 31. Provider Services - 1• Consumer is allocated some number of virtual machine instances. – Number of instances is under the control of the consumer – Provider allows consumer to set rules for “autoscaling”. Automatically creating and removing instances – When new instance is launched it has • Software as specified by either the consumer or the provider • Private IP address available only from within cloud. Private IP address exists for life of instance and will not change • Public IP address. Addressable from outside the cloud. May change under certain circumstancesNICTA Copyright 2012 From imagination to impact 33
    32. 32. Provider Services – 2• Cloud data centers – hosted in different geographic regions – Cloud provider responsible for physical security• SLAs from cloud providers are for 99.9%+ up time for the cloud. No guarantee for any individual instance• Cloud provider will replicate databases to different regions or within a region.NICTA Copyright 2012 From imagination to impact 34
    33. 33. QuestionsNICTA Copyright 2012 From imagination to impact 35
    34. 34. NICTA Copyright 2012 From imagination to impact 36
    35. 35. What is dependability?• Dependability of a computing system is the ability to deliver service that can justifiably be trusted. – The service delivered by a system is its behaviour as it is perceived by its user(s); – a user is another system (physical, human) that interacts with the former at the service interface. – The function of a system is what the system is intended for, and is described by the system specification.[ A. Avizienis, J.-C. Laprie and B. Randell: Fundamental Concepts of Dependability.Research Report No 1145, LAAS-CNRS, April 2001]NICTA Copyright 2012 From imagination to impact 37
    36. 36. Parsing the definition• Dependability is relative – “justifiably be trusted”• May be different users with different expectations• Users can be systems or humans• Systems may deliver many services and dependability may be different for each serviceNICTA Copyright 2012 From imagination to impact 38
    37. 37. Dependability subsumes many otherattributesNICTA Copyright 2012 From imagination to impact 39
    38. 38. QuestionsNICTA Copyright 2012 From imagination to impact 40
    39. 39. NICTA Copyright 2012 From imagination to impact 41
    40. 40. Cloud vis a vis private data center • Cloud providers remove some of the problems of operating a private data center Acquisition of physical hardware. Hiring/training data center staff Physical security • Other problems remain basically the same Security threats from internet connections Separation of production/test environments Patch installation • Other problems are new or exist in changed form It is these other problems that we now focus on.NICTA Copyright 2012 From imagination to impact 42
    41. 41. Cloud Specific Dependability ProblemsFailure Instance failure Data failure/consistency Operator error Upgrade errorPerformance Latency of provisioning Over/under provisioning Latency of communicationSecurity/privacy Credentials and keys Multi-tenancy Location dependency/governanceDisaster RecoveryNICTA Copyright 2012 From imagination to impact 43
    42. 42. Provisioning• Consumer or cloud infrastructure can launch or delete instance of virtual machine• When new instance launched it consists of – Virtual hardware with public and private IP address – Executable image – Virtual hard disk• Provisioning is important both in failure recovery and performanceNICTA Copyright 2012 From imagination to impact 44
    43. 43. Elasticity - Over or Under Provisioning• Elasticity is the defining characteristic of cloud – Traditional „scalability‟ or „throughput‟ measures no longer helpful – “the ability of software to meet changing capacity demands, deploying and releasing relevant necessary resources on- demand”• There is often over or under provisioningNICTA Copyright 2012 From imagination to impact
    44. 44. NICTA Copyright 2012 From imagination to impact 46
    45. 45. Instance Failure – recognition• Basic failure recognition mechanism is “heartbeat”.• Instance must periodically show it is still alive – Send a message – Respond to query• Must be an entity that is responsible for monitoring “aliveness” of instance – Entity can be infrastructure – Entity can be other portion of the application – Entity can be client• Failed instances are not automatically deletedNICTA Copyright 2012 From imagination to impact 47
    46. 46. Monitoring for Pending Failure • Besides PING… • A dashboard of flashing lights • Monitoring ongoing CPU, memory utilization, disk activities, Network activities • Environmental controls, water/coolant flow, power and temperatureAkamai’s NOC in Cambridge, Massachusetts NICTA Copyright 2012 From imagination to impact 48
    47. 47. State• An instance can be stateful or stateless• A stateful instance remembers information from one message to another. State can be stored either within instance memory or on external memory device• A stateless instance must be sent necessary state associated with the message.• HTTP is a stateless protocol so every message must contain information allowing the instance to understand the context.• Recovery process is different for stateful instances than for stateless instances. 49NICTA Copyright 2012 From imagination to impact
    48. 48. Stateful Recovery• Strategy depends on how much loss of computation and events can be tolerated.• Strategy - 1 – Checkpoint image periodically – On recovery, provision with checkpointed image and computation will restart from last checkpoint – Any computation and messages between last checkpoint and failure will be lost. – Assumes no state stored on external device.• Only for cloud because of checkpointing imageNICTA Copyright 2012 From imagination to impact 50
    49. 49. Stateful Recovery Strategy – 2• Periodically save important state on persistent external device.• When image is activated, it checks whether any state has been saved. If so, it reads that state and resumes computation• Any computation and messages between last checkpoint and failure will be lost• Different with prior strategy is that does not assume an image exists and state is explicitly checkedpointed by applicationNICTA Copyright 2012 From imagination to impact 51
    50. 50. Stateful Recovery Strategy – 3• Periodically save important state on persistent external device• Log incoming messages on persistent external device• When image is activated, it checks whether any state has been saved. If so, it reads that state.• Activated image then reads log and replays activity.• No computation or messages will be lost unless there is failure between message arrival and recording that message on log. Acks to client will allow client to resend message if necessary. 52NICTA Copyright 2012 From imagination to impact
    51. 51. Comments on Stateful recovery strategies• Only strategy 1 (provision with checkpointed image) is specific to cloud• Other strategies apply also to non-cloud environments.• Strategy 3 achieves least data loss since messages are logged and replayed upon recovery.NICTA Copyright 2012 From imagination to impact 53
    52. 52. Stateless images• If instance is stateless then – Infrastructure can send any message to any instance – Can create new instances for performance or reliability reasons. – Router/load balancer/controller is responsible for getting messages to instances Cloud Clients Servers Load balancerNICTA Copyright 2012 From imagination to impact 54
    53. 53. How do messages get to instances?• Two models – Push. Load balancer decides which instance should get message – Pull. Load balancer maintains queue of messages and instances retrieve messages from queue.NICTA Copyright 2012 From imagination to impact 55
    54. 54. Push Architecture Pattern Clients Load balancer Monitor ServersNICTA Copyright 2012 From imagination to impact
    55. 55. Push Pattern Description Client sends a request (e.g. HTTP message) to the app in the cloud. Request arrives at a load balancer Load balancer forwards request to one of the VMs Load balancer uses scheduling strategy to decide which VM gets the request, e.g. round robinNICTA Copyright 2012 From imagination to impact
    56. 56. MonitorThe load balancer knows CPU utilization for each VM through monitor how many requests each VM has gotten Possibly how long it took to service the requests.The monitor decides (based on rules) when new resources are neededNICTA Copyright 2012 From imagination to impact 58
    57. 57. Failure management within Push Pattern• Monitor will recognize failure of instance through non-responsiveness.• Load Balancer will not send further messages to instance• Messages currently being processed by failed instance are lost• Client must detect message not processed (through timeout) and resend message.NICTA Copyright 2012 From imagination to impact 59
    58. 58. Pull architecture pattern (aka Producer-Consumer) Clients Load balancer/ queue manager Monitor ServersNICTA Copyright 2012 From imagination to impact
    59. 59. Pull architecture descriptionEach request from the client is application specific and typed.The queue keeps separate queues for each application running on the VMs.A VM requests the next message of a particular type (pull) and processes it.When the VM has processed a message, it informs the controller to remove the message from the queue.NICTA Copyright 2012 From imagination to impact
    60. 60. MonitorThe monitor can now see how long a request waits in a queue the average queue lengthThis is an indication of the load on the VMs that have applications that service requests of that type.Allows better scheduling of messages to VMs.NICTA Copyright 2012 From imagination to impact 62
    61. 61. Failure Management within Pull Pattern• Controller knows when message has been processed.• If message is not processed within time interval, controller can reassign it.• Failed instances will not request further messages and so take themselves out of service.• It is possible for a failed instance to recover and continue processing on a message that has been rescheduled so checks must be in place to keep a message from being double processed.NICTA Copyright 2012 From imagination to impact 63
    62. 62. Cleaning upWhen instance fails it is not automaticallydeallocatedConsumer must deallocate failed instance.When instance deallocated – Public and private IP address available for realloation – Possible to tell infrastructure that public IP address is to be assigned to replacement instance• Within AWS charging continues until instance deallocated.NICTA Copyright 2012 From imagination to impact 64
    63. 63. Data Failure• Data storage can be “ephemeral” or “persistent”• Ephemeral storage disappears if instance fails• Persistent storage is maintained by cloud provider – Replicated automatically – Replicas may be geographically separated• May lead to problems with data consistencyNICTA Copyright 2012 From imagination to impact 65
    64. 64. Data Consistency• Takes time to replicate data• Means that different replicas of the data may not be instantaneously consistent• CAP Theorem. Data cannot simultaneously be – Consistent – Fully available – Partitioned (distributed across multiple data stores)• May take ½ second for data to become consistent• Most cloud providers offer “consistent reads” but at a potential cost in latencyNICTA Copyright 2012 From imagination to impact 66
    65. 65. Characterising Eventual Consistency inAmazon SimpleDB• The probability to read updated data in SimpleDB in US West – An application reads data X (ms) after it has written data Consistent Read Eventual Consistent • SimpleDB has two read operations – Eventual Consistent Read – Consistent Read • This pattern is consistent regardless of the time of day 67NICTA Copyright 2012 From imagination to impact
    66. 66. Operator error• After trying out something in AWS, may want to go back to original state• Not always that straight-forward: – Attaching volume is no problem while the instance is running, detaching might be problematic – Creating / changing auto-scaling rules has effect on number of running instances • Cannot terminate additional instances, as the rule would create new ones! – Deleted / terminated / released resources are gone!NICTA Copyright 2012 From imagination to impact 68
    67. 67. Undo for System Operators Administrator begin- do do do rollback transaction + commit + pseudo-deleteNICTA Copyright 2012 From imagination to impact 69
    68. 68. Approach Administrator begin- do do do rollback transaction Sense cloud Sense cloud resources states resources states Undo SystemNICTA Copyright 2012 From imagination to impact 70
    69. 69. Approach Administrator begin- do do do rollback transaction Sense cloud Sense cloud resources states resources states Goal Goal Initial Initial state state state state Undo SystemNICTA Copyright 2012 From imagination to impact 71
    70. 70. Approach Administrator begin- do do do rollback transaction Sense cloud Sense cloud resources states resources states Goal Goal Initial Initial Set of Set of state state state state actions actions Execute Generate code Plan Undo SystemNICTA Copyright 2012 From imagination to impact 72
    71. 71. Location of instances• Amazon divides the cloud into – Regions (currently eight) • US – east (Northern Va), west (Oregon, Northern Calif), gov • Asia Pactific – Singapore, Toyko • Europe – Ireland • South America (Sao Paulo) – Each region has some number of availability zones. • Each availability zone has distinct physical location, power sources • Communication – within availability zones is high speed, – across availability zones is lower speed, – across regions is lowest speed• Availability zones and regions can be exploited to improve availabilityNICTA Copyright 2012 From imagination to impact 73
    72. 72. User Visible Failures• Operator error is largest cause of user visible errors in large Internet systems• Largest cause of operator error is configuration errors during upgrade – Data may be dated – Data is based on a world where monthly updates were considered frequent. Updates may be as frequent as weekly (Facebook) or even more frequently – Jan Bosch talks about “continuous deployment”. – I have not seen recent data describing sources of operator errorNICTA Copyright 2012 From imagination to impact 74
    73. 73. Upgrade FrequencyUpgrades to systems are a very commonoccurrenceUpgrade frequency of some common systems Application Average release interval Facebook (platform) < 7 days Google Docs <50 days Media Wiki 21 (171 schema updates in 4.5 years) Joomla 30This frequency would suggest it is important to getthe updates correctNICTA Copyright 2012 From imagination to impact 75
    74. 74. Configuration parameters• Options are extensive – Hadoop – 206 – Cassandra – 36 – HBase – 64• Massive numbers of dependencies, many hidden – File path – Network address – Dynamically loaded libraries – Database schema – …NICTA Copyright 2012 From imagination to impact 76
    75. 75. Basic upgrade strategies• Rolling Upgrade – Perform upgrade one node at a time • Does not require additional resources • Allows for determination of correctness in an incremental fashion • Implies that multiple versions may be simultaneously in service • Takes time• Big flip – Perform upgrade to a cluster at a time • Keep users from accessing cluster until upgrade completed • Takes resources out of service until upgrade is completed• General industrial practice is Rolling UpgradeNICTA Copyright 2012 From imagination to impact 77
    76. 76. Potential error condition during rollingupgrade • Multiple versions are simultaneously active during rolling upgrade • Opens door to errors resulting from version incompatibility • During a single session a client can deal with multiple versions of a single component. • May result in “mixed-version” race condition • “…these race conditions occur frequently during rolling updates of large Internet systems, such as Facebook” From “To Upgrade or Not to Upgrade”NICTA Copyright 2012 From imagination to impact 78
    77. 77. Mixed Version Race Condition Client (browser) Server 1 Start rolling upgrade 2 Initial request HTTP reply with New embedded JavaScript 3 Version 4 AJAX callback Old 5 Version X ERRORNICTA Copyright 2012 From imagination to impact 79
    78. 78. Assumptions/Requirements for a Solution• Requirements – Clients never interact with decreasing versions. i.e. once a client interacts with version xxx, it will never interact with a version less than xxx. – Messages are balanced across all instances of an application, whether new or old versions.• Assumptions – Versions are backwards compatible. i.e. any message can be processed by the latest version without creating mixed-version race condition – Client behavior with respect to the versions with which it interacts is governed by mobile code sent to the browser from the server side.NICTA Copyright 2012 From imagination to impact 80
    79. 79. Key Ideas of Proposed Solution - 1• Consider different versions as separate endpoints for a message. Each version is www.sample.com/<version number>• Each instance knows its version number.• Client knows the largest version number with which it has interacted.NICTA Copyright 2012 From imagination to impact 81
    80. 80. Key ideas of Proposed Solution - 2• Load Balancer portion – Use a load balancer that routes messages to different endpoints – The load balancer is the entry point for messages. – Messages with /<version number> in the header are routed to an instance greater than or equal than the version number according to load balancing algorithm for those instances. – Messages without version information are routed according to normal load balancing• Load balancers are hierarchical – Ensure that top level is updated before used to route messagesNICTA Copyright 2012 From imagination to impact 82
    81. 81. NICTA Copyright 2012 From imagination to impact 83
    82. 82. Achieving Elasticity• Elasticity means the ability to create new (virtual) resources on demand• Providers allow consumer to set up “autoscaling” rules. These rules make the demand automatic without necessity for operator manual action. – E.g. create a new instance when an existing instance is utilizing greater than 75% of CPU for more than 5 minutes.• Correct strategy for autoscaling is a matter of research because of the time it takes to create a new instance, provision it, boot it, and start an application.NICTA Copyright 2012 From imagination to impact 84
    83. 83. Provisioning Latency• Small Instance – 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform with a base install of CentOS 5.3 AMI – Between 5 and 6 minutes us-east-1c from launch to availability• Large Instance – 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform with a base install of CentOS 5.3 AMI – Between 11 and 18 minutes us-east-1c[http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-instance]NICTA Copyright 2012 From imagination to impact 85
    84. 84. Provisioning Forecasting• Approaches to predict appropriate number of instances• Technique 1 (due to Sadeka Islam) – Calculate cost of having instances that are unused (overprovisioning) – Calculate cost of having requests go unsatisfied (underprovisioning) – Allocate additional instances to optimize costs under various usage scenarios• Technique 2 (due to Matthew Sladescu ) – Sniff out events that might lead to surge in demand and use that to predict appropriate number of instancesNICTA Copyright 2012 From imagination to impact 86
    85. 85. Latency of Communication• Measurements by Robin Meehan based on http- ping• Within EU region but across availability zones – Roundtrip to local host within cloud (control) avg = 1.0 ms – Roundtrip to public IP in same AZ avg = 1.4 ms• Out of cloud (local England facility) to within cloud – Us-east = 231 ms – Eu-west = 96 mshttp://smart421.wordpress.com/2011/02/15/amazon-web-services-inter-az-latency-measurements/http://smart421.wordpress.com/2011/01/17/which-amazon-web-services-region-should-you-use-for-your-service/NICTA Copyright 2012 From imagination to impact 87
    86. 86. NICTA Copyright 2012 From imagination to impact 88
    87. 87. Security topics• Credentials and keys• Management of credentials and keys in the cloud• Multi-tenancy• Location dependency/governanceNICTA Copyright 2012 From imagination to impact 89
    88. 88. Credentials and keys • A credential identifies you – As an individual – As having certain privileges – As having certain qualifications • Credentials are used in – Authentication (you are who you say you are) – Authorization (you have the rights to perform certain actions) – Non-repudiation (you cannot deny you did something) • A key is a magic number used in cryptography for – Encrypting/decrypting data – Digital credentialsNICTA Copyright 2012 From imagination to impact 90
    89. 89. Basic Data protection App outside App inside of cloud of cloud (data (data unencrypted, communicati unencrypted) https: data is on encrypted) encrypted for transfer into the cloud Data is stored Data encrypted (by vendor)NICTA Copyright 2012 From imagination to impact 91
    90. 90. What can go wrong with the Basic DataProtection?• Suppose cloud provider has to respond to subpoena for data. Your data may, potentially, be included.• Cloud provider must decrypt data to respond to subpoena.• You may wish to encrypt your data (double encryption) so that cloud provider can only provide encrypted data.• Of course, if subpoena is directed at you, you must comply with decrypted data.NICTA Copyright 2012 From imagination to impact 92
    91. 91. Use of credentials• Log into app in the cloud• Attach a disk volume• Download application from a non-public location• Access particular data bases.• For non-public applications, protect your credentials and your data will be protected.NICTA Copyright 2012 From imagination to impact 93
    92. 92. Vulnerabilities to Credentials• Compromised inadvertently through social engineering means or carelessness• Held by disgruntled employee• Compromised through some sort of attackNICTA Copyright 2012 From imagination to impact 94
    93. 93. Goals for credential storage• Easy to do. If it is difficult to store credentials, people will avoid their use. A script can automate the provisioning of credentials but then the script needs to be protected• Possible to change in a running instance?. Once an instance has been launched, can the credentials it uses be changed?• Possible to change for instances launched in the future? This issue is related to building credentials into scripts. If scripts have credentials built in then it makes it difficult to change them in the future.NICTA Copyright 2012 From imagination to impact 95
    94. 94. Options for getting credentials to App in thecloud• Send credentials from client outside the cloud – HTTPS will negotiate encryption of credentials over the internet – Assumes credentials can be kept private on clients that have them. – Credentials need to be sent every time there is a new instance –• Pass credentials in as a parameter during launch of instance – Credentials persist for the life of the instance so if credentials change, can re-instantiate instance – Means credentials are stored on a server – itself a vulnerabilityNICTA Copyright 2012 From imagination to impact 96
    95. 95. More options for getting credentials to Appserver• Build credentials into the image – App server is instantiated from an image in the image library – Could install credentials in the image when building it – Makes it difficult to change credentials – Prevents reuse of image (or makes reusing image a very bad idea)• Keep credentials in persistent storage. – Access control list for persistent storage provides protection based on credentials – Credentials may be based on a different accountNICTA Copyright 2012 From imagination to impact 97
    96. 96. Conclusion with respect to credentialmanagement• No insurmountable problem• Needs to be thought through – Who has access to credentials? – Will I ever need to change credentials?NICTA Copyright 2012 From imagination to impact 98
    97. 97. What is Multi-tenancy? VM for VM for VM for customer 1 customer 2 customer 3 Hypervisor Server Local Network Storage Data Data Data DataNICTA Copyright 2012 From imagination to impact 99
    98. 98. Multi Tenancy Gets More Complicated End users VM for VM for VM for customer 1 customer 2 customer 3 HypervisorNICTA Copyright 2012 From imagination to impact 100
    99. 99. Multi Tenancy Means “Sharing”• Consumers share hardware – CPU – Network – Storage media• Consumers share software – Hypervisor• End users share applications – E.g. Salesforce.comNICTA Copyright 2012 From imagination to impact 101
    100. 100. What are the problems with Multi-tenancy?• Performance – other users or consumers will consume resources and, potentially, keep you from achieving your performance requirements. – Some providers allow consumers to reserve complete machines that would prevent multi-tenancy from occurring.• Security – other users could potentially break confidentiality or integrity – Provider uses isolation for security. Consumer must have trust in provider – Consumer uses encryption to protect data.NICTA Copyright 2012 From imagination to impact 102
    101. 101. Isolation assumptions• Virtual machines are isolated based on virtual memory technology and addressing scheme – Processor manufacturers have specialized hardware to support virtualization – Hypervisor introduces a new layer of privileged software that could be attacked.• Hypervisors provide facilities to isolate networks.• Disk isolation is the same as in a non-cloud environment. OSs or shared software provide facilities.NICTA Copyright 2012 From imagination to impact 103
    102. 102. Personally Identifiable Information• Personally identifiable (US NIST) – Information which can be used to distinguish or trace an individuals identity, such as their name, social security number, biometric records, etc. alone, or when combined with other personal or identifying information which is linked or linkable to a specific individual, such as date and place of birth, mother’s maiden name, etc.• Personal data (EU) – ‘personal data shall mean any information relating to an identified or identifiable natural person (data subject); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identityNICTA Copyright 2012 From imagination to impact 104
    103. 103. Location dependency/governance• Some jurisdictions require that personal information for their jurisdiction is not stored outside of the jurisdiction – The EU requires that personal information can leave the EU only for locations that have equivalent privacy guarantees – Australia has a similar policy – “If offshore cloud compromises your data, we‟ll sue you, not them”, Victoria Privacy Commissioner• Some jurisdictions claim rights to access any data stored within their borders – US Patriot Act gives US government right to examine any data stored in the US.NICTA Copyright 2012 From imagination to impact 105
    104. 104. What does this mean in the cloud?• Knowing location of data centers – Amazon provides locations of their data centers – Google does not• Does this mean just use Amazon data center in region compliant with your requirements? – Not so fast! – Back up locations may be chosen by provider. Could be anywhere – A complicated problem is to control back up location based on data content.• Amazon does have a gov region that almost certainly complies with US government regulationsNICTA Copyright 2012 From imagination to impact 106
    105. 105. Use tokens as a replacement for PII• A token is an identifier that has no mathematical mapping to the individual being identified – E.g. number people in tutorial arbitrarily – Your number becomes a unique identifier for your PII stored in the cloud – I keep mapping between you and your token privately according to jurisdictional lawsNICTA Copyright 2012 From imagination to impact 107
    106. 106. Example of token use• Original data – John Doe – Sensitive information• Token table (kept locally to conform to privacy laws) – John Doe – Token for John Doe• Data stored in cloud – Token – Sensitive information• Take join of token table and data table in cloud and the original data is restoredNICTA Copyright 2012 From imagination to impact 108
    107. 107. How about jurisdictional problem?• Tokens – Technique for decoupling PII from identifier. – Adds a level of indirection and protects that level locally• Does this solve jurisdictional problems? – I don‟t know – PerspecSys says it does “http://www.perspecsys.com/how-we-help/data-residency/”NICTA Copyright 2012 From imagination to impact 109
    108. 108. QuestionsNICTA Copyright 2012 From imagination to impact 110
    109. 109. NICTA Copyright 2012 From imagination to impact 111
    110. 110. Netflix Corporation• Launched in 1998 after founder was irritated at having to pay late fees on a DVD rental.• DVD Model – Pay monthly membership fee that includes rentals, shipping and no late fees – Maintain online queue of desired rentals – When return last rental (depending on service plan), next item in queue is mailed to you together with a return envelope.• Customers rate movies and Netflix recommends based on your preferencesNICTA Copyright 2012 From imagination to impact
    111. 111. Streaming video - 1• Streaming video service introduced in 2008• Customers can watch Netflix streaming video on a wide variety of devices many of which feed into a TV – Roku set top box – Blu-ray disk platers – Xbox 360 – TV directly – PlayStation 3 – …• Customers can stop and restart video at will. Netflix calls these locations in the films “bookmarks”.NICTA Copyright 2012 From imagination to impact
    112. 112. Streaming video - 2• Initially, one hour of streaming video was available to customers for every dollar they spent on their plan• In Jan, 2008, every customer was entitled to unlimited streaming video.• In Nov, 2011 Netflix changed billing model to have separate charges for DVDs and streamingNICTA Copyright 2012 From imagination to impact
    113. 113. Internet statistics• In May, 2011, Netflix streaming video accounted for 22% of all internet traffic. 30% of traffic during peak usage hours.• Three bandwidth tiers – Continuous bandwidth to the client of 5 Mbit/s. HDTV, surround sound – Continuous bandwidth to the client of 3Mbit/s – better than DVD – Continuous bandwidth to the client of 1.5Mbit/s – DVD qualityNICTA Copyright 2012 From imagination to impact 115
    114. 114. Netflix‟s move to the cloud• In late 2008, Netflix had a single data center with Oracle as the main database system.• With the growth of subscriptions and streaming video, it was clear that they would soon outgrow the data center.• Two options: – Build more data centers – Use the cloud• Netflix choose Amazon EC2 platformNICTA Copyright 2012 From imagination to impact
    115. 115. Why EC2?• Four reasons cited by Netflix for moving to the cloud 1. Every layer of the software stack needed to scale horizontally, be more reliable, redundant, and fault tolerant. This leads to reason #2 2. Outsourcing data center infrastructure to Amazon allowed Netflix engineers to focus on building and improving their business. 3. Netflix is not very good at predicting customer growth or device engagement. They underestimated their growth rate. The cloud supports rapid scaling. 4. Cloud computing is the future. This will help Netflix with recruiting engineers who are interested in honing their skills, and will help scale the business. It will also ensure competition among cloud providers helping to keep costs down.• Why Amazon and EC2? In 2008, Amazon was the leading supplier. Netflix wanted an IaaS so they could focus on their core competencies.NICTA Copyright 2012 From imagination to impact
    116. 116. Netflix applicationsVideo ratings, reviews, and recommendationsVideo streamingUser registration, log-inVideo queuesBillingDVD disc management – inventory and shippingVideo metadata management – movie cast informationNICTA Copyright 2012 From imagination to impact
    117. 117. Netflix Reliability• Deep service dependency hierarchy• 1 billion incoming calls/day• Across 1000s of instances• Intermittent failure guaranteedNICTA Copyright 2012 From imagination to impact 119
    118. 118. Approach to detecting faults• Fast network timeouts and retries• Separate threads on per- dependency thread pools• Semaphores instead of threads for services that do not perform network calls• Circuit breaker – Service calls are decorated with code to test whether service is failing too oftenNICTA Copyright 2012 From imagination to impact 120
    119. 119. If failure detected• Custom fallback – Each service has specific fallback plan• Fail silent – Service returns a null value and invoking service knows it has failed• API should be able to show what is happening now, in real time, not from some past time. Dashboard shown to operator has red/yellow/green lights for important servicesNICTA Copyright 2012 From imagination to impact 121
    120. 120. Netflix test suite - 1 • Netflix has a variety of test programs they call the Simian Army. These programs include – Chaos monkey. Randomly kill a process and monitor the effect. – Latency monkey. Randomly introduce latency and monitor the effect. – Doctor monkey. The Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. – Janitor Monkey. The Janitor Monkey ensures that the Netflix cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.NICTA Copyright 2012 From imagination to impact
    121. 121. Netflix test suite - 2 – Conformity Monkey. The Conformity Monkey finds instances that don‟t adhere to best-practices and shuts them down. For example, if an instance does not belong to an auto-scaling group, that is a potential problem. – Security Monkey The Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal. – 10-18 Monkey The 10-18 Monkey (Localization- Internationalization) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets. The name 10-18 comes from L10n and I18n which are the number of characters in the words localization and internationalization.NICTA Copyright 2012 From imagination to impact
    122. 122. Performance• Create new auto-scaling group for each new version of code – Copy entire configuration to new group – Test behaviour under load by squeezing traffic in production to a smaller set of servers or generating artificial load against a single serverNICTA Copyright 2012 From imagination to impact 124
    123. 123. SmugMug• Photo sharing site• Survived April AWS outage• Recommendations – Spread across as many availability zones as possible – Spread across regions if possible – Build for failure (like Chaos Monkey) – Understand how components fail (yours and cloud providers services)NICTA Copyright 2012 From imagination to impact 125
    124. 124. Others• Bizo – Use circuit breakers. Assume services will fail, cache data and monitor extensively to detect failure.• SimpleGeo – share nothing, redundancy, automated failover, automated replication• Twilio – Unit of failure is a single host • Simple services, replicatable – Short timeouts and quick retries – Idempotent service interfaces (stateless) – Relax consistency requirementsNICTA Copyright 2012 From imagination to impact 126
    125. 125. NICTA Copyright 2012 From imagination to impact 127
    126. 126. Enterprise DR under pressure?Issues… Good DR is only affordable for a DR requirement is growing, driven by (a) changing few applications customer expectations, and associated reputational risks; (b) Government & industry regulations Infrastructure for DR is expensive: sophisticated DR Good DR is only affordable for a small % of applications; coverage Higher priority applications forces compromises/prioritisation Confidence in initiating a recovery often less than it Limited should be (too long, too much loss), uncertain coverage integrity DR Solutions often too „local‟, insufficiently resilient Enterprise IT becoming more complex No coverCost of DR is increasing… Improving business continuity (BC) and DR is 2nd highest priority for enterprises for 2010/2011 BC/DR typically claims 6-7% of total IT budget 32% of enterprises plan to increase spending on BC/DR by at least 5% in 2010/2011. Hypothesis: We can use cloud Forrester global survey 2,803 IT decision-makers, Sept 2010 to extend DR at 1/10th cost. 128NICTA Copyright 2012 From imagination to impact
    127. 127. Using Cloud for Business Continuity• Two main usages of cloud for Business Continuity: – Provides highly available systems for day-to-day business – Serves as a technology platform to implement disaster recovery• Some definitions: – Business Continuity: “Activity performed by an organisation to ensure that critical business functions will be available to customers, suppliers, regulators and other entities…” – Disaster Recovery: “A small subset of business continuity. The process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organisation after a natural or human-induced disaster” – Fault Tolerance: “The property that enables a system to continue operating properly, possibly at a reduced quality level…” 129NICTA Copyright 2012 From imagination to impact
    128. 128. Building Highly Reliable Systems with Cloud• Must address potential failures at two levels: – Hardware/Infrastructure • To prevent Single-Point-of-Failure (SPOF) by adding redundancy in all hardware components (i.e., redundant disks, redundant network devices, redundant power supply, etc.) • NOT all cloud providers provide 100% availability. Check your SLA!! – Application • Prepare fail-over system to take over in case of a failure • Database replicates to minimise downtime and loss of data • Replicate to geographically different location (e.g., to avoid natural disasters such as floods) 130NICTA Copyright 2012 From imagination to impact
    129. 129. DR As A Service – Requirements• Cost Effective DR-As-A-Service is essential to get the DR solution deployed• Deep architectural expertise does not exist in many businesses• Needs solutions that achieves dependability that is • Non intrusive at runtime • Does not require changes to application architecture • Works across platforms • Cheaper and easier to use than current state of practiceNICTA Copyright 2012 From imagination to impact 131
    130. 130. Case Study: Building Reliable System using EC2• Highly replicated Minimum Size= 1 architecture of cloud Elastic IP address xxx.xxx.xxx.xxx Availability Zones = A, B, C makes them great as Auto Scaling Rule Create foundations for business Allocate continuity solutions• Globally distributed EC2 Instance Availability Zone A Availability Zone B Availability Zone C nature further enhances the disaster recovery Minimum Size= 2 Availability Zones = A, B, C capability of cloud Auto Scaling Rule Request from Clients Availability Zones• Availability limitations Elastic Load Balancer = A, B, C means need to be Forward Request realistic about Hot vs Warm vs Cold standby EC2 Instance EC2 Instance Availability Zone A Availability Zone B Availability Zone C options NICTA Copyright 2012 From imagination to impact 132
    131. 131. Case Study: Building Reliable System using EC2 (Contd)• Data backup in AWS – Amazon S3 is best for off-site data backup • Stores large binary files • Designed to provide 99.999999999% durability • Objects are redundantly stored in multiple facilities in a Region – Back up using EBS • Uses a regular file system • Takes image (or snapshot) of the partition – VM Import • Allows for easy replication from on-premise to cloud • Not trivial to replicate various configuration such as network configuration and disk drives 133NICTA Copyright 2012 From imagination to impact
    132. 132. The Business Opportunity “always-on” costs in cloud. Also, very hot one Cost is not feasible Hot Warm Standby Cold Standby Standby • Run • Ship backup to transactions on • Regularly offsite multiple sites but backup app/data • Hardware is not use only one in a backup site already set up • Mirror data via • Launch systems • Recover dedicated high upon a disaster systems after speed network disaster (e.g., SANs) Traditional DR Cost of warm and cold is Cloud DR comparable seconds minutes – few hours – few days – weeks Downtime(auto failover) hours days (large data loss) (auto failover, NICTA Copyright 2012 (manual From imagination to impact 134 minimum data loss) failover, few data
    133. 133. Yuruware BoltNICTA Copyright 2012 From imagination to impact 135
    134. 134. QuestionsNICTA Copyright 2012 From imagination to impact 136
    135. 135. Conclusions• Cloud Computing brings unique dependability challenges • Latency across the global links • Full automation means faster than ever error propagation • Multi-tenancy issues• Many traditional dependability patterns would work, but need some new techniques in the Cloud-era • Traditional Patterns: stateless, etc • Upgrade, undo/redo • Simian armies, DR-As-A-ServiceNICTA Copyright 2012 From imagination to impact 137
    136. 136. References• How to keep your AWS credentials on an EC2 Instance Securely, Shlomo Swidler, http://shlomoswidler.com/2009/08/how-to-keep- your-aws-credentials-on-ec2.html• http://techblog.netflix.com/• Cloud Performance Benchmark Series, Network Performance: Rackspace.com, Sumit, Sanghrajka, Radu Sion, http://www.cs.stonybrook.edu/~sion/research/sion2011cloud- net2.pdf• How long does it take to launch an Amazon EC2 instance, Phil Chen, http://www.philchen.com/2009/04/21/how-long-does-it-take- to-launch-an-amazon-ec2-instance• Basic Concepts and Taxonomy of Dependable and Secure Computing, Avizienis, Laprie, Randell, Landwehr, IEEE Transactions on Dependable and Secure Computing, Vol 1, No 1, Jan-March 2004NICTA Copyright 2012 From imagination to impact
    137. 137. References - 2• Cloud Software Updates: Challenges and Opportunies, Neamtiu, Dumitras, http://www.ece.cmu.edu/~tdumitra/public_documents/neamtiu11clou dupgrades11.pdf• To upgrade or not to Upgrade, Dumitras, Narasimhan, Tilevich, Onward! 2010• Cloud Application Architectures, George Reese, O‟Reilly, 2009• Why do internet services fail and what can be done about it? Oppenheimer, et al. Usenix Symposium on Internet Technologies and Systems, 2003• Data Consistency properties and the trade-offs in commercial cloud storages: the consumers‟ perspectives, Wada, et al. 5th Biennial conference on Innovative Data Systems Research, CiDR, 2011 http://www.nicta.com.au/pub?id=4341NICTA Copyright 2012 From imagination to impact 139
    138. 138. References - 3• Why do upgrades fail and what can we do about it? Tudor Dumitras and Priya Narasimhan. 2009. Why do upgrades fail and what can we do about it? Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware (Middleware09)• Using Program Analysis to Reduce Misconfiguration in Open Source Systems Software, Ariel Rabkin, PhD thesis, Univ of Calif, Berkeley, 2012• A method for preventing mixed version race conditions, Bass, Wada https://docs.google.com/open?id=0ByLr8SO1MsAiaXVxcmNNcDhV czg, 2012• Automatic Undo for Cloud Management via AI Planning, Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, Len Bass, Proceedings of the 12th Hot Topics in System Dependability http://www.nicta.com.au/pub?id=5994NICTA Copyright 2012 From imagination to impact 140
    139. 139. References - 4• How a consumer can measure elasticity for cloud platforms, Sadeka Islam, Kevin Lee, Alan Fekete, Anna Liu, Proceedings of the 3rd Joint WOSP/SIPEW International Conference on Performance Engineering, p.85-96, 2012• Empirical prediction models for adaptive resource provisioning in the cloud, Sadeka Islam, Jacky Keung, Kevin Lee, Anna Liu, Future Generation Computer Systems, Vol 28, No.1, p.155-162, 2012NICTA Copyright 2012 From imagination to impact 141
    140. 140. Q&A Thank You!Research study opportunities in dependable cloud computing:• Software Architecture• Data Management• Performance Engineering• Autonomic Computing To find out more, send your CV and undergraduate details to students@nicta.com.auNICTA Copyright 2012 From imagination to impact 142
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×