Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalability and Reliability in the Cloud


Published on

From AT&T Bootstrap Week: This session focuses on architecture and design concepts to ensure scalability and maximize reliability for server-based applications running in the cloud environment. The session will discuss techniques to consider for achieving scalability and reliability and tradeoffs to consider such as time vs. cost based on the needs for different types of applications.

Published in: Technology

Scalability and Reliability in the Cloud

  2. 2. About This Session Target audience is backend application developers deploying infrastructure into a cloud environment Will cover concepts for scalability and reliability with the goal of helping application developers understand some key considerations when designing and building the backend.
  3. 3. Design Time Decisions When first building your application backend, consider a few important questions  How fast should the application be recovered if a failure occurs?  What kind of down time is acceptable?  Is the application maintaining stateful data?  What kind of information needs to be shared across multiple instances?
  4. 4. Scalability
  5. 5. What is Scalability? Scalability is a term used to describe how the application will handle increased loads of traffic volume
  6. 6. Scalability – Factors to Consider Horizontal vs. Vertical Stateless vs. Stateful Understanding Limitations Connection Management Segmentation of traffic Segmentation of responsibility (distributed arch) Clustering Messaging
  7. 7. What Type of Scalability?Vertical vs. HorizontalVertical Horizontal Scaling up a single  Scaling out across node multiple nodes  Physical limitations –  Ability to distribute instances are very powerful but still have traffic over a number finite limits of nodes  Resources such as  Allows for more number of sockets flexibility over time can only go so high
  8. 8. Will the App Maintain State?Stateless Applications Application does not persist information about transactions Request Respons e Each transaction is independent and Application atomic
  9. 9. Will the App Maintain State?Stateful Applications Application needs to maintain data about transactions in First Subseque progress Request nt Request Requires storage D Application B Persistence may also be required depending the
  10. 10. Understanding Limitations Thorough testing is key to understanding bottlenecks Test real-world scenarios included latency Push the system to the max to understand how it
  11. 11. Connection ManagementMobile Device Connections Mobile devices don’t always behave like you expect  Connectivity is often very dynamic  Devices move from 4G/3G/2G/no G/Wifi  Not all TCP events will get reported and sockets can remain open If not handled correctly, these factors can be time bomb no matter how vertically you scale a component
  12. 12. Segmenting Traffic Once the application is able to be scaled out, traffic can be segmented in different ways  Location (i.e. east coast vs. west coast)  Pre-assigned criteria - User ID, IP, or other dynamic criteria  Load Balanced
  13. 13. Segmenting Responsibility Segmenting responsibility allows for a distributed architecture  Each component can be scaled independently  Allows for more flexibility in scaling  Adds more complexity and potential messaging overhead
  14. 14. Clustering Clustering is the concept of having a group of nodes working App App App App Nod Nod Nod Nod together to provide the e e e e same capability  Nodes typically co- Share located d  Common data shared Data as needed across the cluster  Communication may be needed between nodes
  15. 15. Messaging Once a clustered  Types of Messaging and/or distributed  JMS architecture is used  Open Source MQ messaging will be packages needed between  Custom Designed various components  Use of APIs and/or nodes
  16. 16. Example of Scaled Architecture Load Load Load Load Balancer Balancer Balancer Balancer Web Compone Compone Web Compone Compone Web Server Compone nt 1 Compone nt 2 Web Server Compone nt 1 Compone nt 2 Server nt 1 nt 2 Server nt 1 nt 2 Database Database Site 1 Site 2
  17. 17. Reliability/Availability
  18. 18. What is Reliability/Availability? Availability is typically measured by the amount of downtime your application has in a given year  Unplanned downtime and planned downtime are both considered Reliability is described by the likelihood of failure based on actual measurements We’ll focus more on Availability
  19. 19. Reliability/AvailabilityFactors to Consider Cost vs. Need Problem detection Automation for recovery Active/standby, active/active, hot standby vs. cold standby Local and Geo-redundancy Multi-zone, multi-cloud Test Until You Break the System
  20. 20. Reliability RequirementsCost Considerations Need Number of instances  User Experience Bandwidth  Customer requirements requirements between sites  Negative Publicity Complexity of software Monitoring
  21. 21. Problem Detection Effective monitoring of the application is key to minimizing downtime  Event reporting in the software  External monitoring – test for successful behavior  Auto detection and alerting to minimize cost of operations personnel
  22. 22. Automation for Recovery How quickly a failed component recovers increases reliability  Automatic detection and automatic recovery  Automated installation key for minimizing setup time during recovery
  23. 23. Availability Models N = number of nodes required for normal N N processing N+1 = one additional node to provide N N +1 redundancy in case of failure N+K = K nodes provide N N K K additional redundancy
  24. 24. Redundancy Models Active/Cold Standby Cold  backup site is booted Active Standb up when needed y Active/Hot Standby Active  Backup site is running Active Standb and ready to takeover y Active/Active  Both sites active and Active Active processing traffic
  25. 25. Local and Geo-Redundancy Local  Geo-Graphic  Backup instances  Backup instances are available within are available in the same location another geo-graphic location  Use of availability  Typically in a zones within a separate region to region very similar account for events such as natural disasters
  26. 26. Availability to the Max Multi-Zone/Multi-  Multi-Cloud Region  Ifyour application  Multi-zone typically requires the provide instances running in different maximum possible physical locations, but availability in same region  Run in different  Multi-region provides cloud providers in different geographic regions of availability different regions
  27. 27. Test Until You Break the System Push the system to the max and observe the breaking points Fix the problem, repeat The best way to find problems to prevent unplanned downtime is to thoroughly test with a mindset to break
  28. 28. Q&A
  29. 29. THANK YOU!Greg