Scalability and Reliability in the Cloud
Upcoming SlideShare
Loading in...5

Scalability and Reliability in the Cloud



From AT&T Bootstrap Week: This session focuses on architecture and design concepts to ensure scalability and maximize reliability for server-based applications running in the cloud environment. The ...

From AT&T Bootstrap Week: This session focuses on architecture and design concepts to ensure scalability and maximize reliability for server-based applications running in the cloud environment. The session will discuss techniques to consider for achieving scalability and reliability and tradeoffs to consider such as time vs. cost based on the needs for different types of applications.



Total Views
Views on SlideShare
Embed Views



1 Embed 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Scalability and Reliability in the Cloud Scalability and Reliability in the Cloud Presentation Transcript

  • About This Session Target audience is backend application developers deploying infrastructure into a cloud environment Will cover concepts for scalability and reliability with the goal of helping application developers understand some key considerations when designing and building the backend.
  • Design Time Decisions When first building your application backend, consider a few important questions  How fast should the application be recovered if a failure occurs?  What kind of down time is acceptable?  Is the application maintaining stateful data?  What kind of information needs to be shared across multiple instances?
  • Scalability
  • What is Scalability? Scalability is a term used to describe how the application will handle increased loads of traffic volume
  • Scalability – Factors to Consider Horizontal vs. Vertical Stateless vs. Stateful Understanding Limitations Connection Management Segmentation of traffic Segmentation of responsibility (distributed arch) Clustering Messaging
  • What Type of Scalability?Vertical vs. HorizontalVertical Horizontal Scaling up a single  Scaling out across node multiple nodes  Physical limitations –  Ability to distribute instances are very powerful but still have traffic over a number finite limits of nodes  Resources such as  Allows for more number of sockets flexibility over time can only go so high
  • Will the App Maintain State?Stateless Applications Application does not persist information about transactions Request Respons e Each transaction is independent and Application atomic
  • Will the App Maintain State?Stateful Applications Application needs to maintain data about transactions in First Subseque progress Request nt Request Requires storage D Application B Persistence may also be required depending the
  • Understanding Limitations Thorough testing is key to understanding bottlenecks Test real-world scenarios included latency Push the system to the max to understand how it
  • Connection ManagementMobile Device Connections Mobile devices don’t always behave like you expect  Connectivity is often very dynamic  Devices move from 4G/3G/2G/no G/Wifi  Not all TCP events will get reported and sockets can remain open If not handled correctly, these factors can be time bomb no matter how vertically you scale a component
  • Segmenting Traffic Once the application is able to be scaled out, traffic can be segmented in different ways  Location (i.e. east coast vs. west coast)  Pre-assigned criteria - User ID, IP, or other dynamic criteria  Load Balanced
  • Segmenting Responsibility Segmenting responsibility allows for a distributed architecture  Each component can be scaled independently  Allows for more flexibility in scaling  Adds more complexity and potential messaging overhead
  • Clustering Clustering is the concept of having a group of nodes working App App App App Nod Nod Nod Nod together to provide the e e e e same capability  Nodes typically co- Share located d  Common data shared Data as needed across the cluster  Communication may be needed between nodes
  • Messaging Once a clustered  Types of Messaging and/or distributed  JMS architecture is used  Open Source MQ messaging will be packages needed between  Custom Designed various components  Use of APIs and/or nodes
  • Example of Scaled Architecture Load Load Load Load Balancer Balancer Balancer Balancer Web Compone Compone Web Compone Compone Web Server Compone nt 1 Compone nt 2 Web Server Compone nt 1 Compone nt 2 Server nt 1 nt 2 Server nt 1 nt 2 Database Database Site 1 Site 2
  • Reliability/Availability
  • What is Reliability/Availability? Availability is typically measured by the amount of downtime your application has in a given year  Unplanned downtime and planned downtime are both considered Reliability is described by the likelihood of failure based on actual measurements We’ll focus more on Availability
  • Reliability/AvailabilityFactors to Consider Cost vs. Need Problem detection Automation for recovery Active/standby, active/active, hot standby vs. cold standby Local and Geo-redundancy Multi-zone, multi-cloud Test Until You Break the System
  • Reliability RequirementsCost Considerations Need Number of instances  User Experience Bandwidth  Customer requirements requirements between sites  Negative Publicity Complexity of software Monitoring
  • Problem Detection Effective monitoring of the application is key to minimizing downtime  Event reporting in the software  External monitoring – test for successful behavior  Auto detection and alerting to minimize cost of operations personnel
  • Automation for Recovery How quickly a failed component recovers increases reliability  Automatic detection and automatic recovery  Automated installation key for minimizing setup time during recovery
  • Availability Models N = number of nodes required for normal N N processing N+1 = one additional node to provide N N +1 redundancy in case of failure N+K = K nodes provide N N K K additional redundancy
  • Redundancy Models Active/Cold Standby Cold  backup site is booted Active Standb up when needed y Active/Hot Standby Active  Backup site is running Active Standb and ready to takeover y Active/Active  Both sites active and Active Active processing traffic
  • Local and Geo-Redundancy Local  Geo-Graphic  Backup instances  Backup instances are available within are available in the same location another geo-graphic location  Use of availability  Typically in a zones within a separate region to region very similar account for events such as natural disasters
  • Availability to the Max Multi-Zone/Multi-  Multi-Cloud Region  Ifyour application  Multi-zone typically requires the provide instances running in different maximum possible physical locations, but availability in same region  Run in different  Multi-region provides cloud providers in different geographic regions of availability different regions
  • Test Until You Break the System Push the system to the max and observe the breaking points Fix the problem, repeat The best way to find problems to prevent unplanned downtime is to thoroughly test with a mindset to break
  • Q&A
  • THANK YOU!Greg