HIGH SCALABILITY AND
           RELIABILITY IN THE
           CLOUD
           GREG THOMPSON
           HEAD OF ARCHITECTURE, APPS ENABLEMENT
           ALCATEL-LUCENT

@gmthomp   greg.thompson@alcatel-lucent.com
About This Session
   Target audience is backend application
    developers deploying infrastructure into a
    cloud environment
   Will cover concepts for scalability and
    reliability with the goal of helping application
    developers understand some key
    considerations when designing and building
    the backend.
Design Time Decisions
   When first building your application backend,
    consider a few important questions
     How fast should the application be recovered if a
      failure occurs?
     What kind of down time is acceptable?
     Is the application maintaining stateful data?
     What kind of information needs to be shared across
      multiple instances?
Scalability
What is Scalability?
   Scalability is a term
    used to describe
    how the application
    will handle
    increased loads of
    traffic volume
Scalability – Factors to Consider
   Horizontal vs. Vertical
   Stateless vs. Stateful
   Understanding Limitations
   Connection Management
   Segmentation of traffic
   Segmentation of responsibility (distributed arch)
   Clustering
   Messaging
What Type of Scalability?
Vertical vs. Horizontal
Vertical                        Horizontal
   Scaling up a single            Scaling out across
    node                            multiple nodes
     Physical limitations –         Ability to distribute
      instances are very
      powerful but still have         traffic over a number
      finite limits                   of nodes
     Resources such as              Allows for more
      number of sockets               flexibility over time
      can only go so high
Will the App Maintain State?
Stateless Applications
   Application does not
    persist information
    about transactions     Request       Respons
                                         e
   Each transaction is
    independent and            Application
    atomic
Will the App Maintain State?
Stateful Applications
   Application needs to
    maintain data about
    transactions in
                           First         Subseque
    progress               Request       nt
                                         Request

   Requires storage                            D
                               Application      B
   Persistence may also
    be required
    depending the
Understanding Limitations
   Thorough testing is
    key to understanding
    bottlenecks
   Test real-world
    scenarios included
    latency
   Push the system to
    the max to
    understand how it
Connection Management
Mobile Device Connections
   Mobile devices don’t always
    behave like you expect
       Connectivity is often very
        dynamic
       Devices move from 4G/3G/2G/no
        G/Wifi
       Not all TCP events will get
        reported and sockets can remain
        open
   If not handled correctly, these
    factors can be time bomb no
    matter how vertically you scale a
    component
Segmenting Traffic
   Once the application is
    able to be scaled out,
    traffic can be
    segmented in different
    ways
       Location (i.e. east coast
        vs. west coast)
       Pre-assigned criteria -
        User ID, IP, or other
        dynamic criteria
       Load Balanced
Segmenting Responsibility
   Segmenting
    responsibility allows for
    a distributed
    architecture
       Each component can be
        scaled independently
       Allows for more flexibility
        in scaling
       Adds more complexity
        and potential messaging
        overhead
Clustering
   Clustering is the
    concept of having a
    group of nodes working     App   App   App   App
                               Nod   Nod   Nod   Nod
    together to provide the     e     e     e     e
    same capability
       Nodes typically co-            Share
        located                          d
       Common data shared             Data
        as needed across the
        cluster
       Communication may be
        needed between nodes
Messaging
   Once a clustered          Types of Messaging
    and/or distributed          JMS
    architecture is used        Open Source MQ
    messaging will be            packages
    needed between              Custom Designed
    various components          Use of APIs
    and/or nodes
Example of Scaled Architecture
             Load                                 Load
               Load                                 Load
            Balancer                             Balancer
             Balancer                             Balancer

  Web         Compone     Compone      Web         Compone     Compone
    Web
 Server         Compone
                nt 1        Compone
                            nt 2         Web
                                      Server         Compone
                                                     nt 1        Compone
                                                                 nt 2
   Server          nt 1        nt 2     Server          nt 1        nt 2




              Database                             Database

               Site 1                               Site 2
Reliability/Availability
What is Reliability/Availability?
   Availability is typically
    measured by the amount of
    downtime your application
    has in a given year
       Unplanned downtime and
        planned downtime are both
        considered
   Reliability is described by the
    likelihood of failure based on
    actual measurements
   We’ll focus more on
    Availability
Reliability/Availability
Factors to Consider
   Cost vs. Need
   Problem detection
   Automation for recovery
   Active/standby, active/active, hot standby vs. cold
    standby
   Local and Geo-redundancy
   Multi-zone, multi-cloud
   Test Until You Break the System
Reliability Requirements
Cost Considerations       Need

   Number of instances      User Experience
   Bandwidth                Customer
    requirements              requirements
    between sites
                             Negative Publicity
   Complexity of
    software
   Monitoring
Problem Detection
   Effective monitoring of
    the application is key to
    minimizing downtime
       Event reporting in the
        software
       External monitoring –
        test for successful
        behavior
       Auto detection and
        alerting to minimize cost
        of operations personnel
Automation for Recovery
   How quickly a failed
    component recovers
    increases reliability
     Automatic detection
      and automatic
      recovery
     Automated installation
      key for minimizing
      setup time during
      recovery
Availability Models
   N = number of nodes
    required for normal     N   N
    processing
   N+1 = one additional
    node to provide         N   N   +1
    redundancy in case of
    failure
   N+K = K nodes provide   N   N   K    K
    additional redundancy
Redundancy Models
   Active/Cold Standby                    Cold
       backup site is booted    Active   Standb
        up when needed                       y

   Active/Hot Standby
                                          Active
       Backup site is running   Active   Standb
        and ready to takeover                y

   Active/Active
       Both sites active and    Active   Active
        processing traffic
Local and Geo-Redundancy
   Local                       Geo-Graphic
     Backup  instances           Backup   instances
      are available within         are available in
      the same location            another geo-graphic
                                   location
     Use of availability
                                  Typically in a
      zones within a               separate region to
      region very similar          account for events
                                   such as natural
                                   disasters
Availability to the Max
   Multi-Zone/Multi-              Multi-Cloud
    Region
                                     Ifyour application
     Multi-zone typically
                                      requires the
      provide instances
      running in different            maximum possible
      physical locations, but         availability
      in same region                 Run in different
     Multi-region provides           cloud providers in
      different geographic
      regions of availability
                                      different regions
Test Until You Break the System
   Push the system to
    the max and observe
    the breaking points
   Fix the problem,
    repeat
   The best way to find
    problems to prevent
    unplanned downtime
    is to thoroughly test
    with a mindset to
    break
Q&A
THANK YOU!
Greg Thompson
@gmthomps
greg.thompson@alcatel-lucent.com

Scalability and Reliability in the Cloud

  • 1.
    HIGH SCALABILITY AND RELIABILITY IN THE CLOUD GREG THOMPSON HEAD OF ARCHITECTURE, APPS ENABLEMENT ALCATEL-LUCENT @gmthomp greg.thompson@alcatel-lucent.com
  • 2.
    About This Session  Target audience is backend application developers deploying infrastructure into a cloud environment  Will cover concepts for scalability and reliability with the goal of helping application developers understand some key considerations when designing and building the backend.
  • 3.
    Design Time Decisions  When first building your application backend, consider a few important questions  How fast should the application be recovered if a failure occurs?  What kind of down time is acceptable?  Is the application maintaining stateful data?  What kind of information needs to be shared across multiple instances?
  • 4.
  • 5.
    What is Scalability?  Scalability is a term used to describe how the application will handle increased loads of traffic volume
  • 6.
    Scalability – Factorsto Consider  Horizontal vs. Vertical  Stateless vs. Stateful  Understanding Limitations  Connection Management  Segmentation of traffic  Segmentation of responsibility (distributed arch)  Clustering  Messaging
  • 7.
    What Type ofScalability? Vertical vs. Horizontal Vertical Horizontal  Scaling up a single  Scaling out across node multiple nodes  Physical limitations –  Ability to distribute instances are very powerful but still have traffic over a number finite limits of nodes  Resources such as  Allows for more number of sockets flexibility over time can only go so high
  • 8.
    Will the AppMaintain State? Stateless Applications  Application does not persist information about transactions Request Respons e  Each transaction is independent and Application atomic
  • 9.
    Will the AppMaintain State? Stateful Applications  Application needs to maintain data about transactions in First Subseque progress Request nt Request  Requires storage D Application B  Persistence may also be required depending the
  • 10.
    Understanding Limitations  Thorough testing is key to understanding bottlenecks  Test real-world scenarios included latency  Push the system to the max to understand how it
  • 11.
    Connection Management Mobile DeviceConnections  Mobile devices don’t always behave like you expect  Connectivity is often very dynamic  Devices move from 4G/3G/2G/no G/Wifi  Not all TCP events will get reported and sockets can remain open  If not handled correctly, these factors can be time bomb no matter how vertically you scale a component
  • 12.
    Segmenting Traffic  Once the application is able to be scaled out, traffic can be segmented in different ways  Location (i.e. east coast vs. west coast)  Pre-assigned criteria - User ID, IP, or other dynamic criteria  Load Balanced
  • 13.
    Segmenting Responsibility  Segmenting responsibility allows for a distributed architecture  Each component can be scaled independently  Allows for more flexibility in scaling  Adds more complexity and potential messaging overhead
  • 14.
    Clustering  Clustering is the concept of having a group of nodes working App App App App Nod Nod Nod Nod together to provide the e e e e same capability  Nodes typically co- Share located d  Common data shared Data as needed across the cluster  Communication may be needed between nodes
  • 15.
    Messaging  Once a clustered  Types of Messaging and/or distributed  JMS architecture is used  Open Source MQ messaging will be packages needed between  Custom Designed various components  Use of APIs and/or nodes
  • 16.
    Example of ScaledArchitecture Load Load Load Load Balancer Balancer Balancer Balancer Web Compone Compone Web Compone Compone Web Server Compone nt 1 Compone nt 2 Web Server Compone nt 1 Compone nt 2 Server nt 1 nt 2 Server nt 1 nt 2 Database Database Site 1 Site 2
  • 17.
  • 18.
    What is Reliability/Availability?  Availability is typically measured by the amount of downtime your application has in a given year  Unplanned downtime and planned downtime are both considered  Reliability is described by the likelihood of failure based on actual measurements  We’ll focus more on Availability
  • 19.
    Reliability/Availability Factors to Consider  Cost vs. Need  Problem detection  Automation for recovery  Active/standby, active/active, hot standby vs. cold standby  Local and Geo-redundancy  Multi-zone, multi-cloud  Test Until You Break the System
  • 20.
    Reliability Requirements Cost Considerations Need  Number of instances  User Experience  Bandwidth  Customer requirements requirements between sites  Negative Publicity  Complexity of software  Monitoring
  • 21.
    Problem Detection  Effective monitoring of the application is key to minimizing downtime  Event reporting in the software  External monitoring – test for successful behavior  Auto detection and alerting to minimize cost of operations personnel
  • 22.
    Automation for Recovery  How quickly a failed component recovers increases reliability  Automatic detection and automatic recovery  Automated installation key for minimizing setup time during recovery
  • 23.
    Availability Models  N = number of nodes required for normal N N processing  N+1 = one additional node to provide N N +1 redundancy in case of failure  N+K = K nodes provide N N K K additional redundancy
  • 24.
    Redundancy Models  Active/Cold Standby Cold  backup site is booted Active Standb up when needed y  Active/Hot Standby Active  Backup site is running Active Standb and ready to takeover y  Active/Active  Both sites active and Active Active processing traffic
  • 25.
    Local and Geo-Redundancy  Local  Geo-Graphic  Backup instances  Backup instances are available within are available in the same location another geo-graphic location  Use of availability  Typically in a zones within a separate region to region very similar account for events such as natural disasters
  • 26.
    Availability to theMax  Multi-Zone/Multi-  Multi-Cloud Region  Ifyour application  Multi-zone typically requires the provide instances running in different maximum possible physical locations, but availability in same region  Run in different  Multi-region provides cloud providers in different geographic regions of availability different regions
  • 27.
    Test Until YouBreak the System  Push the system to the max and observe the breaking points  Fix the problem, repeat  The best way to find problems to prevent unplanned downtime is to thoroughly test with a mindset to break
  • 28.
  • 29.