Building Fault-Tolerant
Applications in the Cloud
Ryan Holland
Ecosystem Solution Architect
Faults?
Facilities
Hardware
Networking
Code



People
What is “Fault-Tolerant”?
Degrees of risk mitigation - not binary




Automated

Tested!
Agenda
The AWS Approach

Building Blocks

Design Patterns
Old School Fault-Tolerance: Build Two
Cloud Computing Benefits
   No Up-Front        Low Cost      Pay Only for
  Capital Expense                   What You Use




    Self-Service    Easily Scale   Improve Agility &
   Infrastructure   Up and Down     Time-to-Market

      Deploy
Cloud Computing Fault-Tolerance Benefits
  No Up-Front HA            Low Cost         Pay for DR Only
  Capital Expense           Backups          When You Use it




   Self-Service      Easily Deliver Fault-   Improve Agility &
 DR Infrastructure   Tolerant Applications   Time-to-Recovery

      Deploy
AWS Cloud allows Overcast Redundancy

                         Have the shadow
                         duplicate of your
                         infrastructure ready to go
                         when you need it…




…but only pay for what
you actually use
Old Barriers to HA
are now Surmountable

Cost

Complexity

Expertise
AWS Building Blocks: Two Strategies
 Inherently fault-      Services that are fault-tolerant
tolerant services         with the right architecture
          S3                     Amazon EC2
      SimpleDb
                                     VPC
     DynamoDB
      Cloudfront                      EBS
SWF, SQS, SNS, SES                    RDS
       Route53
Elastic Load Balancer
  Elastic Beanstalk
     ElastiCache
 Elastic MapReduce
         IAM
Resources

                  Deployment
The Stack:       Management

               Configuration

              Networking

             Facilities
     Geographies
EC2 Instances

                  Amazon Machine Images

The Stack:      CW Alarms - AutoScaling

             Cloudformation - Beanstalk

         Route53 – ElasticIP – ELB

      Availability Zones

   Regions
Regional Diversity

Use Regions for:
  Latency
   • Customers
   • Data Vendors
   • Staff
  Compliance
  Disaster Recovery
  … and Fault Tolerance!
Proper Use of Multiple Availability Zones
Network Fault-Tolerance Tools
107.22.18.45   isn’t fault-tolerant but 50.17.200.146 is: EIP

Elastic Load Balancing

Automated DNS: Route53

Latency-Based Routing
New EC2 VPC feature:
Elastic Network Interface


    Up to 8 Interfaces
    with 30 Addresses
    each
    Span Subnets
    Attach/Detach
    Public or Private
Cloudformation – Elastic Beanstalk




  Q: Is your stack unique?
Cloudwatch – Alarms – AutoScaling
AMI’s
Maintenance is critical

Alternatives: Chef, Puppet, cfn-init, etc.

When in doubt: 64-bit

Replicate for DR
EC2 Instances
Consistent, reliable building block

100% API controlled

Reserved Instances

EBS

Immense Fleet Scale
Example:
a “fork-lifted” app
Example:
Fault-Tolerant
Why mess with all of that?
Design For Failure




SPOF
Copyright ©
                                   2011 Amazon

   Build Loosely Coupled Systems   Web Services




Tight
Coupling
Loose Coupling
using Queues
Fault-Tolerant Front-end Systems

Addressing: Route53, EIP
                                                Auto Scaling                Amazon CloudFront


Distribution: Multi-AZ, ELB, Cloudfront

Redundancy: Auto-Scaling                               Amazon CloudWatch          Amazon Route
                                                                                      53


                                 Elastic Load
Monitoring: Cloudwatch            Balancer

                                                                     Elastic IP

                                                  AWS Elastic
Platform: Elastic Beanstalk                        Beanstalk
Fault-Tolerant Data-Tier Systems

Tuned
Patched
Cached
Sharded
Replicated
Backed Up
Archived
Monitored
Fault-Tolerant Data-Tier Systems

Tuned
Patched
Cached                      LOTS
Sharded
Replicated
                             OF
Backed Up                   WORK
Archived
Monitored
AWS Fault-Tolerant Data-Tier Services
S3

SimpleDB     Amazon Relational
             Database Service                                         Amazon Elastic
                  (RDS)                                                MapReduce

                                           Amazon Simple
                                           Storage Service
EMR                                             (S3)




DynamoDB                                                                 Amazon SimpleDB

                         Amazon DynamoDB




RDS                                                       Amazon
                                                        ElastiCache
RDS Fault-Tolerant Features
Multi-AZ Deployments

Read Replicas
                       RDS DB Instance   RDS DB Instance
                                         Multi-AZ Standby

Automated Backups

Snapshots
Storage Gateway
                Your Datacenter




                                                                                                  Amazon Elastic
                                                                                                  Compute Cloud
                                                                                                          (EC2)

                            AWS Storage
                             Gateway
                                VM                            SSL
  Clients

                                                              Internet
                           On-premises Host                      or
                                                               Direct     AWS Storage        Amazon Simple
                                                              Connect    Gateway Service   Storage Service (S3)


Application
 Servers                                                                                                          Amazon Elastic
                                                                                                                  Block Storage
                                                                                                                      (EBS)
              Direct Attached or Storage Area Network Disks
Test! Use a Chaos Monkey!
                                                                            Prudent

                                                                            Conservative

                                                                            Professional

                                                                            Open source


                                                                            …and all the cool kids are doing it




http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
Thank You!

13h00 aws 2012-fault_tolerant_applications

Editor's Notes

  • #2 We are going to talk today about building fault-tolerant systems, andlmorespecificically look at how AWS enables the cost effective and scalable design of these systems in ways that simply cannot be done otherwise.
  • #3 So what types of faults are we trying to survive? If we stop and think about most applications there’s Really there are a wide array of different ways most applications can fail, the facitilits themselves, could have failures ranging from something extremely catasrophic like the building catching fire to something as simple as a power outage. Inside the facilities we are relying on a number of systems to be opperating, we have a network stack with routers, switches and firewalls, as well as servers and storage devices all of which very much have the ability to fail, either through hardware failures or configuraiton errors. And all of that is before we even get to the code for your applications and the peple that manage it, both of which are also potential areas of where failures can occur.
  • #4 So what does fault tolerant mean, first its important to point out this sin’t an absolte there sin’t a magic easy button for this nor is there a one size fits all approach to building applications that can survive every possible failure, generally speaking there are costs associated with mitigating the risk of different types of failures as well as likelithoods of those fialuresoccuring so the design of these applications becomes an exercise in risk mitigation. For example of hard drives, the risk associated with a hard drive failing is pretty high compared to an entire datacenter being destroyed luckily the cost of mitigating against a failed hard drive is also far lower than building duplicate datacenters. The second bullet on there is very important, given that people, or human error, is probably the most common cause of failures for applications to truly be fualtfolerant they must leveage automation in the case of failure, this not only makes recovery much faster but also assures it happens in a known and controlled manner. And lastly if you don’t test your design, you won’t know if it works.
  • #6 So here’s how we used to implimentfualt tolerance, it was really simple: build two of everything. Now there’s some signifigant problems with this approach, the obvious one is cost since your application just got 100% more expensive, and here in brazil which already has much higher server hw costs that can make the cost of mitigating against many types of failures impracticle for many applications. So what ends up happening is, again going back the risk mitigation idea someone is going to have to look at the cost of purchaings, maintaining and opperating a second instance of the applcaition and decide if its worth the cost.
  • #7 I’m sure people are familiar with a lot of the commonly talked about beneifts of cloud computing from removing upfront capital costs to time to market and agility and in the area of fault tolerance these benefits all translate very well.
  • #8 The upfront capital cost of adding a second server or mirroring stroage is gone for HA, backups are far simipler to use and extremely cost effective, and today with our release of Glacier which will revolutionize the way business backup and archive data that’s never been more true. From a DR perspective you can stage infrastructure and only launch and pay for it when its actually needed, versus paying for infrastrucuture 24/7 you hope to never use. Services that are part of AWS greatly simpliy making your applications highly available and fault tolerant, often at a m
  • #9 With DR this becomes every evident, think of how often you actually use your DR site, hopefully your thinking of a really small number, now think of how youre paying for that. With AWS we have massive ammounts of infrastructure, in 8 different regions around the world and the ability of stage and programmically deploy infrastrcture to any of those regions. So you can stage your DR site and have it ready to spring into action if needed but only pay for what your actually using.
  • #10 The next eveoltion beyond DR is HA, and HA has traditionally had a number of barriers that limited the applications that could be deployed in a HA manner, the first being cost but also from the standpoint of complexity, now this is different that DR where something is broke and you need to have some method of getting it back online either by using a second location with HA you want to have components be able to fail but have the system still opperate normally because there’s multiple servers that can perform that fuction or multiple online replicas of the data. In the traditional DC this can be very difficult and complex as well as costly, but with AWS we have built HA services that you can leverage which not only bends the cost curve but also makes its extremely simple to do.
  • #11 As you can see here many of the servives we provide, are inherently fault-tolerant, we’ve done all the work to create them in a fashion that is resiliant to failure and highly durable so you don’t have to. So now if you need a fault-tolerant NoSQL DB you don’t have to worry about how to architect that you can simply use DynamoDB. So with the right design and by leveraging the services we provide that are inherently fault-tolerant you can focus on building your application rather than the infrastrcutre. Some of the services you see on the right are fault tolerant with the right architecture, and what we mean by that is we give you options on how you’re going to architect and deploy those services, RDS with mysql for example is fault-tolerant when Multi-AZ deployments are used since it will be replicating the data to multiple datacenters.
  • #12 So we know there are opprotunities for failure at every layer of the stack, from disasters that affect entire geographies, or indidividualbuildingsall they up to the sever you’re application is running on. Now lets see how this translates in AWS and look at the service we have that provide fault-tolerance
  • #13 At AWS we’ve built fault tolerant systems at every level of that stack.
  • #14 Fault Separation Amazon EC2 provides customers the flexibility to place instances within multiple geographic regions as well as across multiple Availability Zones. Each Availability Zone is designed with fault separation. This means that Availability Zones are physically separated within a typical metropolitan region, on different flood plains, in seismically stable areas. In addition to discrete uninterruptable power source (UPS) and onsite backup generation facilities, they are each fed via different grids from independent utilities to further reduce single points of failure. They are all redundantly connected to multiple tier-1 transit providers. It should be noted that although traffic flowing across the private networks between Availability Zones in a single region is on AWS-controlled infrastructure, all communications between regions is across public Internet infrastructure, so appropriate encryption methods should be used to protect sensitive data. Data are not replicated between regions unless proactively done so by the customer.
  • #15 Distinct physical locationsLow-latency network connections between AzsIndependent power, cooling, network, securityAlways partition app stacks across 2 or more AzsElastic Load Balance across instances in multiple AzsDon’t confuse AZ’s with Regions!
  • #18 Note, the question is not “do you need to automate your deployment” or “should I use automation when I’m using the cloud?” the answer to that is YES!The question is; if you’re using fully standard PHP or Java stacks, why manage it? Beanstalk does that great, with zero lock-in. If what you need is more complex, perhaps cloudformation (note, you can do BOTH!)
  • #22 Three-Tier Web App has been “fork-lifted” to the cloudEverything in a single Availability ZoneLoad balanced at the Web tier and App tier using software load balancersMaster and Standby databaseElastic IP on front end load balancer onlyS3 used as DB backup instead of tapeHow can you use AWS features to make this app more highly available?
  • #23 Three-Tier Web App has been “fork-lifted” to the cloudEverything in a single Availability ZoneLoad balanced at the Web tier and App tier using software load balancersMaster and Standby databaseElastic IP on front end load balancer onlyS3 used as DB backup instead of tapeHow can you use AWS features to make this app more highly available?
  • #25 Avoid single points of failureAssume everything fails, and design backwardsGoal: Applications should continue to function even if the underlying physical hardware fails or is removed or replaced.Design your recovery processTrade off business needs vs. cost of high-availability
  • #28 Multiple DNS TargetsLoad Balanced across Availability ZonesAuto-scaled web-cache servers with health checksAuto-scaled web-servers with health checksComprehensive config, data, and AMI backupMonitoring, alarming and logging
  • #29 DB-Tier Load Balancing or QueueingAuto-scaled Database cache servers with health checksRedundant Relational Database systems Mirrored, log-shipped, async or sync replicatedDesigned to scale horizontally (sharding)Durable NoSQL or KV-store Data SystemsNo SPOF designSupports automatic re-balancing, replication, and fault-recoveryMonitoring, alarming and logging
  • #30 DB-Tier Load Balancing or QueueingAuto-scaled Database cache servers with health checksRedundant Relational Database systems Mirrored, log-shipped, async or sync replicatedDesigned to scale horizontally (sharding)Durable NoSQL or KV-store Data SystemsNo SPOF designSupports automatic re-balancing, replication, and fault-recoveryMonitoring, alarming and logging
  • #32 Multi-AZ DeploymentsSynchronous replication across AZsAutomatic fail-over to standby replicaAutomated BackupsEnables point-in-time recovery of the DB instanceRetention period configurableSnapshotsUser initiated full backup of DBNew DB can be created from snapshots