This is an internal talk I gave within Datalynx in May 2013.
It’s an introduction to the ideas and concepts involved in building cloud systems and applications for technical people who are new to the cloud. It also compares and contrasts these ideas to those we’re used to in “traditional” enterprise IT systems.
Notes on the “Hippo” slide:
The inclusion of persistent storage here may make it sound like a hippo and a pet are essentially the same, however to my mind there is a key difference between them.
A pet’s persistent data is not transferable to other pets without some sort of intervention, be it a restore from backup, a manual copy, or similar. Typically it would be stored on disks which are not immediately and automatically accessible if the pet is offline or not functioning.
A hippo’s persistent data, by contrast, is automatically transferred to the hippo’s successor instance if the original hippo dies or
otherwise ceases to function. Typically the data would exist on a storage mechanism such as AWS EBS or OpenStack Cinder volumes, whose lifecycle is separate from that of the hippo and which are immediately and automatically accessible by other instances if the hippo dies.
Notes on the “Design for Failure” section:
This section provides a few rough and ready calculations for failure rates of hard drives. The calculations here are quite simplistic and assume and even distribution of failures over time. Obviously this isn’t the case in reality, however the idea here is to provide a rough illustration of the differences in how we experience failure between enterprise IT systems and cloud systems.
The “Enterprise Failures” slide is based on a hypothetical application where 10 servers are involved in its delivery and each server has 10 drives (including SAN/NAS/backup systems etc.). It also assumes that “enterprise class” drives with MTBFs have been used.
The “Cloud Failures” slide is based on numbers for Microsoft’s Windows Azure data centre in Dublin, which houses around 600’000 servers, and again assumes 10 drives average per server. It also assumes that consumer drives with low MTBFs have been used.
My override aim here was to express, to technical people who is not used to truly large scale systems, why they need to take the attitude of assuming that anything can fail at any time, and to realise that the implicit assumption of hardware reliability that is often applied in enterprise IT doesn’t map onto the cloud.
16. ANATOMY OF A COW
Bootstrapped
Stateless
Usually Linux
Image by Pearson Scott Foresman via Wikimedia
17. BOOTSTRAPPING
Photo by neoroma via Flickr
Config OS
Install software
Write config files
Initialise services
At boot time
Without human
input
18. Photo by Velo Steve via Flickr
BOOTSTRAPPING TOOLS
Puppet
Chef
Ansible
CFEngine
AWS CloudFormation
OpenStack Heat
Group Policy/System Center
etc. etc. …
19. STATELESS
Photo by Numinosity (Gary J Wood) via Flickr
No persistent data
Collects state / job
data on boot
Ephemeral storage
Exception: Hippos
20. USUALLY LINUX
Photo by brian.gratwicke via Flickr
Fewer licensing
considerations
Easier to automate
Easier to image
Smaller footprint
More common at
large scale
WelcomeGoing to cover some concepts and ideas which are key in cloud systemsNot new – have been around for a long time, particularly in HPCBut may be unfamiliar if your used to traditional enterprise IT
A roughmetaphor for the difference between traditional enterprise architectures and cloud systemsThese ideas are not new or unique to the cloudBut cloud does really compel you to use themWith thanks to Tim Bell, who manages the DCs at CERNSide note: he’s running 15’000 servers with 3 people thanks to the concepts we’ll discuss2 main animals in the server zoo
The first are pets
Each pet is, more or less, uniqueRequire human intervention to build and set upMay even be entirely built by hand
Sometimes named – e.g. Starbuck in NovartisEach pet contains some unique data which must persistBy unique I mean data which is not accessible or available elsewhere except perhaps via a restore from backup or through some other manual interventionE.g. it has a state which needs to be maintained
Because pets are stateful, we scramble to fix them when they breakSo we IT folk care a lot about whether our pets are working or not
Next up cows!
Cows come in heardsThey’re all basically the sameAnd that’s because they’re all built by other computers – no humans involved
Because they come in heards, we number themThey only use shared data – no cow holds any unique dataThus they have no state
Because they have no state, individual cows don’t matterWe only care about the overall health of the heard So when a cow breaks, we terminate it and more are automatically built to replace itNext, a couple of special types of cattleFirst Hippos!
Hippos are cows which do have some unique data which must persistThis may sound like a pet, but there is a key difference between a hippo and a petA hippo’s data is automatically transferred and so isn’t truly unique to that single hippoWhen a hippo breaks, it’s replacement is automatically given the required dataThus no restore from backup or other manual intervention requiredThe second special cow is the canary!
Canaries are experimental cattleKey difference from hippos and cattle is that they have a new, untested configurationThus canaries are your dev/test/QA etc. environmentSo now we know the difference between pets and cattle, lets map that to technical terminology
Traditional IT architectures use servers These are the types of systems we’ve all been working on for many yearsTypically you’re concerned about keeping them up, so you perform maintenance and troubleshooting to keep it runningUsually because they hold some unique state dataThus servers are petsCloud systems use instancesInstances are fully automatedNo single instance is ever guaranteed to survive for any significant length of time (more on this latter)But we don’t care about individual instances, only the health of our pool of instances overallThus instances are cattleSo, as cattle are not pets, and visa versa…Instances are not servers – these two are fundamentally different!This is a very, very important concept to graspTreating instances as servers won’t work in the long runIt’ll also negate the cost and operational efficiency benefits of cloud architectures
So a core design goal for cloud systems should be to get rid of pets and have lots of cattle instead!Due to the automated, low maintenance nature of cattle, this means we can spend less time firefighting, and more on building and improving our applications/systems/services/etc.
I would argue that cattle are also good for regulatorsAutomation makes them easy to test and manage en massIt also makes them highly homogenousWhich makes them easy to monitorAnd makes anomalies/issues/security breaches easier to spot And if an issue is spotted on one cow, remediation takes minutesTerminate it and get another one
So lets look at how to build a cowThere are three key things I’d highlight for this
Bootstrapping is automating the build and config of a systemIt can do anything you would normally do by handIt’s normally done as the instance bootsMay also run periodically and apply configuration updatesKey element of bootstrapping – once the config has been bootstrapped, no human input should be required at all to build a new instance!
There are many mature, stable tools available for bootstrappingAWS has a specific feature for this type of thing – CloudFormationOpenStack will have a CloudFormation-compatible counterpart later this year called HeatPick your own poison – doesn’t particularly matter which tool you use, so long as you’re bootstrapping!
As mentioned before, cattle are statelessWith the exception of hipposTypically they have only ephemeral storageA cow’s storage and its contentsdisappears when the cow is terminatedSo each cow has to collect the data it needs to operate as it bootsAn ideal cow boots with Just Enough OS and then “phones home” to ask “who am I and what should I do?”
Cattle almost always run Linux Windows is poorly suited to cattleIt’s much harder to bootstrap away all human input on first bootWindows management systems tend to be host-name sensitive, because AD isEach Windows server has a truly unique identity – which I would count as stateThus I would consider Windows an inherently stateful OSIt has a much heavier base resource footprintIt’s exceptionally rare at truly large scaleHint: Enterprise IT is not large scale! (more on this later)
So now we know more about cattle, lets talk about broader cloud architecture principlesHere there are four key things I’d like to call out
Loose coupling is the exact opposite of most enterprise architectures I’ve seen deployedSystems should be split into layers – a.k.a. tiersThe identities of individual instances within each tier must not matterRule of thumb – if any part of your architecture depends on some system having a particularly FQDN, then it’s NOT loosely coupled!Communication must be asynchronousRequests should be made between the tiers and systems without any waiting for a responseNormally done via a message bus and message queuingThis is another key thing to wrap your head aroundQuick, simplified overview of message queuingEach request from one instance to another is put in a queueAny instance capable of answering the request can pick up the request messageThe response goes back into the message queueAny instance capable of processing the response can pick up the response messageThus the instance which processes the response may not be the same one that made the requestAgain – stateless systems!
Again, exact opposite of most enterprise architectures in realityTraditional approach is vertical scaling, a.k.a. scale upAdd more RAM, CPUs, spindles, etc. to improve performanceCloud approach is horizontal scaling, a.k.a. scale outAdd more instances to improve performancei.e. scale by getting more cows, not by making your cows fatter!This is enabled by loose coupling and the distribution of workloadDone right, it means you can scale each tier of your architecture separatelyE.g. scaling your storage tier without scaling your front end tier as well
Parallel processing within your applications is key to enabling the loose coupling and horizontal scalingIn turn loosely coupling and scaling horizontally enable greater parallelisation of your processingGeneral idea is to break each task/request/action/etc. down into smallest possible/practical chunksIdeally each chunk should be independent of all other chunksProcess all the chunks at once, combine the results to complete the task/request/etc.Again, scaling out not up!E.g. instead of getting a faster individual CPU to improve performance, get more CPUs
Still need to monitor cloud systemsBut emphasis changesAgain don’t care about individual instancesMonitor the health of the heard insteadAnother key difference is automation of responsesIf a problem is detected, it should be fixed automaticallyAgain, no human interventionOften accomplished by terminating the failing instances and starting new ones
Again many mature and stable monitoring tools existOn AWS look at CloudWatch
If there’s only one thing you take with you when you leave this room, this is it!I don’t think it’s possible to understate how important this is in cloud architecturesIn order to design for cloud systems it’s vital to understand failureFailure isn’t something to be afraid of – it’s just a fact of life!Let’s put some numbers on failure with a few rough and ready calculations for hard drivesShould stress – these are simplistic calculations which assume and even distribution of failures over timeThey’re meant only to be a rough illustration of the differences between the experience of failure in the enterprise vs. the cloudIn reality failure would probably not be as evenly distributed as assumed here!
Let’s look at a hypothetical enterprise application firstSay, in your typical enterprise application you have 10 serversSay 10 drives per server, including SAN/NAS/backup etc.So 100 drivesTypical enterprise hard drive has an MTBF of around 1.2 million hours That means a half of drives will fail within 1.2 million hours, not that any individual drive lasts 1.2 million hoursCrunch the numbers and you get an expected annual failure rate of 0.73%So 0.73% of the drives will fail each yearSo roughly 1 expected failure in 15 monthsAt enterprise scale failure is a annual occurrenceSo how does that compare to the cloud?Unlike the enterprise application, we don’t know exactly which physical boxes are involvedOur hypothetical application could be running anywhere within a cloud provider’s DCAnd it may move between hardware over timeSo we need to consider failure the whole DC
Microsoft run 600’000 physical servers for their Azure cloud in their Dublin DCLets stick with 10 hard drives per server, so that’s 6 million drivesClouds normally use cheaper consumer drives – typical MTBF 300’000 hoursCrunch the numbers again and you get an expected annual failure rate of 2.88%So that’s around 20 expected failures per hour – one every 3 minutesOr around 215’000 per in 15 monthsAt cloud scale failure is an every minute occurrenceI’ve seen this at a smaller scale – the last enterprise HPC system I ran had 65 servers at peak, and most months I’d say that at least one of them was failing in some way
Thus there is no SLA for any individual instanceSo when we’re designing cloud architectures we must assume that anything could fail at any timeThis is why we need cattle instead of petsAnd why loose coupling and horizontal scaling are importantThey are the means by which you build a system which continues to function in the face of constant and unpredictable failureAny data that needs to persist must exist in at least three placesSo that if one fails you’ve still got two leftAWS S3 and similar services guarantee thisEvery component of your architecture must exist in duplicatePreferably in different physical locationsAWS availability zones guarantee thisAnd again, this is why bootstrapping is important – automatic recovery from failureSide note: there’s a detailed article, although now a few years old, from Google on this here: http://storagemojo.com/2007/02/19/googles-disk-failure-experience/
To make sure your architecture is robust enough to withstand failure, test, test, and test againThe only way to properly test is to actually inflict failures on your running, production systemYes, that’s a scary thing to sayBut that’s how it’s done at cloud scale, and should be done everywhere in my opinionE.g. Google’s engineers periodically inflict disasters on their infrastructure without telling the maintenance people, to see what happensAnd by disasters, I mean pulling the plug on whole DCsGood article on this and other Google DC things here: http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/To help everyone test in this way, there’s a special tool available – the last animal in the zoo…
The chaos monkey was created by Netflix and is actively used to test their production systemsOnce released it goes around randomly terminating instances and otherwise screwing stuff upIf you system continues to function during a chaos monkey attack, it’ll probably survive real failures and disasters!