SlideShare a Scribd company logo
1 of 33
CLOUD ARCHITECTURE
CONCEPTS
CHRIS BINGHAM
MAY 2013
THE SERVER ZOO
Model of server types
Applicable beyond the cloud
Courtesy of Tim Bell from CERN
Photo by rbrwr via Flickr
7PETS
Photo by chris friese via Flickr
UNIQUE
&
CONFIGURED
BY HAND
Photo by picto:graphic via Flick
NAMED
&
STATEFUL
Photo by captainsubtle via Flickr
FIXED WHEN BROKEN
Photo by Ruud Hein via Flickr
CATTLE
Photo by twicepix via Flickr
IDENTICAL
&
AUTOMATED
Photo by cwasteson via Flickr
NUMBERED
&
STATELESS
Photo by vonguard via Flickr
Photo by blmurch via Flickr
REPLACED WHEN
BROKEN
Photo by Gusjer via Flickr
COW
+
PERSISTENT
STORAGE
=
HIPPO
Photo by chriswsn via Flickr
COW
+
EXPERIMENTAL
CONFIG
=
CANARY
INSTANCES
VS.
SERVERS
Pets = Servers
Cattle = Instances
Cattle ≠ Pets
∴
Instances ≠ Servers
Photo by wstryder via Flickr
MINIMISE PETS
MAXIMISE CATTLE
More time for
must-have pets
Better service
Do more with less
Photo by aWorldTourer via Flickr
REGULATORS
(SHOULD)
LOVE CATTLE
Highly consistency
Highly testable
Highly change controllable
Highly monitorable
Instant remediation
Photo by gordonplant via Flickr
ANATOMY OF A COW
Bootstrapped
Stateless
Usually Linux
Image by Pearson Scott Foresman via Wikimedia
BOOTSTRAPPING
Photo by neoroma via Flickr
Config OS
Install software
Write config files
Initialise services
At boot time
Without human
input
Photo by Velo Steve via Flickr
BOOTSTRAPPING TOOLS
Puppet
Chef
Ansible
CFEngine
AWS CloudFormation
OpenStack Heat
Group Policy/System Center
etc. etc. …
STATELESS
Photo by Numinosity (Gary J Wood) via Flickr
No persistent data
Collects state / job
data on boot
Ephemeral storage
Exception: Hippos
USUALLY LINUX
Photo by brian.gratwicke via Flickr
Fewer licensing
considerations
Easier to automate
Easier to image
Smaller footprint
More common at
large scale
ELASTICITY
&
SCALABILITY
Loose coupling
Horizontal scaling
Parallel processing
Monitoring
Photo by rwkvisual via Flickr
LOOSE COUPLING
Tiered architectures
No hostname
dependencies
Asynchronous
communication
Message queuing
HORIZONTAL SCALING
More servers, not
bigger servers
Distributed workload
Scale tiers
independently
PARALLEL PROCESSING
Photo by °Florian via Flickr
Break workload
into many chunks
Process many
chunks at once
Accelerates
processing
MONITORING
Identify key
metrics
Automate
watching
Log continually
Automate
responses
MONITORING TOOLS
Photo by C G-K via Flickr
Nagios
Cacti
Ganglia
AWS CloudWatch
System Center
DESIGN FOR FAILURE
This is the most important
concept of all!
Embrace failure!
ENTERPRISE FAILURES
100 drives
MTBF = 1’200’000
hours
AFR ≈ 0.73%
1 failure in ≈15 months
CLOUD FAILURES
6’000’000 drives
MTBF = 300’000 hours
AFR ≈ 2.88%
1 failure in ≈3 minutes
≈215’000 failures in 15
months
DESIGN FOR FAILURE
Instances have no SLA
Assume anything can
fail at any time
Backup persistent data
Duplicate everything
TEST EVERYTHING
Create your own
disasters
Unleash the last
animal in the zoo…
CONTACT
E-mail: chris.bingham@datalynx.ch
LinkedIn: ch.linkedin.com/in/binghamchris
Blog: clustersandclouds.wordpress.com

More Related Content

Viewers also liked

Solution Architecture Concept Workshop
Solution Architecture Concept WorkshopSolution Architecture Concept Workshop
Solution Architecture Concept Workshop
Alan McSweeney
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical Architecture
Harshendu Desai
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
Chamnap Chhorn
 

Viewers also liked (8)

Ntroduction to computer architecture and organization
Ntroduction to computer architecture and organizationNtroduction to computer architecture and organization
Ntroduction to computer architecture and organization
 
Introduction To Business Architecture – Part 1
Introduction To Business Architecture – Part 1Introduction To Business Architecture – Part 1
Introduction To Business Architecture – Part 1
 
Solution Architecture Concept Workshop
Solution Architecture Concept WorkshopSolution Architecture Concept Workshop
Solution Architecture Concept Workshop
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical Architecture
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Structured Approach to Solution Architecture
Structured Approach to Solution ArchitectureStructured Approach to Solution Architecture
Structured Approach to Solution Architecture
 
An introduction to fundamental architecture concepts
An introduction to fundamental architecture conceptsAn introduction to fundamental architecture concepts
An introduction to fundamental architecture concepts
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Cloud Architecture Concepts

Editor's Notes

  1. WelcomeGoing to cover some concepts and ideas which are key in cloud systemsNot new – have been around for a long time, particularly in HPCBut may be unfamiliar if your used to traditional enterprise IT
  2. A roughmetaphor for the difference between traditional enterprise architectures and cloud systemsThese ideas are not new or unique to the cloudBut cloud does really compel you to use themWith thanks to Tim Bell, who manages the DCs at CERNSide note: he’s running 15’000 servers with 3 people thanks to the concepts we’ll discuss2 main animals in the server zoo
  3. The first are pets
  4. Each pet is, more or less, uniqueRequire human intervention to build and set upMay even be entirely built by hand
  5. Sometimes named – e.g. Starbuck in NovartisEach pet contains some unique data which must persistBy unique I mean data which is not accessible or available elsewhere except perhaps via a restore from backup or through some other manual interventionE.g. it has a state which needs to be maintained
  6. Because pets are stateful, we scramble to fix them when they breakSo we IT folk care a lot about whether our pets are working or not
  7. Next up cows!
  8. Cows come in heardsThey’re all basically the sameAnd that’s because they’re all built by other computers – no humans involved
  9. Because they come in heards, we number themThey only use shared data – no cow holds any unique dataThus they have no state
  10. Because they have no state, individual cows don’t matterWe only care about the overall health of the heard So when a cow breaks, we terminate it and more are automatically built to replace itNext, a couple of special types of cattleFirst Hippos!
  11. Hippos are cows which do have some unique data which must persistThis may sound like a pet, but there is a key difference between a hippo and a petA hippo’s data is automatically transferred and so isn’t truly unique to that single hippoWhen a hippo breaks, it’s replacement is automatically given the required dataThus no restore from backup or other manual intervention requiredThe second special cow is the canary!
  12. Canaries are experimental cattleKey difference from hippos and cattle is that they have a new, untested configurationThus canaries are your dev/test/QA etc. environmentSo now we know the difference between pets and cattle, lets map that to technical terminology
  13. Traditional IT architectures use servers These are the types of systems we’ve all been working on for many yearsTypically you’re concerned about keeping them up, so you perform maintenance and troubleshooting to keep it runningUsually because they hold some unique state dataThus servers are petsCloud systems use instancesInstances are fully automatedNo single instance is ever guaranteed to survive for any significant length of time (more on this latter)But we don’t care about individual instances, only the health of our pool of instances overallThus instances are cattleSo, as cattle are not pets, and visa versa…Instances are not servers – these two are fundamentally different!This is a very, very important concept to graspTreating instances as servers won’t work in the long runIt’ll also negate the cost and operational efficiency benefits of cloud architectures
  14. So a core design goal for cloud systems should be to get rid of pets and have lots of cattle instead!Due to the automated, low maintenance nature of cattle, this means we can spend less time firefighting, and more on building and improving our applications/systems/services/etc.
  15. I would argue that cattle are also good for regulatorsAutomation makes them easy to test and manage en massIt also makes them highly homogenousWhich makes them easy to monitorAnd makes anomalies/issues/security breaches easier to spot And if an issue is spotted on one cow, remediation takes minutesTerminate it and get another one
  16. So lets look at how to build a cowThere are three key things I’d highlight for this
  17. Bootstrapping is automating the build and config of a systemIt can do anything you would normally do by handIt’s normally done as the instance bootsMay also run periodically and apply configuration updatesKey element of bootstrapping – once the config has been bootstrapped, no human input should be required at all to build a new instance!
  18. There are many mature, stable tools available for bootstrappingAWS has a specific feature for this type of thing – CloudFormationOpenStack will have a CloudFormation-compatible counterpart later this year called HeatPick your own poison – doesn’t particularly matter which tool you use, so long as you’re bootstrapping!
  19. As mentioned before, cattle are statelessWith the exception of hipposTypically they have only ephemeral storageA cow’s storage and its contentsdisappears when the cow is terminatedSo each cow has to collect the data it needs to operate as it bootsAn ideal cow boots with Just Enough OS and then “phones home” to ask “who am I and what should I do?”
  20. Cattle almost always run Linux Windows is poorly suited to cattleIt’s much harder to bootstrap away all human input on first bootWindows management systems tend to be host-name sensitive, because AD isEach Windows server has a truly unique identity – which I would count as stateThus I would consider Windows an inherently stateful OSIt has a much heavier base resource footprintIt’s exceptionally rare at truly large scaleHint: Enterprise IT is not large scale! (more on this later)
  21. So now we know more about cattle, lets talk about broader cloud architecture principlesHere there are four key things I’d like to call out
  22. Loose coupling is the exact opposite of most enterprise architectures I’ve seen deployedSystems should be split into layers – a.k.a. tiersThe identities of individual instances within each tier must not matterRule of thumb – if any part of your architecture depends on some system having a particularly FQDN, then it’s NOT loosely coupled!Communication must be asynchronousRequests should be made between the tiers and systems without any waiting for a responseNormally done via a message bus and message queuingThis is another key thing to wrap your head aroundQuick, simplified overview of message queuingEach request from one instance to another is put in a queueAny instance capable of answering the request can pick up the request messageThe response goes back into the message queueAny instance capable of processing the response can pick up the response messageThus the instance which processes the response may not be the same one that made the requestAgain – stateless systems!
  23. Again, exact opposite of most enterprise architectures in realityTraditional approach is vertical scaling, a.k.a. scale upAdd more RAM, CPUs, spindles, etc. to improve performanceCloud approach is horizontal scaling, a.k.a. scale outAdd more instances to improve performancei.e. scale by getting more cows, not by making your cows fatter!This is enabled by loose coupling and the distribution of workloadDone right, it means you can scale each tier of your architecture separatelyE.g. scaling your storage tier without scaling your front end tier as well
  24. Parallel processing within your applications is key to enabling the loose coupling and horizontal scalingIn turn loosely coupling and scaling horizontally enable greater parallelisation of your processingGeneral idea is to break each task/request/action/etc. down into smallest possible/practical chunksIdeally each chunk should be independent of all other chunksProcess all the chunks at once, combine the results to complete the task/request/etc.Again, scaling out not up!E.g. instead of getting a faster individual CPU to improve performance, get more CPUs
  25. Still need to monitor cloud systemsBut emphasis changesAgain don’t care about individual instancesMonitor the health of the heard insteadAnother key difference is automation of responsesIf a problem is detected, it should be fixed automaticallyAgain, no human interventionOften accomplished by terminating the failing instances and starting new ones
  26. Again many mature and stable monitoring tools existOn AWS look at CloudWatch
  27. If there’s only one thing you take with you when you leave this room, this is it!I don’t think it’s possible to understate how important this is in cloud architecturesIn order to design for cloud systems it’s vital to understand failureFailure isn’t something to be afraid of – it’s just a fact of life!Let’s put some numbers on failure with a few rough and ready calculations for hard drivesShould stress – these are simplistic calculations which assume and even distribution of failures over timeThey’re meant only to be a rough illustration of the differences between the experience of failure in the enterprise vs. the cloudIn reality failure would probably not be as evenly distributed as assumed here!
  28. Let’s look at a hypothetical enterprise application firstSay, in your typical enterprise application you have 10 serversSay 10 drives per server, including SAN/NAS/backup etc.So 100 drivesTypical enterprise hard drive has an MTBF of around 1.2 million hours That means a half of drives will fail within 1.2 million hours, not that any individual drive lasts 1.2 million hoursCrunch the numbers and you get an expected annual failure rate of 0.73%So 0.73% of the drives will fail each yearSo roughly 1 expected failure in 15 monthsAt enterprise scale failure is a annual occurrenceSo how does that compare to the cloud?Unlike the enterprise application, we don’t know exactly which physical boxes are involvedOur hypothetical application could be running anywhere within a cloud provider’s DCAnd it may move between hardware over timeSo we need to consider failure the whole DC
  29. Microsoft run 600’000 physical servers for their Azure cloud in their Dublin DCLets stick with 10 hard drives per server, so that’s 6 million drivesClouds normally use cheaper consumer drives – typical MTBF 300’000 hoursCrunch the numbers again and you get an expected annual failure rate of 2.88%So that’s around 20 expected failures per hour – one every 3 minutesOr around 215’000 per in 15 monthsAt cloud scale failure is an every minute occurrenceI’ve seen this at a smaller scale – the last enterprise HPC system I ran had 65 servers at peak, and most months I’d say that at least one of them was failing in some way
  30. Thus there is no SLA for any individual instanceSo when we’re designing cloud architectures we must assume that anything could fail at any timeThis is why we need cattle instead of petsAnd why loose coupling and horizontal scaling are importantThey are the means by which you build a system which continues to function in the face of constant and unpredictable failureAny data that needs to persist must exist in at least three placesSo that if one fails you’ve still got two leftAWS S3 and similar services guarantee thisEvery component of your architecture must exist in duplicatePreferably in different physical locationsAWS availability zones guarantee thisAnd again, this is why bootstrapping is important – automatic recovery from failureSide note: there’s a detailed article, although now a few years old, from Google on this here: http://storagemojo.com/2007/02/19/googles-disk-failure-experience/
  31. To make sure your architecture is robust enough to withstand failure, test, test, and test againThe only way to properly test is to actually inflict failures on your running, production systemYes, that’s a scary thing to sayBut that’s how it’s done at cloud scale, and should be done everywhere in my opinionE.g. Google’s engineers periodically inflict disasters on their infrastructure without telling the maintenance people, to see what happensAnd by disasters, I mean pulling the plug on whole DCsGood article on this and other Google DC things here: http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/To help everyone test in this way, there’s a special tool available – the last animal in the zoo…
  32. The chaos monkey was created by Netflix and is actively used to test their production systemsOnce released it goes around randomly terminating instances and otherwise screwing stuff upIf you system continues to function during a chaos monkey attack, it’ll probably survive real failures and disasters!
  33. Thanks you, hope it’s been useful!