Cloud Architecture Concepts

•

2 likes•1,339 views

This is an internal talk I gave within Datalynx in May 2013. It’s an introduction to the ideas and concepts involved in building cloud systems and applications for technical people who are new to the cloud. It also compares and contrasts these ideas to those we’re used to in “traditional” enterprise IT systems. Notes on the “Hippo” slide: The inclusion of persistent storage here may make it sound like a hippo and a pet are essentially the same, however to my mind there is a key difference between them. A pet’s persistent data is not transferable to other pets without some sort of intervention, be it a restore from backup, a manual copy, or similar. Typically it would be stored on disks which are not immediately and automatically accessible if the pet is offline or not functioning. A hippo’s persistent data, by contrast, is automatically transferred to the hippo’s successor instance if the original hippo dies or otherwise ceases to function. Typically the data would exist on a storage mechanism such as AWS EBS or OpenStack Cinder volumes, whose lifecycle is separate from that of the hippo and which are immediately and automatically accessible by other instances if the hippo dies. Notes on the “Design for Failure” section: This section provides a few rough and ready calculations for failure rates of hard drives. The calculations here are quite simplistic and assume and even distribution of failures over time. Obviously this isn’t the case in reality, however the idea here is to provide a rough illustration of the differences in how we experience failure between enterprise IT systems and cloud systems. The “Enterprise Failures” slide is based on a hypothetical application where 10 servers are involved in its delivery and each server has 10 drives (including SAN/NAS/backup systems etc.). It also assumes that “enterprise class” drives with MTBFs have been used. The “Cloud Failures” slide is based on numbers for Microsoft’s Windows Azure data centre in Dublin, which houses around 600’000 servers, and again assumes 10 drives average per server. It also assumes that consumer drives with low MTBFs have been used. My override aim here was to express, to technical people who is not used to truly large scale systems, why they need to take the attitude of assuming that anything can fail at any time, and to realise that the implicit assumption of hardware reliability that is often applied in enterprise IT doesn’t map onto the cloud.

Technology Business

CLOUD ARCHITECTURE
CONCEPTS
CHRIS BINGHAM
MAY 2013

THE SERVER ZOO
Model of server types
Applicable beyond the cloud
Courtesy of Tim Bell from CERN
Photo by rbrwr via Flickr

UNIQUE
&
CONFIGURED
BY HAND
Photo by picto:graphic via Flick

NAMED
&
STATEFUL
Photo by captainsubtle via Flickr

FIXED WHEN BROKEN
Photo by Ruud Hein via Flickr

IDENTICAL
&
AUTOMATED
Photo by cwasteson via Flickr

NUMBERED
&
STATELESS
Photo by vonguard via Flickr

Photo by blmurch via Flickr
REPLACED WHEN
BROKEN

Photo by Gusjer via Flickr
COW
+
PERSISTENT
STORAGE
=
HIPPO

Photo by chriswsn via Flickr
COW
+
EXPERIMENTAL
CONFIG
=
CANARY

INSTANCES
VS.
SERVERS
Pets = Servers
Cattle = Instances
Cattle ≠ Pets
∴
Instances ≠ Servers
Photo by wstryder via Flickr

MINIMISE PETS
MAXIMISE CATTLE
More time for
must-have pets
Better service
Do more with less
Photo by aWorldTourer via Flickr

REGULATORS
(SHOULD)
LOVE CATTLE
Highly consistency
Highly testable
Highly change controllable
Highly monitorable
Instant remediation
Photo by gordonplant via Flickr

ANATOMY OF A COW
Bootstrapped
Stateless
Usually Linux
Image by Pearson Scott Foresman via Wikimedia

BOOTSTRAPPING
Photo by neoroma via Flickr
Config OS
Install software
Write config files
Initialise services
At boot time
Without human
input

Photo by Velo Steve via Flickr
BOOTSTRAPPING TOOLS
Puppet
Chef
Ansible
CFEngine
AWS CloudFormation
OpenStack Heat
Group Policy/System Center
etc. etc. …

STATELESS
Photo by Numinosity (Gary J Wood) via Flickr
No persistent data
Collects state / job
data on boot
Ephemeral storage
Exception: Hippos

USUALLY LINUX
Photo by brian.gratwicke via Flickr
Fewer licensing
considerations
Easier to automate
Easier to image
Smaller footprint
More common at
large scale

ELASTICITY
&
SCALABILITY
Loose coupling
Horizontal scaling
Parallel processing
Monitoring
Photo by rwkvisual via Flickr

LOOSE COUPLING
Tiered architectures
No hostname
dependencies
Asynchronous
communication
Message queuing

HORIZONTAL SCALING
More servers, not
bigger servers
Distributed workload
Scale tiers
independently

PARALLEL PROCESSING
Photo by Â°Florian via Flickr
Break workload
into many chunks
Process many
chunks at once
Accelerates
processing

MONITORING
Identify key
metrics
Automate
watching
Log continually
Automate
responses

MONITORING TOOLS
Photo by C G-K via Flickr
Nagios
Cacti
Ganglia
AWS CloudWatch
System Center

DESIGN FOR FAILURE
This is the most important
concept of all!
Embrace failure!

ENTERPRISE FAILURES
100 drives
MTBF = 1’200’000
hours
AFR ≈ 0.73%
1 failure in ≈15 months

CLOUD FAILURES
6’000’000 drives
MTBF = 300’000 hours
AFR ≈ 2.88%
1 failure in ≈3 minutes
≈215’000 failures in 15
months

DESIGN FOR FAILURE
Instances have no SLA
Assume anything can
fail at any time
Backup persistent data
Duplicate everything

TEST EVERYTHING
Create your own
disasters
Unleash the last
animal in the zoo…

CONTACT
E-mail: chris.bingham@datalynx.ch
LinkedIn: ch.linkedin.com/in/binghamchris
Blog: clustersandclouds.wordpress.com

Viewers also liked

Ntroduction to computer architecture and organization

Fakulti seni, komputeran dan indusri kreatif

This is the first of a proposed four part introduction to Business Architecture. It is intended to focus on activities associated with Business Architecture work and engagements. Business change without a target business architecture and a plan is likely to result in a lack of success and even failure. An effective approach to business architecture and business architecture competency is required to address effectively the pressures on businesses to change. Business architecture connects business strategy to effective implementation and operation: • Translates business strategic aims to implementations • Defines the consequences and impacts of strategy • Isolates focussed business outcomes • Identifies the changes and deliverables that achieve business success Enterprise Architecture without Solution Architecture and Business Architecture will not deliver on its potential. Business Architecture is an essential part of the continuum from theory to practice.

Introduction To Business Architecture – Part 1

Alan McSweeney

Solution Architecture Concept Workshop

Alan McSweeney

Data Quality Technical Architecture

Harshendu Desai

Introduction to Web Architecture

Chamnap Chhorn

Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.

Building an Effective Data Warehouse Architecture

James Serra

The role of solution architecture is to identify answer to a business problem and set of solution options and their components. There will be many potential solutions to a problem with varying degrees of suitability to the underlying business need. Solution options are derived from a combination of Solution Architecture Dimensions/Views which describe characteristics, features, qualities, requirements and Solution Design Factors, Limitations And Boundaries which delineate limitations. Use of structured approach can assist with solution design to create consistency. The TOGAF approach to enterprise architecture can be adapted to perform some of the analysis and design for elements of Solution Architecture Dimensions/Views.

Structured Approach to Solution Architecture

Alan McSweeney

(Note: This is a very dated version of this popular deck, as SlideShare does not provide authors with a mechanism to update their documents. If interested in the latest version, feel free to message me on LinkedIn or at wweinmeyer@gmail.com. Also, feel free to ask SlideShare to bring back the ability to update posted documents.) A discussion of the fundamentals you need to nail in your architecture practice: - Architecture vs. Design - Conceptual vs. Logical vs. Physical architecture - Viewpoint Frameworks - Architecture Domains - Architecture Tiers You are free to use/copy this information but if you do so, please include an acknowledgement

An introduction to fundamental architecture concepts

wweinmeyer79

Viewers also liked (8)

Ntroduction to computer architecture and organization

Introduction To Business Architecture – Part 1

Solution Architecture Concept Workshop

Data Quality Technical Architecture

Introduction to Web Architecture

Building an Effective Data Warehouse Architecture

Structured Approach to Solution Architecture

An introduction to fundamental architecture concepts

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Histor y of HAM Radio presentation slide

vu2urc

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Tech Trends Report 2024 Future Today Institute.pdf

hans926745

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script

Exploring the Future Potential of AI-Enabled Smartphone Processors

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Histor y of HAM Radio presentation slide

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Boost PC performance: How more available memory can improve productivity

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Finology Group – Insurtech Innovation Award 2024

Advantages of Hiring UIUX Design Service Providers for Your Business

AWS Community Day CPH - Three problems of Terraform

Powerful Google developer tools for immediate impact! (2023-24 C)

Tech Trends Report 2024 Future Today Institute.pdf

Driving Behavioral Change for Information Management through Data-Driven Gree...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

🐬 The future of MySQL is Postgres 🐘

Cloud Architecture Concepts

1. CLOUD ARCHITECTURE CONCEPTS CHRIS BINGHAM MAY 2013

2. THE SERVER ZOO Model of server types Applicable beyond the cloud Courtesy of Tim Bell from CERN Photo by rbrwr via Flickr

3. 7PETS Photo by chris friese via Flickr

4. UNIQUE & CONFIGURED BY HAND Photo by picto:graphic via Flick

5. NAMED & STATEFUL Photo by captainsubtle via Flickr

6. FIXED WHEN BROKEN Photo by Ruud Hein via Flickr

7. CATTLE Photo by twicepix via Flickr

8. IDENTICAL & AUTOMATED Photo by cwasteson via Flickr

9. NUMBERED & STATELESS Photo by vonguard via Flickr

10. Photo by blmurch via Flickr REPLACED WHEN BROKEN

11. Photo by Gusjer via Flickr COW + PERSISTENT STORAGE = HIPPO

12. Photo by chriswsn via Flickr COW + EXPERIMENTAL CONFIG = CANARY

13. INSTANCES VS. SERVERS Pets = Servers Cattle = Instances Cattle ≠ Pets ∴ Instances ≠ Servers Photo by wstryder via Flickr

14. MINIMISE PETS MAXIMISE CATTLE More time for must-have pets Better service Do more with less Photo by aWorldTourer via Flickr

15. REGULATORS (SHOULD) LOVE CATTLE Highly consistency Highly testable Highly change controllable Highly monitorable Instant remediation Photo by gordonplant via Flickr

16. ANATOMY OF A COW Bootstrapped Stateless Usually Linux Image by Pearson Scott Foresman via Wikimedia

17. BOOTSTRAPPING Photo by neoroma via Flickr Config OS Install software Write config files Initialise services At boot time Without human input

18. Photo by Velo Steve via Flickr BOOTSTRAPPING TOOLS Puppet Chef Ansible CFEngine AWS CloudFormation OpenStack Heat Group Policy/System Center etc. etc. …

19. STATELESS Photo by Numinosity (Gary J Wood) via Flickr No persistent data Collects state / job data on boot Ephemeral storage Exception: Hippos

20. USUALLY LINUX Photo by brian.gratwicke via Flickr Fewer licensing considerations Easier to automate Easier to image Smaller footprint More common at large scale

21. ELASTICITY & SCALABILITY Loose coupling Horizontal scaling Parallel processing Monitoring Photo by rwkvisual via Flickr

22. LOOSE COUPLING Tiered architectures No hostname dependencies Asynchronous communication Message queuing

23. HORIZONTAL SCALING More servers, not bigger servers Distributed workload Scale tiers independently

24. PARALLEL PROCESSING Photo by Â°Florian via Flickr Break workload into many chunks Process many chunks at once Accelerates processing

25. MONITORING Identify key metrics Automate watching Log continually Automate responses

26. MONITORING TOOLS Photo by C G-K via Flickr Nagios Cacti Ganglia AWS CloudWatch System Center

27. DESIGN FOR FAILURE This is the most important concept of all! Embrace failure!

28. ENTERPRISE FAILURES 100 drives MTBF = 1’200’000 hours AFR ≈ 0.73% 1 failure in ≈15 months

29. CLOUD FAILURES 6’000’000 drives MTBF = 300’000 hours AFR ≈ 2.88% 1 failure in ≈3 minutes ≈215’000 failures in 15 months

30. DESIGN FOR FAILURE Instances have no SLA Assume anything can fail at any time Backup persistent data Duplicate everything

31. TEST EVERYTHING Create your own disasters Unleash the last animal in the zoo…

32.

33. CONTACT E-mail: chris.bingham@datalynx.ch LinkedIn: ch.linkedin.com/in/binghamchris Blog: clustersandclouds.wordpress.com

Editor's Notes

WelcomeGoing to cover some concepts and ideas which are key in cloud systemsNot new – have been around for a long time, particularly in HPCBut may be unfamiliar if your used to traditional enterprise IT
A roughmetaphor for the difference between traditional enterprise architectures and cloud systemsThese ideas are not new or unique to the cloudBut cloud does really compel you to use themWith thanks to Tim Bell, who manages the DCs at CERNSide note: he’s running 15’000 servers with 3 people thanks to the concepts we’ll discuss2 main animals in the server zoo
The first are pets
Each pet is, more or less, uniqueRequire human intervention to build and set upMay even be entirely built by hand
Sometimes named – e.g. Starbuck in NovartisEach pet contains some unique data which must persistBy unique I mean data which is not accessible or available elsewhere except perhaps via a restore from backup or through some other manual interventionE.g. it has a state which needs to be maintained
Because pets are stateful, we scramble to fix them when they breakSo we IT folk care a lot about whether our pets are working or not
Next up cows!
Cows come in heardsThey’re all basically the sameAnd that’s because they’re all built by other computers – no humans involved
Because they come in heards, we number themThey only use shared data – no cow holds any unique dataThus they have no state
Because they have no state, individual cows don’t matterWe only care about the overall health of the heard So when a cow breaks, we terminate it and more are automatically built to replace itNext, a couple of special types of cattleFirst Hippos!
Hippos are cows which do have some unique data which must persistThis may sound like a pet, but there is a key difference between a hippo and a petA hippo’s data is automatically transferred and so isn’t truly unique to that single hippoWhen a hippo breaks, it’s replacement is automatically given the required dataThus no restore from backup or other manual intervention requiredThe second special cow is the canary!
Canaries are experimental cattleKey difference from hippos and cattle is that they have a new, untested configurationThus canaries are your dev/test/QA etc. environmentSo now we know the difference between pets and cattle, lets map that to technical terminology
Traditional IT architectures use servers These are the types of systems we’ve all been working on for many yearsTypically you’re concerned about keeping them up, so you perform maintenance and troubleshooting to keep it runningUsually because they hold some unique state dataThus servers are petsCloud systems use instancesInstances are fully automatedNo single instance is ever guaranteed to survive for any significant length of time (more on this latter)But we don’t care about individual instances, only the health of our pool of instances overallThus instances are cattleSo, as cattle are not pets, and visa versa…Instances are not servers – these two are fundamentally different!This is a very, very important concept to graspTreating instances as servers won’t work in the long runIt’ll also negate the cost and operational efficiency benefits of cloud architectures
So a core design goal for cloud systems should be to get rid of pets and have lots of cattle instead!Due to the automated, low maintenance nature of cattle, this means we can spend less time firefighting, and more on building and improving our applications/systems/services/etc.
I would argue that cattle are also good for regulatorsAutomation makes them easy to test and manage en massIt also makes them highly homogenousWhich makes them easy to monitorAnd makes anomalies/issues/security breaches easier to spot And if an issue is spotted on one cow, remediation takes minutesTerminate it and get another one
So lets look at how to build a cowThere are three key things I’d highlight for this
Bootstrapping is automating the build and config of a systemIt can do anything you would normally do by handIt’s normally done as the instance bootsMay also run periodically and apply configuration updatesKey element of bootstrapping – once the config has been bootstrapped, no human input should be required at all to build a new instance!
There are many mature, stable tools available for bootstrappingAWS has a specific feature for this type of thing – CloudFormationOpenStack will have a CloudFormation-compatible counterpart later this year called HeatPick your own poison – doesn’t particularly matter which tool you use, so long as you’re bootstrapping!
As mentioned before, cattle are statelessWith the exception of hipposTypically they have only ephemeral storageA cow’s storage and its contentsdisappears when the cow is terminatedSo each cow has to collect the data it needs to operate as it bootsAn ideal cow boots with Just Enough OS and then “phones home” to ask “who am I and what should I do?”
Cattle almost always run Linux Windows is poorly suited to cattleIt’s much harder to bootstrap away all human input on first bootWindows management systems tend to be host-name sensitive, because AD isEach Windows server has a truly unique identity – which I would count as stateThus I would consider Windows an inherently stateful OSIt has a much heavier base resource footprintIt’s exceptionally rare at truly large scaleHint: Enterprise IT is not large scale! (more on this later)
So now we know more about cattle, lets talk about broader cloud architecture principlesHere there are four key things I’d like to call out
Loose coupling is the exact opposite of most enterprise architectures I’ve seen deployedSystems should be split into layers – a.k.a. tiersThe identities of individual instances within each tier must not matterRule of thumb – if any part of your architecture depends on some system having a particularly FQDN, then it’s NOT loosely coupled!Communication must be asynchronousRequests should be made between the tiers and systems without any waiting for a responseNormally done via a message bus and message queuingThis is another key thing to wrap your head aroundQuick, simplified overview of message queuingEach request from one instance to another is put in a queueAny instance capable of answering the request can pick up the request messageThe response goes back into the message queueAny instance capable of processing the response can pick up the response messageThus the instance which processes the response may not be the same one that made the requestAgain – stateless systems!
Again, exact opposite of most enterprise architectures in realityTraditional approach is vertical scaling, a.k.a. scale upAdd more RAM, CPUs, spindles, etc. to improve performanceCloud approach is horizontal scaling, a.k.a. scale outAdd more instances to improve performancei.e. scale by getting more cows, not by making your cows fatter!This is enabled by loose coupling and the distribution of workloadDone right, it means you can scale each tier of your architecture separatelyE.g. scaling your storage tier without scaling your front end tier as well
Parallel processing within your applications is key to enabling the loose coupling and horizontal scalingIn turn loosely coupling and scaling horizontally enable greater parallelisation of your processingGeneral idea is to break each task/request/action/etc. down into smallest possible/practical chunksIdeally each chunk should be independent of all other chunksProcess all the chunks at once, combine the results to complete the task/request/etc.Again, scaling out not up!E.g. instead of getting a faster individual CPU to improve performance, get more CPUs
Still need to monitor cloud systemsBut emphasis changesAgain don’t care about individual instancesMonitor the health of the heard insteadAnother key difference is automation of responsesIf a problem is detected, it should be fixed automaticallyAgain, no human interventionOften accomplished by terminating the failing instances and starting new ones
Again many mature and stable monitoring tools existOn AWS look at CloudWatch
If there’s only one thing you take with you when you leave this room, this is it!I don’t think it’s possible to understate how important this is in cloud architecturesIn order to design for cloud systems it’s vital to understand failureFailure isn’t something to be afraid of – it’s just a fact of life!Let’s put some numbers on failure with a few rough and ready calculations for hard drivesShould stress – these are simplistic calculations which assume and even distribution of failures over timeThey’re meant only to be a rough illustration of the differences between the experience of failure in the enterprise vs. the cloudIn reality failure would probably not be as evenly distributed as assumed here!
Let’s look at a hypothetical enterprise application firstSay, in your typical enterprise application you have 10 serversSay 10 drives per server, including SAN/NAS/backup etc.So 100 drivesTypical enterprise hard drive has an MTBF of around 1.2 million hours That means a half of drives will fail within 1.2 million hours, not that any individual drive lasts 1.2 million hoursCrunch the numbers and you get an expected annual failure rate of 0.73%So 0.73% of the drives will fail each yearSo roughly 1 expected failure in 15 monthsAt enterprise scale failure is a annual occurrenceSo how does that compare to the cloud?Unlike the enterprise application, we don’t know exactly which physical boxes are involvedOur hypothetical application could be running anywhere within a cloud provider’s DCAnd it may move between hardware over timeSo we need to consider failure the whole DC
Microsoft run 600’000 physical servers for their Azure cloud in their Dublin DCLets stick with 10 hard drives per server, so that’s 6 million drivesClouds normally use cheaper consumer drives – typical MTBF 300’000 hoursCrunch the numbers again and you get an expected annual failure rate of 2.88%So that’s around 20 expected failures per hour – one every 3 minutesOr around 215’000 per in 15 monthsAt cloud scale failure is an every minute occurrenceI’ve seen this at a smaller scale – the last enterprise HPC system I ran had 65 servers at peak, and most months I’d say that at least one of them was failing in some way
Thus there is no SLA for any individual instanceSo when we’re designing cloud architectures we must assume that anything could fail at any timeThis is why we need cattle instead of petsAnd why loose coupling and horizontal scaling are importantThey are the means by which you build a system which continues to function in the face of constant and unpredictable failureAny data that needs to persist must exist in at least three placesSo that if one fails you’ve still got two leftAWS S3 and similar services guarantee thisEvery component of your architecture must exist in duplicatePreferably in different physical locationsAWS availability zones guarantee thisAnd again, this is why bootstrapping is important – automatic recovery from failureSide note: there’s a detailed article, although now a few years old, from Google on this here: http://storagemojo.com/2007/02/19/googles-disk-failure-experience/
To make sure your architecture is robust enough to withstand failure, test, test, and test againThe only way to properly test is to actually inflict failures on your running, production systemYes, that’s a scary thing to sayBut that’s how it’s done at cloud scale, and should be done everywhere in my opinionE.g. Google’s engineers periodically inflict disasters on their infrastructure without telling the maintenance people, to see what happensAnd by disasters, I mean pulling the plug on whole DCsGood article on this and other Google DC things here: http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/To help everyone test in this way, there’s a special tool available – the last animal in the zoo…
The chaos monkey was created by Netflix and is actively used to test their production systemsOnce released it goes around randomly terminating instances and otherwise screwing stuff upIf you system continues to function during a chaos monkey attack, it’ll probably survive real failures and disasters!
Thanks you, hope it’s been useful!

Cloud Architecture Concepts

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Recently uploaded

Recently uploaded (20)

Cloud Architecture Concepts

Editor's Notes