This is an internal talk I gave within Datalynx in May 2013.
It’s an introduction to the ideas and concepts involved in building cloud systems and applications for technical people who are new to the cloud. It also compares and contrasts these ideas to those we’re used to in “traditional” enterprise IT systems.
Notes on the “Hippo” slide:
The inclusion of persistent storage here may make it sound like a hippo and a pet are essentially the same, however to my mind there is a key difference between them.
A pet’s persistent data is not transferable to other pets without some sort of intervention, be it a restore from backup, a manual copy, or similar. Typically it would be stored on disks which are not immediately and automatically accessible if the pet is offline or not functioning.
A hippo’s persistent data, by contrast, is automatically transferred to the hippo’s successor instance if the original hippo dies or
otherwise ceases to function. Typically the data would exist on a storage mechanism such as AWS EBS or OpenStack Cinder volumes, whose lifecycle is separate from that of the hippo and which are immediately and automatically accessible by other instances if the hippo dies.
Notes on the “Design for Failure” section:
This section provides a few rough and ready calculations for failure rates of hard drives. The calculations here are quite simplistic and assume and even distribution of failures over time. Obviously this isn’t the case in reality, however the idea here is to provide a rough illustration of the differences in how we experience failure between enterprise IT systems and cloud systems.
The “Enterprise Failures” slide is based on a hypothetical application where 10 servers are involved in its delivery and each server has 10 drives (including SAN/NAS/backup systems etc.). It also assumes that “enterprise class” drives with MTBFs have been used.
The “Cloud Failures” slide is based on numbers for Microsoft’s Windows Azure data centre in Dublin, which houses around 600’000 servers, and again assumes 10 drives average per server. It also assumes that consumer drives with low MTBFs have been used.
My override aim here was to express, to technical people who is not used to truly large scale systems, why they need to take the attitude of assuming that anything can fail at any time, and to realise that the implicit assumption of hardware reliability that is often applied in enterprise IT doesn’t map onto the cloud.