resilience in the
The developer’s Azure Partner
Scaling out doesn’t work how you think
Auto Scaling in response to volume of web
requests is slow and not very exact because you
have to measure averages over a time period
You can, of course, scale manually.
Databases and other back-end services quickly
become bottlenecks and some are difficult and/or
expensive to scale
SQL and CosmosDB now both have auto/elastic scale options but the
base cost of those can be higher. It is not just a matter of “turning it
on” as you may end up paying more.
distributed systems / micro services / buzzword
Traditional “heavy” writes
Do something Call service /
Queues for “heavy” writes
(distributed systems / micro services / buzzword)
Do something Call service /
Read from queue
Post to queue
Azure queues are incredibly
scalable, robust and cheap
Azure functions or Web Jobs
can scale very fast based on the
length of the queue
Database / IO
Use more than one “database”
(multi-modal persistence model)
Car history checking system
6 million cars with case history
~550 million mileage records for
>50 million cars.
A few £ per
Write each line to
Write a message to
a queue for each
Auto scale starts more servers
in seconds – and shuts them
down when the queue is empty
To infinity and
You hear an awful lot about the “right” way to develop stuff in the cloud, usually peppered with things like “Netflix does x” or “Amazon does Y”.
There is a lot of good information out there – but you need to take a lot of it with a pinch of salt. Very few of us are developing anything at the same scale as Netflix.
Today, I am going to talk about two very simple things that are really your “101” for scale and resilience in the cloud. It’s a starting point that you can implement very easily on any cloud platform without making big changes to your existing code. These two things are, imo, the biggest, immediate difference in *mindset* compared to the on-premise mindset. I look at a lot of software that is already in the cloud – but most of it still has an “on-premise” mindset, so don’t dismiss what I have to say just because you are already cloud based.
Of course there is a ton more you *can* do – just remember to assess what you actually *need* before building Netflix scale architecture to support a blog site. The things I am talking about today you should almost always do – even for very simple apps. If you are planning for millions of users then you are in the wrong talk. Please tune out for the next 14 minutes or so.
In my business, we have been building software on Azure since 2011. Some of it large scale, some of it fairly small. We have had the chance to try out many of the technologies as Azure has introduced them and lived through Azure finding it’s feet.
For several years we have also been an Azure Gold Partner and reseller, helping developers in other companies make the most of Azure.
That means my colleagues and I get to go into companies, much like yours, to look at how they use Azure and give advice on how to reduce costs, increase scale or improve security.
What I am talking about today are the two core lessons about designing for scale in the cloud that I have learnt over the last decade. I have various other talks coming up that go into much more detail about specific topics.
The first constraint is that auto scale doesn’t respond quickly to web traffic.
The problem is that in order to respond to an increase in web traffic, you have to measure the number of hits or the cpu/memory load over a period of time before you can make a decision; web servers constantly hit 100% CPU, for example, or very brief spikes in traffic. You need to measure it over several minutes, at least, before you auto scale.
Once your web servers scale out, you now have all these extra servers hitting the database and that may start being overloaded. Databases can be a single point of contention and they don’t scale quickly, not even when done manually. Some database operations increase more than linearly with size. It can also be very expensive to scale database out; some cloud databases quadruple the cost for every doubling in performance.
So, in reality, you need to predict spikes in traffic and scale out ahead of time. That can be costly and will still not save you when you have a sudden, unexpected increase in traffic.
When a user “does” something, like place an order or register for an event or report an absence etc, this is how you’d write it. It’s easy.
It doesn’t scale well.
If there is a bug in your code, you lose data.
If the external service is down, you lose data.
If the database is overloaded, you lose data.
My first rule is to take anything that is more than a simple read or write and put it on a queue for a background process to process. Anything that requires complex processing or talking to other services.
Traditionally, introducing queues into a system was a big decision because you now had to set up and manage a queue system.
With Azure and other cloud providers, these are built-in, very cheap and rock-solid.
You can scale very rapidly based on the length of the queue. Typically you will have new servers up in less than one minute to handle the load.
This is a huge resilience benefit as well; if there is a bug in your processing code or a backend service is down, you can fix and just replay the message – no data is lost; if did the processing inside the web app, you will irretrievable lose the data.
For fun, last time I taught someone to add an Azure queue and set up a function to do background processing, the whole thing including setup and coding took 57 minutes end-to-end.
In a traditional system, you typically have a single database that does everything for you. That makes sense, because you have to look after everything.
In the cloud you usually have different data storage – persistence – tools available to you. The trick is to use a mix of them in way that makes sense. For example, you may offload some types of “cold” data to another model or you may pre-process responses and store them in a faster persistence system so you can respond to users without having to run complex queries on the main database.
Blob storage and table storage, in particular, are worth looking at. They are completely useless as a general database, but they are exceptionally cheap and fast for the right usecases.
If you need to scale up to the millions of users then you obviously need much more advanced patterns. There are many of them and we use a few of them, such a as Kubernetes clusters, actor models and other things. But, for most systems, the stuff I have covered here is more than enough – and is in any case a good starting point. Don’t do more than you need to.
As I mentioned above, my company is an Azure partner and reseller. If there is anything I have said today that has piqued your interest or you just need to talk to somebody about reducing cost, improving security or scaling better in Azure, feel free to contact me. We love to talk about Azure. We are not primarily a consultancy and we often given an initial consultation for free so you can try us out.