Fault tolerance in general is a challenging topic. Yet we need fault toleranct designs more badly than ever in order to provide robust, highly available systems - especially in times of scale out systems becoming more and more popular.
Unfortunately, most developers do not care too much about a fault tolerant design, either because they are scared by the complexity of the realm or because they do not care enough. One of the problems is that a lack of fault tolerant design does not hurt a lot in development or in QA, but it hurts a lot in production - as Michael Nygard said: "It's all about production!" (at least figuratively).
In this presentation I do *not* try to give a general introduction to fault tolerant design. Instead I pick a few generic case studies that demonstrate the results of missing fault tolerant design, try to sensitize a bit about the production relevance of fault tolerant design and then go along with a few selected patterns. I picked a few patterns which are surprisingly easy to implement and help to mitigate the problems of the former case studies.
This way I try to show two things:
1. A piece of architecture or design as a pattern is not necessarily hard to implement. Sometimes the code is written quicker than it takes to explain the pattern beforehand.
2. Even if fault tolerant design as a general topic might be hard, some parts of it can be implemented very easily and it's more than worth the coding effort if you look how much better your system behaves in production just from adding those few lines of code.