This article proposes a system-level approach to providing fault tolerance as a service in cloud computing environments. The key aspects are:
1) A resource manager monitors resources in the cloud and maintains a database and graph representing the state of nodes, VMs, and network links. This provides visibility of resources for managing fault tolerance.
2) Fault tolerance units (ft-units) implement generic fault tolerance mechanisms that can be applied transparently to VMs hosting applications. Examples include replication and failure detection ft-units.
3) A two-stage process first analyzes client requirements and matches them to available ft-units, then deploys the solution by combining suitable ft-units at runtime to provide the desired fault