This document summarizes a course on fault tolerant computing in cloud systems. It defines cloud computing and discusses common types of failures that can occur in cloud services, such as overflows, timeouts, and hardware/software failures. It then outlines a proposed fault tolerance framework for cloud computing that uses techniques like fault prediction, process-level redundancy, and checkpoint/restart to minimize overhead and provide transparent fault tolerance without requiring application modifications. The framework also leverages fault prediction, failure detection, and a control daemon to determine when to checkpoint and recover applications.