The document proposes a Fault Tolerance Policy called FTCPM that uses checkpointing to improve reliability in peer-to-peer grid systems. It introduces a checkpoint strategy that saves job state to stable storage. If a failure occurs, it uses timestamps and rollback to identify the cause and reschedules the job to the correct node using an algorithm called SMF. The system architecture is layered, with services like job management, fault tolerance, and SMF. It also details CP and SMF algorithms that schedule checkpoints dynamically and select the most fitting resource for job assignment.