Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

1,930 views

Published on

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,930
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
45
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

  1. 1. Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti § , Yoshio Turner HP Labs §: Currently at Meiosys, Inc.
  2. 2. Broad Opportunity for Checkpoint-Restart in Server Management <ul><li>Fault tolerance (minimize unplanned downtime) </li></ul><ul><ul><li>Recover by restarting from checkpoint </li></ul></ul><ul><li>Minimize planned downtime </li></ul><ul><ul><li>Migrate application before hardware/OS maintenance </li></ul></ul><ul><li>Resource management </li></ul><ul><ul><li>Manage resource allocation in shared computing environments by migrating applications </li></ul></ul>
  3. 3. Need for General-Purpose Checkpoint-Restart <ul><li>Existing checkpoint-restart methods are too limited: </li></ul><ul><ul><li>No support for many OS resources that commercial applications use (e.g., sockets) </li></ul></ul><ul><ul><li>Limited to applications using specific libraries </li></ul></ul><ul><ul><li>Require application source and recompilation </li></ul></ul><ul><ul><li>Require use of specialized operating systems </li></ul></ul><ul><li>Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications </li></ul>
  4. 4. Cruz: Our Solution for General-Purpose Checkpoint-Restart on Linux <ul><li>Application-transparent: supports applications without modifications or recompilation </li></ul><ul><li>Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps) </li></ul><ul><ul><li>Comprehensive support for user-level state, kernel-level state, and distributed computation and communication state </li></ul></ul><ul><li>Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module </li></ul>
  5. 5. Cruz Overview <ul><li>Builds on Columbia Univ.’s Zap process migration </li></ul><ul><li>Our Key Extensions </li></ul><ul><li>Support for migrating networked applications, transparent to communicating peers </li></ul><ul><ul><li>Enables role in managing servers running commercial applications (e.g., databases) </li></ul></ul><ul><li>General method for checkpoint-restart of TCP/IP-based distributed applications </li></ul><ul><ul><li>Also enables efficiencies compared to library-specific approaches </li></ul></ul>
  6. 6. Outline <ul><li>Zap (Background) </li></ul><ul><li>Migrating Networked Applications </li></ul><ul><ul><li>Network Address Migration </li></ul></ul><ul><ul><li>Communication State Checkpoint and Restore </li></ul></ul><ul><li>Checkpoint-Restart of Distributed Applications </li></ul><ul><li>Evaluation </li></ul><ul><li>Related Work </li></ul><ul><li>Future Work </li></ul><ul><li>Summary </li></ul>
  7. 7. Zap (Background) <ul><li>Process migration mechanism </li></ul><ul><ul><li>Kernel module implementation </li></ul></ul><ul><li>Virtualization layer groups processes into Pods with private virtual name space </li></ul><ul><ul><li>Intercepts system calls to expose only virtual identifiers (e.g., vpid) </li></ul></ul><ul><ul><li>Preserves resource names and dependencies across migration </li></ul></ul><ul><li>Mechanism to checkpoint and restart pods </li></ul><ul><ul><li>User and kernel-level state </li></ul></ul><ul><ul><li>Primarily uses system call handlers </li></ul></ul><ul><ul><li>File system not saved or restored (assumes a network file system) </li></ul></ul>Linux System calls Zap Linux Pods Applications
  8. 8. Outline <ul><li>Zap (Background) </li></ul><ul><li>Migrating Networked Applications </li></ul><ul><ul><li>Network Address Migration </li></ul></ul><ul><ul><li>Communication State Checkpoint and Restore </li></ul></ul><ul><li>Checkpoint-Restart of Distributed Applications </li></ul><ul><li>Evaluation </li></ul><ul><li>Related Work </li></ul><ul><li>Future Work </li></ul><ul><li>Summary </li></ul>
  9. 9. Migrating Networked Applications <ul><li>Migration must be transparent to remote peers to be useful in server management scenarios </li></ul><ul><ul><li>Peers, including unmodified clients, must not perceive any change in the IP address of the application </li></ul></ul><ul><ul><li>Communication state of live connections must be preserved </li></ul></ul><ul><li>No prior solution for these (including original Zap) </li></ul><ul><li>Our Solution: </li></ul><ul><ul><li>Provide unique IP address to each pod that persists across migration </li></ul></ul><ul><ul><li>Checkpoint and restore the socket control state and socket data buffer state of all live sockets </li></ul></ul>
  10. 10. Network Address Migration <ul><li>Pod attached to virtual interface with own IP & MAC addr. </li></ul><ul><ul><li>Implemented by using Linux’s virtual interfaces (VIFs) </li></ul></ul><ul><li>IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address) </li></ul><ul><li>Intercept bind() & connect() to ensure pod processes use pod’s IP address </li></ul><ul><li>Migration: delete VIF on source host & create on new host </li></ul><ul><ul><li>Migration limited to subnet </li></ul></ul>eth0 [IP-1, MAC-h1] eth0:1 Pod DHCP Server Network DHCP Client 1. ioctl() 2. MAC-p1 3. dhcprequest(MAC-p1) 4. dhcpack(IP-p1)
  11. 11. Communication State Checkpoint and Restore <ul><li>Communication state: </li></ul><ul><ul><li>Control: Socket data structure, TCP connection state </li></ul></ul><ul><ul><li>Data: contents of send and receive socket buffers </li></ul></ul><ul><li>Challenges in communication state checkpoint and restore: </li></ul><ul><li>Network stack will continue to execute even after application processes are stopped </li></ul><ul><li>No system call interface to read or write control state </li></ul><ul><li>No system call interface to read send socket buffers </li></ul><ul><li>No system call interface to write receive socket buffers </li></ul><ul><li>Consistency of control state and socket buffer state </li></ul>
  12. 12. Communication State Checkpoint <ul><li>Acquire network stack locks to freeze TCP processing </li></ul><ul><li>Save receive buffers using socket receive system call in peek mode </li></ul><ul><li>Save send buffers by walking kernel structures </li></ul><ul><li>Copy control state from kernel structures </li></ul><ul><li>Modify two sequence numbers in saved state to reflect empty socket buffers </li></ul><ul><ul><li>Indicate current send buffers not yet written by application </li></ul></ul><ul><ul><li>Indicate current receive buffers all consumed by application </li></ul></ul>Checkpoint State State for one socket Note : Checkpoint does not change live communication state Control Rh Rt Recv buffers St Sh Send buffers Sh Rt+1 Timers, Options, etc. Rh St+1 Sh Rt+1 X X receive() direct access direct access Rh Rt . . . St Sh . . . Rt+1 Rh Sh St+1 Timers, Options, etc. Control Recv buffers Send buffers copied_seq rcv_nxt snd_una write_seq Live Communication State
  13. 13. Communication State Restore <ul><li>Create a new socket </li></ul><ul><li>Copy control state in checkpoint to socket structure </li></ul><ul><li>Restore checkpointed send buffer data using the socket write call </li></ul><ul><li>Deliver checkpointed receive buffer data to application on demand </li></ul><ul><ul><li>Copy checkpointed receive buffer data to a special buffer </li></ul></ul><ul><ul><li>Intercept receive system call to deliver data from special buffer until buffer is emptied </li></ul></ul>Sh State for one socket Control Live Communication State copied_seq rcv_nxt snd_una write_seq St Sh . . . Send buffers Checkpoint State Control Rh Rt Recv buffers St Sh Send buffers Sh Rt+1 Timers, Options, etc. Rt+1 Sh Rt+1 Rt+1 Sh Timers, Options, etc. Rh Rt Recv data direct update St+1 write() To App by intercepted receive system call
  14. 14. Outline <ul><li>Zap (Background) </li></ul><ul><li>Migrating Networked Applications </li></ul><ul><ul><li>Network Address Migration </li></ul></ul><ul><ul><li>Communication State Checkpoint and Restore </li></ul></ul><ul><li>Checkpoint-Restart of Distributed Applications </li></ul><ul><li>Evaluation </li></ul><ul><li>Related Work </li></ul><ul><li>Future Work </li></ul><ul><li>Summary </li></ul>
  15. 15. Checkpoint-Restart of Distributed Applications <ul><li>State of processes and messages in channel must be checkpointed and restored consistently </li></ul><ul><li>Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel </li></ul><ul><li>Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state </li></ul><ul><ul><li>Transparently supports TCP/IP-based distributed applications </li></ul></ul><ul><ul><li>Enables efficiencies compared to library-based implementations </li></ul></ul>Communication Channel Library Library Library Checkpoint Node Processes Node Processes Node Processes TCP/IP TCP/IP TCP/IP
  16. 16. Checkpoint-Restart of Distributed Applications in Cruz <ul><li>Global communication state saved and restored by saving and restoring TCP communication state for each pod </li></ul><ul><ul><li>Messages in flight need not be saved since the TCP state will trigger retransmission of these messages at restart </li></ul></ul><ul><ul><ul><li>Eliminates O(N 2 ) step to flush channel for capturing messages in flight </li></ul></ul></ul><ul><ul><li>Eliminates need to re-establish connections at restart </li></ul></ul><ul><li>Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart </li></ul>Communication Channel Library Library Library Checkpoint Node Pod (processes) Node Pod (processes) Node Pod (processes) TCP/IP TCP/IP TCP/IP
  17. 17. Consistent Checkpoint Algorithm in Cruz (Illustrative) <ul><li>Algorithm has O(N) complexity (blocking algorithm shown for simplicity) </li></ul><ul><li>Can be extended to improve robustness and performance, e.g.: </li></ul><ul><ul><li>Tolerate Agent & Coordinator failures </li></ul></ul><ul><ul><li>Overlap computation and checkpointing using copy-on-write </li></ul></ul><ul><ul><li>Allow nodes to continue without blocking for all nodes to complete checkpoint </li></ul></ul><ul><ul><li>Reduce checkpoint size with incremental checkpoints </li></ul></ul><checkpoint> Node Pod TCP/IP Library Agent Node Coordinator Node Pod TCP/IP Library Agent <ul><li>Disable pod comm § </li></ul><done> <continue> <ul><li>Enable pod comm </li></ul><continue-done> <checkpoint> <ul><li>Disable pod comm </li></ul><ul><li>Save pod state </li></ul><done> <continue> <ul><li>Enable pod comm </li></ul><ul><li>Resume pod </li></ul><continue-done> <ul><li>Save pod state </li></ul><ul><li>Resume pod </li></ul>§: using netfilter rules in Linux
  18. 18. Outline <ul><li>Zap (Background) </li></ul><ul><li>Migrating Networked Applications </li></ul><ul><ul><li>Network Address Migration </li></ul></ul><ul><ul><li>Communication State Checkpoint and Restore </li></ul></ul><ul><li>Checkpoint-Restart of Distributed Applications </li></ul><ul><li>Evaluation </li></ul><ul><li>Related Work </li></ul><ul><li>Future Work </li></ul><ul><li>Summary </li></ul>
  19. 19. Evaluation <ul><li>Cruz implemented for Linux 2.4.x on x86 </li></ul><ul><li>Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark </li></ul><ul><li>Cruz incurs negligible runtime overhead (less than 0.5%) </li></ul><ul><li>Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable </li></ul>
  20. 20. Performance Result – Negligible Coordination Overhead <ul><li>Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes </li></ul><ul><li>Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable </li></ul><ul><ul><li>Coordination latency of 400-500 microseconds is a small fraction of the overall checkpoint latency of about 1 second </li></ul></ul>
  21. 21. Related Work <ul><li>MetaCluster product from Meiosys </li></ul><ul><ul><li>Capabilities similar to Cruz (e.g., checkpoint and restart of unmodified distributed applications) </li></ul></ul><ul><li>Berkeley Labs Checkpoint Restart (BLCR) </li></ul><ul><ul><li>Kernel-module based checkpoint-restart for single node </li></ul></ul><ul><ul><li>No identifier virtualization – restart will fail in the event of an identifier (e.g., pid) conflict </li></ul></ul><ul><ul><li>No support for handling communication state – relies on application or library changes </li></ul></ul><ul><li>MPVM, CoCheck, LAM-MPI </li></ul><ul><ul><li>Library-specific implementations of parallel application checkpoint-restart with disadvantages described earlier </li></ul></ul>
  22. 22. Future Work <ul><li>Many areas for future work, e.g., </li></ul><ul><li>Improve portability across kernel versions by minimizing direct access to kernel structures </li></ul><ul><ul><li>Recommend additional kernel interfaces when advantageous (e.g., accessing socket attributes) </li></ul></ul><ul><li>Implement performance optimizations to the coordinated checkpoint-restart algorithm </li></ul><ul><ul><li>Evaluate performance on a wide range of applications and cluster configurations </li></ul></ul><ul><li>Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA) </li></ul>
  23. 23. Summary <ul><li>Cruz, a practical checkpoint-restart system for Linux </li></ul><ul><ul><li>No change to applications or to base OS kernel needed </li></ul></ul><ul><li>Novel mechanisms to support checkpoint-restart of a broader class of applications </li></ul><ul><ul><li>Migrating networked applications transparent to communicating peers </li></ul></ul><ul><ul><li>Consistent checkpoint-restart of general TCP/IP-based distributed applications </li></ul></ul><ul><li>Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management </li></ul>
  24. 24. http://www.hpl.hp.com/research/dca
  25. 25. Zap Virtualization <ul><li>Groups processes into a POD (Process Domain) that has a private virtual namespace </li></ul><ul><ul><li>Uses system call interception to expose only virtual identifiers (e.g., virtual pids, virtual IPC identifiers) </li></ul></ul><ul><ul><li>Virtual identifiers eliminate conflicts with identifiers already in use within the OS on the restarting node </li></ul></ul><ul><li>All dependent processes (e.g., forked child processes) are assigned to same pod </li></ul><ul><ul><li>Checkpoint and restart operate on an entire pod, which preserves resource dependencies across checkpoint and restart </li></ul></ul>
  26. 26. Zap Checkpoint and Restart <ul><li>Checkpoint: </li></ul><ul><li>Stops all processes in pod with SIGSTOP </li></ul><ul><li>Parent-child relationships saved from /proc </li></ul><ul><li>State of each process is captured by accessing system call handlers and kernel data structures </li></ul><ul><li>Restart: </li></ul><ul><li>Original forest of processes recreated in a new pod by forking recursively </li></ul><ul><li>Each process restores most of its resources using system calls (e.g., open files) </li></ul><ul><li>Kernel module restores sharing relationships (e.g., shared file descriptors) and other key resources (e.g., socket state) </li></ul>
  27. 27. Outline <ul><li>Zap (Background) </li></ul><ul><li>Migrating Networked Applications </li></ul><ul><ul><li>Network Address Migration </li></ul></ul><ul><ul><li>Communication State Checkpoint and Restore </li></ul></ul><ul><li>Checkpoint-Restart of Distributed Applications </li></ul><ul><li>Evaluation </li></ul><ul><li>Related Work </li></ul><ul><li>Future Work </li></ul><ul><li>Summary </li></ul>
  28. 28. Outline <ul><li>Zap (Background) </li></ul><ul><li>Migrating Networked Applications </li></ul><ul><ul><li>Network Address Migration </li></ul></ul><ul><ul><li>Communication State Checkpoint and Restore </li></ul></ul><ul><li>Checkpoint-Restart of Distributed Applications </li></ul><ul><li>Evaluation </li></ul><ul><li>Related Work </li></ul><ul><li>Future Work </li></ul><ul><li>Summary </li></ul>
  29. 29. Performance Result – Impact of Dropping Packets at Checkpoint <ul><li>Benchmark streaming data at maximum rate over a GigE link between 2 nodes </li></ul><ul><li>Shows TCP recovers peak throughput in 100ms </li></ul><ul><ul><li>Will be overshadowed by checkpoint latency in real applications </li></ul></ul><ul><ul><li>Optimizations can overlap TCP recovery entirely with checkpointing </li></ul></ul>

×