Improving the Scalability of TransparentCheckpointing for GPU Computing Systems              The 2012 IEEE Region 10 Confe...
Outline•   Introduction•   Two-level CheCL•   Performance Model•   Evaluation and Discussion•   Conclusion                ...
High-Performance Computing & Checkpoint• High-performance computing (HPC) systems are getting faster  and larger in scale ...
Difficulties in CPR of Heterogeneous Systems         • Heterogeneous systems use both CPUs and GPUs         • Conventional...
Difficulties in CPR of Heterogeneous Systems • Problem: checkpointing time increases with the # of nodes                  ...
Writing Checkpoints to Global Storage is Ineffective   • To withstand failures, large-scale heterogeneous systems need    ...
Writing Checkpoints to Global Storage is Ineffective • To withstand failures, large-scale heterogeneous systems need   to ...
Outline•   Introduction•   Two-level CheCL•   Performance Model•   Evaluation and Discussion•   Conclusion                ...
Local CheCL  • Avoid the network by utilizing node’s local storage       –  Simultaneous checkpointing → Fast       –  L...
Local CheCL  • Avoid the network by utilizing node’s local storage       –  Simultaneous checkpointing → Fast       –  L...
Local CheCL  • Avoid the network by utilizing node’s local storage       –  Simultaneous checkpointing → Fast       –  L...
Two-level CheCL  • Writing ckpt files to the global storage is more reliable but time    consuming  • Use local storages o...
Outline•   Introduction•   Two-level CheCL•   Performance Model•   Evaluation and Discussion•   Conclusion                ...
Performance Model• Total execution time of an OpenCL application running with  Two-level CheCL is Ttotal• The original exe...
Performance Model• Total time spent for checkpointing is TC   n     Cov     n     Cov      n      Cov   n   Cov   n       ...
Performance Model• Total time spent for checkpointing is TC• Local CheCL ckpt overhead CL, Global CheCL ckpt overhead CG  ...
Performance Model• No failure during ckpt process. On average, failures occur at 0.5n• TL is time overhead when the proces...
Performance Model• No failure during ckpt process. On average, failures occur at 0.5n• TL is time overhead when the proces...
Performance Model• TG is time overhead when the process is only recoverable by the  global checkpoint file.   n       CG  ...
Outline•   Introduction•   Two-level CheCL•   Performance Model•   Evaluation and Discussion•   Conclusion                ...
Experimental Set Up• The evaluation was conducted on a GPU cluster of  four compute nodes, each compute node has:   –   In...
Checkpoint Time Comparison for GPU Cluster                        16000                                                   ...
Efficiency (Ts/Ttotal) Improvement (No Failure)               100%                                                        ...
Efficiency Improvement (MTTF = 3 minutes)                                                                           [Schro...
Trade-off Between Local/Global Ratio and Two-level CheCL’s Time Overhead                        4500                      ...
Conclusion• Checkpointing is important for HPC system  dependability• Two-level CheCL can improve system efficiency• Local...
Upcoming SlideShare
Loading in …5
×

Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

437 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
437
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

  1. 1. Improving the Scalability of TransparentCheckpointing for GPU Computing Systems The 2012 IEEE Region 10 Conference (TENCON 2012) Cebu, Philippines November 21, 2012Alfian Amrizal, S. Hirasawa, K. Komatsu, H. Takizawa, H. Kobayashi Tohoku University
  2. 2. Outline• Introduction• Two-level CheCL• Performance Model• Evaluation and Discussion• Conclusion 2
  3. 3. High-Performance Computing & Checkpoint• High-performance computing (HPC) systems are getting faster and larger in scale – Consist of huge numbers of CPUs and GPUs – Probability of encountering failures also increases• Checkpoint/restart (CPR) tools are important to make sure HPC systems can successfully finish their calculation – Long running applications; e.g. SPECFEM3D CPU-GPU in Heterogeneous HPC system 3
  4. 4. Difficulties in CPR of Heterogeneous Systems • Heterogeneous systems use both CPUs and GPUs • Conventional CPR tools such as BLCR and DMTCP do not assume GPUs ⇒ CPR fails compute node CPU GPU SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); Host … Device SCR_Complete_checkpt(); Memory Memory process resourceconventional CPR tools CheCL allows conventional only save CPU state tools to save GPU state • CheCL has been developed for checkpointing OpenCL applications running on CPU-GPU systems [Takizawa, IPDPS’11] 4
  5. 5. Difficulties in CPR of Heterogeneous Systems • Problem: checkpointing time increases with the # of nodes 5
  6. 6. Writing Checkpoints to Global Storage is Ineffective • To withstand failures, large-scale heterogeneous systems need to checkpoint more frequently to the global storage (low BW) • However, the global storage is shared among nodes ⇒ CheCL ‘s checkpoint time increases with the # of nodes • CheCL is not scalable: the larger the node’s numbers, the SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); … SCR_Start_checkpt(); SCR_Route_file(fn,fn2); … fwrite(data,…); …compute nodes it takes to checkpoint longer SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); • Objective – To establish an effective implementation of the checkpointing mechanism for heterogeneous HPC system Network Contention global storage 6
  7. 7. Writing Checkpoints to Global Storage is Ineffective • To withstand failures, large-scale heterogeneous systems need to checkpoint more frequently to the global storage (low BW) • However, the global storage is shared among nodes ⇒ CheCL ‘s checkpoint time increases with the # of nodes • CheCL is not scalable: the larger the node’s numbers, the longer it takes to checkpoint • Objective – To establish an effective implementation of the checkpointing mechanism for heterogeneous HPC system 7
  8. 8. Outline• Introduction• Two-level CheCL• Performance Model• Evaluation and Discussion• Conclusion 8
  9. 9. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();compute nodesAdd local storage to Interrupt this process each node Large, reliable but slow 9 global storage
  10. 10. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();compute nodesAdd local storage to each node Large, reliable but slow 10 global storage
  11. 11. Local CheCL • Avoid the network by utilizing node’s local storage –  Simultaneous checkpointing → Fast –  Less reliable SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();compute nodesAdd local storage to each node Large, reliable but slow 11 global storage
  12. 12. Two-level CheCL • Writing ckpt files to the global storage is more reliable but time consuming • Use local storages of compute nodes. Fast but sacrifice reliabilityPropose Two-level CheCL : use both local and global ⇒ Local CheCL + Global CheCL SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); … … … … fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…); … … … … SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();compute nodes local storages shared global storage 12
  13. 13. Outline• Introduction• Two-level CheCL• Performance Model• Evaluation and Discussion• Conclusion 13
  14. 14. Performance Model• Total execution time of an OpenCL application running with Two-level CheCL is Ttotal• The original execution time is Ts n dG n n dL n dL n Ts 14
  15. 15. Performance Model• Total time spent for checkpointing is TC n Cov n Cov n Cov n Cov n Ts + Tc 15
  16. 16. Performance Model• Total time spent for checkpointing is TC• Local CheCL ckpt overhead CL, Global CheCL ckpt overhead CG 75% 25% n CG n CL n CL n CL n Ts + Tc 16
  17. 17. Performance Model• No failure during ckpt process. On average, failures occur at 0.5n• TL is time overhead when the process is recoverable by the latest checkpoint file.0.5n 0.5n 0.5n 0.5n 0.5n n CG n CL n CL n CL n Ts + Tc 17
  18. 18. Performance Model• No failure during ckpt process. On average, failures occur at 0.5n• TL is time overhead when the process is recoverable by the latest checkpoint file. wasted time 85% 15% # of failures [Moody, SC’10] n CG n CL 0.5n RL n CL n n CG 0.5n RG n CL n CL n 18
  19. 19. Performance Model• TG is time overhead when the process is only recoverable by the global checkpoint file. n CG n CL 0.5n RG RL n CL n 19
  20. 20. Outline• Introduction• Two-level CheCL• Performance Model• Evaluation and Discussion• Conclusion 20
  21. 21. Experimental Set Up• The evaluation was conducted on a GPU cluster of four compute nodes, each compute node has: – Intel core i7 930 CPU – Nvidia Tesla C2070 GPU – Main memory of 24 GB – tmpfs RAM Disk of 12 GB• CPR tools: – BLCR-0.8-4 (CPU state ckpt) – CheCL (GPU state ckpt)• Benchmark: – Molecular Dynamic (MD) 21
  22. 22. Checkpoint Time Comparison for GPU Cluster 16000 Accelerate up to > 4x 14000 Checkpoint Time (ms) 12000 10000 8000 6000 Global CheCL Local CheCL 4000 2000 0 12288 24574 73728 12288 24574 73728 12288 24574 73728 1 node 2 nodes 4 nodes # of Nodes and Problem size 22
  23. 23. Efficiency (Ts/Ttotal) Improvement (No Failure) 100% Two-level CheCL’s PL:PG=3:1 90% 80% 70% Efficiency 60% 50% 40% 30% 20% 10% 0% 1x 2x 4x 8x 16x 32x 64x Checkpoint Frequencies 2 nodes, Local and Global 2 nodes, Global only 4 nodes, Local and Global 4 nodes, Global only 23
  24. 24. Efficiency Improvement (MTTF = 3 minutes) [Schroeder, SciDAC’07] 100% Two-level CheCL’s PL:PG=3:1 90% 80% 70% Efficiency 60% 50% 40% 30% 20% 10% 0% 1x 2x 4x 8x 16x 32x 64x Checkpoint Frequencies 4 nodes, Local and Global 4 nodes, Global only 24
  25. 25. Trade-off Between Local/Global Ratio and Two-level CheCL’s Time Overhead 4500 4000 3500 Time overhead (ms) 3000 2500 2000 1500 1000 500 0 (0:10) (1:9) (2:8) (3:7) (4:6) (5:5) (6:4) (7:3) (8:2) (9:1) Local/Global ratio 25
  26. 26. Conclusion• Checkpointing is important for HPC system dependability• Two-level CheCL can improve system efficiency• Local CheCL can be used for high speed checkpointing• There is a trade-off between Local and Global CheCL which must be treated carefully for future implementation on large-scale GPU computing systems 26

×