Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

Improving the Scalability of Transparent
Checkpointing for GPU Computing Systems
The 2012 IEEE Region 10 Conference
(TENCON 2012)
Cebu, Philippines
November 21, 2012

Alfian Amrizal, S. Hirasawa, K. Komatsu, H. Takizawa, H. Kobayashi
Tohoku University

Outline
• Introduction
• Two-level CheCL
• Performance Model
• Evaluation and Discussion
• Conclusion

2

High-Performance Computing & Checkpoint
• High-performance computing (HPC) systems are getting faster
and larger in scale
– Consist of huge numbers of CPUs and GPUs
– Probability of encountering failures also increases
• Checkpoint/restart (CPR) tools are important to make sure
HPC systems can successfully finish their calculation
– Long running applications; e.g. SPECFEM3D

CPU-GPU in Heterogeneous HPC system
3

Difficulties in CPR of Heterogeneous Systems
• Heterogeneous systems use both CPUs and GPUs
• Conventional CPR tools such as BLCR and DMTCP do not
assume GPUs ⇒ CPR fails

compute node CPU GPU
SCR_Start_checkpt();
SCR_Route_file(fn,fn2);
…
fwrite(data,…);

Host
…

Device
SCR_Complete_checkpt();

Memory Memory

process resource
conventional CPR tools CheCL allows conventional
only save CPU state tools to save GPU state
• CheCL has been developed for checkpointing OpenCL
applications running on CPU-GPU systems [Takizawa, IPDPS’11]
4

Difficulties in CPR of Heterogeneous Systems
• Problem: checkpointing time increases with the # of nodes

5

Writing Checkpoints to Global Storage is Ineffective
• To withstand failures, large-scale heterogeneous systems need
to checkpoint more frequently to the global storage (low BW)
• However, the global storage is shared among nodes
⇒ CheCL ‘s checkpoint time increases with the # of nodes
• CheCL is not scalable: the larger the node’s numbers, the
…
fwrite(data,…);
…
…
fwrite(data,…);
…
…
fwrite(data,…);
…
…
fwrite(data,…);
…

compute nodes it takes to checkpoint
longer
SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt(); SCR_Complete_checkpt();

• Objective
– To establish an effective implementation of the checkpointing
mechanism for heterogeneous HPC system
Network Contention

global storage 6

Writing Checkpoints to Global Storage is Ineffective
• To withstand failures, large-scale heterogeneous systems need
to checkpoint more frequently to the global storage (low BW)
• However, the global storage is shared among nodes
⇒ CheCL ‘s checkpoint time increases with the # of nodes
• CheCL is not scalable: the larger the node’s numbers, the
longer it takes to checkpoint

• Objective
– To establish an effective implementation of the checkpointing
mechanism for heterogeneous HPC system

7

Outline
• Introduction
• Two-level CheCL
• Conclusion

8

Local CheCL
• Avoid the network by utilizing node’s local storage
–  Simultaneous checkpointing → Fast
–  Less reliable
SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt(); SCR_Start_checkpt();
SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2); SCR_Route_file(fn,fn2);
… … … …
fwrite(data,…); fwrite(data,…); fwrite(data,…); fwrite(data,…);
… … … …

compute nodes

Add local storage to Interrupt this process
each node

Large, reliable but slow 9
global storage

Local CheCL
… … … …
… … … …

compute nodes

Add local storage to
each node

global storage

Two-level CheCL
• Writing ckpt files to the global storage is more reliable but time
consuming
• Use local storages of compute nodes. Fast but sacrifice reliability

Propose Two-level CheCL : use both local and global ⇒ Local CheCL ＋ Global CheCL

… … … …
… … … …

compute nodes

local storages

shared global storage
12

Outline
• Introduction
• Two-level CheCL
• Conclusion

13

Performance Model

• Total execution time of an OpenCL application running with
Two-level CheCL is Ttotal
• The original execution time is Ts

n dG n n dL n dL n

Ts
14

Performance Model

• Total time spent for checkpointing is TC

n Cov n Cov n Cov n Cov n

Ts ＋ Tc
15

Performance Model

• Total time spent for checkpointing is TC
• Local CheCL ckpt overhead CL, Global CheCL ckpt overhead CG

75% 25%

n CG n CL n CL n CL n

Ts ＋ Tc
16

Performance Model

• No failure during ckpt process. On average, failures occur at 0.5n
• TL is time overhead when the process is recoverable by the latest
checkpoint file.

0.5n 0.5n 0.5n 0.5n 0.5n

n CG n CL n CL n CL n

Ts ＋ Tc
17

Performance Model

• No failure during ckpt process. On average, failures occur at 0.5n
• TL is time overhead when the process is recoverable by the latest
checkpoint file.

wasted time 85% 15%
# of failures [Moody, SC’10]
n CG n CL 0.5n
RL n CL n
n CG 0.5n
RG n CL n CL n
18

Performance Model

• TG is time overhead when the process is only recoverable by the
global checkpoint file.

n CG n CL 0.5n

RG RL n CL n
19

Outline
• Introduction
• Two-level CheCL
• Conclusion

20

Experimental Set Up
• The evaluation was conducted on a GPU cluster of
four compute nodes, each compute node has:
– Intel core i7 930 CPU
– Nvidia Tesla C2070 GPU
– Main memory of 24 GB
– tmpfs RAM Disk of 12 GB
• CPR tools:
– BLCR-0.8-4 (CPU state ckpt)
– CheCL (GPU state ckpt)
• Benchmark:
– Molecular Dynamic (MD)
21

Checkpoint Time Comparison for GPU Cluster
16000
Accelerate up to > 4x
14000
Checkpoint Time (ms)

12000

10000

8000

6000 Global CheCL
Local CheCL
4000

2000

0
12288 24574 73728 12288 24574 73728 12288 24574 73728
1 node 2 nodes 4 nodes
# of Nodes and Problem size

22

Efficiency (Ts/Ttotal) Improvement (No Failure)
100%
Two-level CheCL’s PL:PG=3:1
90%
80%
70%
Efficiency

60%
50%
40%
30%
20%
10%
0%
1x 2x 4x 8x 16x 32x 64x
Checkpoint Frequencies

2 nodes, Local and Global 2 nodes, Global only 4 nodes, Local and Global 4 nodes, Global only

23

Efficiency Improvement (MTTF = 3 minutes)
[Schroeder, SciDAC’07]

100%
Two-level CheCL’s PL:PG=3:1
90%
80%
70%
Efficiency

60%
50%
40%
30%
20%
10%
0%
1x 2x 4x 8x 16x 32x 64x
Checkpoint Frequencies
4 nodes, Local and Global 4 nodes, Global only

24

Trade-off Between Local/Global Ratio and Two-level CheCL’s Time Overhead

4500

4000

3500
Time overhead (ms)

3000

2500

2000

1500

1000

500

0
(0:10) (1:9) (2:8) (3:7) (4:6) (5:5) (6:4) (7:3) (8:2) (9:1)
Local/Global ratio

25

Conclusion
• Checkpointing is important for HPC system
dependability
• Two-level CheCL can improve system efficiency
• Local CheCL can be used for high speed
checkpointing
• There is a trade-off between Local and Global CheCL
which must be treated carefully for future
implementation on large-scale GPU computing
systems

26

Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Improving the Scalability of Transparent Checkpointing for GPU Computing Systems

Similar to Improving the Scalability of Transparent Checkpointing for GPU Computing Systems (20)

Recently uploaded

Recently uploaded (20)

Improving the Scalability of Transparent Checkpointing for GPU Computing Systems