Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

C/R Support for Heterogeneous HPC Applications


Published on

This poster is about C/R Support for Heterogeneous HPC Applications.

Published in: Science
  • Be the first to comment

  • Be the first to like this

C/R Support for Heterogeneous HPC Applications

  1. 1. C/R SUPPORT FOR HETEROGENEOUS HPC APPLICATIONS {KONSTANTINOS.PARASYRIS LEONARDO.BAUTISTA }@BSC.ES MOTIVATION • Last decade supercomputers have increased in size and computing capabilities. – However, as the number of components increase, the systems become more error prone thus more prone to failures. • The future exascale will be heterogeneous, consisting of multiple nodes that combine a balance of high throughput general-purpose GPUs, coupled with high performance multi-core CPUs. – GPUs are more error prone than CPUs, In TSUBAME 40% of the total number of failures are cause by GPU errors, while the number of CPU related failures is below 5%. CHALLENGES 1. The amount of data to be checkpointed in- creases, as HPC applications are able to pro- cess more information 2. The decrease of the MTBF results to higher checkpoint frequency. 3. Portions of the application data are stored in the CPU-memory, whereas other portions are stored in the GPU main memory. 4. Increased programmers effort to provide an easy to use checkpoint mechanism across multiple nodes and multiple GPU devices. VISION Challenge 1 Support for differential checkpoint to reduce the amount of data stored by stor- ing only data that change valies. Challenge 2 Support multi-level checkpoint to support higher checkpoint frequency by us- ing dedicated background processes. Challenge 3 Support for checkpoint data dis- tributed on multiple devices in the same node. Challenge 4 Provide a single API to support multi-node/multi-GPU applications. FTI API The following scheme shows the logic of an FTI in- tegration: 1. Initialize FTI FTI_Init() 2. Expose crucial data to FTI FTI_Protect() 3. Perform checkpoint FTI_Checkpoint() or FTI_Snapshot() 4. Finalize FTI FTI_Finalize() In line 10 the developer allocates memory using a UVM address, in line 12 she allocates a device pointer. In lines 14,15,16 the developer protects three different memory address, a host , a UVM and a device address using exactly the same API. ACKNOWLEDGEMENT This work has been supported by EU H2020 ICT project LEGaTO, contract #780681. GET FTI GPU SUPPORT Implementation of the GPU checkpoint. The writing of the checkpoint file is overlapped with the streaming of the data from the GPU Chunk1 Chunk2 Chunk3 Chunk4H2D engine Chunk1 Chunk2 Chunk3 Chunk4CPU Chunk1 Chunk2 Chunk3 Chunk4I/O engine Overhead Profiling of the code showed, latencies on the CPU side to compute the integrity checksum of the checkpoint file. We exploit the GPU computational power to compute the checksum in the GPU and let CPU to only perform data movements. Chunk1 Chunk2 Chunk3 Chunk4H2D engine Chunk1 Chunk2 Chunk3 Chunk4CPU Task 1 Chunk1 Chunk2 Chunk3 Chunk4I/O engine Overhead Driver of the MD5-GPU implementationCPU Task 2 GPU MD5 GPU MD5 kernel EVALUATION 9.6 TB / 9.6K Procs L1 L2 L3 L4 No-FTI POSIX - Write [s] 3.2 7 15.4 127 124 MPI-IO - Write [s] 3.4 6.8 16 365 362 POSIX - Read [s] 6 8 8 4 0.64 1 Node 4 Nodes 8 Nodes 16 Nodes 1 Node 4 Nodes 8 Nodes 16 Nodes 64 Gb 128 Gb 0 20 40 60 80 100 120 # Node - Ckpt. Size per node Seconds(s) Initial Optimized Time to checkpoint and recover using the 4 FTI checkpoint levels Checkpoint of Head2D testing our multi-gpu/multi-node checkpoint methodology using UVM memory MULTI-LEVEL CHECKPOINT Multi-level checkpoint support, the application can store the checkpoint file using different levels. Each level corresponds to a different point in the reliability overhead grid. Reliability CheckpointOverhead Level 1: SSD, PCM, NVM. Fastest checkpoint level. Low reliability, transient failures Level 2: Replication. Fast copy to neighbor node. It tolerates single node crashes. Level 3 : Encoding. Slow for large checkpoints. Very reliable, multiple node crashes. Level 4 : Classic Ckpt. Slowest of all levels. The most reliable. Power outage. Time } Checkpoint overhead Application Execution Checkpoint to local storage Post checkpoint procedure Synchronous post-processing Asynchronous post-processing Time FTI dedicated thread FTI offers the feature to dedicate one head process per node to post-process a local checkpoint. The application processes store the checkpoint locally on the nodes and the head does the respective work needed to perform a L2, L3 or L4 checkpoints.