Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Providing fault tolerance in extreme scale parallel applications


Published on

Presentation as presented at Super Computer 2011 in the HPCDB workshop.

Published in: Technology
  • Be the first to comment

Providing fault tolerance in extreme scale parallel applications

  1. 1. Providing fault tolerance in extreme scale parallel applications What can the HPC community learn from the Database community Huub van Dam, Abhinav Vishnu, Bert de Jong Hubertus.vanDam@pnnl.govHPCDB @ SC’11, Seattle, Friday, Nov 18, 2011
  2. 2. Outline Intro Hard fault Error detection Redundancy for data protection Transactional vs. Phased updates Fault recovery Soft errors Characteristics Read errors Compute errors: detectable and undetectable Optimizing algorithms vs. Step algorithms Summary2
  3. 3. PNNL is taking on extreme scale computing The extreme scale computing initiative Attacks a broad range of aspects Parallelization and scalability Hybrid computing Performance analysis, debugging, etc. Energy efficiency and consumption Fault tolerance Collaborative approach Abhinav Vishnu (Computer Scientist, Global Arrays developer) Bert de Jong (Chemist, NWChem team leader) Huub van Dam (Chemist, NWChem developer) NWChem was used by 2009 Gordon Bell finalist Apra et al. E. Apra, A.P. Rendell, R.J. Harrison, T. Vinod, W.A. de Jong, S.S. Xantheas, SC’09, Portland, OR, SESSION: Gordon Bell Finalists, article 66, Doi:10.1145/1654059.16541273
  4. 4. Faults are inevitable at scale Tera Scale Peta Scale Exa 1 Scale 0.8 Jaguar (ORNL) Probability 0.6 Prob. 1.0e-8/Core 0.4 Prob. 1.0e-7/Core Prob. 1.0e-6/Core 0.2 0 1 1,000 1,000,000 1,000,000,000 Number of Cores Database community realized Faults are fact of live Back in 1980s (e.g. J. Gray, Proc. 7th Int. Conf. Very Large Databases, Sep 9-11, 1981, 144-154; T. Haerder, A. Reuter, Computing Surveys, 1983, 15, 287-317, Doi:10.1145/289.291)4
  5. 5. Different views: DB vs. HPC Databases High Performance Computing Run for a long time Run for a limited time Serve many queries / requests Calculate one specific thing Entrusted with unique business If needed calculation can be critical data repeated Reliability top priority Performance top priority Up to some 5,000 cores Up to 225,000 cores (Google, Yahoo, Facebook, etc.) Advanced fault tolerance Basic checkpoint restart up to strategies now5
  6. 6. Data distribution managed through GlobalArray Toolkit PGAS programming model: Distributed dense arrays that can be addressed through a shared memory-like style and one-sided access I.e. all processors can access all data irrespective of location Shared Object Physically distributed data Shared Object get put compute/update local memory local memory local memory Hides much of the necessary parallel infrastructure but awareness of locality of data for scalability Model will have to change on exascale machines!! Global Address Space
  7. 7. How to handle hard faults? Definition: A hard fault is an error that kills a process. Examples: A power failure on a node A process segmentation faults Issues: How to detect a fault How to protect against data loss How to determine the state of the application How to salvage the state of the application7
  8. 8. Fault detection protocols Infiniband ??? ??? Requires Reliable response No Response Notification, from Most remote No Response Reliable process, Less Node is dead Reliable Network Ping message Node RDMA Read A. Vishnu, H. van Dam, W. de Jong, P. Balaji, S. Song, High Perf. Comp. (HiPC), 2010 Int. Conf. on, pp.1-9, 19-22 Dec. 2010, Doi: 10.1109/HIPC.2010.57131958
  9. 9. Fault detection alternative Proc M Proc N Process N only participates if valid Manager Contract M Contract M Contract N contract in place Contract N If contract expires N terminates itself Manager concludes N Contract died renewal M checks status of N request with manager Contract Time to error confirmation detection increases with contract length Communication Contract enquiry N decreases with contract length9
  10. 10. How is data redundancy usedand data access orchestrated? Working Proc 0 Proc 1 Proc 2 Proc 3 Updating primary copy Updating shadow copy DoneH.J.J. van Dam, W.A. de Jong, A. Vishnu, J. Chem. TheoryComput., 2011, 7, pp 66–75, Doi: 10.1021/ct100439u
  11. 11. Our approach vs. transactional Our approach Transactions Rolled into one: Separated: a. Data transmission a. Send data first b. Changing persistent b. Change persistent data data only at commit Memory efficient Stores data until commit Three states for data Two states for data 1. Available and valid 1. Available and valid 2. Available and corrupt 2. Unavailable 3. Unavailable Only one update per task Only one commit per task allowed allowed M. Herlihy, J.E.B. Moss, “Transactional memory” Proc. ISCA’93,1993, Doi:10.1145/165123.16516411
  12. 12. What about soft errors Soft errors are intermittent deviations from the correct platform behavior Examples: Data as read is different from data as written Instructions that mis-execute: i.e. Add(1,1) --> 3 Read errors can be detected using checksums Check all inputs to a task Maybe also check all inputs at the end of a task Error correction not needed if relying on duplicated persistent data12
  13. 13. Mis-executing instructions Incorrectly executed instructions can be detected By duplicating work and using a quorum This is very expensive (at least factor 2, if using quorum 3) Feasible only if used selectively By using estimates Requires development of bounds on (many) quantities Much less operations than blanket duplication Can detect only a subset of deviations Abs  ij kl   Max  Abs  ij | ij   Max  Abs  kl | kl   i , j k ,l  ij kl    f  r  f  r  f  r  f  r  r  r i 1 j 1 k 2 l 2 1 2 1 dr1dr2 M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, “Soft-Error Detection through Software Fault-Tolerance Techniques”, Proc. DFT’99, 210-13 218, 1999, Doi:10.1109/DFTVS.1999.802887
  14. 14. Impact on calculation The soft error impact depends on algorithm type Optimization algorithms (defined result) E.g. Minimizing the energy as a function of atom positions Termination condition expressed as a property of the result Designed to iteratively reduce error Automatically removes impact of perturbations Perturbation should not be too big All invariants must be expressed explicitly so that they can enforced Step algorithms (defined effort) A ij   B ik C kj E.g. a matrix-matrix multiplication k Termination condition independent of the result Any error perturbs final answer Perturbations must be actively minimized14
  15. 15. Soft errors Best candidates for soft errors resiliency are optimization algorithms Some Step algorithms can be transformed into Optimization algorithms Even Optimization algorithms may require rewrites to express invariants explicitly Even Optimization algorithms may be perturbed such that recovery becomes unpractical15
  16. 16. Summary Fault tolerance in HPC is easier than in Databases Because HPC algorithms make defined changes to the persistent state, so no roll back required, and re-executing tasks is no problem Fault tolerance in HPC is harder than in Databases Because all recovery has to be handled automatically without any human intervention Reliable fault detection currently depends on hardware supported features. Is there a better way? Our data updates are not transactional, should they be? Soft errors are a challenge because they cannot always be detected. Soft errors are likely manageable for Optimization algorithms Question / Comments ?16
  17. 17. NWChem development is funded by: - Department of Energy: BER, ASCR, BES - PNNL LDRD EMSL: A national scientific user facility integrating experimental and computational resources for discovery and technological innovation17