Exascale computing is the next major milestone for the HPC community. Due to a steadily increasing probability of failures, cur- rent applications must be made malleable to be able to cope with dy- namic resource changes. In this paper, we show first results with LAIK, a lightweight library for dynamically re-distributable application data. This allows to free compute nodes from workload before a predicted failure. For a real-world application, we show that LAIK adds negligi- ble overhead. In addition, we show the effect of different re-distribution strategies.
1. Dai Yang, Josef Weidendorfer, Tilman Küstner and Carsten Trinitis
Chair of Computing Architecture
Technical University of Munich (TUM)
Sibylle Ziegler
Klinik und Poliklinik für Nuklearmedizin,
Ludwig Maximillian Universität München
14. September 2017
Enabling Application Integrated Proactive Fault
Tolerance
ENVELOPE – Efficiency and Reliability: Selforganisation in HPC Systems
ParCo Conferences 2017
http://envelope.itec.kit.edu/
2. • Complexity of HPC towards Exascale Computing
To hide the complexity of HPC from the application programmer.
• Missing dynamic in HPC applications
• With increasing degree of heterogeneity
• Efficiency
To increase the efficiency of existing and new HPC applications.
• Reliability
To increase the reliability of HPC environment.
- Global Checkpointing and Restart do not scale well enough for exascale
• This work is part of BMBF Project ENVELOPE and funded by BMBF under grant title
01IH16010D.
• Computer resources for this project have been provided by the Gauss Centre for
Supercomputing/Leibniz Supercomputing Centre under grant: pr63qi.
2(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
Motivation
3. Background
• Application integrated approach
• In comparison to application transparent, system-level approach
• For both existing and new applications
Basic Idea
• Exchange/expand/shrink application („Malleable“ Application)
• Application should be able to retreat itself
• Incremental adaptable
• Data-Oriented, SPMD Model (same as MPI)
• PGAS-like
3(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
Goals
4. • Modularized Design, plugin-based, expandable
• Index space abstraction
• A bit of data management – no global array
• Automatic Load-Balancing
• (proactive) Fault Tolerance
• (future) Reactive Fault Tolerance by using In-Memory Checkpointing
4(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
LAIK (0) – Design Principles
5. • Application - Integrated
• Typical data types (1D/2D/3D) + (future) any data types
• Typical HPC communication backend:
currently MPI (works with simple OpenMP as well)
5(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
LAIK (1) – Design
6. • Partitioning over index spaces
• Automatic Data (Re-)Balancing by Repartitioning:
• Uniform Distribution per # of Elements or task-wise
• By Element weight
• (future) by Profiling
• Fault Tolerance
• Proactive, via Repartitioning
• (future) Reactive, via local in-memory checkpointing
• Communication Backend:
• Working: MPI
• WIP: Shared Memory
• WIP: Agents for System State Information
• MQTT and TCP
6(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
LAIK (2) At a Glance
7. • Access Pattern (r/w) and Data Flow (CopyIn/CopyOut) controlled
• Supports coupling of different data containers
• Data Consistency by using given reduction operations upon multiple write access
• Flexible data partitions (malleable) for repartitioning
7(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
LAIK (3) – Partitioning
8. • Types of partitioning and corresponding partitioners
o Master: all data in only one task
o Blocked: every task has a slice of data
o All: everyone has everything
o (future) Halo, Bisection and others
• Switch Partitioning for Data redistribution
• Data Flow and Consistency is checked and enforced
8(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
LAIK (4) – Partitioning and Partitioners
9. • Different Repartition Methods: continuous and incremental
• Steps:
1. Synchronize Tasks, communicate failed Task Numbers
2. Create a new Group excluding failed tasks
3. Get partitioner, rerun partitioner with this new group -> new balanced indexes
4. Calculate differences and data transfer action required - Transition
5. For each data container: Execute the transition
6. (optional) remove/migrate old group to new group
7. Update Data/Address space Mapping
LAIK (5) – Repartitioning
(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017 9
11. 11(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
MLEM – Short Introduction The small animal PET scanner
MADPET-II
1152 detectors, 662976 lines of response
Field of view 140 x 140 x 40 voxels, total 784000 voxels
12. 12(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
MLEM Algorithm
13. • Adaptation for Matrix Partitioning using LAIK
• Improve Mapping algorithm of sparse matrix to handle multiple independent slices
• Creation of Data Container for all working vectors
• Add loop to handle multiple slices
• Added wrapper for handling parameters for repartitioning
• System: CooLMUC 2 - NeXtScale nx360M5, Xeon E5-2697v3 14C 2.6GHz, Infiniband
FDR14
• Testinput: 12GB Probability Sparse Matrix, 10 Iterations
• Simulated Fault by enforce shrinking after 6th Iteration
13(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
Steps Done for Porting MLEM to LAIK
18. • LAIK: A library to increase elasticity in parallel application
• By adding partitioned index spaces as abstraction
• Repartitioning as central functionality
• Automatic Load-Balancing
• Fault Tolerant
• Modularized and expandable
• Increased elasticity in parallel codes
• Porting MLEM & Results
• Limited effort in application porting required
• Low overhead of LAIK
• LAIK scales at least at the same scale as the original application
18(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
Conclusion
19. Working in Progress
• Porting further application, e.g. LULESH
• Further Scalability research using >10000 cores on SuperMUC
• Agent system
• Shared memory backend
• Further optimization to reduce communication effort
Proposed
• Solution to overcome MPI-Weakness
• Local in-memory Checkpointing
• Non-regular data structure
• Elastic index space size for hierarchical instantiations
19(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
Future Work
20. [1] Alrutz, T., Backhaus, J., and et. al. GASPI: A Partitioned Global Address Space
programming interface. In Facing the Multicore-Challenge III (2013), vol. 7686 of Lecture
notes in computer science, Springer Berlin Heidelberg.
[2] Bergman, K., Borkar, S., and et. al. Exascale computing study: Technology challenges in
achieving exascale systems. DARPA IPTO Office, Tech. Rep 15 (2008).
[3] Forum, M. P. I. MPI: A Message-Passing Interface Standard Version 3.0, 2012.
[4] Furlinger, K., Glass, C., Knüpfer, A., Tao, J., Hünich, D., Idrees, K., Maiterth, M., Mhedheb,
Y., and Zhou, H. DASH: Data structures and algorithms with support for hierarchical locality. In
Euro-Par 2014 Workshops (Porto, Portugal) (2014).
[5] Idrees, K. Effective use of the PGAS paradigm: Driving transformations and self-adaptive
behavior in dash-applications. In Proceedings of the 1st Int. Workshop on Program
Transformation for Programmability in Heterogeneous Architectures (2016).
[6] Kale, L. V., and Krishnan, S. Charm++: a portable concurrent object oriented system based
on c++. In ACM Sigplan Notices (1993), vol. 28, ACM, pp. 91–108.
[7] Küstner, T., Weidendorfer, J., Schirmer, J., Klug, T., Trinitis, C., and Ziegler, S. Parallel
MLEM on multicore architectures. In ICCS 2009: 9th Int. Conf. on Computational Science
(Berlin, Heidelberg, 2009), G. Allen et al., Ed., Springer.
20(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
References
21. [8] Nagarajan, A. B., and Mueller, F. Proactive fault tolerance for HPC with Xen virtualization.
In Proceedings of the 21st annual Int. Conf. on Supercomputing (2007).
[9] Nieplocha, J., Palmer, B., Tipparaju, V., Krishnan, M., Trease, H., and Apra, E. ` Advances,
applications and performance of the global arrays shared memory programming toolkit. The
Int. Journal of High Performance Computing Applications 20, 2 (2006).
[10] Pickartz, S., Clauss, C., Lankes, S., Krempel, S., Moschny, T., and Monti, A. Nonintrusive
Migration of MPI Processes in OS-Bypass Networks. In 2016 IEEE Int. Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (2016).
[11] Rafecas, M., Mosler, B., Dietz, M., Pgl, M., Stamatakis, A., McElroy, D. P., and Ziegler, S.
I. Use of a Monte Carlo-based probability matrix for 3-D iterative reconstruction of MADPET-II
data. IEEE Trans. on Nuclear Science 51, 5 (2004).
[12] Saraswat, V., Bloom, B., and et. al. X10 language specification version 2.5.
[13] Shepp, L. A., and Vardi, Y. Maximum likelihood reconstruction for emission tomography.
IEEE Transactions on Medical Imaging 1, 2 (1982), 113–122.
[14] Strul, D., Slates, R. B., Dahlbom, M., Cherry, S. R., and Marsden, P. K. An improved
analytical detector response function model for multilayer small-diameter PET scanners.
Physics in Medicine and Biology 48 (2003), 979–994.
21(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
References
22. [15] Treichler, S., Bauer, M., and Aiken, A. Language support for dynamic, hierarchical data
partitioning. In ACM SIGPLAN Notices (2013), vol. 48, ACM, pp. 495–514.
[16] Wang, C., Mueller, F., and et. al. Proactive process-level live migration and back migration
in HPC environments. J. of Parallel and Distributed Comp. 72, 2 (2012).
[17] Weidendorfer, J., Yang, D., and Trinitis, C. Laik: A library for fault tolerant distribution of
global data for parallel applications. In Proceedings of the 27th PARS Workshop (PARS 2017)
(Hagen, 2017), Gesellschaft für Informatik.
[18] Zhou, H., Mhedheb, Y., and et. al. DART-MPI: an mpi-based implementation of a PGAS
runtime system. CoRR abs/1507.01773 (2015).
[19] Zima, H., Chamberlain, B. L., and Callahan, D. Parallel programmability and the Chapel
language. International Journal on HPC Applications, Special Issue on High Productivity
Languages and Models 21, 3 (2007), 291–312.
22(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
References
23. • LAIK
https://github.com/envelope-project/laik
• MLEM Project
https://github.com/envelope-project/mlem
• Josef Weidendorfer:
weidendo@in.tum.de
• Dai Yang
d.yang@tum.de
• Tilman Küstner
kuestner@in.tum.de
• Carsten Trinitis
carsten.trinitis@tum.de
23(C) 2017 D. Yang (TUM-LRR) | www.lrr.in.tum.de | Enabling Application Integrated Fault Tolerance | PAR-CO 2017
Infos