NR-MPI: A Non-stop and Fault Resilient MPI Supporting Programmer Defined Data Backup and Restore for E-scale Super Computing Systems

Suo Guang


Fault resilience has became a major issue for HPC systems, particularly, in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. MPI-level fault tolerant constructs, such as ULFM, are being proposed to support software level fault tolerance. However, there are few systematic evaluations by application programmers using benchmarks or pseudo applications. This paper proposes NR-MPI, a \emph{N}on-stop and Fault \emph{R}esilient \emph{MPI}, supporting programmer defined data backup and restore. To help programmers write fault tolerant programs, NR-MPI provides a set of friendly programming interfaces and a state transition diagram for data backup and restore. This paper focuses on design, implementation and evaluation of NR-MPI. Specifically,this paper puts emphases on failure detection in MPI library, friendly programming interface extending for NR-MPI and examples of fault tolerant programs based NR-MPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup interfaces based on double in-memory checkpoint/restart. We conduct experiments with both NPB benchmarks and Sweep3D on TH supercomputer in NSCC-TJ. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.

Full Text:



Top500, Top500 lists., 2012.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer and M. Snir. Toward Exascale Resilience: 2014 update. Supercomputing Frontiers and Innovations, Vol. 1 no. 1, 2014

W. Bland. User Level Failure Mitigation in MPI, Euro-Par 2012: Parallel Processing Workshops Lecture Notes in Computer Science, vol.7640, p.499-504, 2013.

W. Gropp, MPICH2: A New Start for MPI Implementations, Recent Advances in Parallel Virtual Machine and Message Passing Interface, p.37–42, 2002.

M. Jette and M. Grondona, SLURM: Simple Linux Utility for Resource Management, in Proceedings of ClusterWorld Conference and Expo, San

Jose, California, 2003.

G. Zheng, L. Shi and L.V. Kal, FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI, Proc. Sixth

IEEE Int’,l Conf. Cluster Computing (Cluster ’,04), p.93-103, Sept. 2004.

D. Bailey, T. Harris, W. Saphir, R. Van Der Wijngaart, and A. Woo, The NAS Parallel Benchmarks 2.0, NASA Ames Research Center, Moffett Field, CA 2002.

A. Hoisie, O. Lubeck, and H. Wasserman, Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications, International Journal of High Performance Computing Applications, vol.14, p.330-346, 2000.

M. Xie, Y. Lu, K. Wang, L. Liu, H. Cao, and X. Yang, Tianhe-1A Interconnect and Message-Passing Services, IEEE Micro, vol.32, p.8-20, 2012.

X. Yang, X. Liao, K. Lu, Q. Hu, J. Song, and J. Su, The TianHe-1A Supercomputer: Its Hardware and Software, Journal of Computer Science and Technology, vol. 26, p.344-351, 2011.

G. Suo, Y. Lu, X. Liao, M. Xie, and H. Cao, NR-MPI: a Non-stop and Fault Resilient MPI, In Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems(ICPADS 2013), Seoul, Korea, p.190-199, 2013.

R. Koo and S. Toueg, Checkpointing and Rollback-Recovery for Disitributed Systems, IEEE Transactions on Software Engineering, vol.13, p.23–31, 1987.

A. Bouteiller, P. Lemarinier, G. Krawezik and F. Cappello, Coordinated Checkpoint versus Message Log for Fault Tolerant MPI, in Proc. Fifth IEEE Int’,l Conf. Cluster Computing (Cluster ’,03), p.242, 2003.

G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. Toward a scalable fault tolerant mpi for volatile nodes, In Proceedings of SC 2002. IEEE, 2002.

K. Ferreira, J. Stearley, I. J. H. Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold, Evaluating the viability of process replication reliability for exascale systems, in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, New York, NY, USA, 2011, p.1–12.

G. E. Fagg and J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World, in Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, London, UK, UK, 2000, p.346–353.

A. Bouteiler, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello. MPICH-V project: A multiprotocol automatic fault tolerant MPI, The International Journal of High Performance Computing Applications, vol.20, p.319-333, 2006.

C. Engelmann and S. Bohm. Redundant execution of HPC applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), p.31C38, 2011.

D. Fiala, F. Mueller, C. Engelmann, and R. Riesen, Detection and correction of silent data corruption for large-scale high-performance computing. In Parallel & Distributed Processing Workshops & Phd Forum IEEE International Sympos, 7196(5), p.2069-2072, 2011.

W. Bland, Enabling Application Resilience with and without the MPI Standard, in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), Washington, DC, USA, 2012, p.746–751.

J. Hursey and R. Graham, Building a Fault Tolerant MPI Application: A Ring Communication Example, in 16th International Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS) held in conjunction with the 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Anchorage, Alaska, 2011.

M. Beck, J. J. Dongarra, G. E. Fagg, G. A. Geist, P. Gray, J.s Kohl, M. Migliardi, K. Moore, T. Moore, P. Papadopoulous, S. L. Scott, and V. Sunderam. HARNESS: A Next Generation Distributed Virtual Machine, Future Generation Computer Systems, 15(5-6):571-582, 1999.

H. Jung, D. Shin, H. Han, J. W. Kim, H. Y. Yeom, and J. Lee, Design and Implementation of M ultiple Fault-Tolerant M PI over M yrinet(M 3 ), in Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005, p.32–46.

C. Wang. Transparent Fault Tolerance for Job Healing in HPC Environments, PhD thesis, North Carolina State University, 2009.

C. Huang, O. Lawlor, and L. V. Kal ́e . Adaptive MPI, In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, p.306-322, College Station, Texas, October 2003.

A. Agbaria and R. Friedman, Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations, In 8th IEEE International Symposium on High Performance Distributed Computing, 1999.

S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing, International Journal of High Performance Computing Applications, 19(4):479-493, Winter 2005.

R. Wang, E. Yao, P. Balaji, D. Buntinas, M. Chen and G. Tan. Building Algorithmically Nonstop Fault Tolerant MPI Programs, In Proceedings of the 18th IEEE International Conference on High Performance Computing. (HiPC 2011), December 2011, Bangalore, India.

Z. Wu, R. Wang, W. Xu, M. Chen, E. Yao, Supporting User-directed Fault Tolerance over Standard MPI, in 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS), p.696-697, 17-19 Dec. 2012.

J. S. Plank, K. Li, and M. A. Puening, Diskless Checkpointing, IEEE Trans. Parallel Distrib. Syst., vol. 9, p.972–986, 1998.

J. P. Walters and V. Chaudhary, Application-Level checkpointing techniques for parallel programs, in ICDCIT’06, Berlin, Heidelberg, 2006, p. 221–234.

J. S. Plank, M. Beck, G. Kingsley, and K. Li, Libckpt: transparent checkpointing under Unix, in TCON’95, Berkeley, CA, USA, 1995, p.18–18.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, in SC ’10, Washington, DC, USA, 2010, p.1–11.

W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, J. Dongarra. A Checkpoint-on Failure protocol for algorithm-based recovery in standard MPI, In 18th Euro-Par, LNCS, vol. 7484, p.477-489. 2012.

K. Huang and J. A. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Comput., vol. 33, p.518–528, 1984.

D. Fiala, Detection and Correction of Silent Data Corruption for Large-Scale High Performance Computing, in IPDPSW ’11, Washington, DC, USA, 2011, p.2069–2072.

Z. Chen, Algorithm-based recovery for iterative methods without checkpointing, in HPDC ’11, New York, NY, USA, 2011, p.73–84.

X. Yang, Y. Du, P. Wang, H. Fu, and J. Jia, FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing, IEEE Trans. Parallel Distrib. Syst., vol.20, p.1471–1486, 2009.

Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)