NR-MPI: A Non-stop and Fault Resilient MPI Supporting Programmer Defined Data Backup and Restore for E-scale Super Computing Systems
Abstract
Full Text:
PDFReferences
Top500, Top500 lists. http://www.top500.org, 2012.
F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer and M. Snir. Toward Exascale Resilience: 2014 update. Supercomputing Frontiers and Innovations, Vol. 1 no. 1, 2014
W. Bland. User Level Failure Mitigation in MPI, Euro-Par 2012: Parallel Processing Workshops Lecture Notes in Computer Science, vol.7640, p.499-504, 2013.
W. Gropp, MPICH2: A New Start for MPI Implementations, Recent Advances in Parallel Virtual Machine and Message Passing Interface, p.37–42, 2002.
M. Jette and M. Grondona, SLURM: Simple Linux Utility for Resource Management, in Proceedings of ClusterWorld Conference and Expo, San
Jose, California, 2003.
G. Zheng, L. Shi and L.V. Kal, FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI, Proc. Sixth
IEEE Int’,l Conf. Cluster Computing (Cluster ’,04), p.93-103, Sept. 2004.
D. Bailey, T. Harris, W. Saphir, R. Van Der Wijngaart, and A. Woo, The NAS Parallel Benchmarks 2.0, NASA Ames Research Center, Moffett Field, CA 2002.
A. Hoisie, O. Lubeck, and H. Wasserman, Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications, International Journal of High Performance Computing Applications, vol.14, p.330-346, 2000.
M. Xie, Y. Lu, K. Wang, L. Liu, H. Cao, and X. Yang, Tianhe-1A Interconnect and Message-Passing Services, IEEE Micro, vol.32, p.8-20, 2012.
X. Yang, X. Liao, K. Lu, Q. Hu, J. Song, and J. Su, The TianHe-1A Supercomputer: Its Hardware and Software, Journal of Computer Science and Technology, vol. 26, p.344-351, 2011.
G. Suo, Y. Lu, X. Liao, M. Xie, and H. Cao, NR-MPI: a Non-stop and Fault Resilient MPI, In Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems(ICPADS 2013), Seoul, Korea, p.190-199, 2013.
R. Koo and S. Toueg, Checkpointing and Rollback-Recovery for Disitributed Systems, IEEE Transactions on Software Engineering, vol.13, p.23–31, 1987.
A. Bouteiller, P. Lemarinier, G. Krawezik and F. Cappello, Coordinated Checkpoint versus Message Log for Fault Tolerant MPI, in Proc. Fifth IEEE Int’,l Conf. Cluster Computing (Cluster ’,03), p.242, 2003.
G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. Toward a scalable fault tolerant mpi for volatile nodes, In Proceedings of SC 2002. IEEE, 2002.
K. Ferreira, J. Stearley, I. J. H. Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold, Evaluating the viability of process replication reliability for exascale systems, in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, New York, NY, USA, 2011, p.1–12.
G. E. Fagg and J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World, in Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, London, UK, UK, 2000, p.346–353.
A. Bouteiler, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello. MPICH-V project: A multiprotocol automatic fault tolerant MPI, The International Journal of High Performance Computing Applications, vol.20, p.319-333, 2006.
C. Engelmann and S. Bohm. Redundant execution of HPC applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), p.31C38, 2011.
D. Fiala, F. Mueller, C. Engelmann, and R. Riesen, Detection and correction of silent data corruption for large-scale high-performance computing. In Parallel & Distributed Processing Workshops & Phd Forum IEEE International Sympos, 7196(5), p.2069-2072, 2011.
W. Bland, Enabling Application Resilience with and without the MPI Standard, in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), Washington, DC, USA, 2012, p.746–751.
J. Hursey and R. Graham, Building a Fault Tolerant MPI Application: A Ring Communication Example, in 16th International Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS) held in conjunction with the 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Anchorage, Alaska, 2011.
M. Beck, J. J. Dongarra, G. E. Fagg, G. A. Geist, P. Gray, J.s Kohl, M. Migliardi, K. Moore, T. Moore, P. Papadopoulous, S. L. Scott, and V. Sunderam. HARNESS: A Next Generation Distributed Virtual Machine, Future Generation Computer Systems, 15(5-6):571-582, 1999.
H. Jung, D. Shin, H. Han, J. W. Kim, H. Y. Yeom, and J. Lee, Design and Implementation of M ultiple Fault-Tolerant M PI over M yrinet(M 3 ), in Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005, p.32–46.
C. Wang. Transparent Fault Tolerance for Job Healing in HPC Environments, PhD thesis, North Carolina State University, 2009.
C. Huang, O. Lawlor, and L. V. Kal ́e . Adaptive MPI, In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, p.306-322, College Station, Texas, October 2003.
A. Agbaria and R. Friedman, Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations, In 8th IEEE International Symposium on High Performance Distributed Computing, 1999.
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing, International Journal of High Performance Computing Applications, 19(4):479-493, Winter 2005.
R. Wang, E. Yao, P. Balaji, D. Buntinas, M. Chen and G. Tan. Building Algorithmically Nonstop Fault Tolerant MPI Programs, In Proceedings of the 18th IEEE International Conference on High Performance Computing. (HiPC 2011), December 2011, Bangalore, India.
Z. Wu, R. Wang, W. Xu, M. Chen, E. Yao, Supporting User-directed Fault Tolerance over Standard MPI, in 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS), p.696-697, 17-19 Dec. 2012.
J. S. Plank, K. Li, and M. A. Puening, Diskless Checkpointing, IEEE Trans. Parallel Distrib. Syst., vol. 9, p.972–986, 1998.
J. P. Walters and V. Chaudhary, Application-Level checkpointing techniques for parallel programs, in ICDCIT’06, Berlin, Heidelberg, 2006, p. 221–234.
J. S. Plank, M. Beck, G. Kingsley, and K. Li, Libckpt: transparent checkpointing under Unix, in TCON’95, Berkeley, CA, USA, 1995, p.18–18.
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, in SC ’10, Washington, DC, USA, 2010, p.1–11.
W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, J. Dongarra. A Checkpoint-on Failure protocol for algorithm-based recovery in standard MPI, In 18th Euro-Par, LNCS, vol. 7484, p.477-489. 2012.
K. Huang and J. A. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Comput., vol. 33, p.518–528, 1984.
D. Fiala, Detection and Correction of Silent Data Corruption for Large-Scale High Performance Computing, in IPDPSW ’11, Washington, DC, USA, 2011, p.2069–2072.
Z. Chen, Algorithm-based recovery for iterative methods without checkpointing, in HPDC ’11, New York, NY, USA, 2011, p.73–84.
X. Yang, Y. Du, P. Wang, H. Fu, and J. Jia, FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing, IEEE Trans. Parallel Distrib. Syst., vol.20, p.1471–1486, 2009.