NR-MPI : A N on-stop and Fault R esilient MPI Supporting Programmer Defined Data Backup and Restore for E-scale Super Computing Systems

Fault resilience has became a major issue for HPC systems, particularly, in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. MPI-level fault tolerant constructs, such as ULFM, are being proposed to support software level fault tolerance. However, there are few systematic evaluations by application programmers using benchmarks or pseudo applications. This paper proposes NR-MPI, a N on-stop and Fault Resilient MPI, supporting programmer defined data backup and restore. To help programmers write fault tolerant programs, NR-MPI provides a set of friendly programming interfaces and a state transition diagram for data backup and restore. This paper focuses on design, implementation and evaluation of NR-MPI. Specifically,this paper puts emphases on failure detection in MPI library, friendly programming interface extending for NR-MPI and examples of fault tolerant programs based NRMPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup interfaces based on double in-memory checkpoint/restart. We conduct experiments with both NPB benchmarks and Sweep3D on TH supercomputer in NSCC-TJ. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.


Introduction
Large scale scientific applications have been the main driving force for high-performance computing.Scientists need to analyze ever-larger data set and to run ever-larger simulations, which drives the scale of high-performance computers, growing to millions of processor cores.In the future, extreme scale high-performance computers will consist of even more cores.From the top500 [1] list of November 2015, there are 3120000 cores in the rank of 1 supercomputer.With the expansion of computer systems, the failure rate is also increasing.So the mean time between failures (MTBF) is decreasing.However, many scientific applications need to run for weeks or even months.Therefore, the MTBF of these computers is becoming significantly shorter than the execution time of many current scientific applications.To support the execution of such applications, fault tolerance is imperative [2].
The lack of appropriate resilience solutions is a major problem at exascale.Currently the MPI forum is working on a new fault tolerant MPI standard.Additional MPI-level constructs will be added into the future MPI standard.The most promising way is User Level Failure Migration(ULFM) [3].However, there is still lack of enough experimental results to proof both the usability and performance of ULFM.
In this paper, we present NR-MPI, a Non-stop and Fault Resilient MPI.The semantics of NR-MPI is derived mainly from FT-MPI and ULFM.The programming interface, which is designed for iteration based on scientific parallel applications, is less complicated than FT-MPI and ULFM.The interface is more suitable for programmers, who just put emphasis on which data to be backuped and when to backup, to convert a parallel application into a fault tolerant one.The failure detector of NR-MPI is different from ULFM.ULFM detects the process failures in the MPI library, while NR-MPI relies on external failure detector which is usually integrated with process manager or resource manager.We implemented some fault tolerance of ULFM based on MPICH.The implementation of NR-MPI is based on MPICH [4].NR-MPI has no runtime overhead when there are no failures.
This paper focuses on the design, implementation and evaluation of NR-MPI, which is implemented on top of ULFM.The MPI Forum has not reached a consensus on the principles of a resilient MPI, although ULFM is discussed a lot.However, we think that the following issues are important when implementing a fault tolerant MPI.
• How to detect failures in a MPI library.
• How to recover the state of a MPI library based on ULFM.
• How to recover the lost application data after failures using NR-MPI.
• What programming interfaces are needed in order to reduce the complexity of fault tolerant programming.
• How to convert a non-fault tolerant program into a fault tolerant one by NR-MPI.The rest of this paper is organized as follows: Section 1 presents the design and implementation of NR-MPI.Section 2 shows the usage of programming interface based an example algorithm.Section 3 evaluates the performance of NR-MPI with NPB benchmarks.Section 4 gives the related work of NR-MPI.In Section 4.2, we conclude this paper and discuss the future work.

Design and Implementation of NR-MPI
NR-MPI is designed on top of ULFM and implemented based on MPICH [4].Traditional ULFM is implemented based on OpenMPI.We use MPICH instead of OpenMPI because the software stack of the high performance network is on top of MPICH.The fault tolerant RMS is modified based on SLURM [5].

Failure Model
In this paper, we assume a failure model in which fail-stop failures can occur anytime in any process during a parallel execution.There are two types of failure models: fail-stop model and byzantine model.Fail-stop failures can be detected more easily.In fact, byzantine failures can be detected by error checking based on ABFT, which can also be implemented based on NR-MPI.

The Fault Tolerant RMS of NR-MPI
The Resource Management System, for example SLURM, is developed to manage and monitor parallel applications running on a cluster of computers.It is designed to coordinate a global and consistent system state upon failures.According to roles and locations, the RMS can be divided into two parts: Resource Manager and Process Manager, shown in fig. 1.Furthermore, we add Failure Arbiter (FA) and Failure Detector (FD) to them respectively.The functions of the fault tolerant RMS are: fault tolerant resource management, failure detection and notification.FDs can detect process failures of parallel jobs using the SIGCHLD signal.FA uses a periodic heartbeat to detect failures from FDs.In this way, FA can detect all process failures of parallel jobs.There are two advantages, if the FDs and FAs are integrated into RMS.Firstly, they can be light weighted so as to not interfere with the performance of jobs.Secondly, FD and FA can make use of the fault tolerant techniques already implemented in the RMS.For example, there are two active Resource Managers on line.One is active while the other one is standing by, so that failures of one Resource Manager don't influence the availability of the system.
Based on the communication topology of Resource Manager and Process Manager, the communication topology of failure detecting system is a tree too.Root of the tree is FA; inter-mediate and leaf nodes are FDs.Shared memory is used for the communication between FD and MPI processes in the same node.Software enhanced communication network for high performance computing used for the communications between FD and FD (or FA), so that the heartbeat and failure notification messages can be unfailingly transferred.In addition, to recover sucessfully, the failure notification messages should be received by all the MPI processes of a parallel job in the same order.
During MPI communications, the NR-MPI library needs to check the failure notification messages by reading the contents of shared memory, when sending or receiving data.For example, based on MPICH, NR-MPI checks the failure notification messages in progress engine.
The contents of the failure notification message are the list of failed ranks of the parallel job.This failure information can help the programmers to recover from process failures, node failures and network failures.When a node crashes, the message contains all the ranks belonging to the program on that node.When a part of network fails, the message contains all the ranks of processes which FA can't communicate with due to the network failure.

Recovery of NR-MPI Library
Main job of recovering the NR-MPI Library is to recover the communicators.We take MPI COMM WORLD as an example, shown in fig.2, to explain how to recover it inside the library.The world communicator recovery is to repair the corrupted attributes.In fact, old communication context is OK for the repaired communicator.So recovering group and virtual connection table is the main job.It can be done together by using mainly existing MPI procedures.The steps of world communicator are as follows.
(1)Failure detection.When failures occur, the other alive ranks enter the failure recovery process as soon as they receive failure information from FDs.
(2)Shinking the communicator.The alive ranks shink the failed ranks from world communicator by calling MPI Comm shink.The MPI Comm shink operation is defined in ULFM.It will exclude the failed ranks from a failed communicator.
(3)Spawning replacements.The world communicator spawns processes as replacements for the failed ones by calling MPI Comm spawn.Note that the spawned processes are in a different process group.They communicate via an inter-communicator.
(4)Merging the inter-communicator.Constructing a new world group from the old world group and the spawned group.Note that the order of new world group is different from the world group before the failures occur.
(5)Reordering rank in a new world group.This can be done by calling MPI Comm create.Most of the steps, except MPI Comm shink, of world communicator recovery are based on existing MPI standard version 2. Meanwhile, the recovery process does not manipulate virtual connection table directly.So the complexity of implementing NR-MPI is relatively low.

State Transition Diagram with Failures
Upon failures, the lost application data also need to be recovered.C/R, ABFT, application level data backup via MPI communications or combination of the above can be used to recover the lost data.In this paper, we don't assume a specific data backup and restore technique.Instead, we define a data backup and recovery protocol for NR-MPI, so that the data backup and recovery techniques defined by programmers can be integrated with NR-MPI as a whole.
During the execution of a NR-MPI parallel program, any processes may fail.In addition, the failures may occur when recovering MPI data, or application data.Meanwhile, a failure is recovered, if and only if all processes recovered their MPI data and application data.The data to recover is different for different kinds of processes at different state.To help programmers implementing NR-MPI parallel programs, we define a state transition diagram of a NR-MPI process, shown in fig. 3 When data recovery is via MPI communications, failures can be detected in progress engine (the module that probes upcoming messages and sends queued messages) of MPI library.However, when there is no interaction between processes during data recovery, the processes won't enter REPLACE RECOVER, so the state transfer 7, 8, 9 and 15 won't be triggered either.

Extending Interface of NR-MPI
Many parallel applications are iteration based, such as linear solvers and PDE based applications.We focus on fault tolerance for iteration based applications.To help programmers, NR-MPI provides a convenient programming interface which can reduce the burden of programmers, listed in Table 3.
NR Register, NR Backup, NR Recover and NR Need -backup are used to backup and restore the application data based on the double in memory checkpoint [6].Moreover, to prevent the damages of failures during the data backup and restore, we use ping-pong buffers to backup application data.NR Backup and NR Recover are simple examples for application level data backup and restore.There can be any data backup and restore techniques, and programmers can implement their own methods by overriding the 4 functions.NR Get state and NR Set state are used to get and set the state of the local process.NR Get failure ranks is used to query failed rank set in NR Recover.

Implementation of the NR-MPI
The structure of NR-MPI is shown in fig. 4. Clearly, NR-MPI is in the middle of the software stack.The fault tolerant RMS is modified based on slurm-2.4.0-rc1[XX].Many different MPI implementations have been developed to support different supercomputers efficiently.NR-MPI can integrate with a wide range of existing MPI implementations.In this paper, NR-MPI is modified based on mpich2-1.4.1p1 [21], integrated with state management module, data backup and restore module, failure detecting module, and failure recovery module.Their functions are as follows: state management module provides a state managing interface for programmers.From state management module, programmers can query or set its own fault tolerant state, which is essential for NR-MPI.Data backup and restore module provide the programming interface based on mutual data backup to save the application data.Failure detecting module can query failures

NR-MPI USAGE EXAMPLE
Based on NR-MPI, the programmers just add two sections to a non-fault tolerate program to make it fault tolerant, without changing other codes of the program.The function of data backup segment is to save the application data periodically in case of failures, while the function of data restore segment is to restore the lost application data of failed processes.We take the conjugate gradient (CG) algorithm as an example to illustrate the usage of the programming interface, shown in fig. 5. NR-CG is the fault tolerance version of CG algorithm based on NR-MPI.To support fault tolerance, we added data backup and restored codes.Lines of 19∼29(data restore segment) and 39∼42(data backup segment) are the additional codes.
From the algorithm of NR-CG (lines 33∼38), we can see that vector x and iteration index j are the data to be backed up.So after initialization of MPI library, NR-CG registers x and j to NR-MPI (line 11∼12).Then, NR-CG sets the return position env, so that NR-CG can switch from MPI library to the position user defined in the program after the world communicator has been recovered in a failure (line 13∼14).The callback function cg callback is called by NR-MPI after recovering the MPI core data.In line 15, NR-CG gets the current state of itself.Then, it sets x and j start based on the program state.If the state is RECOVER or REPLACE RECOVER, the program needs to restore the application data.For the two states, the codes are the same.However, they may be different for ABFT.Lines 39∼42 are used to back up data.Usually it is not necessary to back up data per iteration, so NR Need backup is called to determine whether the backup is needed.
In this example, NR-CG uses longjmp to get to the user-defined position in the program.In fact, NR-MPI can also return FAILURE status like the interfaces defined in FT-MPI, so that setjmp and longjmp can be omitted.We didn't follow the style of returning error codes, because more modifications are needed.Furthermore, programmers can call any MPI routines after recovering MPI core data.For example, MPI Comm create can be used to create a new communicator, which doesn't contain the recovered ranks, like shrink operation in FT-MPI and OpenMPI.
After receiving notification from FDs, the survival processes recover MPI core data before calling cg callback.In cg callback, the default action of cg callback is jumping to line 13.longjmp can clear the calling stack of MPI library.At this moment, the state of the program is RECOVER.So lines 26∼29 are executed by the survival processes, they will help the failed processes to restore their lost application data, when they find their partner is failed.Meanwhile, the replacements exit from INIT and run to line 15.Their states are REPLACE RECOVER, so they get x and j start from their partners (lines 22∼25).At last, they set their states to RUNNING.  Figure 6 shows an execution process of NR-CG upon a failure.There are 5 processes in NR-CG, 4 of them are normal process and the other one is the stand-by process.During execution, P0 and P2 backup the application data of each other, and P1 and P3 backup the application data of each other.At phase 1, the 4 processes backup their application data successfully.At phase 2, after backing up their application data, P2 crashed.Then the event, which P2 is crashed, (0)P0, P1, P3 spawn a new process, which is named P4.
(1)P4 finds that the replacement of P2 is itself.Then P0, P1, P4 and P3 recover their MPI core data (creating a new world communicator).
(2)After recovering MPI core data, the 4 processes begin to recover the application data.P0, P1 and P3 recover their local application data to the last backing up version.
(3)P0 finds that its partner (P2) has failed by calling NR Get failure ranks, so it sends application data of P2 to help it to recover.Meanwhile, the new P2 receives its lost application data.
(4)All of the four processes recover their data to the latest backing up version.And the NR-MPI parallel program recovers from a failure.

EXPERIMENTAL EVALUATION
To use NR-MPI, we have modified benchmarks from NAS Parallel Benchmarks [7] (version 3.3) and Sweep3D [8].Our experimental platform, configuration of which is shown in Table 4, is TH-1A [9,10], deployed in National Supercomputer Center in Tianjin.There are two steps to modify a non-fault tolerant program.Firstly, to indentify main iteration and application data to backup.Secondly, to add data restore segment and data backup segment before and inside the main iteration, if necessary.Moreover, if a program has several phases, that is to say, it has several main iterations.For each one of the main iterations, the programmers need to identify application data, to add data restore and to backup segments respectively.So the modification complexity is very low.For example, we have added only a few additional codes to CG in NPB-3.3 to enable fault tolerant.When there is no failure, the overhead of a MPI program without data backup and restore is the failure detection overhead.In our implementation, the failure detection overhead is almost zero based on experiments.So we don't provide the results of failure detection overhead.fig.7 presents the execution time of non-fault tolerant and fault tolerant NPB benchmarks with NR-MPI.For fault tolerant version of the benchmarks, the application data is backed up every 1,5 and 10 iterations, using double in memory checkpoint.Normal results are the experimental results without data backup.ft-1 is the experimental result, in which the application data is backed up every 1 iteration.ft-5 is the experimental result, in which the application data is backed up every 5 iteration.ft-10 is the experimental results in which the application data is backed up every 10 iteration.The iteration intervals are different for different benchmarks, so time interval of data backup is different for different benchmarks.
From fig. 7, it can be found that the runtime overheads of NR-MPI parallel programs are different from different benchmarks.For NR-CG, the runtime is almost independent from the backup interval.The reason is that the data amount to be backed up is really small compared with the communications during the execution.However, for NR-MG, NR-FT and NR-LU, the overheads are higher than NR-CG.Take NR-LU as an example, the runtime of ft-1 is 122% higher than normal execution for 2048 processes.However, when the backup interval increases, the overhead due to the data backup is decreasing.For NR-LU, the runtime of ft-10 is by 33 higher than the normal execution for 2048 processes.NR-MG, NR-FT and NR-LU have higher runtime overhead, because the data to be backed up is larger.For example, when normal execution, the LU benchmark only exchanges the surface data of a data cube.However, NR-LU needs to backup the entire cube every 1, 5 or 10 iterations.So the runtime overhead of data backup is relatively higher than NR-CG.When the number of processes increases, the runtime overhead also decreases.For example, the runtime overhead of NR-LU is 15%, when the backup interval is 10 iterations for 16384 processes.The runtime overhead of NR-FT is 2%, when the backup interval is 10 iterations for 32768 processes.The backup interval (similar to the checkpoint intervals) is another important criterion for NR-MPI parallel programs, because it can be used to measure the expected lost computation due to one failure.Fig. 8 shows the backup of ft-10 intervals of E class NPB benchmarks.It can be found that the intervals are really small.For example, the highest interval is 90 seconds, for FT.E.2048, which means the average lost of computation is 45 seconds.For the other benchmarks, except NR-FT, the highest interval is 20 seconds.

Overall Time Overhead due to One Failure
NR-MPI is similar to SCR in data backup and restore.For example, SCR also supports mutual file backup when checkpoint files are stored in local disk or memory disk.In addition, when using SCR, the programmers also need to specify the variables to be saved, just like backing up key application data, when programmers use NR-MPI.
However, there are still two differences between SCR and NR-MPI.Firstly, the data backup and restore efficiency of NR-MPI is higher than SCR, because SCR uses file system interface to save and restore data, which needs additional copies between programs and the operating systems.Meanwhile, SCR usually uses TCP/IP-based communication channels.It cant use fast interfaces, such as RDMA, to accelerate communication directly.Secondly, in a failure, SCR needs restarting jobs, which is a time consuming operation in large parallel systems.
Table 5 gives the overall timing analysis of SCR and NR-MPI based on Sweep3d.We have modified two versions of Sweep3d: SCR-Sweep3d and NR-Sweep3d.The two versions save the same data during execution using different interfaces.The grid size of Sweep3d is 1024x1024x 16384 and the parallelism is 16384.The iteration count is 10, and data is saved or checkpointed per iteration.The time to detect a failure is two times of the heartbeat interval.In the experimental system, it is about 60s.The way to backup application data is different for the two versions.One is via SCR interfaces, while the other one is via MPI communications.So NR-Sweep3d is faster than SCR-Sweep3d for backing up data.Rebuilding or routing data is specific for SCR, so do restarting jobs.While recovering MPI library is only needed by NR-MPI.The time to read application data after failures is different for the two versions.All processes of SCR-Sweep3d need to read some application data from parallel file systems except the application data in local memory disk, while only the replacement processes of NR-Sweeep3d need to receive lost application data and read lost application data from file systems.The average lost of computation is the same for the two versions, because they all backup data per iteration.In conclusion, NR-Sweep3d is better than SCR-Sweep3d.

Fault Tolerant Techniques
In our previous work [11], we all also present NR-MPI, which is implemented based on MPICH.The failure recovery of MPI library is based on our own fault tolerant MPI constructs.In this paper, the failure recovery of MPI library is based on ULFM.In fact, there is few performance difference between this work and previous work.
There are several fault tolerant techniques, such as Checkpoint/Restart [12] (C/R for short), Message logging [13,14], process-level replication [15] and forward recovery [16].C/R, which has been used widely in previous high performance computers, periodically saves the state of a computation to stable storages, such as parallel file systems.However, C/R requires a restart of the entire parallel job even when only one process of the job failed.In a restart, all processes of a parallel job need to be reloaded.Then, all processes read the latest checkpoint to recover a consistent state.Both the overheads of checkpoint file I/O and restart overhead are unbearable for the current large scale systems, letting alone for the future extreme large scale systems.Log based methods, such as MPICH-V [17], can recover the process to its initial state and roll it forward by re-playing the messages before failures in the same order they were delivered before the crash.However, the main limitation is the necessity to log all messages of the execution.Process-level replication of parallel executions, such as MrMPI [18] and RedMPI [19], employs replicated processes performing the same task.If a process fails its a replication can take over its execution.Thus, redundant copies can decrease the overall failure rate.One major replication overhead comes from the management of extra messages required for replication.For a doublereplication execution, when a process sends a message to another process, four communications of that message take place.Forward recovery, such as FT-MPI (and OpenMPI [20,21]) and User-Level Failure Mitigation (ULFM), allows applications continue running after a failure, while standard MPI does not provide any specification of the behavior of an MPI application after a failure.FT-MPI allows the semantics and associated failure modes to be completely controlled by the applications.FT-MPI required programmers to recover all the application state after failures.In addition, Harness [22], a distributed virtual machine, is used as the runtime system, in the initial implementation of FT-MPI.Both the programming interface and runtime system of FT-MPI are too complicated to use.ULFM allows the application to get notifications of errors and to use specific functions to reorganize the execution for forward recovery.However, ULFM provides only the basic interface and new semantics to enable applications and libraries to repair the state of MPI and tolerate failures.It is just like an assemble language for fault tolerance and is a little complicated for programmers to write fault tolerant applications.
In addition, there are some other hybrid ways.M 3 [23] is a user-transparent checkpoint system for fault tolerant MPI without restarting jobs.Coordinated checkpoint is used to recover lost data upon failures.Job pause service [24], Adaptive MPI [25], StarFish [26] and LAM/MPI [27] are also similar with M 3 , using system level checkpoint to recover MPI runtime system.User-Directed Fault Tolerance (UDFT) [28,29], on top of standard MPI, provides the user directed support for application level algorithmic fault tolerance.

Data Recovery for HPC
C/R is the most typical technique for fault tolerance in previous researches.There are two types of C/R.Traditional C/R can tolerant the whole system failures by writing checkpoint data periodically to stable storages, such as parallel file systems, and restarting from the latest checkpoint.While diskless C/R [30] saves the states of the processes into the memory directly, eliminating the overhead of writing data to stable storages.Diskless C/R is faster, but traditional C/R can recover from more serious failures.C/R can be implemented at two levels: applicationlevel checkpoint [31] and system-level checkpoint [32].SCR [33] reduces the checkpoint and restart overheads by caching checkpoint data in the memory or local storage of the compute node.However, the spatial overhead of SCR is triple size of the checkpoints at least.This is a huge overhead for future supercomputers which have a lower memory/processor ratio.In addition, SCR also needs a restart of the entire parallel job.Usually, the overheads of restarting jobs are directly proportional to the parallelism of jobs.Checkpoint-on-Failure Protocol [34] can reduce checkpoint overhead by eliminating the overhead of customary periodic checkpointing.Using NR-MPI, Checkpoint-on-Failure can reduce restart overhead furthermore.
Algorithm-based Fault Tolerance (ABFT) can also be used to recover lost data due to failures.Huang and Abraham [35] developed the ABFT technique to detect, locate and correct soft failures.For many matrix operations, the checksum relationship in the input checksum matrices is still held in the final computation results.Therefore, the soft failures can be detected by checking the checksum relationship in the final computation results.If some processes fail during computation, the data is lost.Based on the checksum relationship, the lost data can be rebuilt using the data of the survival processes.For example, Davies [36] proposed an algorithm-based recovery scheme for the HPL benchmark, based on the checksum relationship of the right-looking LU factorization algorithm.The checksum is maintained at every step of the computation.Chen [37] found that, for many iterative methods, if the data partitioning scheme satisfies certain conditions, the iterative methods will maintain enough inherent redundant information for the accurate recovery of the lost data.Yang [38] proposed a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant Parallel Algorithm, which provides fast self-recovery upon failures.In a failure, all survival processes re-compute the workload of the crashed processes in parallel.However, it requires programmers to redesign algorithms.

Conclusion
This paper proposes a convenient, scalable and efficient fault tolerant MPI, named NR-MPI.By the notifications from fault tolerant RMS, failures are internally and automatically recovered by the NR-MPI runtime system.On the one hand, NR-MPI eliminates the restarting job overhead by automatic online recovering MPI communication states after failures for the survival processes.On the other hand, NR-MPI reduces the complexity of fault tolerant MPI by designing new semantics of MPI.For example, duplicate messages and termination of programs do not to be detected any more.We carried detailed experiments to evaluate NR-MPI.The experimental results, from 2048 processes to 32768 processes, show that NR-MPI could be scalable soundly.
NR-MPI also has limitations.Firstly, not all failures can be recovered.Not enough communicator contexts, not enough replacements, or the death of the two partners backing up data each other can cause of unsuccessful recovery.Secondly, programmers need to modify programs to using NR-MPI.Thirdly, upon failures, programmers have to rollback to a consistent position in the NR-MPI parallel programs.However, the lost computation due to rollback is controlled by programmers.Fourthly, NR-MPI assumes that the crashed processes are necessary for the parallel job, so it reinitializes replacements upon failures.In fact, if replacements are not needed, they can be excluded from the recovered world communicator by calling MPI Comm create.
In the future, our work focuses on: 1)lazy allocation-based failure recovery, which requires the survival processes to spawn replacements when necessary.2)more flexible data recovery algorithms to reduce the memory overheads of double in memory checkpoint.3)combining NR-MPI with the new de facto message passing standard, to support fault tolerance for a broad range of extreme scale applications.

Figure 2 .
Figure 2. Recovery process of a communicator

Figure 5 .
Figure 5. Algorithm of NR-CG based on NR-MPI

Figure 6 .
Figure 6.Execution example of NR-CG

Figure 8 .
Figure 8. Backup intervals of E class NPB benchmarks

S
. Guang, Y. Lu, X. Liao, M. Xie, H. Cao 2016, Vol. 3, No. 1 Structure of the resource management system of NR-MPI For NR-MPI, failures are detected by the fault tolerant RMS.To detect fail-stop failures, we add FDs in Process Managers of RMS in computing nodes and add FA in Resource Manager.

Table 1 .
STATE DEFINATION States Descriptions INIT When a process calls MPI Init, it enters INIT state.In INIT state, the processes, including standby processes, initialize their internal MPI data.At the end of INIT, the normal processes can exit, while the standby processes are waiting and processing failure notification messages.The standby processes exit INIT when they are selected as replacements upon failures or the program exits.RUNNING If the return status of MPI Init indicates that initialization is OK, or if a replacement process recovers its lost data successfully, the process enters RUN-NING state.A process can do user computation, user defined MPI communication, and backup application data (via C/R, ABFT, or MPI communication) in RUNNING state.FINISH When a process calls MPI Finalize, it enters FINISH state.FINISH state is an absorbing state.ABORT Whenever a process finds unrecoverable failures, it enters ABORT state.To simplify the diagram, we only draw 4 state transfers to ABORT.ABORT is also an absorbing state.
FAILURE When a normal process in RUNNING state finds failures in MPI library, it enters FAILURE state.This state is hidden in MPI library; the work in this state is to recover MPI core data RECOVER After recovering MPI core data, a process enters RECOVER state.The work in this state is to recover MPI extend data and application data by programmers.If successful, the programmers should alter the state of the local process

Table 2 .
STATE TRANSFER DEFINATION

Table 3 .
PROGRAMMING INTERFACE efficiently from the external failure detector.When the processes are spawned by process manager, the process manager creates a shared memory used to pass failure notification messages and adds FD SHMID=shmid into the environmental variables of spawned processes.Failure recovery module can recover the corrupted MPI core data into a consistent state.

Table 4 .
CONFIGURATION OF EXPERIMENT PLATFORM

Table 5 .
OVERAL TIMING ALALYSIS OF SCR AND NR-MPI