Resilience within Ultrascale Computing System: Challenges and Opportunities from Nesus Project

Although resilience is already an established ﬁeld in system science and many methodologies and approaches are available to deal with it, the unprecedented scales of computing, of the massive data to be managed, new network technologies, and drastically new forms of massive scale applications bring new challenges that need to be addressed. This paper reviews the challenges and approaches of resilience in ultrascale computing systems from multiple perspectives involving and addressing the resilience aspects of hardware-software co-design for ultrascale systems, resilience against (security) attacks, new approaches and methodologies to resilience in ultrascale systems


Introduction
Ultrascale computing is a new computing paradigm that comes naturally from the necessity of computing systems that should be able to handle massive data in possibly very large scale distributed systems, enabling new forms of applications that can serve a very large amount of users and in a timely manner that we have never experienced before.
Ultrascale Computing Systems (UCSs) are envisioned as large-scale complex systems joining parallel and distributed computing systems that will be two to three orders of magnitude larger than today's systems (considering the number of Central Processing Unit (CPU) cores).It is very challenging to find sustainable solutions for UCSs due to their scale and a wide range of possible applications and involved technologies.For example, we need to deal with cross fertilization among HPC, large-scale distributed systems, and big data management.One of the challenges regarding sustainable UCSs is resilience.Traditionally, it has been an important aspect in the area of critical infrastructure protection (e.g.traditional electrical grid and smart grids).Furthermore, it has also become popular in the area of information and communication technology (ICT), ICT systems, computing and large-scale distributed systems.In essence, resilience is an ability of a system to efficiently deliver and maintain (in a timely manner) a correct service despite failures and changes.It is important to emphasize this term in comparison with a closely related "fault tolerance".The latter indicates only a well-defined behaviour of a system once an error occurs.For example, a system is resilient to an effect on an error (in one of its components) if it continues correct operation and service delivery (possibly degraded in some way).Whereas, it is fault tolerant to the error when it is able to detect and notify about the existence of the problem with possible recovery to the correct state.
The existing practices of dependable design deal reasonably well with achieving and predicting dependability in systems that are relatively closed and unchanging.Yet, the tendency to make all kinds of large-scale systems more interconnected, open, and able to change without new intervention by designers, makes existing techniques inadequate to deliver the same levels of dependability.For instance, evolution of the system itself and its uses impairs dependability: new components "create" system design faults or vulnerabilities by feature interaction or by triggering pre-existing bugs in existing components; likewise, new patterns of use arise, new interconnections open the system to attack by new potential adversaries, and so on.
Many new services and applications will be able to get advantage of ultrascale platforms such as big data analytics, life science genomics and HPC sequencing, high energy physics (such as QCD), scalable robust multiscale and multi-physics methods and diverse applications for analysing large and heterogeneous data sets related to social, financial, and industrial contexts.These applications have a need for Ultrascale Computing Systems (UCSs) due to scientific goals to simulate larger problems within a reasonable time period.However, it is generally agreed that applications will require substantial rewriting in order to scale and benefit from UCSs.
In this paper, we aim at providing an overview of Ultrascale Computing Systems (UCSs) and highlighting open problems.This includes: • Exploring and reviewing the state-of-the-art approaches of continuous execution in the presence of failures in UCSs.
• Techniques to deal with hardware and system software failures or intentional changes within the complex system environment.
• Resilient, reactive schedulers that can survive errors at the node and/or the cluster-level, cluster-level monitoring and assessment of failures with pro-active actions to remedy failures before they actually occur (like migrating processes [13,58], virtual machines [49], etc.), and malleable applications that can adapt their resource usage at run-time.In particular, we approach the problem from different angles including fundamental issues such as reproducibility, repeatability and resiliency against security attacks; application-specific challenges such as hardware and software issues with big data, cloud-based cyber physical systems.The paper then also discusses new opportunities in providing and supporting resilience in ultrascale systems.
This paper is organized as follows: the section 1 reviews the basic notions of faults, Fault Tolerance and robustness.Then, several key issues need to be tackled to ensure a robust execution on top of an UCS system.The section 2 focuses on recent trends as regards the resilience of large scale computing systems, focusing on hardware failures (see §2.1) and on Algorithm-Based Fault Tolerance (ABFT) techniques where the fault tolerance scheme is tailored to the algorithm i.e. the application run.We will see that at this level, Evolutionary Algorithms (EAs) present all the characteristics to handle natively faulty executions, even at the scale foreseen in UCSs systems.Then, the section 3 will review the challenges linked to the notion of repeatibility and reproducibility in UCSs.The final section concludes the paper and provides some future directions and perspectives opened by this study.construction or through its exploitation (e.g.software bugs, hardware faults, problems with data transfer) [4,11,12,52].A fault may occur by a deviation of a system from the required operation leading to an error (for instance a software bug becomes apparent after a subroutine call).This transition is called a fault activation, i.e. a dormant fault (not producing any errors) becomes active.An error is detected if its presence is indicated by a message or a signal, whereas not detected, present errors are called latent.Errors in the system may cause a (service) failure and depending on its type, successive faults and errors may be introduced (error/failure propagation).The distinction between faults, errors and failures is important because these terms create boundaries allowing analysis and coping with different threats.In essence, faults are the cause of errors (reflected in the state) which without proper handling may lead to failures (wrong and unexpected outcome).Following these definitions, fault tolerance is an ability of a system to behave in a well-defined manner once an error occurs.

Fault models for distributed computing
There are five specific fault models relevant in distributed computing: omission, duplication, timing, crash, and byzantine failures [53].
Omission and duplication failures are linked with problems in communication.Send-omission corresponds to a situation, when a message is not sent; receive-omission -when a message is not received.Duplication failures occur in the opposite situation -a message is sent or received more than once.
Timing failures occur when time constraints concerning the service execution or data delivery are not met.This type is not limited to delays only, since too early delivery of a service may also be undesirable.
The crash failure occurs in four variants, each additionally associated with its persistence.Transient crash failures correspond to the service restart: amnesia-crash (the system is restored to a predefined initial state, independent on the previous inputs), partial-amnesia-crash (a part of the system stays in the state before the crash, where the rest is reset to the initial conditions), and pause-crash (the system is restored to the state it had before the crash).Halt-crash is a permanent failure encountered when the system or the service is not restarted and remains unresponsive.
The last modelbyzantine failure (also called arbitrary) -covers any (very often unexpected and inconsistent) responses of a service or a system at arbitrary times.In this case, failures may emerge periodically with varying results, scope, effects, etc.This is the most general and serious type of failure [53].

Dependable computing
Faults, errors and failures are threats to system's dependability.A system is described as dependable, when it is able to fulfil a contract for the delivery of its services avoiding frequent downtimes caused by failures.
Identification of threats does not automatically guarantee dependable computing.For this purpose, four main groups of appropriate methods have been defined [4]: fault prevention, fault tolerance, fault removal, and fault forecasting.As visible in fig. 1, all of them can be analysed from two points of view -either as means of avoidance/acceptance of faults or as approaches to support/assess dependability.Fault tolerance techniques aim to reduce (or even eliminate) the

Fault tolerance
Fault tolerance techniques may be divided into two main, complementary categories [4]: error detection, and recovery.Error detection may be performed during normal service operation or while it is suspended.The first approach in this categoryconcurrent detection -is based on various tests carried out by components (software and/or hardware) involved in the particular activity or by elements specially designated for this function.For example, a component may calculate and verify checksums for the data which is processed by it.On the other hand, a firewall is a good illustration of a designated piece of hardware (or software) oriented on detection of intrusions and other malicious activities.Preemptive detection is associated with the maintenance and diagnostics of a system or a service.The focus in this approach is laid on identification of latent faults and dormant errors.It may be carried out at a system startup, at a service bootstrap, or during special maintenance sessions.
After an error of a fault is detected, recovery methods are applied.Depending on the problem type, error or fault handling techniques are used.The first group is focused on elimination of errors from the system state, while the second are designed to prevent activation of faults.In [4], the specific methods are separated from each other, where in practice this boundary is fuzzy and depends on the specific service and system types.
Generally, error handling is solved through: 1. Rollback [22] -the system is restored to the last known, error-free state.The approach here depends on a method used to track the changes of the state.A well known technique is checkpointing -the state of a system is saved periodically (e.g. the snapshot of a process is stored on a disk) as a potential recovery point in the future.Obviously, this solution is not straightforward in the case of distributed systems and there are many factors to consider.In such environment, checkpointing can be coordinated or not -with differences in reliability and the cost of synchronisation of the distributed components (for details see: [18,31,53]).Rollback can be also implemented through the message logging.In this case, the communication between the components is tracked rather than their state.In case of an error, the system is restored by replaying the historical messages, allowing it to reach global consistency [53].Sometimes both techniques are treated as one, as usually they complement each other.2. Rollforward -the current, erroneous system state is discarded and replaced with a one newly created and initialised.3. Compensation -solutions based on components' redundancy and replication, sometimes referred to as fault masking.In the first case, additional components (usually hardware) are kept in reserve [31].If failures or errors occur, they are used to compensate the losses.For example, a connection to the Internet of a cloud platform should be based on solutions from at least two different Internet Service Providers (ISPs).
Replication is based on the dispersion of multiple copies of the service components.A schema with replicas used only for the purpose of fault tolerance is called a passive (primary-backup) replication [31].On the other hand, an active replication is when the replicas participate in providing the service, leading to increased performance and applicability of load balancing techniques [31].Coherence is the major challenge here, and various approaches are used to support it.For instance, read-write protocols are crucial in active replication, as all replicas have to have the same state.Another worth to note example is clearly visible in volunteer-based platforms.An appropriate selection policy of the correct service response is needed when replicas return different answers, i.e. a method to reach quorum consensus is required [31].These techniques are not exclusive and can be used together.If the system can not be restored to a correct state thanks to the compensation, rollback may be attempted.If this fails, then rollforward may be used.
The above mentioned methods may be referred to as general-purpose techniques.These solutions are relatively generic, which aid their implementation for almost any distributed computation.It is also possible to delegate responsibility for fault tolerance to the service (or application) itself, allowing tailoring the solution for specific needs -therefore forming an application-specific approach.A perfect example in this context is ABFT, originally applied to distributed matrix operations [14], where original matrices are extended with checksums before being scattered among the processing resources.This allows detection, location and correction of certain miscalculations, creating a disk-less checkpointing method.Similarly, in certain cases it is possible to continue the computation or the service operation despite the occurring errors.For instance, unavailable resource resulting from a crash-stop failure can be excluded from further use.In this work, the idea will be further analysed and extended to the context of byzantine errors and the nature-inspired distributed algorithms.
Fault handling techniques are applied after the system is restored to an error-free state (using the methods described above).As the aim now is to prevent future activation of detected faults, four subgroups according to the intention of the operation may be created.These are [4]: diagnosis (the error(s) are identified and their source(s) are located), isolation (faulty components are logically or physically separated and excluded from the service), reconfiguration (the service/platform is reconfigured to substitute or bypass the faulty elements), and reinitialization (the configuration of the system is adapted to the new conditions).

Robustness
When a given system is resilient to a given type of fault, one generally claims that this system is robust.Yet defining rigorously robustness is not an easy task and many contributions come with their own interpretation of what robustness is.Actually, there exists a systematic framework that permits to define a robust system unambiguously.In fact, this should be probably applied to any system or approach claiming to propose a fault-tolerance mechanism.This framework, formalized in [57], answers the following three questions: 1. What behavior of the system makes it robust?2. What uncertainties is the system robust against?3. Quantitatively, exactly how robust is the system?The first question is generally linked to the technique or the algorithm applied.The second -explicitly lists the type of faults or disturbing elements targeted by the system.Answering it is critical to delimit the application range of the designed system and avoid counter examples selected in a context not addressed by the robust mechanism.The third question is probably the most difficult to answer, and at the same time the most vital to characterize the limits of the system.Indeed, there is nearly always a threshold on the error/fault rate above which the proposed infrastructure fails to remain robust and breaks (in some sense).

Lessons Learned from Big Data Hardware
One of the direct consequences of the treatment of big data is, clearly, the requirement for extremely high processing power.And whereas research in the big data domain does not traditionally include research in processor and computer architecture, there is a clear correlation between the advances in the two domains.While it is obviously difficult to predict future developments in processing architectures with high accuracy, we have identified two major trends that are likely to affect big data processing: the development of many-core devices and hardware/software codesign.
The many-core approach represents a step-change in the number of processing units available either in single devices or in tightly-coupled arrays.Exploiting techniques and solutions derived from the Network-on-Chip (NoC) [32] and Graphical Processing Units (GPU) areas, many-core systems are likely to have a considerable impact on application development, pushing towards distributed memory and data-flow computational models.At the same time, the standard assumption of "more tasks than processors" will be loosened (or indeed inverted), reducing to some extent the complexity of processes such as task mapping and load balancing.
Hardware/software co-design implies that applications will move towards a co-synthesis of hardware and software: the compilation process will change to generate at the same time the code to be executed by a processor and one or more hardware co-processing units to accelerate computation.Intel and ARM have already announced alliances with Altera and Xilinx7 , respectively, to offer tight coupling between their processors and reconfigurable logic, while Microsoft recently introduced the reconfigurable Catapult system to accelerate its Bing servers [47].
These trends, coupled with the evolution of VLSI fabrication processes (the sensitivity of a device to faults increases as feature size decreases), introduce new challenges to the application of fault tolerance in the hardware domain.In addition to increasing the probability of fabrication defects (not directly relevant to this article), the heterogeneous nature of these systems and their extreme density represent major challenges to reliability.Indeed, the notion of hardware fault itself is being affected and extended to include a wider variety of effects, such as variability and power/heat dissipation.This section does not in any way claim to represent an exhaustive survey of this very complex area, nor even a thorough discussion of the topic, but rather wants to provide a brief "snapshot" of a few interesting approaches to achieve fault tolerance in hardware, starting with a brief outline of some key concepts and fundamental techniques.

Fault tolerance in digital hardware
One of the traditional classification methods subdivides online faults in hardware systems (i.e.faults that occur during the lifetime of a circuit, rather than at fabrication) into two categories: permanent and transient (a third category, intermittent faults, is outside the scope of this discussion).
Permanent faults are normally introduced by irreversible physical damage to a circuit (for example, short circuits).Rather common in fabrication, they are rare in the lifetime of a circuit, but become increasingly less so as circuits age.Once a permanent fault appears, it will continue to affect the operation of the circuit forever.
Transient faults have limited duration and will disappear with time.By far the most common example of transient faults is Single-Event Upsets (SEU), where radiation causes a change of state within a memory element in a circuit.This distinction is highly relevant in the context of fault tolerance, defined as the ability of a system to operate correctly in the presence of faults.Generally, the design of a fault tolerant hardware system involves four successive steps [1]: 1. Fault detection: can the system detect the presence of a fault?2. Fault diagnosis or localization: can the system identify (as precisely as needed) the exact nature and location of a fault?3. Fault limitation or containment: can the impact of the fault on the operation of the system be circumscribed so that no irreversible damage (to the circuit or to the data) results? 4. Fault repair : can the functionality of the system be recovered?While there is usually no significant difference between transient and permanent faults in the first three steps, the same does not apply to the last step: transient faults can allow the full recovery of the circuit functionality, whereas graceful degradation (e.g., [15]) is normally the objective in the case of permanent faults in the system.

Fundamental techniques
Any implementation of fault tolerance (or indeed of fault detection) in hardware implies, directly or indirectly, the use of redundancy.Specific applications of redundancy, however, vary significantly depending on the features of the hardware system.In general, three "families" of redundant techniques can be identified.Once again, the examples presented in this section are not meant to be exhaustive, but simply to illustrate the different ways in which redundancy can be applied in the context of fault tolerance in hardware systems.

Data or information redundancy
This type of techniques relies on the use of non-minimal coding to represent the data in a system.By far the most common implementation of data redundancy implies the use of error detecting codes (EDC), when the objective is fault detection, and of error correcting codes (ECC), when the objective is fault tolerance [1].
It is worth highlighting that, even though these techniques rely on information redundancy, they also imply considerable hardware overhead, not only due to the requirement for additional storage (due to the non-minimal encoding), but also because the computation of the additional redundant bits implies the presence of (sometimes significant) additional logic.

Hardware redundancy
Hardware redundancy techniques exploit additional resources more directly to achieve fault detection or tolerance.In the general case, the best-known hardware redundancy approaches exploit duplication (Double Modular Redundancy, or DMR) for fault detection or triplication (Triple Modular Redundancy, or TMR) for fault tolerance.
TMR in particular is a widely used technique for safety critical systems: three identical systems operate on identical data, and a 2-out-of-3 voter is used to detect faults in one system and recover the correct result from the others.In its most common implementations (for example, in space missions), TMR is usually applied to complete systems, but the technique can operate at all levels of granularity (for example, it would be possible, if terribly inefficient, to design TMR systems for single logic gates).

Time redundancy
This type of approaches relies, generally speaking, on the repetition of computation and the comparison of the results between the different runs.In the simplest case, the same computation is repeated twice to generate identical results, allowing the detection of SEU.More sophisticated (but less generally applicable) approaches introduce differences in two executions (e.g. by inverting input data or shifting operands) in order to be able to detect permanent faults as well.
It is worth noting that time redundancy techniques are rarely used when fault tolerance is sought (being essentially limited to detection) but in theory can be extended to allow it in case of transient faults.

Fault tolerant design
In the introduction to this section, we highlighted how the heterogeneity and density of a type of devices that are likely to become relevant in big data treatment complicates considerably the task of achieving fault tolerant behaviour in hardware.
In particular, the heterogeneity introduced by the presence of programmable logic and the complexity of many-core devices implies that the notion of a single approach to fault tolerance applicable to every component of a system will have to be replaced by ad-hoc techniques.What follows is a short list of the main components of a complete system, followed by a brief analysis of their fault tolerance requirements and a few examples of approaches developed to achieve this goal, in order to illustrate some of the issues and difficulties that will have to be met.

Memories
Memory elements are probably the hardware components that require the highest degree of fault tolerance: their extremely regular structure implies that transistor density in memories is substantially greater than in any other device (the largest memory device commercial available in 2015 reaches a transistor count of almost 140 billion, compared for example to 4.3 billion of the largest processor).This level of density has resulted in the introduction of fault tolerant features even in commonly available commercial memories.
Reliability in memories takes essentially two forms: to protect against SEUs, the use or redundant ECC bits associated with each memory word is common and well-advertised [39], while marginally less known is the use of spare memory locations to replace permanently damaged ones.The latter technique, used extensively at fabrication for laser-based permanent reconfiguration, has also been applied in an on-line self-repair setting [16].

Programmable logic
Programmable logic devices (generally referred to as Field Programmable Gate Arrays) are regular circuits that can reach extremely high transistor counts.In 2015, the largest commercial FPGA device (the Virtex-Ultrascale XCVU440 by Xilinx) contains more than 20 billion transistors.
The regularity of FPGAs has sparked a significant amount of research into self-testing and self-repairing programmable devices since the late 1990s [2,34,37], but to the best of our knowledge this research has yet to impact consumer products (even considering potential fabricationtime improvement measures similar to those described in the previous section for memories), with the exception of radiation-hardening for space applications.
In reality, the relationship between programmable logic and fault tolerance would merit a more complete analysis, since the interplay between the fabric of the FPGA itself and the circuit that is implemented within the fabric can lead to complex interactions.Such an analysis is however beyond the scope of this article.Interestingly in this context, even though its FPGA fabric itself does not appear to contain explicit features for fault-tolerance, Xilinx supports a design tool to allow a degree of fault-tolerance in implemented designs through its Isolation Design Flow, specifically aimed at fault containment.

Single processing cores
The main driving force in the development of high-performance processors has been, until recently at least, sheer computational speed.In the last few years, power consumption has become an additional strong design consideration, particularly since the pace of improvements in performance has started to slow.Since fault tolerance, with its redundancy requirements, has negative implications both for performance and for power consumption, relatively little research into fault tolerant cores has reached the consumer market.
The situation is somewhat different outside of the high-performance end of the spectrum, where examples of processors specifically designed for fault tolerance exist (for example, the NGMP processor developed on behalf of the European Space Agency [3] or the Cortex-R series by ARM), demonstrating at least the feasibility of such implementations.
More recently, the RAZOR approach [23] represents a fault tolerance technique aimed specifically at detecting (and possibly correcting) timing errors within processor pipelines using a particular kind of time redundancy approach that exploits delays in the clock distribution lines.

On-chip networks
Networks are a crucial element of any system where processors have to share information, and therefore represent a fundamental aspect not only of many-core devices, but also of any multiprocessor system.Often rivalling in size and complexity with the processing units themselves, networks and their routers have traditionally been a fertile ground for research on fault tolerance.
Indeed, even when limiting the scope of the investigation to on-chip networks, numerous books and surveys exist that classify, describe, and analyse the most significant approaches to fault tolerance (for example, [7,45,48]).Very broadly, most of the fundamental redundancy techniques have been applied, in one form or another, to the problem of implementing fault-tolerant on-chip networks, ranging from data redundancy (e.g.parity or ECC encoding of transmitted packets), through hardware redundancy (e.g.additional routing logic), to time redundancy (e.g.repeated data transmission).

Many-Core arrays
An accurate analysis of fault tolerance in many-core devices is of course hampered by the lack of commercially-available devices (the Intel MIC Architecture, based on Xeon Phi co-processors, is a step in this direction but at the moment is limited to a maximum of 61 cores and relies on conventional programming models within a coarse-grained architecture).Using transistor density as a rough indicator of the fault sensitivity of a device (keeping in mind that issue related to heat dissipation can be included in the definition), it is no surprise that fault tolerance is generally considered as one of the key enabling technologies for this type of device: once again, the regular structure of many-core architecture is likely to have a significant impact on transistor count.Today, for example, transistor count in GPUs (the commercial devices that, arguably, bear the closest resemblance to many-core systems, both for the number of cores and for the programming model) is roughly twice that of Intel processors using similar fabrication processes, and even in the case of Intel this type of density was achieved only in multi-core (and hence regular) devices.
The lack of generally accessible many-core platforms implies that most of the existing approaches to fault-tolerance in this kind of systems remain at a somewhat higher abstraction layer and typically rely on mechanisms of task remapping through OS routines or dedicated middleware [9,10,33,59].Specific hardware-level approaches, on the other hand, have been applied to GPUs (e.g., [55]) and could have an impact on many-core systems.Indeed, one of the few prototype many-core platforms with a degree of accessibility (ClearSpeed's CSX700 processor) boasts a number of dedicated hardware error-correction mechanisms, hinting at least to the importance of fault tolerance in this type of devices, whereas no information is available on fault-tolerance mechanisms in the Intel MIC architecture.

Toward Inherent Software Resilience: ABFT nature of EAs
Evolutionary Algorithms (EAs) are a class of solving techniques based on the Darwinian theory of evolution [19] which involves the search of a population of solutions.
A set of recent studies [20,26,30,35,42,43] illustrate what seems to be a natural resilience of EAs against a model of destructive failures (crash failures).With a properly designed execution, the system experiences a graceful degradation [35].This means, that up to some threshold and despite the failures, the results are still delivered.However, it either requires more time for the execution or the returned values are further from the optimum being searched.

Repeatibility and Reproducibility Challenges in UCSs
Repeatibility and reproducibility are important aspects of sound scientific research in all disciplines -yet more difficult to achieve than might be expected [5,54].Repeatability of the experiments denotes the ability to repeat the same experiment and achieve the same result, by the original investigator [56].On the other hand, reproducibility enables the verification of the validity of the conclusions and claims drawn from scientific experiments by other researchers, independent from the original investigator.
Repeatibility is essential for all evidence-based sciences, as counterpart to formal proofs used in theoretical sciences or discourse used widely in e.g. the humanities.It is a key requirement for all sciences and studies relying on computational processes.The challenge of achieving repeatability, however, increases drastically with the complexity of the underlying computational processes, making the characteristics of the processes less intuitive to grasp and interpret and verify.It thus becomes an enormous challenge in the area of ultrascale computing given the enormous complexity of the massive amount of computing steps involved, and the numerous dependencies of an algorithm performing on a stack of software and hardware of considerable complexity.
While many disciplines have, over sometimes long-time periods, established a set of good practices for repeating and verifying their experiments (e.g. by using experiment logbooks in disciplines such as chemistry or physics, where the researchers record their experiments), computational science lags behind, and many investigations are hard to repeat or reproduce [17].This can be attributed to the lower maturity of computer science methods and practices in general, the fast-moving pace of changing technology that is utilised to perform the experiments, or the multitude of different software components needed to interact to perform the experiments.Small variations in, for example, the version of a specific software can have a great impact on the final result which might deviate significantly from the expected outcome, as has promoinently been shown e.g. for the analysis of CT scans in the medical domain [27].More severely, the source of such changes might not even be in the software that is specifically used for a certain task, but somewhere further down the stack of software the application depends on, including for example the operating system, system libraries or the very specific hardware environment being used.Recognizing these needs, steps are being taken to assist research in ensuring their results are more easily reproducible [24,25].The significant overhead in providing enough documentation to allow an exact reproduction of the experiment setup further adds to these difficulties.
Technical solutions to increase reproducibility in eScience research cover several branches.One type of solutions is aimed at recreating the technical environments where experiments are executed in.Simple approaches towards this goal include virtualising the complete environment the experiment is conducted in, e.g. by making a clone which can subsequently be redistributed and executed in Virtual Machines.Such approaches only partially allow reproducibility, as the cloned system is potentially containing many more applications than are actually needed, and no identification of which components are actually required is provided.Thus, a more favourable approach is to recreate only the needed parts of the system.Code, Data, and Environment (CDE) [28] is such an approach, as it detects the required components during the runtime of a process.CDE works on Linux operating system environments and requires the user to prepend his commands to scripts or binaries by the cde command.CDE will then intercept system calls and gather all the files and binaries that were used in the execution.A packaged created thereof can then be transferred to a new environment.
CDE has a few shortcomings especially in distributed system set-ups.External systems may be utilised, e.g. by calling Web Services, taking over part of the computational tasks.CDE does not aim at detecting these calls.This challenge is exacerbated in more complex distributed set-ups such as may be encountered in ultra-scale computational environments.
Not only external, but also calls to local service applications are an issue.These are normally running in the background and started before the program execution, and thus not all the sources that are necessary to run them are detected.It is more problematic, though, that there is no explicit detection of such a service being a background or remote service.Thus, the fact that the capturing of the environment is incomplete remains unnoticed to users who are not familiar with all the details of the implementation.The Process Migration Framework [8] (PMF) is a solution similar to CDE, but takes specifically the distributed aspect into account.
Another approach to enable better repeatibility and reproducibility is in using standardised methods and techniques to author experiments, such as the use of workflows.Workflows allow a precise definition of the involved steps, the required environment, and the data flow between components.The modelling of workflows can be seen as an abstraction layer, as they describe the computational ecosystem of the software used during a process.Additionally, they provide an execution environment that integrates the required components to perform a process and execute all defined subtasks.Different scientific workflow management systems exist that allow scientists to combine services and infrastructure for their research.The most prominent examples of such systems are Taverna [41] and Kepler [36].Vistrails [51] also adds versioning and provenance of the creation of the workflow itself.The Pegasus workflow engine [21] specifically aims at scalability, and allows executing workflows in cluster, grid and cloud infrastructures.
Building on top of workflows, the concept of workflow-centric Research Objects [6] (ROs) tries to describe research workflows in the wider eco-system they are embedded in.ROs are a means to aggregate or bundle resources used in a scientific investigation, such as a workflow, provenance from results of its execution, and other digital resources such as publications or datasets.In addition, annotations are used to describe these objects further.A digital library exists for Workflows and Research Objects to be shared, such as the platform my experiment. 8orkflows facilitate many aspects of reproducibility.However, unless experiments are designed from the beginning to be implemented as a workflow, there is a significant overhead to migrate an existing solution to a workflow.Furthermore, workflows are normally limited in features they support, most prominently in the type of programming languages.Thus, not all experiments can be easily implemented in such a manner.
A model to describe a scientific process or experiment is presented in [38].It allows the researcher to describe their experiments in a manner similar to the approach of Research Objects; however, this model is independent of a specific workflow engine, and provides a more refined set of concepts to specifiy the software and hardware setup utilised.
Another important aspect of reproducibility is the verification of the results obtained.The verification and validation of experiments aims to prove whether the replicated or repeated experiment has the same characteristics and performs in the same way as the original experiment -even if the original implementation is faulty.Simple approaches just comparing a final experiment outcome, e.g. a performance measure of a machine learning experiment, doesn't provide sufficient evidence on this task, especially in settings where probabilisitic learning is utilised, and also a close approximation of the original result would be accepted.Furthermore, a comparison on final outcomes provides no means to trace where potential deviations originated in the experiment.It is therefore required to analyze the characteristics of an experimental process wrt. to its significant properties, its determinism, and levels where significant states of a process can be compared.On further needs to define a set of measurements to be taken during an experiment, and identify approapriate metrics to compare values obtained from different experiment executions, beyond simple similarity [44].A framework to formalise such a verification has been introduced as the VFramework [40], which is specifically tailored to process verification.It describes what conditions must be met and what actions need to be taken in order to compare the executions of two processes.
In the context of ultrascale computing, the distributed nature of large experiments poses challenges for repeating the results, as there are many potential sources for errors, and an increased demand for documentation.Also, approaches such as virtualisation or recreation of computing environments hit the boundaries of feasibility especially in larger distributed settings based on grid our cloud infrastructure, where the number of nodes to be stored becomes difficult to manage.Another challenge in ultrascale computing is in the nature of computing hardware utilised, which is often highly specialised towards certain tasks, and much more difficult to be captured and recreated in other settings.
Last, but not least, ultrascale computing is usually also tightly linked with massive volumes of data that need to be kept available and identifiable in sometimes highly dynamic environments.Proper data management and preservation have been prominently called for [29].A key aspect in ultra-scale computing in this context are means to persistently identify the precise versions and subsets of data having been used in an experiment.The Working Group on Dynamic Data Citation of the Research Data Alliance9 has been developing recommendations how to achieve such a machine-actionable citation mechanism for dynamic data that is currently being evaluated in a number of pilots [46].

Conclusion
In this paper, we first proposed an overview of resilient computing in Ultrascale Computing Systems, i.e., cross-layered techniques dealing with hardware and software failures or attacks, but also the necessary services including security and repeatability.We also described how new application needs such as big data and cyber-physical systems challenge existing computing paradigms and solutions.
New opportunities have been highlighted but they certainly require further investigations and in particular large-scale experiments and validations.What emerges is the need for the apparition of additional disruptive paradigms and solutions at all levels: from hardware, languages, compilers, operating systems, middleware, services, and application-level solutions.Offering a

Figure 1 .
Figure 1.Means for dependable computing