Exascale Machines Require New Programming Paradigms and Runtimes

Extreme scale parallel computing systems will have tens of thousands of optionally accelerator-equipped nodes with hundreds of cores each, as well as deep memory hierarchies and complex interconnect topologies. Such exascale systems will provide hardware parallelism at multiple levels and will be energy constrained. Their extreme scale and the rapidly deteriorating reliability of their hardware components means that exascale systems will exhibit low mean-time-betweenfailure values. Furthermore, existing programming models already require heroic programming and optimization efforts to achieve high efficiency on current supercomputers. Invariably, these efforts are platform-specific and non-portable. In this article, we explore the shortcomings of existing programming models and runtimes for large-scale computing systems. We propose and discuss important features of programming paradigms and runtimes to deal with exascale computing systems with a special focus on data-intensive applications and resilience. Finally, we discuss code sustainability issues and propose several software metrics that are of paramount importance for code development for ultrascale computing systems.


Introduction
Ultrascale systems are envisioned as large-scale complex systems joining parallel and distributed computing systems that will be two to three orders of magnitude larger than today's systems reaching millions to billions elements.New programming models and runtimes will be necessary to use efficiently these new infrastructures.To achieve results on this front, the European Union funded the COST Action Nesus IC1305 [72].Its goal is to establish an open European research network targeting sustainable solutions for ultrascale computing, aiming at cross fertilization among HPC, large-scale distributed systems and big data management.The network contributes to gluing together disparate researchers working across different areas and provides them with a meeting ground to exchange ideas, identify synergies and pursue common activities in research topics, such as sustainable software solutions (applications and system software stack), data management, energy efficiency and resilience.
One key element on the ultrascale front is the necessity of new sustainable programming and execution models in the context of rapid underlying computing architecture changing.There is a need to explore synergies among emerging programming models and runtimes from HPC, distributed systems and big data management communities.To improve the programmability of

Improved programmability for extra large-scale systems
Supercomputers have become an essential tool in numerous research areas.Enabling future advances in science requires the development of efficient parallel applications, which are able to meet the computational demands.The modern high-performance computing systems (HPCS) are composed of hundreds of thousand computational nodes.Due to the rapidly increasing scale of those systems, programmers cannot have a complete view of the system.The programmability strongly determines the overall performance of a high performance computing system.It is a substrate over which processors, memory and I/O devices are exchanging data and instructions.It should have high scalability, which will support the development of the next generation exas-cale supercomputers.Programmers also need to have an abstraction that allows them to manage hundreds of millions to billions of concurrent threads.Abstraction allows organizing programs into comprehensible fragments, which is very important for clarity, maintenance and scalability of the system.It also allows increasing of programmability by defining new languages on top of the existing language and, by defining completely new parallel programming languages.This makes abstraction an important part of most parallel paradigms and runtimes.Formerly, computer architectures were designed primarily for improved performance or for energy efficiency.In future exascale architectures, one of the top challenges will be enabling a programmable environment for the next generation architectures.In reality, programmability is a metric, which is really difficult to define and measure.The next generation architectures should minimize the chances of parallel computational errors while relieving the programmer from managing low-level tasks.
In order to explore this situation more precisely, one aim of this research is to investigate the limitations of current programming models along-with evaluations of promises of hybrid programming model to solve these scaling-related difficulties.

Limitations of the current programming models
Reaching exascale in terms of computing nodes requires the transition from current control of thousands of threads to billions of threads as well as the adaptation of the performance models to cope with an increased level of failures.One simple model that is used to program at any scale is a utopian idea as proved in the last twenty years of 'standardized' parallel computing.Unfortunately the exascale has become the reason for improving programming systems, as existing programming models implementations, more than a reason for a change in the programming models.This approach was classified by Gropp and Snir in [2] as evolutionary.According to their view, the five most important characteristics of the programming models that are affected by the exascale transition are: the thread scheduling, the communications, the synchronization, the data distribution and the control views.

Limitations of message passing programming model
The current vision on exascale system is at the moment to exploit distributed memory parallelism, and therefore the message passing model is likely to be used at least partially.Moreover the most popular system implementation of the model, MPI, has been shown to run with millions of cores for particular problems.MPI is based upon standard sequential programming languages, augmented with low-level message passing constructs, forcing users to deal with all aspects of parallelization, including the distribution of data and work to cores, communication and synchronization.MPI primarily favors static data distribution and is consequently not well suited to deal with dynamic load balancing.
However, it has been shown that the all-to-all communication algorithms used in the message passing models are not scalable (most commonly used implementations often assume a fully connected network and have dense communication patterns) while all-to-some, one-sided or sparse communication patterns are more reliable.
Furthermore, parallel I/O is a limiting factor in the MPI systems, showing that the current MPI-IO model should be reconsidered.In particular the limitations are related to the collective access to the I/O request and the data partitioning.

Limitations of shared-memory programming models
The exascale system is expected to handle hundreds of cores in the one CPU or GPU.Using shared-memory systems is a feasible alternative to message passing in the case of medium size parallel systems in order to reduce the programming overhead as is moving the parallelization burden from the programmer to the compiler.
The most popular shared-memory system, OpenMP, is following a parallelism control model that does not allow the control of data distribution and uses non-scalable synchronization mechanism like locks or atomic sections.Moreover, the global view of data leads easily to non-efficient programming as encouraging synchronization joins of all threads' remote data accesses similar to the local ones.
The emerging Partitioned Global Address Space model (PGAS) is trying to overcome the scalability problems of the global shared-memory model [3].The PGAS model is likely to have benefit where non-global communication patterns can be implemented with minimal synchronization and overlap of computation and communication.Moreover, the scalability of I/O mechanisms in PGAS depends only on the scalability of the underlying I/O infrastructure and is not limited by the model.However, the scalability is limited to thousands of cores (with the exception of X10 which is implementing an asynchronous PGAS model).The load balancing is still an open issue for the systems that implement the model.Furthermore, it is not possible yet to sub-structure threads into subgroups.

Limitations of heterogeneous programming
Clusters of heterogeneous nodes composed of multi-core CPUs and GPUs are increasingly being used for High Performance Computing due to the benefit in peak performance and energy efficiency.In order to fully harvest the computational capabilities of such architectures, application developers often employ a combination of different parallel programming paradigms (e.g.OpenCL, CUDA, MPI and OpenMP).However, heterogeneous computing also poses the new challenge of how to handle the diversity of execution environment and programming models.The Open Computing Language [60] introduces an open standard for general-purpose parallel programming of heterogeneous systems.An OpenCL program may target any OpenCLcompliant device and today many vendors provide an implementation of the OpenCL standard.An OpenCL program comprises a host program and a set of kernels intended to run on a compute device.It also includes a language for kernel programming, and an API for transferring data between host and device memory and for executing kernels.
Single node hardware design is shifting to a heterogeneous nature.At the same time many of today's largest HPC systems are clusters that combine heterogeneous compute device architectures.Although OpenCL has been designed to work with multiple devices, it only considers local devices available on a single machine.However, the host-device semantics can be potentially applied to remote, distributed devices accessible on different compute nodes.Porting single-node multi-device applications to clusters that combine heterogeneous compute device architectures is not straightforward and in addition it requires the use of a communication layer for data exchange between nodes.Writing programs for such platforms is error prone and tedious.Therefore, new abstractions, programming models and tools are required to deal with these problems.

Exascale promise of the hybrid programming model
Using a message passing model for the inter-node parallelism and a shared-memory programming model for intra-node parallelism is nowadays seen as a promising path to reach the exascale.The hybrid model is referred as MPI+X, where X represents the programming model that supports threads.The most common X is OpenMP, while there are options for X, like OpenACC.
However, restrictions on the MPI+X model are still in place, for example how MPI can be used in a multi-threaded process.In particular, threads cannot be individually identified as the source or target of MPI messages, or an MPI barrier synchronize the execution of the processes but does not guarantee their synchronization in terms of memory views.The proposal to use MPI Endpoints in all-to-all communications from [4] is a step forward in order to facilitate high performance communication between multi-threaded processes.
Furthermore, combining different programming styles like message passing with shared memory programming lends itself to information hiding between different layers that may be important for optimization.Different runtime systems involved with these programming models lack a global view that can have a severe impact on the overall performance.
However, the biggest problem of MPI+X is the competition for the resources like bandwidth (accessing memory via the inter-node interconnect) [2].Furthermore, an important obstacle is the memory-footprint and efficient memory usage, as the available memory per core or node is not expected to scale linearly with the number of cores and nodes, and the MPI+X functionality must cope with the expected decrease of space per core or node.
Multitasking is a mean to increase the ability to deal with fluctuations in execution of the threads due to the fault handling or power management strategies.PGAS+multitasking is providing a programming model analogous with MPI+X.
In exascale system storage and communication hierarchies will be deeper than the current ones.Therefore it is expected that the hierarchical programming models should replace the current two level ones [1].
The one-side communication model enables programming in a shared-memory-like programming style.In MPI it is based on the concept of a communication window to which the MPI processes in a communicator statically attach contiguous segments of their local memory for exposure to other processes; the access to the window is granted by synchronization operations.The model separates the communication operations and synchronization for data consistency, allowing the programmer to delay and schedule the actual communications.However the model is criticized for being difficult to be used efficiently.

Innovative programming for heterogeneous computing systems
In recent years, heterogeneous systems have received a great amount of attention from the research community.Although several projects have been recently proposed to facilitate the programming of clusters with heterogeneous nodes [54-59, 68, 69], none of them combines support for high performance inter-node data transfer, support for a wide number of different devices and a simplified programming model.Kim et al. [56] proposed the SnuCL framework that extends the original OpenCL semantics to heterogeneous cluster environments.SnuCL relies on the OpenCL language with few extensions to directly support collective patterns of MPI.Indeed, in SnuCL the programmer is responsible to take care of the efficient data transfers between nodes.In that sense, end users of the SnuCL platform need to have an understanding of MPI collective calls semantics in order to be able to write scalable programs.
Also other works have investigated the problem of extending the OpenCL semantics to access a cluster of nodes.The Many GPUs Package (MGP ) [69] is a library and runtime system that using the MOSIX VCL layer enables unmodified OpenCL applications to be executed on clusters.Hybrid OpenCL [68] is based on the FOXC OpenCL runtime and extends it with a network layer that allows the access to devices in a distributed system.The clOpenCL [59] platform comprises a wrapper library and a set of user-level daemons.Every call to an OpenCL primitive is intercepted by the wrapper which redirects its execution to a specific daemon at a cluster node or to the local runtime.dOpenCL [55] extends the OpenCL standard, such that arbitrary compute devices installed on any node of a distributed system can be used together within a single application.Distributed OpenCL [54] is a framework that allows the distribution of computing processes to many resources connected via network using JSON RPC as a communication layer.OpenCL Remote [58] is a framework which extends both OpenCL's platform model and memory model with a network client-server paradigm.Virtual OpenCL [57], based on the OpenCL programming model, exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution.
An innovative approach to program clusters of nodes composed of multi-core CPUs and GPUs has been introduced through libWater [71], a library-based extension of the OpenCL programming paradigm that simplifies the development of applications for distributed heterogeneous architectures.
libWater aims to improve both productivity and implementation efficiency when parallelizing an application targeting a heterogeneous platform by achieving two design goals: (i ) transparent abstraction of the underlying distributed architecture, such that devices belonging to a remote node are accessible like a local device; (ii ) access to performance-related details since it supports the OpenCL kernel logic.The libWater programming model extends the OpenCL standard by replacing the host code with a simplified interface.libWater also comes with a novel device query language (DQL) for OpenCL device management and discovery.A lightweight distributed runtime environment has been developed which dispatches the work between remote devices, based on asynchronous execution of both communications and OpenCL commands.libWater runtime also collects and arranges dependencies between commands in the form of a powerful representation called command DAG.The command DAG can be effectively exploited to improve the scalability.For this purpose a collective communication pattern recognition analysis and optimization has been introduced that matches multiple single point-to-point data transfers and dynamically replaces them with a more efficient collective operation (e.g.scatter, gather and broadcast) supported by MPI.
Besides OpenCL-based approaches, also CUDA solutions have been proposed to simplify distributed systems programming.CUDASA [66] is an extension of the CUDA programming language which extends parallelism to multi-GPU systems and GPU-cluster environments.rCUDA [61] is a distributed implementation of the CUDA API that enables shared remote GPGPU in HPC clusters.cudaMPI [62] is a message passing library for distributed-memory GPU clusters that extends the MPI interface to work with data stored on the GPU using the CUDA programming interface.All of these approaches are limited to devices that support CUDA, i.e.NVidia GPU accelerators, and therefore they cannot be used to address heterogeneous systems which combines CPUs and accelerators from different vendors.
Other projects have investigated how to simplify the OpenCL programming interface.Sun et.al [65], proposed a task queuing extension for OpenCL that provides a high-level API based on the concepts of work pools and work units.Intel CLU [75], OCL-MLA [53] and SimpleOpencl [70] are lightweight API designed to help programmers to rapidly prototype heterogeneous programs.
A more sophisticated approach was proposed in [67].OmpSs relies on compiler technologies to generate host and kernel code from a sequential program annotated with pragmas.The runtime of OmpSs internally uses a DAG with the scope of scheduling.However, the DAG is not dynamically optimized as done by libWater.

Data-intensive programming and runtimes
The data intensity of scientific and engineering applications forces the expansion of exascale system.It puts a focus on architectures, programming models, runtime systems improvement on data intensive computing.A major challenge is to utilize the available technologies and large-scale computing resources effectively to tackle the scientific and societal challenges.This section describes the runtime requirements and scalable programming models for data-intensive applications, as well as new data access, communication, and processing operations for dataintensive applications.

Next generation MPI
MPI is the most widely used standard [10] in the current petascale systems, supporting among others the message passing model.It has proven high performance portability and scalability [9,17], as well as stability over the last 20 years.MPI provides a (nearly) fixed number of statically scheduled processes with a local view of the data distributed across the system.Nevertheless, ultrascale systems are not going to be built scaling incrementally from current systems, which probably will have a high impact on all levels of the software stack [2,18].The international community agrees that changes need to be done in current software at all levels.Future parallel and distributed applications push to explore alternative scalable and reliable programming models [13].
Current HPC systems need to scale up by three orders of magnitude to meet exascale.While a sharp rise in the number of nodes of this magnitude is not expected, the critical growth will come from the intra-node capacities.Fat nodes with a large number of lightweight heterogeneous processing elements, including accelerators, will be common in the ultrascale platforms, mixing parallel and distributed systems.In addition, memory per core ratio is expected to decrease, while the number of NUMA nodes will grow to alleviate the problem of memory coherence between hundreds of cores [16].On the software side, weak scaling of applications running on such platforms will demand more computation resources to manage huge volumes of data.Nowadays MPI applications, most of which are built using the bulk-synchronous synchronization model [11] as a sequence of communication and computation over the interchanged data stages, will continue to be important, but multi-physics and adaptive meshing applications, with multiple components implemented using different programming models, and with dynamic starting and finalization of such components, will become common in ultrascale.Apart from the most regular applications, this synchronization model is already a strangle point.
In this scenario, programming models need to face multiple challenges to efficiently exploit resources with a high level of programmability [14]: scalability and parallelism increase, energy efficient resource management and data movement, and I/O and resilience in applications among others.
MPI has successfully faced the scalability challenge at petascale with the so-called hybrid model, represented as MPI+X, meaning MPI to communicate between nodes and a shared memory programming model (e.g.OpenMP for shared memory and OpenACC for accelerators) inside a node.This scheme provides with load balancing and reduces the memory footprint and data copies in shared memory, and it is likely to continue in the future.Notwithstanding, increase in node scale, heterogeneity and complexity of integration of processing elements will demand improved techniques for balancing the computational load between potentially large number of processes running kernels composing the application [24].
Increasing imbalances in large-scale applications, aggravated by hardware power management, localized failures or system noise, require synchronization-avoiding algorithms, adaptable to dynamic changes in the hardware and the applications.An example is the collective algorithms based on a non-deterministic point-to-point communication pattern, and able to capture and deal with relevant network properties related to heterogeneity [23].The MPI specification provides support to mitigate load imbalance issues through the one-sided communication model, non-blocking collectives, or the scalable neighbor collectives for communication along the virtual user defined topology.In the meanwhile, specification and implementation scalability issues have been detected [17].They need to be either avoided, as the use of all to all inherently non-scalable collectives, or improved, as initialization or communicator creation, in exascale applications.
To support the hybrid programming model, MPI defines levels of thread-safety.Lower levels are suitable for bulk-synchronous applications, while higher levels require synchronization mechanisms, which lead current MPI libraries to a significant performance degradation.Communication endpoints [19] mechanism is a proposal to extend the MPI standard for reducing contention inside a single process by allowing to attach threads to different endpoints for sending and receiving messages.
Big data volumes and the power consumed in moving data across the system makes data management one of the main concerns in future systems.In the distributed view, the common methodology of reading data from a centralized filesystem, spreading it over the processing elements and writing the results is energy and performance inefficient, and failure prone.Data will be distributed across the system, and the placement of MPI processes in a virtual topology needs to adapt to the data layout to improve the performance, which will require better mapping algorithms.MPI addresses these challenges in shared memory by a programming model based on shared data windows accessible by processes in the same node, hence avoiding horizontal data movement inside the node.However, lack of data locality awareness leads to vertical movement of data across the memory hierarchy, which degrades performance.For instance, the communication buffers received by MPI processes and the access by the local OpenMP threads for computing on them will need smarter scheduling policies.Data-centric approaches are needed for describing data in the system and apply the computation where such data resides [12,15].
Another critical challenge for MPI to support exascale systems is the resilience, a crosscutting issue affecting the whole software stack.Current checkpointing/restart methods are insufficient for future systems under a failure ratio of a few hours and in the presence of silent errors, and traditional triple modular redundancy (TMR) is not affordable in an energy efficient manner.New techniques of resilient computing have been proposed and developed, also in the MPI context [5,6].One proposal for increasing resilience to node failures is to implement malleable applications, able to adapt their execution to the available resources in the presence of hardware errors, and avoiding the restart of the application [7].
Alternatives to MPI come from the Partitioned Global Address languages (PGAs) and High Productivity Computing Systems (HPCS) programming languages.PGAs programming models provide a global view of data with explicit communication as CAF [38] (Co-Array Fortran), or implicit communication as UPC [40] (Unified Parallel C).However, static scheduling and poor performance issues make them currently far from replacing the well established and successful MPI+X hybrid model.Moreover, OpenMP presents problems with nested parallelism and vertical locality, so the possibility of MPI+PGAs has been, and continues to be evaluated [8,22] to provide a programming environment better suited to the future platforms.HPCS languages, such as a Chapel [63] and X10 [64], provide a global view of data and control.For instance, Chapel provides programming constructions at different levels of abstraction.It includes features for computation-centric parallelism based on tasks, as well as data-centric programming capabilities.For instance, the locale construction describes the compute nodes in the target architecture and allows to reasoning about locality and affinity, and to manage global views of distributed arrays.

Runtime requirements for data-intensive applications
Developing data-intensive applications over exascale platforms requires the availability of effective runtime systems.This subsection discusses which functional and non-functional requirements should be fulfilled by future runtime systems to support users in designing and executing complex data-intensive applications over large-scale parallel and distributed platforms in an effective and scalable way.
The functional requirements can be grouped into four areas: data management, tool management, design management, and execution management [39].
Data management.Data to be processed can be stored in different formats, such as relational databases, NoSQL databases, binary files, plain files, or semi-structured documents.The runtime system should provide mechanisms to store and access such data independently from their specific format.In addition, metadata formalisms should be provided to describe the relevant information associated with data (e.g., location, format, availability, available views), in order to enable their effective access, manipulation and processing.
Tool management.Data processing tools include programs and libraries for data selection, transformation, visualization, mining and evaluation.The runtime system should provide mechanisms to access and use such tools independently from their specific implementation.Also in this case metadata should be provided to describe the most important features of such tools (e.g., their function, location, usage).
Design management.From a design perspective, three main classes of data-intensive applications can be identified: single-task applications, in which a single sequential or parallel process task is performed on a given data set; parameter sweeping applications, in which data are analyzed using multiple instances of a data processing tool with different parameters; workflow-based applications, in which data-intensive applications are specified as possibly complex workflows.
A runtime system should provide an environment to effectively design all the above-mentioned classes of applications.
Execution management.The system should provide a parallel/distributed execution environment to support the efficient execution of data-intensive applications designed by the users.Since applications range from single tasks to complex workflows, the runtime system should cope with such a variety of applications.In particular, the execution environment should provide the following functionalities, which are related to the different phases of application execution: accessing the data to be processed; allocating the needed compute resources; running the application based on the user specifications, which may be expressed as a workflow; allowing users to monitor an applications execution.
The non-functional requirements can be defined at three levels: user, architecture, and infrastructure.
From a user point of view, non-functional requirements to be satisfied include: • Data protection.The system should protect data from both unauthorized access and intentional/incidental losses.
• Usability.The system should be easy to use by users, without the need of undertaking any specialized training.From an architecture perspective, the following principles should inspire system design: • Openness and extensibility.The architecture should be open to the integration of new data processing tools and libraries; moreover, existing tools and libraries should be open for extension and modifications.
• Independence from infrastructure.The architecture should be designed to be as independent as possible from the underlying infrastructure; in other terms, the system services should be able to exploit the basic functionalities provided by different infrastructures.Finally, from an infrastructure perspective, non-functional requirements include: • Heterogeneous/Distributed data support.The infrastructure should be able to cope with very large and high dimensional data sets, stored in different formats in a single site, or geographically distributed across many sites.
• Availability.The infrastructure should be in a functioning condition even in the presence of failures that affect a subset of the hardware/software resources.Thus, effective mechanisms (e.g., redundancy) should be implemented to ensure dependable access to sensitive resources such as user data.
• Scalability.The infrastructure should be able to handle a growing workload (deriving from larger data to process or heavier algorithms to execute) in an efficient and effective way, by dynamically allocating the needed resources (processors, storage, network).Moreover, as soon as the workload decreases, the infrastructure should release the unneeded resources.
• Efficiency.The infrastructure should minimize resource consumption for a given task to execute.In the case of parallel/distributed tasks, efficient allocation of processing nodes should be guaranteed.Additionally, the infrastructure should be highly utilized so to provide efficient services.Even though several research systems fulfilling most of these requirements have been developed, such as Pegasus [25], Taverna [26], Kepler [27], ClowdFlows [28], E-Science Central [29], and COMPSs [30], they are designed to work on conventional HPC platforms, such as clusters, Grids, and -in some cases -Clouds.Therefore, it is necessary to study novel architectures, environments and mechanisms to fulfill the requirements discussed above, so as to effectively support design and execution of data-intensive applications in future exascale systems.

Scalable programming models for data-intensive applications
Data-intensive applications often involve a large number of data processing tools that must be executed in a coordinated way to analyze huge amount of data.This section discusses the need for scalable programming models to support the effective design and execution of data-intensive applications on a massive number of processors.
Implementing efficient data-intensive applications is not trivial and requires skills of parallel and distributed programming.For instance, it is necessary to express the task dependencies and their parallelism, to use mechanisms of synchronization and load balancing, and to properly manage the memory and the communication among tasks.Moreover, the computing infrastructures are heterogeneous and require different libraries and tools to interact with them.To cope with all these problems, different scalable programming models have been proposed for writing data-intensive applications [31].
Scalable programming models may be categorized based on their level of abstraction (i.e., high-level and low-level scalable models) and based on how they allow programmers to create applications (i.e., visual or code-based formalisms).
Using high-level scalable models, the programmers define only the high-level logic of applications while hiding the low-level details that are not fundamental for application design, including infrastructure-dependent execution details.The programmer is helped in application definition and the application performance depends on the compiler that analyzes the application code and optimizes its execution on the underlying infrastructure.Instead, low-level scalable models allow the programmers to interact directly with computing and storage elements of the underlying infrastructure and thus to define the applications parallelism directly.Defining an application requires more skills and the application performance strongly depends on the quality of the code written by the programmer.
Data-intensive applications can be designed through visual programming formalism, which is a convenient design approach for high-level users, e.g.domain-expert analysts having a limited understanding of programming.In addition, a graphical representation of workflows intrinsically captures parallelism at the task level, without the need to make parallelism explicit through control structures [32].Code-based formalism allows users to program complex applications more rapidly, in a more concise way, and with higher flexibility [33].The code-base applications can be defined in different ways: i) with a language or an extension of language that allows to express parallel applications; ii) with some annotations in the application code that permits the compiler to understand which instructions will be executed in parallel; and iii) using a library in the application code that adds parallelism to application.
Given the variety of data-intensive applications (from single-task to workflow-based) and types of users (from end users to skilled programmers) that can be envisioned in future exascale systems, there will be a need for scalable programming models with different levels of abstractions (high-level and low-level) and different design formalisms (visual and code-based), according to the classification outlined above.Thus, the programming models should adapt to user needs by ensuring a good trade-off between ease in defining applications and efficiency of executing them on exascale architectures composed by a massive number of processors.

New data access, communication, and processing operations for data-intensive applications
This subsection discusses the need for new operations supporting data access, data exchange and data processing to enable scalable data-intensive applications on a large number of processing elements.
Data-intensive applications are software programs that have a significant need to process large volumes of data [21].Such applications devote most of their processing time to run I/O operations and to exchange and move data among the processing elements of a parallel computing infrastructure.Parallel processing of data-intensive applications typically involves accessing, pre-processing, partitioning, aggregating, querying, and visualizing data which can be processed independently.These operations are executed using application programs running in parallel on a scalable computing platform that can be a large Cloud system or a massively parallel machine composed of many thousand processors.In particular, the main challenges for programming data-intensive applications on exascale computing systems come from the potential scalability and resilience of mechanisms and operations made available to developers for accessing and managing data.Indeed, processing very large data volumes requires operations and new algorithms able to scale in loading, storing, and processing massive amounts of data that generally must be partitioned in very small data grains on which analysis is done by thousands to millions of simple parallel operations.
Evolutionary models have been recently proposed that extend or adapt traditional parallel programming models like MPI, OpenMP, MapReduce (e.g., Pig Latin) to limit the communication overhead (in the case of message-passing models) or to limit the synchronization control (in the case of shared-models languages) [2].On the other hand, new models, languages and APIs based on a revolutionary approach, such as X10, ECL, GA, SHMEM, UPC, and Chapel have been developed.In this case, novel parallel paradigms are devised to address the requirements of massive parallelism.
Languages such as X10, UPC, GA and Chapel are based on a partitioned global address space (PGAS) memory model that can be suited to implement data-intensive exascale applications because it uses private data structures and limits the amount of shared data among parallel threads.Together with different approaches (e.g., Pig Latin and ECL) those models must be further investigated and adapted for providing data-centered scalable programming models useful to support the efficient implementation of exascale data analysis applications composed of up to millions of computing units that process small data elements and exchange them with a very limited set of processing elements.A scalable programming model based on basic operations for data intensive/data-driven applications must include operations for parallel data access, data-driven local communication, data processing on limited groups of cores, near-data synchronization, in-memory querying, group-level data aggregation, and locality-based data selection and classification.
Supporting efficient data-intensive applications on exascale systems will require an accurate modeling of basic operations and of the programming languages/APIs that will include them.At the same time, a significant programming effort of developers will be needed to implement complex algorithms and data-driven applications such that used, for example, in big data analysis and distributed data mining.Programmers must be able to design and implement scalable algorithms by using the operations sketched above.To reach this goal, a coordinated effort between the operation/language designers and the application developers would be very fruitful.

Resilience
As exascale systems grow in computational power and scale, failure rates inevitably increase.Therefore, one of the major challenges in these systems is to effectively and efficiently maintain the system reliability.This requires to handle failures efficiently, so that the system can continue to operate with satisfactory performance.The timing constraints of the workload, as well as the heterogeneity of the system resources, constitute another critical issue that must be addressed by the scheduling strategy that is employed in such systems.Therefore, the next generation code will need to be resistant to failures.Advanced modeling and simulation techniques are the basic means of investigating fault tolerance in exascale systems, before performing the costly prototyping actions required for resilient code generation.

Modeling and simulation of failures in large-scale systems
Exascale computing provides a large-scale, heterogeneous distributed computing environment for the processing of demanding jobs.Resilience is one of the most important aspects of exascale systems.Due to the complexity of such systems, their performance is usually examined by simulation rather than by analytical techniques.Analytical modeling of complex systems is difficult and often requires several simplifying assumptions.Such assumptions might have an unpredictable impact on the results.For this reason, there have been many research efforts in developing tractable simulation models of large-scale systems.
In [34], simulation models are used to investigate performance issues in distributed systems where the processors are subject to failures.In this research, the author considers that failures are a Poisson process with a rate that reflects the failure probability of processors.Processor repair time has been considered as an exponentially distributed random variable with a mean value that reflects the average time required for the distributed processors to recover.The failure/ repair model of this paper can be used in combination with other models in the case of large-scale distributed processors.

Checkpointing in exascale systems
Application resilience is an important issue that must be addressed in order to realize the benefits of future systems.If a failure occurs, recovery can be handled by checkpoint-restart (CPR), that is, by terminating the job and restarting it from its last stored checkpoint.There are views that this approach is not expected to scale efficiently to exascale, so different mechanisms are explored in the literature.Gamell et al. in [35] have implemented Fenix, a framework for enabling recovery from failures for MPI-based parallel applications in an online manner (i.e.without disrupting the job).This framework relies on application-driven, diskless, implicitlycoordinated checkpointing.Selective checkpoints are created at specific points within the application, guaranteeing global consistency without requiring a coordination protocol.
Zhao et al. in [36] investigate the suitability of a checkpointing mechanism for exascale computers, across both parallel and distributed filesystems.It is shown that a checkpointing mechanism on parallel filesystems is not suitable for exascale systems.However, the simulation results reveal that a distributed filesystem with local persistent storage could enable efficient checkpointing at exascale.
In [37], the authors define a model for future systems that faces the problem of latent errors, i.e. errors that go undetected for some time.They use their proposed model to derive optimal checkpoint intervals for systems with latent errors.The importance of a multi-version checkpointing system is explored.They conclude that a multi-version model outperforms a single checkpointing scheme in all cases, while for exascale scenarios, the multi-version model increases efficiency significantly.
Many applications in large-scale systems have an inherent need for fault tolerance and high-quality results within strict timing constraints.The scheduling algorithm employed in such cases must guarantee that every job will meet its deadline, while providing at the same time high-quality (i.e.precise) results.In [41], the authors study the scheduling of parallel jobs in a distributed real-time system with possible software faults.They model the system with a queuing network model and evaluate the performance of the scheduling algorithms with simulation techniques.For each scheduling policy they provide an alternative version which allows imprecise computations.They propose a performance metric which takes into account not only the number of jobs guaranteed, but also the precision of the results of each guaranteed job.Their simulation results reveal that the alternative versions of the algorithms outperform their respective counterparts.The authors employ the technique of imprecise computations, combined with checkpointing, in order to enhance fault tolerance in real-time systems.They consider monotone jobs that consist of a mandatory part, followed by an optional part.In order for a job to produce an acceptable result, it is required that at least the mandatory part of the job must be completed.The precision of the results is further increased, if the optional part is allowed to be executed longer.The aim is to guarantee that all jobs will complete at least their mandatory part before their deadline.
The authors employ application-directed checkpointing.When a software failure occurs during the execution of a job that has completed its mandatory part, there is no need to rollback and re-execute the job.In this case, the system accepts as its result the one produced by its mandatory part, assuming that a checkpoint takes place when a job completes its mandatory part.According to the research findings in [41], in large-scale systems where many software failures can occur, scheduling algorithms based on the technique of imprecise computations could be effectively employed for the fault-tolerant execution of parallel real-time jobs.

Alternative programming models for fault tolerance in exascale systems
However, programming models that enable more appropriate recovery strategies than CPR are required in exascale systems.Towards this direction, Heroux in [42] presents the following four programming models for developing new algorithms: • Skeptical Programming (SkP): SkP requires that algorithm developers should expect that silent data corruption is possible, so that they can develop validation tests.
• Local Failure, Local Recovery (LFLR): LFLR provides programmers with the ability to recover locally and continue application execution when a process is lost.This model requires more support from the underlying system layers.
• Selective Reliability Programming (SRP): SRP provides the programmer with the ability to selectively declare the reliability of specific data and compute regions.The User Level Failure Mitigation (ULFM) interface has been proposed to provide faulttolerant semantics in MPI.In [43], the authors present their experiences on using ULFM in a case study to exploit the advantages and difficulties of this interface to program fault-tolerant MPI applications.They found that ULFM is suitable for specific types of applications, but it provides few benefits for general MPI applications.
The issue of fault-tolerant MPI is also considered in [44].Due to the fact that the system kills all the remaining processes and restarts the application from the last saved checkpoint when an MPI process is lost, it is expected that this approach will not work for future extreme scale systems.The authors address this scaling issue through the LFLR programming model.In order to achieve this model, they design and implement a software framework using a prototype MPI with ULFM (MPI-ULFM).

Code sustainability and other metrics
Software designers for supercomputers face new challenges.Their code must be efficient whatever the underlying platform is while not wasting computing time for crossing abstraction layers.Several tools presented in the previous sections provide designers and programmers with tools to abstract from the underlying hardware while achieving the maximum performance.
The clear goal to achieve is to increase the raw performance of supercomputers, but it is not anymore the simple the faster, the better.Two reasons show that taking care of raw performance is no more sufficient: • Life of code is way longer than hardware life; • Other metrics (energy, plasticity, scalability) become more and more important.

Life cycle of codes
HPC world is comprised of a few widely used codes that serve as base library and a majority of ad-hoc codes often mainly designed and programmed by non-computer scientists.
As an example for the first category, in the scientific computing domain, which aims at constructing mathematical models and numerical solution techniques for solving problems arising in science and engineering, Scalapack [51] is a largely used library of high-performance linear algebra routines for parallel distributed memory machines.It is a base of large-scale scientific codes and can run on nearly every classical supercomputers.It encompasses BLAS (Basic Linear Algebra Subprograms) and pBLAS (parallel BLAS) libraries.This package is comprised of old C and Fortran codes and was first released in 1979 from NASA [52] for BLAS and 1996 for Scalapack [51].In the last version (version 2.0.2,May 2012) large parts of code are still dating from the first version of 1996, being raw computing code in Fortran or higher level code in C.This version also contains code from nearly every year from 1996 to 2012.This code was able to evolve up to present days due to the community of users behind it.Most other less used libraries or software did not have this chance.But even this library has several sustainability problems such as new hardware architectures: Several supercomputer projects are planning to use GPU [50] or ARM [49] processors instead of classical standard x86 ones.
Concerning the second category the difficulties are even higher, as a large number of codes has been tested and evaluated only on a handful of supercomputers.Their scalability is unknown on different networks or memory topologies for example.In this case, these codes are not sustainable as they require a major rewrite to run on new architecture.
Hence new programming paradigms such as skeletons [47], or YML [48] are needed to reach sustainable codes that run efficiently on the latest generation of supercomputers.Concerning exascale computing the situation is even more dire as the exact detail of these architectures is still cloudy.

New metrics
Further away from sustainability of the code itself, other metrics are important for designers and programmers.Power consumption of supercomputers is reaching thresholds that prevent them from continuing to grow like before [46].The main three metrics that programmers have to confront are: • Raw power consumption: Depending on the particular instructions, library, memory and network access patterns, application will consume different power consumption at particular time and different overall energy for the same work; • Scalability: The capability of scaling up is key as future exascale systems will be composed of hundreds of thousands of cores; • Plasticity: It is the capability of the software to adapt to the underlying hardware architecture (ARM/x86/GPU, network topology, memory hierarchy,. . . ) but also to reconfigure itself by changing the number of allocated resources or migrating between architectures at runtime.At the moment most tools to provide insight on the code to programmers are aiming toward raw computing performance or memory and network usage.Only a few tools exist to provide feedback to programmers on such needed metrics.Valgreen [45] offers to give insight on the power consumption of codes for example.But at the moment manual evaluation is needed in order to evaluate these metrics for any code.

Conclusion
In this article we explored and discussed programming models and runtimes required for scalable high performance computing systems that comprise a very large number of processors and threads.Currently no programming solutions exist that satisfy all the main requirements of such systems.Therefore, new programming models are required that support data locality, minimize data exchange and synchronization, while providing resilience and fault-tolerant mechanisms in order to tackle the increasing probability of failures in such large and complex systems.
New programming models and languages will be a key component of exascale systems.Their design and implementation is one of the pillars of the future exascale strategy that is based on the development of massively parallel hardware, small-grain parallel algorithms and scalable programming tools.All those components must be developed to make that strategy effective and useful.Furthermore, in order to reach actual sustainability, code must reinvent itself and be more independent of the underlying hardware.
One main element will be to create new communication channels between runtime software and development environment.Indeed the latter have all relevant high-level information concerning application structure and adaptation capabilities but they are usually lost when the time to actually run the application comes.
As exascale systems grow in computational power and scale, their resilience becomes increasingly important.Due to the complexity of such systems, fault tolerance must be achieved by employing more effective approaches than the traditional checkpointing scheme.Even though many alternative approaches have been proposed in the literature, further research is required towards this direction.
Finally, a way to provide higher abstraction from design time to execution time that will be investigated is to extend MPI standards to support this abstraction and to provide higher scalability support.
The work presented in this article has been partially supported by EU under the COST program Action IC1305, "Network for Sustainable Ultrascale Computing (NESUS)".This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited.