Design and Implementation of the PULSAR Programming System for Large Scale Computing

The objective of the PULSAR project was to design a programming model suitable for largescale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, messagepassing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.


Introduction Motivation
High-end supercomputers are on the steady path of growth in size and complexity.One can get a fairly reasonable picture of the road that lies ahead by examining the platforms that will be brought online under the DOEs CORAL initiative.In 2018, the DOE aims to deploy three different CORAL platforms, each over 150 petaflop peak performance level.Two systems, named Summit and Sierra, based on the IBM OpenPOWER platform with NVIDIA GPU-accelerators, were selected for Oak Ridge National Laboratory and Lawrence Livermore National Laboratory; an Intel system, based on the Xeon Phi platform and named Aurora, was selected for Argonne National Laboratory.
Summit and Sierra will follow the hybrid computing model, by coupling powerful latencyoptimized processors with highly parallel throughput-optimized accelerators.They will rely on IBM Power9 CPUs, NVIDIA Volta GPUs, and NVIDIA NVLink interconnect to connect the hybrid devices within each node, and a Mellanox Dual-Rail EDR Infiniband interconnect to connect the nodes.The Aurora system, on the contrary, will offer a more homogeneous model by utilizing the Knights Hill Xeon Phi architecture, which, unlike the current Knights Corner model, will be a stand-alone processor and not a slot-in coprocessor, and will also include integrated Omni-Path communication fabric.All platforms will benefit from recent advances in 3D-stacked memory technology.
Overall, both types of systems promise major performance improvements: CPU memory bandwidth is expected to be between 200 GB/s and 300 GB/s using HMC; GPU memory bandwidth is expected to approach 1 TB/s using HBM; GPU memory capacity is expected to reach 60 GB (NVIDIA Volta); NVLink is expected to deliver no less than 80 GB/s, and possibly as high at 200 GB/s, of CPU to GPU bandwidth.In terms of computing power, the Knights Hill is expected to be between 3.6 teraFLOPS and 9 teraFLOPS, while the NVIDIA Volta is expected to be around 10 teraFLOPS.And yet, taking a wider perspective, the challenges are severe for software developers who have to extract performance from these systems.The hybrid computing model seems to be here to stay, and memory systems will become even more complicated.It is clear that support for parallelism is going to have to dramatically increase, going up by at least an order of magnitude for the CORAL systems and achieving billion-way parallelism at exascale.The PULSAR project attempts to tackle these challenges with a simple programming model, based on the systolic processing model, augmented with virtualization.The programming model is simple and so is the runtime implementation.Processing is completely distributed and, therefore, very scalable.

Background
Systolic arrays are descendants of array-like architectures such as iterative arrays, cellular automata and processor arrays.A systolic array is a network of processors that rhythmically compute and pass data through the system.The seminal paper by Kung and Leiserson [42] defines systolic arrays as devices with "simple and regular geometries and data paths" with "pipelining as general methods of using these structures".
The systolic array paradigm is a departure from the von Neuman paradigm.While the von Neuman architecture is instruction-stream-driven by an instruction counter, the systolic array architecture is data-stream-driven by data counters.A systolic array is composed of matrix-like rows of units called cells or Data Processing Units (DPUs).DPUs operation is transport-triggered, i.e., triggered by the arrival of a data object.The DPUs are connected in a mesh-like topology (often two-dimensional).Each DPU is connected to a small number of nearest neighbor DPUs and performs a sequence of operations on data that flows through it.Often different data streams flow across the mesh in different directions.Figure 1 shows three canonical shapes of systolic arrays: square can be used for a dense matrix multiplication, diamond for a band matrix factorization, triangle for a dense matrix factorization.
Early on, Kung identified the main strength of systolic arrays as ability to addressing the problem of the I/O bottleneck: "Thus, a problem what was originally compute-bound can become I/O-bound during its execution.This unfortunate situation is the result of a mismatch between the computation and the architecture.Systolic architectures, which ensure multiple computations per memory access, can speed up compute-bound computations without increasing I/O requirements" [41].Other valuable properties of systolic arrays include highly scalable parallelism, modularity, regularity, local interconnection, high degree of pipelining, and highly synchronized multiprocessing.
Closely related to systolic arrays is the concept of wavefront arrays, where global synchronization is replaced by dataflow principles.Wavefront arrays are derived by tracing computa-tional wavefronts in the algorithm, and pipelining these wavefronts on the processor array.The computing network serves as a data-wave-propagating medium.
The computational wavefronts are data-driven.In a sense, they are similar to electromagnetic wavefronts, since each processor acts as a secondary source and is responsible for the activation of the next front.Despite the lack of global timing, the sequencing of tasks is correctly followed.Whenever data are available, the sender informs the receiver and the receiver accepts the data whenever required.This scheme can be implemented through a simple handshaking protocol, which ensures that the computational wavefronts propagate in an orderly manner.Wavefront arrays share the features of regularity, modularity, local interconnection, and pipelining: "A wavefront array equals a systolic array plus dataflow computing" [43].
Most importantly, computations expressed as a Direct Acyclic Graph (DAG) can be mapped to an array processor, by assigning multiple nodes of the DAG to each processing element, as long as the DAG is uniform (shift-invariant).Examples of algorithms, which belong to this class include: convolution, autoregressive filtering, discrete Fourier transforms and an array of dense linear algebra algorithms -matrix multiplication, LU factorization, QR factorization, triangular matrix inversion, and more.
The origins of systolic arrays can be traced back to parallel array computers such as the Solomon computer [29] and its successor ILLIAC IV [7,40].At the peak of interest in the mid 80s systolic arrays targeted special-purpose algorithm-oriented VLSI implementations, often as attached processors.The ideas also led to the design of the Warp machines [3], which were a series of increasingly general-purpose systolic array processors created by Carnegie Mellon University in conjunction with industrial partners: G. E., Honeywell and Intel, and funded by the U. S. Defense Advances Research Projects Agency (DARPA).Interest in systolic arrays died off by early 90s, mostly due to the high cost of implementing them as special-purpose hardware during a time in which Moore's law was relentlessly increasing the computing power and decreasing the cost of general-purpose processors.
In their seminal paper [42], Kung and Leiserson applied systolic arrays to problems in dense linear algebra: matrix multiplication, Gaussian elimination, and triangular solve.They also pointed out applications in signal processing: convolution, Finite Impulse Response (FIR) filter, and discrete Fourier transform.A large body of work on systolic arrays was devoted to applications in dense linear algebra [1,6,8,17,28,46,56].
Despite the loss of interest in systolic arrays per se, systolic principles lead to a series of efficient algorithms for general-purpose computer systems, the prime example being a series of algorithms for matrix multiplication, including: Cannon's [10], Fox's [26], BiMMeR [30], PUMMA [16], SUMMA [58], and DIMMA [15].
The paper by Fisher and Kung [24] offers an extensive overview of systolic array literature.General discussion and motivation for systolic arrays is given by Kung [41], and Fortes and Wah [25].Systematic treatment of the topic is provided by Robert [52,53] and Evans [22].The paper by Kung [43] coins the term wavefront arrays and discusses the mapping of task graphs to systolic architectures.The paper by Johnsson et al. [32] talks about general purpose systolic arrays that can be applied to a wider range of problems (reconfigurable computing).

Related Work
The emerging Exascale programming models, including languages, draw from the PGAS (Partitioned Global Address Space) and APGAS (Asynchronous PGAS) efforts.These efforts include the DARPA-sponsored HPCS (High Productivity Computing Systems) program [21] which stressed productivity rather than performance, with the latter being the prerequisite for the former.The older PGAS languages include Titanium [4, 19, 37-39, 44, 45, 50], UPC (Unified Parallel C) [11,18], and CAF (Co-Array Fortran) [23,51].They use the concept of globally visible address space with an explicit handling of addresses that are non-local.The ongoing efforts in the Fortran community ensure continuous support for CAF functionality as exemplified by CAF 2.0 [31,48,49,54] and incorporation of some of the CAF features into the Fortran 2008 standard [55].The DARPA's HPCS program introduced three more languages into the space: Fortress [2,33], X10 [59], and Chapel (Cascade High Productivity Language) [12,13].The last two are still maintained by IBM and Cray, respectively.A recent resurgence of activity around shared memory programming resulted in the OpenSHMEM [14] project that borrows some ideas from the PGAS model, but is much more library-centric, as opposed to requiring a completely new programming language.
The other approach to achieving good performance on the current Petascale and future Exascale hardware designs is to use software runtime systems.Two notable projects in this category are Charm++ [36] and PaRSEC [9], which deal with algorithms and their implementation represented as a Direct Acyclic Graph (DAG) of tasks connected with edges that communicate data between them -a concept clearly related to the dataflow paradigm.Many other systems offer similar paradigm but might not afford the same type of support for distributed memory parallelism [5,47].
A new execution model has been argued by the authors of ParalleX [27,34,35] and implemented by the HPX project [57] that now serves as a clear need for extensions of the C++ standard.Codelets Execution Model [20] can also be considered in the category of the new models of computation.

Outline
The rest of the paper is organized as follows.We describe the PULSAR programming model in Section 1, and further explain the construction and operation of a PULSAR instance in Section 2. We outline the runtime implementation in Section 3. Then the following sections are devoted to the detailed description of an example, namely Cannon's algorithm.We briefly review the algorithm in Section 5 and report performance results in Section 6.Finally, we state some final remarks in Section 6.

Programming Model
The PULSAR programming model relies on five abstractions to define the computation: VSA, VDP, channel, packet, tuple; and on two abstractions to map the computation to the actual hardware: thread, device.Virtual Systolic Array (VSA) A set of VDPs connected with channels.
Virtual Data Processor (VDP) The basic processing element in the VSA.
Channel A point-to-point connection between a pair of VDPs.
Packet The basic unit of information transferred in a channel.
Tuple A unique VDP identifier.Device Synonymous with an accelerator device (GPU, Xeon Phi, etc.) The sections to follow describe the roles of the different entities, how the VDP operation is defined, how the VSA is constructed, and how the VSA is mapped to the hardware.These operations are accessible to the user through PULSAR's Application Programming Interface (API).Because this API is quite small (12 core functions and 6 auxiliary functions), actual function names are used when describing the different actions.Currently, PULSAR is implemented in C and exports C bindings.

Tuple
Tuples are strings of integers.Each VDP is uniquely identified by a tuple.Tuples can be of any length, and different length tuples can be used in the same VSA.Two tuples are identical if they are of the same lengths and have identical values of all components.Tuples are created using the variadic function prt tuple new, which takes a (variable length) list of integers as its input.The user only creates tuples; after creation, the tuples are passed to VDP and channel constructors.They are destroyed by the runtime at the appropriate time of destroying those objects.As a general rule in PULSAR, the user only creates objects and loses their ownership after passing them to the runtime.

Packet
Packets are basic units of information exchanged through channels connecting VDPs.A packet contains a reference to a continuous piece of memory of a given size.Conceptually, packets are created by VDPs.The user can use the VDP function prt vdp packet new to create a new packet.A packet can be created from preallocated memory by providing the pointer.Alternatively, new memory can be allocated by providing a NULL pointer.The VDP can fetch a packet from an input channel using the prt vdp channel pop function, and push a packet to an output channel using the prt vdp channel push function.The prt vdp packet release function can be used to discard a packet.This does not translate to immediate deallocation, since a packet can have multiple active references.The runtime discards a packet when the number of active references goes down to zero.The VDP does not lose the ownership of the packet after pushing it to a channel.The packet can be used until the prt vdp packet release function is called.

Channel
Channels are unidirectional point-to-point connections between VDPs, used to exchange packets.Each VDP has a set of input channels and a set of output channels.Packets can be fetched from input channels and pushed to output channels.Channels in each set are assigned consecutive numbers starting from zero (or slots).Channels are created using the prt channel new function and providing tuples of source and destination VDPs, and slot numbers in the source and destination VDPs.The user does not destroy channels.The runtime destroys channels at the time of destroying the VDP.After creation, the channel needs to be inserted in the appropriate VDP, using the prt vdp channel insert function.The user needs to insert a full set of channels to each VDP.At the time of inserting the VDP in the VSA, the system joins channels that identify the same communication path.

VDP
The VDP is the basic processing element of the VSA.Each VDP is uniquely identified by a tuple.The VDP is assigned a function which defines its operation.Within that function, the VDP has access to a set of global parameters, its private, persistent local storage, and its channels.The runtime invokes that function when there are packets in all of the VDP's input channels.This is called firing.When the VDP fires, it can fetch packets from its input channels, call computational kernels, and push packets to its output channels.It is not required that these operations are invoked in any particular order.The VDP fires a prescribed number of times.When the VDP's counter goes down to zero, the VDP is destroyed.The VDP has access to its tuple and its counter.
At the time of the VDP creation, the user specifies if the VDP resides on a CPU or on an accelerator.This is an important distinction, because the code of a CPU VDP has synchronous semantics, while the code of an accelerator VDP has asynchronous semantics.In other words, for a CPU VDP, actions are executed as they are invoked, while for an accelerator VDP, actions are queued for execution after preceding actions complete.In the CUDA implementation, each VDP has its own stream.All kernel invocations have to be asynchronous calls, placed in the VDP's stream.Similarly, in the case of an accelerator VDP, all channel operations have asynchronous semantics.

VSA
The VSA contains all VDPs and their channel connections, and stores the information about the mapping of VDPs to the hardware.The VSA needs to be created first and then launched.An empty VSA is created using the prt vsa new function.Then VDPs can be inserted in the VSA using the prt vsa vdp insert function.Then the VSA can be executed using the prt vsa run function, and later destroyed using the prt vsa delete function.
At the time of creation, using the prt vsa new function, the user provides the number of CPU threads to launch per each distributed memory node, and the number of accelerator devices to use per each node.The user also provides a function for mapping VDPs to threads,  and another for mapping VDPs to devices.These functions have to return the global thread or device number, based on the VDP's tuple and the total thread count or device count.
VSA construction can be replicated or distributed.The replicated construction is more straightforward from the user's perspective.In the replicated construction, each MPI process inserts all the VDPs, and the system filters out the ones that do not belong in a given node, based on the VDP-to-thread and the VDP-to-device function.However, the VSA construction process is inherently distributed, so each process can also insert only the VDPs that belong in that process.

Construction and Operation
The VSA is first constructed, and then launched.The VSA is constructed by creating all VDPs and inserting them in the VSA.Each VDP, in turn, is constructed by creating all its channels and inserting then in the VDP.The operation of the VSA is defined through the operation of its VDPs.VDPs operate by launching computational kernels, and communicating by writing and reading packets to and from their channels.

VSA Construction and Launching
Figure 3 shows a simple code snippet for VSA construction.A VSA is created using the prt vsa new function, which returns a new VSA with an empty set of VDPs.After creation, the VSA has to be populated with VDPs.Then the VSA can be launched using the prt vsa run function.After execution, the VSA can be destroyed using the prt vsa delete function.The function destroys all resources associated with the VSA.

VDP Creation and Insertion
Figure 3 shows simple code snippets of VDP operation.A VDP is created using the prt vdp new function.The function returns a pointer to a new VDP with empty sets of input and output channels.After creation, the VDP has to be populated with channels.Then the VDP can be inserted into the VSA using the prt vsa vdp insert function.The user does not free the VDP.At the time of calling prt vsa vdp insert, the runtime takes ownership of the VDP.The VDP will be destroyed in the process of the VSA execution or at the time of calling prt vsa delete.
Design and Implementation of the PULSAR Programming System for Large Scale...The user has to define the VDP function.The runtime invokes that function when packets are available in all of the VDP channels, which is called firing.Inside that function, the user has access to the VDP object.In particular, the user has access to its tuple, counter, local store and global store.global store is the read-only global storage area, passed to the VSA at the time of its creation.local store is the VDP private local storage area, which is persistent between firings.tuple is the VDP unique tuple, assigned at the time of creation.counter is the VDP counter.At the first firing, the counter is equal to the value assigned at the time of the VDP creation.At each firing the counter is decremented by one.At the last firing the counter is equal one.

Channel Creation and Insertion
A channel is created using the prt channel new function.After creation, the channel can be inserted into the VDP using the prt vdp channel insert function.The user does not free the channel.At the time of calling prt vdp channel insert, the runtime takes ownership of the channel.The channel will be destroyed in the process of the VSA execution or at the time of calling prt vsa delete.

Mapping of VDPs to Threads and Devices
The user defines the placement of VDPs on CPUs and GPUs by providing the mapping function at the time of the VSA creation with prt vsa new.The runtime calls that function for each VDP and passes as parameters: the VDP tuple, the pointer to the global store, the total number of CPU threads at the VSA disposal in that launch, and the total number of devices in that launch.
The function has to return the mapping information in an object of type prt mapping t, with the fields location and rank, where the location can be either PRT LOCATION HOST or PRT LOCATION DEVICE, and the rank indicates the global rank of the unit.

VDP Operation
This section describes actions which can take place inside the VDP function, i.e., the function passed to prt vdp new.The user never calls that function; it is called by the runtime when packets are available in all active input channels of the VDP.Inside that function, computational kernel can be launched and packets can be created, deleted, pushed down output channels and fetched from input channels.
A new data packet can be created by calling the prt vdp packet new function and released by calling the prt vdp packet release function.The runtime will keep the packet and its data around until it completes all pending operations associated with the packet.However, the packet and its data should not be accessed after the release operation.
A packet can be received by calling the prt vdp channel pop function, and sent by calling the prt vdp channel push function.The packet is still available to the VDP after calling the send function and can be used and repeatedly sent until it is released.

Channel Deactivation and Reactivation
A channel can be deactivated by the VDP by calling the prt vdp channel off function.This indicates to the runtime that it can schedule the VDP (call the VDP function) without checking if there are packets in that channel.The VDP should not attempt to read packets from an inactive channel.
A channel can be reactivated by the VDP by calling the prt vdp channel on function.This indicates to the runtime that it cannot schedule the VDP (call the VDP function) if there are no incoming packets in that channel.By default, channels are active.

Handling of Tuples
A new tuple can be created by calling the variadic function prt tuple new.A tuple has to have at least one element.There is no upper limit on the number of elements.Tuples are dynamically allocated strings or integers, with the INT MAX constant at the end, serving as the termination symbol.As such, tuples can be freed by calling the C standard library free function.However, tuples should not be freed after passing them to the PULSAR runtime.The runtime will free all such tuples during its operation or at the time of calling prt vsa delete.

Runtime Implementation
Figure 4 shows the main objects in the runtime implementation and their relations.The VSA is the top-level object containing multiple threads and devices, threads being synonymous with CPU cores, and devices being synonymous with accelerator devices.It also contains a single instance of the communication proxy, which is a server-like object responsible for managing inter-node (MPI) and intra-node (PCI) communication.Each thread and device maintains a list of multiple VDP.Each VDP maintains two separate lists of channels, one for input channels, one for output channels.Each channel maintains a list of packets.
In a CPU-only scenario, there may be no devices; in a GPU-only scenario, there may be no threads.Depending on the distribution of VDPs to threads and devices, any particular thread or device may end up with an empty list of VDPs.There may be VDPs with no input channels -pure data producers, as well as VDPs with no output channels -pure data consumers.In practical scenarios, most VDPs will have a number of input and output channels.A list of packets in a channel will grow and shrink at runtime.

Tuple
Tuples are implemented as strings of integers, terminated with the INT MAX marker.Tuples support copy, concatenation and two types of comparisons.One checks for an exact match in size and content, the other implements lexicographical ordering.

Packet
A packet is the basic unit of data in PULSAR.It contains a pointer to a memory region and its size and the number of active references to the packet.A packet can be actively being used by a VDP while also residing in multiple channels.A VDP can keep using a packet after pushing While the packet is a very simple concept in the context of a CPU implementation, support for accelerators introduces an additional level of complexity, due to the fact that accelerators have separate memories.A shared-memory system with multiple accelerators basically looks to the programmer like a distributed-memory, global address space system.To handle this situation, PULSAR runtime keeps track of the location of the packet, which can be either in the host (CPU) memory, or in the memory of one of the accelerators.This is the reason a packet is not a stand-alone object, but is subordinate to a VDP.Packets are created by VDPs and inherit their initial locations from the VDPs that create them.
Another level of complexity is introduced by the fact that PULSAR cannot rely on CUDA functions for device memory allocations, without sacrificing asynchronous semantics, since CUDA allocations cannot be executed in a stream.Because of that, PULSAR implements its own device memory allocator, which grabs a large chunk of the device memory at the time of initialization, and assigns memory segments asynchronously at runtime.Currently, the implementation is very simple, with fixed size (configurable) segments and fixed size (also configurable) initial reservation.PULSAR maintains one such allocator per device, and each packet maintains a reference to its allocator.

Channel
Channels are packet carriers between VDPs.A channel knows the tuples and slots of its source and destination VDPs, and maintains a list of packets.A channel also knows the numbers (MPI ranks) of the nodes, where the source and destination VDPs reside, as well as its unique tag for MPI communication between the pair of nodes it connects.A channel connecting to a device VDP is also assigned a unique stream to allow for asynchronous communication (one that does not block kernel launches).
A channel provides two services to the VDP, fetching a packet (the pop operation), and sending a packet (the push operation).A VDP is only fired when there are packets in all its active input channels.Therefore, the pop operation is trivial, and simply fetches a packet from the queue.All the complexity of communication is in the push operation, which takes appropriate actions, depending on the boundaries crossed by the channel.If the source and the destination VDPs are both CPU VDPs, residing in the same node, the operation is as simple as moving the packet from one queue to another.Things are more complicated if the channel crosses node boundaries, in which case MPI is invoked.Yet a different scenario is implemented if a boundary between a host memory and a device memory has to be crossed.And finally, the most complex case is invoked when a packet residing in device memory is sent across the network.The last case results in a sequence of CUDA callbacks and non-blocking MPI calls, initiated by the channel and carried out by the communication proxy, with support from the CUDA runtime and MPI.
As an extension to the basic programming model, a channel contains an active/inactive flag, which allows the VDP for suspending channel activities.By deactivating a channel, the VDP pledges to not pop packets from that channel, which allows the VDP to fire with that channel being empty.Newly created channels are active by default.A VDP can deactivate and re-activate channels at will, as long as it does not attempt to fetch packets from an inactive channel.

VDP
A VDP is the basic execution unit of the VSA.A VDP knows its tuple and counter, and the lists of input and output channels.It is assigned a function, local store and global store.Support for accelerators requires a VDP to also know its location, i.e., if it is assigned to a CPU or an accelerator, and which accelerator, if there are multiple of them in a node.An accelerator VDP is also assigned its unique stream, so that multiple VDPs can execute at the same time, in different streams, and also kernel launches can overlap data transfers.
In order to hide the complexity of managing the memory system within a single node (host plus multiple devices), the VDP provides more abstract functions for accessing lower-level functions of packets and channels.Specifially, rather than calling packet and channel methods directly, the user calls VDP methods to perform operations on packets and channels.E.g., to create a packet, the user calls the VDP function prt vdp packet new, so that the packet can inherit its initial location from the creator VDP.For consistency, a packet is released by the VDP upon calling the function prt vdp packet release.Similarly, channels are accessed by VDP functions prt vdp channel push and prt vdp channel pop.First, this is consistent with the handling of packets.Second, the user designates channels by using their slot numbers, rather than references, which slightly increases the level of abstraction.

Threads and Devices
Threads and devices are internal PULSAR objects not directly accessed by the user.The user only deals with them indirectly by providing formulas for mapping of VDPs to threads and devices.Threads and devices know their local (within a node) and global ranks, and maintain lists of VDPs.
The main reason for the distinction between CPU threads and GPU devices is the synchronous behavior of the former and asynchronous behavior of the latter.While the code of a CPU VDP is expected to have the usual synchronous nature, the code of a GPU VDP is expected to be fully asynchronous.The reason is the asymmetry in the programming models for CPUs and GPUs, in which GPUs are not fully autonomous devices.Most of their actions have to be initiated by a CPU thread, and synchronous GPU code locks up the controlling thread.In a multi-GPU setup, this can be remedied by putting a separate CPU thread in charge of controlling each GPU.This still leaves the problem with asynchronous communication and multi-stream execution.
The requirement for the GPU VDPs to only contain asynchronous calls allows for maximum performance and the simplest runtime implementation.A single CPU thread is sufficient to launch all the computation and communication to multiple GPUs, and in the actual PULSAR implementation, it is the same thread that is also responsible for all MPI transactions.

VSA
The VSA is the main object in PULSAR, containing all the top-level information about the system, including: the total number of nodes and the rank of the local node, the number of CPU threads launched per node, and the number of GPU devices used per node.It contains the lists of threads and devices, and the communication proxy.It also contains a number of auxiliary structures, like a lookup table of local VDPs, and a list of channels connecting the local node to other nodes, as well as the list of memory allocators for all local devices.
The most complicated function provided by the VSA is VDP insertion.First, the VSA evaluates the mapping function to find out the VDP location.VDPs that do not belong in the local node, are immediately discarded.VDPs that do belong in the local node, are inserted in the appropriate thread or device, depending on the location returned by the mapping function.
Then local channels are merged.Each VDP is inserted with a complete set of channels.If the newly inserted VDP has channels connecting to other VDPs, inserted before, each pair of duplicate channels is merged into one channel.To support this operation, the VSA maintains a hash table, where the VDPs can be looked up by their tuples.
Then a unique (MPI) tag is assigned to each channel going out of the node.All channels connecting each pair of nodes have consecutive tags.This is necessary, because PULSAR implements VDP-to-VDP communication on top of MPI.MPI messages are received with the MPI ANY SOURCE, MPI ANY TAG flags and the destination VDP identified by the rank of the origin and the tag of the channel.Consecutive numbering of tags connecting each pair of nodes, as opposed to, e.g., global numbering in the whole system, prevents the problem of exhausting the 16-bit tag size limit of older MPI implementations.To support this operation, the VSA maintains lists of channels connecting the local node to all other nodes.
Finally, at the time of inserting a device VDP, a unique stream is created and assigned to the VDP to enable its asynchronous operation.
At the time of launch, the VSA basically launches threads and waits for their completion.Each CPU thread carries out its own execution.For a CPU-only run, nothing more is required.If the run is distributed and/or accelerators are involved, the VSA launches an extra thread for the proxy, which carries out the firings of device VDPs, as well as the MPI and PCI communication.

Proxy
The proxy carries out all tasks of asynchronous nature, including communication and firing of device VDPs.For the purpose of communicating, threads and devices register with the proxy as agents.The proxy maintains a list of sends requested, per agent, and sends posted, per agent.It also maintains a list of outstanding receive requests, one for all agents, as well as a list of outstanding local transfers, also one for all agents.During execution, the proxy continuously loops over the following actions: • Post another send for each agent.
• Complete another send for each agent.
• Post another receive.
• Complete another receive.• Cycle each device, i.e., fire another VDP on each device.
• Issue all local communications.
The most complicated transfer that may happen in the system is when a GPU sends a packet across the network.Figure 5 illustrates that situation.The following sequence of events takes place: 1.In the VDP function, the prt vdp channel push function is called do push a packet down a channel.At this point, the packet resides in device memory.A CUDA callback is placed in the VDP stream, to trigger the transfer, when all preceding VDP operations complete.2. CUDA runtime executes the callback, which places the transfer request in proxy's list of local requests.3. The proxy handles the transfer request by placing asynchronous device-to-host memory copy, across the PCI, in the outbound stream associated with the channel performing the communication.4. The proxy follows up with placing a callback in the channel's outbound stream, to trigger a transfer across the network, when the PCI transfer completes.5. CUDA runtime executes the callback, which places the transfer request in the proxy's list of local requests.6.The proxy places a send request in the list of sends requested by the device.7. The proxy issues the MPI Isend action, and moves the request from the list of sends requested to the list of sends posted.8.The proxy tests the send request for completion.When the request completes, the proxy removes the request from the list of sends posted and releases the packet.
A similar, although not identical, sequence of actions happens on the receiving side, where the message is received, the destination VDP identified, and appropriate steps taken depending on its location.Specifically, if the destination VDP resides in one of the devices, a transfer across the PCI is queued with the proxy.As long as the sequence is, there are no shortcuts here, since potentially two PCI buses and the network interface have to be crossed by the data.Little can be done about the latency of such transfers, however, all the emphasis in PULSAR is on latency hiding, rather than minimizing.The objective is to keep the buses and network interfaces saturated by multiple transfers going on at any given point in time.

Software Engineering
Although being an experimental project, PULSAR has a quite robust implementation.PUL-SAR is coded in C in a fairly object-oriented manner, and would be very suitable for a C++ implementation.PULSAR is very compact with only 21 .cfiles, 22 .hfiles and 6,600 lines of code.The compactness and clear structure is mostly thanks to the very crisp abstractions established in the design process (VDP, VSA, channel, packet, etc.) PULSAR has very few software dependencies.At the minimum, it requires Pthreads, and can be optionally compiled with MPI support and/or CUDA support.The build process involves compiling the sources and creating a library from the object files.Using MPI requires providing PULSAR only depends on the most basic data structures: a non-thread-safe double-linked list, a thread-safe double-liked list (implemented by protecting the non-thread-safe one with Pthread spinlocks), and a hash table.The basic linked list and hash table are implemented themselves as dependency-free, stand-alone structures.
PULSAR contains its own tracing routines, based on recording time of different events and writing an SVG file at the end of execution.PULSAR records computational tasks on CPUs and GPUs (VDP firings), MPI communications, and CUDA data transfers.CPU timestamps are taken using get time of day, GPU timestamps are take using cudaEventRecord.
PULSAR code, including the runtime (PRT) and examples, is available on the project website (https://bitbucket.org/icl/pulsar),which also contains extensive documentation, including: installation instructions, users' guide, reference manual, etc.The code is documented using Doxygen and the reference manual is produced automatically from Doxygen annotations.A simple version is available in HTML and an extended version in PDF, where function call graphs and data structures dependency graphs are also included.

Cannon's Matrix Multiplication
Cannon's algorithm for matrix multiplication is arguably the best known systolic algorithm.Here it makes for a perfect example due to its simplicity and compactness of implementation.Figure 6 shows the basic principle of Cannon's algorithm.A two-dimensional (2D) mesh of processors is used to compute the matrix product, C = A × B, by rotating the A matrix from left to right, and the B matrix from top to bottom, while each processor computes a block of C.

Figure 6. Cannon's matrix multiplication algorithm
Figure 7 shows the PULSAR code for the construction of the VSA.The basic premise of the implementation is to built a 2D mesh of VDPs, and let each VDP compute one tile of the result matrix C.Here nb is the size of each tile, and nt is the width and height of the matrix (number of tiles).The code loops over the vertical and horizontal dimensions of the matrix and, for each tile, creates a VDP, creates four channels (two vertical, two horizontal), inserts the channels in the VDP, and inserts the VDP in the VSA.
Each VDP holds its local tile of the C matrix, as well as its local tiles of matrices A and B, which are distributed in a skewed fashion, as depicted on Figure 6, i.e., tiles of the B matrix are shifted by one vertically, from column to column, and tiles of A are shifted by one horizontally, from row to row.All channels are initially inserted as inactive, so that each VDP can be launched without any data in the input channels, compute the local part of the product, send its local tile of A to the right, and send its local tile of B down.
Figure 8 shows a complete code of a VDP implementing the Cannon's algorithm.It starts with the declarations section, where tile size (nb) and matrix size (nt) are retrieved from global store, the location of the VDP tile in the matrix (m, n) is retrieved from the VDP tuple, an alias is created to the local store.The done variable is declared because the cuBLAS calls require passing of the constant by reference.
The first half of the code handles the first step, when the VDP's counter equals to the size of the matrix.If the VDP is a device VDP (is assigned to a GPU), then cuBLAS initializations are handled in the first step (creating cuBLAS handle and associating it with the VDP's stream).Then the local tile A is pushed to the right and the local tile of B is pushed down.Then the local product is computed, either using a cuBLAS call or a CBLAS call, depending on the VDP's location.Finally, the input channels are activated to provide data for the next step.
In the followup steps, a tile of A is read from the left and passed to the right, unless it is the last step of the algorithm (VDP reaches one).Tiles of B are handeld in the same manner, passing downwards.Then the product is computed using an appropriate call (either cuBLAS or CBLAS), and finally the transient packets, used for transfers of A and B, are released.Multithreaded BLAS is used for the CPU runs and cuBLAS is used for the GPU runs.
Figure 9 shows the scaling.The dashed lines show ideal scaling, taking the smallest parallel case (2 × 2 nodes) as the reference point.Figure 10 shows an execution trace of a small CPU run (2 × 2 nodes, 2 × 2 tiles per node, 2048 × 2048 tiles).Figure 11 shows an execution trace of the same run using GPUs.In the CPU trace, the timeline of each node is represented by two lines.The first one shows the execution of the matrix multiplications.The second one shows invocations of communication tasks.In the case of communication tasks, only the duration of asynchronous MPI calls is registered (not duration of the actual communication), which results In the CPU case, MPI communication is completely overlapped with computation, resulting in no gaps in the trace and almost idea scaling.In the GPU case, the DMA transfers do not keep up with the speed of execution, and the MPI transfers do not keep up with the DMA transfers, resulting in large gaps in the trace and poor scaling.At the same time, GPU execution still produces much higher overall performance than CPU execution.

Conclusion
This paper has presented the PULSAR system.PULSAR combines the key paradigms of systolic arrays (regularity, extensive data re-use and multilevel pipelining) with virtualization techniques, in order to provide a simple yet efficient programming model to design parallel algorithms on complex multicore systems with attached accelerators.
After detailing the nuts and bolts of the system, we have provided a comprehensive description of a PULSAR instance, namely the canonic Cannon's algorithm for matrix product.We have shown convincing performance results on Titan, which nicely demonstrate that a limited programming effort, as required by PULSAR, is not incompatible with an efficient implementation.Achieving a good trade-off between the ease of programming and the quality of the results was the primary objective of PULSAR.

Figure 2
conveys the basic ideas.

Figure 3 .
Figure 3. Code snippets for VDP operation (left) and VSA construction (right)

Figure 4 .
Figure 4. Structure of PULSAR runtime

Figure 5 .
Figure 5. Timeline for a transfer originating from a GPU and involving MPI