Towards A Data Centric System Architecture: SHARP

The Authors 2017. This paper is published with open access at SuperFri.org Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. The SHARP technology is a step towards a data-centric architecture, where data is manipulated throughout the system. This paper introduces a new SHARP optimization, and studies aspects that impact application performance in a data-centric environment. The use of UD-Multicast to distribute aggregation results is introduced, reducing the letency of an eight-byte MPI Allreduce() across 128 nodes by 16%. Use of reduction trees that avoid the inter-socket bus further improves the eight-byte MPI Allreduce() latency across 128 nodes, with 28 processes per node, by 18%. The distribution of latency across processes in the communicator is studied, as is the capacity of the system to process concurrent aggregation operations.


Introduction
The challenge of providing increasingly unprecedented levels of effective computing cycles, for tightly coupled computer-based simulations, continues to pose new technical hurdles.With each hurdle traversed, a new challenge comes to the forefront, with many architectural features emerging to address these problems.This has included the introduction of vector compute capabilities to single processor systems, such as the CDC Star-100 [28] and the Cray-1 [27], followed by the introduction of small-scale parallel vector computing, such as the Cray-XMP [5], custom-processor-based tightly-coupled MPPs, such as the CM-5 [21] and the Cray T3D [17], followed by systems of clustered commercial-off-the-shelf micro-processors, such as the Dell PowerEdge C8220 Stampede at TACC [30] and the Cray XK7 Titan computer at ORNL [24].For a decade or so the latter systems relied mostly on Central Processing Unit (CPU) frequency upticks to provide the increase in computational power.But, as a consequence of the end of Dennard scaling [9], the single CPU frequency has plateaued, with contemporary HPC cluster performance increases depending on rising numbers of compute engines per silicon device to provide the desired computational capabilities.Today HPC systems use many-core host elements that utilize, for example, X86, Power, or ARM processors, General Purpose Graphical Processing Units (GPGPUs) and Field Programmable Gate Arrays (FPGAs), [15], to keep scaling the system performance.Network capabilities have also increased dramatically over the same period, with changes such as increases in bandwidth, decreases in latency, and communication technologies like InfiniBand RDMA that offload processing from the CPU to the network.
With increasing compute engine counts, system architectures have continued to be CPU centric, with these system elements being involved in the vast majority of data manipulation.This has resulted in unnecessary data movement and undesirable competition between computational, communication, storage and other needs for the same computational resources.A Data-Centric system architecture, which co-locates computational resources and data throughout the system, enables data to be processed all across the system, and not only by CPU's at the edge.For example, data can be manipulated as it is being transferred within the data center network as part of a collective operation.This type of approach addresses latency and other performance bottlenecks that exist in the traditional CPU-Centric architecture.Mellanox focuses on CPU offload technologies designed to process data as it moves through the network, either by the Host Channel Adapter (HCA) or the switch.This frees up CPU cycles for computation, reduces the amount of data transferred over the network, allows for efficient pipelining of network and computation, and provides for very low communication latencies.To accomplish a marked increase in application performance, there has been an effort to optimize often used communication patterns, such as collective operations, in addition to the continuous improvements to basic communication metrics, such as point-to-point bandwidth, latency, and message rate.
InfiniBand technologies are being transformed to support such data-centric system architectures.These include technologies such as SHARP for handling data reduction and aggregation, hardware-based tag matching and Network data hardware-gather scatter capabilities.These technologies are used to process data and network errors at the network levels, without the need for data to reach a CPU, reducing overall volume of transferred data and system resilience.
This paper extends the investigation of the the SHARP technology previously introduced [12] for offloading aggregation and reduction operations to InfiniBand switches.The paper is organized as follows: Section 1 presents previous related work in offload technologies.Section 2 describes the new UD-Multicast protocol which utilizes multiple children in the reduction tree to avoid using internal node interconnect between sockets.Section 3 describes the benchmarks and applications investigated, and discusses the distribution of latencies across processes in a communicator and the network's ability to process multiple reduction operations concurrently.The final section provides a summary and discussion of the work presented.

Previous Work
In the past extensive work has been done on improving performance of blocking and nonblocking barrier and reduction algorithms.
Algorithmic work performed by Venkata et al. [33] developed short vector blocking and non blocking reduction and barrier operations using a recursive K-ing type host-based approach, and extended work by Thakur [31].Vadhiar et al. [32] presented implementations of blocking reduction, gather and broadcast operations using sequential, chain, binary, binomial tree and Rabenseifner algorithms.Hoefler et al. [16] studied several implementations of nonblocking MPI -Allreduce() operations, showing performance gains when using large communicators and large messages.
Some work aimed to optimize collective operations for specific topologies.Representative examples are ref.[6] and [22], which optimized collectives for mesh topologies, and for hypercubes, respectively.
Other work presented hardware support for performance improvement.Conventionally, most implementations use the CPU to setup and manage collective operations, with the network just used as a data conduit.However, Quadrics [26] implemented support for broadcast and barrier in network device hardware.Recently IBM's Blue Gene supercomputer included network-level hardware support for barrier and reduction operations.Its preliminary version Blue Gene/L [11] which uses torus interconnect [1], provided up to twice throughput performance gain of all-to-all collective operations [2,20].On a 512 node system the latency of the 16 byte MPI Allreduce() the latency was 4.22 µ-seconds.Later, a message passing framework DCMF for the next-generation supercomputer Blue Gene/P was introduced [18].MPI collectives optimization algorithms for this generation of Blue Gene were analyzed in [10].The recent version Blue Gene/Q [14] provides additional performance improvements for MPI collectives [19].On a 96,304 node system, the latency of a short allreduce is about 6.5 µ-seconds.IBM's PERCS system [4] fully offloads collective reduction operations to hardware.Finally, Mai et al. presented the NetAgg platform [23], which uses in-network middleboxes for partition/aggregation operations, to provide efficient network link utilization.Cray's Aries network [3] implemented 64 byte reduction support in the HCA, supporting reduction trees with a radix of up to 32.The eight byte MPI Allreduce() latency for about 12,000 process with 16 processes per host was close to ten u-seconds.
Several APIs have been proposed for offloading collective operation management to the HCA.This includes the Mellanox's CORE-Direct [13], protocol, Portal 4.0 triggered operations [7], and an extension to Portals 4.0 [29].All these support protocols that use end-point management of the collective operations, whereas in the current approach the end-points are involved only in collective initiation and completion, with the switching infrastructure supporting the collective operation management.

Aggregation Protocol
A goal of the new network co-processor architecture is to optimize completion time of frequently used global communication patterns and to minimize their CPU utilization.The first set of patterns being targeted are global reductions of short vectors, and include barrier synchronization, and small data reductions.As previously mentioned, the SHARP protocol has already been described in detail, therefore, only a brief description is provided in this section, highlighting the new hardware capability that is introduced.
SHARP provides an abstraction describing data reduction and aggregation.The protocol defines aggregation nodes (ANs) which form the nodes of a reduction tree.These trees overlay a physical network.Figure 1 shows an example of a physical network topology, with Fig. 2 describing a possible reduction tree constructed over this physical topology.The aggregation nodes are colored in red, with the leaves of the tree, the blue stars, being source of the data.

Figure 1. Physical Network Topology
Aggregation operations are defined for SHARP groups.These groups are formed as subtrees of SHARP trees, where multiple groups may be formed from a given SHARP tree. Figure 3 gives an example of a SHARP group of size eight.
An aggregation operation is performed with participation of each member of the aggregation group.To initiate such an operation, members of the aggregation group send their aggregation request message to their leaf aggregation node.The aggregation request header contains all needed information to perform the aggregation, and includes the data description, i.e. the data type, data size, and number of such elements, and the aggregation operations to be performed, such as a min or sum operation.An aggregation node receiving aggregation requests collects these from all its children and performs the aggregation operation once all the expected requests arrive.The root aggregation node performs the final aggregation producing the result of the aggregation operation.
This aggregation result is distributed in up to two of several possible ways.The destination may be one of several targets, including one of the requesting processes, such as in the case of MPI Reduce(), all the group processes, such as in the case of an MPI Allreduce() operation, or a separate process that may not be a member of the reduction group.An aggregation tree can be used to distribute the data in these cases.
The new hardware capability described in this paper is that the target may also be a userdefined InfiniBand multicast address.It is important to note that while multicast data distribution is supported by the underlying transport, it provides an unreliable delivery mechanism.Any reliability protocol needed must be provided on top of this mechanism.The protocol does not define the data transport, so that communication between AN's can occur using a range of transports, such as RDMA-enabled protocols like InfiniBand or RDMA over Converged Ethernet (RoCE).It also does not handle packet loss or reordering, requiring a reliable transport which provides reliable in-order delivery of packets to the upper layer.

SwitchIB-2-Based Aggregation Support
In the SwitchIB-2 implementation, the aggregation node logic is implemented as an Infini-Band TCA integrated into the switch ASIC.The transport used for communication between ANs and between AN and hosts in the aggregation tree is the InfiniBand Reliable Connection (RC) transport.The results are distributed from the root to the leaf nodes, or hosts, down the tree, or to a target InfiniBand Multicast group.
The aggregation node implementation includes a high performance Arithmetic Logic Unit (ALU), used to perform the aggregation operations supported by the aggregation node.It can operate on 32-and 64-bit signed and unsigned integers and floating point data.The supported operations include sum, min and max, MPIs MinLoc and MaxLoc, bitwise OR, AND, and XOR, which include all the operations, with the exception of the product, needed to support the MPI standard and the OpenSHMEM specification.
Requests are collected in the TCA, with the reduction performed only after all operands are available, in a predetermined and fixed order.SwitchIB-2 implements a predictable operation ordering to enable repeatable results regardless of the order of arrival of the aggregation requests.
When using hardware multicast to distribute the aggregation results, the result also needs to be distributed with a reliable protocol to ensure delivery of these results.

Benchmark Results
To evaluate the SHARP capabilities, both low-level MPI benchmarks, as well as an application level benchmark are used.
A 128 host system is used for these experiments.Each node has two 14-core Broadwell CPUs running at 2.60 GHz, with 256GB of RAM memory.ConnectX-4 HCAs are used running at 100Gb/s.The fabric uses a two-level fat-tree with SwitchIB-2 switches and eight leaf switches, each connecting to 16 hosts.The hosts run RedHat Linux 7.2, and the tests were carried out with OFED 3.4-2.1.9.0.A pre-release version of HPC-X, the Mellanox supported MPI, is used, which includes a set of MPI collective routines that access and use the SHARP hardware capabilities, embedded in the SwitchIB-2 switches, to optimize the performance of the corresponding MPI collectives.

MPI-Level SHARP Measurements
The OSU MPI Allreduce() test [25] is used to measure the SHARP latency.Figure 4 shows the latency of MPI Allreduce() operations as a function of message size and the mode of result distribution, with one process per-node.Using UD multicast for distributing the result takes advantage of the O(1) multicast capabilities for improved performance, but is unreliable (bit error rate being on the order of 10 −15 ) requiring the additional RC result distribution to provide the result when a UD packet is dropped.Using UD multicast and RC to distribute the results improves latency in the range of 15-58% relative to using RC only for this distribution, even with the duplicate result distribution.The improvement relative to the host-based approach is in the range of 143 to 385 percent.A comparison to the host-only algorithm is also included.Latency is reported in µ-seconds SHARP reduction trees assume some sort of host-level aggregation prior to sending data to the leaf AN, because of the limitation on AN's radix.Figure 5 shows the latency of the MPI Allreduce() operation when using one connection per socket (2 channels) into the SHARP reduction tree, avoiding reduction over the internal chip network, and one connection per node (1 channel).As the results show, for messages up to 1024 bytes in size, this reduces latency by more than ten percent.With larger messages, an increase in latency is observed.The two-channel case eliminates the host-side intra-socket reduction steps, it increases the leaf AN radix by a factor of two.As the vector length increases, this manifests itself with a larger latency relative to the one-channel case.To get a better understanding of the spread in completion times across the communicator, several metrics are collected to characterize this behavior.Table 1 lists the average MPI Allreduce() latencies, along with quartile data, minimum value and maximum value to describe the data distribution, using UD-multicast for result distribution, and one process per node.These are reported for the average of the full collective operation (measured as as the average of the collective operation) and for the the completion of each of the individual ranks in the communicator.As expected, there is greater variance in individual completion times, as compared with the average per-collective completion time.Also, we see that the SHARP based collectives have a much smaller per-rank latency range.For the SHARP capabilities to be useful in a general purpose production system, where multiple jobs run concurrently, potentially sharing ANs, it is useful to study the systems ability to support concurrent SHARP operations.The system's capacity to service concurrent collective operations is studied by running multiple collective operations at the same time, using completely overlapping SHARP-tree groups.The OSU-latency test was modified to run concurrent collective operations with non-overlapping MPI Communicators, with the MPI process layout configured to achieve this overlap.As the results show in Fig. 6 for communicators of size eight, SHARP is able to accommodate many outstanding operations very well.Latency starts to degrade at a message size of 2048 bytes, with eight concurrent operations, where as many as sixty four operations are in flight.With sixteen concurrent operations, latency is impacted by about 30% with a message size of 64 bytes.   2 presents the MPI Allreduce() latency as a function of the number of outstanding SHARP operations each group is configured to allow.The eight byte data requires only one SHARP-level operation per MPI operation, whereas the 2048 byte reduction requires eight such operations.As expected, we see that the eight byte reduction is minimally impacted by the number of allowed outstanding SHARP operations, except in the eight communicator test, where there are insufficient resources for all communicators, and the test does not run as written.The 2048 byte MPI-level operation is negatively impacted by the lack of sufficient resources to pipeline the entire operation at once, but even with only two outstanding SHARP operations supported, there is the benefit of some pipelining, than times that of the eight operation case.

Application Benchmarks
Table 3 shows the result of running the Algebraic Multi-Grid (AMG) [8] micro benchmark on 64 nodes, with 28 processes per node.The AMG benchmark uses an eight byte data reduction.On average, running five of the AMG test cases (Laplace, 27 point, Jumps, def/pool1 and def/pool0) an average improvement of 1.8% in total test run time was measured when using

Discussion and Conclusions
To improve MPI-level aggregation performance, UD-Multicast is used to distribute results from the root of the aggregation-tree, and SHARP trees that avoid using the host's inter-socket bus for aggregations are employed.Employing UD-multicast to distribute the aggregated values reduces overall operation latency, even though the result is sent twice to ensure reliable delivery, once using UD-Multicast and once with RC.The UD packets allow for fast result distribution, with a very low packet-rate loss.The RC packets sent to ensure data delivery arrive a little later, and impact latency only when the UD packets are lost.This improves eight byte reduction at 128 nodes by 16%, and the 4096 byte latency by 58%.The distribution using UD-Multicast benefits from the switch's ability to replicate the data packet to all ports relevant to the multicast group in parallel, whereas the RC packet replication has some degree of serialization.In addition, for small message distribution message rate, rather than bandwidth, is the primary performance limiter.The high message rate, of 195 messages per port, per µ-second supported by the SwitchIB-2 device, is capable of handling the duplicated data.
Using the intra-host bus for data exchange between sockets can be expensive relative to the intra-socket communication.It is frequently more efficient to avoid using this bus when accessing the network, and therefore a similar approach has been investigated for SHARP reductions.As the results show, aggregating data on a per-socket basis also helps reduce operation latency for small message sizes, reducing the eight byte operation latency by 16% at 128 nodes.However, as the data size increases, competition for the PCIe bus bandwidth from the host to the network and the two fold increase in AN radix at the leaf switches make this particular optimization undesirable.
In general purpose data-centric environments, with multiple jobs running on the system at the same time jobs compete for a fixed set of resources, unless special care has been taken to isolate the resources used by separate jobs.In the case of SHARP the AN resources are an additional set of resources, beyond the host and other network resources that may be shared.The impact of such sharing on the SHARP latencies has been studied by running concurrent reductions on the same reduction-tree and limiting the number of concurrent aggregations.
To study the effectiveness of the protocol in a multi-job scenario, where some ANs may be used by multiple jobs, we ran up to sixteen concurrent collectives simultaneously.This is expected to be a worse-case type of scenario, because the test forces the collective operation concurrency.Since application runs typically are not synchronized, and they do more than just run collective operations, the impact on concurrent running applications using the same AN resources is expected to be less.The results show that the impact on the small message reduction latency is small, but as the message size increases the impact of this sharing becomes noticeable due to the competition for bandwidth.At 2048 byte message size and 128 nodes, a small impact is noticed when two operations are running concurrently, but with four it is still advantageous to use the SHARP protocol over the host-based protocol.
We also observed that when there are insufficient resources to pipeline a reduction operation with independent resources, there are still benefits to such optimization when compared with the host-based approach.A 2048 byte message size and 128 nodes requires eight OSOs for the full message reduction to be concurrently in flight.However, providing only two such OSOs still reduces the operation latency relative to the host-based approach.
Finally, collective operations are known to amplify application load imbalance.Looking at the per-process spread in collective operations, we see that the SHARP based collectives are less susceptible to imbalance within the collective algorithms themselves, thus supporting application scalability better than the host-based algorithms.
In conclusion, this paper has introduced the ability to use UD-multicast for aggregation result distribution and presented several aspects of the SHARP protocol not previously examined.Benchmark and application results show that the protocol is effective, and help to show how to best utilize the underlying SHARP capabilities in a general purpose data-centric environment.
This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited.

Figure 2 .
Figure 2. Logical SHARP Tree.Note that in the SHARP abstraction an Aggregation Node may be hosted by an end-node

Figure 4 .
Figure 4. 128 node MPI Allreduce() average latency with different modes of result distribution.A comparison to the host-only algorithm is also included.Latency is reported in µ-seconds

Table 1 .
MPI Allreduce() Latency (µsec) Distribution of a 127 Node Cluster with One Process Per Node

Table 2 .
Eight Process MPI Allreduce() Average Latency (µsec) as a Function of the Number of Communicators Operating in Parallel and as a Function of Maximum Outstanding SHARP Operations (OSOs) Available

Table 3 .
AMG Figure-of-Merit (Higher is Better) Data for Five Different Tests, Run on 64 Nodes with 28 Processes Per Node, and a System Configured with Low System Noise