A Case for Embedded FPGA-based SoCs in Energy-Efficient Acceleration of Graph Problems

Nachiket Kapre; Pradeep Moorthy

doi:10.14529/jsfi150307

Authors

Nachiket Kapre Nanyang Technological University, Singapore
Pradeep Moorthy Nanyang Technological University, Singapore

DOI:

https://doi.org/10.14529/jsfi150307

Abstract

Sparse graph problems are notoriously hard to accelerate on conventional platforms due to irregular memory access patterns resulting in underutilization of memory bandwidth. These bottlenecks on traditional x86-based systems mean that sparse graph problems scale very poorly, both in terms of performance and power efficiency. A cluster of embedded SoCs (systems-on-chip) with closely-coupled FPGA accelerators can support distributed memory accesses with better matched low-power processing. We first conduct preliminary experiments across a range of COTS (commercial off-the-shelf) embedded SoCs to establish promise for energy-efficiency acceleration of sparse problems. We select the Xilinx Zynq SoC with FPGA accelerators to construct a prototype 32-node Beowulf cluster. We develop specialized MPI routines and memory DMA offload engines to support irregular communication efficiently. In this setup, we use the ARM processor as a data marshaller for local DMA traffic as well as remote MPI traffic while the FPGA may be used as a programmable accelerator. Across a set of benchmark graphs, we show that 32-node embedded SoC cluster can exceed the energy efficiency of an Intel E5-2407 by as much as 1.7× at a total graph processing capacity of 91–95 MTEPS for graphs as large as 32 million nodes and edges.

References

Richard M. Russell. The CRAY-1 Computer System. Commun. ACM, 21(1):63–72, January 1978. DOI: 10.1145/359327.359336.

Nikola Rajovic, Alejandro Rico, Nikola Puzovic, Chris Adeniyi-Jones, and Alex Ramirez. Tibidabo: Making the case for an ARM-based HPC system. Future Generation Computer Systems, 2013. DOI: 10.1016/j.future.2013.07.013.

Nikola Rajovic, Paul M Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. Supercomputing with commodity CPUs. In the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, New York, New York, USA, 2013. ACM Press. DOI: 10.1145/2503210.2503281.

Karl Furlinger, Christof Klausecker, and Dieter Kranzlmüller. The AppleTV-cluster: Towards energy efficient parallel computing on consumer electronic devices. Whitepaper, Ludwig-Maximilians-Universitat, 2011. DOI: 10.1007/978-3-642-23447-7_1.

Simon J Cox, James T Cox, Richard P Boardman, Steven J Johnston, Mark Scott, and Neil S O’Brien. Iridis-pi: a low-cost, compact demonstration cluster. Cluster Computing, 17(2):349–358, June 2013. DOI: 10.1007/s10586-013-0282-7.

E Principi, V Colagiacomo, S Squartini, and F Piazza. Low power high-performance computingon the Beagleboard platform. In Education and Research Conference (EDERC), 2012 5th European DSP, pages 35–39, 2012. DOI: 10.1109/ederc.2012.6532220.

Linley Gwennap. Adapteva: More Flops, Less Watts. Microprocessor Report, pages 1–5, June 2011.

P. Moorthy and N. Kapre. Zedwulf: Power-performance tradeoffs of a 32-node zynq soc cluster. In Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on, pages 68–75, May 2015. DOI: 10.1109/fccm.2015.37.

N. Kapre, Han Jianglei, A. Bean, P. Moorthy, and Siddhartha. Graphmmu: Memory management unit for sparse graph accelerators. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International, pages 113–120, May 2015. DOI: 10.1109/ipdpsw.2015.101.

Paul Erdos and Alfred Renyi. { On the evolution of random graphs } . Publ. Math. Inst. Hung. Acad. Sci, 5:17–61, 1960.

N. Kapre. Custom fpga-based soft-processors for sparse graph acceleration. In Application-specific Systems, Architectures and Processors (ASAP), 2015 IEEE 26th International Conference on, pages 9–16, July 2015. DOI: 10.1109/asap.2015.7245698.