Runtime-Aware Architectures: A First Approach
In the last few years, the traditional ways to keep the increase of hardware performance at the rate predicted by Moore's Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. With the irruption of multi-cores and parallel applications, this simple interface started to leak. As a consequence, the role of decoupling again applications from the hardware was moved to the runtime system. Efficiently using the underlying hardware from this runtime without exposing its complexities to the application has been the target of very active and prolific research in the last years.
Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores already have to face. It is our position that the runtime has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In this paper, we introduce a first approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime's perspective.
L. Alvarez, L. Vilanova, M. Gonzalez, X. Martorell, N. Navarro, and E. Ayguade. Hardware-software coherence protocol for the coexistence of caches and local memories. In SC, 2012.
E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst., 20(3):404-418, Mar. 2009.
P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. CellSs: a programming model for the
Cell BE architecture. In SC, 2006.
P. Bellens, J. M. Perez, F. Cabarcas, A. Ramrez, R. M. Badia, and J. Labarta. CellSs: Scheduling techniques to better exploit memory hierarchy. Scientic Programming, 17(1-2):77-95, 2009.
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An ecient multithreaded runtime system. In PPoPP, pages 207-216, 1995.
D. Bohme, F. Wolf, B. R. de Supinski, M. Schulz, and M. Geimer. Scalable critical-path based performance analysis. In IPDPS, pages 1330-1340, 2012.
J. Bueno, X. Martorell, R. M. Badia, E. Ayguade, and J. Labarta. Implementing OmpSs support for regions of data in architectures with multiple address spaces. In ICS, pages 359-368, 2013.
N. P. Carter et al. Runnemede: An architecture for ubiquitous high-performance computing. In HPCA, pages 198-209, 2013.
B. Chapman. The Multicore Programming Challenge, volume 4847 of Lecture Notes in Computer Science, pages 3-3. Springer Berlin Heidelberg, 2007.
D. Comte, N. Hifdi, and J.-C. Syre. The data driven lau multiprocessor system: Results and perspectives. In IFIP, pages 175-180, 1980.
A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(2):173-193, 2011.
Y. Etsion, F. Cabarcas, A. Rico, A. Ramrez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero. Task superscalar: An out-of-order task pipeline. In MICRO, pages 89-100, 2010.
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In SC, 2006.
V. Garcia, A. Rico, C. Villavieja, P. Carpenter, N. Navarro, and A. Ramirez. Adaptive runtime-assisted block prefetching on chip-multiprocessors. Technical Report UPC-DACRR-2014-8, Department of Computer Architecture, UPC, May 2014.
A. Gonzalez, J. Gonzalez, and M. Valero. Virtual-physical registers. In HPCA, pages 175-184, 1998.
W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd edition. MIT Press, Cambridge, MA, 1999.
R. Haring et al. The IBM Blue Gene/Q compute chip. IEEE Micro, 32(2):48-60, Mar. 2012.
J. L. Hennessy and D. A. Patterson. Computer Architecture - A Quantitative Approach (5. ed.). Morgan Kaufmann, 2012.
International technology roadmap for semiconductors (ITRS), system drivers. In ITRS, 2011.
L. V. Kale and S. Krishnan. Charm++: A portable concurrent object oriented system based on c++. In OOPSLA, pages 91-108, 1993.
J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In ISCA, pages 140-151, 2009.
S. Kumar, C. J. Hughes, and A. D. Nguyen. Carbon: architectural support for ne-grained parallelism on chip multiprocessors. In ISCA, pages 162-173, 2007.
M. Manivannan, A. Negi, and P. Stenstrom. Ecient forwarding of producer-consumer data in task-based programs. In ICPP, pages 517-522, 2013.
M. Manivannan and P. Stenstrom. Runtime-guided cache coherence optimizations in multicore architectures. In IPDPS, 2014.
V. Marjanovic, J. Labarta, E. Ayguade, and M. Valero. Overlapping communication and computation by using a hybrid MPI/SMPSs approach. In ICS, pages 5-16, 2010.
M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin, M. Oskin, and S. J. Eggers. Instruction scheduling for a tiled dataflow architecture. In ASPLOS, pages 141-150, 2006.
T. Mudge. Power: A rst-class architectural design constraint. Computer, 34(4):52-58, Apr. 2001.
V. Papaefstathiou, M. Katevenis, D. S. Nikolopoulos, and D. N. Pnevmatikatos. Prefetching and cache management using task lifetimes. In ICS, pages 325-334, 2013.
J. Planas, R. M. Badia, E. Ayguade, and J. Labarta. Hierarchical task-based programming with starss. Int. J. High Perform. Comput. Appl., 23(3):284-299, Aug. 2009.
J. Planas, R. M. Badia, E. Ayguade, and J. Labarta. Self-adaptive OmpSs tasks in heterogeneous environments. In IPDPS, pages 138-149, 2013.
A. Ramirez et al. The SARC architecture. Micro, IEEE, 30(5):16-29, Sept 2010.
J. Reinders. Intel threading building blocks - outtting C++ for multi-core processor parallelism. O'Reilly, 2007.
D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible architectural support for fine-grain scheduling. In ASPLOS, pages 311-322, 2010.
J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking parallel loops in the presence of synchronization. In ICS, pages 181-192, 2009.
J. Sim, G. H. Loh, V. Sridharan, and M. O'Connor. Resilient die-stacked DRAM caches. In ISCA, pages 416-427, 2013.
A. Smith, R. Nagarajan, K. Sankaralingam, R. McDonald, D. Burger, S. W. Keckler, and K. S. McKinley. Dataflow predication. In MICRO, pages 89-102, 2006.
S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy eciency. In ICS, pages 337-348, 2013.
O. Subasi, F. J. Arias, O. Unsal, J. Labarta, and A. Cristal. Leveraging a task-based asynchronous dataflow substrate for ecient and scalable resiliency. Technical Report UPC-DAC-RR-CAP-2013-12, Department of Computer Architecture, UPC, May 2013.
B. Vermeulen, J. Dielissen, K. Goossens, and C. Ciordas. Bringing communication networks on a chip: test and verication implications. IEEE Communications Magazine, 41(9):74-81, 2003.
E. Waingold, M. B. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. P. Amarasinghe, and A. Agarwal. Baring it all to software: Raw machines. IEEE Computer, 30(9):86-93, 1997.
S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The potential of the Cell processor for scientic computing. In CF, pages 9-20, 2006.
W. A. Wulf and S. A. McKee. Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20-24, 1995.
X. Yang, Z. Wang, J. Xue, and Y. Zhou. The reliability wall for exascale supercomputing. IEEE Trans. Comput., 61(6):767-779, June 2012.
F. Yazdanpanah, D. Jimenez-Gonzalez, C. Alvarez-Martinez, Y. Etsion, and R. M. Badia. Analysis of the task superscalar architecture hardware design. In ICCS, pages 339-348, 2013.
D. Zhao and Y. Wang. SD-MAC: Design and synthesis of a hardware-ecient collision-free QoS-aware MAC protocol for wireless network-on-chip. IEEE Trans. Comput., 57(9):1230-1245, 2008.