Simultac Fonton: A Fine-Grain Architecture for Extreme Performance beyond Moore's Law

Maciej Brodowicz, Thomas Sterling

Abstract


With nano-scale technology and Moore's Law end, architecture advance serves as the principal means of achieving enhanced efficiency and scalability into the exascale era. Ironically, the field that has demonstrated the greatest leaps of technology in the history of humankind, has retained its roots in its earliest strategy, the von Neumann architecture model which has imposed tradeoffs no longer valid for today's semiconductor technologies, although they were suitable through the 1980s. Essentially all commercial computers, including HPC, have been and are von Neumann derivatives. The bottlenecks imposed by this heritage are the emphasis on ALU/FPU utilization, single instruction issue and sequential consistency, and the separation of memory and processing logic ("von Neumann bottleneck"). Here the authors explore the possibility and implications of one class of non von Neumann architecture based on cellular structures, asynchronous multi-tasking, distributed shared memory, and message-driven computation. "Continuum Computer Architecture" is introduced as a genus of ultra-fine-grained architectures where complexity of operation is an emergent behavior of simplicity of design combined with highly replicated elements. An exemplar species of CCA, "Simultac" is considered comprising billions of simple elements, "fontons", of merged properties of data storage and movement combined with logical transformations. Employing the ParalleX execution model and a variation of the HPX+ runtime system software, the Simultac may provide the path to cost effective data analytics and machine learning as well as dynamic adaptive simulations in the trans-exaOPS performance regime.


Full Text:

PDF

References


Anderson, M., Brodowicz, M., Kaiser, H., Sterling, T.L.: An application driven analysis of the ParalleX execution model. CoRR (2011), http://arxiv.org/abs/1109.5201, arXiv:1109.5201v1

Argyris, J.H., et al.: Finite element method – the natural approach. Computer Methods in Applied Mechanics and Engineering 17–18, 1–106 (January 1979), DOI:10.1016/0045-7825(79)90083-5

Berlekamp, E.R., Conway, J.H., Guy, R.K.: Winning Ways for your Mathematical Plays, vol. 4. A. K. Peters Ltd. (2001-2004), ISBN:978-1568811444

Black, B., et al.: Die stacking (3D) microarchitecture. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO’06. pp. 469–479 (December 2006), DOI:10.1109/MICRO.2006.18

Borkar, S., et al.: Supporting systolic and memory communication in iWarp. In: Proceedings of the 17th Annual International Symposium on Computer Architecture. pp. 70–81 (1990), DOI:10.1109/ISCA.1990.134510

Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965), DOI:10.2307/2003354

Dennard, R.H., Gaensslen, F., Yu, H.N., Rideout, L., Bassous, E., LeBlanc, A.: Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid State Circuits 9(5) (October 1974), DOI:10.1109/JSSC.1974.1050511

Dennis, J.B.: Data flow supercomputers. Computer 13(11), 48–56 (1980), DOI:10.1109/MC.1980.1653418

Hewitt, C., Baker, H.G.: Actors and continuous functionals. Tech. rep., Cambridge, MA, USA (1978)

Intel Corp.: Intel Threading Building Blocks (Intel TBB) (2017), website, http://www.threadingbuildingblocks.org

Kaiser, H., Brodowicz, M., Sterling, T.: ParalleX: An Advanced Parallel Execution Model for Scaling-Impaired Applications. In: Parallel Processing Workshops. pp. 394–401. IEEE Computer Society (2009), DOI:10.1109/ICPPW.2009.14

Kale, L.V., Krishnan., S.: Charm++: Parallel programming with message-driven objects. In: Wilson, G.V., Lu, P. (eds.) Parallel Programming using C++, pp. 175–213. MIT Press (1996), ISBN:9780262731188

Kim, T.H., Liu, J., Keane, J., Kim, C.H.: A high-density subthreshold SRAM with data-independent bitline leakage and virtual ground replica scheme. In: IEEE International Solid State Circuits Conference. pp. 330–331,606. IEEE (2007), DOI:10.1109/ISSCC.2007.373428

von Neumann, J.: Collected Works, vol. 5, pp. 288–326. Oxford: Pergamon Press (1961), ISBN:0080095666

Slaughter, E., Lee, W., Jia, Z., Warszawski, T., Aiken, A., McCormick, P., Ferenbaugh, C., Gutierrez, S., Davis, K., Shipman, G., Watkins, N., Bauer, M., Treichler, S.: Legion programming system (Feb 2017), version 16.10.0, http://legion.stanford.edu/

Sterling, T., Kogler, D., Anderson, M., Brodowicz, M.: SLOWER: A performance model for exascale computing. Supercomputing Frontiers and Innovations 1(2) (2014), DOI:10.14529/jsfi140203

Syrbu, A., Mereuta, A., Iakovlev, V., Caliman, A., Royo, P., Kapon, E.: 10 Gbps VCSELs with high single mode output in 1310 nm and 1550 nm wavelength bands. In: Proceedings of the Optical Fiber Communication/National Fiber Optic Engineers Conference. pp. 1–3 (February 2008), DOI:10.1109/OFC.2008.4528529

The Center for Research in Extreme Scale Technologies: HPX-5 (Nov 2016), version 4.0.0, http://hpx.crest.iu.edu/

The Ste||ar group: HPX (July 2016), version 0.9.99, http://stellar.cct.lsu.edu/

Tim, M., Romain, C.: OCR, the open community runtime interface (March 2016), version 1.1.0, https://xstack.exascale-tech.com/git/public?p=ocr.git;a=blob;f=ocr/spec/ocr-1.1.0.pdf

Valiant, L.G.: A bridging model for parallel computation. Comm. ACM 33(8), 103–111 (1990), DOI:10.1145/79173.79181

Verma, N., Chandrakasan, A.P.: A 256 kb 65 nm 8T subthreshold SRAM employing sense-amplifier redundancy. In: IEEE Journal of Solid-State Circuits. pp. 141–149. IEEE (2008), DOI:10.1109/JSSC.2007.908005

Wilke, J., Hollman, D., Slattengren, N., Lifflander, J., Kolla, H., Rizzi, F., Teranishi, K., Bennett, J.: DARMA 0.3.0-alpha specification (March 2016), version 0.3.0-alpha, SANDIA Report SAND2016-5397




Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)