Resilience within Ultrascale Computing System: Challenges and Opportunities from Nesus Project

Pascal Bouvry, Rudolf Mayer, Jakub Muszyński, Dana Petcu, Andreas Rauber, Gianluca Tempesti, Tuan Trinh, Sébastien Varrette


Ultrascale computing is a new computing paradigm that comes naturally from the necessity of computing systems that should be able to handle massive data in possibly very large scale  distributed systems, enabling new forms of applications that can serve a very large amount of  users and in a timely manner that we have never experienced before. However, besides the benefits,  ultrascale computing systems do not come without challenges. One of the challenges is the resilience  of ultrascale computing systems. Although resilience is already an established field in system  science and many methodologies and approaches are available to deal with it, the unprecedented  scales of computing, of the massive data to be managed, new network technologies, and drastically  new forms of massive scale applications bring new challenges that need to be addressed. This paper  reviews the challenges and approaches of resilience in ultrascale computing systems from multiple  perspectives involving and addressing the resilience aspects of hardware-software co-design for  ultrascale systems, resilience against (security) attacks, new approaches and methodologies to  resilience in ultrascale systems, applications and case studies.

Full Text:



M. Abd-El-Barr. Design and Analysis of Reliable and Fault-tolerant Computer Systems. Imperial College Press, 2007. DOI: 10.1142/9781860948909.

M. Abramovici, C. Strond, C. Hamilton, S. Wijesuriya, and V. Verma. Using roving stars for on-line testing and diagnosis of fpgas in fault-tolerant applications. In Test Conference, 1999. Proceedings. International, pages 973–982, 1999. DOI: 10.1109/TEST.1999.805830.

J. Andersson, J. Gaisler, and R. Weigand. Next Generation MultiPurpose Microprocessor. In DASIA 2010 Data Systems In Aerospace, volume 682 of ESA Special Publication, page 7, August 2010.

A. Avizienis, J.-C. Laprie, B. Randell, and C.E. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1:11–33, 2004. DOI: 10.1109/tdsc.2004.2.

C. Glenn Begley and Lee M. Ellis. Drug development: Raise standards for preclinical cancer research. Nature, 483(7391):531–533, March 2012. DOI: 10.1038/483531a.

Khalid Belhajjame, Oscar Corcho, Daniel Garijo, Jun Zhao, Paolo Missier, David Newman, RaÞl Palma, Sean Bechhofer, Esteban GarcÃŋa Cuesta, JosÃľ Manuel GÃşmez-PÃľrez, Stian Soiland-Reyes, Lourdes Verdes-Montenegro, David De Roure, and Carole Goble. Workflow-centric research objects: First class citizens in scholarly discourse. In Proceedings of Workshop on the Semantic Publishing, (SePublica 2012) 9th Extended Semantic Web Conference, May 28 2012.

Luca Benini and Giovanni De Michelli. Networks on chips : technology and tools. The Morgan Kaufmann series in systems on silicon. Elsevier Morgan Kaufmann Publishers, Amsterdam, Boston, Paris, 2006.

Johannes Binder, Stephan Strodl, and Andreas Rauber. Process migration framework – virtualising and documenting business processes. In Proceedings of the 18th IEEE International EDOC Conference Workshops and Demonstrations (EDOCW 2014), pages 398–401, Ulm, Germany, September 2014. DOI: 10.1109/edocw.2014.66.

Cristiana Bolchini, Matteo Carminati, and Antonio Miele. Self-adaptive fault tolerance in multi-/many-core systems. Journal of Electronic Testing, 29(2):159–175, 2013. DOI: 10.1007/s10836-013-5367-y.

C. Braun and H. Wunderlich. Algorithm-based fault tolerance for many-core architectures. In Test Symposium (ETS), 2010 15th IEEE European, pages 253–253, May 2010. DOI: 10.1109/ETSYM.2010.5512738.

Franck Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. IJHPCA, 23(3):212–226, 2009. DOI: 10.1177/1094342009106189.

Franck Cappello, Al Geist, Bill Gropp, Laxmikant V. Kalé, Bill Kramer, and Marc Snir. Toward exascale resilience. IJHPCA, 23(4):374–388, 2009. DOI: 10.1177/1094342009347767.

Sayantan Chakravorty, Celso L. Mendes, and Laxmikant V. KalÃľ. Proactive Fault Tolerance in MPI Applications via task Migration. In In International Conference on High Performance Computing, 2006. DOI: 10.1007/11945918_47.

Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, and Jack Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’05, pages 213–223, New York, NY, USA, 2005. ACM. DOI: 10.1145/1065944.1065973.

V. Cherkassky. A measure of graceful degradation in parallel-computer systems. Reliability, IEEE Transactions on, 38(1):76–81, Apr 1989. DOI: 10.1109/24.24577.

M. Choi, N.J. Park, K.M. George, B. Jin, N. Park, Y.B. Kim, and F. Lombardi. Fault tolerant memory design for hw/sw co-reliability in massively parallel computing systems. In Network Computing and Applications, 2003. NCA 2003. Second IEEE International Symposium on, pages 341–348, April 2003. DOI: 10.1109/NCA.2003.1201173.

Christian Collberg, Todd Proebsting, Gina Moraila, Akash Shankaran, Zuoming Shi, and Alex M Warren. Measuring reproducibility in computer systems research. Technical report, University of Arizona, 2014.

George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair. Distributed Systems: Concepts and Design. Addison-Wesley Publishing Company, USA, 5th edition, 2011.

C. Darwin. The Origin of Species. John Murray, 1859.

Francisco Fernandez De Vega. A Fault Tolerant Optimization Algorithm based on Evolutionary Computation. In Proceedings of the International Conference on Dependability of Computer Systems (DEPCOS-RELCOMEX’06), pages 335–342, Washington, DC, USA, 2006. IEEE Computer Society. DOI: 10.1109/depcos-relcomex.2006.2.

Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, et al. Pegasus, a workflow management system for science automation. Future Generation Computer Systems, 2014. DOI: 10.1016/j.future.2014.10.008.

E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375–408, September 2002. DOI: 10.1145/568522.568525.

D. Ernst, S. Das, Seokwoo Lee, D. Blaauw, T. Austin, T. Mudge, Nam Sung Kim, and K. Flautner. Razor: circuit-level correction of timing errors for low-power operation. Micro, IEEE, 24(6):10–20, Nov 2004. DOI: 10.1109/MM.2004.85.

James Taylor eivind Hovig Geir Kjetil Sandve, Anton Nekrutenko. Ten simple rules for reproducible computational research. PLoS Computational Biology, 9(10), 10 2013. DOI: doi:10.1371/journal.pcbi.1003285.

Ian Gent. The recomputation manifesto, April 12 2013.

Daniel Lombrana González, Francisco Fernández de Vega, and Henri Casanova. Characterizing fault tolerance in genetic programming. In Proc. of the 2009 workshop on Bio-inspired algorithms for distributed systems (BADS’09), pages 1–10, New York, NY, USA, 2009. ACM. DOI: 10.1145/1555284.1555286.

Ed H. B. M. Gronenschild, Petra Habets, Heidi I. L. Jacobs, Ron Mengelers, Nico Rozendaal, Jim van Os, and Machteld Marcelis. The effects of freesurfer version, workstation type, and macintosh operating system version on anatomical volume and cortical thickness measurements. PLoS ONE, 7(6), 06 2012. DOI: 10.1371/journal.pone.0038234.

Philip J. Guo. CDE: Run any Linux application on-demand without installation. In Proceedings of the 25th international conference on Large Installation System Administration (LISA’11), pages 2–2, Berkeley, CA, USA, 2011.

Mark Guttenbrunner and Andreas Rauber. A measurement framework for evaluating emulators for digital preservation. ACM Transactions on Information Systems (TOIS), 30(2), 3 2012. DOI: 10.1145/2180868.2180876.

J. Ignacio Hidalgo, Juan Lanchares, Francisco Fernández de Vega, and Daniel Lombrana. Is the island model fault tolerant? In GECCO ’07: Proceedings of the 2007 GECCO conference companion on Genetic and evolutionary computation, pages 2737–2744, London, United Kingdom, July 7–11 2007. ACM. DOI: 10.1145/1274000.1274085.

Kai Hwang, Jack Dongarra, and Geoffrey C. Fox. Distributed and Cloud Computing: From Parallel Processing to the Internet of Things. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2011.

A. Ivanov and G. De Micheli. Guest editors’ introduction: The network-on-chip paradigm in practice and research. Design Test of Computers, IEEE, 22(5):399–403, Sept 2005. DOI: 10.1109/MDT.2005.111.

C.M. Jeffery and R.J.O. Figueiredo. Towards byzantine fault tolerance in many-core computing platforms. In Dependable Computing, 2007. PRDC 2007. 13th Pacific Rim International Symposium on, pages 256–259, Dec 2007. DOI: 10.1109/PRDC.2007.40.

Fernanda Lima Kastensmidt, Luigi Carro, and Ricardo Reis. Fault-Tolerance Techniques for SRAM-Based FPGAs (Frontiers in Electronic Testing). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

J.L.J. Laredo, P. Bouvry, D.L. GonzÃąlez, F. FernÃąndez de Vega, M.G. Arenas, J.J. Merelo, and C.M. Fernandes. Designing robust volunteer-based evolutionary algorithms. Genetic Programming and Evolvable Machines, 15(3):221–244, 2014. DOI: 10.1007/s10710-014-9213-5.

Bertram Ludäscher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience, 18(10):1039–1065, 2006. DOI: 10.1002/cpe.994.

D. Mange, M. Sipper, A. Stauffer, and G. Tempesti. Toward robust integrated circuits: The embryonics approach. Proceedings of the IEEE, 88(4):516–543, April 2000. DOI: 10.1109/5.842998.

Rudolf Mayer, Tomasz Miksa, and Andreas Rauber. Ontologies for describing the context of scientific experiment processes. In Proceedings of the 10th International Conference on e-Science, Guarujá, SP, Brazil, October 20–24 2014. DOI: 10.1109/eScience.2014.47.

P. Mazumder. Design of a fault-tolerant dram with new on-chip ecc. In Israel Koren, editor, Defect and Fault Tolerance in VLSI Systems, pages 85–92. Springer US, 1989. DOI: 10.1007/978-1-4615-6799-8_8.

Tomasz Miksa, Stefan Proell, Rudolf Mayer, Stephan Strodl, Ricardo Vieira, José Barateiro, and Andreas Rauber. Framework for verification of preserved and redeployed processes. In Proceedings of the 10th International Conference on Preservation of Digital Objects (iPres 2013), Lisbon, Portugal, September 2–6 2013.

Paolo Missier, Stian Soiland-Reyes, Stuart Owen, Wei Tan, Aleksandra Nenadic, Ian Dunlop, Alan Williams, Thomas Oinn, and Carole Goble. Taverna, reloaded. In M. Gertz, T. Hey, and B. Ludaescher, editors, SSDBM 2010, Heidelberg, Germany, June 2010. DOI: 10.1007/978-3-642-13818-8_33.

Elizabeth Montero and María-Cristina Riff. On-the-fly calibrating strategies for evolutionary algorithms. Information Sciences, 181(3):552–566, 2011.

Alicia Morales-Reyes, Evangelos F. Stefatos, Ahmet T. Erdogan, and Tughrul Arslan. Towards Fault-Tolerant Systems based on Adaptive Cellular Genetic Algorithms. In Proceedings of the 2008 NASA/ESA Conference on Adaptive Hardware and Systems (AHS’08), pages 398–405, Noordwijk, The Netherlands, June 22-25 2008. IEEE Computer Society.

Nature. Data’s shameful neglect. Nature, 461(7261), 9 2009. DOI: 10.1038/461145a.

Dongkook Park, C. Nicopoulos, Jongman Kim, N. Vijaykrishnan, and C.R. Das. Exploring fault-tolerant network-on-chip architectures. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on, pages 93–104, June 2006. DOI: 10.1109/DSN.2006.35.

Stefan Pröll and Andreas Rauber. Scalable Data Citation in Dynamic, Large Databases: Model and Reference Implementation. In IEEE International Conference on Big Data 2013 (IEEE BigData 2013), Santa Clara, CA, USA, October 2013. IEEE. DOI: 10.1109/bigdata.2013.6691588.

A. Putnam, A.M. Caulfield, E.S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G.P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P.Y. Xiao, and D. Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 13–24, June 2014. DOI: 10.1109/ISCA.2014.6853195.

Martin Radetzki, Chaochao Feng, Xueqian Zhao, and Axel Jantsch. Methods for fault tolerance in networks-on-chip. ACM Comput. Surv., 46(1):8:1–8:38, July 2013. DOI: 10.1145/2522968.2522976.

Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-performance Computing Systems. SIGPLAN Not., 44(4):305–306, February 2009. DOI: 10.1145/1594835.1504227.

Daniel P. Siewiorek and Robert S. Swarz. Reliable Computer Systems (3rd Ed.): Design and Evaluation. A. K. Peters, Ltd., Natick, MA, USA, 1998.

C.T. Silva, J. Freire, and S.P. Callahan. Provenance for visualizations: Reproducibility and beyond. Computing in Science Engineering, 9(5):82–89, October 2007. DOI: 10.1109/MCSE.2007.106.

Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan DeBardeleben, Pedro C. Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing failures in exascale computing. IJHPCA, 28(2):129–173, 2014.

Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems: Principles and Paradigms. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2nd edition, 2006.

The Economist. Trouble at the lab, October 19 2013.

S. Tselonis, V. Dimitsas, and D. Gizopoulos. The functional and performance tolerance of gpus to permanent faults in registers. In On-Line Testing Symposium (IOLTS), 2013 IEEE 19th International, pages 236–239, July 2013. DOI: 10.1109/IOLTS.2013.6604089.

Jan Vitek and Tomas Kalibera. R3: Repeatability, reproducibility and rigor. SIGPLAN Not., 47(4a):30–36, March 2012. DOI: 10.1145/2442776.2442781.

Aida Vosoughi, Kashif Bilal, Samee Ullah Khan, Nasro Min-Allah, Juan Li, Nasir Ghani, Pascal Bouvry, and Sajjad Madani. A multidimensional robust greedy algorithm for resource path finding in large-scale distributed networks. In Proceedings of the 8th International Conference on Frontiers of Information Technology, FIT ’10, pages 16:1–16:6, New York, NY, USA, 2010. ACM. DOI:

Chao Wang, F. Mueller, C. Engelmann, and S.L. Scott. Proactive process-level live migration in HPC environments. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1–12, Nov 2008. DOI: 10.1109/SC.2008.5222634.

Keun Soo Yim and R.K. Iyer. A codesigned fault tolerance system for heterogeneous many-core processors. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, pages 2053–2056, May 2011. DOI: 10.1109/IPDPS.2011.375.

Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)