An Autonomic Performance Environment for Exascale

Kevin A. Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler

Abstract


Exascale systems will require  new approaches to performance observation, analysis, and runtime decision-making to optimize for performance and efficiency. The standard "first-person" model, in which multiple operating system processes and threads observe themselves and record first-person performance profiles or traces for offline analysis, is not adequate to observe and capture interactions at shared resources in highly concurrent, dynamic systems. Further, it does not support mechanisms for runtime adaptation. Our approach, called APEX (Autonomic Performance Environment for eXascale), provides mechanisms for sharing information among the layers of the software stack, including hardware, operating and runtime systems, and application code, both new and legacy. The performance measurement components share information  across layers, merging first-person data sets with information collected by  third-person tools observing shared hardware and software states at  node- and global-levels. Critically, APEX provides a policy engine designed to guide runtime adaptation mechanisms to make algorithmic changes, re-allocate resources, or change scheduling rules when appropriate conditions occur.


Full Text:

PDF

References


L. Adhianto et al. “HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org”. In: Concurr. Comput. : Pract. Exper. 22 (6 Apr. 2010), pp. 685–701. DOI: 10.1002/cpe.v22:6.

V. C. Amatya. “Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management”. PhD thesis. Louisiana State University, 2014.

M. Anderson et al. “An Application Driven Analysis of the ParalleX Execution Model”. In: CoRR abs/1109.5201 (2011). http://arxiv.org/abs/1109.5201.

E. Bajrovic et al. “Autotuning of Pattern Runtimes for Accelerated Parallel Systems.” In: PARCO 2013, September 2013, Munich, Germany. Sept. 2013.

R. F. Barrett, C. T. Vaughan, and M. A. Heroux. MiniGhost: a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing. Tech. rep. SAND2012-10431. 2011.

S. Benedict, V. Petkov, and M. Gerndt. “Periscope: An online-based distributed performance analysis tool”. In: Tools for High Performance Computing 2009. Springer, 2010, pp. 1–16. DOI: 10.1007/978-3-642-11261-4_1.

R. Brightwell and K. Pedretti. “An Intra-Node Implementation of OpenSHMEM Using Virtual Address Space Mapping”. In: Fifth Partitioned Global Address Space Conference. Oct. 2011.

M. Chaarawi et al. “A Tool for Optimizing Runtime Parameters of Open MPI”. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science 5205. Springer Berlin Heidelberg, 2008, pp. 210–217. DOI: 10.1007/978-3-540-87475-1_30.

S. Corporation. “eXascale PRogramming Environment and System Software (XPRESS)”. http://xstack.sandia.gov/xpress/. Apr. 2015.

M. Curtis-Maury et al. “Online Power-performance Adaptation of Multithreaded Programs Using Hardware Event-based Prediction”. In: 20th Annual International Conference on Supercomputing. ICS ’06. Cairns, Queensland, Australia: ACM, 2006, pp. 157–166. DOI: 10.1145/1183401.1183426.

I. J. Dooley. “Intelligent runtime tuning of parallel applications with control points”. PhD thesis. University of Illinois at Urbana-Champaign, 2011.

A. Duran et al. “OmpSs: A Proposal For Programming Heterogeneous Multi-Core Architectures”. In: Parallel Processing Letters 21.02 (2011), pp. 173–193. DOI: 10.1142/S0129626411000151.

M. Heroux and R. Barrett. Mantevo Project. 2011.

H. Hoffmann et al. “Application Heartbeats: A Generic Interface for Specifying Program Performance and Goals in Autonomous Computing Environments”. In: 7th International Conference on Autonomic Computing. ICAC ’10. Washington, DC, USA: ACM, 2010, pp. 79–88. DOI: 10.1145/1809049.1809065.

K. A. Huck et al. “TAUg: Runtime Global Performance Data Access Using MPI”. English. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface. Vol. 4192. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2006, pp. 313–321. DOI: 10.1007/11846802_44.

E. Jeannot et al. “Communication and topology-aware load balancing in Charm++ with TreeMatch”. In: Cluster Computing (CLUSTER), 2013 IEEE International Conference on. Sept. 2013, pp. 1–8. DOI: 10.1109/CLUSTER.2013.6702666.

H. Kaiser, M. Brodowicz, and T. Sterling. “ParalleX: An Advanced Parallel Execution Model for Scaling-Impaired Applications”. In: Parallel Processing Workshops. Los Alamitos, CA, USA: IEEE Computer Society, 2009, pp. 394–401. DOI: 10.1109/ICPPW.2009.14.

L. V. Kale and G. Zheng. “Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects”. In: Advanced Computational Infrastructures for Parallel and Distributed Applications. Wiley-Interscience, 2009, pp. 265–282. DOI: 10.1002/9780470558027.ch13.

A. Knupfer et al. “The Vampir performance analysis tool-set”. In: Tools for High Performance Computing. Springer, 2008, pp. 139–155. DOI: 10.1007/978-3-540-68564-7_9.

A. Mandal, R. Fowler, and A. Porterfield. “Modeling Memory Concurrency for MultiSocket Multi-Core Systems”. In: 2010 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS2010). IEEE. White Plains, NY, Mar. 2010, pp. 56–75. DOI: 10.1109/ispass.2010.5452064.

A. Mandal, R. Fowler, and A. Porterfield. “System-wide introspection for accurate attribution of performance bottlenecks”. In: Second International Workshop on High-perfromance Infrastruture for Scalable Tools. 2012.

S. J. Martin and M. Kappel. “Cray XC30 Power Monitoring and Management”. In: Cray User Group Conference Proceedings. 2014.

R. Miceli et al. “AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications”. In: Applied Parallel and Scientific Computing. Lecture Notes in Computer Science 7782. Springer Berlin Heidelberg, 2013, pp. 328–342. DOI: 10.1007/978-3-642-36803-5_24.

Y. Oleynik et al. “Recent Advances in Periscope for Performance Analysis and Tuning”. English. In: Tools for High Performance Computing 2013. Springer International Publishing, 2014, pp. 39–51. DOI: 10.1007/978-3-319-08144-1_4.

S. L. Olivier et al. “Scheduling Task Parallelism on Multi-socket Multicore Systems”. In: International Workshop on Runtime and Operating Systems for Supercomputers. ROSS’11. Tucson, Arizona: ACM, 2011, pp. 49–56. DOI: 10.1145/1988796.1988804.

V. Pillet et al. “Paraver: A tool to visualize and analyze parallel code”. In: Proceedings of WoTUG-18: Transputer and occam Developments. Vol. 44. mar. 1995, pp. 17–31.

J. Planas et al. “Self-Adaptive OmpSs Tasks in Heterogeneous Environments”. In: Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. May 2013, pp. 138–149. DOI: 10.1109/IPDPS.2013.53.

B. Rountree et al. “Adagio: Making DVS Practical for Complex HPC Applications”. In: 23rd International Conference on Supercomputing. ICS ’09. Yorktown Heights, NY, USA: ACM, 2009, pp. 460–469. DOI: 10.1145/1542275.1542340.

B. Rountree et al. “Beyond DVFS: A first look at performance under a hardware-enforced power bound”. In: Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE. 2012, pp. 947–953. DOI: 10.1109/ipdpsw.2012.116.

S. Shende and A. D. Malony. “The TAU Parallel Performance System”. In: International Journal of High Performance Computing Applications 20.2 (Summer 2006), pp. 287–331. DOI: 10.1177/1094342006064482.

T. Sterling et al. “SLOWER: A performance model for Exascale computing”. In: Supercomputing frontiers and innovations 1.2 (2014), pp. 42–57. DOI: 10.14529/jsfi140203.

Y. Sun, J. Lifflander, and L. V. Kale. “PICS: A Performance-analysis-based Introspective Control System to Steer Parallel Applications”. In: International Workshop on Runtime and Operating Systems for Supercomputers. ROSS ’14. Munich, Germany: ACM, 2014, 5:1–5:8. DOI: 10.1145/2612262.2612266.

A. Tabbal et al. “Preliminary Design Examination of the ParalleX System from a Software and Hardware Perspective”. In: SIGMETRICS Performance Evaluation Review 38 (Mar. 2011), p. 4.

C. Tapus, I.-H. Chung, and J. K. Hollingsworth. “Active Harmony: Towards Automated Performance Tuning”. In: 2002 ACM/IEEE Conference on Supercomputing. SC ’02. Baltimore, Maryland: IEEE Computer Society Press, 2002, pp. 1–11.

The Kitten Lightweight Kernel. https://software.sandia.gov/trac/kitten. Sandia National Laboratories.

The National Energy Research Scientific Computing Center (NERSC). “Edison”. https://www.nersc.gov/users/computational-systems/edison/. Apr. 2015.

T. Williams and C. Kelley. “Gnuplot Homepage”. http://www.gnuplot.info. Apr. 2015.

F. Wolf et al. “Usage of the SCALASCA toolset for scalable performance analysis of largescale parallel applications”. In: Tools for High Performance Computing. Springer Berlin Heidelberg, 2008, pp. 157–167.

G. Zheng et al. “Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers”. In: Parallel Processing Workshops (ICPPW), 2010 39th International Conference on. Sept. 2010, pp. 436–444. DOI: 10.1109/ICPPW.2010.65.




Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)