Deep Analysis of Job State Statistics on Lomonosov-2 Supercomputer
AbstractIt is a common knowledge that the increasingly growing capabilities of HPC systems are always limited by a number of efficiency related issues. The reasons can be very different: hardware failures, incorrect job scheduling, peculiarities of algorithm, chosen programming technology specifics, etc. Most of these issues can be detected after precise analysis, but is a very resourceful way to study every application run. Therefore we performed less complicated analysis of the whole supercomputer job flow. In this paper we share our experience of analyzing user applications’ job states assigned by the SLURM resource manager that is used on the Lomonosov-2 system at Supercomputing center of Lomonosov Moscow State University. The statistics on job states was collected and it revealed that the ratio of correctly finished jobs (with the COMPLETED state) was rather low. The jobs owners were asked if the distribution of their jobs’ states is normal regarding their applications. This user feedback was processed, and some new ways of efficiency gain were revealed as the result.
TOP500 List – June 2018 – TOP500 Supercomputer Sites. https://www.top500.org/list/2018/06/
Voevodin, V., Voevodin, V.: Efficiency of Exascale Supercomputer Centers and Supercomputing Education. In: High Performance Computer Applications: Proceedings of the 6th International Supercomputing Conference in Mexico (ISUM 2015). pp. 14–23. Springer, Cham (2016), DOI: 10.1007/978-3-319-32243-8_2
Nikitenko, D., Stefanov, K., Zhumatiy, S., Teplov, A., Shvets, P., Voevodin, Vad.: System monitoring-based holistic resource utilization analysis for every user of a large HPC center. Algorithms and Architectures for Parallel Processing, LNCS. 10049. 305–318. Springer (2016), DOI: 10.1007/978-3-319-49956-7_24
Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science 66, 625–634 (2015), DOI: 10.1016/j.procs.2015.11.071
Nikitenko, D., Antonov, A., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: JobDigest Detailed System Monitoring-Based Supercomputer Application Behavior Analysis. In: Supercomputing. Third Russian Supercomputing Days, RuSCDays 2017, Moscow, Russia, September 25–26, 2017, Revised Selected Papers. pp. 516–529. Springer, Cham (2017), DOI: 10.1007/978-3-319-71255-0_42
Jones, M.D., White, J.P., Innus, M., DeLeon, R.L., Simakov, N., Palmer, J.T., Gallo, S.M., Furlani, T.R., Showerman, M., Brunner, R., Kot, A., Bauer, G., Bode, B., Enos, J., Kramer, W.: Workload Analysis of Blue Waters (2017), http://arxiv.org/abs/1703.00924
You, H., Zhang, H.: Comprehensive workload analysis and modeling of a petascale supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing. pp. 253–271. Springer (2012), DOI: 10.1007/978-3-642-35867-8_14
Furlani, T.R., Schneider, B.L., Jones, M.D., et al.: Using XDMoD to facilitate XSEDE operations, planning and analysis In:Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, ACM, p. 46, (2013), DOI: 10.1145/2484762.2484763
Palmer, J.T., Gallo, S.M., Furlani, T.R., et al.: Open XDMoD: A tool for the comprehensive management of high-performance computing resources. In: Computing in Science & Engineering, vol. 17, no. 4, pp. 52–62. IEEE (2015), DOI: 10.1109/MCSE.2015.68
Agrawal, K., Fahey, M.R., McLay, R., Doug, J.: User environment tracking and problem detection with XALT. In: Proceedings of the First International Workshop on HPC User Support Tools, pp. 32–40. IEEE press (2014), DOI: 10.1109/HUST.2014.6
Nikitenko, D., Voevodin, Vl., Zhumatiy, S.: Resolving frontier problems of mastering largescale supercomputer complexes. In: ACM International Conference on Computing Frontiers (CF’16), pp. 349–352. ACM New York (2016), DOI: 10.1145/2903150.2903481
Voevodin, Vl., Voevodin, Vad., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, Pizzo calabro, Italy, June 20–24, 2016. AIP Conference Proceedings, vol. 1776, pp. 090015-1–090015-4 (2016), DOI: 10.1063/1.4965379
How to Cite
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.