Improving Efficiency of Hybrid HPC Systems Using a Multi-agent Scheduler and Machine Learning Methods

Vladimir S. Zaborovsky; Lev V. Utkin; Vladimir A. Muliukha; Alexey A. Lukashin

doi:10.14529/jsfi230207

Authors

Vladimir S. Zaborovsky Peter the Great St.Petersburg Polytechnic University, St.Petersburg, Russian Federation https://orcid.org/0000-0003-2284-9833
Lev V. Utkin Peter the Great St.Petersburg Polytechnic University, St.Petersburg, Russian Federation https://orcid.org/0000-0002-5637-1420
Vladimir A. Muliukha Peter the Great St.Petersburg Polytechnic University, St.Petersburg, Russian Federation https://orcid.org/0000-0002-3583-7324
Alexey A. Lukashin Peter the Great St.Petersburg Polytechnic University, St.Petersburg, Russian Federation https://orcid.org/0000-0002-1906-2207

DOI:

https://doi.org/10.14529/jsfi230207

Keywords:

high performance computing, hybrid computing systems, machine learning, multi-agent scheduler, random survival forest, survival analysis, survival function, XAI

Abstract

One of the promising directions for improving hybrid reconfigurable high-performance computer platforms operating in the mode of collaborative applied computing centers is their inclusion as an active component in the machine learning ecosystem, which opens up new opportunities to enhance the actual outperformance of solving various application tasks by intellectualizing the management of available computing resources. The task scheduler operation is crucial in improving the efficiency of hybrid supercomputer platforms, which combine dozens of processor blocks with different architectures, including specialized graphics and reconfigurable accelerators. To form an optimal order of jobs in the HPC queue, the article proposes to apply deep survival machine learning models, which increase the accuracy of the estimated time of the tasks successful execution and the required amount of computing resources. The main peculiarity of the machine learning models is that they are trained on censored heterogeneous data collected from previous periods of task execution observations using a multi-agent scheduler. In order to ensure high accuracy, the random survival forest is used as a part of the machine learning model which provides survival and hazard functions in the framework of the survival analysis. A specific weighted clustering procedure is proposed to divide tasks in accordance with their execution times as well as the feature vectors. Various numerical experiments with actual data illustrate the outperformance of the presented approach.

References

Alaa, A., van der Schaar, M.: Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design. In: Proceedings of the International Conference on Machine Learning, pp. 129–138. PMLR (2018)

Bou-Hamad, I., Larocque, D., Ben-Ameur, H.: A review of survival trees. Statistics Surveys 5, 44–71 (2011). https://doi.org/10.1214/09-SS047

Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

Cox, D.: Regression models and life-tables. Journal of the Royal Statistical Society, Series B (Methodological) 34(2), 187–220 (1972). https://doi.org/10.1111/j.2517-6161.1972.tb00899/x

Faraggi, D., Simon, R.: A neural network model for survival data. Statistics in Medicine 14(1), 73–82 (1995). https://doi.org/10.1002/sim.4780140108

Harrell, F., Califf, R., Pryor, D., et al.: Evaluating the yield of medical tests. Journal of the American Medical Association 247, 2543–2546 (1982). https://doi.org/10.1001/jama.1982.03320430047030

Hosmer, D., Lemeshow, S., May, S.: Applied Survival Analysis: Regression Modeling of Time to Event Data. John Wiley & Sons, New Jersey (2008) https://doi.org/10.1007/s00362-010-0360-3

Hu, S., Fridgeirsson, E., van Wingen, G., Welling, M.: Transformer-based deep survival analysis. In: Survival Prediction-Algorithms, Challenges and Applications, pp. 132–148. PMLR (2021)

Ishwaran, H., Kogalur, U.: Random survival forests for R. R News 7(2), 25–31 (2007). https://doi.org/10.1214/08-AOAS169

Ishwaran, H., Kogalur, U., Blackstone, E., Lauer, M.: Random survival forests. Annals of Applied Statistics 2, 841–860 (2008). https://doi.org/10.1214/08-AOAS169

Kalyaev, A., Kalyaev, I., Khisamutdinov, M., et al.: An effective algorithm for multiagent dispatching of resources in heterogeneous cloud environments. In: 5th International Conference on Informatics, Electronics and Vision (ICIEV), pp. 1140–1142. IEEE (2016). https://doi.org/10.1109/ICIEV.2016.7760177

Kalyaev, I.A., Kalyaev, A.I. Method and Algorithms for Adaptive Multiagent Resource Scheduling in Heterogeneous Distributed Computing Environments. Autom Remote Control 83, 1228–1245 (2022). https://doi.org/10.1134/S0005117922080069

Katzman, J., Shaham, U., Cloninger, A., et al.: Deepsurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology 18(24), 1–12 (2018). https://doi.org/10.1186/s12874-018-0482-1

Khan, F., Zubek, V.: Support vector regression for censored data (SVRc): a novel tool for survival analysis. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 863–868. IEEE (2008). https://doi.org/10.1109/ICDM.2008.50

Konstantinov, A., Utkin, L., Lukashin, A., Muliukha, V.: Neural attention forests: Transformer-based forest improvement (Apr 2023), arXiv:2304.05980. https://doi.org/10.48550/arXiv.2304.05980

Kovalev, M., Utkin, L., Kasimov, E.: SurvLIME: A method for explaining machine learning survival models. Knowledge-Based Systems 203, 106164 (2020). https://doi.org/10.1016/j.knosys.2020.106164

Kunzel, S., Stadie, B., Vemuri, N., et al.: Transfer learning for estimating causal effects using neural networks (Aug 2018), arXiv:1808.07804. https://doi.org/10.48550/arXiv.1808.07804

Lu, J., Behbood, V., Hao, P., et al.: Transfer learning using computational intelligence: A survey. Knowledge-Based Systems 80, 14–23 (2015). https://doi.org/10.1016/j.knosys.2015.01.010

May, M., Royston, P., Egger, M., et al.: Development and validation of a prognostic model for survival time data: application to prognosis of HIV positive patients treated with antiretroviral therapy. Statistics in Medicine 23, 2375–2398 (2004). https://doi.org/10.1002/sim.1825

Nezhad, M., Sadati, N., Yang, K., Zhu, D.: A deep active survival analysis approach for precision treatment recommendations: Application of prostate cancer. Expert Systems with Applications 115, 16–26 (2019). https://doi.org/10.1016/j.eswa.2018.07.070

Pachon-Garcia, C., Hernandez-Perez, C., Delicado, P., Vilaplana, V.: SurvLIMEpy: A Python package implementing SurvLIME (Feb 2023), arXiv:2302.10571. https://doi.org/10.48550/arXiv.2302.10571

Polsterl, S., Navab, N., Katouzian, A.: An efficient training algorithm for kernel survival support vector machines (Nov 2016), arXiv:1611.07054v. https://doi.org/10.48550/arXiv.1611.07054

Ribeiro, M., Singh, S., Guestrin, C.: “Why should I trust You?” Explaining the predictions of any classifier In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016). https://doi.org/10.1145/2939672.2939778

Utkin, L., Satyukov, E., Konstantinov, A.: SurvNAM: The machine learning survival model explanation. Neural Networks 147, 81–102 (2022). https://doi.org/10.1016/j.neunet.2021.12.015

Waititu, H., Koske, J., Onyango, N.: Analysis of balanced random survival forest using different splitting rules: Application on child mortality. International Journal of Statistics and Applications 11(2), 37–49 (2021). https://doi.org/10.5923/j.statistics.20211102.03

Wang, H., Zhou, L.: Random survival forest with space extensions for censored data. Artificial Intelligence in Medicine 79, 52–61 (2017). https://doi.org/10.1016/j.artmed.2017.06.005

Wang, P., Li, Y., Reddy, C.: Machine learning for survival analysis: A survey. ACM Computing Surveys (CSUR) 51(6), 1–36 (2019). https://doi.org/10.1145/3214306

Wang, Z., Sun, J.: SurvTRACE: Transformers for survival analysis with competing events. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–9. ACM (2022). https://doi.org/10.1145/3535508.3545521

Weiss, K., Khoshgoftaar, T., Wang, D.: A survey of transfer learning. Journal of Big Data 3(1), 1–40 (2016). https://doi.org/10.1186/s40537-016-0043-6