Improving Efficiency of Hybrid HPC Systems Using a Multi-agent Scheduler and Machine Learning Methods
DOI:
https://doi.org/10.14529/jsfi230207Keywords:
high performance computing, hybrid computing systems, machine learning, multi-agent scheduler, random survival forest, survival analysis, survival function, XAIAbstract
One of the promising directions for improving hybrid reconfigurable high-performance computer platforms operating in the mode of collaborative applied computing centers is their inclusion as an active component in the machine learning ecosystem, which opens up new opportunities to enhance the actual outperformance of solving various application tasks by intellectualizing the management of available computing resources. The task scheduler operation is crucial in improving the efficiency of hybrid supercomputer platforms, which combine dozens of processor blocks with different architectures, including specialized graphics and reconfigurable accelerators. To form an optimal order of jobs in the HPC queue, the article proposes to apply deep survival machine learning models, which increase the accuracy of the estimated time of the tasks successful execution and the required amount of computing resources. The main peculiarity of the machine learning models is that they are trained on censored heterogeneous data collected from previous periods of task execution observations using a multi-agent scheduler. In order to ensure high accuracy, the random survival forest is used as a part of the machine learning model which provides survival and hazard functions in the framework of the survival analysis. A specific weighted clustering procedure is proposed to divide tasks in accordance with their execution times as well as the feature vectors. Various numerical experiments with actual data illustrate the outperformance of the presented approach.
References
Alaa, A., van der Schaar, M.: Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design. In: Proceedings of the International Conference on Machine Learning, pp. 129–138. PMLR (2018)
Bou-Hamad, I., Larocque, D., Ben-Ameur, H.: A review of survival trees. Statistics Surveys 5, 44–71 (2011). https://doi.org/10.1214/09-SS047
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Cox, D.: Regression models and life-tables. Journal of the Royal Statistical Society, Series B (Methodological) 34(2), 187–220 (1972). https://doi.org/10.1111/j.2517-6161.1972.tb00899/x
Faraggi, D., Simon, R.: A neural network model for survival data. Statistics in Medicine 14(1), 73–82 (1995). https://doi.org/10.1002/sim.4780140108
Harrell, F., Califf, R., Pryor, D., et al.: Evaluating the yield of medical tests. Journal of the American Medical Association 247, 2543–2546 (1982). https://doi.org/10.1001/jama.1982.03320430047030
Hosmer, D., Lemeshow, S., May, S.: Applied Survival Analysis: Regression Modeling of Time to Event Data. John Wiley & Sons, New Jersey (2008) https://doi.org/10.1007/s00362-010-0360-3
Hu, S., Fridgeirsson, E., van Wingen, G., Welling, M.: Transformer-based deep survival analysis. In: Survival Prediction-Algorithms, Challenges and Applications, pp. 132–148. PMLR (2021)
Ishwaran, H., Kogalur, U.: Random survival forests for R. R News 7(2), 25–31 (2007). https://doi.org/10.1214/08-AOAS169
Ishwaran, H., Kogalur, U., Blackstone, E., Lauer, M.: Random survival forests. Annals of Applied Statistics 2, 841–860 (2008). https://doi.org/10.1214/08-AOAS169
Kalyaev, A., Kalyaev, I., Khisamutdinov, M., et al.: An effective algorithm for multiagent dispatching of resources in heterogeneous cloud environments. In: 5th International Conference on Informatics, Electronics and Vision (ICIEV), pp. 1140–1142. IEEE (2016). https://doi.org/10.1109/ICIEV.2016.7760177
Kalyaev, I.A., Kalyaev, A.I. Method and Algorithms for Adaptive Multiagent Resource Scheduling in Heterogeneous Distributed Computing Environments. Autom Remote Control 83, 1228–1245 (2022). https://doi.org/10.1134/S0005117922080069
Katzman, J., Shaham, U., Cloninger, A., et al.: Deepsurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology 18(24), 1–12 (2018). https://doi.org/10.1186/s12874-018-0482-1
Khan, F., Zubek, V.: Support vector regression for censored data (SVRc): a novel tool for survival analysis. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 863–868. IEEE (2008). https://doi.org/10.1109/ICDM.2008.50
Konstantinov, A., Utkin, L., Lukashin, A., Muliukha, V.: Neural attention forests: Transformer-based forest improvement (Apr 2023), arXiv:2304.05980. https://doi.org/10.48550/arXiv.2304.05980
Kovalev, M., Utkin, L., Kasimov, E.: SurvLIME: A method for explaining machine learning survival models. Knowledge-Based Systems 203, 106164 (2020). https://doi.org/10.1016/j.knosys.2020.106164
Kunzel, S., Stadie, B., Vemuri, N., et al.: Transfer learning for estimating causal effects using neural networks (Aug 2018), arXiv:1808.07804. https://doi.org/10.48550/arXiv.1808.07804
Lu, J., Behbood, V., Hao, P., et al.: Transfer learning using computational intelligence: A survey. Knowledge-Based Systems 80, 14–23 (2015). https://doi.org/10.1016/j.knosys.2015.01.010
May, M., Royston, P., Egger, M., et al.: Development and validation of a prognostic model for survival time data: application to prognosis of HIV positive patients treated with antiretroviral therapy. Statistics in Medicine 23, 2375–2398 (2004). https://doi.org/10.1002/sim.1825
Nezhad, M., Sadati, N., Yang, K., Zhu, D.: A deep active survival analysis approach for precision treatment recommendations: Application of prostate cancer. Expert Systems with Applications 115, 16–26 (2019). https://doi.org/10.1016/j.eswa.2018.07.070
Pachon-Garcia, C., Hernandez-Perez, C., Delicado, P., Vilaplana, V.: SurvLIMEpy: A Python package implementing SurvLIME (Feb 2023), arXiv:2302.10571. https://doi.org/10.48550/arXiv.2302.10571
Polsterl, S., Navab, N., Katouzian, A.: An efficient training algorithm for kernel survival support vector machines (Nov 2016), arXiv:1611.07054v. https://doi.org/10.48550/arXiv.1611.07054
Ribeiro, M., Singh, S., Guestrin, C.: “Why should I trust You?” Explaining the predictions of any classifier In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016). https://doi.org/10.1145/2939672.2939778
Utkin, L., Satyukov, E., Konstantinov, A.: SurvNAM: The machine learning survival model explanation. Neural Networks 147, 81–102 (2022). https://doi.org/10.1016/j.neunet.2021.12.015
Waititu, H., Koske, J., Onyango, N.: Analysis of balanced random survival forest using different splitting rules: Application on child mortality. International Journal of Statistics and Applications 11(2), 37–49 (2021). https://doi.org/10.5923/j.statistics.20211102.03
Wang, H., Zhou, L.: Random survival forest with space extensions for censored data. Artificial Intelligence in Medicine 79, 52–61 (2017). https://doi.org/10.1016/j.artmed.2017.06.005
Wang, P., Li, Y., Reddy, C.: Machine learning for survival analysis: A survey. ACM Computing Surveys (CSUR) 51(6), 1–36 (2019). https://doi.org/10.1145/3214306
Wang, Z., Sun, J.: SurvTRACE: Transformers for survival analysis with competing events. In: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–9. ACM (2022). https://doi.org/10.1145/3535508.3545521
Weiss, K., Khoshgoftaar, T., Wang, D.: A survey of transfer learning. Journal of Big Data 3(1), 1–40 (2016). https://doi.org/10.1186/s40537-016-0043-6
Downloads
Published
How to Cite
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.