Facilitating HPC Operation and Administration via Cloud

Chaoqun Sha, Jingfeng Zhang, Lei An, Yongsheng Zhang, Zhipeng Wang, Tomi Ilijas, Nejc Bat, Miha Verlic, Qing Ji

Abstract


Experiencing a tremendous growth, Cloud Computing offers a number of advantages over other distributed platforms. Introducing the advantages of High Performance Computing (HPC) also brought forward the development of HPCaaS (HPC as a Service), which has mainly focused on flexible access to resources, cost-effectiveness, and the no-maintenance-needed for end-users. Besides providing and using HPCaaS, HPC centers could leverage more from Cloud Computing technology, for instance to facilitate operation and administration of deployed HPC systems, commonly faced by most supercomputer centers.

This paper reports the product, EasyOP, developed to realize the idea that one or more Cloud or HPC facilities can be run over a centralized and unified control platform. The main purpose of EasyOP is that the information of HPC systems hardware and system software, failure alarms, jobs scheduling, etc. is sent to the Wuxi cloud computing center. After a series of analysis and processing, we are able to share many valuable data, including alarm and job scheduling status, to HPC users through SMS, email, and WeChat. More importantly, with the data accumulated on the cloud computing center, EasyOP can offer several easy-to-use functions, such as user(s) management, monthly/yearly reports, one-screen monitoring and so on. By the end of 2016, EasyOP successfully served more than 50 HPC systems with almost 10000 nodes and over of 300 regular users.


Full Text:

PDF

References


Badger, M.: Zenoss Core 3.x Network and System Monitoring. Packt Publishing (2011), DOI: 10.1109/infoteh.2018.8345528

Barth, W.: Nagios: System and Network Monitoring. No Starch Press (2008), DOI: 10.1016/b978-1-59749-267-6.x0001-0

Buyya, R.: Parmon: a portable and scalable monitoring system for clusters. Software Practice & Experience 30(7), 723–739 (2015), DOI: 10.1002/(sici)1097-024x(200006)

Crasso, M., Mateos, C., Zunino, A., Campo, M.: Easysoc: Making web service outsourcing easier. Information Sciences An International Journal 259(3), 452–473 (2014), DOI: 10.1016/j.ins.2010.01.013

Cunha, R.L.F., Rodrigues, E.R., Tizzei, L.P., Netto, M.A.S.: Job placement advisor based on turnaround predictions for HPC hybrid clouds. Future Generation Computer Systems 67, 35–46 (2016), DOI: 10.1016/j.future.2016.08.010

Duran-Limon, H.A., Flores-Contreras, J., Parlavantzas, N., Zhao, M., Meulenert-Pena, A.: Efficient execution of the WRF model and other HPC applications in the cloud. Earth Science Informatics 9(3), 1–18 (2016), DOI: 10.1007/s12145-016-0253-7

Evans, R.T., Browne, J.C., Barth, W.L.: Understanding application and system performance through system-wide monitoring. pp. 1702–1710 (2016), DOI: 10.1109/ipdpsw.2016.145

Ferreto, T.C., De Rose, C.A.F., De Rose, L.: Rvision: An open and high configurable tool for cluster monitoring. In: IEEE/ACM International Symposium on CLUSTER Computing and the Grid. pp. 75–75 (2002), DOI: 10.1109/ccgrid.2002.1017114

Fu, W., Huang, Q.: Grideye: A service-oriented grid monitoring system with improved forecasting algorithm. In: International Conference on Grid and Cooperative Computing Workshops. pp. 5–12 (2006), DOI: 10.1109/gccw.2006.51

Gupta, A., Kale, L.V., Gioachin, F., March, V., Suen, C.H., Lee, B.S., Faraboschi, P., Kaufmann, R., Milojicic, D.: The who, what, why, and how of high performance computing in the cloud. In: IEEE International Conference on Cloud Computing Technology and Science. pp. 306–314 (2014), DOI: 10.1109/CloudCom.2013.47

Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of "big data" on cloud computing: Review and open research issues. Information Systems 47(C), 98–115 (2015), DOI: 10.1016/j.is.2014.07.006

Hassan, H.A., Mohamed, S.A., Sheta, W.M.: Scalability and communication performance of HPC on Azure cloud. Egyptian Informatics Journal 17(2), 175–182 (2016), DOI: 10.1016/j.eij.2015.11.001

Hernandez, H.M., Winter, R.L.: Sender device based pause system (2015), https://patents.google.com/patent/US9055467B2/en, accessed: 2019-01-24

Huntington-Lee, J., Terplan, K., Gibson, J.: HP Openview: A Manager's Guide. McGraw-Hill, Inc. (1997), https://dl.acm.org/citation.cfm?id=548549, accessed: 2019-01-24

Kannan, J., Munday, P.: Challenges in Using Cloud Technology for Promoting Learner Autonomy in a Spanish Language Course. IGI Global (2017)

Karjoth, G.: Access control with IBM Tivoli access manager. ACM Transactions on Information & System Security 6(2), 232–257 (2003), DOI: 10.1145/762476.762479

Kundu, D., Kundu, D.: Cacti 0.8 Network Monitoring. Packt Publishing Ltd (2009), https://www.packtpub.com/sites/default/files/sample_chapters/5968-cacti-sample-chapter-4-creating-and-using-templates.pdf, accessed: 2019-01-24

Lin, J., Xu, W., Zhang, W., Yang, G.: Equipment management and system maintenance on HPC platform. Experimental Technology & Management (2013), http://en.cnki.com.cn/Article_en/CJFDTOTAL-SYJL201305028.htm, accessed: 2019-01-24

Mantripragada, K., Tizzei, L.P., Binotto, A.P.D., Netto, M.A.S.: An SLA-based advisor for placement of HPC jobs on hybrid clouds. In: International Conference on Service-Oriented Computing. pp. 324–332 (2015), DOI: 10.1016/j.eij.2015.11.001

Marathe, A., Harris, R., Lowenthal, D.K., Supinski, B.R.D., Rountree, B., Schulz, M., Yuan, X.: A comparative study of high-performance computing on the cloud. In: International Symposium on High-Performance Parallel and Distributed Computing. pp. 239–250 (2013), DOI: 10.1145/2462902.2462919

Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing 30(7), 817–840 (2004), DOI: 10.1016/j.parco.2004.04.001

Mell, P.M., Grance, T.: SP 800-145. The NIST Denition of Cloud Computing. National Institute of Standards & Technology (2011), DOI: 10.6028/nist.sp.800-145

Navaux, P.O.A., Carissimi, A., Rolo, E., Diener, M.: High performance computing in the cloud: Deployment, performance and cost eciency. In: IEEE International Conference on Cloud Computing Technology and Science. pp. 371–378 (2012), DOI: 10.1109/Cloud-Com.2012.6427549

Ni, G., Jie, M., Bo, L.: Gridview: A dynamic and visual grid monitoring system. In: High Performance Computing and Grid in Asia Pacic Region, Seventh International Conference. pp. 89–92 (2004), DOI: 10.1109/hpcasia.2004.1324020

Oetiker, T.: Mrtg: The multi router trac grapher. In: Conference on Systems Administration. pp. 141–148 (1998), https://www.usenix.org/legacy/event/lisa98/full_papers/oetiker/oetiker.pdf, accessed: 2019-01-24

Palmer, J.T., Gallo, S.M., Furlani, T.R., Jones, M.D., Deleon, R.L., White, J.P., Simakov, N., Patra, A.K., Sperhac, J., Yearke, T.: Open XDMoD: A tool for the comprehensive management of high-performance computing resources. Computing in Science & Engineering 17(4), 52–62 (2015), DOI: 10.1109/mcse.2015.68

Rizhiyi: IT maintainance platform won 60 M investment. Information Technology and Informatization 1(12), 8–8 (2015), https://www.rizhiyi.com, accessed: 2019-01-24

Sadooghi, I., Martin, J.H., Li, T., Brandstatter, K., Zhao, Y., Maheshwari, K., Ruivo, T.P.P.D.L., Timm, S., Garzoglio, G., Raicu, I.: Understanding the performance and potential of cloud computing for scientific applications. IEEE Transactions on Cloud Computing PP(99), 1–1 (2015), DOI: 10.1109/TCC.2015.2404821

Sheng, Q.Z., Qiao, X., Vasilakos, A.V., Szabo, C., Bourne, S., Xu, X.: Web services composition: A decade's overview. Information Sciences 280, 218–238 (2014), DOI: 10.1016/j.ins.2014.04.054

Wibisono, A., Suhartanto, H.: Cloud computing model and implementation of molecular dynamics simulation using Amber and Gromacs. In: International Conference on Advanced Computer Science and Information Systems. pp. 31–36 (2012), https://ieeexplore.ieee.org/abstract/document/6468763, accessed: 2019-01-24

Yang, C., Huang, Q., Li, Z., Liu, K., Hu, F.: Big Data and cloud computing: innovation opportunities and challenges. International Journal of Digital Earth 10(1), 13–53 (2017), DOI: 10.1080/17538947.2016.1239771




Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)