Workflows for Science: a Challenge when Facing the Convergence of HPC and Big Data

Authors

  • Rosa M Badia Barcelona Supercomputing Center, Barcelona Consejo Superior de Investigaciones Cientificas, Madrid
  • Eduard Ayguade Barcelona Supercomputing Center, Barcelona Universitat Politecnica de Catalunya, Barcelona
  • Jesus Labarta Barcelona Supercomputing Center, Barcelona Universitat Politecnica de Catalunya, Barcelona

DOI:

https://doi.org/10.14529/jsfi170102

Abstract

Workflows have been used traditionally as a mean to describe and implement the computing usually parametric studies and explorations searching for the best solution  that  scientific researchers want to perform. 

A workflow is not only the computing application, but a way of documenting a process.  Science workflows may be of very different nature depending on the area of research, matching the actual experiment that the scientist want to perform. 

Workflow Management Systems are environments that offer the researchers tools to define, publish, execute and document their workflows. 

In some cases, the science workflows are used to generate data; in other cases are used to analyse existing data; only in a few cases, workflows are used both to generate and analyse  data. The design of experiments is in some cases generated blindly, without a clear idea of which points are relevant to be computed/simulated, ending up with huge amount of computation that is performed following a brute-force strategy. 

However, the evolution of systems and the large amount of data generated by the applications require an in-situ analysis of the data, thus requiring new solutions to develop workflows that includes both the simulation/computational part and the analytic part. What is more, the fact that both components, computation and analytics, can be run together  will enable the possibility of defining more dynamic workflows, with new computations being decided by the analytics in a more efficient way.

The first part of the paper will review current approaches that a set of scientific communities follows in the development of their workflows. Due to the election of several scientific communities and use cases using a specific Workflow Management System, this survey maybe incomplete with regard a complete revision of the literature about workflows, but we expect that the reader appreaciates the effort performed in trying to see the scientific communities needs and requirements. 

The second part of the paper will propose a new software architecture to develop a new  family of end-to-end workflows that enables the management of  dynamic workflows composed of simulations, analytics and visualization, including inputs/outputs from streams.

References

Abbott, B., Abbott, R., Adhikari, R., Ajith, P., Allen, B., Allen, G., Amin, R., Anderson, S., Anderson, W., Arain, M., et al.: Ligo: the laser interferometer gravitational-wave observatory. Reports on Progress in Physics 72(7), 076901 (2009), DOI: 10.1088/0034-4885/72/7/076901

Abouelhoda, M., Issa, S.A., Ghanem, M.: Tavaxy: Integrating taverna and galaxy workflows with cloud computing support. BMC bioinformatics 13(1), 77 (2012)

Afgan, E., Baker, D., van den Beek, M., Blankenberg, D., Bouvier, D., Cech, M., Chilton, J., Clements, D., Coraor, N., Eberhard, C., et al.: The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic acids research p.gkw343 (2016)

Afgan, E., Chapman, B., Taylor, J.: Cloudman as a platform for tool, data, and analysis distribution. BMC Bioinformatics 13(1), 315 (2012), DOI: 10.1186/1471-2105-13-315

Afgan, E., Coraor, N., Chilton, J., Baker, D., Taylor, J., Team, T.G.: Enabling cloud bursting for life sciences within galaxy. Concurrency and Computation: Practice and Experience 27(16), 4330–4343 (2015), cPE-15-0018.R1

Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on. pp. 423–424. IEEE (2004)

Amstutz, P., Crusoe, M.R., Tijani ́c, N., Chapman, B., Chilton, J., Heuer, M., Kartashov, A., Leehr, D., Mnager, H., Nedeljkovich, M., Scales, M., Soiland-Reyes, S., Stojanovic, L.: Common Workflow Language, v1.0. Tech. rep. (3 2016), https://figshare.com/articles/Common_Workflow_Language_draft_3/3115156, DOI: 10.6084/m9.figshare.3115156.v2

Barseghian, D., Altintas, I., Jones, M.B., Crawl, D., Potter, N., Gallagher, J., Cornillon, P., Schildhauer, M., Borer, E.T., Seabloom, E.W., Hosseini, P.R.: Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecological Informatics 5(1), 42 – 50 (2010), http://www.sciencedirect.com/science/article/pii/S1574954109000673, special Issue: Advances in environmental information management

Berthold, M.R., Cebron, N., Dill, F., Gabriel, T.R., K ̈otter, T., Meinl, T., Ohl, P., Sieb, C., Thiel, K., Wiswedel, B.: KNIME: The Konstanz Information Miner. In: Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007). Springer (2007)

BioExcel website. Web page at http://www.bioexcel.eu, accessed: 2017-02-15

Cabellos, L., Campos, I., del Castillo, E.F., Owsiak, M., Palak, B., Pciennik, M.: Scientific workflow orchestration interoperating htc and hpc resources. Computer Physics Communications 182(4), 890 – 897 (2011), http://www.sciencedirect.com/science/article/pii/S0010465510005096

Coppens, F., Corpas, M.: Recommendation for actions on Galaxy for ELIXIR HoNs. available at https://www.elixir-europe.org/about/groups/galaxy-wg, accessed: 2017-02-15

Coster, D.P., Basiuk, V., Pereverzev, G., Kalupin, D., Zagorksi, R., Stankiewicz, R., Huynh, P., Imbeaux, F., et al.: The European Transport Solver. IEEE Transactions on Plasma Science 38(9), 2085–2092 (2010)

Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: An overview of workflow system features and capabilities. Future Generation Computer Systems 25(5), 528 – 540 (2009), http://www.sciencedirect.com/science/article/pii/S0167739X08000861

Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., et al.: Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46, 17–35 (2015)

Ekanayake, S., Kamburugamuve, S., Wickramasinghe, P., Fox, G.C.: Java thread and process performance for parallel machine learning on multicore hpc clusters. In: Proceedings of the 2016 IEEE International Conference on Big Data (2016)

Elixir website. Web page at https://www.elixir-europe.org, accessed: 2017-02-15

Building an European Reseach Community through Interoperable Workflows and Data. Web page at http://www.erflow.eu, accessed: 2017-02-15

Erwin, D.W., Snelling, D.F.: Unicore: A grid computing environment. In: European Conference on Parallel Processing. pp. 825–834. Springer (2001)

European Consortium for the Development of Fusion Energy. Web page at https://www.euro-fusion.org, accessed: 2017-02-15

Fahringer, T., Prodan, R., Duan, R., Nerieri, F., Podlipnig, S., Qin, J., Siddiqui, M., Truong, H.L., Villazon, A., Wieczorek, M.: Askalon: A grid application development and computing environment. In: Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing. pp. 122–131. IEEE Computer Society (2005)

Falchetto, G.L., Coster, D., Coelho, R., Scott, B., Figini, L., Kalupin, D., Nardon, E., Nowak, S., Alves, L.L., Artaud, J.F., et al.: The european integrated tokamak modelling (itm) effort: achievements and first physics results. Nuclear Fusion 54(4), 043018 (2014)

Goble, C.A., Bhagat, J., Aleksejevs, S., Cruickshank, D., Michaelides, D., Newman, D., Borkum, M., Bechhofer, S., Roos, M., Li, P., et al.: myexperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic acids research 38(suppl 2), W677–W682 (2010)

Apache Hadoop. Web page at http://hadoop.apache.org/ ((Date of last access: 15th November, 2016))

Hospital, A., Montras, A., Soiland-Reyes, S., Bonvin, A., Melquiond, A., Gelp ́ı, J.L., Lezzi, D., Newhouse, S., Dianes, J.A., Abraham, M., Apostolov, R., Ippoliti, E., Carter, A., White, D.J.: D2.1 State of the art and gap analysis. Tech. rep., BioExcel deliverable (2016)

Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic acids research 34(suppl 2), W729–W732 (2006)

Imbeaux, F., Pinches, S., Lister, J., Buravand, Y., Casper, T., Duval, B., Guillerminet, B., Hosokawa, M., Houlberg, W., Huynh, P., Kim, S., Manduchi, G., Owsiak, M., Palak, B., Plociennik, M., Rouault, G., Sauter, O., Strand, P.: Design and first applications of the iter integrated modelling & analysis suite. Nuclear Fusion 55(12), 123006 (2015), http://stacks.iop.org/0029-5515/55/i=12/a=123006

InfraStructure for the European Network for the Earth System Modelling. Web page at https://is.enes.org, accessed: 2017-02-15

The Kepler Project. Web page at https://kepler-project.org, accessed: 2017-02-15

Kranzlm ̈uller, D., de Lucas, J.M., Oster, P.: The european grid initiative (egi). In: Remote Instrumentation and Virtual Laboratories, pp. 61–66. Springer (2010)

Laure, E., Edlund, A., Pacini, F., Buncic, P., Barroso, M., Di Meglio, A., Prelz, F., Frohner, A., Mulmo, O., Krenek, A., et al.: Programming the grid with glite. Tech. rep. (2006)

Lordan, F., Tejedor, E., Ejarque, J., Rafanell, R., Alvarez, J., Marozzo, F., Lezzi, D., Sirvent, R., Talia, D., Badia, R.M.: ServiceSs: An Interoperable Programming Framework for the Cloud. Journal of Grid Computing 12(1), 67–91 (2014)

Manubens-Gil, D., Vegas-Regidor, J., Matthews, D., Shin, M.: Assesment report on autosubmit, cylc and ecflow. Tech. rep. (2016), https://earth.bsc.es/wiki/lib/exe/fetch.php?media=tools:isenes2_d93_v1.0_mp.pdf

Manubens-Gil, D., Vegas-Regidor, J., Prodhomme, C., Mula-Valls, O., Doblas-Reyes, F.J.: Seamless management of ensemble climate prediction experiments on hpc platforms. In: High Performance Computing & Simulation (HPCS), 2016 International Conference on. pp. 895–900. IEEE (2016)

Manubens-Gila, D., Vegas-Regidora, J., Acostaa, M.C., Prodhommea, C., Mula-Vallsa, O., Serradell-Marondaa, K., Doblas-Reyes, F.J.: Autosubmit: a versatile tool for managing Earth system models on HPC platforms. Future Generation Computer Systems submited (2016)

Marti, J., Gasull, D., Queralt, A., Cortes, T.: Towards DaaS 2.0: Enriching data models. In: Proceedings - 2013 IEEE 9th World Congress on Services, SERVICES 2013. pp. 349–355. IEEE (jun 2013), DOI: 10.1109/SERVICES.2013.59

McLennan, M., Clark, S., Deelman, E., Rynge, M., Vahi, K., McKenna, F., Kearney, D., Song, C.: Hubzero and pegasus: integrating scientific workflows into science gateways. Concurrency and Computation: Practice and Experience (2014), DOI: 10.1002/cpe.3257

National Virtual Observatory. Web page at http://us-vo.org, accessed: 2017-02-15

Oliver, H.J.: Cylc (the cylc suite engine). Tech. rep. (2016) 40. Pordes, R., Petravick, D., Kramer, B., Olson, D., Livny, M., Roy, A., Avery, P., Blackburn,

K., Wenaus, T., W ̈urthwein, F., et al.: The open science grid 78(1), 012057 (2007)

Price, B.: Frank and lillian gilbreth and the manufacture and marketing of motion study, 1908-1924. Business and economic history pp. 88–98 (1989)

Pronk, S., Larsson, P., Pouya, I., Bowman, G.R., Haque, I.S., Beauchamp, K., Hess, B., Pande, V.S., Kasson, P.M., Lindahl, E.: Copernicus: A new paradigm for parallel adaptive molecular dynamics. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 60:1–60:10. SC ’11, ACM, New York, NY, USA (2011), DOI: 10.1145/2063384.2063465

Ptolemaeus, C. (ed.): System Design, Modeling, and Simulation using Ptolemy II. Ptolemy.org (2014), http://ptolemy.org/books/Systems

Ruiz, J., Garrido, J., Santander-Vela, J., S ́anchez-Exp ́osito, S., Verdes-Montenegro, L.: Astrotavernabuilding workflows with virtual observatory services. Astronomy and Computing 7, 3–11 (2014)

S ́anchez-Exp ́osito, S., Mart ́ın, P., Ru ́ız, J.E., Verdes-Montenegro, L., Garrido, J., Sirvent, R., Falc ́o, A.R., Badia, R., Lezzi, D.: Web services as building blocks for science gateways in astrophysics. Journal of Grid Computing 14(4), 673–685 (2016)

Southern California Earthquake Center. Web page at http://scec.org/, accessed: 2017-02-15

Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks 61, 85 – 117 (2015), http://sciencedirect.com/science/article/pii/S0893608014002135

Scott, B.D., Weinberg, V., Hoenen, O., Karmakar, A., Fazendeiro, L.: Scalability of the plasma physics code gem. arXiv preprint arXiv:1312.1187 (2013)

SHaring Interoperable Workflows for large-scale scientific simulations on Available DCIs. Web page at http://www.shiwa-workflow.eu/, accessed: 2017-02-15

Square Kilometre Array. Web page at https://www.skatelescope.org, accessed: 2017-02-15

Sloggett, C., Goonasekera, N., Afgan, E.: Bioblend: automating pipeline analyses within galaxy and cloudman. Bioinformatics 29(13), 1685–1686 (2013)

The Principles of Scientific Management. The Mathematics Teacher 4(1), 44–44 (1911), http://www.jstor.org/stable/27949698

Tejedor, E., Becerra, Y., Alomar, G., Queralt, A., Badia, R.M., Torres, J., Cortes, T., Labarta, J.: Pycompss: Parallel computational workflows in python. International Journal of High Performance Computing Applications (2015)

Galaxy Tool Sheed. Web page at https://toolshed.g2.bx.psu.edu, accessed: 2017-02-15 55. Wilde, M., Hategan, M., Wozniak, J.M., Clifford, B., Katz, D.S., Foster, I.: Swift: A language for distributed parallel scripting. Parallel Computing 37(9), 633–652 (2011)

Wolstencroft, K., Haines, R., Fellows, D., Williams, A., Withers, D., Owen, S., Soiland-Reyes, S., Dunlop, I., Nenadic, A., Fisher, P., et al.: The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic acids research p. gkt328 (2013)

Yu, J., Buyya, R.: A taxonomy of workflow management systems for grid computing. Journal of Grid Computing 3(3-4), 171–200 (2005)

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10, USENIX Association, Berkeley, CA, USA (2010)

Downloads

Published

2017-04-12

How to Cite

Badia, R. M., Ayguade, E., & Labarta, J. (2017). Workflows for Science: a Challenge when Facing the Convergence of HPC and Big Data. Supercomputing Frontiers and Innovations, 4(1), 27–47. https://doi.org/10.14529/jsfi170102