Workflows for Science: a Challenge when Facing the Convergence of HPC and Big Data

Rosa M Badia, Eduard Ayguade, Jesus Labarta


Workflows have been used traditionally as a mean to describe and implement the computing usually parametric studies and explorations searching for the best solution  that  scientific researchers want to perform. 

A workflow is not only the computing application, but a way of documenting a process.  Science workflows may be of very different nature depending on the area of research, matching the actual experiment that the scientist want to perform. 

Workflow Management Systems are environments that offer the researchers tools to define, publish, execute and document their workflows. 

In some cases, the science workflows are used to generate data; in other cases are used to analyse existing data; only in a few cases, workflows are used both to generate and analyse  data. The design of experiments is in some cases generated blindly, without a clear idea of which points are relevant to be computed/simulated, ending up with huge amount of computation that is performed following a brute-force strategy. 

However, the evolution of systems and the large amount of data generated by the applications require an in-situ analysis of the data, thus requiring new solutions to develop workflows that includes both the simulation/computational part and the analytic part. What is more, the fact that both components, computation and analytics, can be run together  will enable the possibility of defining more dynamic workflows, with new computations being decided by the analytics in a more efficient way.

The first part of the paper will review current approaches that a set of scientific communities follows in the development of their workflows. Due to the election of several scientific communities and use cases using a specific Workflow Management System, this survey maybe incomplete with regard a complete revision of the literature about workflows, but we expect that the reader appreaciates the effort performed in trying to see the scientific communities needs and requirements. 

The second part of the paper will propose a new software architecture to develop a new  family of end-to-end workflows that enables the management of  dynamic workflows composed of simulations, analytics and visualization, including inputs/outputs from streams.

Full Text:



Abbott, B., Abbott, R., Adhikari, R., Ajith, P., Allen, B., Allen, G., Amin, R., Anderson, S., Anderson, W., Arain, M., et al.: Ligo: the laser interferometer gravitational-wave observatory. Reports on Progress in Physics 72(7), 076901 (2009), DOI: 10.1088/0034-4885/72/7/076901

Abouelhoda, M., Issa, S.A., Ghanem, M.: Tavaxy: Integrating taverna and galaxy workflows with cloud computing support. BMC bioinformatics 13(1), 77 (2012)

Afgan, E., Baker, D., van den Beek, M., Blankenberg, D., Bouvier, D., Cech, M., Chilton, J., Clements, D., Coraor, N., Eberhard, C., et al.: The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic acids research p.gkw343 (2016)

Afgan, E., Chapman, B., Taylor, J.: Cloudman as a platform for tool, data, and analysis distribution. BMC Bioinformatics 13(1), 315 (2012), DOI: 10.1186/1471-2105-13-315

Afgan, E., Coraor, N., Chilton, J., Baker, D., Taylor, J., Team, T.G.: Enabling cloud bursting for life sciences within galaxy. Concurrency and Computation: Practice and Experience 27(16), 4330–4343 (2015), cPE-15-0018.R1

Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on. pp. 423–424. IEEE (2004)

Amstutz, P., Crusoe, M.R., Tijani ́c, N., Chapman, B., Chilton, J., Heuer, M., Kartashov, A., Leehr, D., Mnager, H., Nedeljkovich, M., Scales, M., Soiland-Reyes, S., Stojanovic, L.: Common Workflow Language, v1.0. Tech. rep. (3 2016),, DOI: 10.6084/m9.figshare.3115156.v2

Barseghian, D., Altintas, I., Jones, M.B., Crawl, D., Potter, N., Gallagher, J., Cornillon, P., Schildhauer, M., Borer, E.T., Seabloom, E.W., Hosseini, P.R.: Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecological Informatics 5(1), 42 – 50 (2010),, special Issue: Advances in environmental information management

Berthold, M.R., Cebron, N., Dill, F., Gabriel, T.R., K ̈otter, T., Meinl, T., Ohl, P., Sieb, C., Thiel, K., Wiswedel, B.: KNIME: The Konstanz Information Miner. In: Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007). Springer (2007)

BioExcel website. Web page at, accessed: 2017-02-15

Cabellos, L., Campos, I., del Castillo, E.F., Owsiak, M., Palak, B., Pciennik, M.: Scientific workflow orchestration interoperating htc and hpc resources. Computer Physics Communications 182(4), 890 – 897 (2011),

Coppens, F., Corpas, M.: Recommendation for actions on Galaxy for ELIXIR HoNs. available at, accessed: 2017-02-15

Coster, D.P., Basiuk, V., Pereverzev, G., Kalupin, D., Zagorksi, R., Stankiewicz, R., Huynh, P., Imbeaux, F., et al.: The European Transport Solver. IEEE Transactions on Plasma Science 38(9), 2085–2092 (2010)

Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: An overview of workflow system features and capabilities. Future Generation Computer Systems 25(5), 528 – 540 (2009),

Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., et al.: Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46, 17–35 (2015)

Ekanayake, S., Kamburugamuve, S., Wickramasinghe, P., Fox, G.C.: Java thread and process performance for parallel machine learning on multicore hpc clusters. In: Proceedings of the 2016 IEEE International Conference on Big Data (2016)

Elixir website. Web page at, accessed: 2017-02-15

Building an European Reseach Community through Interoperable Workflows and Data. Web page at, accessed: 2017-02-15

Erwin, D.W., Snelling, D.F.: Unicore: A grid computing environment. In: European Conference on Parallel Processing. pp. 825–834. Springer (2001)

European Consortium for the Development of Fusion Energy. Web page at, accessed: 2017-02-15

Fahringer, T., Prodan, R., Duan, R., Nerieri, F., Podlipnig, S., Qin, J., Siddiqui, M., Truong, H.L., Villazon, A., Wieczorek, M.: Askalon: A grid application development and computing environment. In: Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing. pp. 122–131. IEEE Computer Society (2005)

Falchetto, G.L., Coster, D., Coelho, R., Scott, B., Figini, L., Kalupin, D., Nardon, E., Nowak, S., Alves, L.L., Artaud, J.F., et al.: The european integrated tokamak modelling (itm) effort: achievements and first physics results. Nuclear Fusion 54(4), 043018 (2014)

Goble, C.A., Bhagat, J., Aleksejevs, S., Cruickshank, D., Michaelides, D., Newman, D., Borkum, M., Bechhofer, S., Roos, M., Li, P., et al.: myexperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic acids research 38(suppl 2), W677–W682 (2010)

Apache Hadoop. Web page at ((Date of last access: 15th November, 2016))

Hospital, A., Montras, A., Soiland-Reyes, S., Bonvin, A., Melquiond, A., Gelp ́ı, J.L., Lezzi, D., Newhouse, S., Dianes, J.A., Abraham, M., Apostolov, R., Ippoliti, E., Carter, A., White, D.J.: D2.1 State of the art and gap analysis. Tech. rep., BioExcel deliverable (2016)

Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic acids research 34(suppl 2), W729–W732 (2006)

Imbeaux, F., Pinches, S., Lister, J., Buravand, Y., Casper, T., Duval, B., Guillerminet, B., Hosokawa, M., Houlberg, W., Huynh, P., Kim, S., Manduchi, G., Owsiak, M., Palak, B., Plociennik, M., Rouault, G., Sauter, O., Strand, P.: Design and first applications of the iter integrated modelling & analysis suite. Nuclear Fusion 55(12), 123006 (2015),

InfraStructure for the European Network for the Earth System Modelling. Web page at, accessed: 2017-02-15

The Kepler Project. Web page at, accessed: 2017-02-15

Kranzlm ̈uller, D., de Lucas, J.M., Oster, P.: The european grid initiative (egi). In: Remote Instrumentation and Virtual Laboratories, pp. 61–66. Springer (2010)

Laure, E., Edlund, A., Pacini, F., Buncic, P., Barroso, M., Di Meglio, A., Prelz, F., Frohner, A., Mulmo, O., Krenek, A., et al.: Programming the grid with glite. Tech. rep. (2006)

Lordan, F., Tejedor, E., Ejarque, J., Rafanell, R., Alvarez, J., Marozzo, F., Lezzi, D., Sirvent, R., Talia, D., Badia, R.M.: ServiceSs: An Interoperable Programming Framework for the Cloud. Journal of Grid Computing 12(1), 67–91 (2014)

Manubens-Gil, D., Vegas-Regidor, J., Matthews, D., Shin, M.: Assesment report on autosubmit, cylc and ecflow. Tech. rep. (2016),

Manubens-Gil, D., Vegas-Regidor, J., Prodhomme, C., Mula-Valls, O., Doblas-Reyes, F.J.: Seamless management of ensemble climate prediction experiments on hpc platforms. In: High Performance Computing & Simulation (HPCS), 2016 International Conference on. pp. 895–900. IEEE (2016)

Manubens-Gila, D., Vegas-Regidora, J., Acostaa, M.C., Prodhommea, C., Mula-Vallsa, O., Serradell-Marondaa, K., Doblas-Reyes, F.J.: Autosubmit: a versatile tool for managing Earth system models on HPC platforms. Future Generation Computer Systems submited (2016)

Marti, J., Gasull, D., Queralt, A., Cortes, T.: Towards DaaS 2.0: Enriching data models. In: Proceedings - 2013 IEEE 9th World Congress on Services, SERVICES 2013. pp. 349–355. IEEE (jun 2013), DOI: 10.1109/SERVICES.2013.59

McLennan, M., Clark, S., Deelman, E., Rynge, M., Vahi, K., McKenna, F., Kearney, D., Song, C.: Hubzero and pegasus: integrating scientific workflows into science gateways. Concurrency and Computation: Practice and Experience (2014), DOI: 10.1002/cpe.3257

National Virtual Observatory. Web page at, accessed: 2017-02-15

Oliver, H.J.: Cylc (the cylc suite engine). Tech. rep. (2016) 40. Pordes, R., Petravick, D., Kramer, B., Olson, D., Livny, M., Roy, A., Avery, P., Blackburn,

K., Wenaus, T., W ̈urthwein, F., et al.: The open science grid 78(1), 012057 (2007)

Price, B.: Frank and lillian gilbreth and the manufacture and marketing of motion study, 1908-1924. Business and economic history pp. 88–98 (1989)

Pronk, S., Larsson, P., Pouya, I., Bowman, G.R., Haque, I.S., Beauchamp, K., Hess, B., Pande, V.S., Kasson, P.M., Lindahl, E.: Copernicus: A new paradigm for parallel adaptive molecular dynamics. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 60:1–60:10. SC ’11, ACM, New York, NY, USA (2011), DOI: 10.1145/2063384.2063465

Ptolemaeus, C. (ed.): System Design, Modeling, and Simulation using Ptolemy II. (2014),

Ruiz, J., Garrido, J., Santander-Vela, J., S ́anchez-Exp ́osito, S., Verdes-Montenegro, L.: Astrotavernabuilding workflows with virtual observatory services. Astronomy and Computing 7, 3–11 (2014)

S ́anchez-Exp ́osito, S., Mart ́ın, P., Ru ́ız, J.E., Verdes-Montenegro, L., Garrido, J., Sirvent, R., Falc ́o, A.R., Badia, R., Lezzi, D.: Web services as building blocks for science gateways in astrophysics. Journal of Grid Computing 14(4), 673–685 (2016)

Southern California Earthquake Center. Web page at, accessed: 2017-02-15

Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks 61, 85 – 117 (2015),

Scott, B.D., Weinberg, V., Hoenen, O., Karmakar, A., Fazendeiro, L.: Scalability of the plasma physics code gem. arXiv preprint arXiv:1312.1187 (2013)

SHaring Interoperable Workflows for large-scale scientific simulations on Available DCIs. Web page at, accessed: 2017-02-15

Square Kilometre Array. Web page at, accessed: 2017-02-15

Sloggett, C., Goonasekera, N., Afgan, E.: Bioblend: automating pipeline analyses within galaxy and cloudman. Bioinformatics 29(13), 1685–1686 (2013)

The Principles of Scientific Management. The Mathematics Teacher 4(1), 44–44 (1911),

Tejedor, E., Becerra, Y., Alomar, G., Queralt, A., Badia, R.M., Torres, J., Cortes, T., Labarta, J.: Pycompss: Parallel computational workflows in python. International Journal of High Performance Computing Applications (2015)

Galaxy Tool Sheed. Web page at, accessed: 2017-02-15 55. Wilde, M., Hategan, M., Wozniak, J.M., Clifford, B., Katz, D.S., Foster, I.: Swift: A language for distributed parallel scripting. Parallel Computing 37(9), 633–652 (2011)

Wolstencroft, K., Haines, R., Fellows, D., Williams, A., Withers, D., Owen, S., Soiland-Reyes, S., Dunlop, I., Nenadic, A., Fisher, P., et al.: The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic acids research p. gkt328 (2013)

Yu, J., Buyya, R.: A taxonomy of workflow management systems for grid computing. Journal of Grid Computing 3(3-4), 171–200 (2005)

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10, USENIX Association, Berkeley, CA, USA (2010)

Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)