Development of Computational Pipeline Software for Genome/Exome Analysis on the K Computer

Kento Aoyama, Masanori Kakuta, Yuri Matsuzaki, Takashi Ishida, Masahito Ohue, Yutaka Akiyama

Abstract


Pipeline software that comprise tool and application chains for specific data processing have found extensive utilization in the analysis of several data types, such as genome, in bioinformatics research. Recent trends in genome analysis require use of pipeline software for optimum utilization of computational resources, thereby facilitating efficient handling of large-scale biological data accumulated on a daily basis. However, use of pipeline software in bioinformatics tends to be problematic owing to their large memory and storage capacity requirements, increasing number of job submissions, and a wide range of software dependencies. This paper presents a massive parallel genome/exome analysis pipeline software that addresses these difficulties. Additionally, it can be executed on a large number of K computer nodes. The proposed pipeline incorporates workflow management functionality that performs effectively when considering the task-dependency graph of internal executions via extension of the dynamic task distribution framework. Performance results pertaining to the core pipeline functionality, obtained via evaluation experiments performed using an actual exome dataset, demonstrate good scalability when using over a thousand nodes. Additionally, this study proposes several approaches to resolve performance bottlenecks of a pipeline by considering the domain knowledge pertaining to internal pipeline executions as a major challenge facing pipeline parallelization. 


Full Text:

PDF

References


Yoshida, K., Yoshizato, T., Shiraishi, Y., et al.: Integrated molecular analysis of clear-cell renal cell carcinoma. Nature Genetics 45(8), 860–867 (2013), DOI: 10.1038/ng.2699

Yoshida, K., Sanada, M., Shiraishi, Y., et al.: Frequent pathway mutations of splicing machinery in myelodysplasia. Nature 478(7367), 64–69 (2011), DOI: 10.1038/nature10496

Genomon-exome. http://genomon.hgc.jp/exome/en/, accessed: 2019-02-20

Miyazaki, H., Kusano, Y., Shinjou, N., et al.: Overview of the K computer system. Fujitsu Scientific Technical Journal 48(3), 302–309 (2012)

Bamshad, M.J., Ng, S.B., Bigham, A.W., et al.: Exome sequencing as a tool for Mendelian disease gene discovery. Nature Reviews 12(11), 745–755 (2011), DOI: 10.1038/nrg3031

Ajima, Y., Inoue, T., Hiramoto, S., et al.: The Tofu Interconnect. IEEE Micro 32(1), 21–31 (2012), DOI: 10.1109/MM.2011.98

Shimizu T.: Supercomputer “Fugaku”. ISC High Performance 2019, 16-20 June 2019, Frankfurt, Germany. (2019)

Braam, P.J.: The Lustre Storage Architecture. CoRR abs/1903.01955v1 (2019), http://arxiv.org/abs/1903.01955

Sakai, K., Sumimoto, S., Kurokawa, M.: High-performance and highly reliable file system for the K computer. Fujitsu Scientific and Technical Journal 48(3), 302–309 (2012)

Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009), DOI: 10.1093/bioinformatics/btp324

Sequence Alignment/Map Format Specification - The SAM/BAM Format Specification Working Group. http://samtools.github.io/hts-specs/, accessed: 2019-02-20

Li, H., Handsaker, B., Wysocker, A., et al.: The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009), DOI: 10.1093/bioinformatics/btp352

McKenna, A., Hanna M., Banks E., et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20(9), 1297–1303 (2010), DOI: 10.1101/gr.107524.110

Wang, K., Li, M., Hakonarson, H.: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 38(16), e164 (2010), DOI: 10.1093/nar/gkq603

Picard. http://broadinstitute.github.io/picard, accessed: 2020-01-20

Marcel, M.: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17(1), 10–12 (2011), DOI: 10.14806/ej.17.1.200

Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18(11), 1851–1858 (2008), DOI: 10.1101/gr.078212.108

Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010), DOI: 10.1093/bioinformatics/btq033

Gentleman, R.C., Carey, V.J., Bates, D.M., et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5(10), R80 (2004), DOI: 10.1186/gb-2004-5-10-r80

Ohue, M., Shimoda, T., Suzuki, S., et al.: MEGADOCK 4.0: an ultra-high-performance protein-protein docking software for heterogeneous supercomputers. Bioinformatics 30(22), 3281–3283 (2014), DOI: 10.1093/bioinformatics/btu532

Matsuda, M., Maruyama, N., Takizawa, S.: K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers. IEEE International Conference on Cluster Computing 2013, CLUSTER, 23-27 Sep. 2013, Douliu, Taiwan. pp. 1–8. IEEE (2013), DOI: 10.1109/CLUSTER.2013.6702663

Seo, J.S., Ju, Y.S., Lee, W.C., et al.: The transcriptional landscape and mutational profile of lung adenocarcinoma. Genome Research 22(11), 2109–2119 (2010), DOI: 10.1101/gr.145144.112

Deutsch, P.: “GZIP file format specification version 4.3”, RFC Editor (1996), DOI: 10.17487/RFC1952

Kurtzer, G.M., Sochat, V., Bauer, M.W.: Singularity: Scientific containers for mobility of compute. PLOS ONE 12(5), 1–20 (2017), DOI: 10.1371/journal.pone.0177459

Tommaso, D.P., Chatzou, M., Floden, E.W., et al.: Nextflow enables reproducible computational workflows. Nature Biotechnology 35(4), 316–319 (2017), DOI: 10.1038/nbt.3820

NGS analyzer. http://www.csrp.riken.jp/application_d_e.html, accessed: 2019-02-20




Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)