Neo-hetergeneous Programming and Parallelized Optimization of a Human Genome Re-sequencing Analysis Software Pipeline on TH-2 Supercomputer

Xiangke Liao; Shaoliang Peng; Yutong Lu; Yingbo Cui; Chengkun Wu; Heng Wang; Jiajun Wen

doi:10.14529/jsfi150104

Authors

Xiangke Liao National University of Defense Technology, Changsha
Shaoliang Peng National University of Defense Technology, Changsha
Yutong Lu National University of Defense Technology, Changsha
Yingbo Cui National University of Defense Technology, Changsha
Chengkun Wu National University of Defense Technology, Changsha
Heng Wang National University of Defense Technology, Changsha
Jiajun Wen National University of Defense Technology, Changsha

DOI:

https://doi.org/10.14529/jsfi150104

Abstract

The growing velocity of biological big data is way beyond Moore's Law of compute power growth. The amount of genomic data has been explosively accumulating, which calls for an enormous amount of computing power, while current computation methods cannot scale out with the data explosion. In this paper, we try to utilize huge computing resources to solve thebig dataproblems of genome processing on TH-2 supercomputer. TH-2supercomputer adopts neo-heterogeneous architecture and owns 16,000 compute nodes: 32000 Intel Xeon CPUs + 48000 Xeon Phi MICs. The heterogeneity, scalability, and parallel efficiency pose great challenges forthe deployment of the genomeanalysis software pipeline on TH-2. Runtime profiling shows that SOAP3-dp and SOAPsnp are the most time-consuming parts (up to 70% of total runtime) in the whole pipeline, which need parallelized optimization deeply and large-scale deployment. To address this issue, we first designa series of new parallel algorithms for SOAP3-dp and SOAPsnp, respectively, to eliminatethe spatial-temporal redundancy. Then we propose a CPU/MIC collaboratedparallel computing method in one node to fully fill the CPU/MIC time slots. We also propose a series ofscalable parallel algorithms and large scaleprogramming methods to reduce the amount of communications between different nodes. Moreover, we deploy and evaluate our works on the TH-2 supercomputer in different scales. At the most large scale, the whole process takes 8.37 hours using 8192 nodes to finish the analysis of a 300TB dataset of whole genome sequences from 2,000 human beings, which can take as long as 8 months on a commodity server. The speedup is about 700x.

References

Marx V. Biology: The big challenges of big data[J]. Nature, 2013, 498(7453): 255-260.

Luo R, Wong T, Zhu J, et al. SOAP3-dp: fast, accurate and sensitive GPU-based short read aligner[J]. PloS one, 2013, 8(5): e65632.

Li R, Li Y, Kristiansen K, et al. SOAP: short oligonucleotide alignment program[J]. Bioinformatics, 2008, 24(5): 713-714.

Li R, Yu C, Li Y, et al. SOAP2: an improved ultrafast tool for short read alignment[J]. Bioinformatics, 2009, 25(15): 1966-1967.

Liu C M, Wong T, Wu E, et al. SOAP3: ultra-fast GPU-based parallel alignment tool for short reads[J]. Bioinformatics, 2012, 28(6): 878-879.

Li R, Li Y, Fang X, et al. SNP detection for massively parallel whole-genome resequencing[J]. Genome research, 2009, 19(6): 1124-1132.

Chan S H, Cheung J, Wu E, et al. MICA: A fast short-read aligner that takes full advantage of Intel Many Integrated Core Architecture (MIC)[J]. arXiv preprint arXiv:1402.4876, 2014.

Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler[J]. Gigascience, 2012, 1(1): 18.

Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM[J].

arXiv preprint arXiv:1303.3997, 2013.