Model-Driven One-Sided Factorizations on Multicore Accelerated Systems
Abstract
Full Text:
PDFReferences
Intel Xeon Phi Coprocessor System Software Developers Guide. http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide.
E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, J. Langou, H. Ltaief, P. Luszczek, and A. YarKhan.
PLASMA Users Guide. Technical report, ICL, University of Tennessee, 2010.
J. Auerbach, D. F. Bacon, I. Burcea, P. Cheng, S. J. Fink, R. Rabbah, and S. Shukla. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference, DAC'12, pages 271-276, New York, NY, USA, 2012. ACM.
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187-198, 2011.
R. Barik, Z. Budimlic, V. Cav`e, S. Chatterjee, Y. Guo, D. Peixotto, R. Raman, J. Shirako, S. Tasirlar, Y. Yan, Y. Zhao, and V. Sarkar. The Habanero Multicore Software Research Project. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, OOPSLA’09, pages 735-736, New York, NY, USA, 2009. ACM.
N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.
A. J. Bernstein. Analysis of programs for parallel processing. IEEE Transactions on Electronic Computers, EC-15(5):757-763, October 1966.
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. SIGPLAN Not., 30:207-216, August 1995.
C. Cao, J. Dongarra, P. Du, M. Gates, P. Luszczek, and S. Tomov. clMAGMA: High Performance Dense Linear Algebra with OpenCL. In International Workshop on OpenCL, IWOCL 2013, Atlanta, Georgia, USA, May 13-14 2013.
E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In Proceedings of the nineteenth annual ACM symposium on parallel algorithms and architectures, SPAA’07, pages 116-125, New York, NY, USA, 2007. ACM.
NVIDIA CUBLAS library. https://developer.nvidia.com/cublas.
J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. In 10th International Conference on Parallel Processing and Applied Mathematics, PPAM 2013, Warsaw, Poland, September 8-11 2013.
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC’06, New York, NY, USA, 2006. ACM.
C. H. Gonzalez and B. B. Fraguela. A framework for argument-based task synchronization with automatic
detection of dependencies. Parallel Computing, 39(9):475 - 489, 2013. Novel On-Chip Parallel Architectures and Software Support.
Intel. Math Kernel Library. http://software.intel.com/intel-mkl/.
G. Kahn. The semantics of simple language for parallel programming. In IFIP Congress, pages 471-475, 1974.
J. Kurzak, P. Luszczek, A. YarKhan, M. Faverge, J. Langou, H. Bouwmeester, and J. Dongarra. Multithreading in the PLASMA Library. In Handbook of Multi and Many-Core Processing: Architecture, Algorithms, Programming, and Applications, Computer and Information Science Series. Chapman and Hall/CRC, April 26 2013.
L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558-565, July 1978.
MAGMA library. http://icl.cs.utk.edu/magma/.
R. Nath, S. Tomov, and J. Dongarra. An improved MAGMA GEMM for Fermi graphics processing units. Int. J. High Perf. Comput. Applic., 24(4):511-515, 2010. http://dx.doi.org/10.1177/1094342010385729 DOI: 10.1177/1094342010385729.
C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty, R. Narayanaswamy, J. Wiegert, F. Chinchilla, and R. McGuire. Offload compiler runtime for the intel xeon phitm coprocessor. In ISC, pages 239-254, 2013.
J. M. Perez, R. M. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September - 1 October 2008, Tsukuba, Japan, pages 142-151. IEEE, 2008.
M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: a high-level, machine-independent language for parallel programming. Computer, 26(6):28-38, 1993. http://dx.doi.org/10.1109/2.214440 DOI: 10.1109/2.214440.
J. E. Rodrigues. A graph model for parallel computations. Technical Report MIT/LCS/TR-64, MIT, Cambridge, MA, USA, Sept. 1969.
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP’13, pages 49-68, New York, NY, USA, 2013. ACM.
F. Song, S. Tomov, and J. Dongarra. Enabling and Scaling Matrix Computations on Heterogeneous Multi-core and multi-GPU Systems. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS’12, pages 365-376, New York, NY, USA, 2012. ACM.
P. E. Strazdins. Lookahead and algorithmic blocking techniques compared for parallel matrix factorization. In 10th International Conference on Parallel and Distributed Computing and Systems, IASTED, Las Vegas, USA, 1998.
P. E. Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Int. J. Parallel Distrib. Systems Networks, 4(1):26-35, 2001.
L. G. Valiant. Bulk-synchronous parallel computers. In M. Reeve, editor, Parallel Processing and Artificial Intelligence, pages 15-22. John Wiley & Sons, 1989.
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC'08, Austin, TX, November 15-21 2008. IEEE Press. http://dx.doi.org/10.1145/1413370.1413402 DOI: 10.1145/1413370.1413402.
A. YarKhan. Dynamic Task Execution on Shared and Distributed Memory Architectures. PhD thesis, University of Tennessee, December 2012.
A. YarKhan, J. Kurzak, and J. Dongarra. QUARK Users' Guide: QUeueing And Runtime for Kernels. Technical report, Innovative Computing Laboratory, University of Tennessee, 2011.