Exploring Scheduling Effects on Task Performance with TaskInsight
The complex memory hierarchies of nowadays machines make it very difficult to estimate the execution time of the tasks as depending on where the data is placed in memory, tasks of the same type may end up having different performance. Multiple scheduling heuristics have managed to improve performance by taking into account memory-related properties such as data locality and cache sharing. However, we may see tasks in certain applications or phases of applications that take little or no advantage of these optimizations. Without understanding when such optimizations are effective, we may trigger unnecessary overhead at runtime level.
In previous work, we introduced TaskInsight, a technique to characterize how the memory behavior of the application is affected by different task schedulers through the analysis of data reuse across tasks. We now use this tool to dynamically trace the scheduling decisions of multithreaded applications through their execution and analyze how memory reuse can provide information on when and why locality-aware optimizations are effective and impact performance.
We demonstrate how we can detect particular scheduling decisions that produced a variation in performance, and the underlying reasons when applying TaskInsight to several of the Montblanc benchmarks. This flexible insight is key both for the programmer and runtime to allow assigning the optimal scheduling policy to certain executions or phases.
Bell, R., Malony, A.D., Shende, S.: Paraprof: A portable, extensible, and scalable tool for parallel performance profile analysis. In: Proceedings of the 9th International Euro-Par Conference, Klagenfurt, Austria, August 26-29, 2003 (2003), DOI:10.1007/ 978-3-540-45209-6_7
Ceballos, G., Grass, T., Hugo, A., Black-Schaffer, D.: Taskinsight: Understanding task schedules effects on memory and performance. In: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores. pp. 11–20. PMAM’17, ACM, New York, NY, USA (2017), DOI:10.1145/3026937.3026943
Cheveresan, R., Ramsay, M., Feucht, C., Sharapov, I.: Characteristics of workloads used in high-performance and technical computing. In: Proceedings of the 21st Annual International Conference on Supercomputing, ICS 2007, Seattle, Washington, USA, June 17-21, 2007 (2007), DOI:10.1145/1274971.1274984
Chronaki, K., Rico, A., Badia, R.M., Ayguadé, E., Labarta, J., Valero, M.: Criticality-aware dynamic task scheduling for heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015 (2015), DOI:10.1145/2751205.2751235
Drebes, A., Pop, A., Heydemann, K., Cohen, A.: Interactive visualization of cross-layer performance anomalies in dynamic task-parallel applications and systems. In: 2016 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2016, Uppsala, Sweden, April 17-19, 2016. IEEE Computer Society (2016), DOI:10.1109/ISPASS. 2016.7482102
Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21(2) (2011), DOI:10.1142/S0129626411000151
Müller, M.S., Knüpfer, A., Jurenz, M., Lieber, M., Brunst, H., Mix, H., Nagel, W.E.: Developing scalable applications with vampir, vampirserver and vampirtrace. In: Parallel Computing: Architectures, Algorithms and Applications, ParCo 2007, Forschungszentrum Ju ̈lich and RWTH Aachen University, Germany, 4-7 September 2007 (2007)
Pericàs, M., Amer, A., Taura, K., Matsuoka, S.: Analysis of data reuse in task-parallel runtimes. In: High-Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 4th International Workshop, PMBS 2013, Denver, CO, USA, November 18, 2013. Revised Selected Papers (2013), DOI:10.1007/978-3-319-10214-6_4
Stanisic, L., Thibault, S., Legrand, A., Videau, B., Méhaut, J.: Faithful performance prediction of a dynamic task-based runtime system for heterogeneous multi-core architectures. Concurrency and Computation: Practice and Experience 27(16) (2015), DOI: 10.1002/cpe.3555
Weinberg, J., McCracken, M.O., Strohmaier, E., Snavely, A.: Quantifying locality in the memory access patterns of HPC applications. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. SC ’05, IEEE Computer Society, Washington, DC, USA (2005), DOI:10.1109/SC.2005.59