DSLs and Search for Linear Algebra Performance Optimization

Nelson, Thomas Harrison

Graduate Thesis Or Dissertation

DSLs and Search for Linear Algebra Performance Optimization Public Deposited

Analytics

Download PDF

Citations

Citeable URL: https://scholar.colorado.edu/concern/graduate_thesis_or_dissertations/gq67jr391

Abstract

Linear algebra and tensor algebra computations lie on the critical path of many scientific applications. These numerical problems have a mathematical structure that allows for complex transformations of the code. This thesis presents search strategies for two Domain-Specific Languages (DSLs) focusing on high performance numerical computing domains. The first DSL is Build-to-Order BLAS (BTO), a language for matrix computation. Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls do not achieve optimal performance. The entire sequence needs to be optimized in concert. Instead of vendor-tuned BLAS, a programmer could start with source code in Fortran or C and use a state-of-the-art optimizing compiler. However, experiments in this thesis show that optimizing compilers often attain only one-quarter of the performance of hand-optimized code. I present a scalable search algorithm for BTO that reliably achieves high performance by choosing the best combination of loop fusion, array contraction, and multithreading for data parallelism. The second DSL is the Optimizing Compiler with Tensor OPeration Intelligence (OCTOPI). Tensor computations are characterized by arrays with numerous dimensions, inherent parallelism, and moderate data reuse. The best-performing implementation is heavily dependent on the tensor dimensionality and the target architecture. This optimization problem is especially challenging when the computation requires many iterations with tensors of small dimensions. I created a high-level search representation and input language as part of a toolchain to solve these problems. In this thesis, OCTOPI maps such tensor computations to GPUs, starting with a high-level tensor input language and producing efficient CUDA code as output. This approach combines tensor-specific mathematical transformations with a GPU decision algorithm and autotuning of a large parameter space. Generated code shows significant performance gains over sequential and OpenMP parallel code, and a comparison with OpenACC shows the importance of autotuning and other optimizations for achieving efficient results.

Creator