COT 3: CUDA Optimization Tutorial

Parallelizing and Optimizing Programs for GPU Acceleration using CUDA

Monday morning, October 10
Galveston Island, TX

Held in conjunction with
The Twentieth International Conference on
Parallel Architectures and Compilation Techniques

Intended audience

Anyone interested in GPU programming as well as people with prior CUDA experience who want to improve the performance of their codes.


GPU hardware overview and CUDA introduction
Serial CPU C code example and porting to CUDA
Parallelization and step-by-step performance tuning
Introduction of irregular tree algorithm
Detailed optimization of six CUDA kernels
Summary, conclusion, and outlook


GPUs offer an order of magnitude higher performance, energy efficiency, and price/performance than CPUs, but it is substantially harder to write efficient programs for GPUs, especially if the computation is not very regular. In this tutorial, I will first talk about the key hardware features of GPUs and give an overview of the CUDA C programming language. Then I will introduce a regular n-body code and show how to port it to CUDA and parallelize it. Next, I will illustrate, in ten steps, how to optimize this code until it runs at close to peak performance on a high-end GPU.
After the break, I will present a more sophisticated but irregular n-body code that repeatedly builds an unbalanced tree data structure and performs complex traversals on it. Then, I will discuss how each of the six kernels of this irregular algorithm can be implemented in CUDA to maximally exploit the GPU hardware. The final version of this code running on one GPU outperforms an optimized pthreads implementation running on 256 CPU cores. I will conclude the tutorial with a summary of the key optimization strategies and an outlook into the future.


Martin Burtscher, Texas State University-San Marcos,

Speaker biography

Martin Burtscher is Associate Professor in the Department of Computer Science at Texas State University-San Marcos. He received the combined BS/MS degree in computer science from the Swiss Federal Institute of Technology (ETH) Zurich in 1996 and the Ph.D. degree in computer science from the University of Colorado at Boulder in 2000. Martin's research interests include efficient parallelization of programs for multicore CPUs and GPUs as well as automatic performance assessment and optimization of HPC applications. He is a senior member of the IEEE, its Computer Society, and the ACM. Martin has co-authored over 60 peer-reviewed publications, including a book chapter in NVIDIA's GPU Computing Gems.
Official Texas State University Disclaimer