COT 3: CUDA Optimization Tutorial
Parallelizing and Optimizing Programs for GPU Acceleration using CUDA
Monday morning, October 10
Galveston Island, TX
Held in conjunction with
The Twentieth International Conference on
Parallel Architectures and Compilation Techniques
Anyone interested in GPU programming as well as people with prior CUDA experience who want to improve the performance of their codes.
GPU hardware overview and CUDA introduction
Serial CPU C code example and porting to CUDA
Parallelization and step-by-step performance tuning
Introduction of irregular tree algorithm
Detailed optimization of six CUDA kernels
Summary, conclusion, and outlook
GPUs offer an order of magnitude higher performance, energy efficiency, and price/performance than CPUs, but it is
substantially harder to write efficient programs for GPUs, especially if the computation is not very regular. In
this tutorial, I will first talk about the key hardware features of GPUs and give an overview of the CUDA C
programming language. Then I will introduce a regular n-body code and show how to port it to CUDA and
parallelize it. Next, I will illustrate, in ten steps, how to optimize this code until it runs at close to peak
performance on a high-end GPU.
After the break, I will present a more sophisticated but irregular n-body
code that repeatedly builds an unbalanced tree data structure and performs complex traversals on it. Then, I will
discuss how each of the six kernels of this irregular algorithm can be implemented in CUDA to maximally exploit the
GPU hardware. The final version of this code running on one GPU outperforms an optimized pthreads implementation
running on 256 CPU cores. I will conclude the tutorial with a summary of the key optimization strategies and an
outlook into the future.
Martin Burtscher, Texas State University-San Marcos, firstname.lastname@example.org
Martin Burtscher is Associate Professor in the Department of Computer Science at Texas State University-San Marcos.
He received the combined BS/MS degree in computer science from the Swiss Federal Institute of Technology (ETH) Zurich
in 1996 and the Ph.D. degree in computer science from the University of Colorado at Boulder in 2000. Martin's research
interests include efficient parallelization of programs for multicore CPUs and GPUs as well as automatic performance
assessment and optimization of HPC applications. He is a senior member of the IEEE, its Computer Society, and the ACM.
Martin has co-authored over 60 peer-reviewed publications, including a book chapter in NVIDIA's GPU Computing Gems.