COT 4: CUDA Optimization Tutorial


Parallelizing and Optimizing Programs for GPU Acceleration using CUDA


February 25, 2012
New Orleans, LA

Held in conjunction with the
17th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming

Intended audience

Anyone interested in GPU programming as well as people with prior CUDA experience who want to improve the performance of their codes.

Schedule

GPU hardware overview and CUDA introduction
Serial CPU C code example and porting to CUDA
Parallelization and step-by-step performance tuning
break
Introduction of irregular tree algorithm
Detailed optimization of six CUDA kernels
Summary, conclusion, and outlook

Abstract

GPUs offer an order of magnitude higher performance, energy efficiency, and price/performance than CPUs, but it is substantially harder to write efficient programs for GPUs, especially if the computation is not very regular. In this tutorial, I will first talk about the key hardware features of GPUs and give an overview of the CUDA C programming language. Then I will introduce a regular n-body code and show how to port it to CUDA and parallelize it. Next, I will illustrate, in ten steps, how to optimize this code until it runs at close to peak performance on a high-end GPU.
After the break, I will present a more sophisticated but irregular n-body code that repeatedly builds an unbalanced tree data structure and performs complex traversals on it. I will discuss how each of the six kernels of this irregular algorithm can be implemented in CUDA to maximally exploit the GPU hardware. The final version of this code running on one GPU outperforms an optimized pthreads implementation running on 256 CPU cores. I will conclude the tutorial with a summary of the key optimization strategies and an outlook into the future.

Presenter

Martin Burtscher, Texas State University-San Marcos, burtscher@txstate.edu

Speaker biography

Martin Burtscher is Associate Professor in the Department of Computer Science at Texas State University-San Marcos. He received the combined BS/MS degree in computer science from the Swiss Federal Institute of Technology (ETH) Zurich in 1996 and the Ph.D. degree in computer science from the University of Colorado at Boulder in 2000. Martin's research interests include efficient parallelization of programs for GPUs and multicore CPUs as well as automatic performance assessment and optimization of HPC applications. He is a senior member of the IEEE, its Computer Society, and the ACM. Martin has co-authored over 60 peer-reviewed publications, including a book chapter in NVIDIA's GPU Computing Gems, is the recipient of an NVIDIA Academic Partnership award, and is the PI of a CUDA Teaching Center.
Official Texas State University Disclaimer