COT 5: CUDA Optimization Tutorial
Parallelizing and Optimizing Programs for GPU Acceleration using CUDA
Morning of September 23, 2012
Minneapolis, MN
Held in conjunction with the
The 21st International Conference on
Parallel Architectures and Compilation Techniques
Intended audience
Anyone who is curious about GPU programming as well as people with prior CUDA experience who want to improve the performance of their codes.
Schedule
GPU hardware overview and CUDA introduction
Serial CPU C code example and porting to CUDA
Parallelization and step-by-step performance tuning
break
Introduction of irregular tree algorithm
Detailed optimization of four irregular kernels
Summary, conclusion, and outlook
Abstract
GPUs offer an order of magnitude higher performance, energy efficiency, and price/performance than multicore CPUs, but
it is substantially harder to write efficient programs for GPUs, especially if the programs are not very regular. In
this tutorial, I will first talk about the key hardware features of GPUs and give an overview of the CUDA C programming
language. Then I will introduce a regular n-body code and show how to port it to CUDA and parallelize it. Next, I will
illustrate, in ten steps, how to optimize this code until it runs at close to peak performance on a high-end GPU. After
the break, I will present a more sophisticated but irregular n-body code that repeatedly builds an unbalanced tree data
structure and performs complex traversals on it. Then, I will discuss how this irregular algorithm can be implemented
in CUDA to maximally exploit the GPU hardware. The final version of this code running on one GPU is over twenty times
faster than an optimized OpenMP implementation running on a high-end hex-core CPU. I will conclude the tutorial with a
summary of general optimization strategies and an outlook into the future.
Presenter
Martin Burtscher, Texas State University-San Marcos, burtscher@txstate.edu
Speaker biography
Martin Burtscher is Associate Professor in the Department of Computer Science at Texas State University-San Marcos.
He received the combined BS/MS degree in computer science from the Swiss Federal Institute of Technology (ETH) Zurich
in 1996 and the Ph.D. degree in computer science from the University of Colorado at Boulder in 2000. Martin's research
interests include efficient parallelization of programs for GPUs and multicore CPUs as well as automatic performance
assessment and optimization of HPC applications. He is a senior member of the IEEE, its Computer Society, and the ACM.
Martin has co-authored over 60 peer-reviewed publications, including a book chapter in NVIDIA's GPU Computing Gems, is
the recipient of an NVIDIA Academic Partnership award, and is the PI of a CUDA Teaching Center.
|
|