This material is presented to
ensure timely dissemination of scholarly and technical work. Copyright and all
rights therein are retained by authors or by other copyright holders. All persons
copying this information are expected to adhere to the terms and constraints
invoked by each author's copyright. In most cases, these works may not be
reposted without the explicit permission of the copyright holder.
Alex Fallin and Martin Burtscher.
Reducing Memory-Bus Energy Consumption of GPUs via Software-Based Bit-Flip Minimization.
Proceedings of the Workshop on Memory Centric High-Performance Computing. November 2022.
Energy consumption is a major concern in high-performance computing. One important contributing factor is the number of times the wires are charged and discharged, i.e., how often they switch from '0' to '1' and vice versa. We describe a software technique to minimize this switching activity in GPUs, thereby lowering the energy consumption. Our technique targets the memory bus, which comprises many high-capacitance wires that are frequently used. Our approach is to strategically change data values in the source code such that loading and storing them yields fewer bit flips. The new values are guaranteed to produce the same control flow and program output. Measurements on GPUs from two generations show that our technique allows programmers to save up to 9.3% of the whole-GPU energy consumption and 1.2% on average across eight graph-analytics CUDA codes without impacting performance.
Noushin Azami and Martin Burtscher.
Compressed In-memory Graphs for Accelerating GPU-based Analytics.
Proceedings of the 12th SC Workshop on Irregular Applications: Architectures and Algorithms. November 2022.
Processing large graphs has become an important irregular workload. We present Massively Parallel Log Graphs (MPLG) to accelerate GPU graph codes, including highly optimized codes. MPLG combines a compressed in-memory representation with low-overhead parallel decompression. This yields a speedup if the boost in memory performance due to the reduced footprint outweighs the overhead of the extra instructions to decompress the graph on the fly. However, achieving a sufficiently low overhead is difficult, especially on GPUs with their high-bandwidth memory. Prior work has only successfully employed similar ideas on CPUs, but those approaches exhibit limited parallelism, making them unsuitable for GPUs. On large real-world inputs, MPLG speeds up graph analytics by up to 67% on a Titan V GPU. Averaged over 15 graphs from several domains, it improves the performance of Rodinia's breadth-first search by 11.9%, Gardenia's connected components by 5.8%, and ECL's graph coloring by 5.0%.
Hochan Lee, William Ruys, Yineng Yan, Sean Stephens, Bozhi You, Henrique Fingler, Ian Henriksen, Arthur Peters, Martin Burtscher, Milos Gligoric, Karl Schulz, Keshav Pingali, Christopher J. Rossbach, Mattan Erez, and George Biros.
Parla: A High-level Orchestration System for Heterogeneous Architectures.
Proceedings of the 2022 ACM/IEEE International Conference for High-Performance Computing, Networking, Storage and Analysis. November 2022.
A major complexity of writing MPI+X applications is the diversity of node architectures. We introduce Parla, a heterogeneous task-based programming framework in Python, to simplify writing multi-device code. The Parla interface encourages rapid prototyping and incremental adoption. It is able to be easily integrated into existing applications and tools within the Python ecosystem. Parla targets performance portability by allowing specialized variants of each task along with awareness of device resources and data movement. Among the wide range of tasking system in Python, Parla is unique in providing a tasking system for coordinating kernel calls within a single process that adapts to the tasks utilization of device resources. We show that Parla can achieve performance competitive with optimized application libraries without sacrificing ease of development. Scheduling and data movement policies have been evaluated against a range of applications and synthetic workloads.
Ghadeer Alabandi and Martin Burtscher.
Improving the Speed and Quality of Parallel Graph Coloring.
ACM Transactions on Parallel Computing, Vol. 9, No. 3, Article 10 (35 pages). September 2022.
Graph coloring assigns a color to each vertex of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. In practice, solutions that require fewer distinct colors and that can be computed faster are typically preferred. Various coloring heuristics exist that provide different quality versus speed tradeoffs. The highest-quality heuristics tend to be slow. To improve performance, several parallel implementations have been proposed. This paper describes two improvements of the widely used LDF heuristic. First, we present a shortcutting approach to increase the parallelism by non-speculatively breaking data dependencies. Second, we present color reduction techniques to boost the solution of LDF. On 18 graphs from various domains, the shortcutting approach yields 2.5 times more parallelism in the mean, and the color-reduction techniques improve the result quality by up to 20%. Our deterministic CUDA implementation running on a Titan V is 2.9 times faster in the mean and uses as few or fewer colors as the best GPU codes from the literature.
Yiqian Liu, Noushin Azami, Corbin Walters, and Martin Burtscher.
The Indigo Program-Verification Microbenchmark Suite of Irregular Parallel Code Patterns.
Proceedings of the 2022 IEEE International Symposium on Performance Analysis of Systems and Software. May 2022.
Irregular programs are found in many domains and tend to exhibit input-dependent control flow and memory accesses. This paper introduces the Indigo suite of important irregular parallel code patterns for testing verification and other tools. We studied many irregular CPU and GPU programs and extracted the key code patterns. Then, we methodically built variations of these patterns to alter the control-flow and memory-access behavior and/or introduce bugs, yielding the thousands of OpenMP and CUDA microbenchmarks in the suite. Indigo includes a set of generators to systematically create an unbounded number of inputs for each microbenchmark, which is essential to exercise the wide range of possible behaviors of input-dependent codes. To manage the millions of code and input combinations, Indigo provides the flexibility to generate user-defined subsets of the suite. Experiments with a subset of buggy and bug-free codes illustrate that irregular programs pose a significant challenge to both static and dynamic program verification tools. Moreover, such tools can perform quite differently across code patterns that contain the same bug.
Alex Fallin, Aarti Kothari, Jiayuan He, Christopher Yanez, Keshav Pingali, Rajit Manohar, and Martin Burtscher.
A Simple, Fast, and GPU-friendly Steiner-Tree Heuristic.
Proceedings of the 23rd IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing. May 2022.
Rectilinear-Steiner-Minimum-Tree (RSMT) generation is a key step in VLSI routing. This paper describes a deterministic RSMT heuristic named SFP because it is simple, fast, and parallel. It is based on a rapid method for computing the optimal solution for three-pin nets, which it extends and applies to larger nets. Compared to FLUTE, a fast and high-quality RSMT algorithm, SFP incurs a wirelength penalty of 0.38 percent on the 16 ISPD 2008 inputs with 13 million nets. However, on a 16-core 2950X CPU, parallel SFP is 19 times faster than FLUTE, and on a Titan V GPU, it is 86 times faster. Serial SFP can generate an RSMT for a 10,000-pin net in under 1.5 seconds.
Ghadeer Alabandi, Jelena Tesic, Lucas Rusnak, and Martin Burtscher.
Discovering and Balancing Fundamental Cycles in Large Signed Graphs.
Proceedings of the 2021 ACM/IEEE International Conference for High-Performance Computing, Networking, Storage and Analysis, Article 68, pp. 1-17. November 2021.
Computing consensus states via global sign balancing is a key step in social network analysis. This paper presents graphB+, a fast algorithm for balancing signed graphs based on a new vertex and edge labeling technique, and a parallel implementation thereof for rapidly detecting and balancing all fundamental cycles. The main benefits of graphB+ are that the labels can be computed with linear time complexity, only require a linear amount of memory, and that the running time for balancing a cycle is linear in the length of the cycle times the vertex degrees but independent of the size of the graph. We parallelized graphB+ using OpenMP and CUDA. It takes 0.85 seconds on a Titan V GPU to balance the signs on the edges of an Amazon graph with 10 million vertices and 22 million edges, amounting to over 14 million fundamental cycles identified, traversed, and balanced per second.
Sepideh Maleki, Udit Agarwal, Martin Burtscher, and Keshav Pingali.
BiPart: A Parallel and Deterministic Hypergraph Partitioner.
Proceedings of the 2021 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 161-174. February 2021.
Hypergraph partitioning is used in many problem domains including VLSI design, linear algebra, Boolean satisfiability, and data mining. Most versions of this problem are NP-complete or NP-hard, so practical hypergraph partitioners generate approximate partitioning solutions for all but the smallest inputs. One way to speed up hypergraph partitioners is to exploit parallelism. However, existing parallel hypergraph partitioners are not deterministic, which is considered unacceptable in domains like VLSI design where the same partitions must be produced every time a given hypergraph is partitioned. In this paper, we describe BiPart, the first deterministic, parallel hypergraph partitioner. Experimental results show that BiPart outperforms state-of-the-art hypergraph partitioners in runtime and partition quality while generating partitions deterministically.
Ghadeer Alabandi, Evan Powers, and Martin Burtscher.
Increasing the Parallelism of Graph Coloring via Shortcutting.
Proceedings of the 2020 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 262-275. February 2020.
Graph coloring is an assignment of colors to the vertices of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. Finding a coloring with a minimal number of colors is often only part of the problem. In addition, the solution also needs to be computed quickly. Several parallel implementations exist, but they may suffer from low parallelism depending on the input graph. We present an approach that increases the parallelism without affecting the coloring quality. On 18 test graphs, our technique yields an average of 3.4 times more parallelism. Our CUDA implementation running on a Titan V is 2.9 times faster on average and uses as few or fewer colors as the best GPU codes from the literature.
Jiayuan He, Martin Burtscher, Rajit Manohar, and Keshav Pingali.
SPRoute: A Scalable Parallel Negotiation-based Global Router.
Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 1-8. November 2019.
The complexity of global routing increases rapidly as chip designs grow larger. In many global routers, maze routing is the most time-consuming stage. One way to reduce its runtime is parallelization. Existing parallel maze routers work either by identifying and routing independent nets or by partitioning the chip area into non-overlapping regions. In this paper, we describe a scalable parallel global router called SPRoute that initially exploits net-level parallelism, automatically lowers the parallelism when livelock is identified, and finally switches to fine-grain parallelism to guarantee convergence. We evaluate SPRoute on a 28-core machine on the ISPD 2008 global routing contest benchmark suite. It achieves an average speedup of 11.5 with a wirelength penalty of 0.6% on overflow-free benchmarks, and an average speedup of 4.5 with a total overflow penalty of 7% on hard-to-route benchmarks over sequential SPRoute. Compared to FastRoute 4.1, SPRoute achieves an average speedup of 11.0 and 3.1 on overflow-free benchmarks and hard-to-route benchmarks, respectively.
Saeed Taheri, Ian Briggs, Martin Burtscher, and Ganesh Gopalakrishnan.
DiffTrace: Efficient Whole-Program Trace Analysis and Diffing for Debugging.
Proceedings of the IEEE International Conference on Cluster Computing, pp. 1-12. September 2019.
We present a tool called DiffTrace that approaches debugging via whole program tracing and diffing of typical and erroneous traces. After collecting these traces, a user-configurable front-end filters out irrelevant function calls and then summarizes loops in the retained function calls based on state-of-the-art loop extraction algorithms. Information about these loops is inserted into concept lattices, which we use to compute salient dissimilarities to narrow down bugs. DiffTrace is a clean start that addresses debugging features missing in existing approaches. Our experiments on an MPI/OpenMP program called ILCS and initial measurements on LULESH, a DOE miniapp, demonstrate the advantages of the proposed debugging approach.
Martin Burtscher, Sindhu Devale, Sahar Azimi, Jayadharini Jaiganesh, and Evan Powers.
A High-Quality and Fast Maximal Independent Set Implementation for GPUs.
ACM Transactions on Parallel Computing, Vol. 5, No. 2, Article 8 (27 pages). December 2018.
Computing a maximal independent set is an important step in many parallel graph algorithms. This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs. It includes key optimizations to speed up computation, reduce the memory footprint, and increase the set size. Its CUDA implementation requires fewer than 30 kernel statements, runs asynchronously, and produces a deterministic result. It outperforms the maximal independent set implementations of Pannotia, CUSP, and IrGL on each of the 16 tested graphs of various types and sizes. On a Titan X GPU, ECL-MIS is between 3.9 and 100 times faster (11.5 times, on average). ECL-MIS running on the GPU is also faster than the parallel CPU codes Ligra, Ligra+, and PBBS running on 20 Xeon cores, which it outperforms by 4.1 times, on average. At the same time, ECL-MIS produces maximal independent sets that are up to 52% larger (over 10%, on average) compared to these preexisting CPU and GPU implementations. Whereas these codes produce maximal independent sets that are, on average, about 15% smaller than the largest possible such sets, ECL-MIS sets are less than 6% smaller than the maximum independent sets.
Saeed Taheri, Sindhu Devale, Ganesh Gopalakrishnan, and Martin Burtscher.
ParLoT: Efficient Whole-Program Call Tracing for HPC Applications.
Proceedings of the Seventh Workshop on Extreme-Scale Programming Tools (12 pages). November 2018.
The complexity of HPC software and hardware is quickly increasing. As a consequence, the need for efficient execution tracing to gain insight into HPC application behavior is steadily growing. Unfortunately, available tools either do not produce traces with enough detail or incur large overheads. An efficient tracing method that overcomes the tradeoff between maximum information and minimum overhead is therefore urgently needed. This paper presents such a method and tool, called ParLoT, with the following key features. (1) It describes a technique that makes low-overhead on-the-fly compression of whole-program call traces feasible. (2) It presents a new, highly efficient, incremental trace-compression approach that reduces the trace volume dynamically, which lowers not only the needed bandwidth but also the tracing overhead. (3) It collects all caller/callee relations, call frequencies, call stacks, as well as the full trace of all calls and returns executed by each thread, including in library code. (4) It works on top of existing dynamic binary instrumentation tools, thus requiring neither source-code modifications nor recompilation. (5) It supports program analysis and debugging at the thread, thread-group, and program level. This paper establishes that comparable capabilities are currently unavailable. Our experiments with the NAS parallel benchmarks running on the Comet supercomputer with up to 1,024 cores show that ParLoT can collect whole-program function-call traces at an average tracing bandwidth of just 56 kB/s per core.
Computing a Movie of Zooming into a Fractal.
Proceedings of the Workshop on Education for High-Performance Computing (1 page). November 2018.
Yi-Shan Lu, Samira Ataei, Jiayuan He, Wenmian Hua, Sepideh Maleki, Yihang Yang, Martin Burtscher, Keshav Pingali, and Rajit Manohar.
Parallel Tools for Asynchronous VLSI Systems.
Proceedings of the Workshop on Open-Source EDA Technology (4 pages). November 2018.
We propose to develop a collection of electronic design automation tools for asynchronous circuits. To reduce design turn-around time, we will implement parallel versions of the key algorithms using the Galois system. These tools will be open-sourced when mature.
Jayadharini Jaiganesh and Martin Burtscher.
A High-Performance Connected Components Implementation for GPUs.
Proceedings of the 2018 ACM International Symposium on High-Performance Parallel and Distributed Computing, pp. 92-104. June 2018.
Computing connected components is an important graph algorithm that is used, for example, in medicine, image processing, and biochemistry. This paper presents a fast connected-components implementation for GPUs called ECL-CC. It builds upon the best features of prior algorithms and augments them with GPU-specific optimizations. For example, it incorporates a parallelism-friendly version of pointer jumping to speed up union-find operations and uses various compute kernels to exploit the multiple levels of hardware parallelism. The resulting CUDA code is asynchronous, lock free, and employs load balancing. It is 1.8 times faster on average than the fastest prior GPU implementation running on a Titan X and also faster on most of the 18 real-world and synthetic graphs we tested.
Steven Claggett, Sahar Azimi, and Martin Burtscher.
SPDP: An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data.
Proceedings of the 2018 Data Compression Conference, pp. 337-346. March 2018.
Scientific computing produces, transfers, and stores massive amounts of single- and double-precision floating-point data, making this a domain that can greatly benefit from data compression. To gain insight into what makes an effective lossless compression algorithm for such data, we generated over nine million algorithms and selected the one that yields the highest compression ratio on 26 datasets. The resulting algorithm, called SPDP, comprises four data transformations that operate exclusively at word or byte granularity. Nevertheless, SPDP delivers the highest compression ratio on eleven datasets and, on average, outperforms all but one of the seven compared compressors. An analysis of SPDP's internals reveals how to build effective compression algorithms for scientific data.
Sepideh Maleki and Martin Burtscher.
Automatic Hierarchical Parallelization of Linear Recurrences.
Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 128-138. March 2018.
Linear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-efficient algorithm to compute linear recurrences that is amenable to automatic parallelization and suitable for hierarchical massively-parallel architectures such as GPUs. We implemented our approach in a domain-specific code generator that emits optimized CUDA code. Our evaluation shows that, for standard prefix sums and single-stage IIR filters, the generated code reaches the throughput of memory copy for large inputs, which cannot be surpassed. On higher-order prefix sums, it performs nearly as well as the fastest handwritten code from the literature. On tuple-based prefix sums and 1D digital filters, our automatically parallelized code outperforms the fastest prior implementations.
Armen Dzhagaryan, Aleksandar Milenkovic, and Martin Burtscher.
Improving the Effectiveness of Data Transfers in Mobile Computing Using Lossless Compression Utilities.
Chapter 7 in Advances in Computer Communications and Networks, pp. 181-221. February 2017.
The data traffic originating on mobile computing devices has been growing exponentially over the last several years. Lossless data compression can increase communication throughput, reduce latency, save energy, and increase available storage. However, compression introduces additional overhead that may exceed any gains due to transferring or storing fewer bytes. Compression utilities on mobile computing platforms differ in compression ratio, compression and decompression speeds, and energy requirements. When transferring data, we would like to have an agent to determine whether compressed transfers are beneficial, and if so, select the most beneficial compression utility. A first step toward designing such an agent is to obtain a good understanding of various parameters impacting data transfers. This chapter presents results from an experimental study that evaluates the effectiveness of several compression utilities on a modern smartphone in three typical usage scenarios that involve local and network transfers. We define several metrics that capture the effectiveness of uncompressed and compressed data transfers and provide practical guidelines for selecting the most effective utilities and configurations for each usage scenario. We also introduce an analytical framework for estimating the effectiveness of various compression utilities in transferring mobile data.
Martin Burtscher, Farbod Hesaaraki, Hari Mukka, and Annie Yang.
Real-Time Synthesis of Compression Algorithms for Scientific Data.
Proceedings of the 2016 ACM/IEEE International Conference for High-Performance Computing, Networking, Storage and Analysis, pp. 264-275. November 2016.
Many scientific programs produce large amounts of floating-point data that are saved for later use. To minimize the storage requirement, it is worthwhile to compress such data as much as possible. However, existing algorithms tend to compress floating-point data relatively poorly. As a remedy, we have developed FPcrush, a tool that automatically synthesizes an optimized compressor for each given input. The synthesized algorithms are lossless and parallelized using OpenMP. This paper describes how FPcrush is able to perform this synthesis in real-time, i.e., even when accounting for the synthesis overhead, it compresses the 16 tested real-world single- and double-precision data files more quickly than parallel bzip2. Decompression is an order of magnitude faster and exceeds the throughput of multicore implementations of bzip2, gzip, and FPC. On all but two of the tested files, as well as on average, the customized algorithms deliver higher compression ratios than the other three tools.
Jared Coplin and Martin Burtscher.
Energy and Power Considerations of GPUs.
Chapter 19 in Advances in GPU Research and Practice, pp. 509-541. September 2016.
Annie Yang, Jared Coplin, Hari Mukka, Farbod Hesaaraki, and Martin Burtscher.
MPC: An Effective Floating-Point Compression Algorithm for GPUs.
Chapter 13 in Advances in GPU Research and Practice, pp. 327-347. September 2016.
Jared Coplin, Annie Yang, Andrew Poppe, and Martin Burtscher.
Increasing Telemetry Throughput Using Customized and Adaptive Data Compression.
Proceedings of the AIAA SPACE and Astronautics Forum and Exposition (10 pages). September 2016.
Due to the increasing generation of massive amounts of data by space-based instruments, it has become a challenge to transmit even a fraction of a typical spacecraft data volume back to Earth in a feasible amount of time. Thus, improvements in the ability to losslessly compress data on-board before transmission represent an important method of increasing overall data return rates. We describe a custom methodology for compressing spacecraft data on-board that provides significant improvements in both compression ratio and speed. We have used data returned by the five-probe THEMIS/ARTEMIS constellation to quantify the compression ratio and compression speed improvements over a variety of data types (e.g., time-series and particle data). Our approach results in a 38% improvement in compression ratio and up to a three-fold improvement in compression throughput and energy efficiency. We argue that such methods should be adopted by future space missions to maximize the data return to Earth, thus enabling greater insight and scientific discovery.
Sepideh Maleki, Annie Yang, and Martin Burtscher.
Higher-Order and Tuple-Based Massively-Parallel Prefix Sums.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 539-552. June 2016.
Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA implementation unifies all of these computations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor of 2.6 on eight-tuple prefix sums.
Jared Coplin and Martin Burtscher.
Energy, Power, and Performance Characterization of GPGPU Benchmark Programs.
Proceedings of the Twelfth Workshop on High-Performance, Power-Aware Computing (10 pages). May 2016.
This paper studies the effects on energy consumption, power draw, and runtime of a modern compute GPU when changing the core and memory clock frequencies, enabling or disabling ECC, using alternate implementations, and varying the program inputs. We evaluate 34 applications from 5 benchmark suites and measure their power draw over time on a K20c GPU. Our results show that changing the frequency or the program implementation can alter the energy, power, and performance by a factor of two or more. Interestingly, some changes affect these three aspects very unevenly. ECC can greatly increase the runtime and energy consumption, but only on memory-bound codes. Compute-bound codes tend to behave quite differently from memory-bound codes, in particular regarding their power draw. On irregular programs, a small change in frequency can result in a large change in runtime and energy consumption.
Bahareh Goodarzi, Martin Burtscher, and Dhrubajyoti Goswami.
Parallel Graph Partitioning on a CPU-GPU Architecture.
Proceedings of the Twenty Fifth International Heterogeneity in Computing Workshop (9 pages). May 2016.
Graph partitioning has important applications in multiple areas of computing, including scheduling, social networks, and parallel processing. In recent years, GPUs have proven successful at accelerating several graph algorithms. However, the irregular nature of the real-world graphs poses a problem for GPUs, which favor regularity. In this paper, we discuss the design and implementation of a parallel multilevel graph partitioner for a CPU-GPU system. The partitioner aims to overcome some of the challenges arising due to memory constraints on GPUs and maximizes the utilization of GPU threads through suitable load-balancing schemes. We present a lock-free shared-memory scheme since fine-grained synchronization among thousands of threads imposes too high a performance overhead. The partitioner, implemented in CUDA, outperforms serial Metis and parallel MPI-based ParMetis. It performs similar to the shared-memory CPU-based parallel graph partitioner mt-metis.
Igor Szczyrba, Rafal Szczyrba, and Martin Burtscher.
Geometric Representations of the n-anacci Constants and Generalizations Thereof.
Journal of Integer Sequences, Vol. 19, Article 16.3.8 (14 pages). April 2016.
We introduce geometric representations of the sequence of the n-anacci constants and generalizations thereof that consist of the ratio limits generated by linear recurrences of an arbitrary order n with equal real weights p > 0. We represent the n-anacci constants and their generalizations geometrically by means of the dilation factors of dilations transforming collections of compact convex sets with increasing dimensions n.
Annie Yang, Hari Mukka, Farbod Hesaaraki, and Martin Burtscher.
MPC: A Massively Parallel Compression Algorithm for Scientific Data.
Proceedings of the IEEE International Conference on Cluster Computing, pp. 381-389. September 2015.
Due to their high peak performance and energy efficiency, massively parallel accelerators such as GPUs are quickly spreading in high-performance computing, where large amounts of floating-point data are processed, transferred, and stored. Such environments can greatly benefit from data compression if done sufficiently quickly. Unfortunately, most conventional compression algorithms are unsuitable for highly parallel execution. In fact, it is generally unknown how to design good compression algorithms for massively parallel systems. To remedy this situation, we study 138,240 lossless compression algorithms for single- and double-precision floating-point values that are built exclusively from easily parallelizable components. We analyze the best of these algorithms, explain why they compress well, and derive the Massively Parallel Compression (MPC) algorithm from them. This novel algorithm requires almost no internal state, achieves heretofore unreached compression ratios on several data sets, and roughly matches the best CPU-based algorithms in compression ratio while outperforming them by one to two orders of magnitude in throughput.
Saami Rahman, Martin Burtscher, Ziliang Zong, and Apan Qasem.
Maximizing Hardware Prefetch Effectiveness with Machine Learning.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, pp. 383-389. August 2015.
Modern processors are equipped with multiple hardware prefetchers, each of which targets a distinct level in the memory hierarchy and employs a separate prefetching algorithm. However, different programs require different subsets of these prefetchers to maximize their performance. Turning on all available prefetchers rarely yields the best performance and, in some cases, prefetching even hurts performance. This paper studies the effect of hardware prefetching on multithreaded code and presents a machine-learning technique to predict the optimal combination of prefetchers for a given application. This technique is based on program characterization and utilizes hardware performance events in conjunction with a pruning algorithm to obtain a concise and expressive feature set. The resulting feature set is used in three different learning models. All necessary steps are implemented in a framework that reaches, on average, 96% of the best possible prefetcher speedup. The framework is built from open-source tools, making it easy to extend and port to other architectures.
Armen Dzhagaryan, Aleksandar Milenkovic, and Martin Burtscher.
Quantifying Benefits of Lossless Compression Utilities on Modern Smartphones.
Proceedings of the 24th International Conference on Computer Communications and Networks, pp. 1-9. August 2015.
(best paper candidate) The data traffic originating on mobile computing devices has been growing exponentially over the last several years. Lossless data compression and decompression can be essential in increasing communication throughput, reducing communication latency, achieving energy-efficient communication, and making effective use of available storage. This paper experimentally evaluates several compression utilities and configurations on a modern smartphone. We characterize each utility in terms of its compression ratio, compression and decompression throughput, and energy efficiency for representative use cases. We find a wide variety of energy costs associated with data compression and decompression and provide practical guidelines for selecting the most energy efficient configurations for each use case. For data transfers over WLAN, the best configurations provide a 2.1-fold and 2.7-fold improvement in energy efficiency for compressed uploads and downloads, respectively, when compared to uncompressed data transfers. For data transfers over a mobile broad-band network, the best configurations provide a 2.7-fold and 3-fold improvement in energy efficiency for compressed uploads and downloads, respectively.
Saeed Taheri, Apan Qasem, and Martin Burtscher.
A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels.
Proceedings of the 2015 International Conference on Parallel and Distributed Processing Techniques and Applications (10 pages). July 2015.
Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than today's systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate general-purpose applications, including applications with data-dependent, irregular control flow and memory access patterns. However, the growing complexity, exposed memory hierarchy, incoherence, heterogeneity, and parallelism will make accelerator-based systems progressively more difficult to program. In the foreseeable future, the vast majority of programmers will no longer be able to extract additional performance or energy-savings from next-generation systems because the programming will be too difficult. Automatic performance analysis and optimization recommendation tools have the potential to avert this situation. They embody expert knowledge and make it available to software developers when needed. In this paper, we describe and evaluate such a tool. It quantifies performance characteristics of GPU code through profiling, employs machine learning models to estimate the suitability and benefit of several known source-code optimizations, ranks the optimizations, and suggests the most promising ones to the user if the expected speedup is sufficiently high.
Igor Szczyrba, Rafal Szczyrba, and Martin Burtscher.
Analytic Representations of the n-anacci Constants and Generalizations Thereof.
Journal of Integer Sequences, Vol. 18, Article 15.4.5 (10 pages). May 2015.
We study generalizations of the sequence of the n-anacci constants that are constructed from the ratio limits generated by linear recurrences of an arbitrary order n with equal integer weights m. We derive the analytic representation of the class C-infinity of these ratio limits and prove that, for a fixed m, the ratio limits form a strictly increasing sequence converging to m+1. We also show that the generalized n-anacci constants form a totally ordered set.
Martin Burtscher, Wuxu Peng, Apan Qasem, Hongchi Shi, Dan Tamir, and Heather Thiry.
A Module-based Approach to Adopting the 2013 ACM Curricular Recommendations on Parallel Computing.
Proceedings of the 2015 ACM SIGCSE Symposium, pp. 36-41. March 2015.
The widespread deployment of multicore systems over the last decade has brought about major changes in the software and hardware landscape. The resulting importance of parallel computing is reflected in the 2013 Curriculum Guidelines developed by the joint ACM/IEEE taskforce. The document recommends increased coverage of parallel computing and describes a new Knowledge Area on this topic. These recommendations have already been adopted by several universities in the form of new parallel-programming courses. Implementing the recommendations in a complete curriculum, however, poses many challenges, including deciding on existing material to be removed, complying with administrative and ABET requirements, and maintaining caps on graduation credit hours. This paper describes an alternative approach for adopting the 2013 curricular recommendations on parallel computing. Specifically, we use a module-based approach that introduces parallel computing concepts and re-iterates them through a series of short, self-contained modules taught across several lower-division courses. Most of these concepts are then combined into a new senior-level capstone course on parallel programming. Each module covers parallelism aspects in the context of a conventional computer science topic, thus enabling us to include parallel computing without a major overhaul of the curriculum. Evaluations conducted during the first year show encouraging results for this early-and-often approach in terms of learning outcomes, student interest, and confidence gains.
Bo Li, Yijuan Lu, Chunyuan Li, Afzal Godil, Tobias Schreck, Masaki Aono, Martin Burtscher, Qiang Chen, Nihad K. Chowdhury, Bin Fang, Hongbo Fu, Takahiko Furuya, Haisheng Li, Jianzhuang Liu, Henry Johan, Ryuichi Kosaka, Hitoshi Koyanagi, Ryutarou Ohbuchi, Atsushi Tatsuma, Yajuan Wan, Chaoli Zhang, and Changqing Zou.
A Comparison of 3D Shape Retrieval Methods based on a Large-Scale Benchmark Supporting Multimodal Queries.
Computer Vision and Image Understanding, Vol. 131, pp. 1-27. February 2015.
Large-scale 3D shape retrieval has become an important research direction in content-based 3D shape retrieval. To promote this research area, two Shape Retrieval Contest (SHREC) tracks on large scale comprehensive and sketch-based 3D model retrieval have been organized by us in 2014. Both tracks were based on a unified large-scale benchmark that supports multimodal queries (3D models and sketches). This benchmark contains 13680 sketches and 8987 3D models, divided into 171 distinct classes. It was compiled to be a superset of existing benchmarks and presents a new challenge to retrieval methods as it comprises generic models as well as domain-specific model types. Twelve and six distinct 3D shape retrieval methods have competed with each other in these two contests, respectively. To measure and compare the performance of the participating and other promising Query-by-Model or Query-by-Sketch 3D shape retrieval methods and to solicit state-of-the-art approaches, we perform a more comprehensive comparison of twenty-six (eighteen originally participating algorithms and eight additional state-of-the-art or new) retrieval methods by evaluating them on the common benchmark. The benchmark, results, and evaluation tools are publicly available at our websites (http://www.itl.nist.gov/iad/vug/sharp/contest/2014/Generic3D/, 2014, http://www.itl.nist.gov/iad/vug/sharp/contest/2014/SBR/, 2014).
Jared Coplin and Martin Burtscher.
Effects of Source-Code Optimizations on GPU Performance and Energy Consumption.
Proceedings of the Eighth Workshop on General Purpose Processing Using GPUs (11 pages). February 2015.
This paper studies the effects of source-code optimizations on the performance, power draw, and energy consumption of a modern compute GPU. We evaluate 128 versions of two n-body codes: a compute-bound regular implementation and a memory-bound irregular implementation. Both programs include six optimizations that can be individually enabled or disabled. We measured the active runtime and the power consumption of each code version on three inputs, various GPU clock frequencies, two arithmetic precisions, and with and without ECC. This paper investigates which optimizations primarily improve energy efficiency, which ones mainly boost performance, and which ones help both aspects. Some optimizations also have the added benefit of reducing the power draw. Our analysis shows that individual and combinations of optimizations can alter the performance and energy consumption of a GPU kernel by up to a factor of five.
Molly A. O'Neil and Martin Burtscher.
Rethinking the Parallelization of Random-Restart Hill Climbing.
Proceedings of the Eighth Workshop on General Purpose Processing Using GPUs (10 pages). February 2015.
Random-restart hill climbing is a common approach to combinatorial optimization problems such as the traveling salesman problem (TSP). We present and evaluate an implementation of random-restart hill climbing with 2-opt local search applied to TSP. Our implementation is capable of addressing large problem sizes at high throughput. It is based on the key insight that the GPU's hierarchical hardware parallelism is best exploited with a hierarchical implementation strategy, where independent climbs are parallelized between blocks and the 2-opt evaluations are parallelized across the threads within a block. We analyze the performance impact of this and other optimizations on our heuristic TSP solver and compare its performance to existing GPU-based 2-opt TSP solvers as well as a parallel CPU implementation. Our code outperforms the existing implementations by up to 3X, evaluating up to 60 billion 2-opt moves per second on a single K40 GPU. It also outperforms an OpenMP implementation run on 20 CPU cores by up to 8X.
Jared Coplin and Martin Burtscher.
Power Characteristics of Irregular GPGPU Programs.
Proceedings of the 2014 International Workshop on Green Programming, Computing, and Data Processing (6 pages). November 2014.
This paper investigates the power profiles of irregular programs running on a K20 compute GPU and contrasts them with the profiles of regular programs. The paper further studies the effects on the power profile when changing the GPU's core and memory frequencies, using alternate implementations of the same algorithm, and varying the program input. Our results show that the power behavior of irregular applications often cannot be accurately captured by a single average. Rather, the entire profile, i.e., the power as a function of time, needs to be considered. In addition, lowering the frequency, employing alternate implementations, or using different inputs can drastically alter the power profile of irregular codes, meaning that measurements using one setting may not be representative of that program's power characteristics under a different setting.
Molly A. O'Neil and Martin Burtscher.
Microarchitectural Performance Characterization of Irregular GPU Kernels.
Proceedings of the IEEE International Symposium on Workload Characterization, pp. 130-139. October 2014.
GPUs are increasingly being used to accelerate general-purpose applications, including applications with data-dependent, irregular memory access patterns and control flow. However, relatively little is known about the behavior of irregular GPU codes, and there has been minimal effort to quantify the ways in which they differ from regular GPGPU applications. We examine the behavior of a suite of optimized irregular CUDA applications on a cycle-accurate GPU simulator. We characterize the performance bottlenecks in each program and connect source code with microarchitectural characteristics. We also assess the impact of improvements in cache and DRAM bandwidth and latency and discuss the implications for GPU architecture design. We find that, while irregular graph codes exhibit significantly more underutilized execution cycles due to thread divergence, load imbalance, and synchronization overhead than regular programs, these factors contribute less to performance degradation than we expected. It appears that code optimizations are often able to effectively address these performance hurdles. Insufficient bandwidth and long memory latency are the biggest limiters of performance. Surprisingly, we find that applications with irregular memory access patterns are more sensitive to changes in L2 latency and bandwidth than DRAM latency and bandwidth.
Rong Ge, Xizhou Feng, Martin Burtscher, and Ziliang Zong.
Performance and Energy Modeling for Cooperative Hybrid Computing.
Proceedings of the 9th IEEE International Conference on Networking, Architecture, and Storage, pp. 232-241. August 2014.
Accelerator-based heterogeneous systems can provide high performance and energy efficiency, both of which are key design goals in high performance computing. To fully realize the potential of heterogeneous architectures, software must optimally exploit the hosts' and accelerators' processing and power-saving capabilities. Yet, previous studies mainly focus on using hosts and accelerators to boost application performance. Power-saving features to improve the energy efficiency of parallel programs, such as Dynamic Voltage and Frequency Scaling (DVFS), remain largely unexplored. Recognizing that energy efficiency is a different objective than performance and should therefore be independently pursued, we study how to judiciously distribute computation between hosts and accelerators for energy optimization. We further explore energy-saving scheduling in combination with computation distribution for even larger gains. Moreover, we present PEACH, an analytical model for Performance and Energy Aware Cooperative Hybrid computing. With just a few system- and application-dependent parameters, PEACH accurately captures the performance and energy impact of computation distribution and energy-saving scheduling to quickly identify the optimal coupled strategy for achieving the best performance or the lowest energy consumption. PEACH thus eliminates the need for extensive profiling and measurement. Experimental results from two GPU-accelerated heterogeneous systems show that PEACH predicts the performance and energy of the studied codes with less than 3% error and successfully identifies the optimal strategy for a given objective.
Hassan Rabeti and Martin Burtscher.
Feature Selection by Tree Search of Correlation-Adjusted Class Distances.
Proceedings of the 2014 International Conference on Data Mining (9 pages). July 2014.
The rapidly growing dimensionality of datasets has made feature selection indispensable. We introduce the TS-CACD feature-selection algorithm, which uses a generalization of the Stern-Brocot tree to traverse the search space. This family of trees supports different divergence ratios, i.e., enables the search to focus on and reach certain areas of interest more quickly. TS-CACD uses a continuous filter method, which combines an inter/intra-class distance measure with a pair-wise ranked feature correlation measure. It requires almost no parameters, explicitly selects the most important features, and performs well.
Vladimir Uzelac, Aleksandar Milenkovic, Milena Milenkovic, and Martin Burtscher.
Using Branch Predictors and Variable Encoding for On-the-Fly Program Tracing.
IEEE Transactions on Computers, Vol. 63, No. 4, pp. 1008-1020. April 2014.
Unobtrusive capturing of program execution traces in real-time is crucial for debugging many embedded systems. However, tracing even limited program segments is often cost-prohibitive, requiring wide trace ports and large on-chip trace buffers. This paper introduces a new cost-effective technique for capturing and compressing program execution traces on-the-fly. It relies on branch predictor-like structures in the trace module and corresponding software modules in the debugger to significantly reduce the number of events that need to be streamed out of the target system. Coupled with an effective variable encoding scheme that adapts to changing program patterns, our technique requires merely 0.029 bits per instruction of trace port bandwidth, providing a 34-fold improvement over the commercial state-of-the-art and a five-fold improvement over academic proposals, at the low cost of under 5,000 logic gates.
Bo Li, Yijuan Lu, Chunyuan Li, Afzal Godil, Tobias Schreck, Masaki Aono, Martin Burtscher, Hongbo Fu, Takahiko Furuya, Henry Johan, Jianzhuang Liu, Ryutarou Ohbuchi, Atsushi Tatsuma, and Changqing Zou.
SHREC'14 Track: Extended Large Scale Sketch-Based 3D Shape Retrieval.
Proceedings of the Eurographics Workshop on 3D Object Retrieval (10 pages). April 2014.
Large scale sketch-based 3D shape retrieval has received more and more attentions in the community of content-based 3D object retrieval. The objective of this track is to evaluate the performance of different sketch-based 3D model retrieval algorithms using a large scale hand-drawn sketch query dataset on a comprehensive 3D model dataset. The benchmark contains 12,680 sketches and 8,987 3D models, divided into 171 distinct classes. In this track, 12 runs were submitted by 4 groups and their retrieval performance was evaluated using 7 commonly used retrieval performance metrics. We hope that this benchmark, the comparative evaluation results, and the corresponding evaluation code will further promote the progress of this research direction for the 3D model retrieval community.
Kamil Rocki, Martin Burtscher, and Reiji Suda.
The Future of Accelerator Programming: Abstraction, Performance or Can We Have Both?
Proceedings of the 29th ACM Symposium on Applied Computing, pp. 886-893. March 2014.
In a perfect world, code would only be written once and would run on different devices with high efficiency. To a degree, that used to be the case in the era of frequency scaling on a single core. However, due to power limitations, parallel programming has become necessary to obtain performance gains. But parallel architectures differ substantially from each other, often require specialized knowledge to exploit them, and typically necessitate reimplementation and fine tuning of programs. These slow tasks frequently result in situations where most of the time is spent reimplementing old rather than writing new code. The goal of our research is to find programming techniques that increase productivity, maintain high performance, and provide abstraction to free the programmer from these unnecessary and time-consuming tasks. However, such techniques usually come at the cost of substantial performance degradation. This paper investigates current approaches to portable accelerator programming, seeking to answer whether they make it possible to combine high efficiency with sufficient algorithm abstraction. It discusses OpenCL as a potential solution and presents three approaches of writing portable code: GPU-centric, CPU-centric, and combined. By applying the three approaches to a real-world program, we show that it is at least sometimes possible to run exactly the same code on many different devices with minimal performance degradation using parameterization. The main contributions of this paper are an extensive review of the current state-of-the-art and our original approach of addressing the stated problem with the copious-parallelism technique.
Martin Burtscher, Ivan Zecena, and Ziliang Zong.
Measuring GPU Power with the K20 Built-in Sensor.
Proceedings of the Seventh Workshop on General Purpose Processing Using GPUs, pp. 28-36. March 2014.
GPU-accelerated programs are becoming increasingly common in HPC, personal computers, and even handheld devices, making it important to optimize their energy efficiency. However, accurately profiling the power consumption of GPU code is not straightforward. In fact, we have identified multiple anomalies when using the on-board power sensor of K20 GPUs. For example, we have found that doubling a kernel's runtime more than doubles its energy usage, that kernels consume energy after they have stopped executing, and that running two kernels in close temporal proximity inflates the energy consumption of the later kernel. Moreover, we have observed that the power sampling frequency varies greatly and that the GPU sensor only performs power readings once in a while. We present a methodology to accurately compute the instant power and the energy consumption despite these issues.
Kamil Rocki and Martin Burtscher.
The Future of Accelerator Programming.
HPC wire. January 2014.
Ivan Zecena, Martin Burtscher, Tongdan Jin, and Ziliang Zong.
Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs.
Proceedings of the 32nd IEEE International Performance Computing and Communications Conference, pp. 1-8. December 2013.
N-body simulations are computation-intensive applications that calculate the motion of a large number of bodies under pair-wise forces. Although different versions of n-body codes have been widely used in many scientific fields, the performance and energy efficiency of various n-body codes have not been comprehensively studied, especially when they are running on newly released multi-core CPUs and GPUs (e.g., Tesla K20). In this paper, we evaluate the performance and energy efficiency of five parallel n-body implementations on two different multi-core CPU systems and on two different types of GPUs. Our experimental results show that up to 71% of the energy can be saved by using all cores of a Xeon E5620 CPU instead of only one. We find hyper-threading to be able to further reduce the energy usage and runtime, but not by as much as adding more cores does. Finally, our experiments illustrate that GPU-based acceleration using a Tesla K20c can boost the performance and energy efficiency by orders of magnitude.
Martin Burtscher, Hongchi Shi, Wuxu Peng, Dan Tamir, Apan Qasem, and Heather Thiry.
Integrating Parallel Computing into the Undergraduate Curriculum at Texas State University: Experiences from the First Year.
Proceedings of the Workshop on Parallel, Distributed, and High-Performance Computing in Undergraduate Curricula (7 pages). November 2013.
The widespread deployment of multicore-based computer systems over the last decade has brought about drastic changes in the software and hardware landscape. Yet, many undergraduate computer science (CS) curricula have not embraced the pervasiveness of parallel computing. In their first years, CS undergraduates are typically exclusively trained to think and program sequentially. However, too firm a root in sequential thinking can be a nontrivial barrier for parallel thinking and computing. Thus, there is an urgent need to teach multicore and parallel computing concepts earlier and often in CS programs. This paper describes our efforts at addressing the rapidly widening gap between highly parallel computer architectures and the sequential programming approach taught in traditional CS courses. At Texas State University, we have adopted the early-and-often mode of integrating parallelism into the undergraduate curriculum. In this approach, parallel computing concepts are introduced and reiterated through a series of short, self-contained modules across several lower-division courses. Most of these concepts are then combined into a newly designed senior-level capstone course in multicore programming. Evaluations conducted during the first year show encouraging results for the early-and-often approach in terms of learning outcomes, student interest and confidence gains in computer science.
Rong Ge, Ryan Vogt, Jahangir Majumder, Arif Alam, Martin Burtscher, and Ziliang Zong.
Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU.
Proceedings of the 2nd International Workshop on Power-aware Algorithms, Systems, and Architectures, pp. 826-833. October 2013.
Improving energy efficiency is an ongoing challenge in HPC because of the ever-increasing need for performance coupled with power and economic constraints. Though GPU-accelerated heterogeneous computing systems are capable of delivering impressive performance, it is necessary to explore all available power-aware technologies to meet the inevitable energy efficiency challenge. In this paper, we experimentally study the impacts of DVFS on application performance and energy efficiency for GPU computing and compare them with those of DVFS for CPU computing. Based on a power-aware heterogeneous system that includes dual Intel Sandy Bridge CPUs and the latest Nvidia K20c Kepler GPU, the study provides numerous new insights, general trends and exceptions of DVFS for GPU computing. In general, the effects of DVFS on a GPU differ from those of DVFS on a CPU. For example, on a GPU running compute-bound high-performance and high-throughput workloads, the system performance and the power consumption are approximately proportional to the GPU frequency. Hence, with a permissible power limit, increasing the GPU frequency leads to better performance without incurring a noticeable increase in energy. This paper further provides detailed analytical explanations of the causes of the observed trends and exceptions. The findings presented in this paper have the potential to impact future CPU and GPU architectures to achieve better energy efficiency and point out directions for designing effective DVFS schedulers for heterogeneous systems.
Aleksandar Milenkovic, Armen Dzhagaryan, and Martin Burtscher.
Performance and Energy Consumption of Lossless Compression/Decompression Utilities on Mobile Computing Platforms.
Proceedings of the IEEE 21st International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 254-263. August 2013.
Data compression and decompression utilities can be critical in increasing communication throughput, reducing communication latencies, achieving energy-efficient communication, and making effective use of available storage. This paper experimentally evaluates several such utilities for multiple compression levels on systems that represent current mobile platforms. We characterize each utility in terms of its compression ratio, compression and decompression throughput, and energy efficiency. We consider different use cases that are typical for modern mobile environments. We find a wide variety of energy costs associated with data compression and decompression and provide practical guidelines for selecting the most energy efficient configurations for each use case. The best performing configurations provide 6-fold and 4-fold improvements in energy efficiency for compressed uploads and downloads over WLAN, respectively, when compared to uncompressed data transfers.
Martin Burtscher and Hassan Rabeti.
GPU Acceleration of a Genetic Algorithm for the Synthesis of FSM-based Bimodal Predictors.
Proceedings of the 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (8 pages). July 2013.
This paper presents a fast GPU implementation of a genetic algorithm for synthesizing bimodal predictor FSMs of a given size. Bimodal predictors, i.e., predictors that make binary yes/no predictions, are ubiquitous in microprocessors. Many of these predictors are based on finite-state machines (FSMs). However, there are countless possible FSMs and even heuristic searches for finding good FSMs can be slow when billions of predictions need to be assessed. We designed such a search heuristic that maps well onto GPU hardware. It is based on a multi-start genetic algorithm. On our six traces, the resulting FSMs are 1% to 29% more accurate than saturating up/down counters. On a Kepler-based GTX 680, the CUDA implementation evaluates 18 to 73 billion predictions per second, which is 14 to 18 times faster than a multicore version running on a hex-core Xeon X5690 with hyper-threading.
Martin Burtscher and Hassan Rabeti.
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches.
Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium, pp. 1289-1298. May 2013.
This paper describes and evaluates a highly-scalable framework for running iterative local searches on heterogeneous HPC platforms. The user only needs to provide serial CPU or single-GPU code that implements a simple interface. The framework then executes this code in parallel using MPI between compute nodes and OpenMP and multi-GPU support within nodes. It handles all parallelization aspects, seed distribution and program termination, and it regularly records the currently best solution. We evaluate our framework on three supercomputers using a heuristic iterative hill-climbing TSP solver as well as a search for good finite-state machines. The framework scales to 2048 nodes (32,768 cores) on Ranger with less than a 5% drop in efficiency, searches over 12.2 trillion TSP tours per second on Stampede using 1024 nodes, and evaluates over 21.5 trillion FSM transitions per second using 256 CPUs and 384 GPUs on Keeneland.
Rupesh Nasre, Martin Burtscher, and Keshav Pingali.
Data-driven versus Topology-driven Irregular Computations on GPUs.
Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium, pp. 463-474. May 2013.
Irregular algorithms are algorithms with complex main data structures such as directed and undirected graphs, trees, etc. A useful abstraction for many irregular algorithms is its operator formulation in which the algorithm is viewed as the iterated application of an operator to certain nodes, called active nodes, in the graph. Each operator application, called an activity, usually touches only a small part of the overall graph, so non-overlapping activities can be performed in parallel. In topology-driven implementations, all nodes are assumed to be active so the operator is applied everywhere in the graph even if there is no work to do at some nodes. In contrast, in data-driven implementations the operator is applied only to nodes at which there might be work to do. Multicore implementations of irregular algorithms are usually data-driven because current multicores only support small numbers of threads and work-efficiency is important. Conversely, many irregular GPU implementations use a topology-driven approach because work inefficiency can be counterbalanced by the large number of GPU threads. In this paper, we study data-driven and topology-driven implementations of six important graph algorithms on GPUs. Our goal is to understand the tradeoffs between these implementations and how to optimize them. We find that data-driven versions are generally faster and scale better despite the cost of maintaining a worklist. However, topology-driven versions can be superior when certain algorithmic properties are exploited to optimize the implementation. These results led us to devise hybrid approaches that combine the two techniques and outperform both of them.
Rupesh Nasre, Martin Burtscher, and Keshav Pingali.
Atomic-free Irregular Computations on GPUs.
Proceedings of the Sixth Workshop on General Purpose Processing Using GPUs, pp. 96-107. March 2013.
Atomic instructions are a key ingredient of codes that operate on irregular data structures like trees and graphs. It is well known that atomics can be expensive, especially on massively parallel GPUs, and are often on the critical path of a program. In this paper, we present two high-level methods to eliminate atomics in irregular programs. The first method advocates synchronous processing using barriers. We illustrate how to exploit synchronous processing to avoid atomics even when the threads' memory accesses conflict with each other. The second method is based on exploiting algebraic properties of algorithms to elide atomics. Specifically, we focus on three key properties: monotonicity, idempotency and associativity, and show how each of them enables an atomic-free implementation. We illustrate the generality of the two methods by applying them to five irregular graph applications: breadth-first search, single-source shortest paths computation, Delaunay mesh refinement, pointer analysis and survey propagation, and show that both methods provide substantial speedup in each case on different GPUs.
Rupesh Nasre, Martin Burtscher, and Keshav Pingali.
Morph Algorithms on GPUs.
Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 147-156. February 2013.
There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs. In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-to Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges. We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs. This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.
Martin Burtscher, Rupesh Nasre, and Keshav Pingali.
A Quantitative Study of Irregular Programs on GPUs.
Proceedings of the IEEE International Symposium on Workload Characterization, pp. 141-151. November 2012.
GPUs have been used to accelerate many regular applications and, more recently, irregular applications in which the control flow and memory access patterns are data-dependent and statically unpredictable. This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures. For a suite of 13 benchmarks, we find that (i) irregularity at the warp level varies widely, (ii) control-flow irregularity and memory-access irregularity are largely independent of each other, and (iii) most kernels, including regular ones, exhibit some irregularity. A program's irregularity can change between different inputs, systems, and arithmetic precision but generally stays in a specific region of the irregularity space. Whereas some highly tuned implementations of irregular algorithms exhibit little irregularity, trading off extra irregularity for better locality or less work can improve overall performance.
Igor Szczyrba, Martin Burtscher, and Rafal Szczyrba.
Validating Critical Limits of the Universal Brain Injury Criterion.
Proceedings of the 2012 International Conference on Bioinformatics and Computational Biology, pp. 199-205. July 2012.
We present results of numerical simulations that further validate the critical limits we previously proposed for our universal Brain Injury Criterion (BIC). The BIC extends the applicability of the translational Head Injury Criterion (HIC) to arbitrary head motions by reformulating the acceleration-based HIC formula in terms of the energy transferred locally from the skull to the brain. Our simulations are based on a generalization of the Kelvin-Voigt (K-V) Close Head Injury model that includes a nonlinear strain-stress relation. We validate the proposed BIC limits against (i) the critical limit HIC15 = 700, (ii) the Diffuse Axonal Injury Tolerance Criterion (DAITC) for head rotations that has been derived from the K-V model and from experiments with animal brains, and (iii) recent experimental data on strain levels leading to permanent neuronal damage. Our results imply that for head rotations about various fixed axes, the critical BIC15 limits coincide with the HIC15 critical limit and are in agreement with the DAITC thresholds.
Paruj Ratanaworabhan, Martin Burtscher, Darko Kirovski, and Benjamin Zorn.
Hardware Support for Enforcing Isolation in Lock-Based Parallel Programs.
Proceedings of the 26th International Conference on Supercomputing, pp. 301-310. June 2012.
When lock-based parallel programs execute on conventional multicore hardware, faulty software can cause hard-to-debug race conditions in critical sections that violate the contract between locks and their protected shared variables. This paper proposes new hardware support for enforcing isolation of critical section execution. It can detect and tolerate races, allowing programs to execute race-free. Our hardware scheme targets the existing large code base of lock-based parallel programs written in type unsafe languages such as C and C++. Our approach works directly on unmodified executables. An evaluation of 13 programs from the SPLASH2 and PARSEC suites shows that the cost of the additional hardware and the impact on the overall execution time is minimal for these applications. Our mechanism is complementary to hardware transactional memory in that it uses similar structures but focuses on enhancing the reliability of existing lock-based programs.
Paruj Ratanaworabhan, Martin Burtscher, Darko Kirovski, Benjamin Zorn, Rahul Nagpal, and Karthik Pattabiraman.
Efficient Runtime Detection and Toleration of Asymmetric Races.
IEEE Transactions on Computers, Vol. 61, No. 4, pp. 548-562. April 2012.
We introduce ToleRace, a runtime system that allows programs to detect and even tolerate asymmetric data races. Asymmetric races are race conditions where one thread correctly acquires and releases a lock for a shared variable while another thread improperly accesses the same variable. ToleRace provides approximate isolation in the critical sections of lock-based parallel programs by creating a local copy of each shared variable when entering a critical section, operating on the local copies, and propagating the appropriate copies upon leaving the critical section. We start by characterizing all possible interleavings that can cause races and precisely describe the effect of ToleRace in each case. Then, we study the theoretical aspects of an oracle that knows exactly what type of interleaving has occurred. Finally, we present software implementations of ToleRace and evaluate them on multithreaded applications from the SPLASH2 and PARSEC suites.
Mario Mendez-Lojo, Martin Burtscher, and Keshav Pingali.
A GPU Implementation of Inclusion-based Points-to Analysis.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 107-116. February 2012.
Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components. In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores. Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.
Aleksandar Milenkovic, Vladimir Uzelac, Milena Milenkovic, and Martin Burtscher.
Caches and Predictors for Real-time, Unobtrusive, and Cost-Effective Program Tracing in Embedded Systems.
IEEE Transactions on Computers, Vol. 60, No. 7, pp. 992-1005. July 2011.
The increasing complexity of modern embedded computer systems makes software development and system verification the most critical steps in the system development. To expedite verification and program debugging, chip manufacturers increasingly consider hardware infrastructure for program debugging and tracing, including logic to capture and filter traces, buffers to store traces, and a trace port through which the trace is read by the debug tools. In this paper, we introduce a new approach to capture and compress program execution traces in hardware. The proposed trace compressor encompasses two cost-effective structures, a stream descriptor cache and a last stream predictor. Information about the program flow is translated into a sequence of hit and miss events in these structures, thus dramatically reducing the number of bits that need to be sent out of the chip. We evaluate the efficiency of the proposed mechanism by measuring the trace port bandwidth on a set of benchmark programs. Our mechanism requires only 0.15 bits/instruction/CPU on average on the trace port, which is a six-fold improvement over state-of-the-art commercial solutions. The trace compressor requires an on-chip area that is equivalent to one third of a 1 kilobyte cache and it allows for continual and unobtrusive program tracing in real-time.
Olalekan A. Sopeju, Martin Burtscher, Ashay Rane, and James Browne.
AutoSCOPE: Automatic Suggestions for Code Optimizations Using PerfExpert.
Proceedings of the 2011 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 19-25. July 2011.
Automated source-code performance optimization has four stages: measurement, diagnosis of bottlenecks, determination of optimizations, and rewriting of the source code. Each stage must be successfully implemented to enable the next stage. The PerfExpert tool supports automatic performance measurement and bottleneck diagnosis for multicore and multichip compute nodes, i.e., it implements the first two stages. This paper presents AutoSCOPE, a new system that extends PerfExpert by implementing the third stage. Based on PerfExpert's output, AutoSCOPE automatically determines appropriate source-code optimizations and compiler flags. We describe the process for selecting optimizations and evaluate the effectiveness of AutoSCOPE by applying it to three HPC production codes. Each of these codes is available in unoptimized and manually optimized versions. AutoSCOPE succeeds in selecting the same source-code transformations as were chosen by human experts in most cases. AutoSCOPE is an extensible framework to which additional optimizations and further rules for selecting optimizations can be added.
Molly A. O'Neil, Dan Tamir, and Martin Burtscher.
A Parallel GPU Version of the Traveling Salesman Problem.
Proceedings of the 2011 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 348-353. July 2011.
This paper describes and evaluates an implementation of iterative hill climbing with random restart for determining high-quality solutions to the traveling salesman problem. With 100,000 restarts, this algorithm finds the optimal solution for four out of five 100-city TSPLIB inputs and yields a tour that is only 0.07% longer than the optimum on the fifth input. The presented implementation is highly parallel and optimized for GPU-based execution. Running on a single GPU, it evaluates over 20 billion tour modifications per second. It takes 32 CPUs with 8 cores each (256 cores total) to match this performance.
Igor Szczyrba, Martin Burtscher, and Rafal Szczyrba.
Computer Modeling of Diffuse Axonal Injury Mechanisms.
Proceedings of the 2011 International Conference on Bioinformatics and Computational Biology, pp. 401-407. July 2011.
We investigate numerically which properties of the human brain cause Diffuse Axonal Injuries (DAI) to appear in a scattered and pointwise manner near the gray/white matter boundary, mostly in the white matter. These simulations are based on our dually-nonlinear, viscoelastic, fluid Traumatic Brain Injury model, which includes a nonlinear stress/strain relation. We simulate rotational accelerations and decelerations of a human head that replicate realistic traumatic scenarios. The rotational loads are quantified by our Brain Injury Criterion, which extends the translational Head Injury Criterion to arbitrary head motions. Our simulations show that: (i) DAI occurrences near the gray/white matter boundary can be explained by the difference in the gray and the white matter's shear modulus values, (ii) the scattered/pointwise DAI character can be attributed to the nonlinear fluid aspect of the brain tissue, and (iii) the scattering of DAI deeper in the white matter appears to be caused by the complicated shape of the brain. Our results also show that the nonlinear stress/strain relation plays a secondary role in shaping basic DAI features.
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Mendez-Lojo, Dimitrios Prountzos, and Xin Sui.
The Tao of Parallelism in Algorithms.
Proceedings of the ACM SIGPLAN 2011 Conference on Programming Language Design and Implementation, pp. 12-25. June 2011.
For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in regular algorithms that use dense arrays, such as finite-differences and FFTs. In this paper, we argue that the dependence graph is not a suitable abstraction for algorithms in new application areas like machine learning and network analysis in which the key data structures are irregular data structures like graphs, trees, and sets. To address the need for better abstractions, we introduce a data-centric formulation of algorithms called the operator formulation in which an algorithm is expressed in terms of its action on data structures. This formulation is the basis for a structural analysis of algorithms that we call Tao-analysis. Tao-analysis can be viewed as an abstraction of algorithms that distills out algorithmic properties important for parallelization. It reveals that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous in algorithms, and that, depending on the Tao-structure of the algorithm, this parallelism may be exploited by compile-time, inspector-executor or optimistic parallelization, thereby unifying these seemingly unrelated parallelization techniques. Regular algorithms emerge as a special case of irregular algorithms, and many application-specific optimization techniques can be generalized to a broader context. These results suggest that the operator formulation and Tao analysis of algorithms can be the foundation of a systematic approach to parallel programming.
Jeff Diamond, Martin Burtscher, John McCalpin, Byoung-Do Kim, Stephen Kecker, and James Browne.
Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications.
Proceedings of the 2011 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 32-43. April 2011.
The computation nodes of modern supercomputers commonly consist of multiple multicore processors. To maximize the performance of such systems requires measurement, analysis, and optimization techniques that specifically target multicore environments. This paper first examines traditional unicore metrics and demonstrates how they can be misleading in a multicore system. Second, it examines and characterizes performance bottlenecks specific to multicore-based systems. Third, it describes performance measurement challenges that arise in multicore systems and outlines methods for extracting sound measurements that lead to performance optimization opportunities. The measurement and analysis process is based on a case study of the HOMME atmospheric modeling benchmark code from NCAR running on supercomputers built upon AMD Barcelona and Intel Nehalem quad-core processors. Applying the multicore bottleneck analysis to HOMME led to multicore aware source-code optimizations that increased performance by up to 35%. While the case studies were carried out on multichip nodes of supercomputers using an HPC application as the target for optimization, the pitfalls identified and the insights obtained should apply to any system that is composed of multicore processors.
Molly A. O'Neil and Martin Burtscher.
Floating-Point Data Compression at 75 Gb/s on a GPU.
Proceedings of the Fourth Workshop on General Purpose Processing Using GPUs, pp. 7:1-7:7. March 2011.
Numeric simulations often generate large amounts of data that need to be stored or sent to other compute nodes. This paper investigates whether GPUs are powerful enough to make real-time data compression and decompression possible in such environments, that is, whether they can operate at the 32- or 40-Gb/s throughput of emerging network cards. The fastest parallel CPU-based floating-point data compression algorithm operates below 20 Gb/s on eight Xeon cores, which is significantly slower than the network speed and thus insufficient for compression to be practical in high-end networks. As a remedy, we have created the highly parallel GFC compression algorithm for double-precision floating-point data. This algorithm is specifically designed for GPUs. It compresses at a minimum of 75 Gb/s, decompresses at 90 Gb/s and above, and can therefore improve internode communication throughput on current and upcoming networks by fully saturating the interconnection links with compressed data.
Martin Burtscher and Keshav Pingali.
An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm.
Chapter 6 in GPU Computing Gems Emerald Edition, pp. 75-92. January 2011.
This chapter describes the first CUDA implementation of the classical Barnes Hut n-body algorithm that runs entirely on the GPU. Unlike most other CUDA programs, our code builds an irregular tree-based data structure and performs complex traversals on it. It consists of six GPU kernels. The kernels are optimized to minimize memory accesses and thread divergence and are fully parallelized within and across blocks. Our CUDA code takes 5.2 seconds to simulate one time step with 5,000,000 bodies on a 1.3 GHz Quadro FX 5800 GPU with 240 cores, which is 74 times faster than an optimized serial implementation running on a 2.53 GHz Xeon E5540 CPU.
Martin Burtscher, Byoung-Do Kim, Jeff Diamond, John McCalpin, Lars Koesterke, and James Browne.
PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications.
Proceedings of the 2010 ACM/IEEE International Conference for High-Performance Computing, Networking, Storage and Analysis, pp. 1-11. November 2010.
HPC systems are notorious for operating at a small fraction of their peak performance, and the ongoing migration to multi-core and multi-socket compute nodes further complicates performance optimization. The readily available performance evaluation tools require considerable effort to learn and utilize. Hence, most HPC application writers do not use them. As remedy, we have developed PerfExpert, a tool that combines a simple user interface with a sophisticated analysis engine to detect probable core, socket, and node-level performance bottlenecks in each important procedure and loop of an application. For each bottleneck, PerfExpert provides a concise performance assessment and suggests steps that can be taken by the programmer to improve performance. These steps include compiler switches and optimization strategies with code examples. We have applied PerfExpert to several HPC production codes on the Ranger supercomputer. In all cases, it correctly identified the critical code sections and provided accurate assessments of their performance.
Vladimir Uzelac, Aleksandar Milenkovic, Martin Burtscher, and Milena Milenkovic.
Real-time Unobtrusive Program Execution Trace Compression Using Branch Predictor Events.
Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 97-106. October 2010.
Unobtrusive capturing of program execution traces in real-time is crucial in debugging cyber-physical systems. However, tracing even limited program segments is often cost-prohibitive, requiring wide trace ports and large on-chip trace buffers. This paper introduces a new cost-effective technique for capturing and compressing program execution traces in real time. It uses branch predictor-like structures in the trace module to losslessly compress the traces. This approach results in high compression ratios because it only has to transmit misprediction events to the software debugger. Coupled with an effective variable encoding scheme, our technique requires merely 0.036 bits/instruction of trace port bandwidth (a 28-fold improvement over the commercial state-of-the-art) at a cost of roughly 5,200 logic gates.
Xin Sui, Donald Nguyen, Martin Burtscher, and Keshav Pingali.
Parallel Graph Partitioning on Multicore Architectures.
Proceedings of the Languages and Compilers for Parallel Computing 23rd Annual Workshop, pp. 246-260. October 2010.
Graph partitioning is a common and frequent preprocessing step in many high-performance parallel applications on distributed- memory and shared-memory architectures. It is used to distribute graphs across memory and to improve spatial locality. There are several parallel implementations of graph partitioning for distributed-memory architectures. In this paper, we present a parallel graph partitioner that implements a variation of the Metis partitioner on a shared-memory, multicore architecture. We show that (1) the parallelism in this algorithm is an instance of the general amorphous data-parallelism pattern, and (2) a parallel implementation can be derived systematically from a sequential specification of the algorithm. The resulting program can be executed in parallel using the Galois system for optimistic parallelization. We show that the scalability of this parallel implementation compares favorably with that of a publicly available, hand-parallelized C implementation of the algorithm, ParMetis, but absolute performance is lower because of missing sequential optimizations in our system. On a set of 15 large, publicly available graphs, we achieve an average scalability of 2.98x on 8 cores with our implementation, compared with 1.77x for ParMetis, and we achieve an average speedup of 2.80x over Metis, compared with 3.60x for ParMetis. These results show that our systematic approach for parallelizing irregular algorithms on multicore architectures is promising.
Martin Burtscher, Benjamin Livshits, Gaurav Sinha, and Benjamin Zorn.
Proceedings of the USENIX Conference on Web Application Development (12 pages). June 2010.
Martin Burtscher and Paruj Ratanaworabhan.
gFPC: A Self-Tuning Compression Algorithm.
Proceedings of the 2010 Data Compression Conference, pp. 396-405. March 2010.
This paper presents and evaluates gFPC, a self-tuning implementation of the FPC compression algorithm for double-precision floating-point data. gFPC uses a genetic algorithm to repeatedly reconfigure four hash-function parameters, which enables it to adapt to changes in the data during compression. Self tuning increases the harmonic-mean compression ratio on thirteen scientific datasets from 22% to 28% with sixteen kilobyte hash tables and from 36% to 43% with one megabyte hash tables. Individual datasets compress up to 1.72 times better. The self-tuning overhead reduces the compression speed by a factor of four but makes decompression faster because of the higher compression ratio. On a 2.93 GHz Xeon processor, gFPC compresses at a throughput of almost one gigabit per second and decompresses at over seven gigabits per second.
Mario Mendez-Lojo, Donald Nguyen, Dimitrios Prountzos, Xin Sui, M. Amber Hassaan, Milind Kulkarni, Martin Burtscher, and Keshav Pingali.
Structure-driven Optimizations for Amorphous Data-parallel Programs.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3-14. January 2010.
Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular applications based on the idea of optimistic or speculative execution of programs. However, the overhead of optimistic parallel execution can be substantial. In this paper, we show that many irregular algorithms have structure that can be exploited and present three key optimizations that take advantage of algorithmic structure to reduce speculative overheads. We describe the implementation of these optimizations in the Galois system and present experimental results to demonstrate their benefits. To the best of our knowledge, this is the first system to exploit algorithmic structure to optimize the execution of irregular programs.
Vladimir Uzelac, Aleksandar Milenkovic, Milena Milenkovic, and Martin Burtscher.
Real-time, Unobtrusive, and Efficient Program Execution Tracing with Stream Caches and Last Stream Predictors.
Proceedings of the 2009 International Conference of Computer Design, pp. 173-178. October 2009.
This paper introduces a new hardware mechanism for capturing and compressing program execution traces unobtrusively in real-time. The proposed mechanism is based on two structures called stream cache and last stream predictor. We explore the effectiveness of a trace module based on these structures and analyze the design space. We show that our trace module, with less than 600 bytes of state, achieves a trace-port bandwidth of 0.15 bits/instruction/processor, which is over six times better than state-of-the-art commercial designs.
Carsten Burstedde, Martin Burtscher, Omar Ghattas, Georg Stadler, Tiankai Tu, and Lucas C. Wilcox.
ALPS: A Framework for Parallel Adaptive PDE Solution.
Journal of Physics: Conference Series, Vol. 180 (8 pages). August 2009.
Adaptive mesh refinement and coarsening (AMR) is essential for the numerical solution of partial differential equations (PDEs) that exhibit behavior over a wide range of length and time scales. Because of the complex dynamic data structures and communication patterns and frequent data exchange and redistribution, scaling dynamic AMR to tens of thousands of processors has long been considered a challenge. We are developing ALPS, a library for dynamic mesh adaptation of PDEs that is designed to scale to hundreds of thousands of compute cores. Our approach uses parallel forest-of-octree-based hexahedral finite element meshes and dynamic load balancing based on space-filling curves. ALPS supports arbitrary-order accurate continuous and discontinuous finite element/spectral element discretizations on general geometries. We present scalability and performance results for two applications from geophysics: seismic wave propagation and mantle convection.
Jeff Diamond, Byoung-Do Kim, Martin Burtscher, Steve Keckler, Keshav Pingali, and Jim Browne.
Multicore Optimization for Ranger.
Proceedings of the 2009 TeraGrid Conference (8 pages). June 2009.
As we enter the multicore era, it is of growing concern that an important class of high-performance parallel applications with near perfect weak scaling across nodes cannot take advantage of more than two cores per chip before saturating the on-chip memory hierarchy. This paper presents a first step towards solving this on-chip scalability issue at the source-code level. We begin with a detailed case study of two important simulation benchmarks on the Ranger supercomputing cluster. The two codes are Homme, a spectral element code, and Mangll, a finite element code. Whereas both applications had previously been classified as a computationally intense and memory light, we instead found both of them to be memory bound, achieving less than 0.5 IPC. Each application has lead to a different strategy for attaining a better memory access/computation balance. Both codes demonstrate substantially worse intra-chip (multicore) scaling than inter-node scaling. This study is primarily empirical and presents advanced techniques to locate and ameliorate intra-node bottlenecks in applications. As classical optimization for computation is replaced with the more difficult optimization for memory bandwidth, one goal of this study is to make the use of detailed architectural knowledge and low-level program counters more accessible to help a wider programming audience optimize their code. Based on our analysis of Homme, we apply several source transformations to reduce memory bandwidth, including utilizing idle FPUs to replace the use of non fundamental memory arrays with computation. The performance characteristics of Mangll suggest an optimization strategy for better exploiting the available memory bandwidth by rewriting loops so that the compiler can use vector instructions that load and store multiple data values in a single memory transaction.
Milind Kulkarni, Martin Burtscher, Calin Cascaval, and Keshav Pingali.
Lonestar: A Suite of Parallel Irregular Programs.
Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 65-76. April 2009.
Until recently, parallel programming has largely focused on the exploitation of data-parallelism in dense matrix programs. However, many important application domains, including meshing, clustering, simulation, and machine learning, have very different algorithmic foundations: they require building, computing with, and modifying large sparse graphs. In the parallel programming literature, these types of applications are usually classified as irregular applications, and relatively little attention has been paid to them. To study and understand the patterns of parallelism and locality in sparse graph computations better, we are in the process of building the Lonestar benchmark suite. In this paper, we characterize the first five programs from this suite, which target domains like data mining, survey propagation, and design automation. We show that even such irregular applications often expose large amounts of parallelism in the form of amorphous data-parallelism. Our speedup numbers demonstrate that this new type of parallelism can successfully be exploited on modern multi-core machines.
Martin Burtscher and Paruj Ratanaworabhan.
pFPC: A Parallel Compressor for Floating-Point Data.
Proceedings of the 2009 Data Compression Conference, pp. 43-52. March 2009.
This paper describes and evaluates pFPC, a parallel implementation of the lossless FPC compression algorithm for 64-bit floating-point data. pFPC can trade off compression ratio for throughput. For example, on a 4-core 3 GHz Xeon system, it compresses our nine datasets by 18% at a throughput of 1.36 gigabytes per second and by 41% at a throughput of 570 megabytes per second. Decompression is even faster. Our experiments show that the thread count should match or be a small multiple of the data's dimensionality to maximize the compression ratio and the chunk size should be at least equal to the system's page size to maximize the throughput.
Paruj Ratanaworabhan, Martin Burtscher, Darko Kirovski, Rahul Nagpal, Karthik Pattabiraman, and Benjamin Zorn.
Detecting and Tolerating Asymmetric Races.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 173-184. February 2009.
This paper introduces ToleRace, a runtime system that allows programs to detect and even tolerate asymmetric data races. Asymmetric races are race conditions where one thread correctly acquires and releases a lock for a shared variable while another thread improperly accesses the same variable. ToleRace provides approximate isolation in the critical sections of lock-based parallel programs by creating a local copy of each shared variable when entering a critical section, operating on the local copies, and propagating the appropriate copies upon leaving the critical section. We start by characterizing all possible interleavings that can cause races and precisely describe the effect of ToleRace in each case. Then, we study the theoretical aspects of an oracle that knows exactly what type of interleaving has occurred. Finally, we present two software implementations of ToleRace and evaluate them on multithreaded applications from the SPLASH2 and PARSEC suites. Our implementation on top of a dynamic instrumentation tool, which works directly on executables and requires no source code modifications, incurs an overhead of a factor of two on average. Manually adding ToleRace to the source code of these applications results in an average overhead of 6.4 percent.
Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu, Calin Cascaval, and Keshav Pingali.
How Much Parallelism is There in Irregular Applications?
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3-14. February 2009.
Irregular programs are programs organized around pointer-based data structures such as trees and graphs. Recent investigations by the Galois project have shown that many irregular programs have a generalized form of data-parallelism called amorphous data-parallelism. However, in many programs, amorphous data-parallelism cannot be uncovered using static techniques, and its exploitation requires runtime strategies such as optimistic parallel execution. This raises a natural question: how much amorphous data-parallelism actually exists in irregular programs? In this paper, we describe the design and implementation of a tool called ParaMeter that produces parallelism profiles for irregular programs. Parallelism profiles are an abstract measure of the amount of amorphous data-parallelism at different points in the execution of an algorithm, independent of implementation-dependent details such as the number of cores, cache sizes, load-balancing, etc. ParaMeter can also generate constrained parallelism profiles for a fixed number of cores. We show parallelism profiles for seven irregular applications, and explain how these profiles provide insight into the behavior of these applications.
Martin Burtscher and Paruj Ratanaworabhan.
FPC: A High-Speed Compressor for Double-Precision Floating-Point Data.
IEEE Transactions on Computers, Vol. 58, No. 1, pp. 18-31. January 2009.
Many scientific programs exchange large quantities of double-precision data between processing nodes and with mass storage devices. Data compression can reduce the number of bytes that need to be transferred and stored. However, data compression is only likely to be employed in high-end computing environments if it does not impede the throughput. This paper describes and evaluates FPC, a fast lossless compression algorithm for linear streams of 64-bit floating-point data. FPC works well on hard-to-compress scientific datasets and meets the throughput demands of high-performance systems. A comparison with five lossless compression schemes, BZIP2, DFCM, FSD, GZIP, and PLMI, on four architectures and thirteen datasets shows that FPC compresses and decompresses one to two orders of magnitude faster than the other algorithms at the same geometric-mean compression ratio. Moreover, FPC provides a guaranteed throughput as long as the prediction tables fit into the L1 data cache. For example, on a 1.6 GHz Itanium 2 server, the throughput is 670 megabytes per second regardless of what data are being compressed.
Igor Szczyrba, Martin Burtscher, and Rafal Szczyrba.
On the Role of a Nonlinear Stress-Strain Relation in Brain Trauma.
Proceedings of the 2008 International Conference on Bioinformatics & Computational Biology, pp. 265-271. July 2008.
We investigate how a nonlinear stress-strain relation (that leads to a stiffening of the brain matter under strain) influences the brain dynamics in traumatic situations. We numerically simulate rapid rotational accelerations and decelerations of a human head using our generalization of the viscoelastic Kelvin-Voigt brain injury model that includes an experimentally established dependency of stress on strain. The rotational loads are expressed in terms of the Brain Injury Criterion, which we developed to extend the translational Head Injury Criterion to arbitrary head motions. Under traumatic loads corresponding to HIC15 = 700, our model predicts that the brain stiffening reduces the maximal strain near the skull by up to 70%, but leads to a distribution of relatively high strain values throughout the brain. We show how the brain's complex geometry enhances the random spatial distribution of high strain values.
Martin Burtscher, Milind Kulkarni, Dimitrios Prountzos, and Keshav Pingali.
On the Scalability of an Automatically Parallelized Irregular Application.
Proceedings of the Languages and Compilers for Parallel Computing 21st Annual Workshop, pp. 109-123. July 2008.
Irregular applications, i.e., programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error prone. The Galois system parallelizes such applications using an optimistic approach that exploits higher-level semantics of abstract data types. In this paper, we study the performance and scalability of a Galoised, i.e., automatically parallelized, version of Delaunay mesh refinement (DR) on a shared-memory system with 128 CPUs. DR is an important irregular application that is used, e.g., in graphics and finite-element codes. The parallelized program scales to 64 threads, where it reaches a speedup of 25.8. For large numbers of threads, the performance is hampered by the load imbalance and the nonuniform memory latency, both of which grow as the number of threads increases. While these two issues will have to be addressed in future work, we believe our results already show the Galois approach to be very promising.
Paruj Ratanaworabhan and Martin Burtscher.
Program Phase Detection based on Critical Basic Block Transitions.
Proceedings of the 2008 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 11-21. April 2008.
Many programs go through phases as they execute. Knowing where these phases begin and end can be beneficial. For example, adaptive architectures can exploit such information to lower their power consumption without much loss in performance. Architectural simulations can benefit from phase information by simulating only a small interval of each program phase, which significantly reduces the simulation time while still yielding results that are representative of complete simulations. This paper presents a lightweight profile-based phase detection technique that marks each phase change boundary in the programs binary at the basic block level with a critical basic block transition (CBBT). It is independent of execution windows and does not explicitly employ the notion of threshold to make a phase change decision. We evaluate the effectiveness of CBBTs for reconfiguring the L1 data cache size and for guiding architectural simulations. Our CBBT method is as effective at dynamically reducing the L1 data cache size as idealized cache reconfiguration schemes are. Using CBBTs to statically determine simulation intervals yields as low a CPI error as the well-known SimPoint method does. In addition, experimental results indicate the CBBTs effectiveness in both the self-trained and cross-trained inputs, demonstrating the CBBTs stability across different program inputs.
Igor Szczyrba, Martin Burtscher, and Rafal Szczyrba.
Computational Modeling of Brain Dynamics during Repetitive Head Motions.
Proceedings of the 2007 International Conference on Modeling, Simulation and Visualization Methods, pp. 143-149. June 2007.
We numerically model the effects of repetitive human head motions in traumatic scenarios that are associated with severe brain injuries. Our results are based on the linear Kelvin-Voigt brain injury model, which treats the brain matter as a viscoelastic solid, and on our nonlinear generalization of that model, which emulates the gel-like character of the brain tissue. To properly compare the various traumatic scenarios, we use the BIC scale, which we developed to generalize the HIC scale to arbitrary head motions. Our simulations of the brain dynamics in sagittal and horizontal 2D cross-sections of the skull interior indicate that a repetitive reversal of traumatic head rotations can increase the severity/likelihood of brain injuries due to resonance effects.
Igor Szczyrba, Martin Burtscher, and Rafal Szczyrba.
A Proposed New Brain Injury Tolerance Criterion Based on the Exchange of Energy between the Skull and the Brain.
Proceedings of the 2007 Summer Bioengineering Conference (2 pages). June 2007.
Martin Burtscher and Paruj Ratanaworabhan.
High Throughput Compression of Double-Precision Floating-Point Data.
Proceedings of the 2007 Data Compression Conference, pp. 293-302. March 2007.
This paper describes FPC, a lossless compression algorithm for linear streams of 64-bit floating-point data. FPC is designed to compress well while at the same time meeting the high throughput demands of scientific computing environments. On our thirteen datasets, it achieves a substantially higher average compression ratio than BZIP2, DFCM, FSD, GZIP, and PLMI. At comparable compression ratios, it compresses and decompresses 8 to 300 times faster than the other five algorithms.
Milena Milenkovic, Aleksandar Milenkovic, and Martin Burtscher.
Algorithms and Hardware Structures for Unobtrusive Real-Time Compression of Instruction and Data Address Traces.
Proceedings of the 2007 Data Compression Conference, pp. 283-292. March 2007.
Instruction and data address traces are widely used by computer designers for quantitative evaluations of new architectures and workload characterization, as well as by software developers for program optimization, performance tuning, and debugging. Such traces are typically very large and need to be compressed to reduce the storage, processing, and communication bandwidth requirements. However, preexisting general-purpose and trace-specific compression algorithms are designed for software implementation and are not suitable for runtime compression. Compressing program execution traces at runtime in hardware can deliver insights into the behavior of the system under test without any negative interference with normal program execution. Traditional debugging tools, on the other hand, have to stop the program frequently to examine the state of the processor. Moreover, software developers often do not have access to the entire history of computation that led to an erroneous state. In addition, stepping through a program is a tedious task and may interact with other system components in such a way that the original errors disappear, thus preventing any useful insight. The need for unobtrusive tracing is further underscored by the development of computer systems that feature multiple processing cores on a single chip. In this paper, we introduce a set of algorithms for compressing instruction and data address traces that can easily be implemented in an on-chip trace compression module and describe the corresponding hardware structures. The proposed algorithms are analytically and experimentally evaluated. Our results show that very small hardware structures suffice to achieve a compression ratio similar to that of a software implementation of gzip while being orders of magnitude faster. A hardware structure with slightly over 2 KB of state achieves a compression ratio of 125.9 for instruction address traces, whereas gzip achieves a compression ratio of 87.4. For data address traces, a hardware structure with 5 KB of state achieves a compression ratio of 6.1, compared to 6.8 achieved by gzip.
Ilya Ganusov and Martin Burtscher.
Future Execution: A Prefetching Mechanism that Uses Multiple Cores to Speed up Single Threads.
ACM Transactions on Architecture and Code Optimization, Vol. 3, No. 4, pp. 424-449. December 2006.
This paper describes future execution (FE), a simple hardware-only technique to accelerate individual program threads running on multi-core microprocessors. Our approach uses available idle cores to prefetch important data for the threads executing on the active cores. FE is based on the observation that many cache misses are caused by loads that execute repeatedly and whose address-generating program slices do not change (much) between consecutive executions. To exploit this property, FE dynamically creates a prefetching thread for each active core by simply sending a copy of all committed, register-writing instructions to an otherwise idle core. The key innovation is that on the way to the second core, a value predictor replaces each predictable instruction in the prefetching thread with a load immediate instruction, where the immediate is the predicted result that the instruction is likely to produce during its n^th next dynamic execution. Executing this modified instruction stream (i.e., the prefetching thread) on another core allows to compute the future results of the instructions that are not directly predictable, issue prefetches into the shared memory hierarchy, and thus reduce the primary threads' memory access time. We demonstrate the viability and effectiveness of future execution by performing cycle-accurate simulations of a two-way CMP running the single-threaded SPECcpu2000 benchmark suite. Our mechanism improves program performance by 12% on average over a baseline that already includes an optimized hardware stream prefetcher. We further show that FE is complementary to runahead execution and that the combination of these two techniques raises the average speedup to 20% above the performance of the baseline processor with the aggressive stream prefetcher.
Paruj Ratanaworabhan and Martin Burtscher.
Load Instruction Characterization and Acceleration of the BioPerf Programs.
Proceedings of the IEEE International Symposium on Workload Characterization, pp. 71-79. October 2006.
The load instructions of some of the bioinformatics applications in the BioPerf suite possess interesting characteristics: only a few static loads cover almost the entire dynamic load execution and they almost always hit in the data cache. Nevertheless, these load instructions represent a major performance bottleneck. They often precede or follow branches that are hard to predict, which makes their L1 hit latency difficult to hide even in dynamically scheduled execution cores. This paper investigates this behavior and suggests simple source-code transformations to improve the performance of these benchmark programs by up to 92%.
Ilya Ganusov and Martin Burtscher.
Efficient Emulation of Hardware Prefetchers via Event-Driven Helper Threading.
Proceedings of the 2006 International Conference on Parallel Architectures and Compilation Techniques, pp. 144-153. September 2006.
The advance of multi-core architectures provides significant benefits for parallel and throughput-oriented computing, but the performance of individual computation threads does not improve and may even suffer a penalty because of the increased contention for shared resources. This paper explores the idea of using available general-purpose cores in a CMP as helper engines for individual threads running on the active cores. We propose a lightweight architectural framework for efficient event-driven software emulation of complex hardware accelerators and describe how this framework can be applied to implement a variety of prefetching techniques. We demonstrate the viability and effectiveness of our framework on a wide range of applications from the SPEC CPU2000 and Olden benchmark suites. On average, our mechanism provides performance benefits within 5% of pure hardware implementations. Furthermore, we demonstrate that running event-driven prefetching threads on top of a baseline with a hardware stride prefetcher yields significant speedups for many programs. Finally, we show that our approach provides competitive performance improvements over other hardware approaches for multi-core execution while executing fewer instructions and requiring considerably less hardware support.
Martin Burtscher and Igor Szczyrba.
Computational Simulation and Visualization of Traumatic Brain Injuries.
Proceedings of the 2006 International Conference on Modeling, Simulation and Visualization Methods, pp. 101-107. June 2006.
We numerically model the human brain dynamics in realistic 2D cross-sections during traumatic scenarios corresponding to the Head Injury Criterion's critical value HIC36=1000. Our simulations are based on the Kelvin-Voigt Partial Differential Equations that treat the brain tissue as a viscoelastic solid and on our nonlinear generalization of these linear PDEs that treats the tissue as an incompressible, viscoelastic fluid. To better visualize and study the brain motion, we have developed curved-vector-field plots and movies, which allow the simultaneous analysis of velocity fields with a wide range of magnitudes (as appear in turbulent flows) as well as the corresponding trajectories of brain-matter parcels. The discovered complex brain tissue oscillations shed new light on the mechanisms of Closed Head Injuries. Our results may help establish a universal brain injury tolerance criterion.
TCgen 2.0: A Tool to Automatically Generate Lossless Trace Compressors.
Computer Architecture News, Vol. 34, No. 3, pp. 1-8. June 2006.
This tutorial explains the usage of TCgen, a tool for generating portable lossless trace compressors based on user-provided trace format descriptions. TCgen automatically translates these descriptions into optimized C source code. In many cases, the synthesized code is faster and compresses better than BZIP2, GZIP, MACHE, PDATS II, SBC, SEQUITUR, and VPC3, making it ideal for trace-based research and teaching environments as well as for trace archives. Version 2 includes several improvements and simplifies the integration of the generated code with other code through a streamlined API.
Paruj Ratanaworabhan, Jian Ke, and Martin Burtscher.
Fast Lossless Compression of Scientific Floating-Point Data.
Proceedings of the 2006 Data Compression Conference, pp. 133-142. March 2006.
In scientific computing environments, large amounts of floating-point data often need to be transferred between computers as well as to and from storage devices. Compression can reduce the number of bits that need to be transferred and stored. However, the runtime overhead due to compression may be undesirable in high-performance settings where short communication latencies and high bandwidths are essential. This paper describes and evaluates a new compression algorithm that is tailored to such environments. It typically compresses numeric floating-point values better and faster than other algorithms do. On our data sets, it achieves compression ratios between 1.2 and 4.2 as well as compression and decompression throughputs between 2.8 and 5.9 million 64-bit double-precision numbers per second on a 3GHz Pentium 4 machine.
Sandra J. Jackson and Martin Burtscher.
Self-Optimizing Finite State Machines for Confidence Estimators.
Proceedings of the 2006 Workshop on Introspective Architecture (8 pages). February 2006.
Modern microprocessors are increasingly relying on speculation as a key technique for improving performance and reducing power, temperature and energy. Confidence estimators are necessary to prevent likely misspeculations. These confidence estimators are commonly realized using a Finite State Machine (FSM) called a saturating counter. This paper presents a hardware method that allows FSMs to dynamically optimize their state transitions and confidence thresholds to improve CPU performance by automatically adapting to the current workload. The technique further allows the FSMs to continuously adjust to changing program conditions. These adaptable, self-optimizing confidence estimators are evaluated as a component in a value predictor on the C programs from the SPECcpu2000 benchmark suite. On average, the self-optimizing method achieves a miss rate reduction of 11% with a maximum of 47%. The self-optimizing confidence estimator can provide a speedup of 4%.
Christianto C. Liu, Ilya Ganusov, Martin Burtscher, and Sandip Tiwari.
Bridging the Processor-Memory Performance Gap with 3D IC Technology.
IEEE Design & Test of Computers, Vol. 22, No. 6, pp. 556-564. November 2005.
The growing gap between processor speeds and memory latency is becoming a serious issue. Three-dimensional integrated circuit (3D IC) technology can bridge the processor-memory gap by bringing DRAM main memory on chip, thereby reducing main memory latency, and by allowing for more levels and larger sizes of on-chip caches. Using stream prefetching and on-chip main memory, only a small L2/L3 cache combination is needed to obtain excellent performance. With proper system design, a two-layer, 3D microprocessor-memory structure can achieve performance that is within 5-7% of the performance of a processor with a perfect L2 cache.
Martin Burtscher, Ilya Ganusov, Sandra J. Jackson, Jian Ke, Paruj Ratanaworabhan, and Nana B. Sam.
The VPC Trace-Compression Algorithms.
IEEE Transactions on Computers, Vol. 54, No. 11, pp. 1329-1344. November 2005.
Execution traces, which are used to study and analyze program behavior, are often so large that they need to be stored in compressed form. This paper describes the design and implementation of four value prediction based compression (VPC) algorithms for traces that record the PC as well as other information about executed instructions. VPC1 directly compresses traces using value predictors, VPC2 adds a second compression stage, and VPC3 utilizes value predictors to convert traces into streams that can be compressed better and more quickly than the original traces. VPC4 introduces further algorithmic enhancements and is automatically synthesized. Of the 55 SPECcpu2000 traces we evaluate, VPC4 compresses 36 better, decompresses 26 faster and compresses 53 faster than BZIP2, MACHE, PDATS II, SBC and SEQUITUR. It delivers the highest geometric-mean compression rate, decompression speed and compression speed because of the predictors' simplicity and their ability to exploit local value locality. Most other compression algorithms can only exploit global value locality.
Christianto C. Liu, Ilya Ganusov, Martin Burtscher, and Sandip Tiwari.
Improving Microprocessor Performance through 3D IC Technology.
Proceedings of the Semiconductor Research Corporation's TECHCON 2005 Conference (4 pages). October 2005.
The growing gap between logic and memory performance has forced microprocessors to employ increasingly complex designs, including out-of-order execution and speculation, as well as memory hierarchies with more levels and larger caches to hide the main memory latency. In this paper, we examine how 3D IC technology can be effective in improving the interactions between processors and their memory. We simulate SPEC2000 integer and floating-point programs on an extended version of the SimpleScalar v4.0 simulator. The results reveal that stacking an L2 cache on top of the CPU provides little benefit, whereas stacking main memory shows significant performance improvements. We also investigate the benefits of very large L2/L3 caches, which can be stacked on top of the CPU through 3D technology. In addition, we show that stream prefetching is an effective technique to prefetch data into these large caches.
Ilya Ganusov and Martin Burtscher.
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors.
Proceedings of the 2005 International Conference on Parallel Architectures and Compilation Techniques, pp. 350-360. September 2005.
This paper proposes a new hardware technique for using one core of a CMP to prefetch data for a thread running on another core. Our approach simply executes a copy of all non-control instructions in the prefetching core after they have executed in the primary core. On the way to the second core, each instruction's output is replaced by a prediction of the likely output that the n^th future instance of this instruction will produce. Speculatively executing the resulting instruction stream on the second core issues load requests that the main program will probably reference in the future. Unlike previously proposed thread-based prefetching approaches, our technique does not need any thread spawning points, features an adjustable lookahead distance, does not require complicated analyzers to extract prefetching threads, is recovery-free, and necessitates no storage for the prefetching threads. We demonstrate that for the SPECcpu2000 benchmark suite, our mechanism significantly increases the prefetching coverage and improves the primary core's performance by 10% on average over baseline that already includes an aggressive hardware stream prefetcher. We further show that our approach works well in combination with runahead execution.
Nana B. Sam and Martin Burtscher.
Improving Memory System Performance with Energy-Efficient Value Speculation.
Computer Architecture News, Vol. 33, No. 4, pp. 121-127. September 2005.
Microprocessor speeds have been improving much faster than memory speeds, resulting in the CPU spending a larger and larger amount of time waiting for data. Processor designers have employed several ways to improve memory performance, including hierarchical caching, prefetching, and faster memory chips. Yet, memory accesses still represent a major performance bottleneck and much remains to be done to tolerate the increasing memory latencies. Load-value prediction has been shown to effectively hide some of this latency. However, the hardware required to achieve good performance is substantial, making load-value prediction unappealing in light of increasing power constraints. In this paper, we present a novel predictor that significantly increases CPU performance while at the same time decreasing the energy consumption of the entire processor relative to a baseline with a well-performing hybrid load-value predictor.
Jian Ke, Martin Burtscher, and Evan Speight.
Tolerating Message Latency through the Early Release of Blocked Receives.
Proceedings of Euro-Par 2005, Lecture Notes in Computer Science, pp. 19-29. August 2005.
Large message latencies often lead to poor performance of parallel applications. In this paper, we investigate a latency-tolerating technique that immediately releases all blocking receives, even when the message has not yet (completely) arrived, and enforces execution correctness through page protection. This approach eliminates false message data dependencies on incoming messages and allows the computation to proceed as early as possible. We implement and evaluate our early-release technique in the context of an MPI runtime library. The results show that the execution speed of MPI applications improves by up to 60% when early release is enabled. Our approach also enables faster and easier parallel programming as it frees programmers from adopting more complex nonblocking receives and from tuning message sizes to explicitly reduce false message data dependencies.
Jian Ke, Martin Burtscher, and Evan Speight.
Reducing Communication Time through Message Prefetching.
Proceedings of the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 557-563. June 2005.
The latency of large messages often leads to poor performance of parallel applications. In this paper, we investigate a novel latency reduction technique where message receivers prefetch messages from senders before the matching sends are called. When the send is finally called, only the parts of the message that have changed since the prefetch need to be transmitted, resulting in a smaller message. Our message prefetching technique initiates communication while the sender is still in the computation phase and thus overlaps computation with communication to hide part of the message latency. We implement and evaluate our technique in the context of an MPI run-time library. The results show that the execution speed of five MPI applications improves by up to 24% when message prefetching is enabled.
Martin Burtscher and Igor Szczyrba.
On the Role of the Brain's Geometry in Closed Head Injuries.
Proceedings of the 2005 Summer Bioengineering Conference (2 pages). June 2005.
Martin Burtscher and Igor Szczyrba.
Numerical Modeling of Brain Dynamics in Traumatic Situations - Impulsive Translations.
Proceedings of the 2005 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, pp. 205-211. June 2005.
We numerically model the brain dynamics during and after impulsive head translations using linear Partial Differential Equations (PDEs) describing viscoelastic solids and a nonlinear generalization of these PDEs describing incompressible, viscoelastic fluids. The brain matter motion and the sensitivity of the solutions with respect to the skull's geometry differ substantially in the two cases. In particular, the oscillatory rotational flows, which we discovered to appear in the matter after the translations stop, are quite distinct. The significance of the results for understanding the mechanisms of Closed Head Injuries (CHI) is discussed.
Ilya Ganusov and Martin Burtscher.
On the Importance of Optimizing the Configuration of Stream Prefetchers.
Proceedings of the 3rd Annual ACM SIGPLAN Workshop on Memory Systems Performance, pp. 54-61. June 2005.
This paper provides a detailed analysis of how the parameters of hardware prefetchers affect the memory system performance. In particular, we found the configuration of the frequently used stream prefetcher to have a major impact on the runtime, making parameter optimizations imperative when comparing a stream prefetcher with other prefetching techniques. For example, we show that adjusting the prefetch distance to the optimal value can increase the average speedup over the SPECcpu2000 benchmark suite from 40% to 70%. Moreover, our investigation of the performance of runahead prefetching relative to stream prefetching shows that choosing a non-optimal stream prefetcher as a baseline can distort the results by as much as a factor of two.
Nana B. Sam and Martin Burtscher.
Complex Load-Value Predictors: Why We Need Not Bother.
Proceedings of the Fourth Annual Workshop on Duplicating, Deconstructing, and Debunking, pp. 16-24. June 2005.
Memory accesses continue to represent a major performance bottleneck and much remains to be done to tolerate their latencies. A large body of work exists that presents load-value prediction as an effective means to hide some of the memory latency. To increase the prediction accuracy and hence the performance, researchers have proposed more complex and larger predictor design. This paper re-evaluates load-value predictors and examines the tradeoffs between predictor size, latency, energy consumption and performance. We demonstrate that even though complex predictors deliver the highest accuracy, they are unlikely to be implemented in hardware because they provide a much worse energy-performance tradeoff than simpler predictors with moderate sizes.
Nana B. Sam and Martin Burtscher.
On the Energy-Efficiency of Speculative Hardware.
Proceedings of the 2005 ACM International Conference on Computing Frontiers, pp. 361-370. May 2005.
Microprocessor trends are moving towards wider architectures and more aggressive speculation. With the increasing transistor budgets, energy consumption has become a critical design constraint. To address this problem, several researchers have proposed and evaluated energy-efficient variants of speculation mechanisms. However, such hardware is typically evaluated in isolation and its impact on the energy consumption of the rest of the processor, for example, due to wrong-path executions, is ignored. Moreover, the available metrics that would provide a thorough evaluation of an architectural optimization employ somewhat complicated formulas with hard-to-measure parameters. In this paper, we introduce a simple method to accurately compare the energy-efficiency of speculative architectures. Our metric is based on runtime analysis of the entire processor chip and thus captures the energy consumption due to the positive as well as the negative activities that arise from the speculation activities. We demonstrate the usefulness of our metric on the example of value speculation, where we found some proposed value predictors, including low-power designs, not to be energy-efficient.
Martin Burtscher and Nana B. Sam.
Automatic Generation of High-Performance Trace Compressors.
Proceedings of the 2005 International Symposium on Code Generation and Optimization, pp. 229-24. March 2005.
Program execution traces are frequently used in industry and academia. Yet, most trace-compression algorithms have to be re-implemented every time the trace format is changed, which takes time, is error prone, and often results in inefficient solutions. This paper describes and evaluates TCgen, a tool that automatically generates portable, customized, high-performance trace compressors. All the user has to do is provide a description of the trace format and select one or more predictors to compress the fields in the trace records. TCgen translates this specification into C source code and optimizes it for the specified trace format and predictor algorithms. On average, the generated code is faster and compresses better than the six other compression algorithms we have tested. For example, a comparison with SBC, one of the best trace-compression algorithms in the current literature, shows that TCgen's synthesized code compresses SPECcpu2000 address traces 23% more, decompresses them 24% faster, and compresses them 1029% faster.
Martin Burtscher and Ilya Ganusov.
Automatic Synthesis of High-Speed Processor Simulators.
Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 55-66. December 2004.
Microprocessor simulators are very popular in research and teaching environments. For example, functional simulators are often used to perform architectural studies, to fast-forward over uninteresting code, to generate program traces, and to warm up tables before switching to a more detailed but slower simulator. Unfortunately, most portable functional simulators are on the order of 100 times slower than native execution. This paper describes a set of novel techniques and optimizations to synthesize portable functional simulators that are only 6.6 times slower on average (16 times in the worst case) than native execution and 19 times faster than SimpleScalar's sim-fast on the SPECcpu2000 programs. When simulating a memory hierarchy, the synthesized code is 2.6 times faster than the equivalent ATOM code. Our fully automated synthesis approach works without access to source/assembly code or debug information. It generates C code, integrates optional user-provided code, performs unwanted-code removal, preserves basic blocks, generates low-overhead profiles, employs a simple heuristic to determine potential jump targets, only compiles important instructions, and utilizes mixed-mode execution, i.e., it interleaves compiled and interpreted simulation to maximize performance.
Jian Ke, Martin Burtscher, and Evan Speight.
Runtime Compression of MPI Messages to Improve the Performance and Scalability of Parallel Applications.
Proceedings of the SC 2004 High-Performance Computing, Networking and Storage Conference, pp. 59-65. November 2004.
Communication-intensive parallel applications spend a significant amount of their total execution time exchanging data between processes, which leads to poor performance in many cases. In this paper, we investigate message compression in the context of large-scale parallel message-passing systems to reduce the communication time of individual messages and to improve the bandwidth of the overall system. We implement and evaluate the cMPI message-passing library, which quickly compresses messages on-the-fly with a low enough overhead that a net execution time reduction is obtained. Our results on six large-scale benchmark applications show that their execution speed improves by up to 98% when message compression is enabled.
Nana B. Sam and Martin Burtscher.
Exploiting Type Information in Load-Value Predictors.
Proceedings of the Second Value-Prediction and Value-Based Optimization Workshop, pp. 32-39. October 2004.
To alleviate the high cost of main-memory accesses, computer architects have proposed various speculation mechanisms, including load-value prediction. A load-value predictor forecasts the result of load instructions, thus allowing dependent instructions to execute without having to wait for the memory access to complete. Unfortunately, costly mispredictions hinder the true potential of load-value prediction. This paper describes several approaches to build more accurate and faster load-value predictors with little extra hardware by exploiting the hardware type of the load instructions. Our techniques are easily implementable in any CPU that supports multiple load types, e.g., byte, word, single, etc. Compared to traditional load-value predictors, our schemes substantially reduce the number of mispredictions with a concomitant speedup of the CPU. Moreover, we show that the type of a load can effectively be used as the selector in a hybrid predictor.
VPC3: A Fast and Effective Trace-Compression Algorithm.
Proceedings of the 2004 SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 167-176. June 2004.
Trace files are widely used in research and academia to study the behavior of programs. They are simple to process and guarantee repeatability. Unfortunately, they tend to be very large. This paper describes vpc3, a fundamentally new approach to compressing program traces. Vpc3 employs value predictors to bring out and amplify patterns in the traces so that conventional compressors can compress them more effectively. In fact, our approach not only results in much higher compression rates but also provides faster compression and decompression. For example, compared to bzip2, vpc3's geometric mean compression rate on SPECcpu2000 store address traces is 18.4 times higher, compression is ten times faster, and decompression is three times faster.
Martin Burtscher and Metha Jeeradit.
Compressing Extended Program Traces Using Value Predictors.
Proceedings of the 2003 International Conference on Parallel Architectures and Compilation Techniques, pp. 159-169. September 2003.
Trace files record the execution behavior of programs for future analysis. Unfortunately, nontrivial program traces tend to be very large and have to be compressed. While good compression schemes exist for traces that capture only the PCs of the executed instructions, these schemes can be ineffective on extended traces that include important additional information such as register values or effective addresses. Our novel, value-prediction-based approach compresses extended traces up to 22.8 times better and about two and a half times as well on average. In addition to the higher compression rate, our lossless single-pass algorithm has a fixed memory requirement and compresses traces faster than other algorithms. It achieves compression rates of up to 6170. This paper describes the design of our compression method and illustrates how value predictors can be used to effectively compress extended program traces.
Igor Szczyrba and Martin Burtscher.
On the Role of Ventricles in Diffuse Axonal Injuries.
Proceedings of the 2003 Summer Bioengineering Conference, pp. 147-148. June 2003.
Martin Burtscher and Benjamin G. Zorn.
Hybrid Load-Value Predictors.
IEEE Transactions on Computers, Vol. 51, No. 7, pp. 759-774. July 2002.
Load instructions diminish processor performance in two ways. First, due to the continuously widening gap between CPU and memory speed, the relative latency of load instructions grows constantly and already slows program execution. Second, memory reads limit the available instruction-level parallelism because instructions that use the result of a load must wait for the memory access to complete before they can start executing. Load-value predictors alleviate both problems by allowing the CPU to speculatively continue processing without having to wait for load instructions, which can significantly improve the execution speed. While several hybrid load-value predictors have been proposed and found to work well, no systematic study of such predictors exists. In this paper, we investigate the performance of all hybrids that can be built out of a register value, a last value, a stride 2-delta, a last four value, and a finite context method predictor. Our analysis shows that hybrids can deliver 25% more speedup than the best single-component predictors. An examination of the individual components of hybrids revealed that predictors with a poor standalone performance sometimes make excellent components in a hybrid while combining well-performing individual predictors often does not result in an effective hybrid. Our hybridization study identified the register value + stride 2-delta predictor as one of the best two-component hybrids. It matches or exceeds the speedup of two-component hybrids from the literature in spite of its substantially smaller and simpler design. Of all the predictors we studied, the register value + stride 2-delta + last four value hybrid performs best. It yields a harmonic-mean speedup over the eight SPECint95 programs of 17.2%.
Evan Speight and Martin Burtscher.
Delphi: Prediction-Based Page Prefetching to Improve the Performance of Shared Virtual Memory Systems.
Proceedings of the 2002 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 49-55. June 2002.
Software distributed shared memory (SDSM) systems traditionally exhibit poor performance on applications with significant fine-grain or false sharing. Relaxed-consistency models and multiple-writer protocols improve the performance of SDSM systems significantly, but their performance still lags that of hardware shared memory implementations. This paper describes Delphi, a system that borrows techniques from microarchitectural research on value prediction and applies them to software distributed shared memory. We run a small software predictor on each node in the Delphi system to predict which virtual pages will be needed in the future. We use these predictions to prefetch pages, which reduces the number of accesses to invalid data and thereby reduces expensive network accesses. Experimental results show that Delphi is able to reduce the number of read misses to virtual pages by up to 62% on a set of well-known, scientific benchmarks with minimal runtime overhead in extra processing and memory requirements. This translates into a 13% reduction in execution time over a comparable base system that does not employ prediction techniques.
Martin Burtscher, Amer Diwan, and Matthias Hauswirth.
Static Load Classification for Improving the Value Predictability of Data-Cache Misses.
Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, pp. 222-233. June 2002.
While caches are effective at avoiding most main-memory accesses, the few remaining memory references are still expensive. Even one cache miss per one hundred accesses can double a program's execution time. To better tolerate the data-cache miss latency, architects have proposed various speculation mechanisms, including load-value prediction. A load-value predictor guesses the result of a load so that the dependent instructions can immediately proceed without having to wait for the memory access to complete. To use the prediction resources most effectively, speculation should be restricted to loads that are likely to miss in the cache and that are likely to be predicted correctly. Prior work has considered hardware- and profile-based methods to make these decisions. Our work focuses on making these decisions at compile time. We show that a simple compiler classification is effective at separating the loads that should be speculated from the loads that should not. We present results for a number of C and Java programs and demonstrate that our results are consistent across programming languages and across program inputs.
An Improved Index Function for (D)FCM Predictors.
Computer Architecture News, Vol. 30, No. 3, pp. 19-24. June 2002.
The most promising value predictors to date are the finite context method predictor and a recent improvement thereof, the differential finite context method predictor. Both predictors comprise two levels and the index into the second level is a function of the content of the first level. This index function is crucial for good performance. However, our research shows that the currently used select-fold-shift-xor function performs poorly on range-limited sequences of values. For example, it does not predict the results of byte loads well. The problem with the current function is that it often cannot reach the predictor's entire second-level table. We propose an improved index function that does not suffer from this shortcoming. On the 15 SPECcpu2000 C programs, our new index function improves the average load-value predictability by about 1% to 5% without increase in predictor size. On byte loads, the improvement is over 6% for 4096-entry predictors.
Martin Burtscher and Benjamin G. Zorn.
Hybridizing and Coalescing Load Value Predictors.
Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors, pp. 81-92. September 2000.
Most well-performing load value predictors are hybrids that combine multiple predictors into one. Such hybrids are often large. To reduce their size and to improve their performance, this paper presents two storage reduction techniques as well as a detailed analysis of the interaction between a hybrid's components. We found that state sharing and simple value compression can shrink the size of a predictor by a factor of two without compromising the performance. Our component analysis revealed that combining well-performing predictors does not always yield a good hybrid, whereas sometimes a poor predictor can make an excellent complement to another predictor in a hybrid. Performance evaluations using a cycle-accurate simulator running SPECint95 show that hybridizing can improve non-hybrids by thirty to fifty percent over a wide range of sizes. With fifteen kilobytes of state, our coalesced-hybrid yields a harmonic mean speedup of twelve and fifteen percent with a re-fetch and a re-execute misprediction recovery mechanism, respectively, which is higher than the speedup of other predictors we evaluate, some of which are six times larger.
Improving Context-Based Load Value Prediction.
Ph.D. Dissertation, Department of Computer Science, University of Colorado at Boulder (183 pages). April 2000.
Microprocessors are becoming faster at such a rapid pace that other components like random access memory cannot keep up. As a result, the latency of load instructions grows constantly and already often impedes processor performance. Fortunately, load instructions frequently fetch predictable sequences of values. Load value predictors exploit this behavior to predict the results of load instructions. Because the predicted values are available before the memory can deliver the true load values, the CPU is able to speculatively continue processing without having to wait for memory accesses to complete, which improves the execution speed. The contributions of this dissertation to the area of load value prediction include a novel technique to decrease the number of mispredictions, a predictor design that increases the hardware utilization and thus the number of correctly predicted load values, a detailed analysis of hybrid predictor combinations to determine components that complement each other well, and several approaches to substantially reduce the size of hybrid load value predictors without affecting their performance. One result of this research is a very small yet high-performing load value predictor. Cycle-accurate simulations of a four-way superscalar microprocessor running SPECint95 show that this predictor outperforms other predictors from the literature by twenty or more percent over a wide range of sizes. With about fifteen kilobytes of state, the smallest examined configuration, it surpasses the speedups delivered by other, five-times larger predictors both with a re-fetch and a re-execute misprediction recovery mechanism.
Martin Burtscher and Benjamin G. Zorn.
Exploring Last n Value Prediction.
Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, pp. 66-76. October 1999.
Most load value predictors retain a large number of previously loaded values for making future predictions. In this paper we evaluate the trade-off between tall and slim versus short and wide predictors of the same total size, i.e., between retaining a few values for a large number of load instructions and many values for a proportionately smaller number of loads. Our results show, for example, that even modest predictors holding sixteen kilobytes of values benefit from retaining four values per load instruction when running SPECint95. A detailed comparison of eight load value predictors on a cycle-accurate simulator of a superscalar out-of-order microprocessor shows that our implementation of a last four value predictor outperforms other predictors from the literature, often significantly. With 21kB of state, it yields a harmonic mean speedup of 12.5% with existing re-fetch misprediction recovery hardware and 13.7% with a not yet realized re-execution recovery mechanism.
Martin Burtscher and Benjamin G. Zorn.
Prediction Outcome History-Based Confidence Estimation for Load Value Prediction.
Journal of Instruction-Level Parallelism, Vol. 1 (25 pages). May 1999.
Load instructions occasionally incur very long latencies that can significantly affect system performance. Load value prediction alleviates this problem by allowing the CPU to speculatively continue processing without having to wait for the slow memory access to complete. Current load value predictors can only correctly predict about forty to seventy percent of the fetched load values. To avoid the cycle-penalty for mispredictions in the remaining cases, confidence estimators are employed. They inhibit all predictions that are not likely to be correct. In this paper we present a novel confidence estimator that is based on prediction outcome histories. Profiles are used to identify the high-confidence history patterns. Our confidence estimator is able to trade off coverage for accuracy and vice-versa with great flexibility and reaches an average prediction accuracy over SPECint95 of as high as 99.3%. Cycle-accurate pipeline-level simulations show that a simple last value predictor combined with our confidence estimator outperforms other predictors, sometimes by over 100%. Furthermore, this predictor is one of two predictors that yield a genuine speedup for all eight SPECint95 programs.
Martin Burtscher and Benjamin G. Zorn.
Profile-Supported Confidence Estimation for Load-Value Prediction.
Proceedings of the PACT'98 Workshop on Profile and Feedback-Directed Compilation (8 pages). October 1998.
Due to their occasional very long latency, load instructions are among the slowest instructions of current high-performance microprocessors. Unfortunately, the long latency of one instruction also delays the execution of its dependent instructions, which can significantly affect system performance. Load value prediction alleviates this problem by allowing the CPU to speculatively continue processing without having to wait for the slow memory access to complete. The potential of today's load value predictors for making correct predictions is only about 40 to 70 percent. Confidence estimators are employed to estimate how likely a prediction is to be correct and to keep the predictor from making a (probably incorrect) prediction if the confidence is below a preset threshold. Setting this threshold to a higher level increases the probability that the attempted predictions will be correct (higher accuracy) but results also in more missed opportunities for making correct predictions (lower coverage). The profile-supported confidence estimator we present in this paper is able to trade off coverage for accuracy and vice-versa with previously unseen flexibility and reaches an average prediction accuracy over SPECint95 of as high as 99.4%. A detailed pipeline-level simulation shows that our profile-supported load value predictor not only outperforms other predictors, but is also the only predictor that yields a speedup at all when a re-fetch misprediction recovery policy is used.