JCSE

JCSE, vol. 12, no. 2, pp.37-49, 2018

DOI: http://dx.doi.org/10.5626/JCSE.2018.12.2.37

Packing Narrow-Width Operands to Improve GPU Performance

Xin Wang and Wei Zhang
Department of Electrical and Computer Engineering, Virginia Commonwealth University, Richmond, VA, USA

Abstract: Graphics processing units (GPUs), originally designed for graphics applications, have become a popular platform to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. A GPU typically employs a very large register file (RF) in order to support fast and low-cost context switching between tens of thousands of active threads. As a result, exploiting the RF efficiently is critical for the GPU to achieve high performance. We observe that for many GPGPU applications, a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register, and thus propose a GPU register packing scheme to dynamically exploit narrow-width operands and pack multiple operands into a single full-width register. By using dynamical register packing, more RF space becomes available which allows the GPU to enable more TLP through assigning additional thread blocks on streaming multiprocessors (SMs), and thus improve performance. Our experimental results indicate that dynamic register packing can improve GPU performance by up to 1.96X, and by 1.18X on average.

Keyword: GPU register file; Narrow-width operand; Dynamic register packing

Full Paper: 395 Downloads, 1725 View