research-article

Public Access

SSAGA: SMs Synthesized for Asymmetric GPGPU Applications

Authors:

Chidhambaranathan Rajamanikkam,

Koushik Chakraborty,

Sanghamitra RoyAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 22, Issue 3

Article No.: 49, Pages 1 - 20

https://doi.org/10.1145/3014163

Published: 21 April 2017 Publication History

Abstract

The emergence of GPGPU applications, bolstered by flexible GPU programming platforms, has created a tremendous challenge in maintaining high energy efficiency in modern GPUs. In this article, we demonstrate that customizing a Streaming Multiprocessor (SM) of a GPU at a lower frequency is significantly more energy efficient compared to employing DVFS on an SM designed for a high-frequency operation. Using a system-level CAD technique, we propose SSAGA—Streaming Multiprocessors Synthesized for Asymmetric GPGPU Applications—an energy-efficient GPU design paradigm. SSAGA creates architecturally identical SM cores, customized for different voltage-frequency domains. Our rigorous cross-layer methodology demonstrates an average of 20% improvement in energy efficiency over a spatially multitasking GPU across a range of GPGPU applications.

References

[1]

Advanced Micro Devices (AMD). 2016. AMD Accelerated Parallel Processing (APP) Software Development Kit. http://developer.amd.com/sdks/amdappsdk/.

[2]

J. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. 2012. The case for GPGPU spatial multitasking. In Proceedings of High Performance Computer Architecture (HPCA’12). 79--90.

Digital Library

[3]

P. Aguilera, J. Lee, A. F. Farahani, K. Morrow, M. J. Schulte, and N. S. Kim. 2014a. Process variation-aware workload partitioning algorithms for GPUs supporting spatial-multitasking. In IEEE/ACM Design Automation 8 Test in Europe (DATE’14). 1--6.

[4]

P. Aguilera, K. Morrow, and N. S. Kim. 2014b. QoS-aware dynamic resource allocation for spatial-multitasking GPUs. In IEEE/ACM Asia and South Pacific-Design Automation Conference (ASP-DAC’14). 726--731.

[5]

T. Akenine-Moller and J. Strom. 2008. Graphics processing units for handhelds. In Proceedings of the IEEE. 779--789.

[6]

M. Basoglu, M. Orshansky, and M. Erez. 2010. NBTI-aware DVFS: A new approach to saving energy and increasing processor lifetime. In ACM International Symposium on Low Power Electronic Devices (ISLPED’10). 253--258.

Digital Library

[7]

H. Bokhari, H. Javaid, M. Shafique, J. Henkel, and S. Parameswaran. 2014. darkNoC: Designing energy-efficient network-on-chip with multi-vt cells for dark silicon. In IEEE/ACM Design Automation Conference (DAC’14). 161:1--161:6.

Digital Library

[8]

K. Chakraborty and S. Roy. 2011. Topologically homogeneous power-performance heterogeneous multicore systems. In IEEE/ACM Design Automation 8 Test in Europe (DATE’11). 1--6.

[9]

S. Garg, D. Marculescu, R. Marculescu, and Ü. Y. Ogras. 2009. Technology-driven limits on DVFS controllability of multiple voltage-frequency island designs: A system-level perspective. In IEEE/ACM Design Automation Conference (DAC’09). 818--821.

Digital Library

[10]

R. Ge, R. Vogt, J. Majumder, A. Alam, M. Burtscher, and Z. Zong. 2013. Effects of dynamic voltage and frequency scaling on a K20 GPU. In IEEE International Conference on Parallel Processing (ICPP’13). 826--833.

Digital Library

[11]

S. Hsu, A. Alvandpour, S. Mathew, S.-L. Lu, R. Krishnamurthy, and S. Borkar. 2003. A 4.5-GHz 130-nm 32-kB L0 cache with a leakage-tolerant self reverse-bias bitline scheme. IEEE J. Solid-State Circuits 38, 5, 755--761.

[12]

W. Huang, K. Sankaranarayanan, R. J. Ribando, M. R. Stan, and K. Skadron. 2007. An Improved Block-Based Thermal Model in Hotspot 4.0 with Granularity Considerations. In Proceedings of the 6th Annual Workshop on Duplicating, Deconstructing, and Debanking (WDDD’07). 1--10.

[13]

W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy. 2004. Compact thermal modeling for temperature-aware design. In IEEE/ACM Design Automation Conference (DAC’04). 878--883.

Digital Library

[14]

Q. Jiao, M. Lu, H. P. Huynh, and T. Mitra. 2015. Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). 1--11.

[15]

Y. Jiao, H. Lin, P. Balaji, and W. Feng. 2010. Power and performance characterization of computational kernels on the GPU. In Proceedings of the 2010 IEEE/ACM Int’l Conference on Green Computing and Communications 8 Int’l Conference on Cyber, Physical and Social Computing (GREENCOM-CPSCOM’10). 221--228.

Digital Library

[16]

J. Lee, V. Sathisha, M. Schulte, K. Compton, and N. S. Kim. 2011. Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling. In Parallel Architectures and Compilation Techniques (PACT’11). 111--120.

[17]

W. Joo and D. Shin. 2014. Resource-constrained spatial multi-tasking for embedded GPU. In ICCE. 339--340.

[18]

S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC’11). 17--30.

[19]

M. Kim, K. Kim, J. Geraci, and S. Hong. 2014. Utilization-aware load balancing for the energy efficient operation of the big.LITTLE processor. In IEEE/ACM Design Automation 8 Test in Europe (DATE’14). 223:1--223:4.

[20]

W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proceedings of High Performance Computer Architecture (HPCA’08). 123--134.

[21]

T. Komoda, S. Hayashi, T. Nakada, and S. Miwa. 2013. Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping. In IEEE International Conference on Computer Design (ICCD’13). 349--356.

[22]

R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. 2003. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Microarchitecture, 2003 (MICRO-36). 36, 81--91.

[23]

J. Lee and N. S. Kim. 2009. Optimizing throughput of power and thermal-constrained multicore processors using DVFS and per-core power-gating. In IEEE/ACM Design Automation Conference (DAC’09). 47--50.

Digital Library

[24]

T. Li, P. Brett, R. Knauerhase, D. Koufaty, D. Reddy, and S. Hahn. 2010. Operating system support for overlapping-ISA heterogeneous multi-core architectures. In Proceedings of High Performance Computer Architecture (HPCA’10). 9--14.

[25]

Y. Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. 2015. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26, 3, 748--760.

[26]

J. Lucas, S. Lal, M. Andersch, M. Alvarez-Mesa, and B. Juurlink. 2013. How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator. In International Symposium on Performance Analysis of Systems and Software (ISPASS’13).

[27]

A. Macii, E. Macii, and M. Poncino. 2003. Improving the efficiency of memory partitioning by address clustering. In IEE/ACM Design, Automation and Test in Europe (DATE’03). 18--23.

[28]

X. Mei, L. S. Yung, K. Zhao, and X. Chu. 2013. A measurement study of GPU DVFS on energy conservation. In Proceedings of the Workshop on Power-Aware Computing and Systems (HotPower’13). 10:1--10:5.

Digital Library

[29]

C. Nugteren, G.-J. van den Braak, and H. Corporaal. 2014. Roofline-aware DVFS for GPUs. In Proceedings of International Workshop on Adaptive Self-Tuning Computing Systems. 8--10.

Digital Library

[30]

M. Shebanow. 2013. An evolution of mobile graphics. Keynote Talk at High Performance Graphics.

[31]

D. Shelepov, J. C. S. Alcaide, S. Jeffery, A. Fedorova, N. Perez, Z. F. Huang, S. Blagodurov, and V. Kumar. 2009. HASS: A scheduler for heterogeneous multicore systems. SIGOPS Oper. Syst. Rev. 43, 2, 66--75.

Digital Library

[32]

J. Spall. 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Automatic Control 37, 3, 332--341.

[33]

J. C. Spall. 1998. An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins Apl. Technical Digest 19, 4 (1998), 482--492.

[34]

R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). 335--344.

Digital Library

[35]

Universidad de Costa Rica. 2009. Theia GPU. http://opencores.org/project,theia_gpu.

[36]

P.-H. Wang, C.-L. Yang, Y.-M. Chen, and Y.-J. Cheng. 2011. Power gating strategies on GPUs. ACM Trans. Archit. Code Optim. 8, 3, 13:1--13:25.

Digital Library

[37]

Y. Wang and N. Ranganathan. 2014. A feedback, runtime technique for scaling the frequency in GPU architectures. In Proceedings of the 2014 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’14). 430--435.

Digital Library

[38]

Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proceedings of High Performance Computer Architecture (HPCA’16). 358--369.

[39]

Q. Xu, H. Jeon, K. Kim, W. Roo, and M. Annavaram. 2016. Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In ISCA. 1--13.

Digital Library

[40]

D. You and K.-S. Chung. 2014. Quality of service-aware dynamic voltage and frequency scaling for embedded GPUs. In IEEE Computer Architecture Letters. 66--69.

Cited By

Zhao CGao WNie FZhou H(2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3115630

Index Terms

SSAGA: SMs Synthesized for Asymmetric GPGPU Applications
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors

Recommendations

Optimizing throughput of power- and thermal-constrained multicore processors using DVFS and per-core power-gating
DAC '09: Proceedings of the 46th Annual Design Automation Conference

Process variability from a range of sources is growing as technology scales below 65nm, resulting in increasingly nonuniform transistor delay and leakage power both within a die and across dies. As a result, the negative impact of process variations on ...
Roofline-aware DVFS for GPUs
ADAPT '14: Proceedings of International Workshop on Adaptive Self-tuning Computing Systems

Graphics processing units (GPUs) are becoming increasingly popular for compute workloads, mainly because of their large number of processing elements and high-bandwidth to off-chip memory. The roofline model captures the ratio between the two (the ...
Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms
IGCC '11: Proceedings of the 2011 International Green Computing Conference and Workshops

Energy efficiency is a major concern in modern high-performance-computing. Still, few studies provide a deep insight into the power consumption of scientific applications. Especially for algorithms running on hybrid platforms equipped with hardware ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 22, Issue 3

July 2017

440 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3062395

Editor:
Naehyuck Chang
Korea Advanced Institute of Science and Technology, Korea

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 21 April 2017

Accepted: 01 October 2016

Revised: 01 September 2016

Received: 01 May 2016

Published in TODAES Volume 22, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
440
Total Downloads

Downloads (Last 12 months)84
Downloads (Last 6 weeks)15

Reflects downloads up to 24 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao CGao WNie FZhou H(2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3115630

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents