skip to main content
research-article
Public Access

SSAGA: SMs Synthesized for Asymmetric GPGPU Applications

Published: 21 April 2017 Publication History

Abstract

The emergence of GPGPU applications, bolstered by flexible GPU programming platforms, has created a tremendous challenge in maintaining high energy efficiency in modern GPUs. In this article, we demonstrate that customizing a Streaming Multiprocessor (SM) of a GPU at a lower frequency is significantly more energy efficient compared to employing DVFS on an SM designed for a high-frequency operation. Using a system-level CAD technique, we propose SSAGA—Streaming Multiprocessors Synthesized for Asymmetric GPGPU Applications—an energy-efficient GPU design paradigm. SSAGA creates architecturally identical SM cores, customized for different voltage-frequency domains. Our rigorous cross-layer methodology demonstrates an average of 20% improvement in energy efficiency over a spatially multitasking GPU across a range of GPGPU applications.

References

[1]
Advanced Micro Devices (AMD). 2016. AMD Accelerated Parallel Processing (APP) Software Development Kit. http://developer.amd.com/sdks/amdappsdk/.
[2]
J. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. 2012. The case for GPGPU spatial multitasking. In Proceedings of High Performance Computer Architecture (HPCA’12). 79--90.
[3]
P. Aguilera, J. Lee, A. F. Farahani, K. Morrow, M. J. Schulte, and N. S. Kim. 2014a. Process variation-aware workload partitioning algorithms for GPUs supporting spatial-multitasking. In IEEE/ACM Design Automation 8 Test in Europe (DATE’14). 1--6.
[4]
P. Aguilera, K. Morrow, and N. S. Kim. 2014b. QoS-aware dynamic resource allocation for spatial-multitasking GPUs. In IEEE/ACM Asia and South Pacific-Design Automation Conference (ASP-DAC’14). 726--731.
[5]
T. Akenine-Moller and J. Strom. 2008. Graphics processing units for handhelds. In Proceedings of the IEEE. 779--789.
[6]
M. Basoglu, M. Orshansky, and M. Erez. 2010. NBTI-aware DVFS: A new approach to saving energy and increasing processor lifetime. In ACM International Symposium on Low Power Electronic Devices (ISLPED’10). 253--258.
[7]
H. Bokhari, H. Javaid, M. Shafique, J. Henkel, and S. Parameswaran. 2014. darkNoC: Designing energy-efficient network-on-chip with multi-vt cells for dark silicon. In IEEE/ACM Design Automation Conference (DAC’14). 161:1--161:6.
[8]
K. Chakraborty and S. Roy. 2011. Topologically homogeneous power-performance heterogeneous multicore systems. In IEEE/ACM Design Automation 8 Test in Europe (DATE’11). 1--6.
[9]
S. Garg, D. Marculescu, R. Marculescu, and Ü. Y. Ogras. 2009. Technology-driven limits on DVFS controllability of multiple voltage-frequency island designs: A system-level perspective. In IEEE/ACM Design Automation Conference (DAC’09). 818--821.
[10]
R. Ge, R. Vogt, J. Majumder, A. Alam, M. Burtscher, and Z. Zong. 2013. Effects of dynamic voltage and frequency scaling on a K20 GPU. In IEEE International Conference on Parallel Processing (ICPP’13). 826--833.
[11]
S. Hsu, A. Alvandpour, S. Mathew, S.-L. Lu, R. Krishnamurthy, and S. Borkar. 2003. A 4.5-GHz 130-nm 32-kB L0 cache with a leakage-tolerant self reverse-bias bitline scheme. IEEE J. Solid-State Circuits 38, 5, 755--761.
[12]
W. Huang, K. Sankaranarayanan, R. J. Ribando, M. R. Stan, and K. Skadron. 2007. An Improved Block-Based Thermal Model in Hotspot 4.0 with Granularity Considerations. In Proceedings of the 6th Annual Workshop on Duplicating, Deconstructing, and Debanking (WDDD’07). 1--10.
[13]
W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy. 2004. Compact thermal modeling for temperature-aware design. In IEEE/ACM Design Automation Conference (DAC’04). 878--883.
[14]
Q. Jiao, M. Lu, H. P. Huynh, and T. Mitra. 2015. Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). 1--11.
[15]
Y. Jiao, H. Lin, P. Balaji, and W. Feng. 2010. Power and performance characterization of computational kernels on the GPU. In Proceedings of the 2010 IEEE/ACM Int’l Conference on Green Computing and Communications 8 Int’l Conference on Cyber, Physical and Social Computing (GREENCOM-CPSCOM’10). 221--228.
[16]
J. Lee, V. Sathisha, M. Schulte, K. Compton, and N. S. Kim. 2011. Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling. In Parallel Architectures and Compilation Techniques (PACT’11). 111--120.
[17]
W. Joo and D. Shin. 2014. Resource-constrained spatial multi-tasking for embedded GPU. In ICCE. 339--340.
[18]
S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC’11). 17--30.
[19]
M. Kim, K. Kim, J. Geraci, and S. Hong. 2014. Utilization-aware load balancing for the energy efficient operation of the big.LITTLE processor. In IEEE/ACM Design Automation 8 Test in Europe (DATE’14). 223:1--223:4.
[20]
W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proceedings of High Performance Computer Architecture (HPCA’08). 123--134.
[21]
T. Komoda, S. Hayashi, T. Nakada, and S. Miwa. 2013. Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping. In IEEE International Conference on Computer Design (ICCD’13). 349--356.
[22]
R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. 2003. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Microarchitecture, 2003 (MICRO-36). 36, 81--91.
[23]
J. Lee and N. S. Kim. 2009. Optimizing throughput of power and thermal-constrained multicore processors using DVFS and per-core power-gating. In IEEE/ACM Design Automation Conference (DAC’09). 47--50.
[24]
T. Li, P. Brett, R. Knauerhase, D. Koufaty, D. Reddy, and S. Hahn. 2010. Operating system support for overlapping-ISA heterogeneous multi-core architectures. In Proceedings of High Performance Computer Architecture (HPCA’10). 9--14.
[25]
Y. Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. 2015. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26, 3, 748--760.
[26]
J. Lucas, S. Lal, M. Andersch, M. Alvarez-Mesa, and B. Juurlink. 2013. How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator. In International Symposium on Performance Analysis of Systems and Software (ISPASS’13).
[27]
A. Macii, E. Macii, and M. Poncino. 2003. Improving the efficiency of memory partitioning by address clustering. In IEE/ACM Design, Automation and Test in Europe (DATE’03). 18--23.
[28]
X. Mei, L. S. Yung, K. Zhao, and X. Chu. 2013. A measurement study of GPU DVFS on energy conservation. In Proceedings of the Workshop on Power-Aware Computing and Systems (HotPower’13). 10:1--10:5.
[29]
C. Nugteren, G.-J. van den Braak, and H. Corporaal. 2014. Roofline-aware DVFS for GPUs. In Proceedings of International Workshop on Adaptive Self-Tuning Computing Systems. 8--10.
[30]
M. Shebanow. 2013. An evolution of mobile graphics. Keynote Talk at High Performance Graphics.
[31]
D. Shelepov, J. C. S. Alcaide, S. Jeffery, A. Fedorova, N. Perez, Z. F. Huang, S. Blagodurov, and V. Kumar. 2009. HASS: A scheduler for heterogeneous multicore systems. SIGOPS Oper. Syst. Rev. 43, 2, 66--75.
[32]
J. Spall. 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Automatic Control 37, 3, 332--341.
[33]
J. C. Spall. 1998. An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins Apl. Technical Digest 19, 4 (1998), 482--492.
[34]
R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). 335--344.
[35]
Universidad de Costa Rica. 2009. Theia GPU. http://opencores.org/project,theia_gpu.
[36]
P.-H. Wang, C.-L. Yang, Y.-M. Chen, and Y.-J. Cheng. 2011. Power gating strategies on GPUs. ACM Trans. Archit. Code Optim. 8, 3, 13:1--13:25.
[37]
Y. Wang and N. Ranganathan. 2014. A feedback, runtime technique for scaling the frequency in GPU architectures. In Proceedings of the 2014 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’14). 430--435.
[38]
Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proceedings of High Performance Computer Architecture (HPCA’16). 358--369.
[39]
Q. Xu, H. Jeon, K. Kim, W. Roo, and M. Annavaram. 2016. Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In ISCA. 1--13.
[40]
D. You and K.-S. Chung. 2014. Quality of service-aware dynamic voltage and frequency scaling for embedded GPUs. In IEEE Computer Architecture Letters. 66--69.

Cited By

View all
  • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 22, Issue 3
July 2017
440 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3062395
  • Editor:
  • Naehyuck Chang
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 21 April 2017
Accepted: 01 October 2016
Revised: 01 September 2016
Received: 01 May 2016
Published in TODAES Volume 22, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DVFS
  2. GPU
  3. custom design
  4. decreased peak chip temperature
  5. decreased power consumption
  6. improved energy efficiency
  7. power gating

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)84
  • Downloads (Last 6 weeks)15
Reflects downloads up to 24 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media