skip to main content
research-article
Free access
Just Accepted

Modeling and Controlling Many-Core HPC Processors: an Alternative to PID and Moving Average Algorithms

Online AM: 09 September 2024 Publication History

Abstract

The race towards performance increase and computing power has led to chips with heterogeneous and complex designs, integrating an ever-growing number of cores on the same monolithic chip or chiplet silicon die. Higher integration density, compounded with the slowdown of technology-driven power reduction, implies that power and thermal management become increasingly relevant. Unfortunately, existing research lacks a detailed analysis and modeling of thermal, power, and electrical coupling effects and how they have to be jointly considered to perform dynamic control of complex and heterogeneous mpsoc. To close the gap, in this work, we first provide a detailed thermal and power model targeting a modern hpc mpsoc. We consider real-world coupling effects such as actuators’ non-idealities and the exponential relation between the dissipated power, the temperature state, and the voltage level in a single processing element. We analyze how these factors affect the control algorithm behavior and the type of challenges that they pose. Based on the analysis, we propose a thermal capping strategy inspired by Fuzzy control theory to replace the state-of-the-art PID controller, as well as a root-finding iterative method to optimally choose the shared voltage value among cores grouped in the same voltage domain. We evaluate the proposed controller with model-in-the-loop and hardware-in-the-loop co-simulations. We show an improvement over state-of-the-art methods of up to \(5\times\) the maximum exceeded temperature while providing an average of \(3.56\%\) faster application execution runtime across all the evaluation scenarios.

References

[1]
AMD 2013. BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors. AMD.
[2]
ARM 2022. Arm©Neoverse™V2 Core Technical Reference Manual 5.5.1. ARM. https://developer.arm.com/documentation/102375/latest/
[3]
ARM 2023. Power Control System Architecture - DEN0050. ARM. https://developer.arm.com/documentation/den0050/latest/
[4]
Arm. 2023. SCP-firmware - version 2.13. https://github.com/Arm-software/SCP-firmware.
[5]
Eberhard Baer, Alex Burenkov, Peter Evanschitzky, and Juergen Lorenz. 2016. Simulation of process variations in FinFET transistor patterning. In 2016 International Conference on Simulation of Semiconductor Processes and Devices (SISPAD). 299–302. https://doi.org/10.1109/SISPAD.2016.7605206
[6]
Giovanni Bambini, Robert Balas, Christian Conficoni, Andrea Tilli, Luca Benini, Simone Benatti, and Andrea Bartolini. 2020. An Open-Source Scalable Thermal and Power Controller for HPC Processors. In 2020 IEEE 38th International Conference on Computer Design (ICCD). 364–367. https://doi.org/10.1109/ICCD50377.2020.00067
[7]
Giovanni Bambini, Christian Conficoni, Andrea Tilli, Luca Benini, and Andrea Bartolini. 2022. Modeling the Thermal and Power Control Subsystem in HPC Processors. In 2022 IEEE Conference on Control Technology and Applications (CCTA). 397–402. https://doi.org/10.1109/CCTA49430.2022.9966082
[8]
A. Bartolini, M. Cacciari, A. Tilli, and L. Benini. 2013. Thermal and Energy Management of High-Performance Multicores: Distributed and Self-Calibrating Model-Predictive Controller. IEEE Transactions on Parallel and Distributed Systems 24, 1 (2013), 170–183.
[9]
Andrea Bartolini and Davide Rossi. 2019. Advances in power management of many-core processors. Many-Core Computing: Hardware and Software (2019), 191.
[10]
Anton Beloglazov, Rajkumar Buyya, Young Choon Lee, and Albert Zomaya. 2011. Chapter 3 - A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems. Advances in Computers, Vol. 82. Elsevier, 47–111. https://doi.org/10.1016/B978-0-12-385512-1.00003-7
[11]
Abdelhalim Bendali and Yves Audet. 2007. A 1-V CMOS Current Reference With Temperature and Process Compensation. IEEE Transactions on Circuits and Systems I: Regular Papers 54, 7 (2007), 1424–1429. https://doi.org/10.1109/TCSI.2007.900176
[12]
F. Beneventi, A. Bartolini, A. Tilli, and L. Benini. 2014. An Effective Gray-Box Identification Procedure for Multicore Thermal Modeling. IEEE Trans. Comput. 63, 5 (2014), 1097–1110.
[13]
Ganapati Bhat, Gaurav Singla, Ali K. Unver, and Umit Ogras. 2017. Algorithmic Optimization of Thermal and Power Management for Heterogeneous Mobile Platforms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2017). https://doi.org/10.1109/TVLSI.2017.2770163
[14]
Thomas Burd, Noah Beck, Sean White, Milam Paraschou, Nathan Kalyanasundharam, Gregg Donley, Alan Smith, Larry Hewitt, and Samuel Naffziger. 2019. “Zeppelin”: An SoC for Multichip Architectures. IEEE Journal of Solid-State Circuits 54, 1 (2019), 133–143. https://doi.org/10.1109/JSSC.2018.2873584
[15]
Richard L Burden, J Douglas Faires, and Annette M Burden. 2015. Numerical analysis. Cengage learning.
[16]
Chang-Chih Chen and Linda Milor. 2015. Microprocessor Aging Analysis and Reliability Modeling Due to Back-End Wearout Mechanisms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23, 10 (2015), 2065–2076. https://doi.org/10.1109/TVLSI.2014.2357756
[17]
Hsiang-Yun Cheng, Jia Zhan, Jishen Zhao, Yuan Xie, Jack Sampson, and Mary Jane Irwin. 2015. Core vs. uncore: The heart of darkness. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1145/2744769.2647916
[18]
Yingnan Cui, Wei Zhang, and Bingsheng He. 2017. A Variation-Aware Adaptive Fuzzy Control System for Thermal Management of Microprocessors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 2 (2017), 683–695. https://doi.org/10.1109/TVLSI.2016.2596338
[19]
Pawel Czarnul, Jerzy Proficz, and Adam Krzywaniak. 2019. Energy-Aware High-Performance Computing: Survey of State-of-the-Art Tools, Techniques, and Environments. Scientific Programming 2019 (04 2019), 1–19. https://doi.org/10.1155/2019/8348791
[20]
Shidhartha Das, Paul Whatmough, and David Bull. 2015. Modeling and characterization of the system-level Power Delivery Network for a dual-core ARM Cortex-A57 cluster in 28nm CMOS. In 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). 146–151. https://doi.org/10.1109/ISLPED.2015.7273505
[21]
Anant Deval, Avinash Ananthakrishnan, and Craig Forbell. 2015. Power management on 14 nm Intel® Core M processor. 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII) (2015), 1–3. https://api.semanticscholar.org/CorpusID:37333321
[22]
Roberto Diversi, Andrea Tilli, Andrea Bartolini, Francesco Beneventi, and Luca Benini. 2014. Bias-Compensated Least Squares Identification of Distributed Thermal Models for Many-Core Systems-on-Chip. IEEE Transactions on Circuits and Systems I: Regular Papers 61, 9 (2014), 2663–2676. https://doi.org/10.1109/TCSI.2014.2312495
[23]
Lawrence C. Evans. 2010. Partial differential equations. American Mathematical Society, Providence, R.I.
[24]
Gene Franklin, J.D. Powell, and M.L. Workman. 2022. Digital Control of Dynamic Systems-Third Edition.
[25]
Bolin Gao and Lacra Pavel. 2018. On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning. arXiv:1704.00805 [math.OC]
[26]
Programmer Guide. 2022. Intel 64 and IA-32 Architectures Software Developer's Manual. Intel. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
[27]
Daniel Hackenberg, Robert Schöne, Thomas Ilsche, Daniel Molka, Joseph Schuchart, and Robin Geyer. 2015. An Energy Efficiency Feature Survey of the Intel Haswell Processor. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 896–904. https://doi.org/10.1109/IPDPSW.2015.70
[28]
Vinay Hanumaiah, Sarma Vrudhula, and Karam S. Chatha. 2011. Performance Optimal Online DVFS and Task Migration Techniques for Thermally Constrained Multi-Core Processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 11 (2011), 1677–1690. https://doi.org/10.1109/TCAD.2011.2161308
[29]
IBM. 2022. OpenPower OCC. https://github.com/open-power/occ.
[30]
Seri Lee and Kevin P. Moran. 1996. Constriction/spreading resistance model for electronics packaging. https://api.semanticscholar.org/CorpusID:28843083
[31]
Alberto Leva, Federico Terraneo, Irene Giacomello, and William Fornaciari. 2018. Event-Based Power/Performance-Aware Thermal Management for High-Density Microprocessors. IEEE Transactions on Control Systems Technology 26, 2 (2018), 535–550. https://doi.org/10.1109/TCST.2017.2675841
[32]
Zhifeng Liu and Hong Zhu. 2010. A survey of the research on power management techniques for high‐performance systems. Softw., Pract. Exper. 40 (10 2010). https://doi.org/10.1002/spe.v40:11
[33]
Google LLC. U.S. Patent US8402290B2, Dec. 2020. Power management for multiple processor cores.
[34]
Abhinandan Majumdar, Leonardo Piga, Indrani Paul, Joseph L. Greathouse, Wei Huang, and David H. Albonesi. 2017. Dynamic GPGPU Power Management Using Adaptive Model Predictive Control. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 613–624. https://doi.org/10.1109/HPCA.2017.34
[35]
Sumit K. Mandal, Ganapati Bhat, Janardhan Rao Doppa, Partha Pratim Pande, and Umit Y. Ogras. 2020. An Energy-aware Online Learning Framework for Resource Management in Heterogeneous Platforms. ACM Trans. Des. Autom. Electron. Syst. 25, 3, Article 28 (may 2020), 26 pages. https://doi.org/10.1145/3386359
[36]
Kasra Moazzemi, Biswadip Maity, Saehanseul Yi, Amir M. Rahmani, and Nikil Dutt. 2019. HESSLE-FREE: Heterogeneous Systems Leveraging Fuzzy Control for Runtime Resource Management. ACM Trans. Embed. Comput. Syst. 18, 5s, Article 74 (oct 2019), 19 pages. https://doi.org/10.1145/3358203
[37]
Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H. Loh, Mahesh Subramony, and Sean White. 2021. Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families: Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 57–70. https://doi.org/10.1109/ISCA52012.2021.00014
[38]
Hung T Nguyen and Michio Sugeno. 2012. Fuzzy systems: modeling and control. Vol. 2. Springer Science & Business Media.
[39]
Alessandro Ottaviano, Robert Balas, Giovanni Bambini, Antonio Del Vecchio, Maicol Ciani, Davide Rossi, Luca Benini, and Andrea Bartolini. 2024. ControlPULP: A RISC-V On-Chip Parallel Power Controller for Many-Core HPC Processors with FPGA-Based Hardware-In-The-Loop Power and Thermal Emulation. International Journal of Parallel Programming (26 Feb 2024). https://doi.org/10.1007/s10766-024-00761-4
[40]
G. Paci, P. Marchal, F. Poletti, and L. Benini. 2006. Exploring “ temperature-aware ” design in low-power MPSoCs. In Proceedings of the Design Automation & Test in Europe Conference, Vol. 1. 1–6. https://doi.org/10.1109/DATE.2006.243741
[41]
Santiago Pagani, P. D. Sai Manoj, Axel Jantsch, and Jörg Henkel. 2020. Machine Learning for Power, Energy, and Thermal Management on Multicore Processors: A Survey. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 1 (2020), 101–116. https://doi.org/10.1109/TCAD.2018.2878168
[42]
Jaehyun Park, Donghwa Shin, Naehyuck Chang, and Massoud Pedram. 2010. Accurate modeling and calculation of delay and energy overheads of dynamic voltage scaling in modern high-performance microprocessors. In 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED). 419–424. https://doi.org/10.1145/1840845.1840938
[43]
Jan M Rabaey. 1999. Digital integrated circuits a design perspective.
[44]
Martin Rapp, Mohammed Bakr Sikal, Heba Khdr, and Jörg Henkel. 2021. SmartBoost: Lightweight ML-Driven Boosting for Thermally-Constrained Many-Core Processors. In 2021 58th ACM/IEEE Design Automation Conference (DAC). 265–270. https://doi.org/10.1109/DAC18074.2021.9586287
[45]
RISC-V 2017. The RISC-V Instruction Set Manual Volume II: Privileged Architecture. RISC-V. https://riscv.org/technical/specifications/
[46]
Chiara Irma Riva. 2021. A numerical tool for the analytical solution of temperature rise and thermal spreading resistance for power electronics. (2021).
[47]
Todd Rosedahl, Martha Broyles, Charles Lefurgy, Bjorn Christensen, and Wu Feng. 2017. Power/Performance Controlling Techniques in OpenPOWER. In High Performance Computing, Julian M. Kunkel, Rio Yokota, Michela Taufer, and John Shalf (Eds.). Springer International Publishing, Cham, 275–289.
[48]
Davide Rossi, Antonio Pullini, Igor Loi, Michael Gautschi, Frank K. Gürkaynak, Andrea Bartolini, Philippe Flatresse, and Luca Benini. 2016. A 60 GOPS/W, -1.8V to 0.9V body bias ULP cluster in 28nm UTBB FD-SOI technology. Solid-State Electronics 117 (2016), 170–184. https://doi.org/10.1016/j.sse.2015.11.015PLANAR FULLY-DEPLETED SOI TECHNOLOGY.
[49]
Efraim Rotem, Ran Ginosar, Avi Mendelson, and Uri C. Weiser. 2013. Power and thermal constraints of modern system-on-a-chip computer. In 19th International Workshop on Thermal Investigations of ICs and Systems (THERMINIC). 141–146. https://doi.org/10.1109/THERMINIC.2013.6675226
[50]
Efraim Rotem, Alon Naveh, Avinash Ananthakrishnan, Eliezer Weissmann, and Doron Rajwan. 2012. Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge. IEEE Micro 32, 2 (2012), 20–27. https://doi.org/10.1109/MM.2012.12
[51]
Martin Schlager, Roman Obermaisser, and Wilfried Elmenreich. 2007. A Framework for Hardware-in-the-Loop Testing of an Integrated Architecture. Lecture Notes in Computer Science - LNCS 4761, 159–170. https://doi.org/10.1007/978-3-540-75664-4_16
[52]
Robert Schöne, Thomas Ilsche, Mario Bielert, Andreas Gocht, and Daniel Hackenberg. 2019. Energy Efficiency Features of the Intel Skylake-SP Processor and Their Impact on Performance. arXiv:1905.12468 [cs] (May 2019). http://arxiv.org/abs/1905.12468arXiv: 1905.12468.
[53]
Kevin Skadron, Mircea R. Stan, Wei Huang, Sivakumar Velusamy, Karthik Sankaranarayanan, and David Tarjan. 2003. Temperature-Aware Microarchitecture. In Proceedings of the 30th Annual International Symposium on Computer Architecture (San Diego, California) (ISCA ’03). Association for Computing Machinery, New York, NY, USA, 2–13. https://doi.org/10.1145/859618.859620
[54]
Hameedah Sultan, Anjali Chauhan, and Smruti R. Sarangi. 2019. A Survey of Chip-Level Thermal Simulators. ACM Comput. Surv. 52, 2, Article 42 (apr 2019), 35 pages. https://doi.org/10.1145/3309544
[55]
Zhangxi Tan, Andrew Waterman, Henry Cook, Sarah Bird, Krste Asanović, and David Patterson. 2010. A Case for FAME: FPGA Architecture Model Execution. In Proceedings of the 37th Annual International Symposium on Computer Architecture (Saint-Malo, France) (ISCA ’10). Association for Computing Machinery, New York, NY, USA, 290–301. https://doi.org/10.1145/1815961.1815999
[56]
Andrea Tilli, Emanuele Garone, Christian Conficoni, Matteo Cacciari, Alessandro Bosso, and Andrea Bartolini. 2022. A two-layer distributed MPC approach to thermal control of Multiprocessor Systems-on-Chip. Control Engineering Practice 122 (5 2022). https://doi.org/10.1016/j.conengprac.2022.105099
[57]
Ankush Varma, Bill Bowhill, Jason Crop, Corey Gough, Brian Griffith, Dan Kingsley, and Krishna Sistla. 2015. Power management in the Intel Xeon E5 v3. In 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). 371–376. https://doi.org/10.1109/ISLPED.2015.7273542
[58]
A. Vassighi and M. Sachdev. 2006. Thermal runaway in integrated circuits. IEEE Transactions on Device and Materials Reliability 6, 2 (2006), 300–305. https://doi.org/10.1109/TDMR.2006.876577
[59]
Wei-Bin Yang, Yu-Yao Lin, and Yu-Lung Lo. 2014. Analysis and design considerations of static CMOS logics under process, voltage and temperature variation in 90nm CMOS process. In 2014 International Conference on Information Science, Electronics and Electrical Engineering, Vol. 3. 1653–1656. https://doi.org/10.1109/InfoSEEE.2014.6946202
[60]
Huazhe Zhang and Henry Hoffmann. 2016. Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques. 51, 4 (mar 2016), 545–559. https://doi.org/10.1145/2954679.2872375

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Autonomous and Adaptive Systems
ACM Transactions on Autonomous and Adaptive Systems Just Accepted
EISSN:1556-4703
Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 09 September 2024
Accepted: 01 August 2024
Revised: 19 May 2024
Received: 30 September 2023

Check for updates

Author Tags

  1. Modeling
  2. Control
  3. Nonlinear systems

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 50
    Total Downloads
  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)35
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media