skip to main content
10.1145/3635035.3635043acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

sKokkos: Enabling Kokkos with Transparent Device Selection on Heterogeneous Systems using OpenACC

Published: 19 January 2024 Publication History

Abstract

This paper presents a new feature to enable Kokkos with transparent device selection. For application developers, it is not easy to identify which device is the most appropriate to use in a heterogeneous system, since this depends on the characteristics of both the application and the hardware. In Kokkos, a backend is associated with one specific programming model/hardware. Programmers decide which backend to use at compilation time. This new feature implemented on the OpenACC backend eliminates the burden of deciding which device to use, providing a highly productive programming solution for Kokkos applications. This work includes implementation details and a performance study conducted with a set of mini-benchmarks (i.e., AXPY and dot product), kernels (Lattice-Bolzmann method), and two mini-apps (LULESH and miniFE) on two heterogeneous systems with different hardware capabilities. This new Kokkos feature provides high accelerations of up to 35 × thanks to automatic and transparent device selection.

References

[1]
David Beckingsale, Richard D. Hornung, Tom Scogland, and Arturo Vargas. 2019. Performance portable C++ programming with RAJA. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 455–456. https://doi.org/10.1145/3293883.3302577
[2]
C. Bonati, E. Calore, S. Coscetti, M. D’elia, M. Mesiti, F. Negro, S. F. Schifano, and R. Tripiccione. 2015. Development of Scientific Software for HPC Architectures Using OpenACC: The Case of LQCD. In IEEE/ACM 1st International Workshop on Software Engineering for High Performance Computing in Science. 9–15.
[3]
Sandra Catalán, Xavier Martorell, Jesús Labarta, Tetsuzo Usui, Leonel Antonio Toledo Díaz, and Pedro Valero-Lara. 2019. Accelerating Conjugate Gradient using OmpSs. In 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019, Gold Coast, Australia, December 5-7, 2019. IEEE, 121–126. https://doi.org/10.1109/PDCAT46702.2019.00033
[4]
Sandra Catalán, Tetsuzo Usui, Leonel Toledo, Xavier Martorell, Jesús Labarta, and Pedro Valero-Lara. 2020. Towards an Auto-Tuned and Task-Based SpMV (LASs Library). In OpenMP: Portable Multi-Level Parallelism on Modern Systems - 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22-24, 2020, Proceedings(Lecture Notes in Computer Science, Vol. 12295), Kent F. Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer, 115–129. https://doi.org/10.1007/978-3-030-58144-2_8
[5]
Sunita Chandrasekaran and Guido Juckeland. 2017. OpenACC for Programmers: Concepts and Strategies (1st ed.). Addison-Wesley Professional.
[6]
Cheng Chen, Canqun Yang, Tao Tang, Qiang Wu, and Pengfei Zhang. 2013. OpenACC to Intel Offload: Automatic Translation and Optimization. In Computer Engineering and Technology. 111–120.
[7]
R. Dietrich, G. Juckeland, and M. Wolfe. 2015. OpenACC Programs Examined: A Performance Analysis Approach. In 44th International Conference on Parallel Processing. 310–319.
[8]
H. Carter Edwards, Daniel Sunderland, Vicki Porter, Chris Amsler, and Sam Mish. 2012. Manycore performance-portability: Kokkos multidimensional array library. Sci. Program. 20, 2 (2012), 89–114. https://doi.org/10.3233/SPR-2012-0343
[9]
H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distributed Comput. 74, 12 (2014), 3202–3216. https://doi.org/10.1016/j.jpdc.2014.07.003
[10]
Jan Eichstädt, Martin Vymazal, David Moxey, and Joaquim Peiró. 2020. A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM. Comput. Phys. Commun. 255 (2020), 107245. https://doi.org/10.1016/j.cpc.2020.107245
[11]
J. Austin Ellis and Sivasankaran Rajamanickam. 2019. Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels. In 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019, Waltham, MA, USA, September 24-26, 2019. IEEE, 1–7. https://doi.org/10.1109/HPEC.2019.8916378
[12]
John Gounley, Madhurima Vardhan, Erik W. Draeger, Pedro Valero-Lara, Shirley V. Moore, and Amanda Randles. 2022. Propagation Pattern for Moment Representation of the Lattice Boltzmann Method. IEEE Trans. Parallel Distributed Syst. 33, 3 (2022), 642–653. https://doi.org/10.1109/TPDS.2021.3098456
[13]
Rene Halver, Jan H. Meinke, and Godehard Sutmann. 2020. Kokkos implementation of an Ewald Coulomb solver and analysis of performance portability. J. Parallel Distributed Comput. 138 (2020), 48–54. https://doi.org/10.1016/j.jpdc.2019.12.003
[14]
Glen Hansen, Patrick G. Xavier, Sam P. Mish, Thomas E. Voth, Martin W. Heinstein, and Micheal W. Glass. 2016. An MPI+X implementation of contact global search using Kokkos. Eng. Comput. 32, 2 (2016), 295–311. https://doi.org/10.1007/s00366-015-0418-x
[15]
Xiaoyi He and Li-Shi Luo. 1997. A priori derivation of the lattice Boltzmann equation. Physical Review E 55, 6 (1997), R6333.
[16]
J. A. Herdman, W. P. Gaudin, Oliver Perks, D. A. Beckingsale, A. C. Mallinson, and Stephen A. Jarvis. 2014. Achieving portability and performance through OpenACC. In Proceedings of the First Workshop on Accelerator Programming using Directives, WACCPD ’14, New Orleans, Louisiana, USA, November 16-21, 2014, Sunita Chandrasekaran, Fernanda S. Foertter, and Oscar R. Hernandez (Eds.). IEEE Computer Society, 19–26. https://doi.org/10.1109/WACCPD.2014.10
[17]
Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2022. Improving Performance via Mini-applications. https://github.com/Mantevo/. Online accessed 20-April-2022.
[18]
Bálint Joó, Thorsten Kurth, Michael A. Clark, Jeongnim Kim, Christian Robert Trott, Dan Ibanez, Daniel Sunderland, and Jack Deslippe. 2019. Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC@SC 2019, Denver, CO, USA, November 22, 2019. IEEE, 14–25. https://doi.org/10.1109/P3HPC49587.2019.00007
[19]
Ian Karlin, Jim McGraw, Esthela Gallardo, Jeff Keasler, Edgar A. León, and Bert Still. 2012. Abstract: Memory and Parallelism Exploration Using the LULESH Proxy Application. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, November 10-16, 2012. IEEE Computer Society, 1427–1428. https://doi.org/10.1109/SC.Companion.2012.234
[20]
Samuel Khuvis, Karen Tomko, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2020. Exploring Hybrid MPI+Kokkos Tasks Programming Model. In 3rd IEEE/ACM Annual Parallel Applications Workshop: Alternatives To MPI+X, PAW-ATM@SC 2020, Atlanta, GA, USA, November 12, 2020. IEEE, 66–73. https://doi.org/10.1109/PAWATM51920.2020.00011
[21]
Gloria Y. K. Kim, Akihiro Hayashi, and Vivek Sarkar. 2017. Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs. In Accelerator Programming Using Directives - 4th International Workshop, WACCPD 2017, Held in Conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, Denver, CO, USA, November 13, 2017, Proceedings(Lecture Notes in Computer Science, Vol. 10732), Sunita Chandrasekaran and Guido Juckeland (Eds.). Springer, 125–144. https://doi.org/10.1007/978-3-319-74896-2_7
[22]
Seyong Lee, Jeremy S. Meredith, and Jeffrey S. Vetter. 2015. COMPASS: A Framework for Automated Performance Modeling and Prediction. In ACM International Conference on Supercomputing (ICS15). https://doi.org/10.1145/2751205.2751220
[23]
Wen mei W. Hwu, David B. Kirk, and Izzat El Hajj. 2023. Programming Massively Parallel Processors (Fourth Edition). In Programming Massively Parallel Processors (Fourth Edition) (fourth edition ed.), Wen mei W. Hwu, David B. Kirk, and Izzat El Hajj (Eds.). Morgan Kaufmann, xv.
[24]
Mohammad Alaul Haque Monil, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Allen D. Malony. 2020. MEPHESTO: Modeling Energy-Performance in Heterogeneous SoCs and Their Trade-Offs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual Event, GA, USA) (PACT ’20). Association for Computing Machinery, New York, NY, USA, 413–425. https://doi.org/10.1145/3410463.3414671
[25]
Mohammad Alaul Haque Monil, Seyong Lee, Jeffrey S. Vetter, and Allen D. Malony. 2022. MAPredict: Static Analysis Driven Memory Access Prediction Framework for Modern CPUs. In High Performance Computing, Ana-Lucia Varbanescu, Abhinav Bhatele, Piotr Luszczek, and Baboulin Marc (Eds.). Springer International Publishing, Cham, 233–255.
[26]
OpenACC. 2011. OpenACC: Directives for Accelerators. [Online]. Available: http://www.openacc.org.
[27]
Yue-Hong Qian, Dominique d’Humières, and Pierre Lallemand. 1992. Lattice BGK models for Navier-Stokes equation. EPL (Europhysics Letters) 17, 6 (1992), 479.
[28]
Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Q. Dang, Nathan D. Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, and Ichitaro Yamazaki. 2021. Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels. CoRR abs/2103.11991 (2021). arxiv:2103.11991https://arxiv.org/abs/2103.11991
[29]
Damodar Sahasrabudhe, Eric T. Phipps, Sivasankaran Rajamanickam, and Martin Berzins. 2019. A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures. In Accelerator Programming Using Directives - 6th International Workshop, WACCPD 2019, Denver, CO, USA, November 18, 2019, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 12017), Sandra Wienke and Sridutt Bhalachandra (Eds.). Springer, 140–163. https://doi.org/10.1007/978-3-030-49943-3_7
[30]
Kyle Spafford and Jeffrey S. Vetter. 2012. Aspen: A Domain Specific Language for Performance Modeling. In SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis.
[31]
Keita Teranishi, Daniel M. Dunlavy, Jeremy M. Myers, and Richard F. Barrett. 2020. SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data. In 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020, Waltham, MA, USA, September 22-24, 2020. IEEE, 1–7. https://doi.org/10.1109/HPEC43674.2020.9286251
[32]
Leonel Toledo, Pedro Valero-Lara, Jeffrey S. Vetter, and Antonio J. Peña. 2021. Static Graphs for Coding Productivity in OpenACC. In 28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021, Bengaluru, India, December 17-20, 2021. IEEE, 364–369. https://doi.org/10.1109/HiPC53243.2021.00050
[33]
Christian Trott, Luc Berger-Vergiat, David Poliakoff, Sivasankaran Rajamanickam, Damien Lebrun-Grandié, Jonathan Madsen, Nader Al Awar, Milos Gligoric, Galen Shipman, and Geoff Womeldorff. 2021. The Kokkos EcoSystem: Comprehensive Performance Portability for High Performance Computing. Comput. Sci. Eng. 23, 5 (2021), 10–18. https://doi.org/10.1109/MCSE.2021.3098509
[34]
Pedro Valero-Lara. 2014. Accelerating solid-fluid interaction based on the immersed boundary method on multicore and GPU architectures. J. Supercomput. 70, 2 (2014), 799–815. https://doi.org/10.1007/s11227-014-1262-2
[35]
Pedro Valero-Lara. 2017. Reducing memory requirements for large size LBM simulations on GPUs. Concurr. Comput. Pract. Exp. 29, 24 (2017). https://doi.org/10.1002/cpe.4221
[36]
Pedro Valero-Lara, Diego Andrade, Raül Sirvent, Jesús Labarta, Basilio B. Fraguela, and Ramon Doallo. 2019. A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library). IEEE Access 7 (2019), 23365–23378. https://doi.org/10.1109/ACCESS.2019.2900122
[37]
Pedro Valero-Lara, Francisco D. Igual, Manuel Prieto-Matías, Alfredo Pinelli, and Julien Favier. 2015. Accelerating fluid-solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures. J. Comput. Sci. 10 (2015), 249–261. https://doi.org/10.1016/j.jocs.2015.07.002
[38]
Pedro Valero-Lara and Johan Jansson. 2015. LBM-HPC - An Open-Source Tool for Fluid Simulations. Case Study: Unified Parallel C (UPC-PGAS). In 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015. IEEE Computer Society, 318–321. https://doi.org/10.1109/CLUSTER.2015.52
[39]
Pedro Valero-Lara and Johan Jansson. 2017. Heterogeneous CPU+GPU approaches for mesh refinement over Lattice-Boltzmann simulations. Concurr. Comput. Pract. Exp. 29, 7 (2017). https://doi.org/10.1002/cpe.3919
[40]
Pedro Valero-Lara, Seyong Lee, Marc González Tallada, Joel E. Denny, and Jeffrey S. Vetter. 2022. KokkACC: Enhancing Kokkos with OpenACC. In 9th Workshop on Accelerator Programming Using Directives, WACCPD@SC 2022, Dallas, TX, USA, November 13-18, 2022. IEEE, 32–42. https://doi.org/10.1109/WACCPD56842.2022.00009
[41]
Michael M. Wolf, H. Carter Edwards, and Stephen L. Olivier. 2016. Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics. In 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, September 13-15, 2016. IEEE, 1–7. https://doi.org/10.1109/HPEC.2016.7761649

Cited By

View all

Index Terms

  1. sKokkos: Enabling Kokkos with Transparent Device Selection on Heterogeneous Systems using OpenACC

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
      January 2024
      185 pages
      ISBN:9798400708893
      DOI:10.1145/3635035
      Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 January 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Auto-tuning
      2. C++ Metaprogramming
      3. CPU
      4. GPU
      5. Heterogeneous Systems
      6. Kokkos
      7. OpenACC
      8. Parallel Programming Models

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      HPCAsia 2024

      Acceptance Rates

      Overall Acceptance Rate 69 of 143 submissions, 48%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 119
        Total Downloads
      • Downloads (Last 12 months)119
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media