research-article

sKokkos: Enabling Kokkos with Transparent Device Selection on Heterogeneous Systems using OpenACC

Authors:

Pedro Valero-Lara,

Keita Teranishi,

Jeffrey Vetter,

Marc Gonzalez-TalladaAuthors Info & Claims

HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Pages 23 - 34

https://doi.org/10.1145/3635035.3635043

Published: 19 January 2024 Publication History

Abstract

This paper presents a new feature to enable Kokkos with transparent device selection. For application developers, it is not easy to identify which device is the most appropriate to use in a heterogeneous system, since this depends on the characteristics of both the application and the hardware. In Kokkos, a backend is associated with one specific programming model/hardware. Programmers decide which backend to use at compilation time. This new feature implemented on the OpenACC backend eliminates the burden of deciding which device to use, providing a highly productive programming solution for Kokkos applications. This work includes implementation details and a performance study conducted with a set of mini-benchmarks (i.e., AXPY and dot product), kernels (Lattice-Bolzmann method), and two mini-apps (LULESH and miniFE) on two heterogeneous systems with different hardware capabilities. This new Kokkos feature provides high accelerations of up to 35 × thanks to automatic and transparent device selection.

References

[1]

David Beckingsale, Richard D. Hornung, Tom Scogland, and Arturo Vargas. 2019. Performance portable C++ programming with RAJA. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 455–456. https://doi.org/10.1145/3293883.3302577

Digital Library

[2]

C. Bonati, E. Calore, S. Coscetti, M. D’elia, M. Mesiti, F. Negro, S. F. Schifano, and R. Tripiccione. 2015. Development of Scientific Software for HPC Architectures Using OpenACC: The Case of LQCD. In IEEE/ACM 1st International Workshop on Software Engineering for High Performance Computing in Science. 9–15.

[3]

Sandra Catalán, Xavier Martorell, Jesús Labarta, Tetsuzo Usui, Leonel Antonio Toledo Díaz, and Pedro Valero-Lara. 2019. Accelerating Conjugate Gradient using OmpSs. In 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019, Gold Coast, Australia, December 5-7, 2019. IEEE, 121–126. https://doi.org/10.1109/PDCAT46702.2019.00033

[4]

Sandra Catalán, Tetsuzo Usui, Leonel Toledo, Xavier Martorell, Jesús Labarta, and Pedro Valero-Lara. 2020. Towards an Auto-Tuned and Task-Based SpMV (LASs Library). In OpenMP: Portable Multi-Level Parallelism on Modern Systems - 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22-24, 2020, Proceedings(Lecture Notes in Computer Science, Vol. 12295), Kent F. Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer, 115–129. https://doi.org/10.1007/978-3-030-58144-2_8

Digital Library

[5]

Sunita Chandrasekaran and Guido Juckeland. 2017. OpenACC for Programmers: Concepts and Strategies (1st ed.). Addison-Wesley Professional.

[6]

Cheng Chen, Canqun Yang, Tao Tang, Qiang Wu, and Pengfei Zhang. 2013. OpenACC to Intel Offload: Automatic Translation and Optimization. In Computer Engineering and Technology. 111–120.

[7]

R. Dietrich, G. Juckeland, and M. Wolfe. 2015. OpenACC Programs Examined: A Performance Analysis Approach. In 44th International Conference on Parallel Processing. 310–319.

[8]

H. Carter Edwards, Daniel Sunderland, Vicki Porter, Chris Amsler, and Sam Mish. 2012. Manycore performance-portability: Kokkos multidimensional array library. Sci. Program. 20, 2 (2012), 89–114. https://doi.org/10.3233/SPR-2012-0343

[9]

H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distributed Comput. 74, 12 (2014), 3202–3216. https://doi.org/10.1016/j.jpdc.2014.07.003

Digital Library

[10]

Jan Eichstädt, Martin Vymazal, David Moxey, and Joaquim Peiró. 2020. A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM. Comput. Phys. Commun. 255 (2020), 107245. https://doi.org/10.1016/j.cpc.2020.107245

[11]

J. Austin Ellis and Sivasankaran Rajamanickam. 2019. Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels. In 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019, Waltham, MA, USA, September 24-26, 2019. IEEE, 1–7. https://doi.org/10.1109/HPEC.2019.8916378

[12]

John Gounley, Madhurima Vardhan, Erik W. Draeger, Pedro Valero-Lara, Shirley V. Moore, and Amanda Randles. 2022. Propagation Pattern for Moment Representation of the Lattice Boltzmann Method. IEEE Trans. Parallel Distributed Syst. 33, 3 (2022), 642–653. https://doi.org/10.1109/TPDS.2021.3098456

[13]

Rene Halver, Jan H. Meinke, and Godehard Sutmann. 2020. Kokkos implementation of an Ewald Coulomb solver and analysis of performance portability. J. Parallel Distributed Comput. 138 (2020), 48–54. https://doi.org/10.1016/j.jpdc.2019.12.003

Digital Library

[14]

Glen Hansen, Patrick G. Xavier, Sam P. Mish, Thomas E. Voth, Martin W. Heinstein, and Micheal W. Glass. 2016. An MPI+X implementation of contact global search using Kokkos. Eng. Comput. 32, 2 (2016), 295–311. https://doi.org/10.1007/s00366-015-0418-x

Digital Library

[15]

Xiaoyi He and Li-Shi Luo. 1997. A priori derivation of the lattice Boltzmann equation. Physical Review E 55, 6 (1997), R6333.

[16]

J. A. Herdman, W. P. Gaudin, Oliver Perks, D. A. Beckingsale, A. C. Mallinson, and Stephen A. Jarvis. 2014. Achieving portability and performance through OpenACC. In Proceedings of the First Workshop on Accelerator Programming using Directives, WACCPD ’14, New Orleans, Louisiana, USA, November 16-21, 2014, Sunita Chandrasekaran, Fernanda S. Foertter, and Oscar R. Hernandez (Eds.). IEEE Computer Society, 19–26. https://doi.org/10.1109/WACCPD.2014.10

Digital Library

[17]

Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2022. Improving Performance via Mini-applications. https://github.com/Mantevo/. Online accessed 20-April-2022.

[18]

Bálint Joó, Thorsten Kurth, Michael A. Clark, Jeongnim Kim, Christian Robert Trott, Dan Ibanez, Daniel Sunderland, and Jack Deslippe. 2019. Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC@SC 2019, Denver, CO, USA, November 22, 2019. IEEE, 14–25. https://doi.org/10.1109/P3HPC49587.2019.00007

[19]

Ian Karlin, Jim McGraw, Esthela Gallardo, Jeff Keasler, Edgar A. León, and Bert Still. 2012. Abstract: Memory and Parallelism Exploration Using the LULESH Proxy Application. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, November 10-16, 2012. IEEE Computer Society, 1427–1428. https://doi.org/10.1109/SC.Companion.2012.234

Digital Library

[20]

Samuel Khuvis, Karen Tomko, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2020. Exploring Hybrid MPI+Kokkos Tasks Programming Model. In 3rd IEEE/ACM Annual Parallel Applications Workshop: Alternatives To MPI+X, PAW-ATM@SC 2020, Atlanta, GA, USA, November 12, 2020. IEEE, 66–73. https://doi.org/10.1109/PAWATM51920.2020.00011

[21]

Gloria Y. K. Kim, Akihiro Hayashi, and Vivek Sarkar. 2017. Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs. In Accelerator Programming Using Directives - 4th International Workshop, WACCPD 2017, Held in Conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, Denver, CO, USA, November 13, 2017, Proceedings(Lecture Notes in Computer Science, Vol. 10732), Sunita Chandrasekaran and Guido Juckeland (Eds.). Springer, 125–144. https://doi.org/10.1007/978-3-319-74896-2_7

[22]

Seyong Lee, Jeremy S. Meredith, and Jeffrey S. Vetter. 2015. COMPASS: A Framework for Automated Performance Modeling and Prediction. In ACM International Conference on Supercomputing (ICS15). https://doi.org/10.1145/2751205.2751220

Digital Library

[23]

Wen mei W. Hwu, David B. Kirk, and Izzat El Hajj. 2023. Programming Massively Parallel Processors (Fourth Edition). In Programming Massively Parallel Processors (Fourth Edition) (fourth edition ed.), Wen mei W. Hwu, David B. Kirk, and Izzat El Hajj (Eds.). Morgan Kaufmann, xv.

[24]

Mohammad Alaul Haque Monil, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Allen D. Malony. 2020. MEPHESTO: Modeling Energy-Performance in Heterogeneous SoCs and Their Trade-Offs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual Event, GA, USA) (PACT ’20). Association for Computing Machinery, New York, NY, USA, 413–425. https://doi.org/10.1145/3410463.3414671

Digital Library

[25]

Mohammad Alaul Haque Monil, Seyong Lee, Jeffrey S. Vetter, and Allen D. Malony. 2022. MAPredict: Static Analysis Driven Memory Access Prediction Framework for Modern CPUs. In High Performance Computing, Ana-Lucia Varbanescu, Abhinav Bhatele, Piotr Luszczek, and Baboulin Marc (Eds.). Springer International Publishing, Cham, 233–255.

[26]

OpenACC. 2011. OpenACC: Directives for Accelerators. [Online]. Available: http://www.openacc.org.

[27]

Yue-Hong Qian, Dominique d’Humières, and Pierre Lallemand. 1992. Lattice BGK models for Navier-Stokes equation. EPL (Europhysics Letters) 17, 6 (1992), 479.

[28]

Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Q. Dang, Nathan D. Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, and Ichitaro Yamazaki. 2021. Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels. CoRR abs/2103.11991 (2021). arxiv:2103.11991https://arxiv.org/abs/2103.11991

[29]

Damodar Sahasrabudhe, Eric T. Phipps, Sivasankaran Rajamanickam, and Martin Berzins. 2019. A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures. In Accelerator Programming Using Directives - 6th International Workshop, WACCPD 2019, Denver, CO, USA, November 18, 2019, Revised Selected Papers(Lecture Notes in Computer Science, Vol. 12017), Sandra Wienke and Sridutt Bhalachandra (Eds.). Springer, 140–163. https://doi.org/10.1007/978-3-030-49943-3_7

[30]

Kyle Spafford and Jeffrey S. Vetter. 2012. Aspen: A Domain Specific Language for Performance Modeling. In SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis.

[31]

Keita Teranishi, Daniel M. Dunlavy, Jeremy M. Myers, and Richard F. Barrett. 2020. SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data. In 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020, Waltham, MA, USA, September 22-24, 2020. IEEE, 1–7. https://doi.org/10.1109/HPEC43674.2020.9286251

[32]

Leonel Toledo, Pedro Valero-Lara, Jeffrey S. Vetter, and Antonio J. Peña. 2021. Static Graphs for Coding Productivity in OpenACC. In 28th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2021, Bengaluru, India, December 17-20, 2021. IEEE, 364–369. https://doi.org/10.1109/HiPC53243.2021.00050

[33]

Christian Trott, Luc Berger-Vergiat, David Poliakoff, Sivasankaran Rajamanickam, Damien Lebrun-Grandié, Jonathan Madsen, Nader Al Awar, Milos Gligoric, Galen Shipman, and Geoff Womeldorff. 2021. The Kokkos EcoSystem: Comprehensive Performance Portability for High Performance Computing. Comput. Sci. Eng. 23, 5 (2021), 10–18. https://doi.org/10.1109/MCSE.2021.3098509

[34]

Pedro Valero-Lara. 2014. Accelerating solid-fluid interaction based on the immersed boundary method on multicore and GPU architectures. J. Supercomput. 70, 2 (2014), 799–815. https://doi.org/10.1007/s11227-014-1262-2

[35]

Pedro Valero-Lara. 2017. Reducing memory requirements for large size LBM simulations on GPUs. Concurr. Comput. Pract. Exp. 29, 24 (2017). https://doi.org/10.1002/cpe.4221

[36]

Pedro Valero-Lara, Diego Andrade, Raül Sirvent, Jesús Labarta, Basilio B. Fraguela, and Ramon Doallo. 2019. A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library). IEEE Access 7 (2019), 23365–23378. https://doi.org/10.1109/ACCESS.2019.2900122

[37]

Pedro Valero-Lara, Francisco D. Igual, Manuel Prieto-Matías, Alfredo Pinelli, and Julien Favier. 2015. Accelerating fluid-solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures. J. Comput. Sci. 10 (2015), 249–261. https://doi.org/10.1016/j.jocs.2015.07.002

[38]

Pedro Valero-Lara and Johan Jansson. 2015. LBM-HPC - An Open-Source Tool for Fluid Simulations. Case Study: Unified Parallel C (UPC-PGAS). In 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015. IEEE Computer Society, 318–321. https://doi.org/10.1109/CLUSTER.2015.52

Digital Library

[39]

Pedro Valero-Lara and Johan Jansson. 2017. Heterogeneous CPU+GPU approaches for mesh refinement over Lattice-Boltzmann simulations. Concurr. Comput. Pract. Exp. 29, 7 (2017). https://doi.org/10.1002/cpe.3919

[40]

Pedro Valero-Lara, Seyong Lee, Marc González Tallada, Joel E. Denny, and Jeffrey S. Vetter. 2022. KokkACC: Enhancing Kokkos with OpenACC. In 9th Workshop on Accelerator Programming Using Directives, WACCPD@SC 2022, Dallas, TX, USA, November 13-18, 2022. IEEE, 32–42. https://doi.org/10.1109/WACCPD56842.2022.00009

[41]

Michael M. Wolf, H. Carter Edwards, and Stephen L. Olivier. 2016. Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics. In 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, September 13-15, 2016. IEEE, 1–7. https://doi.org/10.1109/HPEC.2016.7761649

Cited By

Index Terms

sKokkos: Enabling Kokkos with Transparent Device Selection on Heterogeneous Systems using OpenACC
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
    2. Parallel programming languages

Recommendations

Enhancing Kokkos with OpenACC

C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic ...
A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos
ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming

Today, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application ...
On the Performance Portability of OpenACC, OpenMP, Kokkos and RAJA
HPCAsia '22: International Conference on High Performance Computing in Asia-Pacific Region

Performance Portability frameworks are becoming more central and essential in heterogeneous computing systems. However, the developer toolbox lacks the tools to assess the performance portability degree of these frameworks.

This article presents a new ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

January 2024

185 pages

ISBN:9798400708893

DOI:10.1145/3635035

Copyright © 2024 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HPCAsia 2024

HPCAsia 2024: International Conference on High Performance Computing in Asia-Pacific Region

January 25 - 27, 2024

Nagoya, Japan

Acceptance Rates

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
119
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)10

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents